MeiliSearch/typos-ranking-rules.md

3.3 KiB

Typo and Ranking rules

This is an explanation of the default rules used in MeiliDB.

First we have to explain some terms that are used in this reading.

  • A query string is the full list of all the words that the end user is searching for results.
  • A query word is one of the words that compose the query string.

Typo rules

The typo rules are used before sorting the documents. They are used to aggregate them, to choose which documents contain words similar to the queried words.

We use a prefix Levenshtein algorithm to check if the words match. The only difference with a Levenshtein algorithm is that it accepts every word that starts with the query words too. Therefore words are accepted if they start with or have the equal length.

The Levenshtein distance between two words M and P is called "the minimum cost of transforming M into P" by performing the following elementary operations:

  • substitution of a character of M by a character other than P. (e.g. kitten → sitten)
  • insertion in M of a character of P. (e.g. sittin → sitting)
  • deleting a character from M. (e.g. saturday → satuday)

There are some rules about what can be considered "similar". These rules are by word and not for the whole query string.

  • If the query word is between 1 and 4 characters long therefore no typo is allowed, only documents that contains words that start or are exactly equal to this query word are considered valid for this request.
  • If the query word is between 5 and 8 characters long, one typo is allowed. Documents that contains words that match with one typo are retained for the next steps.
  • If the query word contains more than 8 characters, we accept a maximum of two typos.

This means that "satuday", which is 7 characters long, use the second rule and every document containing words that have only one typo will match. For example:

  • "satuday" is accepted because it is exactly the same word.
  • "sat" is not accepted because the query word is not a prefix of it but the opposite.
  • "saturday" is accepted because it contains one typo.
  • "suturday" is not accepted because it contains two typos.

Ranking rules

All documents that have been aggregated using the typo rules above can now be sorted. MeiliDB uses a bucket sort.

What is a bucket sort? We sort all the documents with the first rule, for all documents that can't be separated we create a group and sort it using the second rule, and so on.

Here is the list of all the default rules that are executed in this specific order by default:

  • Number of Typos - The less typos there are beween the query words and the document words, the better is the document.
  • Number of Words - A document containing more of the query words will be more important than one that contains less.
  • Words Proximity - The closer the query words are in the document the better is the document.
  • Attribute - A document containing the query words in a more important attribute than another document is considered better.
  • Position - A document containing the query words at the start of an attribute is considered better than a document that contains them at the end.
  • Exact - A document containing the query words in their exact form, not only a prefix of them, is considered better.