MeiliSearch

mirror of https://github.com/meilisearch/MeiliSearch synced 2025-02-20 01:08:29 +01:00

Author	SHA1	Message	Date
Kerollmops	7c2f5f77b8	Make clippy and fmt happy	2023-06-27 12:32:42 +02:00
Kerollmops	66b8cfd8c8	Introduce a way to store the HNSW on multiple LMDB entries	2023-06-27 12:32:42 +02:00
Kerollmops	ff3664431f	Make rustfmt happy	2023-06-27 12:32:42 +02:00
Kerollmops	531748c536	Return a user error when the _vectors type is invalid	2023-06-27 12:32:41 +02:00
Kerollmops	7aa1275337	Display the _semanticSimilarity even if the `_vectors` field is not displayed	2023-06-27 12:32:41 +02:00
Kerollmops	737aec1705	Expose an _semanticSimilarity as a dot product in the documents	2023-06-27 12:32:41 +02:00
Kerollmops	3e3c743392	Make Rustfmt happy	2023-06-27 12:32:41 +02:00
Kerollmops	5c5a4e075d	Make clippy happy	2023-06-27 12:32:41 +02:00
Kerollmops	ab9f2269aa	Normalize the vectors during indexation and search	2023-06-27 12:32:41 +02:00
Kerollmops	321ec5f3fa	Accept multiple vectors by documents using the _vectors field	2023-06-27 12:32:40 +02:00
Kerollmops	717d4fddd4	Remove the unused distance	2023-06-27 12:32:40 +02:00
Kerollmops	a7e0f0de89	Introduce a new error message for invalid vector dimensions	2023-06-27 12:32:40 +02:00
Kerollmops	3b560ef7d0	Make clippy happy	2023-06-27 12:32:40 +02:00
Kerollmops	2cf747cb89	Fix the tests	2023-06-27 12:32:40 +02:00
Kerollmops	3c31e1cdd1	Support more pages but in an ugly way	2023-06-27 12:32:39 +02:00
Kerollmops	23eaaf1001	Change the name of the distance module	2023-06-27 12:32:39 +02:00
Kerollmops	c2a402f3ae	Implement an ugly deletion of values in the HNSW	2023-06-27 12:32:39 +02:00
Kerollmops	436a10bef4	Replace the euclidean with a dot product	2023-06-27 12:32:39 +02:00
Kerollmops	8debf6fe81	Use a basic euclidean distance function	2023-06-27 12:32:39 +02:00
Kerollmops	c79e82c62a	Move back to the hnsw crate This reverts commit 7a4b6c065482f988b01298642f4c18775503f92f.	2023-06-27 12:32:39 +02:00
Kerollmops	aca305bb77	Log more to make sure we insert vectors in the hgg data-structure	2023-06-27 12:32:38 +02:00
Kerollmops	5816008139	Introduce an optimized version of the euclidean distance function	2023-06-27 12:32:38 +02:00
Kerollmops	268a9ef416	Move to the hgg crate	2023-06-27 12:32:38 +02:00
Clément Renault	642b0f3a1b	Expose a new vector field on the search route	2023-06-27 12:32:38 +02:00
Clément Renault	4571e512d2	Store the vectors in an HNSW in LMDB	2023-06-27 12:32:38 +02:00
Clément Renault	7ac2f1489d	Extract the vectors from the documents	2023-06-27 12:32:37 +02:00
Clément Renault	34349faeae	Create a new _vector extractor	2023-06-27 12:32:37 +02:00
meili-bors[bot]	2d34005965	Merge #3821 3821: Add normalized and detailed scores to documents returned by a query r=dureuill a=dureuill # Pull Request ## Related issue Fixes #3771 ## What does this PR do? ### User standpoint <details> <summary>Request ranking score</summary> ``` echo '{ "q": "Badman dark knight returns", "showRankingScore": true, "limit": 10, "attributesToRetrieve": ["title"] }' \| mieli search -i index-word-count-10-count ``` </details> <details> <summary>Response</summary> ```json { "hits": [ { "title": "Batman: The Dark Knight Returns, Part 1", "_rankingScore": 0.947520325203252 }, { "title": "Batman: The Dark Knight Returns, Part 2", "_rankingScore": 0.947520325203252 }, { "title": "Batman Unmasked: The Psychology of the Dark Knight", "_rankingScore": 0.6657594086021505 }, { "title": "Legends of the Dark Knight: The History of Batman", "_rankingScore": 0.6654905913978495 }, { "title": "Angel and the Badman", "_rankingScore": 0.2196969696969697 }, { "title": "Angel and the Badman", "_rankingScore": 0.2196969696969697 }, { "title": "Batman", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Begins", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Returns", "_rankingScore": 0.11553030303030302 }, { "title": "Batman Forever", "_rankingScore": 0.11553030303030302 } ], "query": "Badman dark knight returns", "processingTimeMs": 12, "limit": 10, "offset": 0, "estimatedTotalHits": 46 } ``` </details> - If adding a `showRankingScore` parameter to the search query, then documents returned by a search now contain an additional field `_rankingScore` that is a float bigger than 0 and lower or equal to 1.0. This field represents the relevancy of the document, relatively to the search query and the settings of the index, with 1.0 meaning "perfect match" and 0 meaning "not matching the query" (Meilisearch should never return documents not matching the query at all). - The `sort` and `geosort` ranking rules do not influence the `_rankingScore`. <details> <summary>Request detailed ranking scores</summary> ``` echo '{ "q": "Badman dark knight returns", "showRankingScoreDetails": true, "limit": 5, "attributesToRetrieve": ["title"] }' \| mieli search -i index-word-count-10-count ``` </details> <details> <summary>Response</summary> ```json { "hits": [ { "title": "Batman: The Dark Knight Returns, Part 1", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 4, "maxMatchingWords": 4, "score": 1.0 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 4, "score": 0.8 }, "proximity": { "order": 2, "score": 0.9545454545454546 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.926829268292683, "score": 0.926829268292683 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.26666666666666666 } } }, { "title": "Batman: The Dark Knight Returns, Part 2", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 4, "maxMatchingWords": 4, "score": 1.0 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 4, "score": 0.8 }, "proximity": { "order": 2, "score": 0.9545454545454546 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.926829268292683, "score": 0.926829268292683 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.26666666666666666 } } }, { "title": "Batman Unmasked: The Psychology of the Dark Knight", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 3, "maxMatchingWords": 4, "score": 0.75 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 3, "score": 0.75 }, "proximity": { "order": 2, "score": 0.6666666666666666 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.8064516129032258, "score": 0.8064516129032258 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.25 } } }, { "title": "Legends of the Dark Knight: The History of Batman", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 3, "maxMatchingWords": 4, "score": 0.75 }, "typo": { "order": 1, "typoCount": 1, "maxTypoCount": 3, "score": 0.75 }, "proximity": { "order": 2, "score": 0.6666666666666666 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.7419354838709677, "score": 0.7419354838709677 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.25 } } }, { "title": "Angel and the Badman", "_rankingScoreDetails": { "words": { "order": 0, "matchingWords": 1, "maxMatchingWords": 4, "score": 0.25 }, "typo": { "order": 1, "typoCount": 0, "maxTypoCount": 1, "score": 1.0 }, "proximity": { "order": 2, "score": 1.0 }, "attribute": { "order": 3, "attributes_ranking_order": 1.0, "attributes_query_word_order": 0.8181818181818182, "score": 0.8181818181818182 }, "exactness": { "order": 4, "matchType": "noExactMatch", "score": 0.3333333333333333 } } } ], "query": "Badman dark knight returns", "processingTimeMs": 9, "limit": 5, "offset": 0, "estimatedTotalHits": 46 } ``` </details> - If adding a `showRankingScoreDetails` parameter to the search query, then the returned documents will now contain an additional `_rankingScoreDetails` field that is a JSON object containing one field per ranking rule that was applied, whose value is a JSON object with the following fields: - `order`: a number indicating the order this rule was applied (0 is the first applied ranking rule) - `score` (except for `sort` and `geosort`): a float indicating how the document matched this particular rule. - other fields that are specific to the rule, indicating for example how many words matched for a document and how many typos were counted in a matching document. - If the `displayableAttributes` list is defined in the settings of the index, any ranking rule using an attribute not part of that list will be marked as `<hidden-rule>` in the `_rankingScoreDetails`. - Search queries that are part of a `multi-search` requests are modified in the same way and each of the queries can take the `showRankingScore` and `showRankingScoreDetails` parameters independently. The results are still returned in separate lists and providing a unified list of results between multiple queries is not in the scope of this PR (but is unblocked by this PR and can be done manually by using the scores of the various documents). ### Implementation standpoint - Fix difference in how the position of terms were computed at indexing time and query time: this difference meant that a query containing a hard separator would fail the exactness check. - Fix the id reported by the sort ranking rule (very minor) - Change how the cost of removing words is computed. After this change the cost no longer works for any other ranking rule than `words`. Also made `words` have a cost of 0 such that the entire cost of `words` is given by the termRemovalStrategy. The new cost computation makes it so the score is computed in a way consistent with the number of words in the query. Additionally, the words that appear in phrases in the query are also counted as matching words. - When any score computation is requested through `showRankingScore` or `showRankingScoreDetails`, remove optimization where ranking rules are not executed on buckets of a single document: this is important to allow the computation of an accurate score. - add virtual conditions to fid and position to always have the max cost: this ensures that the score is independent from the dataset - the Position ranking rule now takes into account the distance to the position of the word in the query instead of the distance to the position 0. - modified proximity ranking rule cost calculation so that the cost is 0 for documents that are perfectly matching the query - Add a new `milli::score_details` module containing all the types that are involved in score computation. - Make it so a bucket of result now contains a `ScoreDetails` and changed the ranking rules to produce their `ScoreDetails`. - Expose the scores in the REST API. - Add very light analytics for scoring. - Update the search tests to add the expected scores. Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-06-26 09:32:43 +00:00
meili-bors[bot]	040b5a5b6f	Merge #3842 3842: fix some typos r=dureuill a=cuishuang # Pull Request ## Related issue Fixes #<issue_number> ## What does this PR do? - fix some typos ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: cui fliter <imcusg@gmail.com>	2023-06-22 18:01:10 +00:00
cui fliter	530a3e2df3	fix some typos Signed-off-by: cui fliter <imcusg@gmail.com>	2023-06-22 21:59:00 +08:00
Louis Dureuil	d26e9a96ec	Add score details to new search tests	2023-06-22 12:39:14 +02:00
Louis Dureuil	49c8bc4de6	Fix tests	2023-06-22 12:39:14 +02:00
Louis Dureuil	da833eb095	Expose the scores and detailed scores in the API	2023-06-22 12:39:14 +02:00
Louis Dureuil	701d44bd91	Store the scores for each bucket Remove optimization where ranking rules are not executed on buckets of a single document when the score needs to be computed	2023-06-22 12:39:14 +02:00
Louis Dureuil	c621a250a7	Score for graph based ranking rules Count phrases in matchingWords and maxMatchingWords	2023-06-22 12:39:14 +02:00
Louis Dureuil	8939e85f60	Add rank_to_score for graph based ranking rules	2023-06-22 12:39:14 +02:00
Louis Dureuil	fa41d2489e	Score for sort	2023-06-22 12:39:14 +02:00
Louis Dureuil	59c5b992c2	Score for geosort	2023-06-22 12:39:14 +02:00
Louis Dureuil	2ea8194c18	Score for exact_attributes	2023-06-22 12:39:14 +02:00
Louis Dureuil	421df64602	RankingRuleOutput now contains a Score	2023-06-22 12:39:14 +02:00
Louis Dureuil	c0fca6f884	Add score_details	2023-06-22 12:39:14 +02:00
Louis Dureuil	f050634b1e	add virtual conditions to fid and position to always have the max cost	2023-06-20 10:07:18 +02:00
Louis Dureuil	becf1f066a	Change how the cost of removing words is computed	2023-06-20 09:45:43 +02:00
Louis Dureuil	701d299369	Remove out-of-date comment	2023-06-20 09:45:42 +02:00
Louis Dureuil	a20e4d447c	Position now takes into account the distance to the position of the word in the query it used to be based on the distance to the position 0	2023-06-20 09:45:42 +02:00
Louis Dureuil	af57c3c577	Proximity costs 0 for documents that are perfectly matching	2023-06-20 09:45:42 +02:00
Louis Dureuil	0c40ef6911	Fix sort id	2023-06-20 09:45:42 +02:00
meili-bors[bot]	45636d315c	Merge #3670 3670: Fix addition deletion bug r=irevoire a=irevoire The first commit of this PR is a revert of https://github.com/meilisearch/meilisearch/pull/3667. It re-enable the auto-batching of addition and deletion of tasks. No new changes have been introduced outside of `milli`. So all the changes you see on the autobatcher have actually already been reviewed. It fixes https://github.com/meilisearch/meilisearch/issues/3440. ### What was happening? The issue was that the `external_documents_ids` generated in the `transform` were used in a very strange way that wasn’t compatible with the deletion of documents. Instead of doing a clear merge between the external document IDs of the DB and the one returned by the transform + writing it on disk, we were doing some weird tricks with the soft-deleted to avoid writing the fst on disk as much as possible. The new algorithm may be a bit slower but is way more straightforward and doesn’t change depending on if the soft deletion was used or not. Here is a list of the changes introduced: 1. We now do a clear distinction between the `new_external_documents_ids` coming from the transform and only held on RAM and the `external_documents_ids` coming from the DB. 2. The `new_external_documents_ids` (coming out of the transform) are now represented as an `fst`. We don't need to struggle with the hard, soft distinction + the soft_deleted => That's easier to understand 3. When indexing documents, we merge the `external_documents_ids` coming from the DB and the `new_external_documents_ids` coming from the transform. ### Other things introduced in this PR Since we constantly have to write small, very specialized fuzzers for this kind of bug, we decided to push the one used to reproduce this bug. It's not perfect, but it's easy to improve in the future. It'll also run for as long as possible on every merge on the main branch. Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>	2023-06-19 09:09:30 +00:00
meili-bors[bot]	cb9d78fc7f	Merge #3835 3835: Add more documentation to graph-based ranking rule algorithms + comment cleanup r=Kerollmops a=loiclec In addition to documenting the `cheapest_path.rs` file, this PR cleans up a few outdated comments as well as some TODOs. These TODOs have been moved to https://github.com/meilisearch/meilisearch/issues/3776 Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>	2023-06-15 15:30:24 +00:00
Louis Dureuil	e0c4682758	Fix tests	2023-06-14 13:30:52 +02:00

1 2 3 4 5 ...

1758 Commits