MeiliSearch

mirror of https://github.com/meilisearch/MeiliSearch synced 2025-06-25 07:58:30 +02:00

Author	SHA1	Message	Date
meili-bors[bot]	2e49d6aec1	Merge #3768 3768: Fix bugs in graph-based ranking rules + make `words` a graph-based ranking rule r=dureuill a=loiclec This PR contains three changes: ## 1. Don't call the `words` ranking rule if the term matching strategy is `All` This is because the purpose of `words` is only to remove nodes from the query graph. It would never do any useful work when the matching strategy was `All`. Remember that the universe was already computed before by computing all the docids corresponding to the "maximally reduced" query graph, which, in the case of `All`, is equal to the original graph. ## 2. The `words` ranking rule is replaced by a graph-based ranking rule. This is for three reasons: 1. performance: graph-based ranking rules benefit from a lot of optimisations by default, which ensures that they are never too slow. The previous implementation of `words` could call `compute_query_graph_docids` many times if some words had to be removed from the query, which would be quite expensive. I was especially worried about its performance in cases where it is placed right after the `sort` ranking rule. Furthermore, `compute_query_graph_docids` would clone a lot of bitmaps many times unnecessarily. 2. consistency: every other ranking rule (except `sort`) is graph-based. It makes sense to implement `words` like that as well. It will automatically benefit from all the features, optimisations, and bug fixes that all the other ranking rules get. 3. surfacing bugs: as the first ranking rule to be called (most of the time), I'd like `words` to behave the same as the other ranking rules so that we can quickly detect bugs in our graph algorithms. This actually already happened, which is why this PR also contains a bug fix. ## 3. Fix the `update_all_costs_before_nodes` function It is a bit difficult to explain what was wrong, but I'll try. The bug happened when we had graphs like: <img width="730" alt="Screenshot 2023-05-16 at 10 58 57" src="https://github.com/meilisearch/meilisearch/assets/6040237/40db1a68-d852-4e89-99d5-0d65757242a7"> and we gave the node `is` as argument. Then, we'd walk backwards from the node breadth-first. We'd update the costs of: 1. `sun` 2. `thesun` 3. `start` 4. `the` which is an incorrect order. The correct order is: 1. `sun` 2. `thesun` 3. `the` 4. `start` That is, we can only update the cost of a node when all of its successors have either already been visited or were not affected by the update to the node passed as argument. To solve this bug, I factored out the graph-traversal logic into a `traverse_breadth_first_backward` function. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com> Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-05-23 13:28:08 +00:00
Louis Dureuil	51043f78f0	Remove trailing whitespace	2023-05-23 15:27:25 +02:00
Louis Dureuil	a490a11325	Add explanatory comment on the way we're recomputing costs	2023-05-23 15:24:24 +02:00
meili-bors[bot]	101f5a20d2	Merge #3757 3757: Adjust the cost of edges in the `position` ranking rule by bucketing positions more aggressively r=loiclec a=loiclec This PR significantly improves the performance of the `position` ranking rule when: 1. a query contains many words 2. the `position` ranking rule needs to be called many times 3. the score of the documents according to `position` is high These conditions greatly increase: 1. the number of edge traversals that are needed to find a valid path from the `start` node to the `end` node 2. the number of edges that need to be deleted from the graph, and therefore the number of times that we need to recompute all the possible costs from START to END As a result, a majority of the search time is spent in `visit_condition`, `visit_node`, and `update_all_costs_before_node`. This is frustrating because it often happens when the "universe" given to the rule consists of only a handful of document ids. By limiting the number of possible edges between two nodes from `20` to `10`, we: 1. reduce the number of possible costs from START to END 2. reduce the number of edges that will be deleted 3. make it faster to update the costs after deleting an edge 4. reduce the number of buckets that need to be computed In terms of relevancy, I don't think we lose or gain much. We still prefer terms that are in a lower positions, with decreasing precision as we go further. The previous choice of bucketing wasn't chosen in a principled way, and neither is this one. They both "feel" right to me. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com> Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com> v1.2.0-rc.1	2023-05-17 11:43:59 +00:00
meili-bors[bot]	6ce1ce77e6	Merge #3738 3738: Add analytics on the get documents resource r=dureuill a=irevoire # Pull Request ## Related issue Fixes https://github.com/meilisearch/meilisearch/issues/3737 Related spec https://github.com/meilisearch/specifications/pull/234 ## What does this PR do? Add the analytics for the following routes: - `GET` - `/indexes/:uid/documents` - `GET` - `/indexes/:uid/documents/:doc_id` - `POST` - `/indexes/:uid/documents/fetch` These analytics are aggregated between two events: - `Documents Fetched GET` - `Documents Fetched POST` That shares the same payload: Property name \| Description \| Example \| \|---------------\|-------------\|---------\| \| `requests.total_received` \| Total number of request received in this batch \| 325 \| \| `per_document_id` \| `false` \| false \| \| `per_filter` \| `true` if `POST /indexes/:indexUid/documents/fetch` endpoint was used with a filter in this batch, otherwise `false` \| false \| \| `pagination.max_limit` \| Highest value given for the `limit` parameter in this batch \| 60 \| \| `pagination.max_offset` \| Highest value given for the `offset` parameter in this batch \| 1000 \| Co-authored-by: Tamo <tamo@meilisearch.com>	2023-05-16 19:37:41 +00:00
Loïc Lecrenier	ec8f685d84	Fix bug in cheapest path algorithm	2023-05-16 17:01:30 +02:00
Loïc Lecrenier	5758268866	Don't compute split_words for phrases	2023-05-16 17:01:18 +02:00
meili-bors[bot]	4d037e6693	Merge #3759 3759: Invalid error code when parsing filters r=dureuill a=irevoire # Pull Request ## Related issue Fixes https://github.com/meilisearch/meilisearch/issues/3753 ## What does this PR do? Fix the error code in case the error comes from the evaluate of the filter for the get, fetch and delete documents routes. Co-authored-by: Tamo <tamo@meilisearch.com>	2023-05-16 12:55:06 +00:00
Tamo	96da5130a4	fix the error code in case of not filterable attributes on the get / delete documents by filter routes	2023-05-16 13:56:18 +02:00
Loïc Lecrenier	3e19702de6	Update snapshot tests	2023-05-16 12:22:46 +02:00
meili-bors[bot]	1e762d151f	Merge #3755 3755: Re-add final dot r=curquiza a=ManyTheFish I removed the final dot of the error message in my last PR, this one re-adds it. related to https://github.com/meilisearch/meilisearch/pull/3749 > Oups 😬 Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-05-16 10:10:58 +00:00
Tamo	0b38f211ac	test the new introduced route	2023-05-16 12:07:44 +02:00
Loïc Lecrenier	f6524a6858	Adjust costs of edges in position ranking rule To ensure good performance	2023-05-16 11:28:56 +02:00
meili-bors[bot]	65ad8cce36	Merge #3741 3741: Add ngram support to the highlighter r=ManyTheFish a=loiclec This PR fixes a bug introduced by the search refactor, where ngrams were not highlighted. The solution was to add the ngrams to the vector of `LocatedQueryTerm` that is given to the `MatchingWords` structure. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-05-16 09:03:31 +00:00
ManyTheFish	42650f82e8	Re-add final dot	2023-05-16 10:57:26 +02:00
Loïc Lecrenier	a37da36766	Implement `words` as a graph-based ranking rule and fix some bugs	2023-05-16 10:42:11 +02:00
Loïc Lecrenier	85d96d35a8	Highlight ngram matches as well	2023-05-16 10:39:36 +02:00
meili-bors[bot]	bf66e97b48	Merge #3749 3749: Fix back: sort error message r=ManyTheFish a=ManyTheFish This PR reintroduces the error message modified in https://github.com/meilisearch/milli/pull/375. However, this added double-quotes around `sort` in the message. I don't think another message contains double-quotes, so I have added a separate commit replacing the double-quotes with back-ticks, which seems more consistent with the other error messages, this last change can be reverted easily. ## Detailed changes #### v1.2-rc0 ``` The sort ranking rule must be specified in the ranking rules settings to use the sort parameter at search time. ``` #### [Reintroduce fix (previous and expected behavior)](`23d1c86825`) ``` You must specify where "sort" is listed in the rankingRules setting to use the sort parameter at search time ``` #### [Replace double-quotes with back-ticks (my suggestion)](`4d691d071a`) ``` You must specify where `sort` is listed in the rankingRules setting to use the sort parameter at search time ``` ## Related Fixes #3722 ## Reviewers - technical review: `@irevoire` - to validate the replacement: `@macraig` Co-authored-by: ManyTheFish <many@meilisearch.com>	2023-05-15 14:55:51 +00:00
meili-bors[bot]	a7ea5ec748	Merge #3651 3651: Use the writemap flag to reduce the memory usage r=irevoire a=Kerollmops This draft PR is showing some stats about the memory usage of Meilisearch when [the LMDB `MDB_WRITEMAP` flag](`3947014aed/libraries/liblmdb/lmdb.h (L573-L581)`) is enabled and when it is not. As you can see there is a reduction of about 50% of the memory usage pick. The dataset used was [the Wikipedia one](https://www.notion.so/meilisearch/Wikipedia-8b1486e4b17547c5bda485d2d97767a0) with the first 30 000 first CSV documents without settings. This PR depends on https://github.com/meilisearch/heed/pull/168. I just [opened a discussion](https://github.com/meilisearch/product/discussions/652) for people to understand the tradeoffs and give their feedback. - [x] Create an experiment flag `--experimental-reduce-indexing-memory-usage`. - [x] Add it to the config file. - [x] Explain the tradeoff and copy/link the LMDB documentation in the help message. - [x] Add analytics about the experimental flag. - [x] Document that this flag cannot be used on Windows, ~~or hide it~~. <details> <summary>The command I used to run the tests</summary> #### Sign the binary to be able to use Instruments / xcrun ```sh codesign -s - -f --entitlements ~/ent.plist target/release/meilisearch ``` where `ent.plist` contains: ```xml <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"> <plist version="1.0"> <dict> <key>com.apple.security.get-task-allow</key> <true/> </dict> </plist> ``` #### Run Meilisearch in measure-mode ```sh xcrun xctrace record --template 'Allocations' --launch -- target/release/meilisearch --max-indexing-memory 0MiB ``` #### Send the wiki dataset available on notion.so / Public ```sh for f in 0.csv 15000.csv; do echo sending $f; xh 'localhost:7700/indexes/wiki/documents' 'content-type:text/csv' `@$f;` done ``` #### Wait for the task to finish ```sh watch --color xh --pretty all 'localhost:7700/tasks?statuses=processing' ``` </details> Keep in mind that I tested that with the Instruments Apple tools on an iMac 5k 2019. More benchmarks must be done, especially on the indexation speed, as the flag is told to slow down writing into databases bigger that the amount of memory. On the left Meilisearch is running without the flag. On the right, it is running with the flag. <p align="center"> <img align="left" width="45%" alt="Instrument showing the memory usage of Meilisearch without the MDB_WRITEMAP flag" src="https://user-images.githubusercontent.com/3610253/234299524-7607f1df-6fc1-45d3-bd3d-4f9388002857.png"> <img align="right" width="45%" alt="Instrument showing the memory usage of Meilisearch with the MDB_WRITEMAP flag" src="https://user-images.githubusercontent.com/3610253/234299534-6cc3ae58-8bd9-426c-aa79-4c78f9e88b94.png"> </p> Co-authored-by: Kerollmops <clement@meilisearch.com> Co-authored-by: Clément Renault <clement@meilisearch.com>	2023-05-15 14:10:07 +00:00
Kerollmops	dc7ba77e57	Add the option in the config file	2023-05-15 16:07:43 +02:00
Clément Renault	13f870e993	Fix typos and documentation issues	2023-05-15 15:11:45 +02:00
Kerollmops	1a79fd0c3c	Use the new heed v0.12.6	2023-05-15 11:42:30 +02:00
Kerollmops	f759ec7fad	Expose a flag to enable the MDB_WRITEMAP flag	2023-05-15 11:38:43 +02:00
ManyTheFish	4d691d071a	Change double-quotes by back-ticks in sort error message	2023-05-15 11:10:36 +02:00
ManyTheFish	23d1c86825	Re-introduce the sort error message fix	2023-05-15 11:07:23 +02:00
Kerollmops	c4a40e7110	Use the writemap flag to reduce the memory usage	2023-05-15 10:15:33 +02:00
meili-bors[bot]	e01980c6f4	Merge #3739 3739: fix: update `payload_too_large` error message to include human readable maximum acceptable payload size r=Kerollmops a=cymruu # Pull Request ## Related issue Fixes #3736 ## What does this PR do? - update `payload_too_large` error message as requested in ticket ## PR checklist Please check if your PR fulfills the following requirements: - [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [x] Have you read the contributing guidelines? - [x] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: Filip Bachul <filipbachul@gmail.com>	2023-05-11 09:37:19 +00:00
Filip Bachul	25209a3590	introduce `remaining` field in `Payload`	2023-05-10 20:55:18 +02:00
Filip Bachul	3064ea6495	fix: update payload_too_large error message to include human readable maximum acceptable payload size	2023-05-10 18:16:59 +02:00
Tamo	46ec8a97e9	rename the analytics according to the spec	2023-05-10 14:28:30 +02:00
Tamo	c42a65a297	Update meilisearch/src/analytics/segment_analytics.rs Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-05-10 14:28:30 +02:00
Tamo	d08f8690d2	add analytics on the get documents resource	2023-05-10 14:28:30 +02:00
meili-bors[bot]	ad5f25d880	Merge #3742 3742: Compute split words derivations of terms that don't accept typos r=ManyTheFish a=loiclec Allows looking for the split-word derivation for short words in the user's query (like `the -> "t he"` or `door -> do or`) as well as for 3grams. Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-05-10 12:12:52 +00:00
Loïc Lecrenier	4d352a21ac	Compute split words derivations of terms that don't accept typos	2023-05-10 13:31:19 +02:00
meili-bors[bot]	4a4210c116	Merge #3734 3734: Update version for the next release (v1.2.0) in Cargo.toml r=curquiza a=meili-bot ⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging. Co-authored-by: curquiza <curquiza@users.noreply.github.com> v1.2.0-rc.0	2023-05-09 07:35:48 +00:00
curquiza	3533d4f2bb	Update version for the next release (v1.2.0) in Cargo.toml	2023-05-08 17:52:33 +00:00
Loïc Lecrenier	3625389057	Highlight ngram matches as well	2023-05-08 15:35:41 +02:00
meili-bors[bot]	eace6df91b	Merge #3726 3726: Fix prefix highlighting r=loiclec a=ManyTheFish The prefix queries were not properly highlighted, this PR now highlights only the start of a word when it matched with a prefix Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>	2023-05-08 07:46:46 +00:00
Loïc Lecrenier	83ab8cf4e5	Remove dbg!(..) expression in highlighter tests	2023-05-08 09:45:23 +02:00
ManyTheFish	cd2573fcc3	Fix prefix highlighting	2023-05-04 16:53:50 +02:00
meili-bors[bot]	9f7981df28	Merge #3687 3687: Allow to disable specialized tokenizations (again) r=Kerollmops a=jirutka In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai` feature flags to allow melisearch to be built without huge specialed tokenizations that took up 90% of the melisearch binary size. Unfortunately, due to some recent changes, this doesn't work anymore. The problem lies in excessive use of the `default` feature flag, which infects the dependency graph. Instead of adding `default-features = false` here and there, it's easier and more future-proof to not declare `default` in `milli` and `meilisearch-types`. I've renamed it to `all-tokenizers`, which also makes it a bit clearer what it's about. Co-authored-by: Jakub Jirutka <jakub@jirutka.cz>	2023-05-04 14:48:01 +00:00
Jakub Jirutka	e615fa5ec6	Fix unused_imports warning in milli when japanese is not enabled	2023-05-04 15:46:11 +02:00
Jakub Jirutka	13f1277637	Allow to disable specialized tokenizations (again) In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai` feature flags to allow melisearch to be built without huge specialed tokenizations that took up 90% of the melisearch binary size. Unfortunately, due to some recent changes, this doesn't work anymore. The problem lies in excessive use of the `default` feature flag, which infects the dependency graph. Instead of adding `default-features = false` here and there, it's easier and more future-proof to not declare `default` in `milli` and `meilisearch-types`. I've renamed it to `all-tokenizers`, which also makes it a bit clearer what it's about.	2023-05-04 15:45:40 +02:00
meili-bors[bot]	4919774f2e	Merge #3570 3570: Get documents by filter r=irevoire a=dureuill # Pull Request ## Related issue Associated spec: https://github.com/meilisearch/specifications/pull/234 None really, this is more of an extension of #3477: since after this issue we'll be able to delete documents by filter, it makes sense to also be able to get documents by filter. ## What does this PR do? ### User standpoint - Add a new `filter` URL parameter to `GET /indexes/{:indexUid}/documents` and a new `POST /indexes/{:indexUid}/documents/fetch` route with the same `offset, limit, fields, filter` ### Implementation standpoint - Add a new `Index::iter_documents` method to iterate on a set of documents rather than return a vector of these documents. - Rewrite the other `Index::*documents` methods to use the new `Index::iter_documents` method. ## Usage <details> <summary> Sample request and response </summary> ``` curl -X POST 'http://localhost:7700/indexes/index-1101/documents/fetch' -H 'Content-Type: application/json' --data-binary '{ "filter": "genres = Comedy", "limit": 3, "offset": 8000}' \| jsonxf ``` ```json { "results": [ { "id": 326126, "title": "Bad Exorcists", "overview": "A trio of awkward teens intend to win a horror festival by making their own movie, but wind up getting their actress possessed in the process.", "genres": [ "Horror", "Comedy" ], "poster": "https://image.tmdb.org/t/p/w500/lwd65kPbjFacAw3QSXiwSsW6cFU.jpg", "release_date": 1425081600 }, { "id": 326215, "title": "Ooops! Noah is Gone...", "overview": "It's the end of the world. A flood is coming. Luckily for Dave and his son Finny, a couple of clumsy Nestrians, an Ark has been built to save all animals. But as it turns out, Nestrians aren't allowed. Sneaking on board with the involuntary help of Hazel and her daughter Leah, two Grymps, they think they're safe. Until the curious kids fall off the Ark. Now Finny and Leah struggle to survive the flood and hungry predators and attempt to reach the top of a mountain, while Dave and Hazel must put aside their differences, turn the Ark around and save their kids. It's definitely not going to be smooth sailing.", "genres": [ "Animation", "Adventure", "Comedy", "Family" ], "poster": "https://image.tmdb.org/t/p/w500/gEJXHgpiKh89Vwjc4XUY5CIgUdB.jpg", "release_date": 1427328000 }, { "id": 326241, "title": "For Here or to Go?", "overview": "An aspiring Indian tech entrepreneur in the Silicon Valley finds himself unexpectedly battling the bizarre American immigration system to keep his dream alive or prepare to return home forever.", "genres": [ "Drama", "Comedy" ], "poster": "https://image.tmdb.org/t/p/w500/ff8WaA7ItBgl36kdT232i0d0Fnq.jpg", "release_date": 1490918400 } ], "offset": 8000, "limit": 3, "total": 9331 } ``` <img width="1348" alt="Capture d’écran 2023-03-08 à 10 09 04" src="https://user-images.githubusercontent.com/41078892/223670905-6932b79b-f9b8-4a41-b59e-be2171705b7d.png"> </details> # Draft status - [ ] Route naming: having one route be `GET /indexes/{:indexUid}/documents` and the other `POST /indexes/{:indexUid}/documents/fetch` is suboptimal (also, technically a breaking change for documents with `fetch` as uid?), but `POST /indexes/{:indexUid}/documents` is already used to insert documents. Co-authored-by: Louis Dureuil <louis@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com>	2023-05-04 12:54:26 +00:00
Tamo	a3da680ce6	Update meilisearch/tests/documents/errors.rs Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-05-04 14:51:17 +02:00
Tamo	11e394dba1	merge the document fetch and get error codes	2023-05-04 15:39:49 +02:00
Tamo	469d2f2a9c	fix the fields field of the POST fetch document API	2023-05-04 15:34:09 +02:00
Tamo	ce6507d20c	improve the test of the get document by filter	2023-05-04 15:34:09 +02:00
Tamo	b92da5d15a	add a big test on the get document by filter of the get route	2023-05-04 15:34:09 +02:00
Tamo	ed3dfbe729	add error codes and tests	2023-05-04 15:34:08 +02:00

1 2 3 4 5 ...

8008 Commits