MeiliSearch

mirror of https://github.com/meilisearch/MeiliSearch synced 2025-07-15 13:58:36 +02:00

Author	SHA1	Message	Date
ManyTheFish	467b49153d	Implement proximityPrecision setting on milli side	2023-12-06 15:49:02 +01:00
ManyTheFish	0c3fa8cbc4	Add tests on proximityPrecision setting	2023-12-06 14:59:23 +01:00
ManyTheFish	bddc168d83	List TODOs	2023-12-06 14:59:23 +01:00
meili-bors[bot]	84a36002d7	Merge #4239 4239: Remove the actix-web dependency from milli r=dureuill a=Kerollmops Just remove actix-web from milli. Co-authored-by: Clément Renault <clement@meilisearch.com>	2023-11-29 10:19:40 +00:00
Clément Renault	170e063b80	Remove the actix-web dependency from milli	2023-11-28 17:19:57 +01:00
meili-bors[bot]	6376c342c1	Merge #4223 4223: Update to heed 0.20 r=dureuill a=Kerollmops This PR brings the v0.20-alpha.9 version of heed into Meilisearch 🎉 The main goal is to test it in a real environment to make the necessary changes if needed. We also want to merge it as soon as possible during the pre-release phase to ensure we catch bugs before the release. Most of the calls to heed are the same as before, except: - The `PolyDatabase` has been replaced with a `Database<Unspecified, Unspecified>`. We replaced the `get<T, U>()` by a `remap<T, U>().get()` calls. - The `Database` `append(...)` method has been replaced with a `put_with_flags(PutFlags::APPEND, ...)`. - The `RwTxn<'e, 'p>` has been simplified into a `RwTxn<'e>`. - The `BytesEncode/Decode` traits return a `Result<_, BoxedError>` instead of an `Option<_>`. - We no longer need to wrap and unwrap the `BEU32` integer when storing/getting them from heed. ### TODO - [x] Create actual, simple error types instead of using strings in the codecs. ### Follow-up work - Move the codecs into another member crate (we depend on the uuid one in the meilitool crate). - Display the internal decoding error in the `SerializationError` internal error variant. Co-authored-by: Clément Renault <clement@meilisearch.com>	2023-11-28 13:39:44 +00:00
Clément Renault	5b563f872b	Move the clippy attribute on the problematic part of the code	2023-11-28 14:37:58 +01:00
Clément Renault	ec9b52d608	Rename copy_to_path to copy_to_file	2023-11-28 14:32:30 +01:00
Clément Renault	34c67ac389	Remove the possibility to fail fetching the env info	2023-11-28 14:31:23 +01:00
Clément Renault	d050c9b4ae	Only remap the main database once	2023-11-28 14:27:30 +01:00
Clément Renault	7dd1226faf	Clarify an unreachable unwrap	2023-11-28 14:26:31 +01:00
Clément Renault	1575456594	Further reduce an async block	2023-11-28 14:23:32 +01:00
Clément Renault	add2ceef67	Introduce error types to avoid panics	2023-11-28 14:21:49 +01:00
Clément Renault	548c8247c2	Create and use real error types in the codecs	2023-11-28 10:11:17 +01:00
meili-bors[bot]	181ca48482	Merge #4234 4234: Fix puffin in the index scheduler r=dureuill a=irevoire Currently, we can't compile the index scheduler without this feature. It could be cool to specify the dependencies in the main workspace cargo toml like quickwit does to avoid this kind of error in the future; https://github.com/quickwit-oss/quickwit/blob/main/quickwit/Cargo.toml#L41 Co-authored-by: Tamo <tamo@meilisearch.com>	2023-11-28 08:23:48 +00:00
Tamo	5751f5c640	fix puffin in the index scheduler	2023-11-27 15:18:33 +01:00
Clément Renault	d32eb11329	Move to the v0.20.0-alpha.9 of heed	2023-11-27 11:52:22 +01:00
meili-bors[bot]	3d23b388bc	Merge #4231 4231: Fixed payload limit setting being ignored for delete documents by batch r=Kerollmops a=Karribalu # Pull Request ## Related issue Fixes #4224 ## What does this PR do? - Added http_payload_size_limit to JsonConfig to allow deleting documents in batches with a payload size greater than 2MB, which is the default limit set in the JsonConfig crate. ## PR checklist Please check if your PR fulfills the following requirements: - [Y] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)? - [Y] Have you read the contributing guidelines? - [Y] Have you made sure that the title is accurate and descriptive of the changes? Thank you so much for contributing to Meilisearch! Co-authored-by: karribalu <karri.balu123456@gmail.com>	2023-11-27 09:26:21 +00:00
karribalu	85626cff8e	Fixed payload limit setting being ignored for delete documents by batch route	2023-11-25 18:41:16 +00:00
Clément Renault	58dac8af42	Remove the panics and unwraps	2023-11-23 15:00:48 +01:00
Clément Renault	0dbf1a16ff	Make clippy happy	2023-11-23 14:11:38 +01:00
Clément Renault	462b4c0080	Fix the tests	2023-11-23 12:07:35 +01:00
Clément Renault	0d4482625a	Make the changes to use heed v0.20-alpha.6	2023-11-23 11:43:58 +01:00
Clément Renault	56a0d91ecd	Update the heed dependency and lock file	2023-11-22 15:11:09 +01:00
meili-bors[bot]	b366acdae6	Merge #4220 4220: Bring back changes from v1.5.0 into main r=dureuill a=Kerollmops This will bring the fixes from v1.5.0 into main. By [following this guide](https://github.com/meilisearch/engine-team/blob/main/resources/meilisearch-release.md#after-the-release) I decided to create a temporary branch to fix the git conflicts and merge into main afterward. Co-authored-by: curquiza <curquiza@users.noreply.github.com> Co-authored-by: Vivek Kumar <vivek.26@outlook.com> Co-authored-by: Louis Dureuil <louis.dureuil@gmail.com> Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com> Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Tamo <tamo@meilisearch.com> Co-authored-by: Clément Renault <clement@meilisearch.com> Co-authored-by: Louis Dureuil <louis.dureuil@xinra.net> Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-11-22 07:46:22 +00:00
Clément Renault	7cb7e37ba8	Merge branch 'main' into tmp-release-v1.5.0	2023-11-21 16:30:46 +01:00
meili-bors[bot]	33b7c574ea	Merge #4090 4090: Diff indexing r=ManyTheFish a=ManyTheFish This pull request aims to reduce the indexing time by computing a difference between the data added to the index and the data removed from the index before writing in LMDB. ## Why focus on reducing the writings in LMDB? The indexing in Meilisearch is split into 3 main phases: 1) The computing or the extraction of the data (Multi-threaded) 2) The writing of the data in LMDB (Mono-threaded) 3) The processing of the prefix databases (Mono-threaded) see below: ![Capture d’écran 2023-09-28 à 20 01 45](https://github.com/meilisearch/meilisearch/assets/6482087/51513162-7c39-4244-978b-2c6b60c43a56) Because the writing is mono-threaded, it represents a bottleneck in the indexing, reducing the number of writes in LMDB will reduce the pressure on the main thread and should reduce the global time spent on the indexing. ## Give Feedback We created [a dedicated discussion](https://github.com/meilisearch/meilisearch/discussions/4196) for users to try this new feature and to give feedback on bugs or performance issues. ## Technical approach ### Part 1: merge the addition and the deletion process This part: a) Aims to reduce the time spent on indexing only the filterable/sortable fields of documents, for example: - Updating the number of "likes" or "stars" of a song or a movie - Updating the "stock count" or the "price" of a product b) Aims to reduce the time spent on writing in LMDB which should reduce the global indexing time for the highly multi-threaded machines by reducing the writing bottleneck. c) Aims to reduce the average time spent to delete documents without having to keep the soft-deleted documents implementation - [x] Create a preprocessing function that creates the diff-based documents chuck (`OBKV<fid, OBKV<AddDel, value>>`) - [x] and clearly separate the faceted fields and the searchable fields in two different chunks - Change the parameters of the input extractor by taking an `OBKV<fid, OBKV<AddDel, value>>` instead of `OBKV<fid, value>`. - [x] extract_docid_word_positions - [x] extract_geo_points - [x] extract_vector_points - [x] extract_fid_docid_facet_values - Adapt the searchable extractors to the new diff-chucks - [x] extract_fid_word_count_docids - [x] extract_word_pair_proximity_docids - [x] extract_word_position_docids - [x] extract_word_docids - Adapt the facet extractors to the new diff-chucks - [x] extract_facet_number_docids - [x] extract_facet_string_docids - [x] extract_fid_docid_facet_values - [x] FacetsUpdate - [x] Adapt the prefix database extractors ⚠️ ⚠️ - [x] Make the LMDB writer remove the document_ids to delete at the same time the new document_ids are added - [x] Remove document deletion pipeline - [x] remove `new_documents_ids` entirely and `replaced_documents_ids` - [x] reuse extracted external id from transform instead of re-extracting in `TypedChunks::Documents` - [x] Remove deletion pipeline after autobatcher - [x] remove autobatcher deletion pipeline - [x] everything uses `IndexOperation::DocumentOperation` - [x] repair deletion by internal id for filter by delete - [x] Improve the deletion via internal ids by avoiding iterating over the whole set of external document ids. - [x] Remove soft-deleted documents #### FIXME - [x] field distribution is not correctly updated after deletion - [x] missing documents in the tests of tokenizer_customization ### Part 2: Only compute the documents field by field This part aims to reduce the global indexing time for any kind of partial document modification on any size of machine from the mono-threaded one to the highly multi-threaded one. - [ ] Make the preprocessing function only send the fields that changed to the extractors - [ ] remove the `word_docids` and `exact_word_docids` database and adapt the search (⚠️ could impact the search performances) - [ ] replace the `word_pair_proximity_docids` database with a `word_pair_proximity_fid_docids` database and adapt the search (⚠️ could impact the search performances) - [ ] Adapt the prefix database extractors ⚠️ ⚠️ ## Technical Concerns - The part 1 implementation could increase the indexing time for the smallest machines (with few threads) by increasing the extracting time (multi-threaded) more than the writing time (mono-threaded) - The part 2 implementation needs to change the databases which could have a significant impact on the search performances - The prefix databases are a bit special to process and may be a pain to adapt to the difference-based indexing Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Clément Renault <clement@meilisearch.com> Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-11-21 09:44:38 +00:00
ManyTheFish	d3575fb028	Make into_del_add_obkv parameters more human readable	2023-11-20 16:10:39 +01:00
ManyTheFish	39cbb499c2	Small fixes	2023-11-20 10:20:39 +01:00
ManyTheFish	ebef6bc24d	Simplify documents database writing	2023-11-20 10:14:57 +01:00
ManyTheFish	d59b7db8d0	remove unused code	2023-11-20 10:10:45 +01:00
ManyTheFish	263e825619	Fix typos in comments	2023-11-20 10:06:29 +01:00
Clément Renault	69354a6144	Add the benchmarck name to the bot message	2023-11-15 13:56:54 +01:00
Many the fish	b0adc73ce6	Merge pull request #4207 from meilisearch/diff-indexing-prefix-databases Diff indexing prefix databases	2023-11-14 16:04:05 +01:00
meili-bors[bot]	2b5d9042d1	Merge #4208 4208: Makes the dump cancellable r=Kerollmops a=irevoire # Pull Request Make the dump tasks cancellable even when they have already started processing. ## Related issue Fixes https://github.com/meilisearch/meilisearch/issues/4157 Co-authored-by: Tamo <tamo@meilisearch.com>	2023-11-14 13:31:45 +00:00
Tamo	5b57fbab08	makes the dump cancellable	2023-11-14 11:23:13 +01:00
meili-bors[bot]	72d3fa4898	Merge #4203 4203: Extract external document docids from docs on deletion by filter r=Kerollmops a=dureuill This fixes some of the performance regression observed on `diff-indexing` when doing delete-by-filter with a filter matching many documents. To delete 19 768 771 documents (hackernews dataset, all documents matching `type = comment`), here are the observed time: \|branch (commit sha1sum)\|time\|speed-down factor (lower is better)\| \|--\|--\|--\| \|`main` (48865470d7aaf42fa5bbfd01cf73423afb77addf)\|1212.885536s (~20min)\|x1.0 (baseline)\| \|`diff-indexing` (523519fdbfd3a28ca15320641cb096f26230a7ca)\|5385.550543s (90min)\|x4.44\| \|`diff-indexing-extract-primary-key`(f8289cd974d957d38645ca66c993ca518ec81955)\|2582.323324s (43min) \| x2.13\| So we're still suffering a speed-down of x2.13, but that's much better than x4.44. --- Changes: - Refactor the logic of PrimaryKey extraction to a struct - Add a trait to abstract the extraction of field id from a name between `DocumentBatch` and `FieldIdMap`. - Add `Index::external_id_of` to get the external ids of a bitmap of internal ids. - Use this new method to add new Transform and Batch methods to remove documents that are known to be from the DB. - Modify delete-by-filter to use the new method Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-11-13 13:02:10 +00:00
Louis Dureuil	772964125d	Factor removal of document from DB	2023-11-13 13:51:22 +01:00
Louis Dureuil	378deb0bef	Rename trait	2023-11-13 13:38:36 +01:00
ManyTheFish	1f36410541	Update tests	2023-11-13 13:36:39 +01:00
meili-bors[bot]	b11f85a635	Merge #4205 4205: Prevent search hang on the processing index r=Kerollmops a=dureuill Fixes #4206, an issue originally [reported on Discord](https://discord.com/channels/1006923006964154428/1148983671026618579/1148983671026618579) where having parallel search requests on more indexes than the index cache capacity would cause search requests on the currently updating index to hang until the index is done updating. ## Test setup - Create 20 empty indexes by sending settings to them - repeatedly send placeholder search requests to each of the indexes in a loop - Create another index and send a significant batch of documents to index. - Attempt to perform a search request on that last index. - Before this PR, the search request hangs while the index update task is processing - After this PR, the search request respond immediately even while the index update task is processing ## Changes - When getting the handle to an index for some potentially long running batches of tasks, save it in the index scheduler. - Drop the handle from the index-scheduler when the task is done so that we don't leak indexes. - When getting an index from outside the task queue processor, check if there is such an handle matching the requested index. If so, skip the cache entirely and clone the handle. Co-authored-by: Louis Dureuil <louis.dureuil@xinra.net> Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-11-13 10:36:01 +00:00
Louis Dureuil	a2d6dc8571	Fix typo, remove caching for the change of index	2023-11-13 10:44:36 +01:00
meili-bors[bot]	ee1701157f	Merge #4204 4204: Throw error when the vector search is sent with the wrong size r=Kerollmops a=dureuill # Pull Request ## Related issue Fixes #4201 Co-authored-by: Louis Dureuil <louis@meilisearch.com>	2023-11-13 09:43:20 +00:00
Louis Dureuil	8c649d8061	Throw error when the vector search is sent with the wrong size	2023-11-13 09:57:42 +01:00
Louis Dureuil	492fc086f0	cargo fmt	2023-11-12 21:53:11 +01:00
Louis Dureuil	a2d0c73b41	Save the currently updating index so that the search can access it at all times	2023-11-10 10:52:03 +01:00
Louis Dureuil	264b10ec20	Fixup documentation	2023-11-09 16:23:20 +01:00
Louis Dureuil	825257da76	Use more efficient method for deletion in benchmarks	2023-11-09 16:13:15 +01:00
Louis Dureuil	f8289cd974	Use it from delete-by-filter	2023-11-09 14:23:15 +01:00
Louis Dureuil	3053e01c05	Batch::remove_documents_from_db_no_batch	2023-11-09 14:23:02 +01:00

1 2 3 4 5 ...

8768 commits