48 Commits

Author SHA1 Message Date
meili-bors[bot]
19e6f675b3
Merge #4900
4900: Indexer edition 2024 r=Kerollmops a=dureuill

This PR is implementing the indexer edition 2024, largely inspired by [the ideas from this blog post](https://blog.kerollmops.com/meilisearch-is-too-slow).

Fixes https://github.com/meilisearch/meilisearch/issues/4985

## Features
- Stream-first approach to reading documents.
- Minimum disk write operations.
- RAM usage-first approach to avoid modifying common bitmaps on disk but in memory.
- Reduced LMDB fragmentation by writing entries only once...
- ...computing the final version of the entries in parallel...
- ...and storing them in write-optimized data structures before sending them to the BTree (LMDB).
- Indexing in multiple transactions to improve large dataset support (dumps).


Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-11-21 16:19:10 +00:00
Louis Dureuil
221e547e86
Slight changes 2024-11-21 16:47:44 +01:00
ManyTheFish
36962b943b First batch of PR comment 2024-11-21 16:38:11 +01:00
Many the fish
ff38f29981
Update crates/index-scheduler/src/batch.rs
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-11-21 14:18:39 +01:00
Louis Dureuil
1f9692cd04
Increase map size for tests 2024-11-20 17:52:21 +01:00
Tamo
1e694ae432
improve the count of the number of tasks in a batch 2024-11-20 17:48:26 +01:00
Tamo
71807cac6d
makes clippy happy 2024-11-20 17:40:58 +01:00
Tamo
21a2264782
improve the details and stats of the current batch processing 2024-11-20 17:25:55 +01:00
Louis Dureuil
6e6acfcf1b
Merge branch 'main' into indexer-edition-2024 2024-11-20 16:59:58 +01:00
Louis Dureuil
ff9c92c409
rename documents -> substep 2024-11-20 15:12:02 +01:00
Clément Renault
8380ddbdcd
Fix progress of into_changes 2024-11-20 15:10:09 +01:00
Louis Dureuil
84600a10d1
Add MSP to document_update.into_changes() 2024-11-20 14:53:37 +01:00
Tamo
ec06879d28
apply review changes 2024-11-20 14:40:36 +01:00
Tamo
83d1f858c1
Update crates/index-scheduler/src/lib.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-11-20 14:36:05 +01:00
Tamo
a7ac590e9e
implements the reverse query parameter for the batches 2024-11-20 13:29:52 +01:00
Tamo
8ad68dd708
stop leaking the update files of the canceled tasks 2024-11-20 13:17:54 +01:00
Tamo
7e379b3d14
remove useless prints 2024-11-20 12:27:12 +01:00
Tamo
bdb51a85fe
now that the task cancelation shares their started at with all the tasks of their batch we don't need the trick of retrieving the previous batch anymore 2024-11-20 10:51:07 +01:00
Tamo
e145d71a62
implements the two last TODOs 2024-11-20 10:51:06 +01:00
Tamo
d9a4e69990
push a missing snapshot 2024-11-20 10:51:06 +01:00
Tamo
b906e3ed70
improve the way we access the mutex 2024-11-20 10:51:06 +01:00
Tamo
4abcd9c04e
add some stats on the batches 2024-11-20 10:51:06 +01:00
Tamo
229fa0f902
implements the batch details 2024-11-20 10:51:06 +01:00
Tamo
62646af7b9
implements the automatic batch deletion 2024-11-20 10:51:06 +01:00
Tamo
1fcb9526f5
fix the task cancelation 2024-11-20 10:51:06 +01:00
Tamo
15eefa4fcc
fixes a lot of small issue, the test about the cancellation is still failing 2024-11-20 10:51:05 +01:00
Tamo
ad9763ffcd
copy multiple task query tests to batches. Currently, they fails 2024-11-20 10:49:25 +01:00
Tamo
d489f5635f
add the mapping between the task and batches 2024-11-20 10:49:23 +01:00
Tamo
a1251c3c83
Implements the get all batches route with filters working 2024-11-20 10:42:55 +01:00
Tamo
6062914654
add the batch_id to the tasks 2024-11-20 10:42:54 +01:00
Louis Dureuil
bfefaf71c2
Progress displayed in logs 2024-11-19 09:32:52 +01:00
Louis Dureuil
5f93651cef
fixes 2024-11-18 16:23:11 +01:00
Louis Dureuil
1f8b01a598
Fix snap since _vectors is no longer part of the field distributions 2024-11-18 12:50:59 +01:00
Louis Dureuil
e9d17136b2
Add deadline of 3 seconds to embedding requests made in the context of hybrid search 2024-11-18 12:15:11 +01:00
Clément Renault
5b4c06c24c
Plug the grenad max memory parameter 2024-11-18 11:28:04 +01:00
Louis Dureuil
c202f3dbe2
fix tests and revert change in behavior when primary_key_from_op != primary_key_from_db && index.is_empty() 2024-11-18 10:59:05 +01:00
Clément Renault
677d7293f5
Fix a lot of primary key related tests 2024-11-18 10:59:05 +01:00
Clément Renault
bd31ea2174
Check for at least one valid task after setting their statuses 2024-11-18 10:59:05 +01:00
Clément Renault
83865d2ebd
Expose intermediate errors when processing batches 2024-11-18 10:59:05 +01:00
Clément Renault
9e8367f1e6
Move the rayon thread pool outside the extract method 2024-11-14 10:40:32 +01:00
Louis Dureuil
1fcd5f091e
Remove progress from task 2024-11-12 12:23:13 +01:00
Louis Dureuil
e32677999f
Adapt some snapshots 2024-11-08 00:06:33 +01:00
Louis Dureuil
8a314ab81d
Fix primary key fid order 2024-11-08 00:05:12 +01:00
Tamo
2eb1801e85 reverse the order of the task queue 2024-11-07 19:19:44 +01:00
Louis Dureuil
ee03743355
Merge branch 'indexer-edition-2024' into indexer-edition-2024-doc-chunks 2024-11-06 15:50:53 +01:00
ManyTheFish
10feeb88f2 Merge branch 'main' into indexer-edition-2024 2024-11-06 15:19:18 +01:00
Tamo
cf6ad1ae5e Merge branch 'main' into tmp-release-v1.11.0 2024-11-04 16:14:44 +01:00
Clément Renault
9c1e54a2c8
Move crates under a sub folder to clean up the code 2024-10-21 08:18:43 +02:00