Commit Graph

1091 Commits

Author SHA1 Message Date
Louis Dureuil
5b776556fe
Add ParallelIteratorExt 2024-10-01 11:10:53 +02:00
ManyTheFish
bb7a503e5d Compute prefix databases
We are now computing the prefix FST and a prefix delta in the Merger thread,
after all the databases are written, the main thread will recompute the prefix databases based on the prefix delta without needing any grenad temporary file anymore
2024-10-01 09:57:06 +02:00
Louis Dureuil
64589278ac
Appease *some* of clippy warnings 2024-09-30 16:08:29 +02:00
ManyTheFish
8df6daf308 Remove fid_wordcount_docids.rs 2024-09-30 11:52:31 +02:00
ManyTheFish
5b552caf42 Fix position in insertions 2024-09-30 11:46:32 +02:00
ManyTheFish
2b51a63418 Remove dead code 2024-09-30 11:42:36 +02:00
Louis Dureuil
3d8024fb2b
write the weighted fields ids map 2024-09-30 11:35:03 +02:00
Louis Dureuil
4b0da0ff24
Fix inversion of field_id and position 2024-09-30 11:34:50 +02:00
ManyTheFish
960060ebdf Fix fst builder when their is no previous FST 2024-09-25 16:53:00 +02:00
Clément Renault
3d244451df
Reduce the lru key size from 8 to 12 bytes 2024-09-25 16:14:13 +02:00
Clément Renault
5f53935c8a
Fix a bug in the Lru 2024-09-25 16:09:34 +02:00
Clément Renault
29a7623c3f
Fxi some logs 2024-09-25 15:57:50 +02:00
Clément Renault
e97041f7d0
Replace the Lru free list by a simple increment 2024-09-25 15:55:52 +02:00
Clément Renault
52d7f3ed1c
Reduce the lru key size from 20 to 8 bytes 2024-09-25 15:37:13 +02:00
Clément Renault
86d5e6d9ff
Use the new Lru 2024-09-25 14:54:56 +02:00
Clément Renault
759b9b1546
Introduce a new custom Lru 2024-09-25 14:49:12 +02:00
ManyTheFish
3f7a500f3b Build prefix fst 2024-09-25 14:36:06 +02:00
ManyTheFish
974272f2e9 Merge branch 'main' into indexer-edition-2024 2024-09-25 07:41:16 +02:00
Clément Renault
7ad037841f
Move the tracing info to eprintln 2024-09-24 18:21:58 +02:00
Clément Renault
e0c7067355
Expose an IndexedParallelIterator to the index function 2024-09-24 17:24:59 +02:00
ManyTheFish
6e87332410 Change the way the FST is built 2024-09-24 16:28:31 +02:00
Clément Renault
2d1caf27df
Use eprintln to log 2024-09-24 15:59:50 +02:00
Clément Renault
7f148c127c
Measure the SmallVec efficacity 2024-09-24 15:32:15 +02:00
Clément Renault
4ce5d3d66d
Do not check before pushing in bitmaps 2024-09-24 09:43:16 +02:00
Clément Renault
42b093687d
Introduce the new PushOptimizedBitmap 2024-09-23 16:38:21 +02:00
Clément Renault
f00664247d
Add more stats about the channel message sent 2024-09-23 15:13:52 +02:00
Clément Renault
013acb3d93
Measure merger writer channel contention 2024-09-23 11:07:59 +02:00
Tamo
1113c42de0 fix broken comments 2024-09-19 16:18:36 +02:00
Tamo
b6b73fe41c
Update milli/src/update/settings.rs
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-09-19 15:41:14 +02:00
Tamo
163f8023a1 remove debug println 2024-09-19 12:13:25 +02:00
Tamo
633537ccd7 fix updating documents without updating the settings 2024-09-19 12:00:58 +02:00
Tamo
3f6301dbc9 fix the missing embedder name in the error message when trying to disable the binary quantization 2024-09-19 12:00:58 +02:00
Tamo
2b6952eda1 rename the ArroyReader to an ArroyWrapper since it can read and write 2024-09-19 12:00:58 +02:00
Tamo
79f29eed3c fix the tests and the arroy_readers method 2024-09-19 12:00:58 +02:00
Tamo
cc45e264ca implement the binary quantization in meilisearch 2024-09-19 12:00:56 +02:00
Clément Renault
f4ab1f168e
Prefer using Rc<str> than String when cloning a lot 2024-09-16 15:41:29 +02:00
ManyTheFish
1a0e962299 Replace hashmap by vectors in wpp 2024-09-16 15:01:20 +02:00
ManyTheFish
f13e076b8a Use hashmap instead of Btree in wpp extractor 2024-09-16 14:40:40 +02:00
ManyTheFish
7ba49b849e Extract and write facet databases 2024-09-16 09:35:16 +02:00
Clément Renault
f7652186e1
WIP geo fields 2024-09-12 18:01:02 +02:00
Clément Renault
b2f4e67c9a
Do not store useless updates 2024-09-12 15:38:31 +02:00
Clément Renault
ff5d3b59f5
Move the document id extraction to the primary key code 2024-09-12 12:01:42 +02:00
ManyTheFish
aa69308e45 Use a bufWriter to build word FSTs 2024-09-12 11:48:00 +02:00
ManyTheFish
eb9a20ff0b Fix fid_word_docids extraction 2024-09-12 11:08:18 +02:00
Clément Renault
3e9198ebaa
Support guessing primary key again 2024-09-11 17:25:40 +02:00
Clément Renault
2a0ad0982f
Fix the document counter 2024-09-11 15:59:36 +02:00
ManyTheFish
2b317c681b Build mergers in parallel 2024-09-11 11:49:26 +02:00
ManyTheFish
39b5990f64 Mutualize tokenization 2024-09-11 10:22:38 +02:00
Clément Renault
8287c2644f
Support CSV again 2024-09-10 21:10:28 +01:00
Clément Renault
c1c44a0b81
Impl serialize on TopLevelMap 2024-09-10 19:32:03 +01:00
Clément Renault
04596f3616
Move the TopLevelMap into a dedicated module 2024-09-10 18:01:17 +01:00
Clément Renault
24cb5839ad
Move the document changes sorting logic to a new trait 2024-09-10 17:37:52 +01:00
ManyTheFish
f69688e8f7 Fix several warnings in extractors and remove unreachable macros 2024-09-09 14:52:50 +02:00
Clément Renault
8fd0afaaaa
Make sure we iterate over the payload documents in order 2024-09-06 08:09:08 +02:00
Clément Renault
72c6a21a30
Use raw JSON to read the payloads 2024-09-05 20:08:23 +02:00
Clément Renault
8412be4a7d
Cleanup CowStr and TopLevelMap struct 2024-09-05 18:32:55 +02:00
Louis Dureuil
10f09c531f
add some commented code to read from json with raw values 2024-09-05 18:22:16 +02:00
ManyTheFish
8fd99b111b Add tracing timers logs 2024-09-05 18:00:22 +02:00
Clément Renault
f6b3d1f9a5
Increase some channel sizes 2024-09-05 15:12:07 +02:00
Clément Renault
73ce67862d
Use the word pair proximity and fid word count docids extractors
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-09-05 10:56:22 +02:00
Clément Renault
0fc02f7351
Move the facet extraction to dedicated modules 2024-09-05 10:32:27 +02:00
ManyTheFish
34f11e3380 Implement word count and word pair proximity extractors 2024-09-05 10:30:39 +02:00
Clément Renault
27308eaab1
Import the facet extractors 2024-09-04 17:58:15 +02:00
Clément Renault
b33ec9ba3f
Introduce the FieldIdFacetIsNullDocidsExtractor 2024-09-04 17:50:08 +02:00
Clément Renault
9c0a1cd9fd
Introduce the FieldIdFacetExistsDocidsExtractor 2024-09-04 17:48:49 +02:00
Clément Renault
0b061f1e70
Introduce the FieldIdFacetIsEmptyDocidsExtractor 2024-09-04 17:40:24 +02:00
Clément Renault
19d937ab21
Introduce the facet extractors 2024-09-04 17:03:54 +02:00
Clément Renault
1d59c19cd2
Send the WordsFst by using an Mmap 2024-09-04 14:30:09 +02:00
Clément Renault
98e48371c3
Factorize some stuff 2024-09-04 12:17:13 +02:00
Clément Renault
6d74fb0229
Introduce the WordFidWordDocids database 2024-09-04 11:40:55 +02:00
ManyTheFish
1eb75a1040 remove milli/src/update/new/extract/tokenize_document.rs 2024-09-04 11:40:26 +02:00
Clément Renault
3b82d8b5b9
Fix the cache to serialize entries correctly 2024-09-04 10:55:36 +02:00
ManyTheFish
781a186f75 remove milli/src/update/new/extract/extract_word_docids.rs 2024-09-04 10:28:31 +02:00
ManyTheFish
6a399556b5 Implement more searchable extractor 2024-09-04 10:20:18 +02:00
Clément Renault
27b4cab857
Extract and write the documents and words fst in the database 2024-09-04 09:59:19 +02:00
Clément Renault
52d32b4ee9
Move the channel sender in the closure to stop the merger thread 2024-09-03 16:08:33 +02:00
ManyTheFish
da61408e52 Remove unimplemented from document changes 2024-09-03 15:14:16 +02:00
ManyTheFish
fe69385bd7 Fix tokenizer test 2024-09-03 14:24:37 +02:00
Louis Dureuil
1ac008926b
Add maxBytes parameter 2024-09-03 12:07:15 +02:00
Clément Renault
c1557734dc
Use the GlobalFieldsIdsMap everywhere and write it to disk
Co-authored-by: Dureuill <louis@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-09-03 12:01:01 +02:00
ManyTheFish
c50d3edc4a Integrate first searchable exctrator 2024-09-03 11:02:39 +02:00
Clément Renault
5369bf4a62
Change some lifetimes 2024-09-02 19:51:22 +02:00
Clément Renault
bcb1aa3d22
Find a temporary solution to par into iter on an HashMap
Spoiler: Do not use an HashMap but drain it into a Vec
2024-09-02 19:39:48 +02:00
Clément Renault
9b7858fb90
Expose the new indexer 2024-09-02 15:21:59 +02:00
Clément Renault
ab01679a8f
Remove the useless option from the document changes 2024-09-02 15:21:00 +02:00
Clément Renault
521775f788
I push for Many 2024-09-02 15:10:21 +02:00
Clément Renault
72e7b7846e
Renaming the indexers 2024-09-02 14:42:27 +02:00
Clément Renault
6526ce1208
Fix the merging of documents 2024-09-02 14:41:20 +02:00
Louis Dureuil
21296190a3
Reindex embedders 2024-09-02 13:00:53 +02:00
Louis Dureuil
580ea2f450
Pass the fields <-> ids map with metadata to render 2024-09-02 11:30:10 +02:00
Clément Renault
e639ec79d1
Move the indexers into their own modules 2024-09-02 10:42:19 +02:00
Clément Renault
bb885a5810
Fix the merge for roaring bitmap 2024-09-01 23:20:19 +02:00
Clément Renault
b625d31c7d
Introduce the PartialDumpIndexer indexer that generates document ids in parallel 2024-08-30 15:07:21 +02:00
Clément Renault
6487a67f2b
Introduce the ConcurrentAvailableIds struct and rename the other to AvailableIds 2024-08-30 15:06:50 +02:00
Clément Renault
271ce91b3b
Add the rayon Threadpool to the index function parameter 2024-08-30 14:34:24 +02:00
Clément Renault
54f2eb4507
Remove duplication of grenad merger 2024-08-30 14:34:05 +02:00
Clément Renault
794ebcd582
Replace grenad with the new grenad various-improvement branch 2024-08-30 11:53:59 +02:00
Clément Renault
b7c77c7a39
Use the latest version of the obkv crate 2024-08-30 11:53:59 +02:00
Clément Renault
0c57cf7565
Replace obkv with the temporary new version of it 2024-08-30 11:53:58 +02:00
Clément Renault
27df9e6c73
Introduce the indexer::index function that runs the indexation 2024-08-30 11:53:58 +02:00