Commit Graph

1050 Commits

Author SHA1 Message Date
Clément Renault
ff5d3b59f5
Move the document id extraction to the primary key code 2024-09-12 12:01:42 +02:00
ManyTheFish
aa69308e45 Use a bufWriter to build word FSTs 2024-09-12 11:48:00 +02:00
ManyTheFish
eb9a20ff0b Fix fid_word_docids extraction 2024-09-12 11:08:18 +02:00
Clément Renault
3e9198ebaa
Support guessing primary key again 2024-09-11 17:25:40 +02:00
Clément Renault
2a0ad0982f
Fix the document counter 2024-09-11 15:59:36 +02:00
ManyTheFish
2b317c681b Build mergers in parallel 2024-09-11 11:49:26 +02:00
ManyTheFish
39b5990f64 Mutualize tokenization 2024-09-11 10:22:38 +02:00
Clément Renault
8287c2644f
Support CSV again 2024-09-10 21:10:28 +01:00
Clément Renault
c1c44a0b81
Impl serialize on TopLevelMap 2024-09-10 19:32:03 +01:00
Clément Renault
04596f3616
Move the TopLevelMap into a dedicated module 2024-09-10 18:01:17 +01:00
Clément Renault
24cb5839ad
Move the document changes sorting logic to a new trait 2024-09-10 17:37:52 +01:00
ManyTheFish
f69688e8f7 Fix several warnings in extractors and remove unreachable macros 2024-09-09 14:52:50 +02:00
Clément Renault
8fd0afaaaa
Make sure we iterate over the payload documents in order 2024-09-06 08:09:08 +02:00
Clément Renault
72c6a21a30
Use raw JSON to read the payloads 2024-09-05 20:08:23 +02:00
Clément Renault
8412be4a7d
Cleanup CowStr and TopLevelMap struct 2024-09-05 18:32:55 +02:00
Louis Dureuil
10f09c531f
add some commented code to read from json with raw values 2024-09-05 18:22:16 +02:00
ManyTheFish
8fd99b111b Add tracing timers logs 2024-09-05 18:00:22 +02:00
Clément Renault
f6b3d1f9a5
Increase some channel sizes 2024-09-05 15:12:07 +02:00
Clément Renault
73ce67862d
Use the word pair proximity and fid word count docids extractors
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-09-05 10:56:22 +02:00
Clément Renault
0fc02f7351
Move the facet extraction to dedicated modules 2024-09-05 10:32:27 +02:00
ManyTheFish
34f11e3380 Implement word count and word pair proximity extractors 2024-09-05 10:30:39 +02:00
Clément Renault
27308eaab1
Import the facet extractors 2024-09-04 17:58:15 +02:00
Clément Renault
b33ec9ba3f
Introduce the FieldIdFacetIsNullDocidsExtractor 2024-09-04 17:50:08 +02:00
Clément Renault
9c0a1cd9fd
Introduce the FieldIdFacetExistsDocidsExtractor 2024-09-04 17:48:49 +02:00
Clément Renault
0b061f1e70
Introduce the FieldIdFacetIsEmptyDocidsExtractor 2024-09-04 17:40:24 +02:00
Clément Renault
19d937ab21
Introduce the facet extractors 2024-09-04 17:03:54 +02:00
Clément Renault
1d59c19cd2
Send the WordsFst by using an Mmap 2024-09-04 14:30:09 +02:00
Clément Renault
98e48371c3
Factorize some stuff 2024-09-04 12:17:13 +02:00
Clément Renault
6d74fb0229
Introduce the WordFidWordDocids database 2024-09-04 11:40:55 +02:00
ManyTheFish
1eb75a1040 remove milli/src/update/new/extract/tokenize_document.rs 2024-09-04 11:40:26 +02:00
Clément Renault
3b82d8b5b9
Fix the cache to serialize entries correctly 2024-09-04 10:55:36 +02:00
ManyTheFish
781a186f75 remove milli/src/update/new/extract/extract_word_docids.rs 2024-09-04 10:28:31 +02:00
ManyTheFish
6a399556b5 Implement more searchable extractor 2024-09-04 10:20:18 +02:00
Clément Renault
27b4cab857
Extract and write the documents and words fst in the database 2024-09-04 09:59:19 +02:00
Clément Renault
52d32b4ee9
Move the channel sender in the closure to stop the merger thread 2024-09-03 16:08:33 +02:00
ManyTheFish
da61408e52 Remove unimplemented from document changes 2024-09-03 15:14:16 +02:00
ManyTheFish
fe69385bd7 Fix tokenizer test 2024-09-03 14:24:37 +02:00
Louis Dureuil
1ac008926b
Add maxBytes parameter 2024-09-03 12:07:15 +02:00
Clément Renault
c1557734dc
Use the GlobalFieldsIdsMap everywhere and write it to disk
Co-authored-by: Dureuill <louis@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-09-03 12:01:01 +02:00
ManyTheFish
c50d3edc4a Integrate first searchable exctrator 2024-09-03 11:02:39 +02:00
Clément Renault
5369bf4a62
Change some lifetimes 2024-09-02 19:51:22 +02:00
Clément Renault
bcb1aa3d22
Find a temporary solution to par into iter on an HashMap
Spoiler: Do not use an HashMap but drain it into a Vec
2024-09-02 19:39:48 +02:00
Clément Renault
9b7858fb90
Expose the new indexer 2024-09-02 15:21:59 +02:00
Clément Renault
ab01679a8f
Remove the useless option from the document changes 2024-09-02 15:21:00 +02:00
Clément Renault
521775f788
I push for Many 2024-09-02 15:10:21 +02:00
Clément Renault
72e7b7846e
Renaming the indexers 2024-09-02 14:42:27 +02:00
Clément Renault
6526ce1208
Fix the merging of documents 2024-09-02 14:41:20 +02:00
Louis Dureuil
21296190a3
Reindex embedders 2024-09-02 13:00:53 +02:00
Louis Dureuil
580ea2f450
Pass the fields <-> ids map with metadata to render 2024-09-02 11:30:10 +02:00
Clément Renault
e639ec79d1
Move the indexers into their own modules 2024-09-02 10:42:19 +02:00
Clément Renault
bb885a5810
Fix the merge for roaring bitmap 2024-09-01 23:20:19 +02:00
Clément Renault
b625d31c7d
Introduce the PartialDumpIndexer indexer that generates document ids in parallel 2024-08-30 15:07:21 +02:00
Clément Renault
6487a67f2b
Introduce the ConcurrentAvailableIds struct and rename the other to AvailableIds 2024-08-30 15:06:50 +02:00
Clément Renault
271ce91b3b
Add the rayon Threadpool to the index function parameter 2024-08-30 14:34:24 +02:00
Clément Renault
54f2eb4507
Remove duplication of grenad merger 2024-08-30 14:34:05 +02:00
Clément Renault
794ebcd582
Replace grenad with the new grenad various-improvement branch 2024-08-30 11:53:59 +02:00
Clément Renault
b7c77c7a39
Use the latest version of the obkv crate 2024-08-30 11:53:59 +02:00
Clément Renault
0c57cf7565
Replace obkv with the temporary new version of it 2024-08-30 11:53:58 +02:00
Clément Renault
27df9e6c73
Introduce the indexer::index function that runs the indexation 2024-08-30 11:53:58 +02:00
Clément Renault
45c060831e
Introduce typed channels and the merger loop 2024-08-30 11:53:58 +02:00
Clément Renault
874c1ac538
First channels types 2024-08-30 11:53:58 +02:00
Clément Renault
e6ffa4d454
Implement the document merge function for the replace method 2024-08-30 11:53:58 +02:00
Clément Renault
637a9c8bdd
Implement the document merge function for the update method 2024-08-30 11:53:58 +02:00
Louis Dureuil
c683fa98e6
WIP
Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-08-30 11:53:57 +02:00
meili-bors[bot]
ee62d9ce30
Merge #4845
4845: Fix perf regression facet strings r=ManyTheFish a=dureuill

Benchmarks between v1.9 and v1.10 show a performance regression of about x2 (+3dB regression) for most indexing workloads (+44s for hackernews).

[Benchmark interpretation in the engine weekly meeting](https://www.notion.so/meilisearch/Engine-weekly-4d49560d374c4a87b4e3d126a261d4a0?pvs=4#98a709683276450295fcfe1f8ea5cef3).

- Initial investigation pointed to #4819 as the origin of the regression.
- Further investigation points towards the hypernormalization of each facet value in `extract_facet_string_docids`
- Most of the slowdown is in `normalize_facet_strings`, and precisely in `detection.language()`.

This PR improves the situation (-10s compared with `main` for hackernews, so only +34s regression compared with `v1.9`) by skipping normalization when it can be skipped.

I'm not sure how to fix the root cause though. Should we skip facet locale normalization for now? Cc `@ManyTheFish` 

---

Tentative resolution options:

1. remove locale normalization from facet. I'm not sure why this is required, I believe we weren't doing this before, so maybe we can stop doing that again.
2. don't do language detection when it can be helped: won't help with the regressions in benchmark, but maybe we can skip language detection when the locales contain only one language?
3. use a faster language detection library: `@Kerollmops` told me about https://github.com/quickwit-oss/whichlang which bolsters x10 to x100 throughput compared with whatlang. Should we consider replacing whatlang with whichlang? Now I understand whichlang supports fewer languages than whatlang, so I also suggest:
4. use whichlang when the list of locales is empty (autodetection), or when it only contains locales that whichlang can detect. If the list of locales contains locales that whichlang *cannot* detect, **then** use whatlang instead.

---

> [!CAUTION]
> this PR contains a commit that adds detailed spans, that were used to detect which part of `extract_facet_string_docids` was taking too much time. As this commit adds spans that are called too often and adds 7s overhead, it should be removed before landing.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-08-19 06:29:48 +00:00
ManyTheFish
0f965d3574 Remove hotloop's spans 2024-08-14 14:33:36 +02:00
ManyTheFish
ade54493ab Only detect language for a facet if several locales have been specified by the user in the settings 2024-08-14 12:03:52 +02:00
Louis Dureuil
c3cdc407ec
Avoid unnecessary clone() 2024-08-08 14:57:02 +02:00
Louis Dureuil
2f10273d14
Group by normalized values, make sure you don't remove a value where there remains at still one value that normalizes towards it 2024-08-08 14:02:53 +02:00
Louis Dureuil
e64d0e0ca8
use insert instead of push for bitmaps 2024-08-01 18:32:45 +02:00
Louis Dureuil
0e68718027
Add detailed spans 2024-07-31 13:05:47 +02:00
Louis Dureuil
7c3fc8c655
Split settings and document facet string extractions 2024-07-31 10:57:46 +02:00
Louis Dureuil
8acd3f50bb
skip normalization when the locales and values are the same 2024-07-31 09:53:00 +02:00
Louis Dureuil
d4ea7cc2a9
fix clippy 👉👈 2024-07-25 12:10:32 +02:00
Louis Dureuil
2413592bbf
Display docid when there are documents without manual embeddings for a manual embedder 2024-07-25 12:10:32 +02:00
Louis Dureuil
553440632e
Introduce Setting::some_or_not_set 2024-07-25 12:01:52 +02:00
Louis Dureuil
7a347966da
Allow explicit dimensions for ollama 2024-07-25 12:01:51 +02:00
Louis Dureuil
4654d51e05
Add custom headers for REST embedder 2024-07-25 12:01:51 +02:00
ManyTheFish
a918561ac1
Fix PR comments 2024-07-25 10:52:56 +02:00
ManyTheFish
04fa44e7eb
Implement localized attributes settings 2024-07-25 10:51:27 +02:00
ManyTheFish
cc02920f2b
Update charabia 2024-07-25 10:51:27 +02:00
Tamo
988552e178
add tests on the rest embedder 2024-07-24 14:34:17 +02:00
Louis Dureuil
0d8199f3b7
Change parameters in milli settings 2024-07-24 14:34:17 +02:00
Louis Dureuil
24240934f9
Improve errors when indexing documents with a user provided embedder 2024-07-16 13:39:01 +02:00
Louis Dureuil
65d0c32aa7
Allow overriding OpenAI's url 2024-07-16 13:39:00 +02:00
Clément Renault
6e80364c50
Apply review comments 2024-07-11 11:00:27 +02:00
Clément Renault
837274f853
Restrict even more the Rhai engine 2024-07-10 16:30:18 +02:00
Clément Renault
aace587dd1
Create errors for the internal processing ones 2024-07-10 16:29:18 +02:00
Clément Renault
81ec0abad1
Use the new rayon-par-bridge library 2024-07-10 16:29:04 +02:00
Clément Renault
b67d385cf0
Parallelize the edition functions 2024-07-10 16:28:54 +02:00
Clément Renault
2eae2015d7
Support aborting documents edition by function 2024-07-10 16:28:15 +02:00
Clément Renault
33fa17bf12
Support deleting documents with functions 2024-07-10 16:28:15 +02:00
Clément Renault
400e6b93ce
Support user-provided context for documents edition 2024-07-10 16:28:15 +02:00
Clément Renault
f4add93043
Limit the number of script operations 2024-07-10 16:28:14 +02:00
Clément Renault
2fae96ac14
Show the actual number of actually edited documents 2024-07-10 16:28:14 +02:00
Clément Renault
45af18ae9c
Check the Rhai syntax before accepting the script 2024-07-10 16:28:13 +02:00
Clément Renault
2d97164d9f
It works perfectly with some Rhai 2024-07-10 16:28:13 +02:00
Clément Renault
efc156a4a4
Executing Lua works correctly 2024-07-10 16:27:36 +02:00
meili-bors[bot]
2099b4f0dd
Merge #4786
4786: Update dependencies r=Kerollmops a=irevoire

# Pull Request

## Related issue
Fixes #4753

## What does this PR do?
- Update all dependencies except rustls
- [x] Release charabia
- [x] Update charabia
- [x] Double check that the docker build works after updating charabia



Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-07-10 13:23:54 +00:00
Tamo
4d5005b01a make clippy happy 2024-07-10 10:06:59 +02:00