Commit Graph

1046 Commits

Author SHA1 Message Date
Clément Renault
2a0ad0982f
Fix the document counter 2024-09-11 15:59:36 +02:00
ManyTheFish
2b317c681b Build mergers in parallel 2024-09-11 11:49:26 +02:00
ManyTheFish
39b5990f64 Mutualize tokenization 2024-09-11 10:22:38 +02:00
Clément Renault
8287c2644f
Support CSV again 2024-09-10 21:10:28 +01:00
Clément Renault
c1c44a0b81
Impl serialize on TopLevelMap 2024-09-10 19:32:03 +01:00
Clément Renault
04596f3616
Move the TopLevelMap into a dedicated module 2024-09-10 18:01:17 +01:00
Clément Renault
24cb5839ad
Move the document changes sorting logic to a new trait 2024-09-10 17:37:52 +01:00
ManyTheFish
f69688e8f7 Fix several warnings in extractors and remove unreachable macros 2024-09-09 14:52:50 +02:00
Clément Renault
8fd0afaaaa
Make sure we iterate over the payload documents in order 2024-09-06 08:09:08 +02:00
Clément Renault
72c6a21a30
Use raw JSON to read the payloads 2024-09-05 20:08:23 +02:00
Clément Renault
8412be4a7d
Cleanup CowStr and TopLevelMap struct 2024-09-05 18:32:55 +02:00
Louis Dureuil
10f09c531f
add some commented code to read from json with raw values 2024-09-05 18:22:16 +02:00
ManyTheFish
8fd99b111b Add tracing timers logs 2024-09-05 18:00:22 +02:00
Clément Renault
f6b3d1f9a5
Increase some channel sizes 2024-09-05 15:12:07 +02:00
Clément Renault
73ce67862d
Use the word pair proximity and fid word count docids extractors
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-09-05 10:56:22 +02:00
Clément Renault
0fc02f7351
Move the facet extraction to dedicated modules 2024-09-05 10:32:27 +02:00
ManyTheFish
34f11e3380 Implement word count and word pair proximity extractors 2024-09-05 10:30:39 +02:00
Clément Renault
27308eaab1
Import the facet extractors 2024-09-04 17:58:15 +02:00
Clément Renault
b33ec9ba3f
Introduce the FieldIdFacetIsNullDocidsExtractor 2024-09-04 17:50:08 +02:00
Clément Renault
9c0a1cd9fd
Introduce the FieldIdFacetExistsDocidsExtractor 2024-09-04 17:48:49 +02:00
Clément Renault
0b061f1e70
Introduce the FieldIdFacetIsEmptyDocidsExtractor 2024-09-04 17:40:24 +02:00
Clément Renault
19d937ab21
Introduce the facet extractors 2024-09-04 17:03:54 +02:00
Clément Renault
1d59c19cd2
Send the WordsFst by using an Mmap 2024-09-04 14:30:09 +02:00
Clément Renault
98e48371c3
Factorize some stuff 2024-09-04 12:17:13 +02:00
Clément Renault
6d74fb0229
Introduce the WordFidWordDocids database 2024-09-04 11:40:55 +02:00
ManyTheFish
1eb75a1040 remove milli/src/update/new/extract/tokenize_document.rs 2024-09-04 11:40:26 +02:00
Clément Renault
3b82d8b5b9
Fix the cache to serialize entries correctly 2024-09-04 10:55:36 +02:00
ManyTheFish
781a186f75 remove milli/src/update/new/extract/extract_word_docids.rs 2024-09-04 10:28:31 +02:00
ManyTheFish
6a399556b5 Implement more searchable extractor 2024-09-04 10:20:18 +02:00
Clément Renault
27b4cab857
Extract and write the documents and words fst in the database 2024-09-04 09:59:19 +02:00
Clément Renault
52d32b4ee9
Move the channel sender in the closure to stop the merger thread 2024-09-03 16:08:33 +02:00
ManyTheFish
da61408e52 Remove unimplemented from document changes 2024-09-03 15:14:16 +02:00
ManyTheFish
fe69385bd7 Fix tokenizer test 2024-09-03 14:24:37 +02:00
Louis Dureuil
1ac008926b
Add maxBytes parameter 2024-09-03 12:07:15 +02:00
Clément Renault
c1557734dc
Use the GlobalFieldsIdsMap everywhere and write it to disk
Co-authored-by: Dureuill <louis@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-09-03 12:01:01 +02:00
ManyTheFish
c50d3edc4a Integrate first searchable exctrator 2024-09-03 11:02:39 +02:00
Clément Renault
5369bf4a62
Change some lifetimes 2024-09-02 19:51:22 +02:00
Clément Renault
bcb1aa3d22
Find a temporary solution to par into iter on an HashMap
Spoiler: Do not use an HashMap but drain it into a Vec
2024-09-02 19:39:48 +02:00
Clément Renault
9b7858fb90
Expose the new indexer 2024-09-02 15:21:59 +02:00
Clément Renault
ab01679a8f
Remove the useless option from the document changes 2024-09-02 15:21:00 +02:00
Clément Renault
521775f788
I push for Many 2024-09-02 15:10:21 +02:00
Clément Renault
72e7b7846e
Renaming the indexers 2024-09-02 14:42:27 +02:00
Clément Renault
6526ce1208
Fix the merging of documents 2024-09-02 14:41:20 +02:00
Louis Dureuil
21296190a3
Reindex embedders 2024-09-02 13:00:53 +02:00
Louis Dureuil
580ea2f450
Pass the fields <-> ids map with metadata to render 2024-09-02 11:30:10 +02:00
Clément Renault
e639ec79d1
Move the indexers into their own modules 2024-09-02 10:42:19 +02:00
Clément Renault
bb885a5810
Fix the merge for roaring bitmap 2024-09-01 23:20:19 +02:00
Clément Renault
b625d31c7d
Introduce the PartialDumpIndexer indexer that generates document ids in parallel 2024-08-30 15:07:21 +02:00
Clément Renault
6487a67f2b
Introduce the ConcurrentAvailableIds struct and rename the other to AvailableIds 2024-08-30 15:06:50 +02:00
Clément Renault
271ce91b3b
Add the rayon Threadpool to the index function parameter 2024-08-30 14:34:24 +02:00
Clément Renault
54f2eb4507
Remove duplication of grenad merger 2024-08-30 14:34:05 +02:00
Clément Renault
794ebcd582
Replace grenad with the new grenad various-improvement branch 2024-08-30 11:53:59 +02:00
Clément Renault
b7c77c7a39
Use the latest version of the obkv crate 2024-08-30 11:53:59 +02:00
Clément Renault
0c57cf7565
Replace obkv with the temporary new version of it 2024-08-30 11:53:58 +02:00
Clément Renault
27df9e6c73
Introduce the indexer::index function that runs the indexation 2024-08-30 11:53:58 +02:00
Clément Renault
45c060831e
Introduce typed channels and the merger loop 2024-08-30 11:53:58 +02:00
Clément Renault
874c1ac538
First channels types 2024-08-30 11:53:58 +02:00
Clément Renault
e6ffa4d454
Implement the document merge function for the replace method 2024-08-30 11:53:58 +02:00
Clément Renault
637a9c8bdd
Implement the document merge function for the update method 2024-08-30 11:53:58 +02:00
Louis Dureuil
c683fa98e6
WIP
Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-08-30 11:53:57 +02:00
meili-bors[bot]
ee62d9ce30
Merge #4845
4845: Fix perf regression facet strings r=ManyTheFish a=dureuill

Benchmarks between v1.9 and v1.10 show a performance regression of about x2 (+3dB regression) for most indexing workloads (+44s for hackernews).

[Benchmark interpretation in the engine weekly meeting](https://www.notion.so/meilisearch/Engine-weekly-4d49560d374c4a87b4e3d126a261d4a0?pvs=4#98a709683276450295fcfe1f8ea5cef3).

- Initial investigation pointed to #4819 as the origin of the regression.
- Further investigation points towards the hypernormalization of each facet value in `extract_facet_string_docids`
- Most of the slowdown is in `normalize_facet_strings`, and precisely in `detection.language()`.

This PR improves the situation (-10s compared with `main` for hackernews, so only +34s regression compared with `v1.9`) by skipping normalization when it can be skipped.

I'm not sure how to fix the root cause though. Should we skip facet locale normalization for now? Cc `@ManyTheFish` 

---

Tentative resolution options:

1. remove locale normalization from facet. I'm not sure why this is required, I believe we weren't doing this before, so maybe we can stop doing that again.
2. don't do language detection when it can be helped: won't help with the regressions in benchmark, but maybe we can skip language detection when the locales contain only one language?
3. use a faster language detection library: `@Kerollmops` told me about https://github.com/quickwit-oss/whichlang which bolsters x10 to x100 throughput compared with whatlang. Should we consider replacing whatlang with whichlang? Now I understand whichlang supports fewer languages than whatlang, so I also suggest:
4. use whichlang when the list of locales is empty (autodetection), or when it only contains locales that whichlang can detect. If the list of locales contains locales that whichlang *cannot* detect, **then** use whatlang instead.

---

> [!CAUTION]
> this PR contains a commit that adds detailed spans, that were used to detect which part of `extract_facet_string_docids` was taking too much time. As this commit adds spans that are called too often and adds 7s overhead, it should be removed before landing.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-08-19 06:29:48 +00:00
ManyTheFish
0f965d3574 Remove hotloop's spans 2024-08-14 14:33:36 +02:00
ManyTheFish
ade54493ab Only detect language for a facet if several locales have been specified by the user in the settings 2024-08-14 12:03:52 +02:00
Louis Dureuil
c3cdc407ec
Avoid unnecessary clone() 2024-08-08 14:57:02 +02:00
Louis Dureuil
2f10273d14
Group by normalized values, make sure you don't remove a value where there remains at still one value that normalizes towards it 2024-08-08 14:02:53 +02:00
Louis Dureuil
e64d0e0ca8
use insert instead of push for bitmaps 2024-08-01 18:32:45 +02:00
Louis Dureuil
0e68718027
Add detailed spans 2024-07-31 13:05:47 +02:00
Louis Dureuil
7c3fc8c655
Split settings and document facet string extractions 2024-07-31 10:57:46 +02:00
Louis Dureuil
8acd3f50bb
skip normalization when the locales and values are the same 2024-07-31 09:53:00 +02:00
Louis Dureuil
d4ea7cc2a9
fix clippy 👉👈 2024-07-25 12:10:32 +02:00
Louis Dureuil
2413592bbf
Display docid when there are documents without manual embeddings for a manual embedder 2024-07-25 12:10:32 +02:00
Louis Dureuil
553440632e
Introduce Setting::some_or_not_set 2024-07-25 12:01:52 +02:00
Louis Dureuil
7a347966da
Allow explicit dimensions for ollama 2024-07-25 12:01:51 +02:00
Louis Dureuil
4654d51e05
Add custom headers for REST embedder 2024-07-25 12:01:51 +02:00
ManyTheFish
a918561ac1
Fix PR comments 2024-07-25 10:52:56 +02:00
ManyTheFish
04fa44e7eb
Implement localized attributes settings 2024-07-25 10:51:27 +02:00
ManyTheFish
cc02920f2b
Update charabia 2024-07-25 10:51:27 +02:00
Tamo
988552e178
add tests on the rest embedder 2024-07-24 14:34:17 +02:00
Louis Dureuil
0d8199f3b7
Change parameters in milli settings 2024-07-24 14:34:17 +02:00
Louis Dureuil
24240934f9
Improve errors when indexing documents with a user provided embedder 2024-07-16 13:39:01 +02:00
Louis Dureuil
65d0c32aa7
Allow overriding OpenAI's url 2024-07-16 13:39:00 +02:00
Clément Renault
6e80364c50
Apply review comments 2024-07-11 11:00:27 +02:00
Clément Renault
837274f853
Restrict even more the Rhai engine 2024-07-10 16:30:18 +02:00
Clément Renault
aace587dd1
Create errors for the internal processing ones 2024-07-10 16:29:18 +02:00
Clément Renault
81ec0abad1
Use the new rayon-par-bridge library 2024-07-10 16:29:04 +02:00
Clément Renault
b67d385cf0
Parallelize the edition functions 2024-07-10 16:28:54 +02:00
Clément Renault
2eae2015d7
Support aborting documents edition by function 2024-07-10 16:28:15 +02:00
Clément Renault
33fa17bf12
Support deleting documents with functions 2024-07-10 16:28:15 +02:00
Clément Renault
400e6b93ce
Support user-provided context for documents edition 2024-07-10 16:28:15 +02:00
Clément Renault
f4add93043
Limit the number of script operations 2024-07-10 16:28:14 +02:00
Clément Renault
2fae96ac14
Show the actual number of actually edited documents 2024-07-10 16:28:14 +02:00
Clément Renault
45af18ae9c
Check the Rhai syntax before accepting the script 2024-07-10 16:28:13 +02:00
Clément Renault
2d97164d9f
It works perfectly with some Rhai 2024-07-10 16:28:13 +02:00
Clément Renault
efc156a4a4
Executing Lua works correctly 2024-07-10 16:27:36 +02:00
meili-bors[bot]
2099b4f0dd
Merge #4786
4786: Update dependencies r=Kerollmops a=irevoire

# Pull Request

## Related issue
Fixes #4753

## What does this PR do?
- Update all dependencies except rustls
- [x] Release charabia
- [x] Update charabia
- [x] Double check that the docker build works after updating charabia



Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-07-10 13:23:54 +00:00
Tamo
4d5005b01a make clippy happy 2024-07-10 10:06:59 +02:00
hanbings
0a40a98bb6
Make milli use edition 2021 (#4770)
* Make milli use edition 2021

* Add lifetime annotations to milli.

* Run cargo fmt
2024-07-09 17:25:39 +02:00
Tamo
cd46ebd6b5 remove insta deprecating 2024-07-08 18:38:05 +02:00
Tamo
1693332cab Update arroy and always build the tree that need to be built 2024-06-24 10:14:03 +02:00
meili-bors[bot]
ddd564665b
Merge #4713
4713: Speed up facet distribution r=ManyTheFish a=Kerollmops

This PR is akin to #4682, but this time, the same logic is applied to the facets. Bitmaps are not decoded, and we do an intersection on the bytes with the search candidates instead of materializing the RoaringBitmap to destroy it just after the operation.

A prospect raised some slow requests when performing facet searches, and I found out that the disk optimization intersection wasn't performed on the facets.

Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-06-24 05:23:46 +00:00