Clément Renault
bb885a5810
Fix the merge for roaring bitmap
2024-09-01 23:20:19 +02:00
Clément Renault
b625d31c7d
Introduce the PartialDumpIndexer indexer that generates document ids in parallel
2024-08-30 15:07:21 +02:00
Clément Renault
6487a67f2b
Introduce the ConcurrentAvailableIds struct and rename the other to AvailableIds
2024-08-30 15:06:50 +02:00
Clément Renault
271ce91b3b
Add the rayon Threadpool to the index function parameter
2024-08-30 14:34:24 +02:00
Clément Renault
54f2eb4507
Remove duplication of grenad merger
2024-08-30 14:34:05 +02:00
Clément Renault
794ebcd582
Replace grenad with the new grenad various-improvement branch
2024-08-30 11:53:59 +02:00
Clément Renault
b7c77c7a39
Use the latest version of the obkv crate
2024-08-30 11:53:59 +02:00
Clément Renault
0c57cf7565
Replace obkv with the temporary new version of it
2024-08-30 11:53:58 +02:00
Clément Renault
27df9e6c73
Introduce the indexer::index function that runs the indexation
2024-08-30 11:53:58 +02:00
Clément Renault
45c060831e
Introduce typed channels and the merger loop
2024-08-30 11:53:58 +02:00
Clément Renault
874c1ac538
First channels types
2024-08-30 11:53:58 +02:00
Clément Renault
e6ffa4d454
Implement the document merge function for the replace method
2024-08-30 11:53:58 +02:00
Clément Renault
637a9c8bdd
Implement the document merge function for the update method
2024-08-30 11:53:58 +02:00
Louis Dureuil
c683fa98e6
WIP
...
Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-08-30 11:53:57 +02:00
meili-bors[bot]
ee62d9ce30
Merge #4845
...
4845: Fix perf regression facet strings r=ManyTheFish a=dureuill
Benchmarks between v1.9 and v1.10 show a performance regression of about x2 (+3dB regression) for most indexing workloads (+44s for hackernews).
[Benchmark interpretation in the engine weekly meeting](https://www.notion.so/meilisearch/Engine-weekly-4d49560d374c4a87b4e3d126a261d4a0?pvs=4#98a709683276450295fcfe1f8ea5cef3 ).
- Initial investigation pointed to #4819 as the origin of the regression.
- Further investigation points towards the hypernormalization of each facet value in `extract_facet_string_docids`
- Most of the slowdown is in `normalize_facet_strings`, and precisely in `detection.language()`.
This PR improves the situation (-10s compared with `main` for hackernews, so only +34s regression compared with `v1.9`) by skipping normalization when it can be skipped.
I'm not sure how to fix the root cause though. Should we skip facet locale normalization for now? Cc `@ManyTheFish`
---
Tentative resolution options:
1. remove locale normalization from facet. I'm not sure why this is required, I believe we weren't doing this before, so maybe we can stop doing that again.
2. don't do language detection when it can be helped: won't help with the regressions in benchmark, but maybe we can skip language detection when the locales contain only one language?
3. use a faster language detection library: `@Kerollmops` told me about https://github.com/quickwit-oss/whichlang which bolsters x10 to x100 throughput compared with whatlang. Should we consider replacing whatlang with whichlang? Now I understand whichlang supports fewer languages than whatlang, so I also suggest:
4. use whichlang when the list of locales is empty (autodetection), or when it only contains locales that whichlang can detect. If the list of locales contains locales that whichlang *cannot* detect, **then** use whatlang instead.
---
> [!CAUTION]
> this PR contains a commit that adds detailed spans, that were used to detect which part of `extract_facet_string_docids` was taking too much time. As this commit adds spans that are called too often and adds 7s overhead, it should be removed before landing.
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-08-19 06:29:48 +00:00
ManyTheFish
0f965d3574
Remove hotloop's spans
2024-08-14 14:33:36 +02:00
ManyTheFish
ade54493ab
Only detect language for a facet if several locales have been specified by the user in the settings
2024-08-14 12:03:52 +02:00
Louis Dureuil
c3cdc407ec
Avoid unnecessary clone()
2024-08-08 14:57:02 +02:00
Louis Dureuil
2f10273d14
Group by normalized values, make sure you don't remove a value where there remains at still one value that normalizes towards it
2024-08-08 14:02:53 +02:00
Louis Dureuil
e64d0e0ca8
use insert instead of push for bitmaps
2024-08-01 18:32:45 +02:00
Louis Dureuil
0e68718027
Add detailed spans
2024-07-31 13:05:47 +02:00
Louis Dureuil
7c3fc8c655
Split settings and document facet string extractions
2024-07-31 10:57:46 +02:00
Louis Dureuil
8acd3f50bb
skip normalization when the locales and values are the same
2024-07-31 09:53:00 +02:00
Louis Dureuil
d4ea7cc2a9
fix clippy 👉 👈
2024-07-25 12:10:32 +02:00
Louis Dureuil
2413592bbf
Display docid when there are documents without manual embeddings for a manual embedder
2024-07-25 12:10:32 +02:00
Louis Dureuil
553440632e
Introduce Setting::some_or_not_set
2024-07-25 12:01:52 +02:00
Louis Dureuil
7a347966da
Allow explicit dimensions
for ollama
2024-07-25 12:01:51 +02:00
Louis Dureuil
4654d51e05
Add custom headers for REST embedder
2024-07-25 12:01:51 +02:00
ManyTheFish
a918561ac1
Fix PR comments
2024-07-25 10:52:56 +02:00
ManyTheFish
04fa44e7eb
Implement localized attributes settings
2024-07-25 10:51:27 +02:00
ManyTheFish
cc02920f2b
Update charabia
2024-07-25 10:51:27 +02:00
Tamo
988552e178
add tests on the rest embedder
2024-07-24 14:34:17 +02:00
Louis Dureuil
0d8199f3b7
Change parameters in milli settings
2024-07-24 14:34:17 +02:00
Louis Dureuil
24240934f9
Improve errors when indexing documents with a user provided embedder
2024-07-16 13:39:01 +02:00
Louis Dureuil
65d0c32aa7
Allow overriding OpenAI's url
2024-07-16 13:39:00 +02:00
Clément Renault
6e80364c50
Apply review comments
2024-07-11 11:00:27 +02:00
Clément Renault
837274f853
Restrict even more the Rhai engine
2024-07-10 16:30:18 +02:00
Clément Renault
aace587dd1
Create errors for the internal processing ones
2024-07-10 16:29:18 +02:00
Clément Renault
81ec0abad1
Use the new rayon-par-bridge library
2024-07-10 16:29:04 +02:00
Clément Renault
b67d385cf0
Parallelize the edition functions
2024-07-10 16:28:54 +02:00
Clément Renault
2eae2015d7
Support aborting documents edition by function
2024-07-10 16:28:15 +02:00
Clément Renault
33fa17bf12
Support deleting documents with functions
2024-07-10 16:28:15 +02:00
Clément Renault
400e6b93ce
Support user-provided context for documents edition
2024-07-10 16:28:15 +02:00
Clément Renault
f4add93043
Limit the number of script operations
2024-07-10 16:28:14 +02:00
Clément Renault
2fae96ac14
Show the actual number of actually edited documents
2024-07-10 16:28:14 +02:00
Clément Renault
45af18ae9c
Check the Rhai syntax before accepting the script
2024-07-10 16:28:13 +02:00
Clément Renault
2d97164d9f
It works perfectly with some Rhai
2024-07-10 16:28:13 +02:00
Clément Renault
efc156a4a4
Executing Lua works correctly
2024-07-10 16:27:36 +02:00
meili-bors[bot]
2099b4f0dd
Merge #4786
...
4786: Update dependencies r=Kerollmops a=irevoire
# Pull Request
## Related issue
Fixes #4753
## What does this PR do?
- Update all dependencies except rustls
- [x] Release charabia
- [x] Update charabia
- [x] Double check that the docker build works after updating charabia
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-07-10 13:23:54 +00:00
Tamo
4d5005b01a
make clippy happy
2024-07-10 10:06:59 +02:00
hanbings
0a40a98bb6
Make milli use edition 2021 ( #4770 )
...
* Make milli use edition 2021
* Add lifetime annotations to milli.
* Run cargo fmt
2024-07-09 17:25:39 +02:00
Tamo
cd46ebd6b5
remove insta deprecating
2024-07-08 18:38:05 +02:00
Tamo
1693332cab
Update arroy and always build the tree that need to be built
2024-06-24 10:14:03 +02:00
meili-bors[bot]
ddd564665b
Merge #4713
...
4713: Speed up facet distribution r=ManyTheFish a=Kerollmops
This PR is akin to #4682 , but this time, the same logic is applied to the facets. Bitmaps are not decoded, and we do an intersection on the bytes with the search candidates instead of materializing the RoaringBitmap to destroy it just after the operation.
A prospect raised some slow requests when performing facet searches, and I found out that the disk optimization intersection wasn't performed on the facets.
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-06-24 05:23:46 +00:00
Clément Renault
9736e16a88
Make clippy happy
2024-06-20 13:02:44 +02:00
Louis Dureuil
a04041c8f2
Only spawn the pool once
2024-06-19 16:25:33 +02:00
Louis Dureuil
0a8f50695e
Fixes for Rust v1.79
2024-06-13 17:47:44 +02:00
Louis Dureuil
e35ef31738
Small changes following review
2024-06-13 14:20:48 +02:00
Louis Dureuil
3bc8f81abc
user_provided => regenerate
2024-06-12 18:12:20 +02:00
Louis Dureuil
a89eea233b
Fix vectors injection
2024-06-12 17:10:19 +02:00
Louis Dureuil
f5cf01e7d1
Rework extraction to use EmbedderAction
2024-06-12 14:50:55 +02:00
Louis Dureuil
d1dd7e5d09
In transform for removed embedders, write back their user provided vectors in documents, and clear the writers
2024-06-12 14:50:55 +02:00
Louis Dureuil
d18c1f77d7
Update embedder configs with a finer granularity
...
- no longer clear vector DB between any two embedder changes
2024-06-12 14:50:55 +02:00
Louis Dureuil
7cef2299cf
Fix behavior when removing a document
2024-06-11 09:45:08 +02:00
Tamo
2cdcb703d9
fix the deletion of vectors and add a test
2024-06-06 11:39:29 +02:00
Tamo
d85ab23b82
rename all occurences of user_defined to user_provided for consistency
2024-06-06 11:39:29 +02:00
Tamo
b7349910d9
implements mor review comments
2024-06-06 11:39:29 +02:00
Tamo
376b3a19a7
makes clippy and fmt happy
2024-06-06 11:39:29 +02:00
Tamo
5d50850e12
always push the user defined vectors in arroy
2024-06-06 11:39:29 +02:00
Tamo
a73ccc78a6
forward the embedding config to the extractors
2024-06-06 11:39:28 +02:00
Tamo
9eb6f522ea
wraps the index embedding config in a struct
2024-06-06 11:37:30 +02:00
Tamo
84e498299b
Remove the vectors from the documents database
2024-06-06 11:36:11 +02:00
Tamo
7a84697570
never store the _vectors as searchable or faceted fields
2024-06-06 11:36:11 +02:00
ManyTheFish
30293883e0
Fix condition mistake
2024-06-05 17:30:07 +02:00
ManyTheFish
b833be46b9
Avoid running proximity when only the exact attributes changes
2024-06-05 17:30:07 +02:00
ManyTheFish
0a4118329e
Put only_additional_fields to None if the difference gives an empty result.
2024-06-05 17:30:07 +02:00
ManyTheFish
261e92d7e6
Skip iterating over documents when the faceted field list doesn't change
2024-06-05 17:30:07 +02:00
ManyTheFish
5cd08979b1
iterate over the faceted fields instead of over the whole document
2024-06-05 17:30:07 +02:00
Clément Renault
a998b881f6
Cache a lot of operations to know if a field must be indexed
2024-06-05 17:30:07 +02:00
Clément Renault
b81953a65d
Add a span for the prepare_for_documents_reindexing
2024-06-05 17:30:07 +02:00
Clément Renault
091bb157f1
Add a span for the settings diff creation
2024-06-05 17:30:07 +02:00
Clément Renault
1b639ce44b
Reduce the number of complex calls to settings diff functions
2024-06-05 17:30:07 +02:00
Clément Renault
87cf8a3c94
Introduce a new way to determine the operations to perform on the fields
2024-06-05 17:30:07 +02:00
Clément Renault
0f578348f1
Introduce a dedicated function to write proximity entries in database
2024-06-05 17:30:07 +02:00
Clément Renault
fad4675abe
Give the settings diff to the write_typed_chunk_into_index function
2024-06-05 17:30:07 +02:00
Clément Renault
1ab03c4ede
Fix an issue with settings diff and * in the searchable attributes
2024-06-05 17:30:07 +02:00
Clément Renault
0c6e4b2f00
Introducing a new into_del_add_obkv_conditional_operation function
2024-06-05 17:30:07 +02:00
Clément Renault
42b3f52ef9
Introduce the SettingDiff only_additional_fields method
2024-06-05 17:30:07 +02:00
ManyTheFish
1ab88e10b9
Merge branch 'main' into merge-release-v1.8.1-in-main
2024-05-29 16:24:00 +02:00
Many the fish
e1fbfde6c4
Merge branch 'main' into merge-release-v1.8.1-in-main
2024-05-29 11:31:03 +02:00
ManyTheFish
27b75ec648
merge main into v1.8.1
2024-05-29 11:26:07 +02:00
Louis Dureuil
d35278320e
Add support functions for accessing arroy writers and readers
2024-05-28 15:27:43 +02:00
Clément Renault
dc949ab46a
Remove puffin usage
2024-05-27 15:59:14 +02:00
meili-bors[bot]
19acc65ad2
Merge #4646
...
4646: Reduce `Transform`'s disk usage r=Kerollmops a=Kerollmops
This PR implements what is described in #4485 . It reduces the number of disk writes and disk usage.
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-05-23 16:06:50 +00:00
Clément Renault
fe17c0f52e
Construct the minimal OBKVs according to the settings diff
2024-05-23 11:23:57 +02:00
Clément Renault
bc5663e673
FieldIdsMap no longer useful thanks to #4631
2024-05-22 16:06:15 +02:00
Louis Dureuil
8a941c0241
Smaller review changes
2024-05-22 14:44:42 +02:00
Louis Dureuil
16037e2169
Don't remove embedders that are not in the config from the document DB
2024-05-22 12:24:51 +02:00
Clément Renault
500ddc76b5
Make the flattened sorter optional
2024-05-21 16:16:36 +02:00
Clément Renault
1aa8ed9ef7
Make the original sorter optional
2024-05-21 14:53:26 +02:00
ManyTheFish
f762307838
Fix clippy
2024-05-21 13:44:20 +02:00
ManyTheFish
3e94a90722
Fixes
2024-05-21 13:39:46 +02:00
ManyTheFish
fc7e817221
Index geo points based on the settings differences
2024-05-20 12:27:26 +02:00
Louis Dureuil
d05d49ffd8
Fix tests
2024-05-20 10:36:18 +02:00
Louis Dureuil
0462ebbe58
Don't write an empty _vectors field
2024-05-20 10:36:18 +02:00
Louis Dureuil
2f7a8a4efb
Don't write vectors that weren't autogenerated in document DB
2024-05-20 10:36:18 +02:00
Louis Dureuil
52d9cb6e5a
Refactor vector indexing
...
- use the parsed_vectors module
- only parse `_vectors` once per document, instead of once per embedder per document
2024-05-20 10:36:17 +02:00
Tamo
897d25780e
update milli to latest version
2024-05-16 18:31:32 +02:00
Tamo
f2d0a59f1d
when no searchable attributes are defined, makes all the weight equals to zero
2024-05-16 01:06:33 +02:00
Tamo
ad4d8502b3
stops storing the whole fieldids weights map when no searchable are defined
2024-05-15 17:16:10 +02:00
Tamo
7ec4e2a3fb
apply all style review comments
2024-05-15 15:02:26 +02:00
Tamo
caa6a7149a
make the attribute ranking rule use the weights and fix the tests
2024-05-14 17:36:32 +02:00
Tamo
b0afe0972e
stop updating the fields ids map when fields are only swapped
2024-05-14 17:00:02 +02:00
Tamo
685f452fb2
Fix the indexing of the searchable
2024-05-14 17:00:02 +02:00
Tamo
4e4a1ddff7
gate a test behind the required feature
2024-05-14 17:00:02 +02:00
Tamo
c22460045c
Stops returning an option in the internal searchable fields
2024-05-14 17:00:02 +02:00
meili-bors[bot]
4d5971f343
Merge #4621
...
4621: Bring back changes from v1.8.0 into main r=curquiza a=curquiza
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-05-06 13:46:39 +00:00
meili-bors[bot]
ebca29f3de
Merge #4597
...
4597: Fix embeddings settings update r=ManyTheFish a=ManyTheFish
# Pull Request
- add some conditions reducing the work done when changing the settings
- add some benchmarks on embedders
## Related issue
Fixes #4585
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-04-25 16:37:28 +00:00
Clément Renault
d4aeff92d0
Introduce the ThreadPoolNoAbort wrapper
2024-04-24 16:40:12 +02:00
Clément Renault
96cc5319c8
Introduce a new internal error type to categorize panics
2024-04-22 18:09:33 +02:00
Clément Renault
0c7003c5df
Introduce an atomic to catch panics in thread pools
2024-04-22 18:09:33 +02:00
ManyTheFish
a1aa999026
Add conditions reducing wrok
2024-04-22 14:18:35 +02:00
ManyTheFish
df29ba709a
Make some cleaning in Arcs
2024-04-17 12:33:25 +02:00
ManyTheFish
3acfab2eb7
Fix PR comments
2024-04-17 10:55:51 +02:00
ManyTheFish
87a93ba47d
fix clippy
2024-04-16 14:39:30 +02:00
ManyTheFish
eaf113ef34
Fix wod pair proximity error when nothing has to be extracted
2024-04-16 14:39:30 +02:00
ManyTheFish
e5ae337aae
Comeback to sorters in extract_word_docids
...
using buffers and merge the keys manually is less efficient
2024-04-16 14:39:30 +02:00
ManyTheFish
a489b406b4
fix test
2024-04-16 14:39:06 +02:00
ManyTheFish
02c3d6b265
finish work
2024-04-16 14:39:06 +02:00
ManyTheFish
b5e4a55af6
refactor faceted and searchable pipeline
2024-04-16 14:39:06 +02:00
ManyTheFish
a7e368aaa6
Create InnerIndexSettingsDiffs struct and populate it
2024-04-16 14:39:06 +02:00
ManyTheFish
893200ab87
Avoid clearing documents in transform
2024-04-16 14:39:06 +02:00
ManyTheFish
aabce52b1b
Fix test
2024-04-16 14:39:06 +02:00
ManyTheFish
8fff5fc281
update tests
2024-04-16 14:39:06 +02:00
yudrywet
cf864a1c2e
chore: fix some typos in comments
...
Signed-off-by: yudrywet <yudeyao@yeah.net>
2024-04-14 20:11:34 +08:00
Louis Dureuil
466d718a05
Fix test
2024-04-04 15:58:19 +02:00
meili-bors[bot]
56bf8503db
Merge #4537
...
4537: Expose distribution shift in settings r=ManyTheFish a=dureuill
See [usage page](https://meilisearch.notion.site/v1-8-AI-search-API-usage-135552d6e85a4a52bc7109be82aeca42#d652adc0890445658aaf36352dbc8802 )
# Changes
- Distribution shift added to all embedders.
- Exposed in settings
- Changed the reindexing logic to not trigger a reindex operation when only the distribution shift or API key change
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-04-03 09:08:58 +00:00
redistay
182cb42953
chore: fix some typos in conments
...
Signed-off-by: redistay <wujunjing@outlook.com>
2024-04-02 19:37:55 +08:00
meili-bors[bot]
92a049c2dd
Merge #4543
...
4543: Bring back changes from v1.7.4 into main r=Kerollmops a=dureuill
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: dureuill <dureuill@users.noreply.github.com>
2024-03-28 16:53:51 +00:00
Louis Dureuil
796213af9a
Merge branch 'main' into tmp-release-v1.7.4
2024-03-28 10:51:49 +01:00
Louis Dureuil
ee8cbea810
Don't optimize reindexing when fields contain dots
2024-03-27 17:04:45 +01:00
Louis Dureuil
572fb3a51d
Finer granularity for embedder needs reindex
2024-03-27 12:01:34 +01:00
Louis Dureuil
afd1da5642
Add distribution to all embedders
2024-03-27 11:50:22 +01:00
Louis Dureuil
817ccc089a
also allow api_key
2024-03-25 11:50:00 +01:00
Louis Dureuil
4136630ea5
Use constants instead of raw strings in set_*set()
2024-03-25 11:39:33 +01:00
Louis Dureuil
58972f35cb
Allow url
parameter for ollama embedder
2024-03-25 11:32:55 +01:00
Louis Dureuil
dfa5e41ea6
Check validity of the URL setting
2024-03-25 11:23:16 +01:00
Louis Dureuil
a1db342f01
Expose REST embedder to the API
2024-03-25 11:23:15 +01:00
Louis Dureuil
f87747f4d3
Remove unwraps
2024-03-25 11:23:04 +01:00
Louis Dureuil
ac52c857e8
Update ollama and openai impls to use the rest embedder internally
2024-03-25 11:23:03 +01:00