MeiliSearch/milli/src
meili-bors[bot] ee62d9ce30
Merge #4845
4845: Fix perf regression facet strings r=ManyTheFish a=dureuill

Benchmarks between v1.9 and v1.10 show a performance regression of about x2 (+3dB regression) for most indexing workloads (+44s for hackernews).

[Benchmark interpretation in the engine weekly meeting](https://www.notion.so/meilisearch/Engine-weekly-4d49560d374c4a87b4e3d126a261d4a0?pvs=4#98a709683276450295fcfe1f8ea5cef3).

- Initial investigation pointed to #4819 as the origin of the regression.
- Further investigation points towards the hypernormalization of each facet value in `extract_facet_string_docids`
- Most of the slowdown is in `normalize_facet_strings`, and precisely in `detection.language()`.

This PR improves the situation (-10s compared with `main` for hackernews, so only +34s regression compared with `v1.9`) by skipping normalization when it can be skipped.

I'm not sure how to fix the root cause though. Should we skip facet locale normalization for now? Cc `@ManyTheFish` 

---

Tentative resolution options:

1. remove locale normalization from facet. I'm not sure why this is required, I believe we weren't doing this before, so maybe we can stop doing that again.
2. don't do language detection when it can be helped: won't help with the regressions in benchmark, but maybe we can skip language detection when the locales contain only one language?
3. use a faster language detection library: `@Kerollmops` told me about https://github.com/quickwit-oss/whichlang which bolsters x10 to x100 throughput compared with whatlang. Should we consider replacing whatlang with whichlang? Now I understand whichlang supports fewer languages than whatlang, so I also suggest:
4. use whichlang when the list of locales is empty (autodetection), or when it only contains locales that whichlang can detect. If the list of locales contains locales that whichlang *cannot* detect, **then** use whatlang instead.

---

> [!CAUTION]
> this PR contains a commit that adds detailed spans, that were used to detect which part of `extract_facet_string_docids` was taking too much time. As this commit adds spans that are called too often and adds 7s overhead, it should be removed before landing.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-08-19 06:29:48 +00:00
..
documents Make milli use edition 2021 (#4770) 2024-07-09 17:25:39 +02:00
facet Make milli use edition 2021 (#4770) 2024-07-09 17:25:39 +02:00
heed_codec Implement localized attributes settings 2024-07-25 10:51:27 +02:00
prompt Remove prompt strategy and fallback 2023-12-14 16:08:41 +01:00
search Merge #4845 2024-08-19 06:29:48 +00:00
snapshots/index.rs always push the user defined vectors in arroy 2024-06-06 11:39:29 +02:00
update Merge #4845 2024-08-19 06:29:48 +00:00
vector Specialize authorized error message depending on config source 2024-07-31 15:03:44 +02:00
asc_desc.rs fmt 2023-03-30 23:37:26 +02:00
criterion.rs update the syntax of the geoboundingbox filter to uses brackets instead of parens around lat and lng 2023-02-06 16:50:27 +01:00
error.rs Improve errors when indexing documents with a user provided embedder 2024-07-16 13:39:01 +02:00
external_documents_ids.rs Make milli use edition 2021 (#4770) 2024-07-09 17:25:39 +02:00
fieldids_weights_map.rs makes clippy and fmt happy 2024-06-06 11:39:29 +02:00
fields_ids_map.rs provide a method to get all the nested fields ids from a name 2024-06-06 11:36:11 +02:00
index.rs Use wrapper that forces the desired date format 2024-07-31 17:12:19 +02:00
lib.rs fix clippy 2024-07-25 10:52:56 +02:00
localized_attributes_rules.rs Fix PR comments 2024-07-25 10:52:56 +02:00
order_by_map.rs Revert "Revert "Merge remote-tracking branch 'origin/main' into release-v1.7.1"" 2024-03-20 10:08:28 +01:00
proximity.rs Change the naming of attributeScale and wordScale into byAttribute and byWord 2023-12-14 16:31:00 +01:00
score_details.rs Do not fail sort comparisons when the field name or target point are different 2024-07-11 16:28:14 +02:00
snapshot_tests.rs Make milli use edition 2021 (#4770) 2024-07-09 17:25:39 +02:00
thread_pool_no_abort.rs Introduce the ThreadPoolNoAbort wrapper 2024-04-24 16:40:12 +02:00