Commit Graph

9809 Commits

Author SHA1 Message Date
ManyTheFish
b12e997c8a Add pinyin flag 2024-08-21 14:38:04 +02:00
ManyTheFish
8bf89ec394 Infer locales from index settings 2024-08-21 10:47:40 +02:00
meili-bors[bot]
ee62d9ce30
Merge #4845
4845: Fix perf regression facet strings r=ManyTheFish a=dureuill

Benchmarks between v1.9 and v1.10 show a performance regression of about x2 (+3dB regression) for most indexing workloads (+44s for hackernews).

[Benchmark interpretation in the engine weekly meeting](https://www.notion.so/meilisearch/Engine-weekly-4d49560d374c4a87b4e3d126a261d4a0?pvs=4#98a709683276450295fcfe1f8ea5cef3).

- Initial investigation pointed to #4819 as the origin of the regression.
- Further investigation points towards the hypernormalization of each facet value in `extract_facet_string_docids`
- Most of the slowdown is in `normalize_facet_strings`, and precisely in `detection.language()`.

This PR improves the situation (-10s compared with `main` for hackernews, so only +34s regression compared with `v1.9`) by skipping normalization when it can be skipped.

I'm not sure how to fix the root cause though. Should we skip facet locale normalization for now? Cc `@ManyTheFish` 

---

Tentative resolution options:

1. remove locale normalization from facet. I'm not sure why this is required, I believe we weren't doing this before, so maybe we can stop doing that again.
2. don't do language detection when it can be helped: won't help with the regressions in benchmark, but maybe we can skip language detection when the locales contain only one language?
3. use a faster language detection library: `@Kerollmops` told me about https://github.com/quickwit-oss/whichlang which bolsters x10 to x100 throughput compared with whatlang. Should we consider replacing whatlang with whichlang? Now I understand whichlang supports fewer languages than whatlang, so I also suggest:
4. use whichlang when the list of locales is empty (autodetection), or when it only contains locales that whichlang can detect. If the list of locales contains locales that whichlang *cannot* detect, **then** use whatlang instead.

---

> [!CAUTION]
> this PR contains a commit that adds detailed spans, that were used to detect which part of `extract_facet_string_docids` was taking too much time. As this commit adds spans that are called too often and adds 7s overhead, it should be removed before landing.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-08-19 06:29:48 +00:00
ManyTheFish
0f965d3574 Remove hotloop's spans 2024-08-14 14:33:36 +02:00
ManyTheFish
ade54493ab Only detect language for a facet if several locales have been specified by the user in the settings 2024-08-14 12:03:52 +02:00
meili-bors[bot]
07c8ed0459
Merge #4864
4864: Don't remove facet value when multiple original values map to the same normalized value r=ManyTheFish a=dureuill

# Pull Request

## Related issue

Fixes #4860 

> [!WARNING]  
> This PR contains a fix to the immediate issue, but it looks like the underlying data model is faulty: there is only one possible "original" value for each normalized value in a facet of a document, while because of array values (or manually written nested fields, if you're evil), it is technically possible to have multiple, distinct original values mapping to the same normalized value.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-08-13 14:04:17 +00:00
Louis Dureuil
c3cdc407ec
Avoid unnecessary clone() 2024-08-08 14:57:02 +02:00
Louis Dureuil
2f10273d14
Group by normalized values, make sure you don't remove a value where there remains at still one value that normalizes towards it 2024-08-08 14:02:53 +02:00
meili-bors[bot]
b44e17c4c3
Merge #4858
4858: also intersect the universe for searchOnAttributes r=irevoire a=dureuill

# Pull Request

## Related issue
Fixes #4857 

## What does this PR do?
- intersect with the universe (which does not contain the filtered out ids) when looking up documents for words, even when using `searchOnAttributes`


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-08-07 13:15:26 +00:00
Louis Dureuil
e3ef0ae19e
also intersect the universe for searchOnAttributes 2024-08-06 14:06:56 +02:00
meili-bors[bot]
57f7af77c7
Merge #4846
4846: Add OpenAI tests r=dureuill a=dureuill

# Pull Request

## Related issue
Part of fixing #4757 

## What does this PR do?
- OpenAI embedder: don't pass apiKey when it is empty (slightly improves error messages)
- rest embedder and rest-based embedders: specialize the authorization denied error message depending on the configuration source
- fix existing tests
- Adds assets containing prerecorded texts to embed and the embeddings obtained from OpenAI
- Adds an asset containing a tokenized long document and the embedding obtained from OpenAI for this token
- Uses the wiremock crate to mock the OpenAI API: parse the openai request, lookup the response in assets, craft an openai response


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-08-05 10:49:28 +00:00
meili-bors[bot]
c817718e07
Merge #4853
4853: Fix rhai deletion r=irevoire a=dureuill

# Pull Request

## Related issue
Fixes #4849 

## What does this PR do?
- insert inside of the bitmap instead of pushing into it.


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-08-01 16:34:31 +00:00
Louis Dureuil
e64d0e0ca8
use insert instead of push for bitmaps 2024-08-01 18:32:45 +02:00
Louis Dureuil
21aa430b5e
Fix openai tests 2024-07-31 17:57:55 +02:00
Louis Dureuil
8535dc0be2
Fix existing tests 2024-07-31 17:57:32 +02:00
Louis Dureuil
72b9005344
Redact uid for Value 2024-07-31 17:57:13 +02:00
meili-bors[bot]
420c33132c
Merge #4850
4850: Use a fixed date format regardless of features r=irevoire a=dureuill

# Pull Request

## Related issue
Fixes #4844 

## What does this PR do?

Given the following script: 
```
cargo run -- --db-path meili.ms
sleep 3
curl -s -X POST http://127.0.0.1:7700/indexes -H 'Content-Type: application/json' --data-binary '{"uid": "movies", "primaryKey": "id"}'
sleep 3
cargo run  -p meilisearch --db-path meili.ms
sleep 3
curl -s -X POST http://127.0.0.1:7700/indexes/movies/search -H 'Content-Type: application/json' --data-binary '{}'
```

- Before this PR, the final search returns a decoding error.
- After this PR, the search completes successfully

### Technical standpoint

This PR fixes two locations where the formatting of dates were dependent on the feature set of the `time` crate.

1. The `IndexStats` had two fields without the serialization format specified
2. More subtly, the index dates (`createdAt,` `updatedAt`) were using value remapping in the main DB to `SerdeJson<OffsetDateTime>`, which was using whatever default format was available. This was fixed by creating a local `OffsetDateTime` wrapper that would specify the serialization format 

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-07-31 15:32:26 +00:00
Louis Dureuil
9ef710cad4
Use wrapper that forces the desired date format 2024-07-31 17:12:19 +02:00
Louis Dureuil
48f7329a83
Specify index_mapper on IndexStats 2024-07-31 17:11:28 +02:00
Louis Dureuil
ab1ec9ca21
Add tokenized test 2024-07-31 15:03:45 +02:00
Louis Dureuil
9d6efd92d2
new assets for tokenized test 2024-07-31 15:03:45 +02:00
Louis Dureuil
abdb337fd6
Add openai tests 2024-07-31 15:03:45 +02:00
Louis Dureuil
1c755c8899
Add openai responses 2024-07-31 15:03:45 +02:00
Louis Dureuil
3a42c3134e
update tests after changing authorized error message 2024-07-31 15:03:45 +02:00
Louis Dureuil
5aa6cb3600
Specialize authorized error message depending on config source 2024-07-31 15:03:44 +02:00
Louis Dureuil
9b7764575b
openai: don't pass apiKey when it is empty 2024-07-31 15:03:44 +02:00
Louis Dureuil
0e68718027
Add detailed spans 2024-07-31 13:05:47 +02:00
Louis Dureuil
7c3fc8c655
Split settings and document facet string extractions 2024-07-31 10:57:46 +02:00
Louis Dureuil
8acd3f50bb
skip normalization when the locales and values are the same 2024-07-31 09:53:00 +02:00
meili-bors[bot]
25791e3f46
Merge #4836
4836: Attach declared localized-attributes subroutes r=dureuill a=dureuill

RC.0 unexpectedly doesn't contain the `GET /indexes/{indexUid}/localized-attributes` and `PUT /indexes/{indexUid}/localized-attributes` subroute.

This PR makes them available.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
2024-07-30 19:01:54 +00:00
Tamo
b1b3a1a98b add a get, set and put test for the localized attributes setting 2024-07-30 15:51:02 +02:00
meili-bors[bot]
143d6cde10
Merge #4835
4835: Log error from main using tracing r=irevoire a=dureuill

Engine follow-up to https://github.com/meilisearch/meilisearch-support/issues/252#issuecomment-2251288276 (private link)

> `@meilisearch/engine-team` we need to open a PR to tracing::error! when an error occurs in the Meilisearch main. It would be nice to have it included in the second RC

<img width="1349" alt="Error logged when launching Meilisearch to import dump on path where the dump doesn't exist" src="https://github.com/user-attachments/assets/e5d2ae6e-f810-4029-9787-3b6ea9d47cfd">

---

<img width="1349" alt="Error logges when launching Meilisearch with a db path that is not writeable" src="https://github.com/user-attachments/assets/f672d78d-04b0-4d02-9402-259eaa6e2b62">



Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-07-30 13:43:50 +00:00
Louis Dureuil
9719dec443
Attach declared attributes-localized subroutes 2024-07-29 16:19:35 +02:00
Louis Dureuil
fa77a949aa
Log error from main using tracing 2024-07-29 14:58:39 +02:00
meili-bors[bot]
abe128476f
Merge #4830
4830: Use the dtolnay's Rust Toolchain r=dureuill a=Kerollmops

Fixes the CI by using another rust-toolchain GitHub repo.

Note: the [helix-editor/rust-toolchain repository](https://github.com/helix-editor/rust-toolchain) has been deleted so we moved to the [dtolnay/rust-toolchain](https://github.com/dtolnay/rust-toolchain) one. However, the dtolnay's one doesn't support `rust-toolchain.toml` and the version is directly in the rust-toolchain@version. We keep the `rust-toolchain.toml` for local builds only.

Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-07-29 08:33:59 +00:00
Clément Renault
a663e408ad
Move to the right rust toolchain version 2024-07-29 10:06:34 +02:00
Clément Renault
986991277f
Use the dtolnay rust toolchain 2024-07-29 10:00:40 +02:00
meili-bors[bot]
c2c1ba39ee
Merge #4826
4826: Update Charabia v0.9.0 r=dureuill a=ManyTheFish

# Pull Request

## Related Changelog
https://github.com/meilisearch/charabia/releases/tag/v0.9.0

## Notable Change for Meilisearch
Adds all math symbols from https://www.compart.com/en/unicode/category/Sm to the default separator list.



Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-07-25 14:08:38 +00:00
ManyTheFish
35567b2137 Update Charabia v0.9.0 2024-07-25 16:02:14 +02:00
meili-bors[bot]
00c97c7152
Merge #4818
4818: Custom headers and QoL improvements r=ManyTheFish a=dureuill

# Pull Request

## Related issue
Fixes #4734 
Depends on #4815 

## What does this PR do?
- Adds custom headers for rest embedders ([public usage](https://meilisearch.notion.site/v1-10-AI-search-changes-737c9d7d010d4dd685582bf5dab579e2#41354652885242c899def07e36a66d49))
- Quality of life: allow specifying `dimensions` for `ollama` embedders ([public usage](https://meilisearch.notion.site/v1-10-AI-search-changes-737c9d7d010d4dd685582bf5dab579e2#37218531431343dab3d2d3a9a1937e9d)). As for `rest` embedders, specifying `dimensions` disables the "test" embedding when the embedder is spawned.
- Improve error message again when indexing documents that don't have a vector for a user-provided vector
  1. Remove the contents of the document
  2. Display the docid of the first document that triggered the error
  3. Indicate how many documents in that chunk suffered from the same issue for that embedder


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-07-25 13:33:11 +00:00
Louis Dureuil
d4ea7cc2a9
fix clippy 👉👈 2024-07-25 12:10:32 +02:00
Louis Dureuil
8532fe8afc
Fix tests 2024-07-25 12:10:32 +02:00
Louis Dureuil
2413592bbf
Display docid when there are documents without manual embeddings for a manual embedder 2024-07-25 12:10:32 +02:00
Louis Dureuil
553440632e
Introduce Setting::some_or_not_set 2024-07-25 12:01:52 +02:00
Louis Dureuil
7a347966da
Allow explicit dimensions for ollama 2024-07-25 12:01:51 +02:00
Louis Dureuil
6c598fa06d
test custom headers 2024-07-25 12:01:51 +02:00
Louis Dureuil
8338df0dbe
Fix tests 2024-07-25 12:01:51 +02:00
Louis Dureuil
4654d51e05
Add custom headers for REST embedder 2024-07-25 12:01:51 +02:00
Louis Dureuil
22ef2d877f
Ensure test server has a single indexing thread 2024-07-25 12:01:51 +02:00
meili-bors[bot]
76bc2c18e8
Merge #4819
4819: Language settings r=dureuill a=ManyTheFish

# Pull Request

## Related issue
Fixes #4749 

## What does this PR do?
- [Implement localized search](c0c6955c0d)
- [Implement localized attributes settings](bde827b055)

## Related PRD

- [PRD](https://www.notion.so/meilisearch/Define-language-settings-to-impact-relevancy-bee62e18b7584c4f87d18a7654855329)
- [Public usage](https://www.notion.so/meilisearch/v1-10-Language-settings-usage-26c5d98b553349d9abacbe7aff698e4e)


Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-07-25 09:00:33 +00:00