Commit Graph

2243 Commits

Author SHA1 Message Date
Louis Dureuil 6ebb6b55a6
Lazily embed, don't fail hybrid search on embedding failure 2024-04-04 15:58:17 +02:00
Louis Dureuil fabc9cf14a
milli: add Embedder::embed_one 2024-04-04 15:57:29 +02:00
Louis Dureuil 00c4ed3bc2
milli: refactor getting embedder and embedder name 2024-04-04 15:57:29 +02:00
Louis Dureuil 928e6e4c05
Breaking change: remove vector for score details 2024-04-04 15:57:29 +02:00
meili-bors[bot] 339a5e3431
Merge #4549
4549: Hugging Face embedder improvements r=dureuill a=dureuill

Architectural changes/Internal improvements

### 1. Prefer safetensors weights over pytorch weights when available

safetensors weights are memory mapped, which reduces memory usage of supported models.

### 2. Update candle

Updates candle to `0.4.1`, now targeting crates.io and the tokenizers to `v0.15.2` (still on github).

This might fix https://github.com/meilisearch/meilisearch/issues/4399 thanks to the now included https://github.com/huggingface/candle/issues/1454

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-04-04 13:47:18 +00:00
meili-bors[bot] 5509bafff8
Merge #4535
4535: Support Negative Keywords r=ManyTheFish a=Kerollmops

This PR fixes #4422 by supporting `-` before any word in the query.

The minus symbol `-`, from the ASCII table, is not the only character that can be considered the negative operator. You can see the two other matching characters under the `Based on "-" (U+002D)` section on [this unicode reference website](https://www.compart.com/en/unicode/U+002D).

It's important to notice the strange behavior when a query includes and excludes the same word; only the derivative ( synonyms and split) will be kept:
 - If you input `progamer -progamer`, the engine will still search for `pro gamer`.
 - If you have the synonym `like = love` and you input `like -like`, it will still search for `love`.

## TODO
 - [x] Add analytics
 - [x] Add support to the `-` operator
 - [x] Make sure to support spaces around `-` well
 - [x] Support phrase negation
 - [x] Add tests


Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-04-04 13:10:27 +00:00
Louis Dureuil 58cafcc824
Update candle 2024-04-03 13:11:56 +02:00
meili-bors[bot] 56bf8503db
Merge #4537
4537: Expose distribution shift in settings r=ManyTheFish a=dureuill

See [usage page](https://meilisearch.notion.site/v1-8-AI-search-API-usage-135552d6e85a4a52bc7109be82aeca42#d652adc0890445658aaf36352dbc8802)

# Changes

- Distribution shift added to all embedders.
- Exposed in settings
- Changed the reindexing logic to not trigger a reindex operation when only the distribution shift or API key change

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-04-03 09:08:58 +00:00
Louis Dureuil a1eccc762a
Prefer safetensors to pytorch when both are available 2024-04-03 11:05:59 +02:00
meili-bors[bot] 75f81a0bab
Merge #4547
4547: Fix milli/Cargo.toml for usage as dependency via git r=dureuill a=Toromyx

# Pull Request

## Related issues/discussions
This enables th usage of `milli` [via git repository](https://doc.rust-lang.org/cargo/reference/specifying-dependencies.html#specifying-dependencies-from-git-repositories) as mentioned in <https://github.com/meilisearch/meilisearch/issues/3367#issuecomment-1422613815>, <https://github.com/meilisearch/meilisearch/discussions/1523#discussioncomment-1039338>, and <https://github.com/meilisearch/meilisearch/discussions/1981#discussioncomment-1771568>

## What does this PR do?
Trying to depend on `milli` like

```
[dependencies.milli]
git = "https://github.com/meilisearch/meilisearch.git"
tag = "v1.7.4"
```

leads to the following error:

```
error: failed to select a version for the requirement `candle-core = "^0.3.1"`
candidate versions found which didn't match: 0.4.2
location searched: Git repository https://github.com/huggingface/candle.git
required by package `milli v1.7.4 (https://github.com/meilisearch/meilisearch.git?tag=v1.7.4#0259ad60)`
```

because the default branch of <https://github.com/huggingface/candle> does not contain the correct version.

To fix this, i added a `rev="..."` entry in the relevant dependencies, specifiyng the commit already present in the `Cargo.lock` file.
I also updated the version to the one in the Cargo.lock. This also updated `candle-kernels` sub-dependency from 0.3.1 to 0.3.3 which is probably correct?

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Thomas Gauges <thomas.gauges@gmail.com>
2024-04-03 07:31:36 +00:00
Thomas Gauges d55d496250 Fix milli/Cargo.toml for usage as dependency via git 2024-04-02 15:19:30 +02:00
redistay 182cb42953 chore: fix some typos in conments
Signed-off-by: redistay <wujunjing@outlook.com>
2024-04-02 19:37:55 +08:00
meili-bors[bot] 92a049c2dd
Merge #4543
4543: Bring back changes from v1.7.4 into main r=Kerollmops a=dureuill



Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: dureuill <dureuill@users.noreply.github.com>
2024-03-28 16:53:51 +00:00
Clément Renault 877f4b1045
Support negative phrases 2024-03-28 15:51:43 +01:00
Louis Dureuil 796213af9a
Merge branch 'main' into tmp-release-v1.7.4 2024-03-28 10:51:49 +01:00
Clément Renault 69f8b2730d
Fix the tests 2024-03-28 10:47:04 +01:00
Louis Dureuil ee8cbea810
Don't optimize reindexing when fields contain dots 2024-03-27 17:04:45 +01:00
Louis Dureuil 572fb3a51d
Finer granularity for embedder needs reindex 2024-03-27 12:01:34 +01:00
Louis Dureuil 4ff0255783
remove unused function 2024-03-27 11:51:14 +01:00
Louis Dureuil a25456120d
Expose distribution in settings 2024-03-27 11:51:04 +01:00
Louis Dureuil 168ded3b9d
Deserr for distribution 2024-03-27 11:50:33 +01:00
Louis Dureuil afd1da5642
Add distribution to all embedders 2024-03-27 11:50:22 +01:00
Clément Renault 34262c7a0d
Add analytics for the negative operator 2024-03-26 18:01:27 +01:00
Clément Renault 1da9e0f246
Better support space around the negative operator (-) 2024-03-26 17:47:13 +01:00
Clément Renault e4a3e603b3
Expose a first working version of the negative keyword 2024-03-26 17:47:13 +01:00
Louis Dureuil 817ccc089a
also allow `api_key` 2024-03-25 11:50:00 +01:00
Louis Dureuil 4136630ea5
Use constants instead of raw strings in set_*set() 2024-03-25 11:39:33 +01:00
Louis Dureuil 58972f35cb
Allow `url` parameter for ollama embedder 2024-03-25 11:32:55 +01:00
Louis Dureuil dfa5e41ea6
Check validity of the URL setting 2024-03-25 11:23:16 +01:00
Louis Dureuil a1db342f01
Expose REST embedder to the API 2024-03-25 11:23:15 +01:00
Louis Dureuil f87747f4d3
Remove unwraps 2024-03-25 11:23:04 +01:00
Louis Dureuil b6b4b6bab7
Remove the tokio and the reqwests 2024-03-25 11:23:03 +01:00
Louis Dureuil ac52c857e8
Update ollama and openai impls to use the rest embedder internally 2024-03-25 11:23:03 +01:00
Louis Dureuil 8708cbef25
Add RestEmbedder 2024-03-25 11:23:03 +01:00
Louis Dureuil c3d02f092d
OpenAI sync 2024-03-25 11:23:03 +01:00
Louis Dureuil bc58e8a310
Documentation for the vector module 2024-03-25 11:23:03 +01:00
meili-bors[bot] ec81c2bf1a
Merge #4511
4511: Bump charabia to 0.8.8 r=ManyTheFish a=6543

... and update lock file

this will add the fix (https://github.com/meilisearch/charabia/pull/275) to support markdown formatted codeblocks

Co-authored-by: 6543 <6543@obermui.de>
2024-03-25 09:26:11 +00:00
meili-bors[bot] fc1c3f4a29
Merge #4466
4466: Implements the search cutoff r=irevoire a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4488

## What does this PR do?
- Adds a cutoff to the bucket sort after 150ms has been spent
- Adds a new setting to customize the default value of 150ms
- When the time is exceeded, we exit early with what we had the time to sort
- If the cutoff has been reached, the search details are updated with a new `Skip` ranking details for the ranking rules that were skipped
- Adds analytics to measure the total number of degraded search requests
- Adds the number of degraded search requests to the Prometheus metrics and Grafana dashboard
- The cutoff **must not** skip the filters; otherwise, we would leak documents to people who don’t have the right to see them


Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-20 13:06:53 +00:00
6543 4628b7b7bd bump charabia to 0.8.8
and update lock file
2024-03-20 13:39:00 +01:00
Tamo c5322df519
Revert "Revert "Merge remote-tracking branch 'origin/main' into release-v1.7.1"" 2024-03-20 10:08:28 +01:00
Tamo 6079141ea6 snapshot the scores side by side with the score details 2024-03-19 18:30:14 +01:00
Tamo 2c3af8e513 query the detailed score detail in the test 2024-03-19 18:09:02 +01:00
Louis Dureuil 098ab594eb
A score of 0.0 is now lesser than a sort result
handles the niche case 🐩 in the hybrid search where:
1. a sort ranking rule is the first rule.
2. the keyword search is skipped at the first rule.
3. the semantic search is not skipped at the first rule.

Previously, we would have the skipped search winning, whereas we want the non skipped one winning.
2024-03-19 17:32:32 +01:00
Tamo 567194b925 Revert "Merge remote-tracking branch 'origin/main' into release-v1.7.1"
This reverts commit bd74cce86a, reversing
changes made to d2f77e88bd.
2024-03-19 16:56:21 +01:00
Tamo d8fe4fe49d return the order in the score details 2024-03-19 15:45:04 +01:00
Tamo 7b9e0d2944 forward the degraded parameter to the hybrid search 2024-03-19 15:11:21 +01:00
Tamo bfec9468d4
Update milli/src/search/mod.rs
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-19 14:49:15 +01:00
Clément Renault bd74cce86a
Merge remote-tracking branch 'origin/main' into release-v1.7.1 2024-03-19 13:39:17 +01:00
Tamo b8cda6c300 fix the search cutoff and add a test 2024-03-19 10:35:47 +01:00
Tamo d1db495119 add a settings for the search cutoff 2024-03-19 10:28:23 +01:00
Tamo 4a467739cd implements a first version of the cutoff without settings 2024-03-19 10:28:21 +01:00
Louis Dureuil a302e258bd
Don't display dimensions as 0 when it is not set 2024-03-18 16:10:12 +01:00
shuangcui 5c95b5c933 chore: remove repetitive words
Signed-off-by: shuangcui <fliter@qq.com>
2024-03-14 21:28:55 +08:00
meili-bors[bot] abd954755d
Merge #4476
4476: Make the `/facet-search` route use the `sortFacetValuesBy` setting r=irevoire a=Kerollmops

This PR fixes #4423 by ensuring that the `/facet-search` route uses the `sortFacetValuesBy` setting.

Note for the documentation team (to be moved in the tracking issue): Using the new `sortFacetValuesBy` setting can slow down the facet-search requests as Meilisearch iterates over the whole list of facet values and computes the count of documents on every entry. That is hardly or even impossible to optimize correctly.

### TODO
 - [x] Create a custom HashMap wrapper for the facet `OrderBy` settings.
         This wrapper will return the `OrderBy` setting of the facet, if not defined will use the default `*` one, and if not there either (strange) will fall back on the lexicographic one.
- [x] Create a `ValuesCollection` wrapper that implements the logic for the lexicographic and count order by.
  - [x] Use it when there is no search query.
  - [x] Use it when there is a search query with and without allowed typos.
  - [x] Do not change the original logic, only use a wrapper.
- [x] Add tests

Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-03-13 14:36:14 +00:00
Clément Renault f3fc2bd01f
Address some issues with preallocations 2024-03-13 15:22:14 +01:00
Clément Renault e0dac5a22f
Simplify the algorithm by using the new facet values collection wrapper 2024-03-13 11:31:34 +01:00
Clément Renault b918b55c6b
Introduce a new facet value collection wrapper to simply the usage 2024-03-13 11:31:34 +01:00
Clément Renault 306b25ad3a
Move the searchForFacetValues struct into a dedicated module 2024-03-13 10:24:21 +01:00
Clément Renault 9f7a4fbfeb
Return the facets of a placeholder facet-search sorted by count 2024-03-13 10:09:01 +01:00
meili-bors[bot] 5ed7b6a0b2
Merge #4456
4456: Add Ollama as an embeddings provider r=dureuill a=jakobklemm

# Pull Request

## Related issue
[Related Discord Thread](https://discord.com/channels/1006923006964154428/1211977150316683305)

## What does this PR do?
- Adds Ollama as a provider of Embeddings besides HuggingFace and OpenAI under the name `ollama`
- Adds the environment variable `MEILI_OLLAMA_URL` to set the embeddings URL of an Ollama instance with a default value of `http://localhost:11434/api/embeddings` if no variable is set
- Changes some of the structs and functions in `openai.rs` to be public so that they can be shared.
- Added more error variants for Ollama specific errors
- It uses the model `nomic-embed-text` as default, but any string value is allowed, however it won't automatically check if the model actually exists or is an embedding model

Tested against Ollama version `v0.1.27` and the `nomic-embed-text` model.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Co-authored-by: Jakob Klemm <jakob@jeykey.net>
Co-authored-by: Louis Dureuil <louis.dureuil@gmail.com>
2024-03-13 08:48:47 +00:00
Louis Dureuil ae67d5eef0
Update milli/src/vector/error.rs
Fix Meilisearch capitalization
2024-03-13 09:45:04 +01:00
Jakob Klemm 88bc9556a9
Add Ollama dimension inference and add clearer errors
Instead of the user manually specifying the model dimensions it will now automatically get determined
Just like with hf.rs the word "test" gets embedded to determine the dimensions of the output
Add a dedicated error type for if the model doesn't exist (don't automatically pull it though) and set the fault of that error to be the user
2024-03-12 19:59:11 +01:00
Clément Renault ca4876fd10
Do not reindex when modifying unknown faceted field 2024-03-12 16:18:58 +01:00
Clément Renault d3a95ea2f6
Introduce a new OrderByMap struct to simplify the sort by usage 2024-03-12 13:56:56 +01:00
Clément Renault 69c118ef76
Extract the facet order before extracting the facets values 2024-03-12 10:35:39 +01:00
meili-bors[bot] ee3076d5ba
Merge #4462
4462: Divide threshold by ten r=dureuill a=ManyTheFish

Change the facet incremental vs bulk indexing threshold to better fit our user needs, it might be changed in the future if we have more insights


Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-03-06 13:05:38 +00:00
meili-bors[bot] ab1224bfa7
Merge #4458
4458: Replace logging timer by spans r=Kerollmops a=dureuill

- Remove logging timer dependency.
- Remplace last uses in search by spans

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-05 16:43:23 +00:00
meili-bors[bot] eefc1c421e
Merge #4459
4459: Put a bound on OpenAI timeout r=dureuill a=dureuill

# Pull Request

## Related issue
Fixes #4460 

## What does this PR do?
- Makes sure that the timeout of the openai embedder is limited to max 1min, rather than the prior 15min+



Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-05 15:18:51 +00:00
Louis Dureuil 0c216048b5
Cap timeout duration 2024-03-05 12:19:25 +01:00
Louis Dureuil 36d17110d8
openai: Handle BAD_GETAWAY, be more resilient to failure 2024-03-05 12:18:54 +01:00
Louis Dureuil 25f64ce7df
Replace logging timer by spans 2024-03-05 11:05:42 +01:00
Louis Dureuil b11df7ec34
Meilisearch: fix some wrong spans 2024-03-05 10:11:43 +01:00
ManyTheFish eada6de261 Divide threshold by ten 2024-03-04 18:02:54 +01:00
Jakob Klemm d3004d8040
Implemented Ollama as an embeddings provider
Initial prototype of Ollama embeddings actually working, error handlign / retries still missing.

Allow model to be any String and require dimensions parameter

Fixed rustfmt formatting issues

There were some formatting issues in the initial PR and this should not make the changes comply with the Rust style guidelines

Because I accidentally didn't follow the style guide for commits in my commit messages I squashed them into one to comply
2024-03-04 15:09:43 +01:00
Louis Dureuil 452a343a2b
Fix imports 2024-02-28 18:09:40 +01:00
meili-bors[bot] b87485e80d
Merge #4433
4433: Enhance facet incremental r=Kerollmops a=ManyTheFish

# Pull Request

## Related issue
Fixes #4367
Fixes #4409

## What does this PR do?

- Add a test reproducing #4409
- Fix #4409 by removing a document from a level only if it is no more present in all the linked sub-level nodes
- Optimize facet Incremental indexing by creating or deleting a complete level once per field id instead of for each facet value
- Optimize facet Incremental indexing by doing the additions and the deletions in the same process instead of doing them separately


Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-02-28 15:28:46 +00:00
ManyTheFish 5e83bac448 Fix PR comments 2024-02-26 15:40:15 +01:00
Louis Dureuil 55796406c5
Add GPU analytics 2024-02-26 10:41:47 +01:00
ManyTheFish a493a50825 Fix clippy 2024-02-22 14:53:33 +01:00
ManyTheFish 9d1f489a37 Fix facet incremental indexing 2024-02-21 18:42:16 +01:00
meili-bors[bot] d34692e30b
Merge #4365
4365: Update charabia r=dureuill a=ManyTheFish

Update Charabia v0.8.7,

- Add Vietnamese Normalization (Ð and Đ into d)

Fixes #4357

Charabia versions:
- https://github.com/meilisearch/charabia/releases/tag/v0.8.6
- https://github.com/meilisearch/charabia/releases/tag/v0.8.7

Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-02-14 16:57:25 +00:00
ManyTheFish 78e04520fc Update charabia version 2024-02-14 15:16:16 +01:00
ManyTheFish 03bb6372af Change is_batchable_with by mergeable_with 2024-02-14 11:50:22 +01:00
ManyTheFish 3beda8833d Fix and add logs 2024-02-14 11:46:30 +01:00
ManyTheFish 55e942cd45 buggy 2024-02-13 15:26:30 +01:00
ManyTheFish 48026aa75c fix PR comments 2024-02-13 15:19:01 +01:00
Many the fish e5e811e2c9
Update milli/src/update/index_documents/extract/mod.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-02-13 14:22:21 +01:00
Many the fish 55de96f74e
Update milli/src/update/facet/mod.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-02-13 14:22:10 +01:00
ManyTheFish 39c83cb3d9 fix clippy 2024-02-12 09:12:54 +01:00
Louis Dureuil 7efb1cae11 yield in loop when the channel is not disconnected 2024-02-12 09:12:54 +01:00
Louis Dureuil 7877788510 fix logs 2024-02-12 09:12:54 +01:00
ManyTheFish be1b054b05
Compute chunk size based on the input data size ant the number of indexing threads 2024-02-08 17:28:37 +01:00
meili-bors[bot] 023c2d755f
Merge #4391
4391: Tracing r=dureuill a=irevoire

# Pull Request

- [ ] Hide the parameters of the process batch
- [x] Make actix-web trace every call on every route
- [x] Remove all `env_logger`/`logs` dependencies
- [x] Be able to enable or disable the memory measurement using the `/logs` route parameters

See the following product discussion: https://github.com/orgs/meilisearch/discussions/721

Supersedes https://github.com/meilisearch/meilisearch/pull/4338

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4317

## What does this PR do?

Update the format of the logs from:
```
[2024-02-06T14:54:11Z INFO  actix_server::builder] starting 10 workers
```

to

```
2024-02-06T13:58:14.710803Z  INFO actix_server::builder: 200: starting 10 workers
```

First, run meilisearch with the route enabled via the feature flag:
- `cargo run --experimental-enable-logs-route`
- Or at runtime by sending the following payload:
```
curl \
  -X PATCH 'http://localhost:7700/experimental-features/' \
  -H 'Content-Type: application/json'  \
--data-binary '{
    "logsRoute": true
  }'
```

Then gather data from meilisearch by calling for example:
```
curl \
	-X POST http://localhost:7700/logs \
	-H 'Content-Type: application/json' \
	--data-binary '{
	    "mode": "fmt",
            "target": "milli=trace"
    }'
```

Once your operation is over, tell meilisearch to stop the route:
```
curl \
	-X DELETE http://localhost:7700/logs
```

----

In the case you’re profiling code, you will be interested by the next command that converts the output of the route to a format that the firefox profiler can understand.

```bash
cargo run --release --bin trace-to-firefox -- 2024-01-17_17:07:55-indexing-trace.json
```

Then go to https://profiler.firefox.com and load it.
Note that we can also share the profiles using the https://share.firefox.dev website.


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
2024-02-08 14:16:56 +00:00
Louis Dureuil 407ad753ed
rust fmt 2024-02-08 15:11:42 +01:00
Tamo bf43a3f60a
fix typo 2024-02-08 15:04:06 +01:00
Tamo 1502382316
use debug instead of debug_span 2024-02-08 15:04:06 +01:00
Tamo 08af0e690c
Structures a bunch of logs 2024-02-08 15:04:06 +01:00
Louis Dureuil db722d201a
Write entries into database downgraded to trace level 2024-02-08 15:04:05 +01:00
Tamo e773dfa9ba
get rids of log in milli and add logs for the bucket sort 2024-02-08 15:04:05 +01:00
Louis Dureuil 5d7061682e
Add tracing to milli 2024-02-08 15:03:31 +01:00