Clément Renault
0f578348f1
Introduce a dedicated function to write proximity entries in database
2024-06-05 17:30:07 +02:00
Clément Renault
fad4675abe
Give the settings diff to the write_typed_chunk_into_index function
2024-06-05 17:30:07 +02:00
Clément Renault
1ab03c4ede
Fix an issue with settings diff and * in the searchable attributes
2024-06-05 17:30:07 +02:00
Clément Renault
0c6e4b2f00
Introducing a new into_del_add_obkv_conditional_operation function
2024-06-05 17:30:07 +02:00
Clément Renault
42b3f52ef9
Introduce the SettingDiff only_additional_fields method
2024-06-05 17:30:07 +02:00
ManyTheFish
1ab88e10b9
Merge branch 'main' into merge-release-v1.8.1-in-main
2024-05-29 16:24:00 +02:00
Many the fish
e1fbfde6c4
Merge branch 'main' into merge-release-v1.8.1-in-main
2024-05-29 11:31:03 +02:00
ManyTheFish
27b75ec648
merge main into v1.8.1
2024-05-29 11:26:07 +02:00
Louis Dureuil
d35278320e
Add support functions for accessing arroy writers and readers
2024-05-28 15:27:43 +02:00
Clément Renault
dc949ab46a
Remove puffin usage
2024-05-27 15:59:14 +02:00
meili-bors[bot]
19acc65ad2
Merge #4646
...
4646: Reduce `Transform`'s disk usage r=Kerollmops a=Kerollmops
This PR implements what is described in #4485 . It reduces the number of disk writes and disk usage.
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-05-23 16:06:50 +00:00
Clément Renault
fe17c0f52e
Construct the minimal OBKVs according to the settings diff
2024-05-23 11:23:57 +02:00
Clément Renault
bc5663e673
FieldIdsMap no longer useful thanks to #4631
2024-05-22 16:06:15 +02:00
Louis Dureuil
8a941c0241
Smaller review changes
2024-05-22 14:44:42 +02:00
Louis Dureuil
16037e2169
Don't remove embedders that are not in the config from the document DB
2024-05-22 12:24:51 +02:00
Clément Renault
500ddc76b5
Make the flattened sorter optional
2024-05-21 16:16:36 +02:00
Clément Renault
1aa8ed9ef7
Make the original sorter optional
2024-05-21 14:53:26 +02:00
ManyTheFish
f762307838
Fix clippy
2024-05-21 13:44:20 +02:00
ManyTheFish
3e94a90722
Fixes
2024-05-21 13:39:46 +02:00
ManyTheFish
fc7e817221
Index geo points based on the settings differences
2024-05-20 12:27:26 +02:00
Louis Dureuil
d05d49ffd8
Fix tests
2024-05-20 10:36:18 +02:00
Louis Dureuil
0462ebbe58
Don't write an empty _vectors field
2024-05-20 10:36:18 +02:00
Louis Dureuil
2f7a8a4efb
Don't write vectors that weren't autogenerated in document DB
2024-05-20 10:36:18 +02:00
Louis Dureuil
52d9cb6e5a
Refactor vector indexing
...
- use the parsed_vectors module
- only parse `_vectors` once per document, instead of once per embedder per document
2024-05-20 10:36:17 +02:00
Tamo
897d25780e
update milli to latest version
2024-05-16 18:31:32 +02:00
Tamo
f2d0a59f1d
when no searchable attributes are defined, makes all the weight equals to zero
2024-05-16 01:06:33 +02:00
Tamo
ad4d8502b3
stops storing the whole fieldids weights map when no searchable are defined
2024-05-15 17:16:10 +02:00
Tamo
7ec4e2a3fb
apply all style review comments
2024-05-15 15:02:26 +02:00
Tamo
caa6a7149a
make the attribute ranking rule use the weights and fix the tests
2024-05-14 17:36:32 +02:00
Tamo
b0afe0972e
stop updating the fields ids map when fields are only swapped
2024-05-14 17:00:02 +02:00
Tamo
685f452fb2
Fix the indexing of the searchable
2024-05-14 17:00:02 +02:00
Tamo
4e4a1ddff7
gate a test behind the required feature
2024-05-14 17:00:02 +02:00
Tamo
c22460045c
Stops returning an option in the internal searchable fields
2024-05-14 17:00:02 +02:00
meili-bors[bot]
4d5971f343
Merge #4621
...
4621: Bring back changes from v1.8.0 into main r=curquiza a=curquiza
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-05-06 13:46:39 +00:00
meili-bors[bot]
ebca29f3de
Merge #4597
...
4597: Fix embeddings settings update r=ManyTheFish a=ManyTheFish
# Pull Request
- add some conditions reducing the work done when changing the settings
- add some benchmarks on embedders
## Related issue
Fixes #4585
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-04-25 16:37:28 +00:00
Clément Renault
d4aeff92d0
Introduce the ThreadPoolNoAbort wrapper
2024-04-24 16:40:12 +02:00
Clément Renault
96cc5319c8
Introduce a new internal error type to categorize panics
2024-04-22 18:09:33 +02:00
Clément Renault
0c7003c5df
Introduce an atomic to catch panics in thread pools
2024-04-22 18:09:33 +02:00
ManyTheFish
a1aa999026
Add conditions reducing wrok
2024-04-22 14:18:35 +02:00
ManyTheFish
df29ba709a
Make some cleaning in Arcs
2024-04-17 12:33:25 +02:00
ManyTheFish
3acfab2eb7
Fix PR comments
2024-04-17 10:55:51 +02:00
ManyTheFish
87a93ba47d
fix clippy
2024-04-16 14:39:30 +02:00
ManyTheFish
eaf113ef34
Fix wod pair proximity error when nothing has to be extracted
2024-04-16 14:39:30 +02:00
ManyTheFish
e5ae337aae
Comeback to sorters in extract_word_docids
...
using buffers and merge the keys manually is less efficient
2024-04-16 14:39:30 +02:00
ManyTheFish
a489b406b4
fix test
2024-04-16 14:39:06 +02:00
ManyTheFish
02c3d6b265
finish work
2024-04-16 14:39:06 +02:00
ManyTheFish
b5e4a55af6
refactor faceted and searchable pipeline
2024-04-16 14:39:06 +02:00
ManyTheFish
a7e368aaa6
Create InnerIndexSettingsDiffs struct and populate it
2024-04-16 14:39:06 +02:00
ManyTheFish
893200ab87
Avoid clearing documents in transform
2024-04-16 14:39:06 +02:00
ManyTheFish
aabce52b1b
Fix test
2024-04-16 14:39:06 +02:00
ManyTheFish
8fff5fc281
update tests
2024-04-16 14:39:06 +02:00
yudrywet
cf864a1c2e
chore: fix some typos in comments
...
Signed-off-by: yudrywet <yudeyao@yeah.net>
2024-04-14 20:11:34 +08:00
Louis Dureuil
466d718a05
Fix test
2024-04-04 15:58:19 +02:00
meili-bors[bot]
56bf8503db
Merge #4537
...
4537: Expose distribution shift in settings r=ManyTheFish a=dureuill
See [usage page](https://meilisearch.notion.site/v1-8-AI-search-API-usage-135552d6e85a4a52bc7109be82aeca42#d652adc0890445658aaf36352dbc8802 )
# Changes
- Distribution shift added to all embedders.
- Exposed in settings
- Changed the reindexing logic to not trigger a reindex operation when only the distribution shift or API key change
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-04-03 09:08:58 +00:00
redistay
182cb42953
chore: fix some typos in conments
...
Signed-off-by: redistay <wujunjing@outlook.com>
2024-04-02 19:37:55 +08:00
meili-bors[bot]
92a049c2dd
Merge #4543
...
4543: Bring back changes from v1.7.4 into main r=Kerollmops a=dureuill
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: dureuill <dureuill@users.noreply.github.com>
2024-03-28 16:53:51 +00:00
Louis Dureuil
796213af9a
Merge branch 'main' into tmp-release-v1.7.4
2024-03-28 10:51:49 +01:00
Louis Dureuil
ee8cbea810
Don't optimize reindexing when fields contain dots
2024-03-27 17:04:45 +01:00
Louis Dureuil
572fb3a51d
Finer granularity for embedder needs reindex
2024-03-27 12:01:34 +01:00
Louis Dureuil
afd1da5642
Add distribution to all embedders
2024-03-27 11:50:22 +01:00
Louis Dureuil
817ccc089a
also allow api_key
2024-03-25 11:50:00 +01:00
Louis Dureuil
4136630ea5
Use constants instead of raw strings in set_*set()
2024-03-25 11:39:33 +01:00
Louis Dureuil
58972f35cb
Allow url
parameter for ollama embedder
2024-03-25 11:32:55 +01:00
Louis Dureuil
dfa5e41ea6
Check validity of the URL setting
2024-03-25 11:23:16 +01:00
Louis Dureuil
a1db342f01
Expose REST embedder to the API
2024-03-25 11:23:15 +01:00
Louis Dureuil
f87747f4d3
Remove unwraps
2024-03-25 11:23:04 +01:00
Louis Dureuil
ac52c857e8
Update ollama and openai impls to use the rest embedder internally
2024-03-25 11:23:03 +01:00
meili-bors[bot]
fc1c3f4a29
Merge #4466
...
4466: Implements the search cutoff r=irevoire a=irevoire
# Pull Request
## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4488
## What does this PR do?
- Adds a cutoff to the bucket sort after 150ms has been spent
- Adds a new setting to customize the default value of 150ms
- When the time is exceeded, we exit early with what we had the time to sort
- If the cutoff has been reached, the search details are updated with a new `Skip` ranking details for the ranking rules that were skipped
- Adds analytics to measure the total number of degraded search requests
- Adds the number of degraded search requests to the Prometheus metrics and Grafana dashboard
- The cutoff **must not** skip the filters; otherwise, we would leak documents to people who don’t have the right to see them
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-20 13:06:53 +00:00
Tamo
c5322df519
Revert "Revert "Merge remote-tracking branch 'origin/main' into release-v1.7.1""
2024-03-20 10:08:28 +01:00
Tamo
567194b925
Revert "Merge remote-tracking branch 'origin/main' into release-v1.7.1"
...
This reverts commit bd74cce86a
, reversing
changes made to d2f77e88bd
.
2024-03-19 16:56:21 +01:00
Clément Renault
bd74cce86a
Merge remote-tracking branch 'origin/main' into release-v1.7.1
2024-03-19 13:39:17 +01:00
Tamo
d1db495119
add a settings for the search cutoff
2024-03-19 10:28:23 +01:00
meili-bors[bot]
abd954755d
Merge #4476
...
4476: Make the `/facet-search` route use the `sortFacetValuesBy` setting r=irevoire a=Kerollmops
This PR fixes #4423 by ensuring that the `/facet-search` route uses the `sortFacetValuesBy` setting.
Note for the documentation team (to be moved in the tracking issue): Using the new `sortFacetValuesBy` setting can slow down the facet-search requests as Meilisearch iterates over the whole list of facet values and computes the count of documents on every entry. That is hardly or even impossible to optimize correctly.
### TODO
- [x] Create a custom HashMap wrapper for the facet `OrderBy` settings.
This wrapper will return the `OrderBy` setting of the facet, if not defined will use the default `*` one, and if not there either (strange) will fall back on the lexicographic one.
- [x] Create a `ValuesCollection` wrapper that implements the logic for the lexicographic and count order by.
- [x] Use it when there is no search query.
- [x] Use it when there is a search query with and without allowed typos.
- [x] Do not change the original logic, only use a wrapper.
- [x] Add tests
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-03-13 14:36:14 +00:00
meili-bors[bot]
5ed7b6a0b2
Merge #4456
...
4456: Add Ollama as an embeddings provider r=dureuill a=jakobklemm
# Pull Request
## Related issue
[Related Discord Thread](https://discord.com/channels/1006923006964154428/1211977150316683305 )
## What does this PR do?
- Adds Ollama as a provider of Embeddings besides HuggingFace and OpenAI under the name `ollama`
- Adds the environment variable `MEILI_OLLAMA_URL` to set the embeddings URL of an Ollama instance with a default value of `http://localhost:11434/api/embeddings ` if no variable is set
- Changes some of the structs and functions in `openai.rs` to be public so that they can be shared.
- Added more error variants for Ollama specific errors
- It uses the model `nomic-embed-text` as default, but any string value is allowed, however it won't automatically check if the model actually exists or is an embedding model
Tested against Ollama version `v0.1.27` and the `nomic-embed-text` model.
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Co-authored-by: Jakob Klemm <jakob@jeykey.net>
Co-authored-by: Louis Dureuil <louis.dureuil@gmail.com>
2024-03-13 08:48:47 +00:00
Jakob Klemm
88bc9556a9
Add Ollama dimension inference and add clearer errors
...
Instead of the user manually specifying the model dimensions it will now automatically get determined
Just like with hf.rs the word "test" gets embedded to determine the dimensions of the output
Add a dedicated error type for if the model doesn't exist (don't automatically pull it though) and set the fault of that error to be the user
2024-03-12 19:59:11 +01:00
Clément Renault
ca4876fd10
Do not reindex when modifying unknown faceted field
2024-03-12 16:18:58 +01:00
Clément Renault
d3a95ea2f6
Introduce a new OrderByMap struct to simplify the sort by usage
2024-03-12 13:56:56 +01:00
meili-bors[bot]
ee3076d5ba
Merge #4462
...
4462: Divide threshold by ten r=dureuill a=ManyTheFish
Change the facet incremental vs bulk indexing threshold to better fit our user needs, it might be changed in the future if we have more insights
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-03-06 13:05:38 +00:00
Louis Dureuil
b11df7ec34
Meilisearch: fix some wrong spans
2024-03-05 10:11:43 +01:00
ManyTheFish
eada6de261
Divide threshold by ten
2024-03-04 18:02:54 +01:00
Jakob Klemm
d3004d8040
Implemented Ollama as an embeddings provider
...
Initial prototype of Ollama embeddings actually working, error handlign / retries still missing.
Allow model to be any String and require dimensions parameter
Fixed rustfmt formatting issues
There were some formatting issues in the initial PR and this should not make the changes comply with the Rust style guidelines
Because I accidentally didn't follow the style guide for commits in my commit messages I squashed them into one to comply
2024-03-04 15:09:43 +01:00
ManyTheFish
5e83bac448
Fix PR comments
2024-02-26 15:40:15 +01:00
ManyTheFish
a493a50825
Fix clippy
2024-02-22 14:53:33 +01:00
ManyTheFish
9d1f489a37
Fix facet incremental indexing
2024-02-21 18:42:16 +01:00
ManyTheFish
03bb6372af
Change is_batchable_with by mergeable_with
2024-02-14 11:50:22 +01:00
ManyTheFish
3beda8833d
Fix and add logs
2024-02-14 11:46:30 +01:00
ManyTheFish
48026aa75c
fix PR comments
2024-02-13 15:19:01 +01:00
Many the fish
e5e811e2c9
Update milli/src/update/index_documents/extract/mod.rs
...
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-02-13 14:22:21 +01:00
Many the fish
55de96f74e
Update milli/src/update/facet/mod.rs
...
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-02-13 14:22:10 +01:00
ManyTheFish
39c83cb3d9
fix clippy
2024-02-12 09:12:54 +01:00
Louis Dureuil
7efb1cae11
yield in loop when the channel is not disconnected
2024-02-12 09:12:54 +01:00
Louis Dureuil
7877788510
fix logs
2024-02-12 09:12:54 +01:00
ManyTheFish
be1b054b05
Compute chunk size based on the input data size ant the number of indexing threads
2024-02-08 17:28:37 +01:00
meili-bors[bot]
023c2d755f
Merge #4391
...
4391: Tracing r=dureuill a=irevoire
# Pull Request
- [ ] Hide the parameters of the process batch
- [x] Make actix-web trace every call on every route
- [x] Remove all `env_logger`/`logs` dependencies
- [x] Be able to enable or disable the memory measurement using the `/logs` route parameters
See the following product discussion: https://github.com/orgs/meilisearch/discussions/721
Supersedes https://github.com/meilisearch/meilisearch/pull/4338
## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4317
## What does this PR do?
Update the format of the logs from:
```
[2024-02-06T14:54:11Z INFO actix_server::builder] starting 10 workers
```
to
```
2024-02-06T13:58:14.710803Z INFO actix_server::builder: 200: starting 10 workers
```
First, run meilisearch with the route enabled via the feature flag:
- `cargo run --experimental-enable-logs-route`
- Or at runtime by sending the following payload:
```
curl \
-X PATCH 'http://localhost:7700/experimental-features/ ' \
-H 'Content-Type: application/json' \
--data-binary '{
"logsRoute": true
}'
```
Then gather data from meilisearch by calling for example:
```
curl \
-X POST http://localhost:7700/logs \
-H 'Content-Type: application/json' \
--data-binary '{
"mode": "fmt",
"target": "milli=trace"
}'
```
Once your operation is over, tell meilisearch to stop the route:
```
curl \
-X DELETE http://localhost:7700/logs
```
----
In the case you’re profiling code, you will be interested by the next command that converts the output of the route to a format that the firefox profiler can understand.
```bash
cargo run --release --bin trace-to-firefox -- 2024-01-17_17:07:55-indexing-trace.json
```
Then go to https://profiler.firefox.com and load it.
Note that we can also share the profiles using the https://share.firefox.dev website.
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
2024-02-08 14:16:56 +00:00
Louis Dureuil
407ad753ed
rust fmt
2024-02-08 15:11:42 +01:00
Tamo
bf43a3f60a
fix typo
2024-02-08 15:04:06 +01:00
Tamo
1502382316
use debug instead of debug_span
2024-02-08 15:04:06 +01:00
Tamo
08af0e690c
Structures a bunch of logs
2024-02-08 15:04:06 +01:00
Louis Dureuil
db722d201a
Write entries into database downgraded to trace level
2024-02-08 15:04:05 +01:00
Tamo
e773dfa9ba
get rids of log in milli and add logs for the bucket sort
2024-02-08 15:04:05 +01:00
Louis Dureuil
5d7061682e
Add tracing to milli
2024-02-08 15:03:31 +01:00
meili-bors[bot]
72ebac1fbb
Merge #4388
...
4388: Cap the maximum memory of the grenad sorters r=curquiza a=Kerollmops
This PR clamps the memory usage of the grenad sorters to a reasonable maximum. Grenad sorters are opened on multiple threads at a time. This can result in higher memory usage than expected, even though it shouldn't consume more than the memory available.
Fixes #4152 .
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-02-08 13:19:28 +00:00
Louis Dureuil
88d03c56ab
Don't accept dimensions of 0 (ever) or dimensions greater than the default dimensions of the model
2024-02-07 11:52:09 +01:00
Louis Dureuil
517f5332d6
Allow actually passing dimensions
for OpenAI source
...
-> make sure the settings change is rejected or the settings task fails when the specified model doesn't support
overriding `dimensions` and the passed `dimensions` differs from the model's default dimensions.
2024-02-07 11:51:44 +01:00
Clément Renault
053306c0e7
Try with 500MiB
2024-02-07 11:24:43 +01:00
Clément Renault
9eeb75d501
Clamp the max memory of the grenad sorters to a reasonable maximum
2024-02-06 10:47:04 +01:00
Louis Dureuil
fbf5f2a392
Don't use a runtime in extract_embedder, use it only for OpenAI
2024-02-01 10:33:27 +01:00
Tamo
9f8f3105d5
make clippy happy
2024-02-01 10:33:27 +01:00
Tamo
318843aacd
add a bunch of tests and fix the error message when adding the geosearch as filterable/sortable while there is malformed documents in the DB
2024-02-01 10:33:27 +01:00
Tamo
c1bf33a112
Revert "Remove panic on the geosearch"
2024-01-25 18:51:19 +01:00
Tamo
0887186ecf
make clippy happy
2024-01-17 16:07:10 +01:00
Tamo
7d190d8078
add a bunch of tests and fix the error message when adding the geosearch as filterable/sortable while there is malformed documents in the DB
2024-01-17 15:51:52 +01:00
Clément Renault
01e2c3d6bb
Bump arroy to v0.2.0
2024-01-16 16:45:55 +01:00
Clément Renault
9f9ad4cc05
Fix Clippy warnings
2024-01-16 15:27:24 +01:00
Clément Renault
3ee7682fa7
Fix some integer comparisons
2024-01-16 15:22:23 +01:00
Tamo
54ae6951eb
fix warning
2024-01-02 15:19:30 +01:00
Louis Dureuil
6ff81de401
Fix tests
2023-12-20 17:16:46 +01:00
Louis Dureuil
9123370e90
Validate fused settings in settings task after fusing with existing setting
2023-12-20 17:16:46 +01:00
Louis Dureuil
e249e4db7b
Change Setting::apply function signature
2023-12-20 17:15:24 +01:00
Many the fish
9e1b458010
Merge branch 'main' into change-proximity-precision-settings
2023-12-18 09:08:47 +01:00
ManyTheFish
6425996e36
Change the naming of attributeScale and wordScale into byAttribute and byWord
2023-12-14 16:31:00 +01:00
Louis Dureuil
87bba98bd8
Various changes
...
- fixed seed for arroy
- check vector dimensions as soon as it is provided to search
- don't embed whitespace
2023-12-14 16:08:42 +01:00
Louis Dureuil
b8e4709dfa
Remove prompt strategy and fallback
2023-12-14 16:08:41 +01:00
Louis Dureuil
806e5b6899
Tests pass
2023-12-14 16:08:41 +01:00
Louis Dureuil
e0cc775dc4
Various changes
...
- DistributionShift in Search object (to be set from model in embed?)
- Fix issue where embedder index wasn't computed at search time
- Accept as default embedder either the "default" one, or the only embedder when there is only one
2023-12-14 16:08:41 +01:00
Louis Dureuil
12940d79a9
WIP
...
- manual embedder
- multi embedders OK
- clippy + tests OK
2023-12-14 16:08:41 +01:00
Louis Dureuil
922a640188
WIP multi embedders
...
fixed template bugs
2023-12-14 16:08:41 +01:00
Louis Dureuil
65e49b7092
Remove stuff, add distribution shift (WIP)
2023-12-14 16:08:38 +01:00
Louis Dureuil
e56f160032
Actually pass embedders on reindex
2023-12-14 16:07:49 +01:00
Louis Dureuil
687d92f217
prompt bifluor+
2023-12-14 16:07:49 +01:00
Louis Dureuil
fb539f61fe
WIP
2023-12-14 16:07:49 +01:00
Louis Dureuil
cb4ebe163e
WIP
2023-12-14 16:07:49 +01:00
Louis Dureuil
dde3a04679
WIP arroy integration
2023-12-14 16:07:49 +01:00
Louis Dureuil
13c2c6c16b
Small commit to add hybrid search and autoembedding
2023-12-14 16:07:48 +01:00
ManyTheFish
467b49153d
Implement proximityPrecision setting on milli side
2023-12-06 15:49:02 +01:00
ManyTheFish
bddc168d83
List TODOs
2023-12-06 14:59:23 +01:00
Clément Renault
d32eb11329
Move to the v0.20.0-alpha.9 of heed
2023-11-27 11:52:22 +01:00
Clément Renault
0dbf1a16ff
Make clippy happy
2023-11-23 14:11:38 +01:00
Clément Renault
462b4c0080
Fix the tests
2023-11-23 12:07:35 +01:00
Clément Renault
0d4482625a
Make the changes to use heed v0.20-alpha.6
2023-11-23 11:43:58 +01:00
ManyTheFish
d3575fb028
Make into_del_add_obkv parameters more human readable
2023-11-20 16:10:39 +01:00
ManyTheFish
39cbb499c2
Small fixes
2023-11-20 10:20:39 +01:00
ManyTheFish
ebef6bc24d
Simplify documents database writing
2023-11-20 10:14:57 +01:00
ManyTheFish
d59b7db8d0
remove unused code
2023-11-20 10:10:45 +01:00
ManyTheFish
263e825619
Fix typos in comments
2023-11-20 10:06:29 +01:00
Many the fish
b0adc73ce6
Merge pull request #4207 from meilisearch/diff-indexing-prefix-databases
...
Diff indexing prefix databases
2023-11-14 16:04:05 +01:00
Louis Dureuil
772964125d
Factor removal of document from DB
2023-11-13 13:51:22 +01:00
Louis Dureuil
264b10ec20
Fixup documentation
2023-11-09 16:23:20 +01:00
Louis Dureuil
3053e01c05
Batch::remove_documents_from_db_no_batch
2023-11-09 14:23:02 +01:00
Louis Dureuil
9cef800b2a
Enrich uses the new type
2023-11-09 14:22:05 +01:00
ManyTheFish
882ab9cc85
remove warnings
2023-11-09 11:35:33 +01:00
ManyTheFish
5a9c96e1db
Compute word integer prefix cache
2023-11-09 11:34:26 +01:00
ManyTheFish
70ce40828c
Compute word docids prefix cache
2023-11-08 17:01:00 +01:00
ManyTheFish
688266c83e
Remove word pair proximity prefix cache and compute it at search time
2023-11-08 14:16:01 +01:00
ManyTheFish
6dab826908
Reactivate prefix databases
2023-11-08 13:58:01 +01:00
ManyTheFish
1e2fbc6a42
revert "REVERT ME: ignore prefix pair databases tests"
...
This reverts commit 1b2ea6cf19
.
2023-11-08 11:50:52 +01:00
Louis Dureuil
cbaa54cafd
Fix clippy issues
2023-11-06 11:19:31 +01:00
Louis Dureuil
1bccf2079e
Correctly mark non-tests as non-tests
2023-11-06 11:03:56 +01:00
ManyTheFish
1b2ea6cf19
REVERT ME: ignore prefix pair databases tests
2023-11-06 10:46:22 +01:00
Louis Dureuil
1ad1fcc8c8
Remove all warnings
2023-11-06 10:31:14 +01:00
ManyTheFish
87610a5f98
Don't try to delete a document that is not in the database
2023-11-02 16:49:03 +01:00
Clément Renault
ff522c919d
Fix the vector extractions for the diff indexing
2023-11-02 15:58:08 +01:00
ManyTheFish
bf0651f23c
Implement iter method on ExternalDocumentsIds
2023-11-02 15:38:00 +01:00
ManyTheFish
5b20e625f3
fix merge
2023-11-02 15:31:37 +01:00
ManyTheFish
bc51d6157a
Fix transform reindexing path
2023-11-02 15:26:20 +01:00
ManyTheFish
1b4ff991c0
update typed chunks
2023-11-02 15:26:20 +01:00
ManyTheFish
4b64c33aa2
update vector extractor
2023-11-02 15:26:20 +01:00
ManyTheFish
12323d610e
Change the original document sorter key from the internal docid to a concatenation of the internal and the external docid
2023-11-02 15:26:20 +01:00
Clément Renault
4d864f0702
Always sort internal Sorter entries in parallel
2023-11-02 14:47:43 +01:00
Clément Renault
c71b1d33ae
Sort entries using rayon in the transform sorters
2023-11-01 11:07:16 +01:00
Clément Renault
0fc446c62f
Add more timing logs to the Transform
2023-11-01 11:07:16 +01:00
Louis Dureuil
0fb6acefc3
Add snapshots for facets
2023-10-31 17:11:08 +01:00
Louis Dureuil
b1d1355b69
remove tests on soft-deleted
2023-10-31 16:36:27 +01:00
Louis Dureuil
f19332466e
Extract field value as values instead of Option<Value>
2023-10-31 16:36:27 +01:00
Louis Dureuil
03ddb4f310
use deladd in facet update tests
2023-10-31 16:36:27 +01:00
Louis Dureuil
da0503ef80
Fix document count
2023-10-31 16:36:27 +01:00
Louis Dureuil
b40253bf18
update snapshots
2023-10-31 10:30:48 +01:00
Louis Dureuil
d8bf3f3fc2
Remove unused snapshots
2023-10-31 10:12:49 +01:00
Louis Dureuil
9d59e8011a
fix some tests
2023-10-31 10:08:36 +01:00
Louis Dureuil
dad78cbf8d
Bulk facet remove deletes keys from DB when value empty
2023-10-31 09:53:55 +01:00
Louis Dureuil
4e91707a06
Rename test
2023-10-31 09:41:17 +01:00
Louis Dureuil
de10f20732
Fix field distribution again
2023-10-30 17:47:22 +01:00
Louis Dureuil
be395c7944
Change order of arguments to tokenizer_builder
2023-10-30 16:26:29 +01:00
Louis Dureuil
9fedd8101a
Fix tests
2023-10-30 15:11:07 +01:00
Louis Dureuil
54d07a8da3
Update field distribution taking into account both deletions and additions
2023-10-30 14:47:51 +01:00
Louis Dureuil
58690dfb19
Fix tests compilation after changes to ExternalDocumentsIds API
2023-10-30 13:34:07 +01:00
Louis Dureuil
abf424ebfc
Remove unused FromIterator
2023-10-30 11:41:56 +01:00
Clément Renault
dfab6293c9
Use an LMDB database to store the external documents ids
2023-10-30 11:41:23 +01:00
Louis Dureuil
fdf3f7f627
Fix facet distribution test
2023-10-30 11:41:23 +01:00
Louis Dureuil
6260cff65f
Actually delete documents from DB when the merge function says so
2023-10-30 11:41:22 +01:00
Louis Dureuil
8e0d9c9a5e
Recover delete_documents tests that were too eagerly deleted
2023-10-30 11:41:22 +01:00
Louis Dureuil
a35988550c
Fix some snapshots
2023-10-30 11:41:22 +01:00
Louis Dureuil
e78281785c
Actually execute the transform even if there are only documents to delete
2023-10-30 11:41:22 +01:00
Louis Dureuil
290e773d23
remove more warnings and fix some tests
2023-10-30 11:41:22 +01:00
Louis Dureuil
113527f466
Remove soft-deleted related methods from Index
2023-10-30 11:41:22 +01:00
Louis Dureuil
c534a1b687
Stop using delete documents pipeline in batch runner
2023-10-30 11:41:22 +01:00
Louis Dureuil
2263dff02b
Stop using removed delete pipelines almost everywhere
2023-10-30 11:41:22 +01:00
Louis Dureuil
d651b3ef01
Remove delete documents files
2023-10-30 11:41:20 +01:00
ManyTheFish
762b0b47e6
Use deladd merging function in chunks mergers
2023-10-30 11:40:20 +01:00
Louis Dureuil
01d5eedf2f
Remove some warnings
2023-10-30 11:40:20 +01:00
Louis Dureuil
073f89db79
Fix facet tests
2023-10-30 11:40:20 +01:00
Louis Dureuil
85f42fbc03
Handle external to internal id mapping from TypedChunk::Documents
2023-10-30 11:40:20 +01:00
Louis Dureuil
c6b3c18c85
WIP: Comment out document deletion in other pipelines than update
...
TODO: fix calls to DELETE route
2023-10-30 11:40:20 +01:00
Louis Dureuil
946c762d28
WIP: reset documents in TypedChunk::Documents
2023-10-30 11:40:20 +01:00
Louis Dureuil
cda6ca1ee6
Remove TypedChunk::NewDocumentIds
2023-10-30 11:40:18 +01:00
Louis Dureuil
696fcf4d18
Fix document insertion into LMDB
2023-10-30 11:39:31 +01:00
ManyTheFish
476e4d3dbe
Use value buffer instead of the initial value when writting the final result in the sorter
2023-10-30 11:39:31 +01:00
Clément Renault
576fa9c6da
Remove useless comment
2023-10-30 11:39:31 +01:00
Kerollmops
77dcbff6b2
Remove and Insert the DelAdd geo points
2023-10-30 11:39:31 +01:00
Kerollmops
544440c363
Ignore geo fields when the Del and Add content is the same
2023-10-30 11:39:31 +01:00
Clément Renault
a3dae4db9b
Extract the geo fields DelAdd and generate a new DelAdd obkv with it
2023-10-30 11:39:31 +01:00
ManyTheFish
ba90a5ec0e
update extract fid word count docids
2023-10-30 11:39:31 +01:00
Louis Dureuil
b26dc9aabe
Explanatory code comment
2023-10-30 11:39:31 +01:00
Louis Dureuil
66abac9364
Use specialized KvReaderDelAdd
type
...
Co-authored-by: Clément Renault <clement@meilisearch.com>
2023-10-30 11:39:31 +01:00
Louis Dureuil
59f88c14b3
Simplify facet update after removing Index::faceted_documents_ids
2023-10-30 11:39:29 +01:00
Louis Dureuil
14832cb324
Remove Index::faceted_documents_ids
2023-10-30 11:37:32 +01:00
Louis Dureuil
04ec293024
Facet Incremental update
2023-10-30 11:37:30 +01:00
Louis Dureuil
f67ff3a738
Facets Bulk update
2023-10-30 11:36:40 +01:00
Clément Renault
560e8f5613
Introduce the CboRoaringBitmapCodec merge_deladd_into and use it
2023-10-30 11:34:55 +01:00
Clément Renault
2d3f15f82c
Introduce a function to only serialize the Add side of a DelAdd obkv
2023-10-30 11:34:55 +01:00
Clément Renault
40186bf403
Rename FieldIdWordCountDocids correctly
2023-10-30 11:34:50 +01:00
ManyTheFish
87e3d27878
update extract word pair proximity to support deladd obkvs
2023-10-30 11:34:02 +01:00
ManyTheFish
6bcf8b4f8c
update extract word position docids
2023-10-30 11:34:02 +01:00
ManyTheFish
46aa75abdb
update extract word docids
2023-10-30 11:34:02 +01:00
ManyTheFish
2597bbd107
Make script language docids map taking a tuple of roaring bitmaps expressing the deletions and the additions
2023-10-30 11:34:00 +01:00
Clément Renault
e2bc054604
Update extract_facet_string_docids to support deladd obkvs
2023-10-30 11:32:36 +01:00
Clément Renault
fcd3a1434d
Update extract_facet_number_docids to support deladd obkvs
2023-10-30 11:31:04 +01:00
Clément Renault
a82dee21e0
Rename docid_fid into fid_docid
2023-10-30 11:31:02 +01:00
Clément Renault
bc45c1206d
Implement all the facet extraction paths and simplify them
2023-10-30 11:29:08 +01:00
Clément Renault
6ae4100f07
Generate the DelAdd for is_null, is_empty, and exists
2023-10-30 11:29:08 +01:00
Clément Renault
0c47defeee
Work on fid docid facet values rewrite
2023-10-30 11:29:06 +01:00
ManyTheFish
313b16bec2
Support diff indexing on extract_docid_word_positions
2023-10-30 11:24:19 +01:00
ManyTheFish
1dd97578a8
Make the transform struct return diff-based documents obkvs
2023-10-30 11:22:07 +01:00
ManyTheFish
f5ef69293b
deactivate prefix dbs
2023-10-30 11:22:07 +01:00
ManyTheFish
1c5705c164
clean PR warnings
2023-10-30 11:22:05 +01:00
ManyTheFish
66c2c82a18
Split wpp in several sorters
2023-10-30 11:15:02 +01:00
ManyTheFish
28a8d0ccda
Fix word pair proximity
2023-10-30 11:15:02 +01:00
ManyTheFish
96be85396d
Use a vecDeque in wpp database
2023-10-30 11:15:02 +01:00
ManyTheFish
df9e5c8651
Generalize usage of CboRoaringBitmap codec to ease the use
2023-10-30 11:15:02 +01:00
ManyTheFish
b541d48847
Add buffer to the obkv writter
2023-10-30 11:15:02 +01:00
ManyTheFish
8ccf32d1a0
Compute word_fid_docids before word_docids and exact_word_docids
2023-10-30 11:15:02 +01:00
ManyTheFish
db1ca21231
add puffin in sorter into reeder function
2023-10-30 11:15:00 +01:00
ManyTheFish
11ea5acff9
Fix
2023-10-30 11:13:10 +01:00
ManyTheFish
8d77736a67
Fix fid_word_docids
2023-10-30 11:13:10 +01:00
ManyTheFish
748b333161
Add usefull debug assert before key insertion in database
2023-10-30 11:13:10 +01:00
ManyTheFish
17b647dfe5
Wip
2023-10-30 11:13:08 +01:00
meili-bors[bot]
5e0485d8dd
Merge #4131
...
4131: Reduce proximity range from 7 to 3 r=Kerollmops a=ManyTheFish
## Summary
This PR aims to reduce the impact of the proximity databases on the indexing time and on the database size by reducing the maximum distance between two words to be indexed in the proximity database.
## Stats
### Impact on database size and indexing time
![Impact on datasets](https://github.com/meilisearch/meilisearch/assets/6482087/28ed3d96-bdde-41c1-bdac-e90c1b1dbb23 )
### Impact on search relevancy
<details>
| dataset_name | host_name | Relevancy rate (Precision) | completion_rate 25.00% | completion_rate 50.00% | completion_rate 75.00% | completion_rate 100.00% |
|--------------|------------------|------------------------------------|-----------------|-----------------|-----------------|-----------------|
| FBIS | 1_4_0 | percentile-10 | 0.00% | 0.00% | 0.00% | 0.00% |
| FBIS | 1_4_0 | percentile-25 | 0.00% | 0.00% | 0.00% | 0.00% |
| FBIS | 1_4_0 | percentile-50 | 0.00% | 0.00% | 5.00% | 5.56% |
| FBIS | 1_4_0 | percentile-75 | 0.00% | 12.50% | 35.00% | 45.00% |
| FBIS | 1_4_0 | percentile-90 | 20.00% | 40.00% | | 100.00% |
| FBIS | 1_4_0 | average | 5.78% | 11.16% | 21.90% | 26.29% |
| FBIS | reduce_proximity | percentile-10 | 0.00% | 0.00% | 0.00% | 0.00% |
| FBIS | reduce_proximity | percentile-25 | 0.00% | 0.00% | 0.00% | 0.00% |
| FBIS | reduce_proximity | percentile-50 | 0.00% | 0.00% | 5.00% | 5.56% |
| FBIS | reduce_proximity | percentile-75 | 0.00% | 15.00% | 35.00% | 40.00% |
| FBIS | reduce_proximity | percentile-90 | 20.00% | 40.00% | 85.00% | 100.00% |
| FBIS | reduce_proximity | average | 5.55% | 11.34% | 21.75% | 26.14% |
| FR94 | 1_4_0 | percentile-10 | 0.00% | 0.00% | 0.00% | 0.00% |
| FR94 | 1_4_0 | percentile-25 | 0.00% | 0.00% | 0.00% | 0.00% |
| FR94 | 1_4_0 | percentile-50 | 0.00% | 0.00% | 0.00% | 0.00% |
| FR94 | 1_4_0 | percentile-75 | 0.00% | 5.00% | 15.00% | 42.11% |
| FR94 | 1_4_0 | percentile-90 | 15.00% | 54.55% | 100.00% | 100.00% |
| FR94 | 1_4_0 | average | 5.95% | 12.07% | 18.70% | 25.57% |
| FR94 | reduce_proximity | percentile-10 | 0.00% | 0.00% | 0.00% | 0.00% |
| FR94 | reduce_proximity | percentile-25 | 0.00% | 0.00% | 0.00% | 0.00% |
| FR94 | reduce_proximity | percentile-50 | 0.00% | 0.00% | 0.00% | 0.00% |
| FR94 | reduce_proximity | percentile-75 | 0.00% | 5.00% | 15.00% | 42.11% |
| FR94 | reduce_proximity | percentile-90 | 15.00% | 54.55% | 100.00% | 100.00% |
| FR94 | reduce_proximity | average | 5.79% | 12.00% | 18.70% | 25.53% |
| FT | 1_4_0 | percentile-10 | 0.00% | 0.00% | 0.00% | 0.00% |
| FT | 1_4_0 | percentile-25 | 0.00% | 0.00% | 0.00% | 0.00% |
| FT | 1_4_0 | percentile-50 | 0.00% | 0.00% | 5.00% | 10.00% |
| FT | 1_4_0 | percentile-75 | 0.00% | 15.00% | 30.00% | 40.00% |
| FT | 1_4_0 | percentile-90 | 20.00% | 50.00% | 65.00% | 100.00% |
| FT | 1_4_0 | average | 5.08% | 12.58% | 20.00% | 25.49% |
| FT | reduce_proximity | percentile-10 | 0.00% | 0.00% | 0.00% | 0.00% |
| FT | reduce_proximity | percentile-25 | 0.00% | 0.00% | 0.00% | 0.00% |
| FT | reduce_proximity | percentile-50 | 0.00% | 0.00% | 5.00% | 10.00% |
| FT | reduce_proximity | percentile-75 | 0.00% | 15.00% | 30.00% | 40.00% |
| FT | reduce_proximity | percentile-90 | 10.00% | 45.00% | 60.00% | 100.00% |
| FT | reduce_proximity | average | 5.01% | 12.64% | 20.10% | 25.53% |
| LAT | 1_4_0 | percentile-10 | 0.00% | 0.00% | 0.00% | 0.00% |
| LAT | 1_4_0 | percentile-25 | 0.00% | 0.00% | 0.00% | 0.00% |
| LAT | 1_4_0 | percentile-50 | 0.00% | 0.00% | 5.00% | 5.00% |
| LAT | 1_4_0 | percentile-75 | 5.00% | 15.00% | 30.00% | 30.00% |
| LAT | 1_4_0 | percentile-90 | 15.00% | 45.00% | 60.00% | 80.00% |
| LAT | 1_4_0 | average | 4.80% | 11.80% | 17.88% | 21.62% |
| LAT | reduce_proximity | percentile-10 | 0.00% | 0.00% | 0.00% | 0.00% |
| LAT | reduce_proximity | percentile-25 | 0.00% | 0.00% | 0.00% | 0.00% |
| LAT | reduce_proximity | percentile-50 | 0.00% | 0.00% | 5.00% | 5.00% |
| LAT | reduce_proximity | percentile-75 | 0.00% | 11.11% | 25.00% | 35.00% |
| LAT | reduce_proximity | percentile-90 | 15.00% | 45.00% | 55.00% | 80.00% |
| LAT | reduce_proximity | average | 4.43% | 11.23% | 17.32% | 21.45% |
</details>
### Impact on Search time
| dataset_name | host_name | 25.00% | 50.00% | 75.00% | 100.00% | Average |
|--------------|------------------|------------:|------------:|------------:|------------:|-------------|
| FBIS | 1_4_0 | 3.45 | 7.446666667 | 9.773489933 | 9.620300752 | 7.572614338 |
| FBIS | reduce_proximity | 2.983333333 | 5.316666667 | 6.911073826 | 7.637218045 | 5.712072968 |
| FR94 | 1_4_0 | 2.236666667 | 4.45 | 5.523489933 | 4.560150376 | 4.192576744 |
| FR94 | reduce_proximity | 2.09 | 3.991666667 | 4.981543624 | 4.266917293 | 3.832531896 |
| FT | 1_4_0 | 5.956666667 | 9.656666667 | 13.86912752 | 10.83270677 | 10.0787919 |
| FT | reduce_proximity | 4.51 | 5.981666667 | 7.701342282 | 6.766917293 | 6.23998156 |
| LAT | 1_4_0 | 5.856666667 | 9.233333333 | 12.98322148 | 10.78759398 | 9.715203865 |
| LAT | reduce_proximity | 6.91 | 6.706666667 | 8.463087248 | 8.265037594 | 7.586197877 |
## Technical approach
- Ensure the MAX_DISTANCE constant is used everywhere needed
- Reduce the MAX_DISTANCE from 8 to 4
## Related
TBD
Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-10-18 14:56:08 +00:00
ManyTheFish
27eec21415
Fix tests
2023-10-18 16:03:22 +02:00
Clément Renault
62dfd09dc6
Add more puffin logs to the deletion functions
2023-10-13 13:11:09 +02:00
Tamo
c0f2724c2d
get rids of the new introduced error code in favor of an io::Error
2023-10-10 15:12:23 +02:00