MeiliSearch/milli
bors[bot] d546f6f40e
Merge #563
563: Improve the `estimatedNbHits` when a `distinctAttribute` is specified r=irevoire a=Kerollmops

This PR is related to https://github.com/meilisearch/meilisearch/issues/2532 but it doesn't fix it entirely. It improves it by computing the excluded documents (the ones with an already-seen distinct value) before stopping the loop, I think it was a mistake and should always have been this way.

The reason it doesn't fix the issue is that Meilisearch is lazy, just to be sure not to compute too many things and answer by taking too much time. When we deduplicate the documents by their distinct value we must do it along the water, everytime we see a new document we check that its distinct value of it doesn't collide with an already returned document. 

The reason we can see the correct result when enough documents are fetched is that we were lucky to see all of the different distinct values possible in the dataset and all of the deduplication was done, no document can be returned.

If we wanted to implement that to have a correct `extimatedNbHits` every time we should have done a pass on the whole set of possible distinct values for the distinct attribute and do a big intersection, this could cost a lot of CPU cycles.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-06-22 12:39:44 +00:00
..
fuzz fix the indexing fuzzer 2022-04-25 18:32:06 +02:00
src Improve the estimatedNbHits when distinct is enabled 2022-06-22 11:39:21 +02:00
tests Add a test to check for the returned facet distribution 2022-04-26 18:12:58 +02:00
Cargo.toml Bump the milli version to 0.31.0 2022-06-22 12:08:16 +02:00
README.md update the readme + dependencies 2022-01-12 18:30:11 +01:00

Milli

Fuzzing milli

Currently you can only fuzz the indexation. To execute the fuzzer run:

cargo +nightly fuzz run indexing

To execute the fuzzer on multiple thread you can also run:

cargo +nightly fuzz run -j4 indexing

Since the fuzzer is going to create a lot of temporary file to let milli index its documents I would also recommand to execute it on a ramdisk. Here is how to setup a ramdisk on linux:

sudo mount -t tmpfs none path/to/your/ramdisk

And then set the TMPDIR environment variable to make the fuzzer create its file in it:

export TMPDIR=path/to/your/ramdisk