588: Fix name of "release_date" facet in movies benchmarks r=ManyTheFish a=loiclec
## What does this PR do?
The `movies.json` file in the benchmark datasets contains a filterable field called "release_date", but the indexing benchmarks wrongly called the field "released_date" instead. This PR fixes that.
Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
583: Use BufReader to read datasets in benchmarks r=ManyTheFish a=loiclec
## What does this PR do?
Ensure that the datasets used by the benchmarks are read efficiently by using a `BufReader`.
## Why?
Using a `BufReader` is more representative of how `meilisearch` works. It will also make performance comparisons between different branches of `milli` more accurate.
Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
572: Add reindexing benchmarks r=Kerollmops a=irevoire
With #557 coming, we should add benchmarks that measure our impact on the reindexing process.
Co-authored-by: Tamo <tamo@meilisearch.com>
557: Fasten documents deletion and update r=Kerollmops a=irevoire
When a document deletion occurs, instead of deleting the document we mark it as deleted in the new “soft deleted” bitmap. It is then removed from the search and all the other endpoints.
I ran the benchmarks against main;
```
% ./compare.sh indexing_main_83ad1aaf.json indexing_fasten-document-deletion_abab51fb.json
group indexing_fasten-document-deletion_abab51fb indexing_main_83ad1aaf
----- ------------------------------------------ ----------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable- 1.05 2.0±0.40ms ? ?/sec 1.00 1904.9±190.00µs ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable- 1.00 10.3±2.64ms ? ?/sec 961.61 9.9±0.12s ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested- 1.00 15.1±3.90ms ? ?/sec 554.63 8.4±0.12s ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable- 1.00 45.1±7.53ms ? ?/sec 710.15 32.0±0.10s ? ?/sec
indexing/-wiki-delete-searchable- 1.00 277.8±7.97ms ? ?/sec 1946.57 540.8±3.15s ? ?/sec
indexing/Indexing geo_point 1.00 12.0±0.20s ? ?/sec 1.03 12.4±0.19s ? ?/sec
indexing/Indexing movies in three batches 1.00 19.3±0.30s ? ?/sec 1.01 19.4±0.16s ? ?/sec
indexing/Indexing movies with default settings 1.00 18.8±0.09s ? ?/sec 1.00 18.9±0.10s ? ?/sec
indexing/Indexing nested movies with default settings 1.00 25.9±0.19s ? ?/sec 1.00 25.9±0.12s ? ?/sec
indexing/Indexing nested movies without any facets 1.00 24.8±0.17s ? ?/sec 1.00 24.8±0.18s ? ?/sec
indexing/Indexing songs in three batches with default settings 1.00 65.9±0.96s ? ?/sec 1.03 67.8±0.82s ? ?/sec
indexing/Indexing songs with default settings 1.00 58.8±1.11s ? ?/sec 1.02 59.9±2.09s ? ?/sec
indexing/Indexing songs without any facets 1.00 53.4±0.72s ? ?/sec 1.01 54.2±0.88s ? ?/sec
indexing/Indexing songs without faceted numbers 1.00 57.9±1.17s ? ?/sec 1.01 58.3±1.20s ? ?/sec
indexing/Indexing wiki 1.00 1065.2±13.26s ? ?/sec 1.00 1065.8±12.66s ? ?/sec
indexing/Indexing wiki in three batches 1.00 1182.4±6.20s ? ?/sec 1.01 1190.8±8.48s ? ?/sec
```
Most things do not change, we lost 0.1ms on the indexing of geo point (I don’t get why), and then we are between 500 and 1900 times faster when we delete documents.
Co-authored-by: Tamo <tamo@meilisearch.com>
577: Fix deserialisation of NDJson documents in benchmarks r=irevoire a=loiclec
Previously, the first document in the NDJson file was read over and over again. So the `geo_point` benchmark was not working properly: it only indexed one document.
Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
When a document deletion occurs, instead of deleting the document we mark it as deleted
in the new “soft deleted” bitmap. It is then removed from the search, and all the other
endpoints.
568: Fix not equal filter when field contains both number and strings r=Kerollmops a=GraDKh
Related to https://github.com/meilisearch/meilisearch/issues/2516
Looks like the issue should be moved to this repo, but I'm not sure what the right procedure for it.
Co-authored-by: Dmytro Gordon <dmytro@bigstream.co>
566: Introduce the copy_to_path method on the Index r=irevoire a=Kerollmops
Meilisearch needs this method to do snapshots.
Co-authored-by: Kerollmops <clement@meilisearch.com>
564: Rename the limitedTo parameter into maxTotalHits r=curquiza a=Kerollmops
This PR is related to https://github.com/meilisearch/meilisearch/issues/2542, it renames the `limitedTo` parameter into `maxTotalHits`.
Co-authored-by: Kerollmops <clement@meilisearch.com>
563: Improve the `estimatedNbHits` when a `distinctAttribute` is specified r=irevoire a=Kerollmops
This PR is related to https://github.com/meilisearch/meilisearch/issues/2532 but it doesn't fix it entirely. It improves it by computing the excluded documents (the ones with an already-seen distinct value) before stopping the loop, I think it was a mistake and should always have been this way.
The reason it doesn't fix the issue is that Meilisearch is lazy, just to be sure not to compute too many things and answer by taking too much time. When we deduplicate the documents by their distinct value we must do it along the water, everytime we see a new document we check that its distinct value of it doesn't collide with an already returned document.
The reason we can see the correct result when enough documents are fetched is that we were lucky to see all of the different distinct values possible in the dataset and all of the deduplication was done, no document can be returned.
If we wanted to implement that to have a correct `extimatedNbHits` every time we should have done a pass on the whole set of possible distinct values for the distinct attribute and do a big intersection, this could cost a lot of CPU cycles.
Co-authored-by: Kerollmops <clement@meilisearch.com>