3461: Bring v1 changes into main r=curquiza a=Kerollmops
Also bring back changes in milli (the remote repository) into main done during the pre-release
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com>
Co-authored-by: curquiza <curquiza@users.noreply.github.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Philipp Ahlner <philipp@ahlner.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
776: Reduce incremental indexing time of `words_prefix_position_docids` DB r=curquiza a=loiclec
Fixes partially https://github.com/meilisearch/milli/issues/605
The `words_prefix_position_docids` can easily contain millions of entries. Thus, iterating
over it can be very expensive. But we do so needlessly for every document addition tasks.
It can sometimes cause indexing performance issues when :
- a user sends many `documentAdditionOrUpdate` tasks that cannot be all batched together (for example if they are interspersed with `documentDeletion` tasks)
- the documents contain long, diverse text fields, thus increasing the number of entries in `words_prefix_position_docids`
- the index has accumulated many soft-deleted documents, further increasing the size of `words_prefix_position_docids`
- the machine running Meilisearch does not have great IO performance (e.g. slow SSD, or quota-limited by the cloud provider)
Note, before approving the PR: the only changed file should be `milli/src/update/words_prefix_position_docids.rs`.
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
774: Update version for the next release (v0.41.1) in Cargo.toml files r=curquiza a=meili-bot
⚠️ This PR is automatically generated. Check the new version is the expected one before merging.
Co-authored-by: curquiza <curquiza@users.noreply.github.com>
This database can easily contain millions of entries. Thus, iterating
over it can be very expensive.
For regular `documentAdditionOrUpdate` tasks, `del_prefix_fst_words`
will always be empty. Thus, we can save a significant amount of time
by adding this `if !del_prefix_fst_words.is_empty()` condition.
The code's behaviour remains completely unchanged.
763: Fixes error message when lat and lng are unparseable r=loiclec a=ahlner
# Pull Request
## Related issue
Fixes partially [#3007](https://github.com/meilisearch/meilisearch/issues/3007)
## What does this PR do?
- Changes function validate_geo_from_json to return a BadLatitudeAndLongitude if lat or lng is a string and not parseable to f64
- implemented some unittests
- Derived PartialEq for GeoError to use assert_eq! in tests
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
Co-authored-by: Philipp Ahlner <philipp@ahlner.com>
767: Update version for the next release (v0.39.2) in Cargo.toml files r=curquiza a=meili-bot
⚠️ This PR is automatically generated. Check the new version is the expected one before merging.
Co-authored-by: curquiza <curquiza@users.noreply.github.com>
765: Update version for the next release (v0.39.1) in Cargo.toml files r=curquiza a=meili-bot
⚠️ This PR is automatically generated. Check the new version is the expected one before merging.
Co-authored-by: curquiza <curquiza@users.noreply.github.com>
764: Update deserr to latest version r=irevoire a=loiclec
Update deserr to 0.1.5, which changes the `DeserializeFromValue` trait, getting rid of the `default()` method.
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
761: Integrate deserr r=irevoire a=loiclec
1. `Setting<T>` now implements `DeserializeFromValue`
2. The settings now store ranking rules as strongly typed `Criterion` instead of `String`, since the validation of the ranking rules will be done on meilisearch's side from now on
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
759: Change primary key inference error messages r=Kerollmops a=dureuill
# Pull Request
## Related issue
Milli part of https://github.com/meilisearch/meilisearch/issues/3301
## What does this PR do?
- Change error message strings
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
733: Avoid a prefix-related worst-case scenario in the proximity criterion r=loiclec a=loiclec
# Pull Request
## Related issue
Somewhat fixes (until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3118
## What does this PR do?
When a query ends with a word and a prefix, such as:
```
word pr
```
Then we first determine whether `pre` *could possibly* be in the proximity prefix database before querying it. There are then three possibilities:
1. `pr` is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of `pre` through the FST and query the regular proximity databases.
2. `pr` is in the prefix cache but cannot be found in the proximity prefix databases. **In this case, we partially disable the proximity ranking rule for the pair `word pre`.** This is done as follows:
1. Only find the documents where `word` is in proximity to `pre` **exactly** (no derivations)
2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8
3. `pr` is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases.
Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is:
1. For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8
2. For common prefixes of more than two letters: we no longer distinguish between any proximities
3. For uncommon prefixes: nothing changes
Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query `heard pr` (IF `pr` is the prefix of more than 200 words in the dataset):
```json
[
{ "text": "I heard there is a faster proximity criterion" },
{ "text": "I heard there is a faster but less relevant proximity criterion" }
]
```
Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro":
```json
[
{ "text": "I heard there is a faster but less relevant proximity criterion" }
{ "text": "I heard there is a faster proximity criterion" },
]
```
But the following document would be considered more relevant than the two documents above:
```json
{ "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " }
```
Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything.
---
## Performance
I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the `songs` benchmark dataset.
```
1. 10x 'a':
- 640ms ⟹ 630ms = no significant difference
2. 10x 'b':
- set-based: 4.47s ⟹ 7.42 = bad, ~2x regression
- dynamic: 1s ⟹ 870 ms = no significant difference
3. 'Someone I l':
- set-based: 250ms ⟹ 12 ms = very good, x20 speedup
- dynamic: 21ms ⟹ 11 ms = good, x2 speedup
4. 'billie e':
- set-based: 623ms ⟹ 2ms = very good, x300 speedup
- dynamic: ~4ms ⟹ 4ms = no difference
5. 'billie ei':
- set-based: 57ms ⟹ 20ms = good, ~2x speedup
- dynamic: ~4ms ⟹ ~2ms. = no significant difference
6. 'i am getting o'
- set-based: 300ms ⟹ 60ms = very good, 5x speedup
- dynamic: 30ms ⟹ 6ms = very good, 5x speedup
7. 'prologue 1 a 1:
- set-based: 3.36s ⟹ 120ms = very good, 30x speedup
- dynamic: 200ms ⟹ 30ms = very good, 6x speedup
8. 'prologue 1 a 10':
- set-based: 590ms ⟹ 18ms = very good, 30x speedup
- dynamic: 82ms ⟹ 35ms = good, ~2x speedup
```
Performance is often significantly better, but there is also one regression in the set-based implementation with the query `b b b b b b b b b b`.
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
732: Interpret synonyms as phrases r=loiclec a=loiclec
# Pull Request
## Related issue
Fixes (when merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3125
## What does this PR do?
We now map multi-word synonyms to phrases instead of loose words. Such that the request:
```
btw I am going to nyc soon
```
is interpreted as (when the synonym interpretation is chosen for both `btw` and `nyc`):
```
"by the way" I am going to "New York City" soon
```
instead of:
```
by the way I am going to New York City soon
```
This prevents queries containing multi-word synonyms to exceed to word length limit and degrade the search performance.
In terms of relevancy, there is a debate to have. I personally think this could be considered an improvement, since it would be strange for a user to search for:
```
good DIY project
```
and have a result such as:
```
{
"text": "whether it is a good project to do, you'll have to decide for yourself"
}
```
However, for synonyms such as `NYC -> New York City`, then we will stop matching documents where `New York` is separated from `City`. This is however solvable by adding an additional mapping: `NYC -> New York`.
## Performance
With the old behaviour, some long search requests making heavy uses of synonyms could take minutes to be executed. This is no longer the case, these search requests now take an average amount of time to be resolved.
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
736: Update charabia r=curquiza a=ManyTheFish
Update Charabia to the last version.
> We are now Romanizing Chinese characters into Pinyin.
> Note that we keep the accent because they are in fact never typed directly by the end-user, moreover, changing an accent leads to a different Chinese character, and I don't have sufficient knowledge to forecast the impact of removing accents in this context.
Co-authored-by: ManyTheFish <many@meilisearch.com>
709: Optimise the `ExactWords` sub-criterion within `Exactness` r=loiclec a=loiclec
# Pull Request
## Related issue
Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3116
## What does this PR do?
1. Reduces the algorithmic complexity of finding the documents containing N exact words from something that is exponential to something that is polynomial.
2. Cache intermediary results between different calls to the `exactness` criterion.
## Performance Results
On the `smol_songs.csv` dataset, a request containing 10 common words now takes about 60ms instead of 5 seconds to execute. For example, this is the case with this (admittedly nonsensical) request: `Rock You Hip Hop Folk World Country Electronic Love The`.
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Displays log message in the form:
```
[2022-12-21T09:19:42Z INFO milli::update::index_documents::enrich] Primary key was not specified in index. Inferred to 'id'
```
742: Add a "Criterion implementation strategy" parameter to Search r=irevoire a=loiclec
Add a parameter to search requests which determines the implementation strategy of the criteria. This can be either `set-based`, `iterative`, or `dynamic` (ie choosing between set-based or iterative at search time). See https://github.com/meilisearch/milli/issues/755 for more context about this change.
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
747: Soft-deletion computation no longer depends on the mapsize r=irevoire a=dureuill
# Pull Request
## Related issue
Related to https://github.com/meilisearch/meilisearch/issues/3231: After removing `--max-index-size`, the `mapsize` will always be unrelated to the actual max size the user wants for their DB, so it doesn't make sense to use these values any longer.
This implements solution 2.3 from https://github.com/meilisearch/meilisearch/issues/3231#issuecomment-1348628824
## What does this PR do?
### User-visible
- Soft-deleted are no longer deleted when there is less than 10% of the mapsize available or when they take more than 10% of the mapsize
- Instead, they are deleted when they are more soft deleted than regular documents, or when they take more than 1GiB disk space (estimated).
### Implementation standpoint
1. Adds a `DeletionStrategy` struct to replace the boolean `disable_soft_deletion` that we had up until now. This enum allows us to specify that we want "always hard", "always soft", or to use the dynamic soft-deletion strategy (default).
2. Uses the current strategy when deleting documents, with the new heuristics being used in the `DeletionStrategy::Dynamic` variant.
3. Updates the tests to use the appropriate DeletionStrategy whenever needed (one of `AlwaysHard` or `AlwaysSoft` depending on the test)
Note to reviewers: this PR is optimized for a commit-by-commit review.
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
743: Fix finite pagination with placeholder search r=Kerollmops a=ManyTheFish
this bug is reproducible on real datasets and is hard to isolate in a simple test.
related to: https://github.com/meilisearch/meilisearch/issues/3200
poke `@curquiza`
Co-authored-by: ManyTheFish <many@meilisearch.com>
728: Add some integration tests on the sort criterion r=ManyTheFish a=loiclec
This is simply an integration test ensuring that the sort criterion works properly.
However, only one version of the algorithm is tested here (the iterative one). To test the version that uses the facet DB, one has to manually set the `CANDIDATES_THRESHOLD` constant to `0`. I have done that and ensured that the test still succeeds. However, in the future, we will probably want to have an option to force which algorithm is used at runtime, for testing purposes.
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
737: Fix typo initial candidates computation r=Kerollmops a=ManyTheFish
When `Typo` criterion was after a different criterion than `Words` and the previous criterion wasn't returning any candidates at the first iteration of the bucket sort, then the `initial_candidates` were lost.
Now, `Typo`ensure to keep the `initial_candidates` between iterations.
related to https://github.com/meilisearch/meilisearch/issues/3200#issuecomment-1345179578
related to https://github.com/meilisearch/meilisearch/issues/3228
Co-authored-by: ManyTheFish <many@meilisearch.com>
By creating snapshots and updating the format of the existing
snapshots. The next commit will apply the fix, which will show
its effects cleanly on the old and new snapshot tests
723: Fix bug in handling of soft deleted documents when updating settings r=Kerollmops a=loiclec
# Pull Request
## Related issue
Fixes (partially, until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3021
## What does this PR do?
This PR fixes the bug where a `missing key in documents database` internal error message could appear when indexing documents.
When updating the settings, before clearing the database and before creating the transform output, we now modify the `ExternalDocumentsIds` structure to get rid of all references to soft deleted document ids in its FSTs.
It used to be that updating the settings would clear the soft-deleted document ids, but keep the original `ExternalDocumentsIds` structure. As a consequence of this, when processing a future document addition, we could wrongly believe that a document was being replaced when, in fact, it was a completely new document. See the tests `bug_3021_first`, `bug_3021_second`, and `bug_3021` for a minimal test case that would have reproduced the issue.
We need to take special care to:
- evaluate how users should update to v0.30.1 (containing this fix): dump? reimporting all documents from scratch?
- understand IF/HOW this bug could have caused duplicate documents to be returned
- and evaluate the correctness of the fix, of course :)
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
719: Add more members of `filter_parser` to `milli::` & `From<&str>` implementation for `Token` r=Kerollmops a=GregoryConrad
## What does this PR do?
The current `milli::Filter` and `milli::FilterCondition` APIs require working with some members of `filter_parser` directly that `milli::` does *not* re-export to its users (at least when not parsing input using `parse`). Also, using `filter_parser` does not make sense when using milli from an embedded context where there is no query to parse.
Instead of reworking `milli::Filter` and `milli::FilterCondition`, this PR adds two non-breaking changes that ease the use of milli:
- Re-exports more members of the dependent version of `filter_parser` in `milli`
- Implements `From<&str>` for `filter_parser::Token`
- This will also allow some basic tests that need to create a `Token` from a string to avoid some boilerplate.
In conjunction, both of these will allow milli users to easily create a `Token` from a `&str` without needing to add `filter_parser` as an extra dependency.
Note: I wanted to use `FromStr` for the `From` implementation; however, it requires returning a `Result` which is not needed for the conversion. Thus, I just left it as `From<&str>`.
Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>
706: Limit the reindexing caused by updating settings when not needed r=curquiza a=GregoryConrad
## What does this PR do?
When updating index settings using `update::Settings`, sometimes a `reindex` of `update::Settings` is triggered when it doesn't need to be. This PR aims to prevent those unnecessary `reindex` calls.
For reference, here is a snippet from the current `execute` method in `update::Settings`:
```rust
// ...
if stop_words_updated
|| faceted_updated
|| synonyms_updated
|| searchable_updated
|| exact_attributes_updated
{
self.reindex(&progress_callback, &should_abort, old_fields_ids_map)?;
}
```
- [x] `faceted_updated` - looks good as-is ✅
- [x] `stop_words_updated` - looks good as-is ✅
- [x] `synonyms_updated` - looks good as-is ✅
- [x] `searchable_updated` - fixed in this PR
- [x] `exact_attributes_updated` - fixed in this PR
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>
708: Reduce memory usage of the MatchingWords structure r=ManyTheFish a=loiclec
# Pull Request
## Related issue
Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3115
## What does this PR do?
1. Reduces the memory usage caused by the creation of a 10-word query tree by 20x.
This is done by deduplicating the `MatchingWord` values, which are heavy because of their inner DFA. The deduplication works by wrapping each `MatchingWord` in a reference-counted box and using a hash map to determine whether a `MatchingWord` DFA already exists for a certain signature, or whether a new one needs to be built.
2. Avoid the worst-case scenario of creating a `MatchingWord` for extremely long words that cannot be indexed by milli.
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
712: Fix bulk facet indexing bug r=Kerollmops a=loiclec
# Pull Request
## Related issue
Fixes (partially, until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3165
## What does this PR do?
Fixes a bug where indexing certain numbers of filterable attribute values in bulk led to corrupted facet databases. This was due to a lossy integer conversion which would ultimately prevent entire levels of the facet database to be written into LMDB.
More specifically, this change was made:
```diff
- if cur_writer_len as u8 >= self.min_level_size {
+ if cur_writer_len >= self.min_level_size as usize {
```
I also checked other comparisons to `min_level_size` and other conversions such as `x as u8` in this part of the codebase.
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
697: Fix bug in prefix DB indexing r=loiclec a=loiclec
Where the batch's information was not properly updated in cases where only the proximity changed between two consecutive word pair proximities.
Closes partially https://github.com/meilisearch/meilisearch/issues/3043
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
1. Handle keys with variable length correctly
This fixes https://github.com/meilisearch/meilisearch/issues/3042 and
is easily reproducible with the updated fuzz tests, which now generate
keys with variable lengths.
2. Prevent adding facets to the database if their encoded value does
not satisfy `valid_lmdb_key`.
This fixes an indexing failure when a document had a filterable
attribute containing a value whose length is higher than ~500 bytes.
693: use the lmdb-master.3 branch r=Kerollmops a=irevoire
After investigating https://github.com/meilisearch/meilisearch/issues/3017, we found out that it was due to lmdb and that, without any code change on our side, bumping using the lmdb-master-3 branch fix our issues.
But, we’re not really confident about what changed between the `mdb.master` and `mdb.master3` branches; thus this is a temporary change, and we hope we’ll be able to move to the new version of heed asap (either before the end of the pre-release or for the next release).
--------
The bug is hard to reproduce; I can reproduce it 100% of the time on my archlinux personal computer. But on a scaleway archlinux bare-metal machine, it doesn’t reproduce. It’s flaky on our test suite, but `@loiclec` was able to write a minimal test that reproduces it every time on macOS.
Basically, what happens is when there are multiple threads opening databases in a different directory at the same time.
If there are 10 or more threads running at the same time, lmdb starts throwing the `Invalid argument (os error 22)` error for no reason, we believe.
I would like to submit an issue to lmdb, but I don’t really have the time to write a test in C without heed currently.
`@hyc,` if you want to take a look at it, here is the repo that reproduces the issue on macOS: https://github.com/irevoire/heed-bug
Co-authored-by: Irevoire <tamo@meilisearch.com>
689: Handle non-finite floats consistently in filters r=irevoire a=dureuill
# Pull Request
## Related issue
Related meilisearch/meilisearch#3000
## What does this PR do?
### User
- Filters using `field = inf`, (or `infinite`, `NaN`) now match the value as a string rather than returning an internal error.
- Filters using `field < inf` (or other comparison operators) now return an invalid_filter error rather than returning an internal error, much like when using `field < aaa`.
### Implementation
- Add new `NonFiniteFloat` error variants to the filter-parser errors
- Add `Token::parse_as_finite_float` that can fail both when the string is not a float and when the float is not finite
- Refactor `Filter::inner_evaluate` to always use `parse_as_finite_float` instead of just `parse`
- Add corresponding tests
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
659: Fix clippy error to add clippy job on Ci r=Kerollmops a=unvalley
## Related PR
This PR is for #673
## What does this PR do?
- ~~add `Run Clippy` job to CI (rust.yml)~~
- apply `cargo clippy --fix` command
- fix some `cargo clippy` error manually (but warnings still remain on tests)
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Co-authored-by: unvalley <kirohi.code@gmail.com>
Co-authored-by: unvalley <38400669+unvalley@users.noreply.github.com>
664: Fix phrase search containing stop words r=ManyTheFish a=Samyak2
# Pull Request
This a WIP draft PR I wanted to create to let other potential contributors know that I'm working on this issue. I'll be completing this in a few hours from opening this.
## Related issue
Fixes#661 and towards fixing meilisearch/meilisearch#2905
## What does this PR do?
- [x] Change Phrase Operation to use a `Vec<Option<String>>` instead of `Vec<String>` where `None` corresponds to a stop word
- [x] Update all other uses of phrase operation
- [x] Update `resolve_phrase`
- [x] Update `create_primitive_query`?
- [x] Add test
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Co-authored-by: Samyak S Sarnayak <samyak201@gmail.com>
Co-authored-by: Samyak Sarnayak <samyak201@gmail.com>