Commit Graph

2326 Commits

Author SHA1 Message Date
Loïc Lecrenier
e3ee553dcc Remove soft deleted ids from ExternalDocumentIds during document import
If the document import replaces a document using hard deletion
2022-12-12 14:16:09 +01:00
Loïc Lecrenier
bebd050961 Add new test for bug 3021 2022-12-08 19:19:40 +01:00
bors[bot]
1f1beae077
Merge #729
729: Fix distincted exhaustive hits r=Kerollmops a=ManyTheFish

This PR changes the name and behavior of `bucket_candidates`:
- `bucket_candidates` become `initial_candidates` that is less confusing
- `initial_candidates` is no more a simple `RoaringBitmap` but an enum allowing us to precise if the candidates are exhaustive or not
- this enum ensures that any modification is allowed only if the candidates are not already exhaustive.

The bug occurred because `initial_candidates` are modified during the bucket sort allowing the estimation to be more and more precise along the search, and this was an issue when the `initial_candidates` were already exhaustive, now, if candidates are exhaustive, then no modifications are made.

Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-12-08 09:26:34 +00:00
ManyTheFish
55724f2412 Introduce an initial candidates set that makes the difference between an exhaustive count and an estimation 2022-12-08 09:41:34 +01:00
ManyTheFish
6d50ea0830 add tests 2022-12-08 08:56:57 +01:00
bors[bot]
098c410612
Merge #727
727: Fix bug in filter search r=Kerollmops a=loiclec

# Pull Request

## Related issue
Fixes (partially, until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3178

## What does this PR do?
The most important change is this one:
```rust
    // in milli/src/search/facet/facet_range_search.rs, line 239
    let should_stop = {
        match self.right {
            Bound::Included(right) => right < previous_key.left_bound,
            Bound::Excluded(right) => right <= previous_key.left_bound,
            Bound::Unbounded => false,
        }
    };
```
where the operations `<` and `<=` between the two branches were switched. This caused (very few) documents to be missing from filter results.

The second change is a simplification of the algorithm for filters such as `field = value`, where we now perform a direct query into the "Level 0" of the facet db to retrieve the docids instead of invoking the full facet search algorithm. This change is done in `milli/src/search/facet/filter.rs`.

I have added yet more insta-snapshot tests, rechecked the content of the snapshots, and added some integration tests as well. 

This is purely a fix in the search algorithms. Based on this PR alone, a dump will not be necessary to switch from v0.30.1 (where this bug is present) to v0.30.2 (where this PR is merged).


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-12-07 14:34:59 +00:00
bors[bot]
ee10cb8c87
Merge #726
726: Update the contributing.md r=curquiza a=irevoire



Co-authored-by: Tamo <tamo@meilisearch.com>
2022-12-07 13:59:04 +00:00
Loïc Lecrenier
d38cc73630 Add one more filter "integration" test 2022-12-07 14:38:25 +01:00
Loïc Lecrenier
e688581c36 Add tests for facet range search on different field ids 2022-12-07 14:38:21 +01:00
Loïc Lecrenier
4ac8f96342 Simplify implementation of equality condition in filters 2022-12-07 14:38:18 +01:00
Loïc Lecrenier
1c9555566e Fix bug in facet range search 2022-12-07 14:38:14 +01:00
Loïc Lecrenier
303d740245 Prepare fix within facet range search
By creating snapshots and updating the format of the existing
snapshots. The next commit will apply the fix, which will show
its effects cleanly on the old and new snapshot tests
2022-12-07 14:38:10 +01:00
Tamo
250743885d
add a sentence about installing rust-nightly 2022-12-07 12:31:43 +01:00
Tamo
5eecb8489d
Update CONTRIBUTING.md
Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
2022-12-07 12:23:12 +01:00
Tamo
0e5c3b1f64
Update CONTRIBUTING.md
Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
2022-12-07 12:23:06 +01:00
Tamo
f53bdc4320
update the contributing.md 2022-12-06 17:41:05 +01:00
bors[bot]
0a301b5f88
Merge #723
723: Fix bug in handling of soft deleted documents when updating settings r=Kerollmops a=loiclec

# Pull Request

## Related issue
Fixes (partially, until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3021

## What does this PR do?
This PR fixes the bug where a `missing key in documents database` internal error message could appear when indexing documents.

When updating the settings, before clearing the database and before creating the transform output, we now modify the `ExternalDocumentsIds` structure to get rid of all references to soft deleted document ids in its FSTs.

It used to be that updating the settings would clear the soft-deleted document ids, but keep the original `ExternalDocumentsIds` structure. As a consequence of this, when processing a future document addition, we could wrongly believe that a document was being replaced when, in fact, it was a completely new document. See the tests `bug_3021_first`, `bug_3021_second`, and `bug_3021` for a minimal test case that would have reproduced the issue.
 
We need to take special care to:
- evaluate how users should update to v0.30.1 (containing this fix): dump? reimporting all documents from scratch?
- understand IF/HOW this bug could have caused duplicate documents to be returned 
- and evaluate the correctness of the fix, of course :)


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-12-06 14:37:38 +00:00
Loïc Lecrenier
a993b68684 Cargo fmt >:-( 2022-12-06 15:22:10 +01:00
Loïc Lecrenier
80c7a00567 Fix compilation error in tests of settings update 2022-12-06 15:19:26 +01:00
Loïc Lecrenier
67d8cec209 Fix bug in handling of soft deleted documents when updating settings 2022-12-06 15:09:19 +01:00
bors[bot]
2a846aaae7
Merge #719
719: Add more members of `filter_parser` to `milli::` & `From<&str>` implementation for `Token` r=Kerollmops a=GregoryConrad

## What does this PR do?
The current `milli::Filter` and `milli::FilterCondition` APIs require working with some members of `filter_parser` directly that `milli::` does *not* re-export to its users (at least when not parsing input using `parse`). Also, using `filter_parser` does not make sense when using milli from an embedded context where there is no query to parse.

Instead of reworking `milli::Filter` and `milli::FilterCondition`, this PR adds two non-breaking changes that ease the use of milli:
- Re-exports more members of the dependent version of `filter_parser` in `milli`
- Implements `From<&str>` for `filter_parser::Token`
  - This will also allow some basic tests that need to create a `Token` from a string to avoid some boilerplate.

In conjunction, both of these will allow milli users to easily create a `Token` from a `&str` without needing to add `filter_parser` as an extra dependency.

Note: I wanted to use `FromStr` for the `From` implementation; however, it requires returning a `Result` which is not needed for the conversion. Thus, I just left it as `From<&str>`.

Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>
2022-12-06 10:36:00 +00:00
bors[bot]
d6eacb2aac
Merge #722
722: Geosearch for zero radius r=irevoire a=amab8901

# Pull Request

## Related issue
Fixes #3167 (https://github.com/meilisearch/meilisearch/issues/3167)

## What does this PR do?
- allows Geosearch with zero radius to return the specified location when the coordinates match perfectly (instead of returning nothing). See link for more details.
- new attempt on https://github.com/meilisearch/milli/pull/713

## PR checklist
Please check if your PR fulfills the following requirements:
- [ X ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [ X ] Have you read the contributing guidelines?
- [ X ] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: amab8901 <amab8901@protonmail.com>
Co-authored-by: Tamo <irevoire@protonmail.ch>
2022-12-05 19:57:08 +00:00
Tamo
212dbfa3b5
Update milli/src/search/facet/filter.rs 2022-12-05 20:56:21 +01:00
amab8901
456da5de9c Geosearch for zero radius 2022-12-05 20:11:46 +01:00
bors[bot]
46e26ab550
Merge #720
720: Make soft deletion optional in document addition and deletion + add lots of tests r=irevoire a=loiclec

# Pull Request

## What does this PR do?
When debugging recent issues, I created a few unit tests in the hopes reproducing the bugs I was looking for. In the end, I didn't find any, but I thought it would still be good to keep those tests. 

More importantly, I added a field to the `DeleteDocuments` and `IndexDocuments` builders, called `disable_soft_deletion`. If set to `true`, the indexing/deletion will never add documents to the `soft_deleted_documents_ids` and instead perform a real deletion of the documents from the databases.

For the new tests, I have:
- Improved the insta-snapshot format of the `external_documents_ids` structure
- Added more tests for the facet DB indexing, deletion, and search algorithms, making sure to test them when the facet DB contains strings (instead of numbers) as well.
- Added more tests for the incremental indexing of the prefix proximity databases. For example, to see if documents are replaced correctly and if common prefixes are deleted correctly.
- Added tests that mix soft deletion and hard deletion, including when processing batches of document updates. 


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-12-05 18:26:01 +00:00
Loïc Lecrenier
cda4ba2bb6 Add document import tests 2022-12-05 12:02:49 +01:00
Loïc Lecrenier
ae59d37b75 Improve insta-snap of the external document ids 2022-12-05 10:51:02 +01:00
Loïc Lecrenier
f2cf981641 Add more tests and allow disabling of soft-deletion outside of tests
Also allow disabling soft-deletion in the IndexDocumentsConfig
2022-12-05 10:51:01 +01:00
Gregory Conrad
50954d31fa feat: Re-export Span and Token to milli:: 2022-12-03 13:37:33 -05:00
Gregory Conrad
1b5b5778c1 feat: Add From<&str> implementation for Token 2022-12-03 13:13:41 -05:00
bors[bot]
d3731dda48
Merge #706
706: Limit the reindexing caused by updating settings when not needed r=curquiza a=GregoryConrad

## What does this PR do?
When updating index settings using `update::Settings`, sometimes a `reindex` of `update::Settings` is triggered when it doesn't need to be. This PR aims to prevent those unnecessary `reindex` calls.

For reference, here is a snippet from the current `execute` method in `update::Settings`:
```rust
// ...
if stop_words_updated
    || faceted_updated
    || synonyms_updated
    || searchable_updated
    || exact_attributes_updated
{
    self.reindex(&progress_callback, &should_abort, old_fields_ids_map)?;
}
```

- [x] `faceted_updated` - looks good as-is 
- [x] `stop_words_updated` - looks good as-is 
- [x] `synonyms_updated` - looks good as-is 
- [x] `searchable_updated` - fixed in this PR
- [x] `exact_attributes_updated` - fixed in this PR

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>
2022-12-01 13:58:02 +00:00
bors[bot]
51a2613c5c
Merge #715
715: Fix benchmark CI r=irevoire a=curquiza

Fixes #714 

Tested with our actions: https://github.com/meilisearch/milli/actions/runs/3591527753/jobs/6046157141

Co-authored-by: curquiza <clementine@meilisearch.com>
2022-12-01 10:39:38 +00:00
bors[bot]
82e1c4f468
Merge #716
716: Bump Swatinem/rust-cache from 2.0.1 to 2.2.0 r=curquiza a=dependabot[bot]

Bumps [Swatinem/rust-cache](https://github.com/Swatinem/rust-cache) from 2.0.1 to 2.2.0.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/Swatinem/rust-cache/releases">Swatinem/rust-cache's releases</a>.</em></p>
<blockquote>
<h2>v2.2.0</h2>
<ul>
<li>Add new <code>save-if</code> option to always restore, but only conditionally save the cache.</li>
</ul>
<h2>v2.1.0</h2>
<ul>
<li>Only hash <code>Cargo.{lock,toml}</code> files in the configured workspace directories.</li>
</ul>
<h2>v2.0.2</h2>
<ul>
<li>Avoid calling cargo metadata on pre-cleanup.</li>
<li>Added <code>prefix-key</code>, <code>cache-directories</code> and <code>cache-targets</code> options.</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/Swatinem/rust-cache/blob/master/CHANGELOG.md">Swatinem/rust-cache's changelog</a>.</em></p>
<blockquote>
<h2>2.2.0</h2>
<ul>
<li>Add new <code>save-if</code> option to always restore, but only conditionally save the cache.</li>
</ul>
<h2>2.1.0</h2>
<ul>
<li>Only hash <code>Cargo.{lock,toml}</code> files in the configured workspace directories.</li>
</ul>
<h2>2.0.2</h2>
<ul>
<li>Avoid calling <code>cargo metadata</code> on pre-cleanup.</li>
<li>Added <code>prefix-key</code>, <code>cache-directories</code> and <code>cache-targets</code> options.</li>
</ul>
<h2>2.0.1</h2>
<ul>
<li>Primarily just updating dependencies to fix GitHub deprecation notices.</li>
</ul>
<h2>2.0.0</h2>
<ul>
<li>The action code was refactored to allow for caching multiple workspaces and
different <code>target</code> directory layouts.</li>
<li>The <code>working-directory</code> and <code>target-dir</code> input options were replaced by a
single <code>workspaces</code> option that has the form of <code>$workspace -&gt; $target</code>.</li>
<li>Support for considering <code>env-vars</code> as part of the cache key.</li>
<li>The <code>sharedKey</code> input option was renamed to <code>shared-key</code> for consistency.</li>
</ul>
<h2>1.4.0</h2>
<ul>
<li>Clean both <code>debug</code> and <code>release</code> target directories.</li>
</ul>
<h2>1.3.0</h2>
<ul>
<li>Use Rust toolchain file as additional cache key.</li>
<li>Allow for a configurable target-dir.</li>
</ul>
<h2>1.2.0</h2>
<ul>
<li>Cache <code>~/.cargo/bin</code>.</li>
<li>Support for custom <code>$CARGO_HOME</code>.</li>
<li>Add a <code>cache-hit</code> output.</li>
<li>Add a new <code>sharedKey</code> option that overrides the automatic job-name based key.</li>
</ul>
<h2>1.1.0</h2>
<ul>
<li>Add a new <code>working-directory</code> input.</li>
<li>Support caching git dependencies.</li>
<li>Lots of other improvements.</li>
</ul>
<h2>1.0.2</h2>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="359a70e43a"><code>359a70e</code></a> 2.2.0</li>
<li><a href="ecee04e7b3"><code>ecee04e</code></a> feat: add save-if option, closes <a href="https://github-redirect.dependabot.com/Swatinem/rust-cache/issues/66">#66</a> (<a href="https://github-redirect.dependabot.com/Swatinem/rust-cache/issues/91">#91</a>)</li>
<li><a href="b894d59a8d"><code>b894d59</code></a> 2.1.0</li>
<li><a href="e78327dd9e"><code>e78327d</code></a> small code style improvements, README and CHANGELOG updates</li>
<li><a href="ccdddcc049"><code>ccdddcc</code></a> only hash Cargo.toml/Cargo.lock that belong to a configured workspace (<a href="https://github-redirect.dependabot.com/Swatinem/rust-cache/issues/90">#90</a>)</li>
<li><a href="b5ec9edd91"><code>b5ec9ed</code></a> 2.0.2</li>
<li><a href="3f2513fdf4"><code>3f2513f</code></a> avoid calling cargo metadata on pre-cleanup</li>
<li><a href="19c46583c5"><code>19c4658</code></a> update dependencies</li>
<li><a href="b8e72aae83"><code>b8e72aa</code></a> Added <code>prefix-key</code> <code>cache-directories</code> and <code>cache-targets</code> options (<a href="https://github-redirect.dependabot.com/Swatinem/rust-cache/issues/85">#85</a>)</li>
<li>See full diff in <a href="https://github.com/Swatinem/rust-cache/compare/v2.0.1...v2.2.0">compare view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=Swatinem/rust-cache&package-manager=github_actions&previous-version=2.0.1&new-version=2.2.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

You can trigger a rebase of this PR by commenting ``@dependabot` rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- ``@dependabot` rebase` will rebase this PR
- ``@dependabot` recreate` will recreate this PR, overwriting any edits that have been made to it
- ``@dependabot` merge` will merge this PR after your CI passes on it
- ``@dependabot` squash and merge` will squash and merge this PR after your CI passes on it
- ``@dependabot` cancel merge` will cancel a previously requested merge and block automerging
- ``@dependabot` reopen` will reopen this PR if it is closed
- ``@dependabot` close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- ``@dependabot` ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- ``@dependabot` ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)


</details>

Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2022-12-01 10:08:58 +00:00
curquiza
5bdf5c0aaf Update the steps to set variables 2022-12-01 11:07:54 +01:00
dependabot[bot]
282b2e3b98
Bump Swatinem/rust-cache from 2.0.1 to 2.2.0
Bumps [Swatinem/rust-cache](https://github.com/Swatinem/rust-cache) from 2.0.1 to 2.2.0.
- [Release notes](https://github.com/Swatinem/rust-cache/releases)
- [Changelog](https://github.com/Swatinem/rust-cache/blob/master/CHANGELOG.md)
- [Commits](https://github.com/Swatinem/rust-cache/compare/v2.0.1...v2.2.0)

---
updated-dependencies:
- dependency-name: Swatinem/rust-cache
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
2022-12-01 10:02:54 +00:00
bors[bot]
5e754b3ee0
Merge #708
708: Reduce memory usage of the MatchingWords structure r=ManyTheFish a=loiclec

# Pull Request

## Related issue
Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3115 

## What does this PR do?
1. Reduces the memory usage caused by the creation of a 10-word query tree by 20x. 
   This is done by deduplicating the `MatchingWord` values, which are heavy because of their inner DFA. The deduplication works by wrapping each `MatchingWord` in a reference-counted box and using a hash map to determine whether a  `MatchingWord` DFA already exists for a certain signature, or whether a new one needs to be built.
 
2. Avoid the worst-case scenario of creating a `MatchingWord` for extremely long words that cannot be indexed by milli.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-11-30 17:47:34 +00:00
bors[bot]
e1612fcb01
Merge #712
712: Fix bulk facet indexing bug r=Kerollmops a=loiclec

# Pull Request

## Related issue
Fixes (partially, until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3165

## What does this PR do?
Fixes a bug where indexing certain numbers of filterable attribute values in bulk led to corrupted facet databases. This was due to a lossy integer conversion which would ultimately prevent entire levels of the facet database to be written into LMDB.

More specifically, this change was made:
```diff
      - if cur_writer_len as u8 >= self.min_level_size {
      + if cur_writer_len >= self.min_level_size as usize {
```
I also checked other comparisons to `min_level_size` and other conversions such as `x as u8` in this part of the codebase.



Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-11-30 16:51:48 +00:00
Loïc Lecrenier
9dd4b33a9a Fix bulk facet indexing bug 2022-11-30 14:27:36 +01:00
bors[bot]
de22116b3d
Merge #711
711: Replace deprecated gh actions r=curquiza a=pnhatminh

# Pull Request

## Related issue
Fixes #678

## What does this PR do?
- Replace deprecated github action command with newly defined command.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Minh Pham <minh.pham@codelink.io>
2022-11-29 09:56:22 +00:00
Minh Pham
5f78522044 Updagte 2022-11-29 10:11:38 +07:00
Gregory Conrad
87e2bc3bed fix(reindex): reindex in a few more cases
Cases: whenever searchable_fields OR user_defined_searchable_fields is modified
2022-11-28 13:12:19 -05:00
Loïc Lecrenier
61b58b115a Don't create partial matching words for synonyms in ngrams 2022-11-28 16:32:28 +01:00
Gregory Conrad
d3182f3830 refactor: Change return type to keep consistency with others 2022-11-28 10:02:03 -05:00
bors[bot]
f698e6cfdf
Merge #707
707: Add all_obkv_to_json function r=Kerollmops a=GregoryConrad

## What does this PR do?
When embedding milli in an application (other than Meilisearch), it often makes sense to not use the `displayed_attributes` functionality and instead just use milli as a full document store. Thus, this PR adds a function, `all_obkv_to_json`, to supplement the already exposed `milli::obkv_to_json` so that those embedding milli *do not* need to deal with `displayed_attributes` if they don't need to.

~This PR also introduces a slight breaking change: `obkv_to_json` now accepts a reference to `obkv::KvReaderU16` instead of taking ownership of it. As far as I can tell, this seems like a change for the better (`obkv_to_json` only acts upon `obkv` rather than consuming it), but I can change it back if you so desire.~ (reverted in [935a724](935a724c57))

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>
2022-11-28 14:52:45 +00:00
Loïc Lecrenier
f70856bab1 Remove memory usage test that fails when many tests are run in parallel 2022-11-28 12:55:28 +01:00
Loïc Lecrenier
80588daae5 Fix compilation error in formatting benches 2022-11-28 10:27:15 +01:00
Loïc Lecrenier
e2ebed62b1 Don't create partial matching words for synonyms, split words, phrases 2022-11-28 10:20:13 +01:00
Loïc Lecrenier
8284bd760f Relax memory ordering of operations within the test CountingAlloc 2022-11-28 10:20:13 +01:00
Loïc Lecrenier
8d0ace2d64 Avoid creating a MatchingWord for words that exceed the length limit 2022-11-28 10:20:13 +01:00
Loïc Lecrenier
86c34a996b Deduplicate matching words 2022-11-28 10:20:13 +01:00