1280 Commits

Author SHA1 Message Date
Tamo
18796d6e6a Consider null as a valid geo object 2023-02-20 13:45:51 +01:00
Tamo
895ab2906c apply review suggestions 2023-02-16 18:42:47 +01:00
bors[bot]
143e3cf948
Merge #3490
3490: Fix attributes set candidates r=curquiza a=ManyTheFish

# Pull Request

Fix attributes set candidates for v1.1.0

## details

The attribute criterion was not returning the remaining candidates when its internal algorithm was been exhausted.
We had a loss of candidates by the attribute criterion leading to the bug reported in the issue linked below.
After some investigation, it seems that it was the only criterion that had this behavior.

We are now returning the remaining candidates instead of an empty bitmap.

## Related issue

Fixes #3483
PR on milli for v1.0.1: https://github.com/meilisearch/milli/pull/777


Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-02-15 17:38:07 +00:00
bors[bot]
91ce8a5e67
Merge #3492
3492: Bump deserr r=Kerollmops a=irevoire

Bump deserr to the latest version;
- We now use the default actix-web extractors that deserr provides (which were copy/pasted from meilisearch)
- We also use the default `JsonError` message provided by deserr instead of defining our own in meilisearch
- Finally, we get the new `did you mean?` error message. Fix #3493

Co-authored-by: Tamo <tamo@meilisearch.com>
2023-02-15 10:05:05 +00:00
Tamo
8fb7b1d10f
bump deserr 2023-02-14 20:04:30 +01:00
Tamo
74dcfe9676
Fix a bug when you update a document that was already present in the db, deleted and then inserted again in the same transform 2023-02-14 19:09:40 +01:00
Tamo
1b1703a609
make a small optimization to merge obkvs a little bit faster 2023-02-14 18:32:41 +01:00
Tamo
fb5e4957a6
fix and test the early exit in case a grenad ends with a deletion 2023-02-14 18:23:57 +01:00
Tamo
8de3c9f737
Update milli/src/update/index_documents/transform.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2023-02-14 17:57:14 +01:00
Tamo
43a19d0709
document the operation enum + the grenads 2023-02-14 17:55:26 +01:00
Filip Bachul
a53536836b fmt 2023-02-14 17:04:22 +01:00
Filip Bachul
d7ad39ad77 fix: clippy error 2023-02-14 00:15:35 +01:00
Filip Bachul
849de089d2 add thiserror for AscDescError 2023-02-14 00:15:35 +01:00
filip
7f25007d31 Update milli/src/asc_desc.rs
Co-authored-by: Tamo <irevoire@protonmail.ch>
2023-02-14 00:15:35 +01:00
Filip Bachul
c810af3ebf implement From<ParseGeoError> for AscDescError 2023-02-14 00:15:35 +01:00
Filip Bachul
c0b77773ba fmt asc_desc 2023-02-14 00:15:35 +01:00
Filip Bachul
7481559e8b move BadGeo to FilterError 2023-02-14 00:15:35 +01:00
Filip Bachul
83c765ce6c implement From<ParseGeoError> for FilterError 2023-02-14 00:15:35 +01:00
Filip Bachul
4c91037602 use ParseGeoError in sort parser 2023-02-14 00:15:35 +01:00
Filip Bachul
825923f6fc export ParseGeoError 2023-02-14 00:15:35 +01:00
Filip Bachul
e405702733 chore: introduce new error ParseGeoError type 2023-02-14 00:15:35 +01:00
ManyTheFish
6fa877efb0 Fix attributes set candidates 2023-02-13 17:49:52 +01:00
Tamo
746b31c1ce
makes clippy happy 2023-02-09 12:23:01 +01:00
Tamo
93db755d57
add a test to ensure we handle correctly a deletion of multiple time the same document 2023-02-08 21:03:34 +01:00
Tamo
93f130a400
fix all warnings 2023-02-08 20:57:35 +01:00
Tamo
421a9cf05e
provide a new method on the transform to remove documents 2023-02-08 16:06:09 +01:00
Tamo
8f64fba1ce
rewrite the current transform to handle a new byte specifying the kind of operation it's merging 2023-02-08 12:53:38 +01:00
bors[bot]
c88c3637b4
Merge #3461
3461: Bring v1 changes into main r=curquiza a=Kerollmops

Also bring back changes in milli (the remote repository) into main done during the pre-release

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com>
Co-authored-by: curquiza <curquiza@users.noreply.github.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Philipp Ahlner <philipp@ahlner.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2023-02-07 11:27:27 +00:00
Tamo
42114325cd
Apply suggestions from code review
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-02-06 18:07:00 +01:00
Tamo
7a38fe624f
throw an error if the top left corner is found below the bottom right corner 2023-02-06 17:50:47 +01:00
Tamo
1b005f697d
update the syntax of the geoboundingbox filter to uses brackets instead of parens around lat and lng 2023-02-06 16:50:27 +01:00
Kerollmops
fbec48f56e
Merge remote-tracking branch 'milli/main' into bring-v1-changes 2023-02-06 16:48:10 +01:00
Tamo
3ebc99473f
Apply suggestions from code review
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-02-06 13:29:37 +01:00
Tamo
d27007005e
comments the geoboundingbox + forbid the usage of the lexeme method which could introduce bugs 2023-02-06 11:36:49 +01:00
Tamo
fcb09ccc3d
add tests on the geoBoundingBox 2023-02-02 18:19:56 +01:00
Louis Dureuil
ae8660e585
Add Token::original_span rather than making Token::span pub 2023-02-02 15:03:34 +01:00
Guillaume Mourier
b297b5deb0
cargo fmt 2023-02-02 12:34:49 +01:00
Guillaume Mourier
0d71c80ba6
add tests 2023-02-02 12:31:27 +01:00
Guillaume Mourier
65a3086cf1
fix test 2023-02-02 12:27:58 +01:00
Guillaume Mourier
426d63b01b
Update insta test suite 2023-02-02 12:27:56 +01:00
Guillaume Mourier
b078477d80
Add error handling and earth lap collision with bounding box 2023-02-02 12:17:38 +01:00
ManyTheFish
0bc1a18f52 Use Languages list detected during indexing at search time 2023-02-01 18:57:43 +01:00
ManyTheFish
643d99e0f9 Add expectancy test 2023-02-01 18:39:54 +01:00
ManyTheFish
064158e4e2 Update test 2023-02-01 15:34:01 +01:00
ManyTheFish
77d32d0ee8 Fix codec deserialization 2023-02-01 15:26:26 +01:00
Loïc Lecrenier
a2690ea8d4 Reduce incremental indexing time of words_prefix_position_docids DB
This database can easily contain millions of entries. Thus, iterating
over it can be very expensive.

For regular `documentAdditionOrUpdate` tasks, `del_prefix_fst_words`
will always be empty. Thus, we can save a significant amount of time
by adding this `if !del_prefix_fst_words.is_empty()` condition.

The code's behaviour remains completely unchanged.
2023-01-31 11:42:24 +01:00
f3r10
2922c5c899 Fix code format 2023-01-31 11:28:05 +01:00
f3r10
7681be5367 Format code 2023-01-31 11:28:05 +01:00
f3r10
50bc156257 Fix tests 2023-01-31 11:28:05 +01:00
f3r10
d8207356f4 Skip script,language insertion if language is undetected 2023-01-31 11:28:05 +01:00
f3r10
2d58b28f43 Improve script language codec 2023-01-31 11:28:05 +01:00
f3r10
fd60a39f1c Format code 2023-01-31 11:28:05 +01:00
f3r10
369c05732e Add test checking if from script_language_docids database were removed
deleted docids
2023-01-31 11:28:05 +01:00
f3r10
34d04f3d3f Filter from script_language_docids database soft deleted documents 2023-01-31 11:28:05 +01:00
f3r10
a27f329e3a Add tests for checking that detected script and language associated with document(s) were stored during indexing 2023-01-31 11:28:05 +01:00
f3r10
b216ddba63 Delete and clear data from the new database 2023-01-31 11:28:05 +01:00
f3r10
d97fb6117e Extract and index data 2023-01-31 11:28:05 +01:00
f3r10
c45d1e3610 Create a new database on index and add a specialized codec for it 2023-01-31 11:28:05 +01:00
Louis Dureuil
20f05efb3c
clippy: needless_lifetimes 2023-01-31 11:12:59 +01:00
Louis Dureuil
cbf029f64c
clippy: --fix 2023-01-31 11:12:59 +01:00
Louis Dureuil
3296cf7ae6
clippy: remove needless lifetimes 2023-01-31 09:32:40 +01:00
Louis Dureuil
89675e5f15
clippy: Replace seek 0 by rewind 2023-01-31 09:32:40 +01:00
Tamo
de3c4f1986 throw an error on unknown fields specified in the _geo field 2023-01-24 12:23:24 +01:00
bors[bot]
3521a3a0b2
Merge #763
763: Fixes error message when lat and lng are unparseable r=loiclec a=ahlner

# Pull Request

## Related issue
Fixes partially [#3007](https://github.com/meilisearch/meilisearch/issues/3007)

## What does this PR do?
- Changes function validate_geo_from_json to return a BadLatitudeAndLongitude if lat or lng is a string and not parseable to f64
- implemented some unittests
- Derived PartialEq for GeoError to use assert_eq! in tests

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Philipp Ahlner <philipp@ahlner.com>
2023-01-19 15:15:46 +00:00
Philipp Ahlner
f5ca421227
Superfluous test removed 2023-01-19 15:39:21 +01:00
Louis Dureuil
4fd6fd9bef
Indicate filterable attributes when the user set a non filterable attribute in facet distributions 2023-01-19 12:25:18 +01:00
Philipp Ahlner
a2cd7214f0
Fixes error message when lat/lng are unparseable 2023-01-19 10:10:26 +01:00
ManyTheFish
d1fc42b53a Use compatibility decomposition normalizer in facets 2023-01-18 15:02:13 +01:00
Philipp Ahlner
497187083b
Add test for bug #3007: Wrong error message
Adds a test for #3007: Wrong error message when lat and lng are
unparseable
2023-01-18 13:24:26 +01:00
Clément Renault
1d507c84b2
Fix the formatting 2023-01-17 18:25:55 +01:00
Clément Renault
1b78231e18
Make clippy happy 2023-01-17 18:25:54 +01:00
bors[bot]
63af1e9f28
Merge #764
764: Update deserr to latest version r=irevoire a=loiclec

Update deserr to 0.1.5, which changes the `DeserializeFromValue` trait, getting rid of the `default()` method.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-17 10:39:36 +00:00
Loïc Lecrenier
f073a86387 Update deserr to latest version 2023-01-17 11:28:19 +01:00
bors[bot]
302d6cccd7
Merge #761
761: Integrate deserr r=irevoire a=loiclec

1. `Setting<T>` now implements `DeserializeFromValue`
2. The settings now store ranking rules as strongly typed `Criterion` instead of `String`, since the validation of the ranking rules will be done on meilisearch's side from now on


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-11 14:35:15 +00:00
bors[bot]
21b7d709ad
Merge #759
759: Change primary key inference error messages r=Kerollmops a=dureuill

# Pull Request

## Related issue
Milli part of https://github.com/meilisearch/meilisearch/issues/3301

## What does this PR do?
- Change error message strings

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-01-11 14:04:25 +00:00
Loïc Lecrenier
02fd06ea0b Integrate deserr 2023-01-11 13:56:47 +01:00
Louis Dureuil
00746b32c0
Add Index::map_size 2023-01-10 11:16:51 +01:00
Louis Dureuil
be9786bed9
Change primary key inference error messages 2023-01-05 10:40:09 +01:00
bors[bot]
c3f4835e8e
Merge #733
733: Avoid a prefix-related worst-case scenario in the proximity criterion r=loiclec a=loiclec

# Pull Request

## Related issue
Somewhat fixes (until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3118

## What does this PR do?
When a query ends with a word and a prefix, such as:
```
word pr
```
Then we first determine whether `pre` *could possibly* be in the proximity prefix database before querying it. There are then three possibilities:

1. `pr` is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of `pre` through the FST and query the regular proximity databases.

2. `pr` is in the prefix cache but cannot be found in the proximity prefix databases. **In this case, we partially disable the proximity ranking rule for the pair `word pre`.** This is done as follows:
   1. Only find the documents where `word` is in proximity to `pre` **exactly** (no derivations)
   2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8

3. `pr` is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases.

Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is:

1. For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8
2. For common prefixes of more than two letters: we no longer distinguish between any proximities
3. For uncommon prefixes: nothing changes

Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query `heard pr` (IF `pr` is the prefix of more than 200 words in the dataset):
```json
[
    { "text": "I heard there is a faster proximity criterion" },
    { "text": "I heard there is a faster but less relevant proximity criterion" }
]
```

Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro":
```json
[
    { "text": "I heard there is a faster but less relevant proximity criterion" }
    { "text": "I heard there is a faster proximity criterion" },
]
```
But the following document would be considered more relevant than the two documents above:
```json
{ "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " }
```

Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything. 

---

## Performance

I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the `songs` benchmark dataset.   

```
1. 10x 'a': 
	- 640ms ⟹ 630ms                  = no significant difference
2. 10x 'b':
	- set-based: 4.47s ⟹ 7.42        = bad, ~2x regression
	- dynamic: 1s ⟹ 870 ms           = no significant difference
3. 'Someone I l':
	- set-based: 250ms ⟹ 12 ms       = very good, x20 speedup
	- dynamic: 21ms ⟹ 11 ms          = good, x2 speedup 
4. 'billie e':
	- set-based: 623ms ⟹ 2ms         = very good, x300 speedup 
	- dynamic: ~4ms ⟹ 4ms            = no difference
5. 'billie ei':
	- set-based: 57ms ⟹ 20ms         = good, ~2x speedup
	- dynamic: ~4ms ⟹ ~2ms.          = no significant difference
6. 'i am getting o' 
	- set-based: 300ms ⟹ 60ms        = very good, 5x speedup
	- dynamic: 30ms ⟹ 6ms            = very good, 5x speedup
7. 'prologue 1 a 1:
	- set-based: 3.36s ⟹ 120ms       = very good, 30x speedup
	- dynamic: 200ms ⟹ 30ms          = very good, 6x speedup
8. 'prologue 1 a 10':
	- set-based: 590ms ⟹ 18ms        = very good, 30x speedup 
	- dynamic: 82ms ⟹ 35ms           = good, ~2x speedup
```

Performance is often significantly better, but there is also one regression in the set-based implementation with the query `b b b b b b b b b b`.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-04 09:00:50 +00:00
bors[bot]
49f58b2c47
Merge #732
732: Interpret synonyms as phrases r=loiclec a=loiclec

# Pull Request

## Related issue
Fixes (when merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3125

## What does this PR do?
We now map multi-word synonyms to phrases instead of loose words. Such that the request:
```
btw I am going to nyc soon
```
is interpreted as (when the synonym interpretation is chosen for both `btw` and `nyc`):
```
"by the way" I am going to "New York City" soon
```
instead of:
```
by the way I am going to New York City soon
```

This prevents queries containing multi-word synonyms to exceed to word length limit and degrade the search performance.

In terms of relevancy, there is a debate to have. I personally think this could be considered an improvement, since it would be strange for a user to search for:
```
good DIY project
```
and have a result such as:
```
{
    "text": "whether it is a good project to do, you'll have to decide for yourself"
}
```
However, for synonyms such as `NYC -> New York City`, then we will stop matching documents where `New York` is separated from `City`. This is however solvable by adding an additional mapping: `NYC -> New York`.

## Performance

With the old behaviour, some long search requests making heavy uses of synonyms could take minutes to be executed. This is no longer the case, these search requests now take an average amount of time to be resolved.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-04 08:34:18 +00:00
bors[bot]
6a10e85707
Merge #736
736: Update charabia r=curquiza a=ManyTheFish

Update Charabia to the last version.

> We are now Romanizing Chinese characters into Pinyin.
> Note that we keep the accent because they are in fact never typed directly by the end-user, moreover, changing an accent leads to a different Chinese character, and I don't have sufficient knowledge to forecast the impact of removing accents in this context.

Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-01-03 15:44:41 +00:00
bors[bot]
9519e60f97
Merge #709
709: Optimise the `ExactWords` sub-criterion within `Exactness` r=loiclec a=loiclec

# Pull Request

## Related issue
Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3116

## What does this PR do?
1. Reduces the algorithmic complexity of finding the documents containing N exact words from something that is exponential to something that is polynomial.
2. Cache intermediary results between different calls to the `exactness` criterion.

## Performance Results
On the `smol_songs.csv` dataset, a request containing 10 common words now takes about 60ms instead of 5 seconds to execute. For example, this is the case with this (admittedly nonsensical) request: `Rock You Hip Hop Folk World Country Electronic Love The`.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-02 12:28:30 +00:00
Loïc Lecrenier
b5df889dcb Apply review suggestions: simplify implementation of exactness criterion 2023-01-02 13:11:47 +01:00
Loïc Lecrenier
8d36570958 Add explicit criterion impl strategy to proximity search tests 2023-01-02 10:37:01 +01:00
Loïc Lecrenier
32c6062e65 Optimise exactness criterion
1. Cache some results between calls to next()
2. Compute the combinations of exact words more efficiently
2022-12-22 12:28:45 +01:00
Loïc Lecrenier
f097aafa1c Add unit test for prefix handling by the proximity criterion 2022-12-22 12:08:00 +01:00
Loïc Lecrenier
777b387dc4 Avoid a prefix-related worst-case scenario in the proximity criterion 2022-12-22 12:08:00 +01:00
Loïc Lecrenier
b0f3dc2c06 Interpret synonyms as phrases 2022-12-22 12:07:51 +01:00
Louis Dureuil
4b166bea2b
Add primary_key_inference test 2022-12-21 15:13:38 +01:00
Louis Dureuil
5943100754
Fix existing tests 2022-12-21 15:13:38 +01:00
Louis Dureuil
b24def3281
Add logging when inference took place.
Displays log message in the form:
```
[2022-12-21T09:19:42Z INFO  milli::update::index_documents::enrich] Primary key was not specified in index. Inferred to 'id'
```
2022-12-21 15:13:38 +01:00
Louis Dureuil
402dcd6b2f
Simplify primary key inference 2022-12-21 15:13:38 +01:00
Louis Dureuil
13c95d25aa
Remove uses of UserError::MissingPrimaryKey not related to inference 2022-12-21 15:13:36 +01:00
bors[bot]
a8defb585b
Merge #742
742: Add a "Criterion implementation strategy" parameter to Search r=irevoire a=loiclec

Add a parameter to search requests which determines the implementation strategy of the criteria. This can be either `set-based`, `iterative`, or `dynamic` (ie choosing between set-based or iterative at search time). See https://github.com/meilisearch/milli/issues/755 for more context about this change.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-12-21 12:18:49 +00:00
Loïc Lecrenier
339a4b0789 Make clippy happy 2022-12-21 12:49:34 +01:00
Loïc Lecrenier
229405aeb9 Choose implementation strategy of criterion at runtime 2022-12-21 09:29:39 +01:00
Loïc Lecrenier
fc0e7382fe Fix hard-deletion of an external id that was soft-deleted 2022-12-20 15:33:31 +01:00
Tamo
69edbf9f6d
Update milli/src/update/delete_documents.rs 2022-12-19 18:23:50 +01:00
Louis Dureuil
916c23e7be
Tests: rename snapshots 2022-12-19 10:07:17 +01:00
Louis Dureuil
ad9937c755
Fix tests after adding DeletionStrategy 2022-12-19 10:07:17 +01:00
Louis Dureuil
171c942282
Soft-deletion computation no longer takes into account the mapsize
Implemented solution 2.3 from https://github.com/meilisearch/meilisearch/issues/3231#issuecomment-1348628824
2022-12-19 10:07:17 +01:00
Louis Dureuil
e2ae3b24aa
Hard or soft delete according to the deletion strategy 2022-12-19 10:00:13 +01:00
Louis Dureuil
fc7618d49b
Add DeletionStrategy 2022-12-19 09:49:58 +01:00
ManyTheFish
7f88c4ff2f Fix #1714 test 2022-12-15 18:22:28 +01:00
ManyTheFish
96d4242b93 Update charabia 2022-12-15 18:22:22 +01:00
bors[bot]
5114686394
Merge #743
743: Fix finite pagination with placeholder search r=Kerollmops a=ManyTheFish

this bug is reproducible on real datasets and is hard to isolate in a simple test.

related to: https://github.com/meilisearch/meilisearch/issues/3200

poke `@curquiza` 

Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-12-15 09:31:47 +00:00
ManyTheFish
3322018c06 Fix placeholder search 2022-12-14 20:09:47 +01:00
bors[bot]
0276d5212a
Merge #728
728: Add some integration tests on the sort criterion r=ManyTheFish a=loiclec

This is simply an integration test ensuring that the sort criterion works properly. 

However, only one version of the algorithm is tested here (the iterative one). To test the version that uses the facet DB, one has to manually set the `CANDIDATES_THRESHOLD` constant to `0`. I have done that and ensured that the test still succeeds. However, in the future, we will probably want to have an option to force which algorithm is used at runtime, for testing purposes.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-12-14 09:27:12 +00:00
bors[bot]
406ee31d1a
Merge #737
737: Fix typo initial candidates computation r=Kerollmops a=ManyTheFish

When `Typo` criterion was after a different criterion than `Words` and the previous criterion wasn't returning any candidates at the first iteration of the bucket sort, then the `initial_candidates` were lost.

Now, `Typo`ensure to keep the `initial_candidates` between iterations.


related to https://github.com/meilisearch/meilisearch/issues/3200#issuecomment-1345179578
related to https://github.com/meilisearch/meilisearch/issues/3228

Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-12-13 10:29:28 +00:00
ManyTheFish
2d8d0af1a6 Rename short name bc by ic for initial_candidates 2022-12-13 10:56:38 +01:00
Loïc Lecrenier
be3b00350c Apply review suggestions: naming and documentation 2022-12-13 10:15:22 +01:00
ManyTheFish
80d34a4169 Fix typo initial candiddates computation 2022-12-12 19:02:48 +01:00
Loïc Lecrenier
e3ee553dcc Remove soft deleted ids from ExternalDocumentIds during document import
If the document import replaces a document using hard deletion
2022-12-12 14:16:09 +01:00
Loïc Lecrenier
bebd050961 Add new test for bug 3021 2022-12-08 19:19:40 +01:00
ManyTheFish
55724f2412 Introduce an initial candidates set that makes the difference between an exhaustive count and an estimation 2022-12-08 09:41:34 +01:00
Loïc Lecrenier
f37c86e0b2 Add some integration tests on the sort criterion 2022-12-07 15:59:33 +01:00
Loïc Lecrenier
d38cc73630 Add one more filter "integration" test 2022-12-07 14:38:25 +01:00
Loïc Lecrenier
e688581c36 Add tests for facet range search on different field ids 2022-12-07 14:38:21 +01:00
Loïc Lecrenier
4ac8f96342 Simplify implementation of equality condition in filters 2022-12-07 14:38:18 +01:00
Loïc Lecrenier
1c9555566e Fix bug in facet range search 2022-12-07 14:38:14 +01:00
Loïc Lecrenier
303d740245 Prepare fix within facet range search
By creating snapshots and updating the format of the existing
snapshots. The next commit will apply the fix, which will show
its effects cleanly on the old and new snapshot tests
2022-12-07 14:38:10 +01:00
bors[bot]
0a301b5f88
Merge #723
723: Fix bug in handling of soft deleted documents when updating settings r=Kerollmops a=loiclec

# Pull Request

## Related issue
Fixes (partially, until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3021

## What does this PR do?
This PR fixes the bug where a `missing key in documents database` internal error message could appear when indexing documents.

When updating the settings, before clearing the database and before creating the transform output, we now modify the `ExternalDocumentsIds` structure to get rid of all references to soft deleted document ids in its FSTs.

It used to be that updating the settings would clear the soft-deleted document ids, but keep the original `ExternalDocumentsIds` structure. As a consequence of this, when processing a future document addition, we could wrongly believe that a document was being replaced when, in fact, it was a completely new document. See the tests `bug_3021_first`, `bug_3021_second`, and `bug_3021` for a minimal test case that would have reproduced the issue.
 
We need to take special care to:
- evaluate how users should update to v0.30.1 (containing this fix): dump? reimporting all documents from scratch?
- understand IF/HOW this bug could have caused duplicate documents to be returned 
- and evaluate the correctness of the fix, of course :)


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-12-06 14:37:38 +00:00
Loïc Lecrenier
a993b68684 Cargo fmt >:-( 2022-12-06 15:22:10 +01:00
Loïc Lecrenier
80c7a00567 Fix compilation error in tests of settings update 2022-12-06 15:19:26 +01:00
Loïc Lecrenier
67d8cec209 Fix bug in handling of soft deleted documents when updating settings 2022-12-06 15:09:19 +01:00
bors[bot]
2a846aaae7
Merge #719
719: Add more members of `filter_parser` to `milli::` & `From<&str>` implementation for `Token` r=Kerollmops a=GregoryConrad

## What does this PR do?
The current `milli::Filter` and `milli::FilterCondition` APIs require working with some members of `filter_parser` directly that `milli::` does *not* re-export to its users (at least when not parsing input using `parse`). Also, using `filter_parser` does not make sense when using milli from an embedded context where there is no query to parse.

Instead of reworking `milli::Filter` and `milli::FilterCondition`, this PR adds two non-breaking changes that ease the use of milli:
- Re-exports more members of the dependent version of `filter_parser` in `milli`
- Implements `From<&str>` for `filter_parser::Token`
  - This will also allow some basic tests that need to create a `Token` from a string to avoid some boilerplate.

In conjunction, both of these will allow milli users to easily create a `Token` from a `&str` without needing to add `filter_parser` as an extra dependency.

Note: I wanted to use `FromStr` for the `From` implementation; however, it requires returning a `Result` which is not needed for the conversion. Thus, I just left it as `From<&str>`.

Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>
2022-12-06 10:36:00 +00:00
Tamo
212dbfa3b5
Update milli/src/search/facet/filter.rs 2022-12-05 20:56:21 +01:00
amab8901
456da5de9c Geosearch for zero radius 2022-12-05 20:11:46 +01:00
Loïc Lecrenier
cda4ba2bb6 Add document import tests 2022-12-05 12:02:49 +01:00
Loïc Lecrenier
ae59d37b75 Improve insta-snap of the external document ids 2022-12-05 10:51:02 +01:00
Loïc Lecrenier
f2cf981641 Add more tests and allow disabling of soft-deletion outside of tests
Also allow disabling soft-deletion in the IndexDocumentsConfig
2022-12-05 10:51:01 +01:00
Gregory Conrad
50954d31fa feat: Re-export Span and Token to milli:: 2022-12-03 13:37:33 -05:00
bors[bot]
d3731dda48
Merge #706
706: Limit the reindexing caused by updating settings when not needed r=curquiza a=GregoryConrad

## What does this PR do?
When updating index settings using `update::Settings`, sometimes a `reindex` of `update::Settings` is triggered when it doesn't need to be. This PR aims to prevent those unnecessary `reindex` calls.

For reference, here is a snippet from the current `execute` method in `update::Settings`:
```rust
// ...
if stop_words_updated
    || faceted_updated
    || synonyms_updated
    || searchable_updated
    || exact_attributes_updated
{
    self.reindex(&progress_callback, &should_abort, old_fields_ids_map)?;
}
```

- [x] `faceted_updated` - looks good as-is 
- [x] `stop_words_updated` - looks good as-is 
- [x] `synonyms_updated` - looks good as-is 
- [x] `searchable_updated` - fixed in this PR
- [x] `exact_attributes_updated` - fixed in this PR

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Gregory Conrad <gregorysconrad@gmail.com>
2022-12-01 13:58:02 +00:00
bors[bot]
5e754b3ee0
Merge #708
708: Reduce memory usage of the MatchingWords structure r=ManyTheFish a=loiclec

# Pull Request

## Related issue
Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3115 

## What does this PR do?
1. Reduces the memory usage caused by the creation of a 10-word query tree by 20x. 
   This is done by deduplicating the `MatchingWord` values, which are heavy because of their inner DFA. The deduplication works by wrapping each `MatchingWord` in a reference-counted box and using a hash map to determine whether a  `MatchingWord` DFA already exists for a certain signature, or whether a new one needs to be built.
 
2. Avoid the worst-case scenario of creating a `MatchingWord` for extremely long words that cannot be indexed by milli.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-11-30 17:47:34 +00:00
bors[bot]
e1612fcb01
Merge #712
712: Fix bulk facet indexing bug r=Kerollmops a=loiclec

# Pull Request

## Related issue
Fixes (partially, until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3165

## What does this PR do?
Fixes a bug where indexing certain numbers of filterable attribute values in bulk led to corrupted facet databases. This was due to a lossy integer conversion which would ultimately prevent entire levels of the facet database to be written into LMDB.

More specifically, this change was made:
```diff
      - if cur_writer_len as u8 >= self.min_level_size {
      + if cur_writer_len >= self.min_level_size as usize {
```
I also checked other comparisons to `min_level_size` and other conversions such as `x as u8` in this part of the codebase.



Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-11-30 16:51:48 +00:00
Loïc Lecrenier
9dd4b33a9a Fix bulk facet indexing bug 2022-11-30 14:27:36 +01:00
Gregory Conrad
87e2bc3bed fix(reindex): reindex in a few more cases
Cases: whenever searchable_fields OR user_defined_searchable_fields is modified
2022-11-28 13:12:19 -05:00
Loïc Lecrenier
61b58b115a Don't create partial matching words for synonyms in ngrams 2022-11-28 16:32:28 +01:00
Gregory Conrad
d3182f3830 refactor: Change return type to keep consistency with others 2022-11-28 10:02:03 -05:00
Loïc Lecrenier
f70856bab1 Remove memory usage test that fails when many tests are run in parallel 2022-11-28 12:55:28 +01:00
Loïc Lecrenier
e2ebed62b1 Don't create partial matching words for synonyms, split words, phrases 2022-11-28 10:20:13 +01:00
Loïc Lecrenier
8284bd760f Relax memory ordering of operations within the test CountingAlloc 2022-11-28 10:20:13 +01:00
Loïc Lecrenier
8d0ace2d64 Avoid creating a MatchingWord for words that exceed the length limit 2022-11-28 10:20:13 +01:00
Loïc Lecrenier
86c34a996b Deduplicate matching words 2022-11-28 10:20:13 +01:00
Gregory Conrad
e0d24104a3 refactor: Rewrite another method chain to be more readable 2022-11-26 13:33:19 -05:00
Gregory Conrad
2db738dbac refactor: rewrite method chain to be more readable 2022-11-26 13:26:39 -05:00
Gregory Conrad
935a724c57 revert: Revert pass by reference API change 2022-11-24 10:08:23 -05:00
Gregory Conrad
ed29cceae9 perf: Prevent reindex in searchable set case when not needed 2022-11-23 22:33:06 -05:00
Gregory Conrad
bb9e33bf85 perf: Prevent reindex in searchable reset case when not needed 2022-11-23 22:01:46 -05:00
Gregory Conrad
7c0e544839 feat: Add all_obkv_to_json function 2022-11-23 21:18:58 -05:00
Gregory Conrad
d19c8672bb perf: limit reindex to when exact_attributes changes 2022-11-23 15:50:53 -05:00
bors[bot]
57c9f03e51
Merge #697
697: Fix bug in prefix DB indexing r=loiclec a=loiclec

Where the batch's information was not properly updated in cases where only the proximity changed between two consecutive word pair proximities.

Closes partially https://github.com/meilisearch/meilisearch/issues/3043



Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-11-17 15:22:01 +00:00
Loïc Lecrenier
777eb3fa00 Add insta-snaps for test of bug 3043 2022-11-17 12:21:27 +01:00
Loïc Lecrenier
0caadedd3b Make clippy happy 2022-11-17 12:17:53 +01:00
Loïc Lecrenier
ac3baafbe8 Truncate facet values that are too long before indexing them 2022-11-17 11:29:42 +01:00
Loïc Lecrenier
990a861241 Add test for indexing a document with a long facet value 2022-11-17 11:29:42 +01:00
Loïc Lecrenier
d95d02cb8a Fix Facet Indexing bugs
1. Handle keys with variable length correctly

This fixes https://github.com/meilisearch/meilisearch/issues/3042 and
is easily reproducible with the updated fuzz tests, which now generate
keys with variable lengths.

2. Prevent adding facets to the database if their encoded value does
not satisfy `valid_lmdb_key`.

This fixes an indexing failure when a document had a filterable
attribute containing a value whose length is higher than ~500 bytes.
2022-11-17 11:29:42 +01:00
Loïc Lecrenier
f00108d2ec Fix name of bug in reproduction test 2022-11-17 11:29:18 +01:00
Loïc Lecrenier
f7c8730d09 Fix bug in prefix DB indexing
Where the batch's information was not properly updated in cases
where only the proximity changed between two consecutive word pair
proximities.

Closes https://github.com/meilisearch/meilisearch/issues/3043
2022-11-17 11:29:18 +01:00
bors[bot]
24a298a83c
Merge #690
690: Fix soft deleted bug settings r=ManyTheFish a=Kerollmops



Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-11-08 13:45:10 +00:00
bors[bot]
d85cd9bf1a
Merge #689
689: Handle non-finite floats consistently in filters r=irevoire a=dureuill

# Pull Request

## Related issue

Related meilisearch/meilisearch#3000

## What does this PR do?

### User

- Filters using `field = inf`, (or `infinite`, `NaN`) now match the value as a string rather than returning an internal error.
- Filters using `field < inf` (or other comparison operators) now return an invalid_filter error rather than returning an internal error, much like when using `field < aaa`.

### Implementation

- Add new `NonFiniteFloat` error variants to the filter-parser errors
- Add `Token::parse_as_finite_float` that can fail both when the string is not a float and when the float is not finite
- Refactor `Filter::inner_evaluate` to always use `parse_as_finite_float` instead of just `parse`
- Add corresponding tests

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2022-11-08 13:24:38 +00:00
Kerollmops
37b3c5c323
Fix transform to use all_documents and ignore soft_deleted documents 2022-11-08 14:23:16 +01:00
Kerollmops
1b1ad1923b
Add a test to check that we take care of soft deleted documents 2022-11-08 14:23:14 +01:00
Louis Dureuil
a836b8e703
tests: Tests filter with non-finite floats 2022-11-08 13:56:55 +01:00
Louis Dureuil
3328560788
fix: allow filters on = inf, = NaN, return InvalidFilter for < inf, < NaN
Fixes meilisearch/meilisearch#3000
2022-11-08 13:27:15 +01:00
unvalley
abf1cf9cd5 Fix clippy errors 2022-11-04 09:27:46 +09:00
unvalley
70465aa5ce Execute cargo fmt 2022-11-04 08:59:58 +09:00
unvalley
3009981d31 Fix clippy errors
Add clippy job

Add clippy job to CI
2022-11-04 08:58:14 +09:00
bors[bot]
6add470805
Merge #659
659: Fix clippy error to add clippy job on Ci r=Kerollmops a=unvalley

## Related PR
This PR is for #673 

## What does this PR do?
- ~~add `Run Clippy` job to CI (rust.yml)~~
- apply `cargo clippy --fix` command
- fix some `cargo clippy` error manually (but warnings still remain on tests)

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?


Co-authored-by: unvalley <kirohi.code@gmail.com>
Co-authored-by: unvalley <38400669+unvalley@users.noreply.github.com>
2022-11-03 15:24:38 +00:00
unvalley
13175f2339 refactor: match for filterCondition 2022-11-03 17:34:33 +09:00
Shashank Kashyap
a07f0a4a43
Delete facet_string_zero_bounds_value_codec.rs 2022-10-30 08:59:04 +05:30
Shashank Kashyap
2dec6e86e9
Delete facet_string_level_zero_value_codec.rs 2022-10-30 08:58:36 +05:30
bors[bot]
c965200010
Merge #664
664: Fix phrase search containing stop words r=ManyTheFish a=Samyak2

# Pull Request

This a WIP draft PR I wanted to create to let other potential contributors know that I'm working on this issue. I'll be completing this in a few hours from opening this.

## Related issue
Fixes #661 and towards fixing meilisearch/meilisearch#2905

## What does this PR do?
- [x] Change Phrase Operation to use a `Vec<Option<String>>` instead of `Vec<String>` where `None` corresponds to a stop word
- [x] Update all other uses of phrase operation
- [x] Update `resolve_phrase`
- [x] Update `create_primitive_query`?
- [x] Add test

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?


Co-authored-by: Samyak S Sarnayak <samyak201@gmail.com>
Co-authored-by: Samyak Sarnayak <samyak201@gmail.com>
2022-10-29 13:42:52 +00:00
unvalley
d55f0e2e53 Execute cargo fmt 2022-10-28 23:42:23 +09:00
unvalley
d53a80b408 Fix clippy error 2022-10-28 23:41:35 +09:00
Samyak Sarnayak
ecb88143f9
Run cargo fmt 2022-10-28 19:37:02 +05:30
Samyak Sarnayak
03eb5d87c1
Only call plane_sweep on subgroups when 2 or more are present 2022-10-28 19:32:05 +05:30
unvalley
a1d7ed1258 fix clippy error and remove clippy job from ci
Remove clippy job

Fix clippy error type_complexity

Restore ambiguous change
2022-10-28 22:33:50 +09:00
unvalley
f3c0b05ae8 Fix rust fmt 2022-10-28 09:32:31 +09:00
unvalley
f4ec1abb9b Fix all clippy error after conflicts 2022-10-27 23:58:13 +09:00
Samyak S Sarnayak
d35afa0cf5
Change consecutive phrase search grouping logic
Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-10-26 23:10:48 +05:30
unvalley
c7322f704c Fix cargo clippy errors
Dont apply clippy for tests for now

Fix clippy warnings of filter-parser package

parent 8352febd646ec4bcf56a44161e5c4dce0e55111f
author unvalley <38400669+unvalley@users.noreply.github.com> 1666325847 +0900
committer unvalley <kirohi.code@gmail.com> 1666791316 +0900

Update .github/workflows/rust.yml

Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>

Allow clippy lint too_many_argments

Allow clippy lint needless_collect

Allow clippy lint too_many_arguments and type_complexity

Fix for clippy warnings comparison_chains

Fix for clippy warnings vec_init_then_push

Allow clippy lint should_implement_trait

Allow clippy lint drop_non_drop

Fix lifetime clipy warnings in filter-paprser

Execute cargo fmt

Fix clippy remaining warnings

Fix clippy remaining warnings again and allow lint on each place
2022-10-27 01:04:23 +09:00
unvalley
811f156031 Execute cargo clippy --fix 2022-10-27 01:00:00 +09:00
Samyak S Sarnayak
af33d22f25
Consecutive is false when at least 1 stop word is surrounded by words 2022-10-26 19:09:45 +05:30
Samyak S Sarnayak
77f1ff019b
Simplify stop word checking in create_primitive_query 2022-10-26 19:09:44 +05:30
Samyak S Sarnayak
2aa11afb87
Fix panic when phrase contains only one stop word and nothing else 2022-10-26 19:09:42 +05:30
Samyak S Sarnayak
bb9ce3c5c5
Run cargo fmt 2022-10-26 19:09:03 +05:30
Samyak S Sarnayak
d187b32a28
Fix snapshots to use new phrase type 2022-10-26 19:09:03 +05:30
Samyak S Sarnayak
c8c666c6a6
Use resolve_phrase in exactness and typo criteria 2022-10-26 19:09:01 +05:30
Samyak S Sarnayak
3e190503e6
Search for closest non-stop words in proximity criteria 2022-10-26 19:08:34 +05:30
Samyak S Sarnayak
709ab3c14c
Increment position even when it's a stop word in exactness criteria 2022-10-26 19:08:33 +05:30
Samyak S Sarnayak
ef13c6a5b6
Perform filter after enumerate to keep origin indices 2022-10-26 19:08:33 +05:30
Samyak S Sarnayak
62816dddde
[WIP] Fix phrase search containing stop words
Fixes #661 and meilisearch/meilisearch#2905
2022-10-26 19:08:06 +05:30
Loïc Lecrenier
54c0cf93fe Merge remote-tracking branch 'origin/main' into facet-levels-refactor 2022-10-26 15:13:34 +02:00
bors[bot]
365f44c39b
Merge #668
668: Fix many Clippy errors part 2 r=ManyTheFish a=ehiggs

This brings us a step closer to enforcing clippy on each build.

# Pull Request

## Related issue
This does not fix any issue outright, but it is a second round of fixes for clippy after https://github.com/meilisearch/milli/pull/665. This should contribute to fixing https://github.com/meilisearch/milli/pull/659.

## What does this PR do?

Satisfies many issues for clippy. The complaints are mostly:

* Passing reference where a variable is already a reference.
* Using clone where a struct already implements `Copy`
* Using `ok_or_else` when it is a closure that returns a value instead of using the closure to call function (hence we use `ok_or`)
* Unambiguous lifetimes don't need names, so we can just use `'_`
* Using `return` when it is not needed as we are on the last expression of a function.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Ewan Higgs <ewan.higgs@gmail.com>
2022-10-26 12:16:24 +00:00
Loïc Lecrenier
2741756248 Merge remote-tracking branch 'origin/main' into facet-levels-refactor 2022-10-26 14:03:23 +02:00
Loïc Lecrenier
b7f2428961 Fix formatting and warning after rebasing from main 2022-10-26 13:49:33 +02:00
Loïc Lecrenier
3b1f908e5e Revert behaviour of facet distribution to what it was before
Where the docid that is used to get the original facet string value
definitely belongs to the candidates
2022-10-26 13:48:01 +02:00
Loïc Lecrenier
14ca8048a8 Add some documentation on how to run the facet db fuzzer 2022-10-26 13:48:01 +02:00
Loïc Lecrenier
206a3e00e5 cargo fmt 2022-10-26 13:48:01 +02:00