Commit Graph

1097 Commits

Author SHA1 Message Date
Tamo
74dcfe9676
Fix a bug when you update a document that was already present in the db, deleted and then inserted again in the same transform 2023-02-14 19:09:40 +01:00
Tamo
1b1703a609
make a small optimization to merge obkvs a little bit faster 2023-02-14 18:32:41 +01:00
Tamo
fb5e4957a6
fix and test the early exit in case a grenad ends with a deletion 2023-02-14 18:23:57 +01:00
Tamo
8de3c9f737
Update milli/src/update/index_documents/transform.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2023-02-14 17:57:14 +01:00
Tamo
43a19d0709
document the operation enum + the grenads 2023-02-14 17:55:26 +01:00
Tamo
746b31c1ce
makes clippy happy 2023-02-09 12:23:01 +01:00
Tamo
93db755d57
add a test to ensure we handle correctly a deletion of multiple time the same document 2023-02-08 21:03:34 +01:00
Tamo
93f130a400
fix all warnings 2023-02-08 20:57:35 +01:00
Tamo
421a9cf05e
provide a new method on the transform to remove documents 2023-02-08 16:06:09 +01:00
Tamo
8f64fba1ce
rewrite the current transform to handle a new byte specifying the kind of operation it's merging 2023-02-08 12:53:38 +01:00
bors[bot]
c88c3637b4
Merge #3461
3461: Bring v1 changes into main r=curquiza a=Kerollmops

Also bring back changes in milli (the remote repository) into main done during the pre-release

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com>
Co-authored-by: curquiza <curquiza@users.noreply.github.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Philipp Ahlner <philipp@ahlner.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2023-02-07 11:27:27 +00:00
Tamo
42114325cd
Apply suggestions from code review
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-02-06 18:07:00 +01:00
Tamo
7a38fe624f
throw an error if the top left corner is found below the bottom right corner 2023-02-06 17:50:47 +01:00
Tamo
1b005f697d
update the syntax of the geoboundingbox filter to uses brackets instead of parens around lat and lng 2023-02-06 16:50:27 +01:00
Kerollmops
fbec48f56e
Merge remote-tracking branch 'milli/main' into bring-v1-changes 2023-02-06 16:48:10 +01:00
Tamo
3ebc99473f
Apply suggestions from code review
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-02-06 13:29:37 +01:00
Tamo
d27007005e
comments the geoboundingbox + forbid the usage of the lexeme method which could introduce bugs 2023-02-06 11:36:49 +01:00
Tamo
fcb09ccc3d
add tests on the geoBoundingBox 2023-02-02 18:19:56 +01:00
Louis Dureuil
ae8660e585
Add Token::original_span rather than making Token::span pub 2023-02-02 15:03:34 +01:00
Guillaume Mourier
b297b5deb0
cargo fmt 2023-02-02 12:34:49 +01:00
Guillaume Mourier
0d71c80ba6
add tests 2023-02-02 12:31:27 +01:00
Guillaume Mourier
65a3086cf1
fix test 2023-02-02 12:27:58 +01:00
Guillaume Mourier
426d63b01b
Update insta test suite 2023-02-02 12:27:56 +01:00
Guillaume Mourier
b078477d80
Add error handling and earth lap collision with bounding box 2023-02-02 12:17:38 +01:00
Loïc Lecrenier
a2690ea8d4 Reduce incremental indexing time of words_prefix_position_docids DB
This database can easily contain millions of entries. Thus, iterating
over it can be very expensive.

For regular `documentAdditionOrUpdate` tasks, `del_prefix_fst_words`
will always be empty. Thus, we can save a significant amount of time
by adding this `if !del_prefix_fst_words.is_empty()` condition.

The code's behaviour remains completely unchanged.
2023-01-31 11:42:24 +01:00
Louis Dureuil
20f05efb3c
clippy: needless_lifetimes 2023-01-31 11:12:59 +01:00
Louis Dureuil
cbf029f64c
clippy: --fix 2023-01-31 11:12:59 +01:00
Louis Dureuil
3296cf7ae6
clippy: remove needless lifetimes 2023-01-31 09:32:40 +01:00
Louis Dureuil
89675e5f15
clippy: Replace seek 0 by rewind 2023-01-31 09:32:40 +01:00
Tamo
de3c4f1986 throw an error on unknown fields specified in the _geo field 2023-01-24 12:23:24 +01:00
bors[bot]
3521a3a0b2
Merge #763
763: Fixes error message when lat and lng are unparseable r=loiclec a=ahlner

# Pull Request

## Related issue
Fixes partially [#3007](https://github.com/meilisearch/meilisearch/issues/3007)

## What does this PR do?
- Changes function validate_geo_from_json to return a BadLatitudeAndLongitude if lat or lng is a string and not parseable to f64
- implemented some unittests
- Derived PartialEq for GeoError to use assert_eq! in tests

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Philipp Ahlner <philipp@ahlner.com>
2023-01-19 15:15:46 +00:00
Philipp Ahlner
f5ca421227
Superfluous test removed 2023-01-19 15:39:21 +01:00
Louis Dureuil
4fd6fd9bef
Indicate filterable attributes when the user set a non filterable attribute in facet distributions 2023-01-19 12:25:18 +01:00
Philipp Ahlner
a2cd7214f0
Fixes error message when lat/lng are unparseable 2023-01-19 10:10:26 +01:00
ManyTheFish
d1fc42b53a Use compatibility decomposition normalizer in facets 2023-01-18 15:02:13 +01:00
Philipp Ahlner
497187083b
Add test for bug #3007: Wrong error message
Adds a test for #3007: Wrong error message when lat and lng are
unparseable
2023-01-18 13:24:26 +01:00
Clément Renault
1d507c84b2
Fix the formatting 2023-01-17 18:25:55 +01:00
Clément Renault
1b78231e18
Make clippy happy 2023-01-17 18:25:54 +01:00
bors[bot]
63af1e9f28
Merge #764
764: Update deserr to latest version r=irevoire a=loiclec

Update deserr to 0.1.5, which changes the `DeserializeFromValue` trait, getting rid of the `default()` method.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-17 10:39:36 +00:00
Loïc Lecrenier
f073a86387 Update deserr to latest version 2023-01-17 11:28:19 +01:00
bors[bot]
302d6cccd7
Merge #761
761: Integrate deserr r=irevoire a=loiclec

1. `Setting<T>` now implements `DeserializeFromValue`
2. The settings now store ranking rules as strongly typed `Criterion` instead of `String`, since the validation of the ranking rules will be done on meilisearch's side from now on


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-11 14:35:15 +00:00
bors[bot]
21b7d709ad
Merge #759
759: Change primary key inference error messages r=Kerollmops a=dureuill

# Pull Request

## Related issue
Milli part of https://github.com/meilisearch/meilisearch/issues/3301

## What does this PR do?
- Change error message strings

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-01-11 14:04:25 +00:00
Loïc Lecrenier
02fd06ea0b Integrate deserr 2023-01-11 13:56:47 +01:00
Louis Dureuil
00746b32c0
Add Index::map_size 2023-01-10 11:16:51 +01:00
Louis Dureuil
be9786bed9
Change primary key inference error messages 2023-01-05 10:40:09 +01:00
bors[bot]
c3f4835e8e
Merge #733
733: Avoid a prefix-related worst-case scenario in the proximity criterion r=loiclec a=loiclec

# Pull Request

## Related issue
Somewhat fixes (until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3118

## What does this PR do?
When a query ends with a word and a prefix, such as:
```
word pr
```
Then we first determine whether `pre` *could possibly* be in the proximity prefix database before querying it. There are then three possibilities:

1. `pr` is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of `pre` through the FST and query the regular proximity databases.

2. `pr` is in the prefix cache but cannot be found in the proximity prefix databases. **In this case, we partially disable the proximity ranking rule for the pair `word pre`.** This is done as follows:
   1. Only find the documents where `word` is in proximity to `pre` **exactly** (no derivations)
   2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8

3. `pr` is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases.

Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is:

1. For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8
2. For common prefixes of more than two letters: we no longer distinguish between any proximities
3. For uncommon prefixes: nothing changes

Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query `heard pr` (IF `pr` is the prefix of more than 200 words in the dataset):
```json
[
    { "text": "I heard there is a faster proximity criterion" },
    { "text": "I heard there is a faster but less relevant proximity criterion" }
]
```

Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro":
```json
[
    { "text": "I heard there is a faster but less relevant proximity criterion" }
    { "text": "I heard there is a faster proximity criterion" },
]
```
But the following document would be considered more relevant than the two documents above:
```json
{ "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " }
```

Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything. 

---

## Performance

I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the `songs` benchmark dataset.   

```
1. 10x 'a': 
	- 640ms ⟹ 630ms                  = no significant difference
2. 10x 'b':
	- set-based: 4.47s ⟹ 7.42        = bad, ~2x regression
	- dynamic: 1s ⟹ 870 ms           = no significant difference
3. 'Someone I l':
	- set-based: 250ms ⟹ 12 ms       = very good, x20 speedup
	- dynamic: 21ms ⟹ 11 ms          = good, x2 speedup 
4. 'billie e':
	- set-based: 623ms ⟹ 2ms         = very good, x300 speedup 
	- dynamic: ~4ms ⟹ 4ms            = no difference
5. 'billie ei':
	- set-based: 57ms ⟹ 20ms         = good, ~2x speedup
	- dynamic: ~4ms ⟹ ~2ms.          = no significant difference
6. 'i am getting o' 
	- set-based: 300ms ⟹ 60ms        = very good, 5x speedup
	- dynamic: 30ms ⟹ 6ms            = very good, 5x speedup
7. 'prologue 1 a 1:
	- set-based: 3.36s ⟹ 120ms       = very good, 30x speedup
	- dynamic: 200ms ⟹ 30ms          = very good, 6x speedup
8. 'prologue 1 a 10':
	- set-based: 590ms ⟹ 18ms        = very good, 30x speedup 
	- dynamic: 82ms ⟹ 35ms           = good, ~2x speedup
```

Performance is often significantly better, but there is also one regression in the set-based implementation with the query `b b b b b b b b b b`.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-04 09:00:50 +00:00
bors[bot]
49f58b2c47
Merge #732
732: Interpret synonyms as phrases r=loiclec a=loiclec

# Pull Request

## Related issue
Fixes (when merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3125

## What does this PR do?
We now map multi-word synonyms to phrases instead of loose words. Such that the request:
```
btw I am going to nyc soon
```
is interpreted as (when the synonym interpretation is chosen for both `btw` and `nyc`):
```
"by the way" I am going to "New York City" soon
```
instead of:
```
by the way I am going to New York City soon
```

This prevents queries containing multi-word synonyms to exceed to word length limit and degrade the search performance.

In terms of relevancy, there is a debate to have. I personally think this could be considered an improvement, since it would be strange for a user to search for:
```
good DIY project
```
and have a result such as:
```
{
    "text": "whether it is a good project to do, you'll have to decide for yourself"
}
```
However, for synonyms such as `NYC -> New York City`, then we will stop matching documents where `New York` is separated from `City`. This is however solvable by adding an additional mapping: `NYC -> New York`.

## Performance

With the old behaviour, some long search requests making heavy uses of synonyms could take minutes to be executed. This is no longer the case, these search requests now take an average amount of time to be resolved.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-04 08:34:18 +00:00
bors[bot]
6a10e85707
Merge #736
736: Update charabia r=curquiza a=ManyTheFish

Update Charabia to the last version.

> We are now Romanizing Chinese characters into Pinyin.
> Note that we keep the accent because they are in fact never typed directly by the end-user, moreover, changing an accent leads to a different Chinese character, and I don't have sufficient knowledge to forecast the impact of removing accents in this context.

Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-01-03 15:44:41 +00:00
bors[bot]
9519e60f97
Merge #709
709: Optimise the `ExactWords` sub-criterion within `Exactness` r=loiclec a=loiclec

# Pull Request

## Related issue
Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3116

## What does this PR do?
1. Reduces the algorithmic complexity of finding the documents containing N exact words from something that is exponential to something that is polynomial.
2. Cache intermediary results between different calls to the `exactness` criterion.

## Performance Results
On the `smol_songs.csv` dataset, a request containing 10 common words now takes about 60ms instead of 5 seconds to execute. For example, this is the case with this (admittedly nonsensical) request: `Rock You Hip Hop Folk World Country Electronic Love The`.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-02 12:28:30 +00:00
Loïc Lecrenier
b5df889dcb Apply review suggestions: simplify implementation of exactness criterion 2023-01-02 13:11:47 +01:00