1151 Commits

Author SHA1 Message Date
Tamo
7a38fe624f
throw an error if the top left corner is found below the bottom right corner 2023-02-06 17:50:47 +01:00
Tamo
1b005f697d
update the syntax of the geoboundingbox filter to uses brackets instead of parens around lat and lng 2023-02-06 16:50:27 +01:00
Kerollmops
fbec48f56e
Merge remote-tracking branch 'milli/main' into bring-v1-changes 2023-02-06 16:48:10 +01:00
Tamo
3ebc99473f
Apply suggestions from code review
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-02-06 13:29:37 +01:00
Tamo
d27007005e
comments the geoboundingbox + forbid the usage of the lexeme method which could introduce bugs 2023-02-06 11:36:49 +01:00
Tamo
fcb09ccc3d
add tests on the geoBoundingBox 2023-02-02 18:19:56 +01:00
Louis Dureuil
ae8660e585
Add Token::original_span rather than making Token::span pub 2023-02-02 15:03:34 +01:00
Guillaume Mourier
b297b5deb0
cargo fmt 2023-02-02 12:34:49 +01:00
Guillaume Mourier
0d71c80ba6
add tests 2023-02-02 12:31:27 +01:00
Guillaume Mourier
65a3086cf1
fix test 2023-02-02 12:27:58 +01:00
Guillaume Mourier
426d63b01b
Update insta test suite 2023-02-02 12:27:56 +01:00
Guillaume Mourier
b078477d80
Add error handling and earth lap collision with bounding box 2023-02-02 12:17:38 +01:00
ManyTheFish
0bc1a18f52 Use Languages list detected during indexing at search time 2023-02-01 18:57:43 +01:00
ManyTheFish
643d99e0f9 Add expectancy test 2023-02-01 18:39:54 +01:00
ManyTheFish
064158e4e2 Update test 2023-02-01 15:34:01 +01:00
ManyTheFish
77d32d0ee8 Fix codec deserialization 2023-02-01 15:26:26 +01:00
Loïc Lecrenier
a2690ea8d4 Reduce incremental indexing time of words_prefix_position_docids DB
This database can easily contain millions of entries. Thus, iterating
over it can be very expensive.

For regular `documentAdditionOrUpdate` tasks, `del_prefix_fst_words`
will always be empty. Thus, we can save a significant amount of time
by adding this `if !del_prefix_fst_words.is_empty()` condition.

The code's behaviour remains completely unchanged.
2023-01-31 11:42:24 +01:00
f3r10
2922c5c899 Fix code format 2023-01-31 11:28:05 +01:00
f3r10
7681be5367 Format code 2023-01-31 11:28:05 +01:00
f3r10
50bc156257 Fix tests 2023-01-31 11:28:05 +01:00
f3r10
d8207356f4 Skip script,language insertion if language is undetected 2023-01-31 11:28:05 +01:00
f3r10
2d58b28f43 Improve script language codec 2023-01-31 11:28:05 +01:00
f3r10
fd60a39f1c Format code 2023-01-31 11:28:05 +01:00
f3r10
369c05732e Add test checking if from script_language_docids database were removed
deleted docids
2023-01-31 11:28:05 +01:00
f3r10
34d04f3d3f Filter from script_language_docids database soft deleted documents 2023-01-31 11:28:05 +01:00
f3r10
a27f329e3a Add tests for checking that detected script and language associated with document(s) were stored during indexing 2023-01-31 11:28:05 +01:00
f3r10
b216ddba63 Delete and clear data from the new database 2023-01-31 11:28:05 +01:00
f3r10
d97fb6117e Extract and index data 2023-01-31 11:28:05 +01:00
f3r10
c45d1e3610 Create a new database on index and add a specialized codec for it 2023-01-31 11:28:05 +01:00
Louis Dureuil
20f05efb3c
clippy: needless_lifetimes 2023-01-31 11:12:59 +01:00
Louis Dureuil
cbf029f64c
clippy: --fix 2023-01-31 11:12:59 +01:00
Louis Dureuil
3296cf7ae6
clippy: remove needless lifetimes 2023-01-31 09:32:40 +01:00
Louis Dureuil
89675e5f15
clippy: Replace seek 0 by rewind 2023-01-31 09:32:40 +01:00
Tamo
de3c4f1986 throw an error on unknown fields specified in the _geo field 2023-01-24 12:23:24 +01:00
bors[bot]
3521a3a0b2
Merge #763
763: Fixes error message when lat and lng are unparseable r=loiclec a=ahlner

# Pull Request

## Related issue
Fixes partially [#3007](https://github.com/meilisearch/meilisearch/issues/3007)

## What does this PR do?
- Changes function validate_geo_from_json to return a BadLatitudeAndLongitude if lat or lng is a string and not parseable to f64
- implemented some unittests
- Derived PartialEq for GeoError to use assert_eq! in tests

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Philipp Ahlner <philipp@ahlner.com>
2023-01-19 15:15:46 +00:00
Philipp Ahlner
f5ca421227
Superfluous test removed 2023-01-19 15:39:21 +01:00
Louis Dureuil
4fd6fd9bef
Indicate filterable attributes when the user set a non filterable attribute in facet distributions 2023-01-19 12:25:18 +01:00
Philipp Ahlner
a2cd7214f0
Fixes error message when lat/lng are unparseable 2023-01-19 10:10:26 +01:00
ManyTheFish
d1fc42b53a Use compatibility decomposition normalizer in facets 2023-01-18 15:02:13 +01:00
Philipp Ahlner
497187083b
Add test for bug #3007: Wrong error message
Adds a test for #3007: Wrong error message when lat and lng are
unparseable
2023-01-18 13:24:26 +01:00
Clément Renault
1d507c84b2
Fix the formatting 2023-01-17 18:25:55 +01:00
Clément Renault
1b78231e18
Make clippy happy 2023-01-17 18:25:54 +01:00
bors[bot]
63af1e9f28
Merge #764
764: Update deserr to latest version r=irevoire a=loiclec

Update deserr to 0.1.5, which changes the `DeserializeFromValue` trait, getting rid of the `default()` method.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-17 10:39:36 +00:00
Loïc Lecrenier
f073a86387 Update deserr to latest version 2023-01-17 11:28:19 +01:00
bors[bot]
302d6cccd7
Merge #761
761: Integrate deserr r=irevoire a=loiclec

1. `Setting<T>` now implements `DeserializeFromValue`
2. The settings now store ranking rules as strongly typed `Criterion` instead of `String`, since the validation of the ranking rules will be done on meilisearch's side from now on


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-11 14:35:15 +00:00
bors[bot]
21b7d709ad
Merge #759
759: Change primary key inference error messages r=Kerollmops a=dureuill

# Pull Request

## Related issue
Milli part of https://github.com/meilisearch/meilisearch/issues/3301

## What does this PR do?
- Change error message strings

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-01-11 14:04:25 +00:00
Loïc Lecrenier
02fd06ea0b Integrate deserr 2023-01-11 13:56:47 +01:00
Louis Dureuil
00746b32c0
Add Index::map_size 2023-01-10 11:16:51 +01:00
Louis Dureuil
be9786bed9
Change primary key inference error messages 2023-01-05 10:40:09 +01:00
bors[bot]
c3f4835e8e
Merge #733
733: Avoid a prefix-related worst-case scenario in the proximity criterion r=loiclec a=loiclec

# Pull Request

## Related issue
Somewhat fixes (until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3118

## What does this PR do?
When a query ends with a word and a prefix, such as:
```
word pr
```
Then we first determine whether `pre` *could possibly* be in the proximity prefix database before querying it. There are then three possibilities:

1. `pr` is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of `pre` through the FST and query the regular proximity databases.

2. `pr` is in the prefix cache but cannot be found in the proximity prefix databases. **In this case, we partially disable the proximity ranking rule for the pair `word pre`.** This is done as follows:
   1. Only find the documents where `word` is in proximity to `pre` **exactly** (no derivations)
   2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8

3. `pr` is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases.

Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is:

1. For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8
2. For common prefixes of more than two letters: we no longer distinguish between any proximities
3. For uncommon prefixes: nothing changes

Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query `heard pr` (IF `pr` is the prefix of more than 200 words in the dataset):
```json
[
    { "text": "I heard there is a faster proximity criterion" },
    { "text": "I heard there is a faster but less relevant proximity criterion" }
]
```

Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro":
```json
[
    { "text": "I heard there is a faster but less relevant proximity criterion" }
    { "text": "I heard there is a faster proximity criterion" },
]
```
But the following document would be considered more relevant than the two documents above:
```json
{ "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " }
```

Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything. 

---

## Performance

I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the `songs` benchmark dataset.   

```
1. 10x 'a': 
	- 640ms ⟹ 630ms                  = no significant difference
2. 10x 'b':
	- set-based: 4.47s ⟹ 7.42        = bad, ~2x regression
	- dynamic: 1s ⟹ 870 ms           = no significant difference
3. 'Someone I l':
	- set-based: 250ms ⟹ 12 ms       = very good, x20 speedup
	- dynamic: 21ms ⟹ 11 ms          = good, x2 speedup 
4. 'billie e':
	- set-based: 623ms ⟹ 2ms         = very good, x300 speedup 
	- dynamic: ~4ms ⟹ 4ms            = no difference
5. 'billie ei':
	- set-based: 57ms ⟹ 20ms         = good, ~2x speedup
	- dynamic: ~4ms ⟹ ~2ms.          = no significant difference
6. 'i am getting o' 
	- set-based: 300ms ⟹ 60ms        = very good, 5x speedup
	- dynamic: 30ms ⟹ 6ms            = very good, 5x speedup
7. 'prologue 1 a 1:
	- set-based: 3.36s ⟹ 120ms       = very good, 30x speedup
	- dynamic: 200ms ⟹ 30ms          = very good, 6x speedup
8. 'prologue 1 a 10':
	- set-based: 590ms ⟹ 18ms        = very good, 30x speedup 
	- dynamic: 82ms ⟹ 35ms           = good, ~2x speedup
```

Performance is often significantly better, but there is also one regression in the set-based implementation with the query `b b b b b b b b b b`.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-01-04 09:00:50 +00:00