Commit Graph

822 Commits

Author SHA1 Message Date
Kerollmops
742543091e
Constify the default primary key name 2022-07-12 14:55:52 +02:00
Kerollmops
5f1bfb73ee
Extract the primary key name and make it accessible 2022-07-12 14:55:52 +02:00
Kerollmops
6a0a0ae94f
Make the Transform read from an EnrichedDocumentsBatchReader 2022-07-12 14:55:52 +02:00
Kerollmops
dc3f092d07
Do not leak an internal grenad Error 2022-07-12 14:55:52 +02:00
Kerollmops
8ebf5eed0d
Make the nested primary key work 2022-07-12 14:55:52 +02:00
Kerollmops
19eb3b4708
Make sur that we do not accept floats as documents ids 2022-07-12 14:55:52 +02:00
Kerollmops
2ceeb51c37
Support the auto-generated ids when validating documents 2022-07-12 14:55:51 +02:00
Kerollmops
399eec5c01
Fix the indexation tests 2022-07-12 14:55:51 +02:00
Kerollmops
fcfc4caf8c
Move the Object type in the lib.rs file and use it everywhere 2022-07-12 14:55:51 +02:00
Kerollmops
0146175fe6
Introduce the validate_documents_batch function 2022-07-12 14:55:51 +02:00
Kerollmops
bdc4263883
Introduce the validate_documents_batch function 2022-07-12 14:55:51 +02:00
Kerollmops
e8297ad27e
Fix the tests for the new DocumentsBatchBuilder/Reader 2022-07-12 14:52:56 +02:00
Kerollmops
419ce3966c
Rework the DocumentsBatchBuilder/Reader to use grenad 2022-07-12 14:52:55 +02:00
Kerollmops
048e174efb
Do not allocate when parsing CSV headers 2022-07-12 14:52:55 +02:00
ManyTheFish
5d79617a56 Chores: Enhance smart-crop code comments 2022-07-07 16:28:09 +02:00
bors[bot]
ebddfdb9a3
Merge #578
578: Bump uuid to 1.1.2 r=ManyTheFish a=Kerollmops

Just to [align the version with Meilisearch](https://github.com/meilisearch/meilisearch/pull/2584).

Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-07-05 14:56:08 +00:00
Kerollmops
1bfdcfc84f
Bump uuid to 1.1.2 2022-07-05 16:23:36 +02:00
Tamo
250be9fe6c
put the threshold back to 10k 2022-07-05 15:57:44 +02:00
Tamo
b61efd09fc
Makes the internal soft deleted error a UserError 2022-07-05 15:34:45 +02:00
Tamo
eaf28b0628
Apply review suggestions
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-07-05 15:30:33 +02:00
Tamo
3b309f654a
Fasten the document deletion
When a document deletion occurs, instead of deleting the document we mark it as deleted
in the new “soft deleted” bitmap. It is then removed from the search, and all the other
endpoints.
2022-07-05 15:30:33 +02:00
Dmytro Gordon
3ff03a3f5f Fix not equal filter when field contains both number and strings 2022-06-27 15:55:17 +03:00
Kerollmops
238692a8e7
Introduce the copy_to_path method on the Index 2022-06-22 16:49:47 +02:00
bors[bot]
290a40b7a5
Merge #564
564: Rename the limitedTo parameter into maxTotalHits r=curquiza a=Kerollmops

This PR is related to https://github.com/meilisearch/meilisearch/issues/2542, it renames the `limitedTo` parameter into `maxTotalHits`.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-06-22 13:48:33 +00:00
Kerollmops
d7c248042b
Rename the limitedTo parameter into maxTotalHits 2022-06-22 12:00:48 +02:00
Kerollmops
d2f84a9d9e
Improve the estimatedNbHits when distinct is enabled 2022-06-22 11:39:21 +02:00
ManyTheFish
a0ab90a4d7 Avoid having an ending separator before crop marker 2022-06-16 18:23:57 +02:00
ManyTheFish
177154828c Extends deletion tests 2022-06-13 17:34:16 +02:00
ManyTheFish
0d1d354052 Ensure that Index methods are not bypassed by Meilisearch 2022-06-13 17:34:11 +02:00
bors[bot]
f1d848bb9a
Merge #552
552: Fix escaped quotes in filter r=Kerollmops a=irevoire

Will fix https://github.com/meilisearch/meilisearch/issues/2380

The issue was that in the evaluation of the filter, I was using the deref implementation instead of calling the `value` method of my token.

To avoid the problem happening again, I removed the deref implementation; now, you need to either call the `lexeme` or the `value` methods but can't rely on a « default » implementation to get a string out of a token.

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-06-09 14:56:44 +00:00
Tamo
90afde435b
fix escaped quotes in filter 2022-06-09 16:03:49 +02:00
Kerollmops
445d5474cc
Add the pagination_limited_to setting to the database 2022-06-08 18:14:27 +02:00
Kerollmops
69931e50d2
Add the max_values_by_facet setting to the database 2022-06-08 17:54:56 +02:00
Kerollmops
52a494bd3b
Add the new pagination.limited_to and faceting.max_values_per_facet settings 2022-06-08 17:15:36 +02:00
Kerollmops
2a505503b3
Change the number of facet values returned by default to 100 2022-06-08 15:58:57 +02:00
Kerollmops
bae4007447
Remove the hard limit on the number of facet values returned 2022-06-08 15:58:57 +02:00
Tamo
d0aaa7ff00
Fix wrong internal ids assignments 2022-06-07 15:49:33 +02:00
ad hoc
31776fdc3f
add failing test 2022-06-07 15:49:33 +02:00
ManyTheFish
d212dc6b8b Remove useless newline 2022-06-02 18:22:56 +02:00
ManyTheFish
7aabe42ae0 Refactor matching words 2022-06-02 17:59:04 +02:00
ManyTheFish
86ac8568e6 Use Charabia in milli 2022-06-02 16:59:11 +02:00
bors[bot]
74d1914a64
Merge #535
535: Reintroduce the max values by facet limit r=ManyTheFish a=Kerollmops

This PR reintroduces the max values by facet limit this is related to https://github.com/meilisearch/meilisearch/issues/2349.

~I would like some help in deciding on whether I keep the default 100 max values in milli and set up the `FacetDistribution` settings in Meilisearch to use 1000 as the new value, I expose the `max_values_by_facet` for this purpose.~

I changed the default value to 1000 and the max to 10000, thank you `@ManyTheFish` for the help!

Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-06-01 14:30:50 +00:00
bors[bot]
582930dbbb
Merge #538
538: speedup exact words r=Kerollmops a=MarinPostma

This PR make `exact_words` return an `Option` instead of an empty set, since set creation is costly, as noticed by `@kerollmops.`

I was not convinces that this was the cause for all of the performance drop we measured, and then realized that methods that initialized it were called recursively which caused initialization times to add up. While the first fix solves the issue when not using exact words, using exact word remained way more expensive that it should be. To address this issue, the exact words are cached into the `Context`, so they are only initialized once.


Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-05-30 08:20:34 +00:00
ad hoc
25fc576696
review changes 2022-05-24 14:15:33 +02:00
ad hoc
69dc4de80f
change &Option<Set> to Option<&Set> 2022-05-24 12:14:55 +02:00
ad hoc
ac975cc747
cache context's exact words 2022-05-24 09:43:17 +02:00
ad hoc
8993fec8a3
return optional exact words 2022-05-24 09:15:49 +02:00
Matthias Wright
754f48a4fb Improves ranking rules error message 2022-05-20 21:25:43 +02:00
Kerollmops
cd7c6e19ed
Reintroduce the max values by facet limit 2022-05-18 15:57:57 +02:00
ManyTheFish
137434a1c8 Add some implementation on MatchBounds 2022-05-17 15:57:09 +02:00
bors[bot]
08c6d50cd1
Merge #531
531: fix the mixed dataset geosearch indexing bug r=Kerollmops a=irevoire

port #529 to main

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-05-16 16:06:36 +00:00
bors[bot]
cf3e574cb4
Merge #530
530: fix the searchable fields bug when a field is nested r=Kerollmops a=irevoire

port #528 to main

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-05-16 15:52:30 +00:00
Tamo
0af399a6d7
fix the mixed dataset geosearch indexing bug 2022-05-16 17:37:45 +02:00
Tamo
f586028f9a
fix the searchable fields bug when a field is nested
Update milli/src/index.rs

Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-05-16 17:24:36 +02:00
bors[bot]
e1e85267fd
Merge #526
526: remove useless comment r=irevoire a=MarinPostma



Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-05-16 10:01:43 +00:00
bors[bot]
51809eb260
Merge #525
525: Simplify the error creation with thiserror r=irevoire a=irevoire

I introduced [`thiserror`](https://docs.rs/thiserror/latest/thiserror/) to implements all the `Display` trait and most of the `impl From<xxx> for yyy` in way less lines.
And then I introduced a cute macro to implements the `impl<X, Y, Z> From<X> for Z where Y: From<X>, Z: From<X>` more easily.

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-05-04 15:47:32 +00:00
Tamo
484a9ddb27
Simplify the error creation with thiserror and a smol friendly macro 2022-05-04 17:24:00 +02:00
bors[bot]
65e6aa0de2
Merge #523
523: Improve geosearch error messages r=irevoire a=irevoire

Improve the geosearch error messages (#488).
And try to parse the string as specified in https://github.com/meilisearch/meilisearch/issues/2354

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-05-04 13:36:11 +00:00
Tamo
c55368ddd4
apply code suggestion
Co-authored-by: Kerollmops <kero@meilisearch.com>
2022-05-04 14:11:03 +02:00
ad hoc
5ad5d56f7e
remove useless comment 2022-05-04 10:43:54 +02:00
bors[bot]
0c2c8af44e
Merge #520
520: fix mistake in Settings initialization r=irevoire a=MarinPostma

fix settings not being correctly initialized and add a test to make sure that they are in the future.

fix https://github.com/meilisearch/meilisearch/issues/2358


Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-05-03 15:32:18 +00:00
Kerollmops
211c8763b9
Make sure that we do not generate too long keys 2022-05-03 10:03:15 +02:00
Kerollmops
7e47031bdc
Add a test for long keys in LMDB 2022-05-03 10:03:13 +02:00
Tamo
3cb1f6d0a1
improve geosearch error messages 2022-05-02 19:20:47 +02:00
ad hoc
1ee3d6ae33
fix mistake in Settings initialization 2022-04-29 16:24:25 +02:00
bors[bot]
9db86aac51
Merge #518
518: Return facets even when there is no value associated to it r=Kerollmops a=Kerollmops

This PR is related to https://github.com/meilisearch/meilisearch/issues/2352 and should fix the issue when Meilisearch is up-to-date with this PR.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-04-28 09:04:36 +00:00
Kerollmops
7d1c2d97bf
Return facets even when there is no values associated to it 2022-04-26 17:59:53 +02:00
bors[bot]
d388ea0f9d
Merge #506
506: fix cargo warnings r=Kerollmops a=MarinPostma

fix cargo warnings


Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-04-26 15:45:20 +00:00
ad hoc
5c29258e8e
fix cargo warnings 2022-04-26 17:33:11 +02:00
Tamo
f19d2dc548
Only flatten the required fields
apply review comments

Co-authored-by: Kerollmops <kero@meilisearch.com>
2022-04-26 12:33:46 +02:00
bors[bot]
8010eca9c7
Merge #505
505: normalize exact words r=curquiza a=MarinPostma

Normalize the exact words, as specified in the specification.


Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-04-25 09:35:32 +00:00
ad hoc
2e0089d5ff
normalize exact words 2022-04-21 15:38:40 +02:00
ad hoc
3a2451fcba
add test normalize exact words 2022-04-21 13:52:09 +02:00
Clément Renault
eb5830aa40
Add a test to make sure that long words are handled 2022-04-21 13:45:28 +02:00
ad hoc
8b14090927
fix min-word-len-for-typo not reset properly 2022-04-19 15:20:16 +02:00
bors[bot]
ea4bb9402f
Merge #483
483: Enhance matching words r=Kerollmops a=ManyTheFish

# Summary

Enhance milli word-matcher making it handle match computing and cropping.

# Implementation

## Computing best matches for cropping

Before we were considering that the first match of the attribute was the best one, this was accurate when only one word was searched but was missing the target when more than one word was searched.

Now we are searching for the best matches interval to crop around, the chosen interval is the one:
1) that have the highest count of unique matches
> for example, if we have a query `split the world`, then the interval `the split the split the` has 5 matches but only 2 unique matches (1 for `split` and 1 for `the`) where the interval `split of the world` has 3 matches and 3 unique matches. So the interval `split of the world` is considered better.
2) that have the minimum distance between matches
> for example, if we have a query `split the world`, then the interval `split of the world` has a distance of 3 (2 between `split` and `the`, and 1 between `the` and `world`) where the interval `split the world` has a distance of 2. So the interval `split the world` is considered better.
3) that have the highest count of ordered matches
> for example, if we have a query `split the world`, then the interval `the world split` has 2 ordered words where the interval `split the world` has 3. So the interval `split the world` is considered better.

## Cropping around the best matches interval

Before we were cropping around the interval without checking the context.

Now we are cropping around words in the same context as matching words.
This means that we will keep words that are farther from the matching words but are in the same phrase, than words that are nearer but separated by a dot.

> For instance, for the matching word `Split` the text:
`Natalie risk her future. Split The World is a book written by Emily Henry. I never read it.`
will be cropped like:
`…. Split The World is a book written by Emily Henry. …`
and  not like:
`Natalie risk her future. Split The World is a book …`


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-04-19 11:42:32 +00:00
ManyTheFish
f1115e274f Use Copy impl of FormatOption instead of clonning 2022-04-19 10:35:50 +02:00
Tamo
00f78d6b5a
Apply code suggestions
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-04-14 11:14:08 +02:00
Tamo
399fba16bb
only flatten an object if it's nested 2022-04-14 11:14:08 +02:00
Tamo
ee64f4a936
Use smartstring to store the external id in our hashmap
We need to store all the external id (primary key) in a hashmap
associated to their internal id during.
The smartstring remove heap allocation / memory usage and should
improve the cache locality.
2022-04-13 21:22:07 +02:00
ad hoc
dda28d7415
exclude excluded canditates from search result candidates 2022-04-13 12:10:35 +02:00
ad hoc
bbb6728d2f
add distinct attributes to cli 2022-04-13 12:10:35 +02:00
ManyTheFish
5809d3ae0d Add first benchmarks on formatting 2022-04-12 16:31:58 +02:00
ManyTheFish
827cedcd15 Add format option structure 2022-04-12 13:42:14 +02:00
ManyTheFish
011f8210ed Make compute_matches more rust idiomatic 2022-04-12 10:19:02 +02:00
ManyTheFish
a16de5de84 Symplify format and remove intermediate function 2022-04-08 11:20:41 +02:00
ManyTheFish
a769e09dfa Make token_crop_bounds more rust idiomatic 2022-04-07 20:15:14 +02:00
ManyTheFish
c8ed1675a7 Add some documentation 2022-04-07 17:32:13 +02:00
ManyTheFish
b1905dfa24 Make split_best_frequency returns references instead of owned data 2022-04-07 17:05:44 +02:00
Irevoire
4f3ce6d9cd
nested fields 2022-04-07 16:58:46 +02:00
ad hoc
b799f3326b
rename merge_nothing to merge_ignore_values 2022-04-05 18:44:35 +02:00
ManyTheFish
fa7d3a37c0 Make some cleaning and add comments 2022-04-05 17:48:56 +02:00
ManyTheFish
3bb1e35ada Fix match count 2022-04-05 17:48:45 +02:00
ManyTheFish
56e0edd621 Put crop markers direclty around words 2022-04-05 17:41:32 +02:00
ManyTheFish
a93cd8c61c Fix prefix highlight with special chars 2022-04-05 17:41:32 +02:00
ManyTheFish
b3f0f39106 Make some cleaning 2022-04-05 17:41:32 +02:00
ManyTheFish
6dc345bc53 Test and Fix prefix highlight 2022-04-05 17:41:32 +02:00
ManyTheFish
bd30ee97b8 Keep separators at start of the croped string 2022-04-05 17:41:32 +02:00
ManyTheFish
29c5f76d7f Use new matcher in http-ui 2022-04-05 17:41:32 +02:00
ManyTheFish
734d0899d3 Publish Matcher 2022-04-05 17:41:32 +02:00
ManyTheFish
4428cb5909 Add some tests and fix some corner cases 2022-04-05 17:41:32 +02:00
ManyTheFish
844f546a8b Add matches algorithm V1 2022-04-05 17:41:32 +02:00
ManyTheFish
3be1790803 Add crop algorithm with naive match algorithm 2022-04-05 17:41:32 +02:00
ManyTheFish
d96e72e5dc Create formater with some tests 2022-04-05 17:41:32 +02:00
ad hoc
201fea0fda
limit extract_word_docids memory usage 2022-04-05 14:14:15 +02:00
ad hoc
5cfd3d8407
add exact attributes documentation 2022-04-05 14:10:22 +02:00
ad hoc
b85cd4983e
remove field_id_from_position 2022-04-05 09:50:34 +02:00
ad hoc
ab185a59b5
fix infos 2022-04-05 09:46:56 +02:00
ad hoc
1810927dbd
rephrase exact_attributes doc 2022-04-04 21:04:49 +02:00
ad hoc
b7694c34f5
remove println 2022-04-04 21:00:07 +02:00
ad hoc
6cabd47c32
fix typo in comment 2022-04-04 20:59:20 +02:00
ad hoc
6b2c2509b2
fix bug in exact search 2022-04-04 20:54:03 +02:00
ad hoc
56b4f5dce2
add exact prefix to query_docids 2022-04-04 20:54:03 +02:00
ad hoc
21ae4143b1
add exact_word_prefix to Context 2022-04-04 20:54:03 +02:00
ad hoc
e8f06f6c06
extract exact_word_prefix_docids 2022-04-04 20:54:03 +02:00
ad hoc
6dd2e4ffbd
introduce exact_word_prefix database in index 2022-04-04 20:54:03 +02:00
ad hoc
ba0bb29cd8
refactor WordPrefixDocids to take dbs instead of indexes 2022-04-04 20:54:02 +02:00
ad hoc
c4c6e35352
query exact_word_docids in resolve_query_tree 2022-04-04 20:54:02 +02:00
ad hoc
8d46a5b0b5
extract exact word docids 2022-04-04 20:54:02 +02:00
ad hoc
0a77be4ec0
introduce exact_word_docids db 2022-04-04 20:54:02 +02:00
ad hoc
5f9f82757d
refactor spawn_extraction_task 2022-04-04 20:54:02 +02:00
ad hoc
f82d4b36eb
introduce exact attribute setting 2022-04-04 20:54:02 +02:00
ad hoc
c882d8daf0
add test for exact words 2022-04-04 20:54:01 +02:00
ad hoc
7e9d56a9e7
disable typos on exact words 2022-04-04 20:54:01 +02:00
ad hoc
30a2711bac
rename serde module to serde_impl module
needed because of issues with rustfmt
2022-04-04 20:10:55 +02:00
ad hoc
0fd55db21c
fmt 2022-04-04 20:10:55 +02:00
ad hoc
559e46be5e
fix bad rebase bug 2022-04-04 20:10:55 +02:00
ad hoc
8b1e5d9c6d
add test for exact words 2022-04-04 20:10:55 +02:00
ad hoc
774fa8f065
disable typos on exact words 2022-04-04 20:10:55 +02:00
ad hoc
9bbffb8fee
add exact words setting 2022-04-04 20:10:54 +02:00
ad hoc
853b4a520f
fmt 2022-04-04 10:41:46 +02:00
ad hoc
1941072bb2
implement Copy on Setting 2022-04-04 10:41:46 +02:00
ad hoc
fdaf45aab2
replace hardcoded value with constant in TestContext 2022-04-04 10:41:46 +02:00
ad hoc
950a740bd4
refactor typos for readability 2022-04-04 10:41:46 +02:00
ad hoc
66020cd923
rename min_word_len* to use plain letter numbers 2022-04-04 10:41:46 +02:00
ad hoc
4c4b336ecb
rename min word len for typo error 2022-04-01 11:17:03 +02:00
ad hoc
286dd7b2e4
rename min_word_len_2_typo 2022-04-01 11:17:03 +02:00
ad hoc
55af85db3c
add tests for min_word_len_for_typo 2022-04-01 11:17:02 +02:00
ad hoc
9102de5500
fix error message 2022-04-01 11:17:02 +02:00
ad hoc
a1a3a49bc9
dynamic minimum word len for typos in query tree builder 2022-04-01 11:17:02 +02:00
ad hoc
5a24e60572
introduce word len for typo setting 2022-04-01 11:17:02 +02:00
ad hoc
9fe40df960
add word derivations tests 2022-04-01 11:05:18 +02:00
ad hoc
d5ddc6b080
fix 2 typos word derivation bug 2022-04-01 10:51:22 +02:00
ad hoc
3e34981d9b
add test for authorize_typos in update 2022-03-31 14:12:00 +02:00
ad hoc
6ef3bb9d83
fmt 2022-03-31 14:06:23 +02:00
ad hoc
f782fe2062
add authorize_typo_test 2022-03-31 10:08:39 +02:00
ad hoc
c4653347fd
add authorize typo setting 2022-03-31 10:05:44 +02:00
bors[bot]
90276d9a2d
Merge #472
472: Remove useless variables in proximity r=Kerollmops a=ManyTheFish

Was passing by plane sweep algorithm to find some inspiration, and I discover that we have useless variables that were not detected because of the recursive function.

Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-03-16 15:33:11 +00:00
ManyTheFish
49d59d88c2 Remove useless variables in proximity 2022-03-16 16:12:52 +01:00
Bruno Casali
adc71742c8 Move string concat to the struct instead of in the calling 2022-03-16 10:26:12 -03:00