Commit Graph

791 Commits

Author SHA1 Message Date
ad hoc 8b14090927
fix min-word-len-for-typo not reset properly 2022-04-19 15:20:16 +02:00
bors[bot] ea4bb9402f
Merge #483
483: Enhance matching words r=Kerollmops a=ManyTheFish

# Summary

Enhance milli word-matcher making it handle match computing and cropping.

# Implementation

## Computing best matches for cropping

Before we were considering that the first match of the attribute was the best one, this was accurate when only one word was searched but was missing the target when more than one word was searched.

Now we are searching for the best matches interval to crop around, the chosen interval is the one:
1) that have the highest count of unique matches
> for example, if we have a query `split the world`, then the interval `the split the split the` has 5 matches but only 2 unique matches (1 for `split` and 1 for `the`) where the interval `split of the world` has 3 matches and 3 unique matches. So the interval `split of the world` is considered better.
2) that have the minimum distance between matches
> for example, if we have a query `split the world`, then the interval `split of the world` has a distance of 3 (2 between `split` and `the`, and 1 between `the` and `world`) where the interval `split the world` has a distance of 2. So the interval `split the world` is considered better.
3) that have the highest count of ordered matches
> for example, if we have a query `split the world`, then the interval `the world split` has 2 ordered words where the interval `split the world` has 3. So the interval `split the world` is considered better.

## Cropping around the best matches interval

Before we were cropping around the interval without checking the context.

Now we are cropping around words in the same context as matching words.
This means that we will keep words that are farther from the matching words but are in the same phrase, than words that are nearer but separated by a dot.

> For instance, for the matching word `Split` the text:
`Natalie risk her future. Split The World is a book written by Emily Henry. I never read it.`
will be cropped like:
`…. Split The World is a book written by Emily Henry. …`
and  not like:
`Natalie risk her future. Split The World is a book …`


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-04-19 11:42:32 +00:00
ManyTheFish f1115e274f Use Copy impl of FormatOption instead of clonning 2022-04-19 10:35:50 +02:00
Clémentine Urquizar 8d630a6f62
Update version for the next release (v0.26.1) 2022-04-14 11:44:06 +02:00
Tamo 00f78d6b5a
Apply code suggestions
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-04-14 11:14:08 +02:00
Tamo 399fba16bb
only flatten an object if it's nested 2022-04-14 11:14:08 +02:00
Tamo ee64f4a936
Use smartstring to store the external id in our hashmap
We need to store all the external id (primary key) in a hashmap
associated to their internal id during.
The smartstring remove heap allocation / memory usage and should
improve the cache locality.
2022-04-13 21:22:07 +02:00
ad hoc dda28d7415
exclude excluded canditates from search result candidates 2022-04-13 12:10:35 +02:00
ad hoc cd83014fff
add test for disctinct nb hits 2022-04-13 12:10:35 +02:00
ad hoc bbb6728d2f
add distinct attributes to cli 2022-04-13 12:10:35 +02:00
ManyTheFish 5809d3ae0d Add first benchmarks on formatting 2022-04-12 16:31:58 +02:00
ManyTheFish 827cedcd15 Add format option structure 2022-04-12 13:42:14 +02:00
ManyTheFish 011f8210ed Make compute_matches more rust idiomatic 2022-04-12 10:19:02 +02:00
ManyTheFish a16de5de84 Symplify format and remove intermediate function 2022-04-08 11:20:41 +02:00
ManyTheFish a769e09dfa Make token_crop_bounds more rust idiomatic 2022-04-07 20:15:14 +02:00
bors[bot] 9ac2fd1c37
Merge #487
487: Update version (v0.26.0) r=Kerollmops a=curquiza

breaking because of #458 

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-04-07 17:10:24 +00:00
Tamo bab898ce86
move the flatten-serde-json crate inside of milli 2022-04-07 18:20:44 +02:00
ManyTheFish c8ed1675a7 Add some documentation 2022-04-07 17:32:13 +02:00
ManyTheFish b1905dfa24 Make split_best_frequency returns references instead of owned data 2022-04-07 17:05:44 +02:00
Tamo ab458d8840
fix tests after rebase 2022-04-07 17:00:00 +02:00
Irevoire 4f3ce6d9cd
nested fields 2022-04-07 16:58:46 +02:00
Clémentine Urquizar ee1d627803
Update version (v0.26.0) 2022-04-07 15:56:10 +02:00
bors[bot] 4ae7aea3b2
Merge #486
486: Update version (v0.25.0) r=curquiza a=curquiza

v0.25.0 will be released once #478 is merged

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-04-06 11:40:41 +00:00
ad hoc b799f3326b
rename merge_nothing to merge_ignore_values 2022-04-05 18:44:35 +02:00
ManyTheFish fa7d3a37c0 Make some cleaning and add comments 2022-04-05 17:48:56 +02:00
ManyTheFish 3bb1e35ada Fix match count 2022-04-05 17:48:45 +02:00
ManyTheFish 56e0edd621 Put crop markers direclty around words 2022-04-05 17:41:32 +02:00
ManyTheFish a93cd8c61c Fix prefix highlight with special chars 2022-04-05 17:41:32 +02:00
ManyTheFish b3f0f39106 Make some cleaning 2022-04-05 17:41:32 +02:00
ManyTheFish 6dc345bc53 Test and Fix prefix highlight 2022-04-05 17:41:32 +02:00
ManyTheFish bd30ee97b8 Keep separators at start of the croped string 2022-04-05 17:41:32 +02:00
ManyTheFish 29c5f76d7f Use new matcher in http-ui 2022-04-05 17:41:32 +02:00
ManyTheFish 734d0899d3 Publish Matcher 2022-04-05 17:41:32 +02:00
ManyTheFish 4428cb5909 Add some tests and fix some corner cases 2022-04-05 17:41:32 +02:00
ManyTheFish 844f546a8b Add matches algorithm V1 2022-04-05 17:41:32 +02:00
ManyTheFish 3be1790803 Add crop algorithm with naive match algorithm 2022-04-05 17:41:32 +02:00
ManyTheFish d96e72e5dc Create formater with some tests 2022-04-05 17:41:32 +02:00
ad hoc 201fea0fda
limit extract_word_docids memory usage 2022-04-05 14:14:15 +02:00
ad hoc 5cfd3d8407
add exact attributes documentation 2022-04-05 14:10:22 +02:00
Clémentine Urquizar 9eec44dd98
Update version (v0.25.0) 2022-04-05 12:06:42 +02:00
ad hoc b85cd4983e
remove field_id_from_position 2022-04-05 09:50:34 +02:00
ad hoc ab185a59b5
fix infos 2022-04-05 09:46:56 +02:00
ad hoc 59e41d98e3
add comments to integration test 2022-04-04 21:17:06 +02:00
ad hoc 1810927dbd
rephrase exact_attributes doc 2022-04-04 21:04:49 +02:00
ad hoc b7694c34f5
remove println 2022-04-04 21:00:07 +02:00
ad hoc 6cabd47c32
fix typo in comment 2022-04-04 20:59:20 +02:00
ad hoc c8d3a09af8
add integration test for disabel typo on attributes 2022-04-04 20:54:03 +02:00
ad hoc 6b2c2509b2
fix bug in exact search 2022-04-04 20:54:03 +02:00
ad hoc 56b4f5dce2
add exact prefix to query_docids 2022-04-04 20:54:03 +02:00
ad hoc 21ae4143b1
add exact_word_prefix to Context 2022-04-04 20:54:03 +02:00