Commit Graph

105 Commits

Author SHA1 Message Date
Kerollmops
66b8cfd8c8
Introduce a way to store the HNSW on multiple LMDB entries 2023-06-27 12:32:42 +02:00
Kerollmops
7aa1275337
Display the _semanticSimilarity even if the _vectors field is not displayed 2023-06-27 12:32:41 +02:00
Kerollmops
737aec1705
Expose an _semanticSimilarity as a dot product in the documents 2023-06-27 12:32:41 +02:00
Kerollmops
5c5a4e075d
Make clippy happy 2023-06-27 12:32:41 +02:00
Kerollmops
ab9f2269aa
Normalize the vectors during indexation and search 2023-06-27 12:32:41 +02:00
Kerollmops
23eaaf1001
Change the name of the distance module 2023-06-27 12:32:39 +02:00
Kerollmops
c79e82c62a
Move back to the hnsw crate
This reverts commit 7a4b6c065482f988b01298642f4c18775503f92f.
2023-06-27 12:32:39 +02:00
Kerollmops
268a9ef416
Move to the hgg crate 2023-06-27 12:32:38 +02:00
Clément Renault
4571e512d2
Store the vectors in an HNSW in LMDB 2023-06-27 12:32:38 +02:00
Louis Dureuil
c0fca6f884
Add score_details 2023-06-22 12:39:14 +02:00
Loïc Lecrenier
8628a0c856 Remove docid_word_positions_db + fix deletion bug
That would happen when a word was deleted from all exact attributes
but not all regular attributes.
2023-06-07 10:52:50 +02:00
Loïc Lecrenier
48f5bb1693 Implements the geo-sort ranking rule 2023-04-29 11:02:16 +02:00
Loïc Lecrenier
1f813a6f3b Simplify implementation of the detailed (=visual) logger 2023-04-12 16:32:53 +02:00
Loïc Lecrenier
e7bb8c940f Merge branch 'search-refactor-highlighter' into search-refactor-highlighter-merged 2023-04-11 12:22:34 +02:00
Loïc Lecrenier
a81165f0d8 Merge remote-tracking branch 'origin/main' into search-refactor 2023-04-07 10:15:55 +02:00
ManyTheFish
9c5f64769a Integrate the new Highlighter in the search 2023-04-06 13:58:56 +02:00
ManyTheFish
efea1e5837 Fix facet normalization 2023-03-29 12:02:24 +02:00
Louis Dureuil
9b83b1deb0
Expose SearchLogger trait 2023-03-27 17:49:18 +02:00
Loïc Lecrenier
862714a18b Remove criterion_implementation_strategy param of Search 2023-03-23 09:44:12 +01:00
Loïc Lecrenier
9b2653427d Split position DB into fid and relative position DB 2023-03-23 09:22:01 +01:00
Loïc Lecrenier
fbb1ba3de0 Cargo fmt 2023-03-20 09:41:56 +01:00
Loïc Lecrenier
8b4e07e1a3 WIP 2023-03-20 09:41:56 +01:00
Loïc Lecrenier
4e266211bf Small code reorganisation 2023-03-20 09:41:56 +01:00
Loïc Lecrenier
57fa689131 Cargo fmt 2023-03-20 09:41:56 +01:00
Loïc Lecrenier
c27ea2677f Rewrite cheapest path algorithm and empty path cache
It is now much simpler and has much better performance.
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
600e3dd1c5 Remove warnings 2023-03-20 09:41:56 +01:00
Loïc Lecrenier
6c659dc12f Use MiMalloc in milli tests 2023-03-20 09:41:37 +01:00
Loïc Lecrenier
229405aeb9 Choose implementation strategy of criterion at runtime 2022-12-21 09:29:39 +01:00
Gregory Conrad
50954d31fa feat: Re-export Span and Token to milli:: 2022-12-03 13:37:33 -05:00
bors[bot]
5e754b3ee0
Merge #708
708: Reduce memory usage of the MatchingWords structure r=ManyTheFish a=loiclec

# Pull Request

## Related issue
Fixes (partially) https://github.com/meilisearch/meilisearch/issues/3115 

## What does this PR do?
1. Reduces the memory usage caused by the creation of a 10-word query tree by 20x. 
   This is done by deduplicating the `MatchingWord` values, which are heavy because of their inner DFA. The deduplication works by wrapping each `MatchingWord` in a reference-counted box and using a hash map to determine whether a  `MatchingWord` DFA already exists for a certain signature, or whether a new one needs to be built.
 
2. Avoid the worst-case scenario of creating a `MatchingWord` for extremely long words that cannot be indexed by milli.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2022-11-30 17:47:34 +00:00
Loïc Lecrenier
8d0ace2d64 Avoid creating a MatchingWord for words that exceed the length limit 2022-11-28 10:20:13 +01:00
Gregory Conrad
935a724c57 revert: Revert pass by reference API change 2022-11-24 10:08:23 -05:00
Gregory Conrad
7c0e544839 feat: Add all_obkv_to_json function 2022-11-23 21:18:58 -05:00
unvalley
c7322f704c Fix cargo clippy errors
Dont apply clippy for tests for now

Fix clippy warnings of filter-parser package

parent 8352febd646ec4bcf56a44161e5c4dce0e55111f
author unvalley <38400669+unvalley@users.noreply.github.com> 1666325847 +0900
committer unvalley <kirohi.code@gmail.com> 1666791316 +0900

Update .github/workflows/rust.yml

Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>

Allow clippy lint too_many_argments

Allow clippy lint needless_collect

Allow clippy lint too_many_arguments and type_complexity

Fix for clippy warnings comparison_chains

Fix for clippy warnings vec_init_then_push

Allow clippy lint should_implement_trait

Allow clippy lint drop_non_drop

Fix lifetime clipy warnings in filter-paprser

Execute cargo fmt

Fix clippy remaining warnings

Fix clippy remaining warnings again and allow lint on each place
2022-10-27 01:04:23 +09:00
Loïc Lecrenier
54c0cf93fe Merge remote-tracking branch 'origin/main' into facet-levels-refactor 2022-10-26 15:13:34 +02:00
Loïc Lecrenier
86d9f50b9c Fix bugs in incremental facet indexing with variable parameters
e.g. add one facet value incrementally with a group_size = X and then
add another one with group_size = Y

It is not actually possible to do so with the public API of milli,
but I wanted to make sure the algorithm worked well in those cases
anyway.

The bugs were found by fuzzing the code with fuzzcheck, which I've added
to milli as a conditional dev-dependency. But it can be removed later.
2022-10-26 13:47:04 +02:00
Ewan Higgs
9d27ac8a2e Ignore too many arguments to functions. 2022-10-25 21:22:53 +02:00
Ewan Higgs
42cdc38c7b Allow weird ranges like 1..=0 to pass clippy.
Everything else is just a warning and exit code will be 0.
2022-10-25 21:12:59 +02:00
Loïc Lecrenier
d76d0cb1bf Merge branch 'main' into word-pair-proximity-docids-refactor 2022-10-24 15:23:00 +02:00
Loïc Lecrenier
1dbbd8694f Rename StrStrU8Codec to U8StrStrCodec and reorder its fields 2022-10-18 10:37:34 +02:00
Ewan Higgs
beb987d3d1 Fixing piles of clippy errors.
Most of these are calling clone when the struct supports Copy.

Many are using & and &mut on `self` when the function they are called
from already has an immutable or mutable borrow so this isn't needed.

I tried to stay away from actual changes or places where I'd have to
name fresh variables.
2022-10-13 22:02:54 +02:00
ManyTheFish
9640976c79 Rename TermMatchingPolicies 2022-08-18 17:36:08 +02:00
Loïc Lecrenier
306593144d Refactor word prefix pair proximity indexation 2022-08-17 11:59:00 +02:00
Loïc Lecrenier
334098a7e0 Add index snapshot test helper function 2022-08-10 15:53:46 +02:00
Loïc Lecrenier
07003704a8 Merge branch 'filter/field-exist' 2022-07-21 14:51:41 +02:00
Loïc Lecrenier
30bd4db0fc Simplify indexing task for facet_exists_docids database 2022-07-19 10:07:33 +02:00
Kerollmops
fcfc4caf8c
Move the Object type in the lib.rs file and use it everywhere 2022-07-12 14:55:51 +02:00
Kerollmops
69931e50d2
Add the max_values_by_facet setting to the database 2022-06-08 17:54:56 +02:00
ManyTheFish
86ac8568e6 Use Charabia in milli 2022-06-02 16:59:11 +02:00
bors[bot]
ea4bb9402f
Merge #483
483: Enhance matching words r=Kerollmops a=ManyTheFish

# Summary

Enhance milli word-matcher making it handle match computing and cropping.

# Implementation

## Computing best matches for cropping

Before we were considering that the first match of the attribute was the best one, this was accurate when only one word was searched but was missing the target when more than one word was searched.

Now we are searching for the best matches interval to crop around, the chosen interval is the one:
1) that have the highest count of unique matches
> for example, if we have a query `split the world`, then the interval `the split the split the` has 5 matches but only 2 unique matches (1 for `split` and 1 for `the`) where the interval `split of the world` has 3 matches and 3 unique matches. So the interval `split of the world` is considered better.
2) that have the minimum distance between matches
> for example, if we have a query `split the world`, then the interval `split of the world` has a distance of 3 (2 between `split` and `the`, and 1 between `the` and `world`) where the interval `split the world` has a distance of 2. So the interval `split the world` is considered better.
3) that have the highest count of ordered matches
> for example, if we have a query `split the world`, then the interval `the world split` has 2 ordered words where the interval `split the world` has 3. So the interval `split the world` is considered better.

## Cropping around the best matches interval

Before we were cropping around the interval without checking the context.

Now we are cropping around words in the same context as matching words.
This means that we will keep words that are farther from the matching words but are in the same phrase, than words that are nearer but separated by a dot.

> For instance, for the matching word `Split` the text:
`Natalie risk her future. Split The World is a book written by Emily Henry. I never read it.`
will be cropped like:
`…. Split The World is a book written by Emily Henry. …`
and  not like:
`Natalie risk her future. Split The World is a book …`


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-04-19 11:42:32 +00:00