Commit Graph

478 Commits

Author SHA1 Message Date
unvalley
abf1cf9cd5 Fix clippy errors 2022-11-04 09:27:46 +09:00
unvalley
70465aa5ce Execute cargo fmt 2022-11-04 08:59:58 +09:00
unvalley
3009981d31 Fix clippy errors
Add clippy job

Add clippy job to CI
2022-11-04 08:58:14 +09:00
unvalley
d55f0e2e53 Execute cargo fmt 2022-10-28 23:42:23 +09:00
unvalley
d53a80b408 Fix clippy error 2022-10-28 23:41:35 +09:00
unvalley
a1d7ed1258 fix clippy error and remove clippy job from ci
Remove clippy job

Fix clippy error type_complexity

Restore ambiguous change
2022-10-28 22:33:50 +09:00
unvalley
f4ec1abb9b Fix all clippy error after conflicts 2022-10-27 23:58:13 +09:00
unvalley
c7322f704c Fix cargo clippy errors
Dont apply clippy for tests for now

Fix clippy warnings of filter-parser package

parent 8352febd646ec4bcf56a44161e5c4dce0e55111f
author unvalley <38400669+unvalley@users.noreply.github.com> 1666325847 +0900
committer unvalley <kirohi.code@gmail.com> 1666791316 +0900

Update .github/workflows/rust.yml

Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>

Allow clippy lint too_many_argments

Allow clippy lint needless_collect

Allow clippy lint too_many_arguments and type_complexity

Fix for clippy warnings comparison_chains

Fix for clippy warnings vec_init_then_push

Allow clippy lint should_implement_trait

Allow clippy lint drop_non_drop

Fix lifetime clipy warnings in filter-paprser

Execute cargo fmt

Fix clippy remaining warnings

Fix clippy remaining warnings again and allow lint on each place
2022-10-27 01:04:23 +09:00
unvalley
811f156031 Execute cargo clippy --fix 2022-10-27 01:00:00 +09:00
Loïc Lecrenier
54c0cf93fe Merge remote-tracking branch 'origin/main' into facet-levels-refactor 2022-10-26 15:13:34 +02:00
bors[bot]
365f44c39b
Merge #668
668: Fix many Clippy errors part 2 r=ManyTheFish a=ehiggs

This brings us a step closer to enforcing clippy on each build.

# Pull Request

## Related issue
This does not fix any issue outright, but it is a second round of fixes for clippy after https://github.com/meilisearch/milli/pull/665. This should contribute to fixing https://github.com/meilisearch/milli/pull/659.

## What does this PR do?

Satisfies many issues for clippy. The complaints are mostly:

* Passing reference where a variable is already a reference.
* Using clone where a struct already implements `Copy`
* Using `ok_or_else` when it is a closure that returns a value instead of using the closure to call function (hence we use `ok_or`)
* Unambiguous lifetimes don't need names, so we can just use `'_`
* Using `return` when it is not needed as we are on the last expression of a function.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Ewan Higgs <ewan.higgs@gmail.com>
2022-10-26 12:16:24 +00:00
Loïc Lecrenier
2741756248 Merge remote-tracking branch 'origin/main' into facet-levels-refactor 2022-10-26 14:03:23 +02:00
Loïc Lecrenier
b7f2428961 Fix formatting and warning after rebasing from main 2022-10-26 13:49:33 +02:00
Loïc Lecrenier
14ca8048a8 Add some documentation on how to run the facet db fuzzer 2022-10-26 13:48:01 +02:00
Loïc Lecrenier
f198b20c42 Add facet deletion tests that use both the incremental and bulk methods
+ update deletion snapshots to the new database format
2022-10-26 13:47:46 +02:00
Loïc Lecrenier
e3ba1fc883 Make deletion tests for both soft-deletion and hard-deletion 2022-10-26 13:47:46 +02:00
Loïc Lecrenier
ab5e56fd16 Add document deletion snapshot tests and tests for hard-deletion 2022-10-26 13:47:46 +02:00
Loïc Lecrenier
d885de1600 Add option to avoid soft deletion of documents 2022-10-26 13:47:46 +02:00
Loïc Lecrenier
2295e0e3ce Use real delete function in facet indexing fuzz tests
By deleting multiple docids at once instead of one-by-one
2022-10-26 13:47:46 +02:00
Loïc Lecrenier
acc8caebe6 Add link to GitHub PR to document of update/facet module 2022-10-26 13:47:46 +02:00
Loïc Lecrenier
a034a1e628 Move StrRefCodec and ByteSliceRefCodec to their own files 2022-10-26 13:47:46 +02:00
Loïc Lecrenier
1165ba2171 Make facet deletion incremental 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
51961e1064 Polish some details 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
cb8442a119 Further unify facet databases of f64s and strings 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
3baa34d842 Fix compiler errors/warnings 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
86d9f50b9c Fix bugs in incremental facet indexing with variable parameters
e.g. add one facet value incrementally with a group_size = X and then
add another one with group_size = Y

It is not actually possible to do so with the public API of milli,
but I wanted to make sure the algorithm worked well in those cases
anyway.

The bugs were found by fuzzing the code with fuzzcheck, which I've added
to milli as a conditional dev-dependency. But it can be removed later.
2022-10-26 13:47:04 +02:00
Loïc Lecrenier
985a94adfc cargo fmt 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
b1ab09196c Remove outdated TODOs 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
27454e9828 Document and refine facet indexing algorithms 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
bee3c23b45 Add comparison benchmark between bulk and incremental facet indexing 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
b2f01ad204 Refactor facet database tests 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
9026867d17 Give same interface to bulk and incremental facet indexing types
+ cargo fmt, oops, sorry for the bad history :(
2022-10-26 13:47:04 +02:00
Loïc Lecrenier
330c9eb1b2 Rename facet codecs and refine FacetsUpdate API 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
485a72306d Refactor facet-related codecs 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
9b55e582cd Add FacetsUpdate type that wraps incremental and bulk indexing methods 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
3d145d7f48 Merge the two <facetttype>_faceted_documents_ids methods into one 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
079ed4a992 Add more snapshots 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
afdf87f6f7 Fix bugs in asc/desc criterion and facet indexing 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
a7201ece04 cargo fmt 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
36296bbb20 Add facet incremental indexing snapshot tests + fix bug 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
07ff92c663 Add more snapshots from facet tests 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
61252248fb Fix some facet indexing bugs 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
68cbcdf08b Fix compile errors/warnings in http-ui and infos 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
85824ee203 Try to make facet indexing incremental 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
d30c89e345 Fix compile error+warnings in new tests 2022-10-26 13:46:46 +02:00
Loïc Lecrenier
e8a156d682 Reorganise facets database indexing code 2022-10-26 13:46:46 +02:00
Loïc Lecrenier
bd2c0e1ab6 Remove unused code 2022-10-26 13:46:14 +02:00
Loïc Lecrenier
39a4a0a362 Reintroduce filter range search and facet extractors 2022-10-26 13:46:14 +02:00
Loïc Lecrenier
22d80eeaf9 Reintroduce facet deletion functionality 2022-10-26 13:46:14 +02:00
Loïc Lecrenier
6cc91824c1 Remove unused heed codec files 2022-10-26 13:46:14 +02:00
Loïc Lecrenier
63ef0aba18 Start porting facet distribution and sort to new database structure 2022-10-26 13:46:14 +02:00
Loïc Lecrenier
7913d6365c Update Facets indexing to be compatible with new database structure 2022-10-26 13:46:14 +02:00
Loïc Lecrenier
c3f49f766d Prepare refactor of facets database
Prepare refactor of facets database
2022-10-26 13:46:14 +02:00
bors[bot]
c8f16530d5
Merge #616
616: Introduce an indexation abortion function when indexing documents r=Kerollmops a=Kerollmops



Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-10-26 11:41:18 +00:00
Ewan Higgs
2ce025a906 Fixes after rebase to fix new issues. 2022-10-25 20:58:31 +02:00
Ewan Higgs
17f7922bfc Remove unneeded lifetimes. 2022-10-25 20:49:04 +02:00
Ewan Higgs
6b2fe94192 Fixes for clippy bringing us down to 18 remaining issues.
This brings us a step closer to enforcing clippy on each build.
2022-10-25 20:49:02 +02:00
Loïc Lecrenier
9a569d73d1 Minor code style change 2022-10-24 15:30:43 +02:00
Loïc Lecrenier
d76d0cb1bf Merge branch 'main' into word-pair-proximity-docids-refactor 2022-10-24 15:23:00 +02:00
Loïc Lecrenier
a983129613 Apply suggestions from code review 2022-10-20 09:49:37 +02:00
Loïc Lecrenier
ab2f6f3aa4 Refine some details in word_prefix_pair_proximity indexing code 2022-10-18 10:37:34 +02:00
Loïc Lecrenier
178d00f93a Cargo fmt 2022-10-18 10:37:34 +02:00
Loïc Lecrenier
072b576514 Fix proximity value in keys of prefix_word_pair_proximity_docids 2022-10-18 10:37:34 +02:00
Loïc Lecrenier
6c3a5d69e1 Update snapshots 2022-10-18 10:37:34 +02:00
Loïc Lecrenier
a7de4f5b85 Don't add swapped word pairs to the word_pair_proximity_docids db 2022-10-18 10:37:34 +02:00
Loïc Lecrenier
264a04922d Add prefix_word_pair_proximity database
Similar to the word_prefix_pair_proximity one but instead the keys are:
(proximity, prefix, word2)
2022-10-18 10:37:34 +02:00
Loïc Lecrenier
1dbbd8694f Rename StrStrU8Codec to U8StrStrCodec and reorder its fields 2022-10-18 10:37:34 +02:00
Loïc Lecrenier
bdeb47305e Change encoding of word_pair_proximity DB to (proximity, word1, word2)
Same for word_prefix_pair_proximity
2022-10-18 10:37:34 +02:00
Kerollmops
6603437cb1
Introduce an indexation abortion function when indexing documents 2022-10-17 17:28:03 +02:00
Ewan Higgs
beb987d3d1 Fixing piles of clippy errors.
Most of these are calling clone when the struct supports Copy.

Many are using & and &mut on `self` when the function they are called
from already has an immutable or mutable borrow so this isn't needed.

I tried to stay away from actual changes or places where I'd have to
name fresh variables.
2022-10-13 22:02:54 +02:00
msvaljek
762e320c35
Add proximity calculation for the same word 2022-10-07 12:59:12 +02:00
vishalsodani
00c02d00f3 Add missing logging timer to extractors 2022-09-30 22:17:06 +05:30
bors[bot]
15d478cf4d
Merge #635
635: Use an unstable algorithm for `grenad::Sorter` when possible r=Kerollmops a=loiclec

# Pull Request
## What does this PR do?

Use an unstable algorithm to sort the internal vector used by `grenad::Sorter` whenever possible to speed up indexing.

In practice, every time the merge function creates a `RoaringBitmap`, we use an unstable sort. For every other merge function, such as `keep_first`, `keep_last`, etc., a stable sort is used.


Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-09-14 12:00:52 +00:00
Loïc Lecrenier
3794962330 Use an unstable algorithm for grenad::Sorter when possible 2022-09-13 14:49:53 +02:00
Kerollmops
d4d7c9d577
We avoid skipping errors in the indexing pipeline 2022-09-13 14:03:00 +02:00
Kerollmops
fe3973a51c
Make sure that long words are correctly skipped 2022-09-07 15:03:32 +02:00
Kerollmops
c83c3cd796
Add a test to make sure that long words are correctly skipped 2022-09-07 14:12:36 +02:00
ManyTheFish
5391e3842c replace optional_words by term_matching_strategy 2022-08-22 17:47:19 +02:00
ManyTheFish
9640976c79 Rename TermMatchingPolicies 2022-08-18 17:36:08 +02:00
Irevoire
4aae07d5f5
expose the size methods 2022-08-17 17:07:38 +02:00
Irevoire
e96b852107
bump heed 2022-08-17 17:05:50 +02:00
bors[bot]
087da5621a
Merge #587
587: Word prefix pair proximity docids indexation refactor r=Kerollmops a=loiclec

# Pull Request

## What does this PR do?
Refactor the code of `WordPrefixPairProximityDocIds` to make it much faster, fix a bug, and add a unit test.

## Why is it faster?
Because we avoid using a sorter to insert the (`word1`, `prefix`, `proximity`) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations: 

1. reusing allocations
2. using a prefix trie instead of an array of prefixes to get all the prefixes of a word
3. inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist. 

## What bug was fixed?
When reindexing, the `new_prefix_fst_words` prefixes may look like:
```
["ant",  "axo", "bor"]
```
which we group by first letter:
```
[["ant", "axo"], ["bor"]]
```

Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with `word2.starts_with(subarray_prefixes[0])`, but `"axolotl".starts_with("ant")` is false, and thus we wrongly think that there are no prefixes in `new_prefix_fst_words` that are prefixes of `axolotl`.

## StrStrU8Codec
I had to change the encoding of `StrStrU8Codec` to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change.

## Requests when reviewing this PR
I have explained what the code does in the module documentation of `word_pair_proximity_prefix_docids`. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not. 

I also have a couple questions regarding the code itself:
- Should we clean up and factor out the `PrefixTrieNode` code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified.
- I wrote a function called `write_into_lmdb_database_without_merging`. (1) Are we okay with such a function existing? (2) Should it be in `grenad_helpers` instead?

## Benchmark Results

We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%. 

```
group                                                                     indexing_main_ce90fc62                  indexing_word-prefix-pair-proximity-docids-refactor_cbad2023
-----                                                                     ----------------------                  ------------------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1893.0±233.03µs        ? ?/sec    1.01  1921.2±260.79µs        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.05      9.4±3.51ms        ? ?/sec     1.00      9.0±2.14ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.22    18.3±11.42ms        ? ?/sec     1.00     15.0±5.79ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00     41.4±4.20ms        ? ?/sec     1.28    53.0±13.97ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.00   285.6±18.12ms        ? ?/sec     1.03   293.1±16.09ms        ? ?/sec
indexing/Indexing geo_point                                               1.03      60.8±0.45s        ? ?/sec     1.00      58.8±0.68s        ? ?/sec
indexing/Indexing movies in three batches                                 1.14      16.5±0.30s        ? ?/sec     1.00      14.5±0.24s        ? ?/sec
indexing/Indexing movies with default settings                            1.11      13.7±0.07s        ? ?/sec     1.00      12.3±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.6±0.11s        ? ?/sec     1.00       9.6±0.15s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.11       9.4±0.15s        ? ?/sec     1.00       8.5±0.10s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.18      66.2±0.39s        ? ?/sec     1.00      56.0±0.67s        ? ?/sec
indexing/Indexing songs with default settings                             1.07      58.7±1.26s        ? ?/sec     1.00      54.7±1.71s        ? ?/sec
indexing/Indexing songs without any facets                                1.08      53.1±0.88s        ? ?/sec     1.00      49.3±1.43s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.08      57.7±1.33s        ? ?/sec     1.00      53.3±0.98s        ? ?/sec
indexing/Indexing wiki                                                    1.06   1051.1±21.46s        ? ?/sec     1.00    989.6±24.55s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.20    1184.8±8.93s        ? ?/sec     1.00     989.7±7.06s        ? ?/sec
indexing/Reindexing geo_point                                             1.04      67.5±0.75s        ? ?/sec     1.00      64.9±0.32s        ? ?/sec
indexing/Reindexing movies with default settings                          1.12      13.9±0.17s        ? ?/sec     1.00      12.4±0.13s        ? ?/sec
indexing/Reindexing songs with default settings                           1.05      60.6±0.84s        ? ?/sec     1.00      57.5±0.99s        ? ?/sec
indexing/Reindexing wiki                                                  1.07   1725.0±17.92s        ? ?/sec     1.00    1611.4±9.90s        ? ?/sec
```

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-17 14:06:12 +00:00
bors[bot]
fb95e67a2a
Merge #608
608: Fix soft deleted documents r=ManyTheFish a=ManyTheFish

When we replaced or updated some documents, the indexing was skipping the replaced documents.

Related to https://github.com/meilisearch/meilisearch/issues/2672

Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-08-17 13:38:10 +00:00
ManyTheFish
e9e2349ce6 Fix typo in comment 2022-08-17 15:09:48 +02:00
ManyTheFish
2668f841d1 Fix update indexing 2022-08-17 15:03:37 +02:00
ManyTheFish
7384650d85 Update test to showcase the bug 2022-08-17 15:03:08 +02:00
Loïc Lecrenier
6cc975704d Add some documentation to facets.rs 2022-08-17 12:59:52 +02:00
Loïc Lecrenier
93252769af Apply review suggestions 2022-08-17 12:41:22 +02:00
Loïc Lecrenier
39687908f1 Add documentation and comments to facets.rs 2022-08-17 12:26:49 +02:00
Loïc Lecrenier
8d4b21a005 Switch string facet levels indexation to new algo
Write the algorithm once for both numbers and strings
2022-08-17 12:26:49 +02:00
Loïc Lecrenier
cf0cd92ed4 Refactor Facets::execute to increase performance 2022-08-17 12:26:49 +02:00
Loïc Lecrenier
78d9f0622d cargo fmt 2022-08-17 12:21:24 +02:00
Loïc Lecrenier
4f9edf13d7 Remove commented-out function 2022-08-17 12:21:24 +02:00
Loïc Lecrenier
405555b401 Add some documentation to PrefixTrieNode 2022-08-17 12:21:24 +02:00
Loïc Lecrenier
1bc4788e59 Remove cached Allocations struct from wpppd indexing 2022-08-17 12:18:22 +02:00
Loïc Lecrenier
ef75a77464 Fix undefined behaviour caused by reusing key from the database
New full snapshot:
---
source: milli/src/update/word_prefix_pair_proximity_docids.rs
---
5                a    1  [101, ]
5                a    2  [101, ]
5                am   1  [101, ]
5                b    4  [101, ]
5                be   4  [101, ]
am               a    3  [101, ]
amazing          a    1  [100, ]
amazing          a    2  [100, ]
amazing          a    3  [100, ]
amazing          an   1  [100, ]
amazing          an   2  [100, ]
amazing          b    2  [100, ]
amazing          be   2  [100, ]
an               a    1  [100, ]
an               a    2  [100, 202, ]
an               am   1  [100, ]
an               an   2  [100, ]
an               b    3  [100, ]
an               be   3  [100, ]
and              a    2  [100, ]
and              a    3  [100, ]
and              a    4  [100, ]
and              am   2  [100, ]
and              an   3  [100, ]
and              b    1  [100, ]
and              be   1  [100, ]
at               a    1  [100, 202, ]
at               a    2  [100, 101, ]
at               a    3  [100, ]
at               am   2  [100, 101, ]
at               an   1  [100, 202, ]
at               an   3  [100, ]
at               b    3  [101, ]
at               b    4  [100, ]
at               be   3  [101, ]
at               be   4  [100, ]
beautiful        a    2  [100, ]
beautiful        a    3  [100, ]
beautiful        a    4  [100, ]
beautiful        am   3  [100, ]
beautiful        an   2  [100, ]
beautiful        an   4  [100, ]
bell             a    2  [101, ]
bell             a    4  [101, ]
bell             am   4  [101, ]
extraordinary    a    2  [202, ]
extraordinary    a    3  [202, ]
extraordinary    an   2  [202, ]
house            a    3  [100, 202, ]
house            a    4  [100, 202, ]
house            am   4  [100, ]
house            an   3  [100, 202, ]
house            b    2  [100, ]
house            be   2  [100, ]
rings            a    1  [101, ]
rings            a    3  [101, ]
rings            am   3  [101, ]
rings            b    2  [101, ]
rings            be   2  [101, ]
the              a    3  [101, ]
the              b    1  [101, ]
the              be   1  [101, ]
2022-08-17 12:17:45 +02:00
Loïc Lecrenier
7309111433 Don't run block code in doc tests of word_pair_proximity_docids 2022-08-17 12:17:18 +02:00
Loïc Lecrenier
f6f8f543e1 Run cargo fmt 2022-08-17 12:17:18 +02:00
Loïc Lecrenier
34c991ea02 Add newlines in documentation of word_prefix_pair_proximity_docids 2022-08-17 12:17:18 +02:00
Loïc Lecrenier
06f3fd8c6d Add more comments to WordPrefixPairProximityDocids::execute 2022-08-17 12:17:18 +02:00