Commit Graph

2071 Commits

Author SHA1 Message Date
Clémentine Urquizar
44192d754f
Add dependabot for GHA 2022-09-06 17:54:05 +02:00
bors[bot]
1fa851a8d0
Merge #622
622: Minor fixes in the just added update-version CI r=ManyTheFish a=curquiza

These fixes are minor, and do not prevent us to use the current CI

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-09-06 13:14:23 +00:00
Clémentine Urquizar
61abc61a69
Minor fixes in the just added update-version CI 2022-09-05 16:01:32 +02:00
bors[bot]
efee0e3f43
Merge #621
621: Add CI to update the Milli version r=ManyTheFish a=curquiza

Add a CI we can trigger manually to create a PR updating the Milli version
The next step is to create a Slack bot that will trigger this CI
In the meantime, we can trigger this CI manually in the [Actions tab](https://github.com/meilisearch/milli/actions)

The `MEILI_BOT_GH_PAT` secrets has been added to the organization level, and is accessible for the following repositories (so far): Meilisearch, Milli and Charabia

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-09-05 08:31:48 +00:00
Clémentine Urquizar
0639b14906
Add CI to update the Milli version 2022-09-04 11:49:50 +02:00
bors[bot]
f7c352a32d
Merge #620
620: Fix word criterion r=Kerollmops a=ManyTheFish

related to https://github.com/meilisearch/meilisearch/issues/2722

- fix the word strategy bug
- update milli version to v0.33.2

Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-09-01 10:14:35 +00:00
ManyTheFish
bf750e45a1 Fix word removal issue 2022-09-01 12:10:47 +02:00
ManyTheFish
a38608fe59 Add test mixing phrased and no-phrased words 2022-09-01 12:02:10 +02:00
ManyTheFish
97a04887a3 Update version for next release (v0.33.2) in Cargo.toml 2022-09-01 11:47:23 +02:00
bors[bot]
17d020e996
Merge #618
618: Update version for next release (v0.33.1) in Cargo.toml r=Kerollmops a=curquiza

No breaking for this release

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-08-31 10:43:45 +00:00
Clémentine Urquizar
c3363706c5
Update version for next release (v0.33.1) in Cargo.toml 2022-08-31 11:37:27 +02:00
bors[bot]
2c2f3d38cc
Merge #617
617: Accept integers as document ids again r=irevoire a=Kerollmops

This PR is related to https://github.com/meilisearch/meilisearch/issues/2723 and will fix when this PR will be merged, a new release deployed and used in Meilisearch itself.

This PR makes the indexer to try to parse the values of the fields identified as numbers i.e. `id:number` as integer first then as float if it fails.

Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-08-31 09:25:17 +00:00
Clément Renault
7f92116b51
Accept again integers as document ids 2022-08-31 10:56:39 +02:00
bors[bot]
0b55e7ce6a
Merge #615
615: Remove the artifacts of the past r=Kerollmops a=irevoire



Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-08-23 14:22:43 +00:00
Irevoire
f6024b3269
Remove the artifacts of the past 2022-08-23 16:10:38 +02:00
bors[bot]
a79ff8a1a9
Merge #611
611: Upgrade charabia v0.6.0 r=curquiza a=ManyTheFish

# Pull Request

## What does this PR do?

- Update `log`
- Upgrade `charabia`

related to https://github.com/meilisearch/meilisearch/issues/2686


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-08-23 10:17:29 +00:00
bors[bot]
e314423653
Merge #613
613: Update version for next release (v0.33.0) r=Kerollmops a=curquiza



Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-08-23 10:01:20 +00:00
bors[bot]
d0521e493f
Merge #612
612: Remove Bors required test for Windows r=Kerollmops a=curquiza

Remove the required windows test for merging due to the issue with Lindera
https://github.com/meilisearch/milli/runs/7970141278?check_suite_focus=true

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-08-23 09:47:51 +00:00
Clémentine Urquizar
9ed7324995
Update version for next release (v0.33.0) 2022-08-23 11:47:48 +02:00
Clémentine Urquizar
e140227065
Remove Bors required test for Windows 2022-08-23 11:45:29 +02:00
bors[bot]
18886dc6b7
Merge #598
598: Matching query terms policy r=Kerollmops a=ManyTheFish

## Summary

Implement several optional words strategy.

## Content

Replace `optional_words` boolean with an enum containing several term matching strategies:
```rust
pub enum TermsMatchingStrategy {
    // remove last word first
    Last,
    // remove first word first
    First,
    // remove more frequent word first
    Frequency,
    // remove smallest word first
    Size,
    // only one of the word is mandatory
    Any,
    // all words are mandatory
    All,
}
```

All strategies implemented during the prototype are kept, but only `Last` and `All` will be published by Meilisearch in the `v0.29.0` release.

## Related

spec: https://github.com/meilisearch/specifications/pull/173
prototype discussion: https://github.com/meilisearch/meilisearch/discussions/2639#discussioncomment-3447699


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-08-22 15:51:37 +00:00
ManyTheFish
5391e3842c replace optional_words by term_matching_strategy 2022-08-22 17:47:19 +02:00
ManyTheFish
f9029727e0 Fix benchmarks 2022-08-22 14:55:53 +02:00
ManyTheFish
a5b9a35c50 Activate char_map for highlighting 2022-08-22 14:39:16 +02:00
ManyTheFish
ba5ca8a362 Upgrade charabia v0.6.0 2022-08-22 14:38:00 +02:00
ManyTheFish
5943e1c3b2 Update log dependency 2022-08-22 13:55:01 +02:00
bors[bot]
b46225070f
Merge #610
610: Share heed between all sub-crates r=Kerollmops a=irevoire

# Pull Request

## What does this PR do?
Use the reexported version of heed in the benchmarks and the fuzzer

Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-08-22 08:44:31 +00:00
Irevoire
e7624abe63
share heed between all sub-crates 2022-08-19 11:23:41 +02:00
ManyTheFish
993aa1321c Fix query tree building 2022-08-18 17:56:06 +02:00
ManyTheFish
bff9653050 Fix remove count 2022-08-18 17:36:30 +02:00
ManyTheFish
9640976c79 Rename TermMatchingPolicies 2022-08-18 17:36:08 +02:00
bors[bot]
60a7221827
Merge #609
609: Retry downloading the benchmarks datasets r=Kerollmops a=irevoire

Downloading the benchmarks datasets is failing [more and more](https://github.com/meilisearch/milli/pull/607#pullrequestreview-1076023074) often; thus, instead of fixing the issue, I thought we could retry multiple times.


Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-08-18 11:47:09 +00:00
bors[bot]
afc10acd19
Merge #596
596: Filter operators: NOT + IN[..] r=irevoire a=loiclec

# Pull Request

## What does this PR do?
Implements the changes described in https://github.com/meilisearch/meilisearch/issues/2580
It is based on top of #556 

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-18 11:24:32 +00:00
Loïc Lecrenier
c7a86b56ef Fix filter parser compilation error 2022-08-18 13:16:56 +02:00
Loïc Lecrenier
9b6602cba2 Avoid cloning FilterCondition in filter array parsing 2022-08-18 13:06:57 +02:00
Loïc Lecrenier
8a271223a9 Change a macro_rules to a function in filter parser 2022-08-18 13:03:55 +02:00
Loïc Lecrenier
dd34dbaca5 Add more filter parser tests 2022-08-18 11:55:01 +02:00
Loïc Lecrenier
5d74ebd5e5 Cargo fmt 2022-08-18 11:36:38 +02:00
Loïc Lecrenier
9af69c151b Limit the maximum depth of filters
This should have no impact on the user but is there to safeguard
meilisearch against malicious inputs.
2022-08-18 11:31:38 +02:00
Loïc Lecrenier
c51dcad51b Don't recompute filterable fields in evaluation of IN[] filter 2022-08-18 10:59:21 +02:00
Loïc Lecrenier
98f0da6b38 Simplify representation of nested NOT filters 2022-08-18 10:58:24 +02:00
Loïc Lecrenier
b030efdc83 Fix parsing of IN[] filter followed by whitespace + factorise its impl 2022-08-18 10:58:04 +02:00
Irevoire
84a784834e
retry downloading the benchmarks datasets 2022-08-17 19:25:05 +02:00
bors[bot]
79094bcbcf
Merge #607
607: Better threshold r=Kerollmops a=irevoire

# Pull Request

## What does this PR do?
Fixes #570 

This PR tries to improve the threshold used to trigger the real deletion of documents.
The deletion is now triggered in two cases;
- 10% of the total available space is used by soft deleted documents
- 90% of the total available space is used.

In this context, « total available space » means the `map_size` of lmdb.
And the size used by the soft deleted documents is actually an estimation. We can't determine precisely the size used by one document thus what we do is; take the total space used, divide it by the number of documents + soft deleted documents to estimate the size of one average document. Then multiply the size of one avg document by the number of soft deleted document.

--------

<img width="808" alt="image" src="https://user-images.githubusercontent.com/7032172/185083075-92cf379e-8ae1-4bfc-9ca6-93b54e6ab4e9.png">

Here we can see we have a ~10GB drift in the end between the space used by the soft deleted and the real space used by the documents.
Personally I don’t think that's a big issue because once the red line reach 90GB everything will be freed but now you know.

If you have an idea on how to improve this estimation I would love to hear it.
It look like the difference is linear so maybe we could simply multiply the current estimation by two?

Co-authored-by: Irevoire <tamo@meilisearch.com>
2022-08-17 16:31:04 +00:00
Loïc Lecrenier
497f9817a2 Use snapshot testing for the filter parser 2022-08-17 17:35:01 +02:00
Irevoire
4aae07d5f5
expose the size methods 2022-08-17 17:07:38 +02:00
Irevoire
e96b852107
bump heed 2022-08-17 17:05:50 +02:00
Loïc Lecrenier
238a7be58d Fix filter parser handling of keywords and surrounding spaces
Now the following fragments are allowed:

AND(field =

AND'field' =

AND"field" =
2022-08-17 16:53:40 +02:00
Loïc Lecrenier
b09a8f1b91 Filters: add explicit error message when using a keyword as value 2022-08-17 16:07:00 +02:00
bors[bot]
087da5621a
Merge #587
587: Word prefix pair proximity docids indexation refactor r=Kerollmops a=loiclec

# Pull Request

## What does this PR do?
Refactor the code of `WordPrefixPairProximityDocIds` to make it much faster, fix a bug, and add a unit test.

## Why is it faster?
Because we avoid using a sorter to insert the (`word1`, `prefix`, `proximity`) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations: 

1. reusing allocations
2. using a prefix trie instead of an array of prefixes to get all the prefixes of a word
3. inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist. 

## What bug was fixed?
When reindexing, the `new_prefix_fst_words` prefixes may look like:
```
["ant",  "axo", "bor"]
```
which we group by first letter:
```
[["ant", "axo"], ["bor"]]
```

Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with `word2.starts_with(subarray_prefixes[0])`, but `"axolotl".starts_with("ant")` is false, and thus we wrongly think that there are no prefixes in `new_prefix_fst_words` that are prefixes of `axolotl`.

## StrStrU8Codec
I had to change the encoding of `StrStrU8Codec` to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change.

## Requests when reviewing this PR
I have explained what the code does in the module documentation of `word_pair_proximity_prefix_docids`. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not. 

I also have a couple questions regarding the code itself:
- Should we clean up and factor out the `PrefixTrieNode` code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified.
- I wrote a function called `write_into_lmdb_database_without_merging`. (1) Are we okay with such a function existing? (2) Should it be in `grenad_helpers` instead?

## Benchmark Results

We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%. 

```
group                                                                     indexing_main_ce90fc62                  indexing_word-prefix-pair-proximity-docids-refactor_cbad2023
-----                                                                     ----------------------                  ------------------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1893.0±233.03µs        ? ?/sec    1.01  1921.2±260.79µs        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.05      9.4±3.51ms        ? ?/sec     1.00      9.0±2.14ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.22    18.3±11.42ms        ? ?/sec     1.00     15.0±5.79ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00     41.4±4.20ms        ? ?/sec     1.28    53.0±13.97ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.00   285.6±18.12ms        ? ?/sec     1.03   293.1±16.09ms        ? ?/sec
indexing/Indexing geo_point                                               1.03      60.8±0.45s        ? ?/sec     1.00      58.8±0.68s        ? ?/sec
indexing/Indexing movies in three batches                                 1.14      16.5±0.30s        ? ?/sec     1.00      14.5±0.24s        ? ?/sec
indexing/Indexing movies with default settings                            1.11      13.7±0.07s        ? ?/sec     1.00      12.3±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.6±0.11s        ? ?/sec     1.00       9.6±0.15s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.11       9.4±0.15s        ? ?/sec     1.00       8.5±0.10s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.18      66.2±0.39s        ? ?/sec     1.00      56.0±0.67s        ? ?/sec
indexing/Indexing songs with default settings                             1.07      58.7±1.26s        ? ?/sec     1.00      54.7±1.71s        ? ?/sec
indexing/Indexing songs without any facets                                1.08      53.1±0.88s        ? ?/sec     1.00      49.3±1.43s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.08      57.7±1.33s        ? ?/sec     1.00      53.3±0.98s        ? ?/sec
indexing/Indexing wiki                                                    1.06   1051.1±21.46s        ? ?/sec     1.00    989.6±24.55s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.20    1184.8±8.93s        ? ?/sec     1.00     989.7±7.06s        ? ?/sec
indexing/Reindexing geo_point                                             1.04      67.5±0.75s        ? ?/sec     1.00      64.9±0.32s        ? ?/sec
indexing/Reindexing movies with default settings                          1.12      13.9±0.17s        ? ?/sec     1.00      12.4±0.13s        ? ?/sec
indexing/Reindexing songs with default settings                           1.05      60.6±0.84s        ? ?/sec     1.00      57.5±0.99s        ? ?/sec
indexing/Reindexing wiki                                                  1.07   1725.0±17.92s        ? ?/sec     1.00    1611.4±9.90s        ? ?/sec
```

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-17 14:06:12 +00:00