Go to file
bors[bot] 087da5621a
Merge #587
587: Word prefix pair proximity docids indexation refactor r=Kerollmops a=loiclec

# Pull Request

## What does this PR do?
Refactor the code of `WordPrefixPairProximityDocIds` to make it much faster, fix a bug, and add a unit test.

## Why is it faster?
Because we avoid using a sorter to insert the (`word1`, `prefix`, `proximity`) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations: 

1. reusing allocations
2. using a prefix trie instead of an array of prefixes to get all the prefixes of a word
3. inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist. 

## What bug was fixed?
When reindexing, the `new_prefix_fst_words` prefixes may look like:
```
["ant",  "axo", "bor"]
```
which we group by first letter:
```
[["ant", "axo"], ["bor"]]
```

Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with `word2.starts_with(subarray_prefixes[0])`, but `"axolotl".starts_with("ant")` is false, and thus we wrongly think that there are no prefixes in `new_prefix_fst_words` that are prefixes of `axolotl`.

## StrStrU8Codec
I had to change the encoding of `StrStrU8Codec` to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change.

## Requests when reviewing this PR
I have explained what the code does in the module documentation of `word_pair_proximity_prefix_docids`. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not. 

I also have a couple questions regarding the code itself:
- Should we clean up and factor out the `PrefixTrieNode` code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified.
- I wrote a function called `write_into_lmdb_database_without_merging`. (1) Are we okay with such a function existing? (2) Should it be in `grenad_helpers` instead?

## Benchmark Results

We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%. 

```
group                                                                     indexing_main_ce90fc62                  indexing_word-prefix-pair-proximity-docids-refactor_cbad2023
-----                                                                     ----------------------                  ------------------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1893.0±233.03µs        ? ?/sec    1.01  1921.2±260.79µs        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.05      9.4±3.51ms        ? ?/sec     1.00      9.0±2.14ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.22    18.3±11.42ms        ? ?/sec     1.00     15.0±5.79ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00     41.4±4.20ms        ? ?/sec     1.28    53.0±13.97ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.00   285.6±18.12ms        ? ?/sec     1.03   293.1±16.09ms        ? ?/sec
indexing/Indexing geo_point                                               1.03      60.8±0.45s        ? ?/sec     1.00      58.8±0.68s        ? ?/sec
indexing/Indexing movies in three batches                                 1.14      16.5±0.30s        ? ?/sec     1.00      14.5±0.24s        ? ?/sec
indexing/Indexing movies with default settings                            1.11      13.7±0.07s        ? ?/sec     1.00      12.3±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.6±0.11s        ? ?/sec     1.00       9.6±0.15s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.11       9.4±0.15s        ? ?/sec     1.00       8.5±0.10s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.18      66.2±0.39s        ? ?/sec     1.00      56.0±0.67s        ? ?/sec
indexing/Indexing songs with default settings                             1.07      58.7±1.26s        ? ?/sec     1.00      54.7±1.71s        ? ?/sec
indexing/Indexing songs without any facets                                1.08      53.1±0.88s        ? ?/sec     1.00      49.3±1.43s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.08      57.7±1.33s        ? ?/sec     1.00      53.3±0.98s        ? ?/sec
indexing/Indexing wiki                                                    1.06   1051.1±21.46s        ? ?/sec     1.00    989.6±24.55s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.20    1184.8±8.93s        ? ?/sec     1.00     989.7±7.06s        ? ?/sec
indexing/Reindexing geo_point                                             1.04      67.5±0.75s        ? ?/sec     1.00      64.9±0.32s        ? ?/sec
indexing/Reindexing movies with default settings                          1.12      13.9±0.17s        ? ?/sec     1.00      12.4±0.13s        ? ?/sec
indexing/Reindexing songs with default settings                           1.05      60.6±0.84s        ? ?/sec     1.00      57.5±0.99s        ? ?/sec
indexing/Reindexing wiki                                                  1.07   1725.0±17.92s        ? ?/sec     1.00    1611.4±9.90s        ? ?/sec
```

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-17 14:06:12 +00:00
.github deny warnings in CI 2022-04-28 15:35:12 +02:00
benchmarks Sort Cargo.toml dependencies 2022-08-17 11:46:56 +02:00
cli Sort Cargo.toml dependencies 2022-08-17 11:46:56 +02:00
filter-parser Update filter-parser/fuzz/.gitignore 2022-07-21 16:12:01 +02:00
flatten-serde-json Merge branch 'filter/field-exist' 2022-07-21 14:51:41 +02:00
helpers Sort Cargo.toml dependencies 2022-08-17 11:46:56 +02:00
http-ui Sort Cargo.toml dependencies 2022-08-17 11:46:56 +02:00
infos Merge #587 2022-08-17 14:06:12 +00:00
json-depth-checker Update version for next release (v0.32.0) 2022-07-21 13:20:02 +04:00
milli Merge #587 2022-08-17 14:06:12 +00:00
script format the whole project 2021-06-16 18:33:33 +02:00
.gitignore Add snapshot tests for Facets::execute 2022-08-10 15:53:46 +02:00
.rustfmt.toml format the whole project 2021-06-16 18:33:33 +02:00
bors.toml Update bors toml 2022-04-26 17:36:04 +02:00
Cargo.toml Make binaries faster on release profile through better compile options 2022-08-17 10:29:33 +02:00
CONTRIBUTING.md Remove the wip section part of the contributing file 2022-05-04 14:44:51 +02:00
LICENSE Update LICENSE 2022-02-15 15:52:50 +01:00
README.md Update README.md 2022-04-25 18:14:43 +02:00

the milli logo

a concurrent indexer combined with fast and relevant search algorithms

Introduction

This repository contains the core engine used in Meilisearch.

It contains a library that can manage one and only one index. Meilisearch manages the multi-index itself. Milli is unable to store updates in a store: it is the job of something else above and this is why it is only able to process one update at a time.

This repository contains crates to quickly debug the engine:

  • There are benchmarks located in the benchmarks crate.
  • The cli crate is a simple command-line interface that helps run flamegraph on top of it.
  • The filter-parser crate contains the parser for the Meilisearch filter syntax.
  • The flatten-serde-json crate contains the library that flattens serde-json Value objects like Elasticsearch does.
  • The helpers crate is only used to do operations on the database.
  • The http-ui crate is a simple HTTP dashboard to test the features like for real!
  • The infos crate is used to dump the internal data-structure and ensure correctness.
  • The json-depth-checker crate is used to indicate if a JSON must be flattened.

How to use it?

Milli is a library that does search things, it must be embedded in a program. You can compute the documentation of it by using cargo doc --open.

Here is an example usage of the library where we insert documents into the engine and search for one of them right after.

let path = tempfile::tempdir().unwrap();
let mut options = EnvOpenOptions::new();
options.map_size(10 * 1024 * 1024); // 10 MB
let index = Index::new(options, &path).unwrap();

let mut wtxn = index.write_txn().unwrap();
let content = documents!([
    {
        "id": 2,
        "title": "Prideand Prejudice",
        "au{hor": "Jane Austin",
        "genre": "romance",
        "price$": "3.5$",
    },
    {
        "id": 456,
        "title": "Le Petit Prince",
        "au{hor": "Antoine de Saint-Exupéry",
        "genre": "adventure",
        "price$": "10.0$",
    },
    {
        "id": 1,
        "title": "Wonderland",
        "au{hor": "Lewis Carroll",
        "genre": "fantasy",
        "price$": "25.99$",
    },
    {
        "id": 4,
        "title": "Harry Potter ing fantasy\0lood Prince",
        "au{hor": "J. K. Rowling",
        "genre": "fantasy\0",
    },
]);

let config = IndexerConfig::default();
let indexing_config = IndexDocumentsConfig::default();
let mut builder =
    IndexDocuments::new(&mut wtxn, &index, &config, indexing_config.clone(), |_| ())
        .unwrap();
builder.add_documents(content).unwrap();
builder.execute().unwrap();
wtxn.commit().unwrap();


// You can search in the index now!
let mut rtxn = index.read_txn().unwrap();
let mut search = Search::new(&rtxn, &index);
search.query("horry");
search.limit(10);

let result = search.execute().unwrap();
assert_eq!(result.documents_ids.len(), 1);

Contributing

We're glad you're thinking about contributing to this repository! Feel free to pick an issue, and to ask any question you need. Some points might not be clear and we are available to help you!

Also, we recommend following the CONTRIBUTING.md to create your PR.