45 Commits

Author SHA1 Message Date
Irevoire
4aae07d5f5
expose the size methods 2022-08-17 17:07:38 +02:00
bors[bot]
087da5621a
Merge #587
587: Word prefix pair proximity docids indexation refactor r=Kerollmops a=loiclec

# Pull Request

## What does this PR do?
Refactor the code of `WordPrefixPairProximityDocIds` to make it much faster, fix a bug, and add a unit test.

## Why is it faster?
Because we avoid using a sorter to insert the (`word1`, `prefix`, `proximity`) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations: 

1. reusing allocations
2. using a prefix trie instead of an array of prefixes to get all the prefixes of a word
3. inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist. 

## What bug was fixed?
When reindexing, the `new_prefix_fst_words` prefixes may look like:
```
["ant",  "axo", "bor"]
```
which we group by first letter:
```
[["ant", "axo"], ["bor"]]
```

Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with `word2.starts_with(subarray_prefixes[0])`, but `"axolotl".starts_with("ant")` is false, and thus we wrongly think that there are no prefixes in `new_prefix_fst_words` that are prefixes of `axolotl`.

## StrStrU8Codec
I had to change the encoding of `StrStrU8Codec` to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change.

## Requests when reviewing this PR
I have explained what the code does in the module documentation of `word_pair_proximity_prefix_docids`. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not. 

I also have a couple questions regarding the code itself:
- Should we clean up and factor out the `PrefixTrieNode` code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified.
- I wrote a function called `write_into_lmdb_database_without_merging`. (1) Are we okay with such a function existing? (2) Should it be in `grenad_helpers` instead?

## Benchmark Results

We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%. 

```
group                                                                     indexing_main_ce90fc62                  indexing_word-prefix-pair-proximity-docids-refactor_cbad2023
-----                                                                     ----------------------                  ------------------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1893.0±233.03µs        ? ?/sec    1.01  1921.2±260.79µs        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.05      9.4±3.51ms        ? ?/sec     1.00      9.0±2.14ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.22    18.3±11.42ms        ? ?/sec     1.00     15.0±5.79ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00     41.4±4.20ms        ? ?/sec     1.28    53.0±13.97ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.00   285.6±18.12ms        ? ?/sec     1.03   293.1±16.09ms        ? ?/sec
indexing/Indexing geo_point                                               1.03      60.8±0.45s        ? ?/sec     1.00      58.8±0.68s        ? ?/sec
indexing/Indexing movies in three batches                                 1.14      16.5±0.30s        ? ?/sec     1.00      14.5±0.24s        ? ?/sec
indexing/Indexing movies with default settings                            1.11      13.7±0.07s        ? ?/sec     1.00      12.3±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.6±0.11s        ? ?/sec     1.00       9.6±0.15s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.11       9.4±0.15s        ? ?/sec     1.00       8.5±0.10s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.18      66.2±0.39s        ? ?/sec     1.00      56.0±0.67s        ? ?/sec
indexing/Indexing songs with default settings                             1.07      58.7±1.26s        ? ?/sec     1.00      54.7±1.71s        ? ?/sec
indexing/Indexing songs without any facets                                1.08      53.1±0.88s        ? ?/sec     1.00      49.3±1.43s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.08      57.7±1.33s        ? ?/sec     1.00      53.3±0.98s        ? ?/sec
indexing/Indexing wiki                                                    1.06   1051.1±21.46s        ? ?/sec     1.00    989.6±24.55s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.20    1184.8±8.93s        ? ?/sec     1.00     989.7±7.06s        ? ?/sec
indexing/Reindexing geo_point                                             1.04      67.5±0.75s        ? ?/sec     1.00      64.9±0.32s        ? ?/sec
indexing/Reindexing movies with default settings                          1.12      13.9±0.17s        ? ?/sec     1.00      12.4±0.13s        ? ?/sec
indexing/Reindexing songs with default settings                           1.05      60.6±0.84s        ? ?/sec     1.00      57.5±0.99s        ? ?/sec
indexing/Reindexing wiki                                                  1.07   1725.0±17.92s        ? ?/sec     1.00    1611.4±9.90s        ? ?/sec
```

Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-08-17 14:06:12 +00:00
Loïc Lecrenier
306593144d Refactor word prefix pair proximity indexation 2022-08-17 11:59:00 +02:00
Loïc Lecrenier
20be69e1b9 Always use mimalloc as the global allocator 2022-08-16 20:09:36 +02:00
Loïc Lecrenier
392472f4bb Apply suggestions from code review
Co-authored-by: Tamo <tamo@meilisearch.com>
2022-07-19 10:07:33 +02:00
Loïc Lecrenier
453d593ce8 Add a database containing the docids where each field exists 2022-07-19 10:07:33 +02:00
ManyTheFish
447195a27a Replace format by to_string 2022-06-14 10:32:44 +02:00
ManyTheFish
0d1d354052 Ensure that Index methods are not bypassed by Meilisearch 2022-06-13 17:34:11 +02:00
ad hoc
ab185a59b5
fix infos 2022-04-05 09:46:56 +02:00
ad hoc
9963f11172
fix infos crate compilation issue 2022-04-04 20:54:03 +02:00
many
5ed75de0db
Update infos crate 2021-10-05 13:56:12 +02:00
many
5c962c03dd
Fix and optimize word_prefix_pair_proximity_docids database 2021-09-01 16:48:40 +02:00
Clément Renault
0227254a65
Return the original string values for the inverted facet index database 2021-07-21 16:59:39 +02:00
Kerollmops
0a78107525
Fix the infos crate to make it read u16 field ids 2021-07-06 11:58:03 +02:00
Tamo
9716fb3b36
format the whole project 2021-06-16 18:33:33 +02:00
Kerollmops
28c004aa2c
Prefer using constant for the database names 2021-06-15 11:13:04 +02:00
many
e857ca4d7d
Fix PR comments 2021-06-01 18:06:46 +02:00
many
b8e6db0feb
Add database in infos crate 2021-05-31 16:29:27 +02:00
Kerollmops
28bd9e183e
Fix the infos crate to support split facet databases 2021-05-25 11:31:06 +02:00
Many
0daa0e170a
Fix PR comments
Co-authored-by: Clément Renault <clement@meilisearch.com>
2021-04-27 14:39:53 +02:00
Kerollmops
7ff4a2a708
Display the number of entries in the infos crate 2021-04-27 14:39:52 +02:00
Kerollmops
1aad66bdaa
Compute stats about the word prefix level positions database in the infos crate 2021-04-27 14:39:52 +02:00
Kerollmops
e65bad16cc
Compute the words prefixes at the end of an update 2021-04-27 14:39:52 +02:00
Kerollmops
89ee2cf576
Introduce the TreeLevel struct 2021-04-27 14:25:35 +02:00
Kerollmops
8bd4f5d93e
Compute the biggest values of the words_level_positions_docids 2021-04-27 14:25:34 +02:00
Kerollmops
3069bf4f4a
Fix and improve the words-level-positions computation 2021-04-27 14:25:34 +02:00
Kerollmops
6b1b42b928
Introduce an infos wordsLevelPositionsDocids subcommand 2021-04-27 14:25:34 +02:00
Kerollmops
b0a417f342
Introduce the word_level_position_docids Index database 2021-04-27 14:25:34 +02:00
Kerollmops
51767725b2
Simplify integer and float functions trait bounds 2021-04-20 10:23:31 +02:00
Clément Renault
18844d60b5
Simplify the output of database sizes in the infos crate 2021-03-08 18:47:33 +01:00
Clément Renault
3d02b19fbd
Introduce the docids-words-positions subcommand to the infos crate 2021-03-08 18:47:33 +01:00
Kerollmops
bd63da0a0e
Add missing databases to the infos subcommand 2021-03-08 18:47:33 +01:00
Kerollmops
07784c8990
Tune the words prefixes threshold to compute for 1/1000 instead 2021-03-03 15:51:28 +01:00
Clément Renault
1eb7ce5cdb
Improve the export-documents infos command by accepting internal ids 2021-03-01 19:48:01 +01:00
Clément Renault
4884b324e6
Remove the useless external ids patch method in the infos crate 2021-03-01 19:48:01 +01:00
Clément Renault
78bede1ffb
Fix error displaying of the workspace members 2021-03-01 19:48:01 +01:00
Clément Renault
45330a5e47
Avoid creating a default empty database in the infos crate 2021-03-01 19:48:00 +01:00
Kerollmops
aa4d9882d2
Introduce the new words-prefixes-docids infos subcomand 2021-02-17 11:22:27 +01:00
Kerollmops
49aee6d02c
Fix the database-stats infos subcommand 2021-02-17 11:22:27 +01:00
Kerollmops
7a0f86a04f
Introduce an infos command to extract the words prefixes fst 2021-02-17 11:22:27 +01:00
Kerollmops
8788485924
Take the prefix databases into account in the infos subcommand 2021-02-17 11:22:26 +01:00
Kerollmops
9b03b0a1b2
Introduce the word prefix pair proximity docids database 2021-02-17 11:12:38 +01:00
Clément Renault
5e7b26791b
Take the words-prefixes into account while computing the biggest values 2021-02-17 10:45:17 +01:00
Clément Renault
b3a21d5a50
Introduce the getters and setters for the words prefixes FST 2021-02-17 10:45:17 +01:00
Clément Renault
fecf3d6fc1
Move the command lines helpers into different crates 2021-02-14 18:55:15 +01:00