619: Refactor the Facets databases to enable incremental indexing r=curquiza a=loiclec

# Pull Request

## What does this PR do?
Party fixes https://github.com/meilisearch/milli/issues/605 by making the indexing of the facet databases (i.e. `facet_id_f64_docids` and `facet_id_string_docids`) incremental. It also closes #327 and https://github.com/meilisearch/meilisearch/issues/2820 . Two more untracked bugs were also fixed:
1. The facet distribution algorithm did not respect the `maxFacetValues` parameter when there were only a few candidate document ids.
2. The structure of the levels > 0 of the facet databases were not updated following the deletion of documents

## How to review this PR

First, read this comment to get an overview of the changes.

Then, based on this comment, raise any concerns you might have about:
1. the new structure of the databases
2. the algorithms for sort, facet distribution, and range search
3. the new/removed heed codecs

Then, weigh in on the following concerns:
1. adding `fuzzcheck` as a fuzz-only dependency may add too much complexity for the benefits it provides
2. the `ByteSliceRef` and `StrRefCodec` are misnamed or should not exist
3. the new behaviour of facet distributions can be considered incorrect
4. incremental deletion is useless given that documents are always deleted in bulk

## What's left for me to do

1. Re-read everything once to make sure I haven't forgotten anything
2. Wait for the results of the benchmarks and see if (1) they provide enough information (2) there was any change in performance, especially for search queries. Then, maybe, spend some time optimising the code.
3. Test whether the `info`/`http-ui` crates survived the refactor

## Old structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases

Previously, these two databases had different but conceptually similar structures. For each field id, the facet number database had the following format:
```
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │            1.2 – 2            │           3.4 – 100           │   102 – 104   │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   1.2 – 1.3   │    1.6 – 2    │   3.4 – 12    │  12.3 – 100   │   102 – 104   │
│Level 1│   │               │               │               │               │               │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  1.2  │  1.3  │  1.6  │   2   │  3.4  │   12  │  12.3 │  100  │  102  │  104  │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where the first line is the key of the database, consisting of :
- the field id
- the level height
- the left and right bound of the group 

and the second line is the value of the database, consisting of:
- a bitmap of all the docids that have a facet value within the bounds

The `facet_id_string_docids` had a similar structure:
```
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │             0 – 3             │             4 – 7             │     8 – 9     │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │     0 – 1     │     2 – 3     │     4 – 5     │     6 – 7     │     8 – 9     │
│Level 1│   │  "ab" – "ac"  │ "ba" – "bac"  │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz"  │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where, **at level 0**, the key is:
* the normalised facet value (string)

and the value is:
* the original facet value (string)
* a bitmap of all the docids that have this normalised string facet value

**At level 1**, the key is:
* the left bound of the range as an index in level 0
* the right bound of the range as an index in level 0

and the value is:
* the left bound of the range as a normalised string
* the right bound of the range as a normalised string
* a bitmap of all the docids that have a string facet value within the bounds

**At level > 1**, the key is:
* the left bound of the range as an index in level 0
* the right bound of the range as an index in level 0

and the value is:
* a bitmap of all the docids that have a string facet value within the bounds

## New structure of the `facet_id_f64_docids` and `facet_id_string_docids` databases

Now both the `facet_id_f64_docids` and `facet_id_string_docids` databases have the exact same structure:
```                                                                                             
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │           "ab" (2)            │           "gaf" (2)           │   "woz" (1)   │
│Level 2│   │                               │                               │               │
└───────┘   │        [a, b, d, f, z]        │        [c, d, e, f, g]        │    [u, y]     │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │   "ab" (2)    │   "ba" (2)    │   "gaf" (2)   │  "form" (2)   │   "woz" (2)   │
│Level 1│   │               │               │               │               │               │
└───────┘   │ [a, b, d, z]  │   [a, b, f]   │   [c, d, g]   │    [e, f]     │    [u, y]     │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │       │       │       │       │       │       │       │       │       │       │
└───────┘   │ [a, b]│ [d, z]│ [b, f]│ [a, f]│ [c, d]│  [g]  │  [e]  │ [e, f]│  [y]  │  [u]  │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘
```
where for all levels, the key is a `FacetGroupKey<T>` containing:
* the field id (`u16`)
* the level height (`u8`)
* the left bound of the range (`T`)

and the value is a `FacetGroupValue` containing:
* the number of elements from the level below that are part of the range (`u8`, =0 for level 0)
* a bitmap of all the docids that have a facet value within the bounds (`RoaringBitmap`)

The right bound of the range is now implicit, it is equal to `Excluded(next_left_bound)`.

In the code, the key is always encoded using `FacetGroupKeyCodec<C>` where `C` is the codec used to encode the facet value (either `OrderedF64Codec` or `StrRefCodec`) and the value is encoded with `FacetGroupValueCodec`.

Since both databases share the same structure, we can implement almost all operations only once by treating the facet value as a byte slice (i.e. `FacetGroupKey<&[u8]>` encoded as `FacetGroupKeyCodec<ByteSliceRef>`). This is, in my opinion, a big simplification.

The reason for changing the structure of the databases is to make it possible to incrementally add a facet value to an existing database. Since the `facet_id_string_docids` used to store indices to `level 0` in all levels > 0, adding an element to level 0 would potentially invalidate all the indices.

Note that the original string value of a facet is no longer stored in this database.

## Incrementally adding a facet value

Here I describe how we can add a facet value to the new database incrementally. If we want to add the document with id `z` and facet value `gap`., then we want to add/modify the elements highlighted below in pink:
<img width="946" alt="Screenshot 2022-09-12 at 10 14 54" src="https://user-images.githubusercontent.com/6040237/189605532-fe4b0f52-e13d-4b3c-92d9-10c705953e3d.png">

which results in:
<img width="662" alt="Screenshot 2022-09-12 at 10 23 29" src="https://user-images.githubusercontent.com/6040237/189607015-c3a37588-b825-43c2-878a-f8f85c000b94.png">

* one element was added in level 0
* one key/value was modified in level 1
* one value was modified in level 2

Adding this element was easy since we could simply add it to level 0 and then increase the `group_size` part of the value for the level above. However, in order to keep the structure balanced, we can't always do this. If the group size reaches a threshold (`max_group_size`), then we split the node into two. For example, let's imagine that `max_group_size` is `4` and we add the docid `y` with facet value `gas`. First, we add it in level 0:
<img width="904" alt="Screenshot 2022-09-12 at 10 30 40" src="https://user-images.githubusercontent.com/6040237/189608391-531f9df1-3424-4f1f-8344-73eb194570e5.png">
Then, we realise that the group size of its parent is going to reach the maximum group size (=4) and thus we split the parent into two nodes:
<img width="919" alt="Screenshot 2022-09-12 at 10 33 16" src="https://user-images.githubusercontent.com/6040237/189608884-66f87635-1fc6-41d2-a459-87c995491ac4.png">
and since we inserted an element in level 1, we also update level 2 accordingly, by increasing the group size of the parent:
<img width="915" alt="Screenshot 2022-09-12 at 10 34 42" src="https://user-images.githubusercontent.com/6040237/189609233-d4a893ff-254a-48a7-a5ad-c0dc337f23ca.png">

We also have two other parameters:
* `group_size` is the default group size when building the database from scratch
* `min_level_size` is the minimum number of elements that a level should contain

When the highest level size is greater than `group_size * min_level_size`, then we create an additional level above it.

There is one more edge case for the insertion algorithm. While we normally don't modify the existing left bounds of a key, we have to do it if the facet value being inserted is smaller than the first left bound. For example, inserting `"aa"` with the docid `w` would change the database to:
<img width="756" alt="Screenshot 2022-09-12 at 10 41 56" src="https://user-images.githubusercontent.com/6040237/189610637-a043ef71-7159-4bf1-b4fd-9903134fc095.png">

The root of the code for incremental indexing is the `FacetUpdateIncremental` builder.

## Incrementally removing a facet value
TODO: the algorithm was implemented and works, but its current API is: `fn delete(self, facet_value, single_docid)`. It removes the given document id from all keys containing the given facet value. I don't think it is the right way to implement it anymore. Perhaps a bitmap of docids should be given instead. This is fairly easy to do. But since we batch document deletions together (because of soft deletion), it's not clear to me anymore that incremental deletion should be implemented at all.  

## Bulk insertion
While it's faster to incrementally add a single facet value to the database, it is sometimes **slower** to repeatedly add facet values one-by-one instead of doing it in bulk. For example, during initial indexing, we'd like to build the database from a list of facet values and associated document ids in one go. The `FacetUpdateBulk` builder provides a way to do so. It works by:
1. clearing all levels > 0 from the DB
2. adding all new elements in level 0
3. rebuilding the higher levels from scratch 

The algorithm for bulk insertion is the same as the previous one.

## Choosing between incremental and bulk insertion
On my computer, I measured that is about 50x slower to add N facet values incrementally than it is to re-build a database with N facet values in level 0. Therefore, we dynamically choose to use either incremental insertion or bulk insertion based on (1) the number of existing elements in level 0 of the database and (2) the number of facet values from the new documents.

This is imprecise but is mainly aimed at avoiding the worst-case scenario where the incremental insertion method is used repeatedly millions of times.

## Fuzz-testing

**Potentially controversial:**
I fuzz-tested incremental addition and deletion using fuzzcheck, which found many bugs. The fuzz-test consists of inserting/deleting facet values and docids in succession, each operation is processed with different parameters for `group_size`, `max_group_size`, and `min_level_size`. After all the operations are processed, the content of level 0 is compared to the content of an equivalent structure with a simple and easily-checked implementation. Furthermore, we check that the database has a correct structure (all groups from levels > 0 correctly combine the content of their children). I also visualised the code coverage found by the fuzz-test. It covered 100% of the relevant code except for `unreachable/panic` statements and errors returned by `heed`.

The fuzz-test and the fuzzcheck dependency are only compiled when `cargo fuzzcheck` is used. For now, the dependency is from a local path on my computer, but it can be changed to a crate version if we decide to keep it. 

## Algorithms operating on the facet databases

There are four important algorithms making use of the facet databases:
1. Sort, ascending
2. Sort, descending
3. Facet distribution
4. Range search

Previously, the implementation of all four algorithms was based on a number of iterators specific to each database kind (number or string): `FacetNumberRange`, `FacetNumberRevRange`, `FacetNumberIter` (with a reversed and reducing/non-reducing option), `FacetStringGroupRange`, `FacetStringGroupRevRange`, `FacetStringLevel0Range`, `FacetStringLevel0RevRange`, and `FacetStringIter` (reversed + reducing/non-reducing). 

Now, all four algorithms have a unique implementation shared by both the string and number databases. There are four functions:
1. `ascending_facet_sort` in `search/facet/facet_sort_ascending.rs`
2. `descending_facet_sort` in `search/facet/facet_sort_descending.rs`
3. `iterate_over_facet_distribution` in `search/facet/facet_distribution_iter.rs`
4. `find_docids_of_facet_within_bounds` in `search/facet/facet_range_search.rs`

I have tried to test them with some snapshot tests but more testing could still be done. I don't *think* that the performance of these algorithms regressed, but that will need to be confirmed by benchmarks.

## Change of behaviour for facet distributions

Previously, the original string value of a facet was stored in the level 0 of `facet_id_string_docids `. This is no longer the case. The original string value was used in the implementation of the facet distribution algorithm. Now, to recover it, we pick a random document id which contains the normalised string value and look up the original one in `field_id_docid_facet_strings`. As a consequence, it may be that the string value returned in the field distribution does not appear in any of the candidates. For example,
```json
{ "id": 0, "colour": "RED" }
{ "id": 1, "colour": "red" }
```
Facet distribution for the `colour` field among the candidates `[1]`:
```
{ "RED": 1 }
```
Here, "RED" was given as the original facet value even though it does not appear in the document id `1`.

## Heed codecs

A number of heed codecs related to the facet databases were removed:
* `FacetLevelValueF64Codec`
* `FacetLevelValueU32Codec`
* `FacetStringLevelZeroCodec`
* `StringValueCodec`
* `FacetStringZeroBoundsValueCodec`
* `FacetValueStringCodec`
* `FieldDocIdFacetStringCodec`
* `FieldDocIdFacetF64Codec`

They were replaced by:
* `FacetGroupKeyCodec<C>` (replaces all key codecs for the facet databases)
* `FacetGroupValueCodec` (replaces all value codecs for the facet databases)
* `FieldDocIdFacetCodec<C>` (replaces `FieldDocIdFacetStringCodec` and `FieldDocIdFacetF64Codec`)

Since the associated encoded item of `FacetGroupKeyCodec<C>` is `FacetKey<T>` and we often work with `FacetKey<&[u8]>` and `FacetKey<&str>`, then we need to have codecs that encode values of type `&str` and `&[u8]`. The existing `ByteSlice` and `Str` codecs do not work for that purpose (their `EItem` are `[u8]` and `str`), I have also created two new codecs:
* `ByteSliceRef` is a codec with a `EItem = DItem = &[u8]`
* `StrRefCodec` is a codec with a `EItem = DItem = &str`

I have also factored out the code used to encode an ordered f64 into its own `OrderedF64Codec`.


Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
This commit is contained in:
bors[bot] 2022-10-26 15:04:53 +00:00 committed by GitHub
commit 2e539249cb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
167 changed files with 6362 additions and 2887 deletions

8
.gitignore vendored
View File

@ -2,6 +2,8 @@
/target /target
/Cargo.lock /Cargo.lock
milli/target/
# datasets # datasets
*.csv *.csv
*.mmdb *.mmdb
@ -11,6 +13,8 @@
# Snapshots # Snapshots
## ... large ## ... large
*.full.snap *.full.snap
## ... unreviewed
# ... unreviewed
*.snap.new *.snap.new
# Fuzzcheck data for the facet indexing fuzz test
milli/fuzz/update::facet::incremental::fuzz::fuzz/

View File

@ -54,7 +54,10 @@ big_s = "1.0.2"
insta = "1.21.0" insta = "1.21.0"
maplit = "1.0.2" maplit = "1.0.2"
md5 = "0.7.0" md5 = "0.7.0"
rand = "0.8.5" rand = {version = "0.8.5", features = ["small_rng"] }
[target.'cfg(fuzzing)'.dev-dependencies]
fuzzcheck = "0.12.1"
[features] [features]
default = [ "charabia/default" ] default = [ "charabia/default" ]

View File

@ -0,0 +1,23 @@
use std::borrow::Cow;
use heed::{BytesDecode, BytesEncode};
/// A codec for values of type `&[u8]`. Unlike `ByteSlice`, its `EItem` and `DItem` associated
/// types are equivalent (= `&'a [u8]`) and these values can reside within another structure.
pub struct ByteSliceRefCodec;
impl<'a> BytesEncode<'a> for ByteSliceRefCodec {
type EItem = &'a [u8];
fn bytes_encode(item: &'a Self::EItem) -> Option<Cow<'a, [u8]>> {
Some(Cow::Borrowed(item))
}
}
impl<'a> BytesDecode<'a> for ByteSliceRefCodec {
type DItem = &'a [u8];
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
Some(bytes)
}
}

View File

@ -1,89 +0,0 @@
use std::borrow::Cow;
use std::convert::TryInto;
use crate::facet::value_encoding::f64_into_bytes;
use crate::{try_split_array_at, FieldId};
// TODO do not de/serialize right bound when level = 0
pub struct FacetLevelValueF64Codec;
impl<'a> heed::BytesDecode<'a> for FacetLevelValueF64Codec {
type DItem = (FieldId, u8, f64, f64);
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (field_id_bytes, bytes) = try_split_array_at(bytes)?;
let field_id = u16::from_be_bytes(field_id_bytes);
let (level, bytes) = bytes.split_first()?;
let (left, right) = if *level != 0 {
let left = bytes[16..24].try_into().ok().map(f64::from_be_bytes)?;
let right = bytes[24..].try_into().ok().map(f64::from_be_bytes)?;
(left, right)
} else {
let left = bytes[8..].try_into().ok().map(f64::from_be_bytes)?;
(left, left)
};
Some((field_id, *level, left, right))
}
}
impl heed::BytesEncode<'_> for FacetLevelValueF64Codec {
type EItem = (FieldId, u8, f64, f64);
fn bytes_encode((field_id, level, left, right): &Self::EItem) -> Option<Cow<[u8]>> {
let mut buffer = [0u8; 32];
let len = if *level != 0 {
// Write the globally ordered floats.
let bytes = f64_into_bytes(*left)?;
buffer[..8].copy_from_slice(&bytes[..]);
let bytes = f64_into_bytes(*right)?;
buffer[8..16].copy_from_slice(&bytes[..]);
// Then the f64 values just to be able to read them back.
let bytes = left.to_be_bytes();
buffer[16..24].copy_from_slice(&bytes[..]);
let bytes = right.to_be_bytes();
buffer[24..].copy_from_slice(&bytes[..]);
32 // length
} else {
// Write the globally ordered floats.
let bytes = f64_into_bytes(*left)?;
buffer[..8].copy_from_slice(&bytes[..]);
// Then the f64 values just to be able to read them back.
let bytes = left.to_be_bytes();
buffer[8..16].copy_from_slice(&bytes[..]);
16 // length
};
let mut bytes = Vec::with_capacity(len + 3);
bytes.extend_from_slice(&field_id.to_be_bytes());
bytes.push(*level);
bytes.extend_from_slice(&buffer[..len]);
Some(Cow::Owned(bytes))
}
}
#[cfg(test)]
mod tests {
use heed::{BytesDecode, BytesEncode};
use super::*;
#[test]
fn globally_ordered_f64() {
let bytes = FacetLevelValueF64Codec::bytes_encode(&(3, 0, 32.0, 0.0)).unwrap();
let (name, level, left, right) = FacetLevelValueF64Codec::bytes_decode(&bytes).unwrap();
assert_eq!((name, level, left, right), (3, 0, 32.0, 32.0));
let bytes = FacetLevelValueF64Codec::bytes_encode(&(3, 1, -32.0, 32.0)).unwrap();
let (name, level, left, right) = FacetLevelValueF64Codec::bytes_decode(&bytes).unwrap();
assert_eq!((name, level, left, right), (3, 1, -32.0, 32.0));
}
}

View File

@ -1,53 +0,0 @@
use std::borrow::Cow;
use std::convert::TryInto;
use std::num::NonZeroU8;
use crate::{try_split_array_at, FieldId};
/// A codec that stores the field id, level 1 and higher and the groups ids.
///
/// It can only be used to encode the facet string of the level 1 or higher.
pub struct FacetLevelValueU32Codec;
impl<'a> heed::BytesDecode<'a> for FacetLevelValueU32Codec {
type DItem = (FieldId, NonZeroU8, u32, u32);
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (field_id_bytes, bytes) = try_split_array_at(bytes)?;
let field_id = u16::from_be_bytes(field_id_bytes);
let (level, bytes) = bytes.split_first()?;
let level = NonZeroU8::new(*level)?;
let left = bytes[8..12].try_into().ok().map(u32::from_be_bytes)?;
let right = bytes[12..].try_into().ok().map(u32::from_be_bytes)?;
Some((field_id, level, left, right))
}
}
impl heed::BytesEncode<'_> for FacetLevelValueU32Codec {
type EItem = (FieldId, NonZeroU8, u32, u32);
fn bytes_encode((field_id, level, left, right): &Self::EItem) -> Option<Cow<[u8]>> {
let mut buffer = [0u8; 16];
// Write the big-endian integers.
let bytes = left.to_be_bytes();
buffer[..4].copy_from_slice(&bytes[..]);
let bytes = right.to_be_bytes();
buffer[4..8].copy_from_slice(&bytes[..]);
// Then the u32 values just to be able to read them back.
let bytes = left.to_be_bytes();
buffer[8..12].copy_from_slice(&bytes[..]);
let bytes = right.to_be_bytes();
buffer[12..].copy_from_slice(&bytes[..]);
let mut bytes = Vec::with_capacity(buffer.len() + 2 + 1);
bytes.extend_from_slice(&field_id.to_be_bytes());
bytes.push(level.get());
bytes.extend_from_slice(&buffer);
Some(Cow::Owned(bytes))
}
}

View File

@ -1,50 +0,0 @@
use std::borrow::Cow;
use std::str;
use crate::{try_split_array_at, FieldId};
/// A codec that stores the field id, level 0, and facet string.
///
/// It can only be used to encode the facet string of the level 0,
/// as it hardcodes the level.
///
/// We encode the level 0 to not break the lexicographical ordering of the LMDB keys,
/// and make sure that the levels are not mixed-up. The level 0 is special, the key
/// are strings, other levels represent groups and keys are simply two integers.
pub struct FacetStringLevelZeroCodec;
impl FacetStringLevelZeroCodec {
pub fn serialize_into(field_id: FieldId, value: &str, out: &mut Vec<u8>) {
out.reserve(value.len() + 2);
out.extend_from_slice(&field_id.to_be_bytes());
out.push(0); // the level zero (for LMDB ordering only)
out.extend_from_slice(value.as_bytes());
}
}
impl<'a> heed::BytesDecode<'a> for FacetStringLevelZeroCodec {
type DItem = (FieldId, &'a str);
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (field_id_bytes, bytes) = try_split_array_at(bytes)?;
let field_id = u16::from_be_bytes(field_id_bytes);
let (level, bytes) = bytes.split_first()?;
if *level != 0 {
return None;
}
let value = str::from_utf8(bytes).ok()?;
Some((field_id, value))
}
}
impl<'a> heed::BytesEncode<'a> for FacetStringLevelZeroCodec {
type EItem = (FieldId, &'a str);
fn bytes_encode((field_id, value): &Self::EItem) -> Option<Cow<[u8]>> {
let mut bytes = Vec::new();
FacetStringLevelZeroCodec::serialize_into(*field_id, value, &mut bytes);
Some(Cow::Owned(bytes))
}
}

View File

@ -1,90 +0,0 @@
use std::borrow::Cow;
use std::convert::TryInto;
use std::{marker, str};
use crate::error::SerializationError;
use crate::heed_codec::RoaringBitmapCodec;
use crate::{try_split_array_at, try_split_at, Result};
pub type FacetStringLevelZeroValueCodec = StringValueCodec<RoaringBitmapCodec>;
/// A codec that encodes a string in front of a value.
///
/// The usecase is for the facet string levels algorithm where we must know the
/// original string of a normalized facet value, the original values are stored
/// in the value to not break the lexicographical ordering of the LMDB keys.
pub struct StringValueCodec<C>(marker::PhantomData<C>);
impl<'a, C> heed::BytesDecode<'a> for StringValueCodec<C>
where
C: heed::BytesDecode<'a>,
{
type DItem = (&'a str, C::DItem);
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (string, bytes) = decode_prefix_string(bytes)?;
C::bytes_decode(bytes).map(|item| (string, item))
}
}
impl<'a, C> heed::BytesEncode<'a> for StringValueCodec<C>
where
C: heed::BytesEncode<'a>,
{
type EItem = (&'a str, C::EItem);
fn bytes_encode((string, value): &'a Self::EItem) -> Option<Cow<[u8]>> {
let value_bytes = C::bytes_encode(value)?;
let mut bytes = Vec::with_capacity(2 + string.len() + value_bytes.len());
encode_prefix_string(string, &mut bytes).ok()?;
bytes.extend_from_slice(&value_bytes[..]);
Some(Cow::Owned(bytes))
}
}
pub fn decode_prefix_string(value: &[u8]) -> Option<(&str, &[u8])> {
let (original_length_bytes, bytes) = try_split_array_at(value)?;
let original_length = u16::from_be_bytes(original_length_bytes) as usize;
let (string, bytes) = try_split_at(bytes, original_length)?;
let string = str::from_utf8(string).ok()?;
Some((string, bytes))
}
pub fn encode_prefix_string(string: &str, buffer: &mut Vec<u8>) -> Result<()> {
let string_len: u16 =
string.len().try_into().map_err(|_| SerializationError::InvalidNumberSerialization)?;
buffer.extend_from_slice(&string_len.to_be_bytes());
buffer.extend_from_slice(string.as_bytes());
Ok(())
}
#[cfg(test)]
mod tests {
use heed::types::Unit;
use heed::{BytesDecode, BytesEncode};
use roaring::RoaringBitmap;
use super::*;
#[test]
fn deserialize_roaring_bitmaps() {
let string = "abc";
let docids: RoaringBitmap = (0..100).chain(3500..4398).collect();
let key = (string, docids.clone());
let bytes = StringValueCodec::<RoaringBitmapCodec>::bytes_encode(&key).unwrap();
let (out_string, out_docids) =
StringValueCodec::<RoaringBitmapCodec>::bytes_decode(&bytes).unwrap();
assert_eq!((out_string, out_docids), (string, docids));
}
#[test]
fn deserialize_unit() {
let string = "def";
let key = (string, ());
let bytes = StringValueCodec::<Unit>::bytes_encode(&key).unwrap();
let (out_string, out_unit) = StringValueCodec::<Unit>::bytes_decode(&bytes).unwrap();
assert_eq!((out_string, out_unit), (string, ()));
}
}

View File

@ -1,114 +0,0 @@
use std::borrow::Cow;
use std::convert::TryInto;
use std::{marker, str};
use super::try_split_at;
/// A codec that optionally encodes two strings in front of the value.
///
/// The usecase is for the facet string levels algorithm where we must
/// know the origin of a group, the group left and right bounds are stored
/// in the value to not break the lexicographical ordering of the LMDB keys.
pub struct FacetStringZeroBoundsValueCodec<C>(marker::PhantomData<C>);
impl<'a, C> heed::BytesDecode<'a> for FacetStringZeroBoundsValueCodec<C>
where
C: heed::BytesDecode<'a>,
{
type DItem = (Option<(&'a str, &'a str)>, C::DItem);
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (contains_bounds, bytes) = bytes.split_first()?;
if *contains_bounds != 0 {
let (left_len, bytes) = try_split_at(bytes, 2)?;
let (right_len, bytes) = try_split_at(bytes, 2)?;
let left_len = left_len.try_into().ok().map(u16::from_be_bytes)?;
let right_len = right_len.try_into().ok().map(u16::from_be_bytes)?;
let (left, bytes) = try_split_at(bytes, left_len as usize)?;
let (right, bytes) = try_split_at(bytes, right_len as usize)?;
let left = str::from_utf8(left).ok()?;
let right = str::from_utf8(right).ok()?;
C::bytes_decode(bytes).map(|item| (Some((left, right)), item))
} else {
C::bytes_decode(bytes).map(|item| (None, item))
}
}
}
impl<'a, C> heed::BytesEncode<'a> for FacetStringZeroBoundsValueCodec<C>
where
C: heed::BytesEncode<'a>,
{
type EItem = (Option<(&'a str, &'a str)>, C::EItem);
fn bytes_encode((bounds, value): &'a Self::EItem) -> Option<Cow<[u8]>> {
let mut bytes = Vec::new();
match bounds {
Some((left, right)) => {
bytes.push(u8::max_value());
if left.is_empty() || right.is_empty() {
return None;
}
let left_len: u16 = left.len().try_into().ok()?;
let right_len: u16 = right.len().try_into().ok()?;
bytes.extend_from_slice(&left_len.to_be_bytes());
bytes.extend_from_slice(&right_len.to_be_bytes());
bytes.extend_from_slice(left.as_bytes());
bytes.extend_from_slice(right.as_bytes());
let value_bytes = C::bytes_encode(value)?;
bytes.extend_from_slice(&value_bytes[..]);
Some(Cow::Owned(bytes))
}
None => {
bytes.push(0);
let value_bytes = C::bytes_encode(value)?;
bytes.extend_from_slice(&value_bytes[..]);
Some(Cow::Owned(bytes))
}
}
}
}
#[cfg(test)]
mod tests {
use heed::types::Unit;
use heed::{BytesDecode, BytesEncode};
use roaring::RoaringBitmap;
use super::*;
use crate::CboRoaringBitmapCodec;
#[test]
fn deserialize_roaring_bitmaps() {
let bounds = Some(("abc", "def"));
let docids: RoaringBitmap = (0..100).chain(3500..4398).collect();
let key = (bounds, docids.clone());
let bytes =
FacetStringZeroBoundsValueCodec::<CboRoaringBitmapCodec>::bytes_encode(&key).unwrap();
let (out_bounds, out_docids) =
FacetStringZeroBoundsValueCodec::<CboRoaringBitmapCodec>::bytes_decode(&bytes).unwrap();
assert_eq!((out_bounds, out_docids), (bounds, docids));
}
#[test]
fn deserialize_unit() {
let bounds = Some(("abc", "def"));
let key = (bounds, ());
let bytes = FacetStringZeroBoundsValueCodec::<Unit>::bytes_encode(&key).unwrap();
let (out_bounds, out_unit) =
FacetStringZeroBoundsValueCodec::<Unit>::bytes_decode(&bytes).unwrap();
assert_eq!((out_bounds, out_unit), (bounds, ()));
}
}

View File

@ -1,35 +0,0 @@
use std::borrow::Cow;
use std::str;
use crate::{try_split_array_at, FieldId};
pub struct FacetValueStringCodec;
impl FacetValueStringCodec {
pub fn serialize_into(field_id: FieldId, value: &str, out: &mut Vec<u8>) {
out.reserve(value.len() + 2);
out.extend_from_slice(&field_id.to_be_bytes());
out.extend_from_slice(value.as_bytes());
}
}
impl<'a> heed::BytesDecode<'a> for FacetValueStringCodec {
type DItem = (FieldId, &'a str);
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (field_id_bytes, bytes) = try_split_array_at(bytes)?;
let field_id = u16::from_be_bytes(field_id_bytes);
let value = str::from_utf8(bytes).ok()?;
Some((field_id, value))
}
}
impl<'a> heed::BytesEncode<'a> for FacetValueStringCodec {
type EItem = (FieldId, &'a str);
fn bytes_encode((field_id, value): &Self::EItem) -> Option<Cow<[u8]>> {
let mut bytes = Vec::new();
FacetValueStringCodec::serialize_into(*field_id, value, &mut bytes);
Some(Cow::Owned(bytes))
}
}

View File

@ -0,0 +1,44 @@
use std::borrow::Cow;
use std::marker::PhantomData;
use heed::{BytesDecode, BytesEncode};
use crate::{try_split_array_at, DocumentId, FieldId};
pub struct FieldDocIdFacetCodec<C>(PhantomData<C>);
impl<'a, C> BytesDecode<'a> for FieldDocIdFacetCodec<C>
where
C: BytesDecode<'a>,
{
type DItem = (FieldId, DocumentId, C::DItem);
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (field_id_bytes, bytes) = try_split_array_at(bytes)?;
let field_id = u16::from_be_bytes(field_id_bytes);
let (document_id_bytes, bytes) = try_split_array_at(bytes)?;
let document_id = u32::from_be_bytes(document_id_bytes);
let value = C::bytes_decode(bytes)?;
Some((field_id, document_id, value))
}
}
impl<'a, C> BytesEncode<'a> for FieldDocIdFacetCodec<C>
where
C: BytesEncode<'a>,
{
type EItem = (FieldId, DocumentId, C::EItem);
fn bytes_encode((field_id, document_id, value): &'a Self::EItem) -> Option<Cow<[u8]>> {
let mut bytes = Vec::with_capacity(32);
bytes.extend_from_slice(&field_id.to_be_bytes()); // 2 bytes
bytes.extend_from_slice(&document_id.to_be_bytes()); // 4 bytes
let value_bytes = C::bytes_encode(value)?;
// variable length, if f64 -> 16 bytes, if string -> large, potentially
bytes.extend_from_slice(&value_bytes);
Some(Cow::Owned(bytes))
}
}

View File

@ -1,37 +0,0 @@
use std::borrow::Cow;
use std::convert::TryInto;
use crate::facet::value_encoding::f64_into_bytes;
use crate::{try_split_array_at, DocumentId, FieldId};
pub struct FieldDocIdFacetF64Codec;
impl<'a> heed::BytesDecode<'a> for FieldDocIdFacetF64Codec {
type DItem = (FieldId, DocumentId, f64);
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (field_id_bytes, bytes) = try_split_array_at(bytes)?;
let field_id = u16::from_be_bytes(field_id_bytes);
let (document_id_bytes, bytes) = try_split_array_at(bytes)?;
let document_id = u32::from_be_bytes(document_id_bytes);
let value = bytes[8..16].try_into().map(f64::from_be_bytes).ok()?;
Some((field_id, document_id, value))
}
}
impl<'a> heed::BytesEncode<'a> for FieldDocIdFacetF64Codec {
type EItem = (FieldId, DocumentId, f64);
fn bytes_encode((field_id, document_id, value): &Self::EItem) -> Option<Cow<[u8]>> {
let mut bytes = Vec::with_capacity(2 + 4 + 8 + 8);
bytes.extend_from_slice(&field_id.to_be_bytes());
bytes.extend_from_slice(&document_id.to_be_bytes());
let value_bytes = f64_into_bytes(*value)?;
bytes.extend_from_slice(&value_bytes);
bytes.extend_from_slice(&value.to_be_bytes());
Some(Cow::Owned(bytes))
}
}

View File

@ -1,50 +0,0 @@
use std::borrow::Cow;
use std::str;
use crate::{try_split_array_at, DocumentId, FieldId};
pub struct FieldDocIdFacetStringCodec;
impl FieldDocIdFacetStringCodec {
pub fn serialize_into(
field_id: FieldId,
document_id: DocumentId,
normalized_value: &str,
out: &mut Vec<u8>,
) {
out.reserve(2 + 4 + normalized_value.len());
out.extend_from_slice(&field_id.to_be_bytes());
out.extend_from_slice(&document_id.to_be_bytes());
out.extend_from_slice(normalized_value.as_bytes());
}
}
impl<'a> heed::BytesDecode<'a> for FieldDocIdFacetStringCodec {
type DItem = (FieldId, DocumentId, &'a str);
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let (field_id_bytes, bytes) = try_split_array_at(bytes)?;
let field_id = u16::from_be_bytes(field_id_bytes);
let (document_id_bytes, bytes) = try_split_array_at(bytes)?;
let document_id = u32::from_be_bytes(document_id_bytes);
let normalized_value = str::from_utf8(bytes).ok()?;
Some((field_id, document_id, normalized_value))
}
}
impl<'a> heed::BytesEncode<'a> for FieldDocIdFacetStringCodec {
type EItem = (FieldId, DocumentId, &'a str);
fn bytes_encode((field_id, document_id, normalized_value): &Self::EItem) -> Option<Cow<[u8]>> {
let mut bytes = Vec::new();
FieldDocIdFacetStringCodec::serialize_into(
*field_id,
*document_id,
normalized_value,
&mut bytes,
);
Some(Cow::Owned(bytes))
}
}

View File

@ -1,23 +1,22 @@
mod facet_level_value_f64_codec; mod field_doc_id_facet_codec;
mod facet_level_value_u32_codec; mod ordered_f64_codec;
mod facet_string_level_zero_codec;
mod facet_string_level_zero_value_codec;
mod facet_string_zero_bounds_value_codec;
mod field_doc_id_facet_f64_codec;
mod field_doc_id_facet_string_codec;
use heed::types::OwnedType; use std::borrow::Cow;
use std::convert::TryFrom;
use std::marker::PhantomData;
pub use self::facet_level_value_f64_codec::FacetLevelValueF64Codec; use heed::types::{DecodeIgnore, OwnedType};
pub use self::facet_level_value_u32_codec::FacetLevelValueU32Codec; use heed::{BytesDecode, BytesEncode};
pub use self::facet_string_level_zero_codec::FacetStringLevelZeroCodec; use roaring::RoaringBitmap;
pub use self::facet_string_level_zero_value_codec::{
decode_prefix_string, encode_prefix_string, FacetStringLevelZeroValueCodec, pub use self::field_doc_id_facet_codec::FieldDocIdFacetCodec;
}; pub use self::ordered_f64_codec::OrderedF64Codec;
pub use self::facet_string_zero_bounds_value_codec::FacetStringZeroBoundsValueCodec; use super::StrRefCodec;
pub use self::field_doc_id_facet_f64_codec::FieldDocIdFacetF64Codec; use crate::{CboRoaringBitmapCodec, BEU16};
pub use self::field_doc_id_facet_string_codec::FieldDocIdFacetStringCodec;
use crate::BEU16; pub type FieldDocIdFacetF64Codec = FieldDocIdFacetCodec<OrderedF64Codec>;
pub type FieldDocIdFacetStringCodec = FieldDocIdFacetCodec<StrRefCodec>;
pub type FieldDocIdFacetIgnoreCodec = FieldDocIdFacetCodec<DecodeIgnore>;
pub type FieldIdCodec = OwnedType<BEU16>; pub type FieldIdCodec = OwnedType<BEU16>;
@ -30,3 +29,76 @@ pub fn try_split_at(slice: &[u8], mid: usize) -> Option<(&[u8], &[u8])> {
None None
} }
} }
/// The key in the [`facet_id_string_docids` and `facet_id_f64_docids`][`Index::facet_id_string_docids`]
/// databases.
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)] // TODO: try removing PartialOrd and Ord
pub struct FacetGroupKey<T> {
pub field_id: u16,
pub level: u8,
pub left_bound: T,
}
/// The value in the [`facet_id_string_docids` and `facet_id_f64_docids`][`Index::facet_id_string_docids`]
/// databases.
#[derive(Debug)]
pub struct FacetGroupValue {
pub size: u8,
pub bitmap: RoaringBitmap,
}
pub struct FacetGroupKeyCodec<T> {
_phantom: PhantomData<T>,
}
impl<'a, T> heed::BytesEncode<'a> for FacetGroupKeyCodec<T>
where
T: BytesEncode<'a>,
T::EItem: Sized,
{
type EItem = FacetGroupKey<T::EItem>;
fn bytes_encode(value: &'a Self::EItem) -> Option<Cow<'a, [u8]>> {
let mut v = vec![];
v.extend_from_slice(&value.field_id.to_be_bytes());
v.extend_from_slice(&[value.level]);
let bound = T::bytes_encode(&value.left_bound)?;
v.extend_from_slice(&bound);
Some(Cow::Owned(v))
}
}
impl<'a, T> heed::BytesDecode<'a> for FacetGroupKeyCodec<T>
where
T: BytesDecode<'a>,
{
type DItem = FacetGroupKey<T::DItem>;
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let fid = u16::from_be_bytes(<[u8; 2]>::try_from(&bytes[0..=1]).ok()?);
let level = bytes[2];
let bound = T::bytes_decode(&bytes[3..])?;
Some(FacetGroupKey { field_id: fid, level, left_bound: bound })
}
}
pub struct FacetGroupValueCodec;
impl<'a> heed::BytesEncode<'a> for FacetGroupValueCodec {
type EItem = FacetGroupValue;
fn bytes_encode(value: &'a Self::EItem) -> Option<Cow<'a, [u8]>> {
let mut v = vec![];
v.push(value.size);
CboRoaringBitmapCodec::serialize_into(&value.bitmap, &mut v);
Some(Cow::Owned(v))
}
}
impl<'a> heed::BytesDecode<'a> for FacetGroupValueCodec {
type DItem = FacetGroupValue;
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let size = bytes[0];
let bitmap = CboRoaringBitmapCodec::deserialize_from(&bytes[1..]).ok()?;
Some(FacetGroupValue { size, bitmap })
}
}

View File

@ -0,0 +1,37 @@
use std::borrow::Cow;
use std::convert::TryInto;
use heed::BytesDecode;
use crate::facet::value_encoding::f64_into_bytes;
pub struct OrderedF64Codec;
impl<'a> BytesDecode<'a> for OrderedF64Codec {
type DItem = f64;
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
if bytes.len() < 16 {
return None;
}
let f = bytes[8..].try_into().ok().map(f64::from_be_bytes)?;
Some(f)
}
}
impl heed::BytesEncode<'_> for OrderedF64Codec {
type EItem = f64;
fn bytes_encode(f: &Self::EItem) -> Option<Cow<[u8]>> {
let mut buffer = [0u8; 16];
// write the globally ordered float
let bytes = f64_into_bytes(*f)?;
buffer[..8].copy_from_slice(&bytes[..]);
// Then the f64 value just to be able to read it back
let bytes = f.to_be_bytes();
buffer[8..16].copy_from_slice(&bytes[..]);
Some(Cow::Owned(buffer.to_vec()))
}
}

View File

@ -1,12 +1,17 @@
mod beu32_str_codec; mod beu32_str_codec;
mod byte_slice_ref;
pub mod facet; pub mod facet;
mod field_id_word_count_codec; mod field_id_word_count_codec;
mod obkv_codec; mod obkv_codec;
mod roaring_bitmap; mod roaring_bitmap;
mod roaring_bitmap_length; mod roaring_bitmap_length;
mod str_beu32_codec; mod str_beu32_codec;
mod str_ref;
mod str_str_u8_codec; mod str_str_u8_codec;
pub use byte_slice_ref::ByteSliceRefCodec;
pub use str_ref::StrRefCodec;
pub use self::beu32_str_codec::BEU32StrCodec; pub use self::beu32_str_codec::BEU32StrCodec;
pub use self::field_id_word_count_codec::FieldIdWordCountCodec; pub use self::field_id_word_count_codec::FieldIdWordCountCodec;
pub use self::obkv_codec::ObkvCodec; pub use self::obkv_codec::ObkvCodec;

View File

@ -0,0 +1,22 @@
use std::borrow::Cow;
use heed::{BytesDecode, BytesEncode};
/// A codec for values of type `&str`. Unlike `Str`, its `EItem` and `DItem` associated
/// types are equivalent (= `&'a str`) and these values can reside within another structure.
pub struct StrRefCodec;
impl<'a> BytesEncode<'a> for StrRefCodec {
type EItem = &'a str;
fn bytes_encode(item: &'a &'a str) -> Option<Cow<'a, [u8]>> {
Some(Cow::Borrowed(item.as_bytes()))
}
}
impl<'a> BytesDecode<'a> for StrRefCodec {
type DItem = &'a str;
fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
let s = std::str::from_utf8(bytes).ok()?;
Some(s)
}
}

View File

@ -12,11 +12,13 @@ use rstar::RTree;
use time::OffsetDateTime; use time::OffsetDateTime;
use crate::error::{InternalError, UserError}; use crate::error::{InternalError, UserError};
use crate::facet::FacetType;
use crate::fields_ids_map::FieldsIdsMap; use crate::fields_ids_map::FieldsIdsMap;
use crate::heed_codec::facet::{ use crate::heed_codec::facet::{
FacetLevelValueF64Codec, FacetStringLevelZeroCodec, FacetStringLevelZeroValueCodec, FacetGroupKeyCodec, FacetGroupValueCodec, FieldDocIdFacetF64Codec, FieldDocIdFacetStringCodec,
FieldDocIdFacetF64Codec, FieldDocIdFacetStringCodec, FieldIdCodec, FieldIdCodec, OrderedF64Codec,
}; };
use crate::heed_codec::StrRefCodec;
use crate::{ use crate::{
default_criteria, BEU32StrCodec, BoRoaringBitmapCodec, CboRoaringBitmapCodec, Criterion, default_criteria, BEU32StrCodec, BoRoaringBitmapCodec, CboRoaringBitmapCodec, Criterion,
DocumentId, ExternalDocumentsIds, FacetDistribution, FieldDistribution, FieldId, DocumentId, ExternalDocumentsIds, FacetDistribution, FieldDistribution, FieldId,
@ -123,10 +125,10 @@ pub struct Index {
/// Maps the facet field id and the docids for which this field exists /// Maps the facet field id and the docids for which this field exists
pub facet_id_exists_docids: Database<FieldIdCodec, CboRoaringBitmapCodec>, pub facet_id_exists_docids: Database<FieldIdCodec, CboRoaringBitmapCodec>,
/// Maps the facet field id, level and the number with the docids that corresponds to it. /// Maps the facet field id and ranges of numbers with the docids that corresponds to them.
pub facet_id_f64_docids: Database<FacetLevelValueF64Codec, CboRoaringBitmapCodec>, pub facet_id_f64_docids: Database<FacetGroupKeyCodec<OrderedF64Codec>, FacetGroupValueCodec>,
/// Maps the facet field id and the string with the original string and docids that corresponds to it. /// Maps the facet field id and ranges of strings with the docids that corresponds to them.
pub facet_id_string_docids: Database<FacetStringLevelZeroCodec, FacetStringLevelZeroValueCodec>, pub facet_id_string_docids: Database<FacetGroupKeyCodec<StrRefCodec>, FacetGroupValueCodec>,
/// Maps the document id, the facet field id and the numbers. /// Maps the document id, the facet field id and the numbers.
pub field_id_docid_facet_f64s: Database<FieldDocIdFacetF64Codec, Unit>, pub field_id_docid_facet_f64s: Database<FieldDocIdFacetF64Codec, Unit>,
@ -775,68 +777,38 @@ impl Index {
/* faceted documents ids */ /* faceted documents ids */
/// Writes the documents ids that are faceted with numbers under this field id. /// Writes the documents ids that are faceted under this field id for the given facet type.
pub(crate) fn put_number_faceted_documents_ids( pub fn put_faceted_documents_ids(
&self, &self,
wtxn: &mut RwTxn, wtxn: &mut RwTxn,
field_id: FieldId, field_id: FieldId,
facet_type: FacetType,
docids: &RoaringBitmap, docids: &RoaringBitmap,
) -> heed::Result<()> { ) -> heed::Result<()> {
let mut buffer = let key = match facet_type {
[0u8; main_key::NUMBER_FACETED_DOCUMENTS_IDS_PREFIX.len() + size_of::<FieldId>()]; FacetType::String => main_key::STRING_FACETED_DOCUMENTS_IDS_PREFIX,
buffer[..main_key::NUMBER_FACETED_DOCUMENTS_IDS_PREFIX.len()] FacetType::Number => main_key::NUMBER_FACETED_DOCUMENTS_IDS_PREFIX,
.copy_from_slice(main_key::NUMBER_FACETED_DOCUMENTS_IDS_PREFIX.as_bytes()); };
buffer[main_key::NUMBER_FACETED_DOCUMENTS_IDS_PREFIX.len()..] let mut buffer = vec![0u8; key.len() + size_of::<FieldId>()];
.copy_from_slice(&field_id.to_be_bytes()); buffer[..key.len()].copy_from_slice(key.as_bytes());
buffer[key.len()..].copy_from_slice(&field_id.to_be_bytes());
self.main.put::<_, ByteSlice, RoaringBitmapCodec>(wtxn, &buffer, docids) self.main.put::<_, ByteSlice, RoaringBitmapCodec>(wtxn, &buffer, docids)
} }
/// Retrieve all the documents ids that faceted with numbers under this field id. /// Retrieve all the documents ids that are faceted under this field id for the given facet type.
pub fn number_faceted_documents_ids( pub fn faceted_documents_ids(
&self, &self,
rtxn: &RoTxn, rtxn: &RoTxn,
field_id: FieldId, field_id: FieldId,
facet_type: FacetType,
) -> heed::Result<RoaringBitmap> { ) -> heed::Result<RoaringBitmap> {
let mut buffer = let key = match facet_type {
[0u8; main_key::NUMBER_FACETED_DOCUMENTS_IDS_PREFIX.len() + size_of::<FieldId>()]; FacetType::String => main_key::STRING_FACETED_DOCUMENTS_IDS_PREFIX,
buffer[..main_key::NUMBER_FACETED_DOCUMENTS_IDS_PREFIX.len()] FacetType::Number => main_key::NUMBER_FACETED_DOCUMENTS_IDS_PREFIX,
.copy_from_slice(main_key::NUMBER_FACETED_DOCUMENTS_IDS_PREFIX.as_bytes()); };
buffer[main_key::NUMBER_FACETED_DOCUMENTS_IDS_PREFIX.len()..] let mut buffer = vec![0u8; key.len() + size_of::<FieldId>()];
.copy_from_slice(&field_id.to_be_bytes()); buffer[..key.len()].copy_from_slice(key.as_bytes());
match self.main.get::<_, ByteSlice, RoaringBitmapCodec>(rtxn, &buffer)? { buffer[key.len()..].copy_from_slice(&field_id.to_be_bytes());
Some(docids) => Ok(docids),
None => Ok(RoaringBitmap::new()),
}
}
/// Writes the documents ids that are faceted with strings under this field id.
pub(crate) fn put_string_faceted_documents_ids(
&self,
wtxn: &mut RwTxn,
field_id: FieldId,
docids: &RoaringBitmap,
) -> heed::Result<()> {
let mut buffer =
[0u8; main_key::STRING_FACETED_DOCUMENTS_IDS_PREFIX.len() + size_of::<FieldId>()];
buffer[..main_key::STRING_FACETED_DOCUMENTS_IDS_PREFIX.len()]
.copy_from_slice(main_key::STRING_FACETED_DOCUMENTS_IDS_PREFIX.as_bytes());
buffer[main_key::STRING_FACETED_DOCUMENTS_IDS_PREFIX.len()..]
.copy_from_slice(&field_id.to_be_bytes());
self.main.put::<_, ByteSlice, RoaringBitmapCodec>(wtxn, &buffer, docids)
}
/// Retrieve all the documents ids that faceted with strings under this field id.
pub fn string_faceted_documents_ids(
&self,
rtxn: &RoTxn,
field_id: FieldId,
) -> heed::Result<RoaringBitmap> {
let mut buffer =
[0u8; main_key::STRING_FACETED_DOCUMENTS_IDS_PREFIX.len() + size_of::<FieldId>()];
buffer[..main_key::STRING_FACETED_DOCUMENTS_IDS_PREFIX.len()]
.copy_from_slice(main_key::STRING_FACETED_DOCUMENTS_IDS_PREFIX.as_bytes());
buffer[main_key::STRING_FACETED_DOCUMENTS_IDS_PREFIX.len()..]
.copy_from_slice(&field_id.to_be_bytes());
match self.main.get::<_, ByteSlice, RoaringBitmapCodec>(rtxn, &buffer)? { match self.main.get::<_, ByteSlice, RoaringBitmapCodec>(rtxn, &buffer)? {
Some(docids) => Ok(docids), Some(docids) => Ok(docids),
None => Ok(RoaringBitmap::new()), None => Ok(RoaringBitmap::new()),

View File

@ -1,3 +1,4 @@
#![cfg_attr(all(test, fuzzing), feature(no_coverage))]
#![allow(clippy::reversed_empty_ranges)] #![allow(clippy::reversed_empty_ranges)]
#![allow(clippy::too_many_arguments)] #![allow(clippy::too_many_arguments)]
#[macro_use] #[macro_use]

View File

@ -6,8 +6,11 @@ use ordered_float::OrderedFloat;
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use super::{Criterion, CriterionParameters, CriterionResult}; use super::{Criterion, CriterionParameters, CriterionResult};
use crate::facet::FacetType;
use crate::heed_codec::facet::FacetGroupKeyCodec;
use crate::heed_codec::ByteSliceRefCodec;
use crate::search::criteria::{resolve_query_tree, CriteriaBuilder}; use crate::search::criteria::{resolve_query_tree, CriteriaBuilder};
use crate::search::facet::{FacetNumberIter, FacetStringIter}; use crate::search::facet::{ascending_facet_sort, descending_facet_sort};
use crate::search::query_tree::Operation; use crate::search::query_tree::Operation;
use crate::{FieldId, Index, Result}; use crate::{FieldId, Index, Result};
@ -59,8 +62,10 @@ impl<'t> AscDesc<'t> {
let field_id = fields_ids_map.id(&field_name); let field_id = fields_ids_map.id(&field_name);
let faceted_candidates = match field_id { let faceted_candidates = match field_id {
Some(field_id) => { Some(field_id) => {
let number_faceted = index.number_faceted_documents_ids(rtxn, field_id)?; let number_faceted =
let string_faceted = index.string_faceted_documents_ids(rtxn, field_id)?; index.faceted_documents_ids(rtxn, field_id, FacetType::Number)?;
let string_faceted =
index.faceted_documents_ids(rtxn, field_id, FacetType::String)?;
number_faceted | string_faceted number_faceted | string_faceted
} }
None => RoaringBitmap::default(), None => RoaringBitmap::default(),
@ -186,21 +191,21 @@ fn facet_ordered<'t>(
iterative_facet_string_ordered_iter(index, rtxn, field_id, is_ascending, candidates)?; iterative_facet_string_ordered_iter(index, rtxn, field_id, is_ascending, candidates)?;
Ok(Box::new(number_iter.chain(string_iter).map(Ok)) as Box<dyn Iterator<Item = _>>) Ok(Box::new(number_iter.chain(string_iter).map(Ok)) as Box<dyn Iterator<Item = _>>)
} else { } else {
let facet_number_fn = if is_ascending { let make_iter = if is_ascending { ascending_facet_sort } else { descending_facet_sort };
FacetNumberIter::new_reducing
} else {
FacetNumberIter::new_reverse_reducing
};
let number_iter = facet_number_fn(rtxn, index, field_id, candidates.clone())?
.map(|res| res.map(|(_, docids)| docids));
let facet_string_fn = if is_ascending { let number_iter = make_iter(
FacetStringIter::new_reducing rtxn,
} else { index.facet_id_f64_docids.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>(),
FacetStringIter::new_reverse_reducing field_id,
}; candidates.clone(),
let string_iter = facet_string_fn(rtxn, index, field_id, candidates)? )?;
.map(|res| res.map(|(_, _, docids)| docids));
let string_iter = make_iter(
rtxn,
index.facet_id_string_docids.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>(),
field_id,
candidates,
)?;
Ok(Box::new(number_iter.chain(string_iter))) Ok(Box::new(number_iter.chain(string_iter)))
} }

View File

@ -6,7 +6,7 @@ use roaring::RoaringBitmap;
use super::{Distinct, DocIter}; use super::{Distinct, DocIter};
use crate::error::InternalError; use crate::error::InternalError;
use crate::heed_codec::facet::*; use crate::heed_codec::facet::{FacetGroupKey, *};
use crate::index::db_name; use crate::index::db_name;
use crate::{DocumentId, FieldId, Index, Result}; use crate::{DocumentId, FieldId, Index, Result};
@ -47,13 +47,16 @@ impl<'a> FacetDistinctIter<'a> {
fn facet_string_docids(&self, key: &str) -> heed::Result<Option<RoaringBitmap>> { fn facet_string_docids(&self, key: &str) -> heed::Result<Option<RoaringBitmap>> {
self.index self.index
.facet_id_string_docids .facet_id_string_docids
.get(self.txn, &(self.distinct, key)) .get(self.txn, &FacetGroupKey { field_id: self.distinct, level: 0, left_bound: key })
.map(|result| result.map(|(_original, docids)| docids)) .map(|opt| opt.map(|v| v.bitmap))
} }
fn facet_number_docids(&self, key: f64) -> heed::Result<Option<RoaringBitmap>> { fn facet_number_docids(&self, key: f64) -> heed::Result<Option<RoaringBitmap>> {
// get facet docids on level 0 // get facet docids on level 0
self.index.facet_id_f64_docids.get(self.txn, &(self.distinct, 0, key, key)) self.index
.facet_id_f64_docids
.get(self.txn, &FacetGroupKey { field_id: self.distinct, level: 0, left_bound: key })
.map(|opt| opt.map(|v| v.bitmap))
} }
fn distinct_string(&mut self, id: DocumentId) -> Result<()> { fn distinct_string(&mut self, id: DocumentId) -> Result<()> {

View File

@ -1,16 +1,19 @@
use std::collections::{BTreeMap, HashSet}; use std::collections::{BTreeMap, HashSet};
use std::ops::Bound::Unbounded; use std::ops::ControlFlow;
use std::{fmt, mem}; use std::{fmt, mem};
use heed::types::ByteSlice; use heed::types::ByteSlice;
use heed::BytesDecode;
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use crate::error::UserError; use crate::error::UserError;
use crate::facet::FacetType; use crate::facet::FacetType;
use crate::heed_codec::facet::{ use crate::heed_codec::facet::{
FacetStringLevelZeroCodec, FieldDocIdFacetF64Codec, FieldDocIdFacetStringCodec, FacetGroupKeyCodec, FacetGroupValueCodec, FieldDocIdFacetF64Codec, FieldDocIdFacetStringCodec,
OrderedF64Codec,
}; };
use crate::search::facet::{FacetNumberIter, FacetNumberRange, FacetStringIter}; use crate::heed_codec::{ByteSliceRefCodec, StrRefCodec};
use crate::search::facet::facet_distribution_iter;
use crate::{FieldId, Index, Result}; use crate::{FieldId, Index, Result};
/// The default number of values by facets that will /// The default number of values by facets that will
@ -94,7 +97,7 @@ impl<'a> FacetDistribution<'a> {
let mut key_buffer: Vec<_> = field_id.to_be_bytes().to_vec(); let mut key_buffer: Vec<_> = field_id.to_be_bytes().to_vec();
let db = self.index.field_id_docid_facet_strings; let db = self.index.field_id_docid_facet_strings;
for docid in candidates.into_iter() { 'outer: for docid in candidates.into_iter() {
key_buffer.truncate(mem::size_of::<FieldId>()); key_buffer.truncate(mem::size_of::<FieldId>());
key_buffer.extend_from_slice(&docid.to_be_bytes()); key_buffer.extend_from_slice(&docid.to_be_bytes());
let iter = db let iter = db
@ -110,7 +113,7 @@ impl<'a> FacetDistribution<'a> {
*count += 1; *count += 1;
if normalized_distribution.len() == self.max_values_per_facet { if normalized_distribution.len() == self.max_values_per_facet {
break; break 'outer;
} }
} }
} }
@ -133,21 +136,23 @@ impl<'a> FacetDistribution<'a> {
candidates: &RoaringBitmap, candidates: &RoaringBitmap,
distribution: &mut BTreeMap<String, u64>, distribution: &mut BTreeMap<String, u64>,
) -> heed::Result<()> { ) -> heed::Result<()> {
let iter = facet_distribution_iter::iterate_over_facet_distribution(
FacetNumberIter::new_non_reducing(self.rtxn, self.index, field_id, candidates.clone())?; self.rtxn,
self.index
for result in iter { .facet_id_f64_docids
let (value, mut docids) = result?; .remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>(),
docids &= candidates; field_id,
if !docids.is_empty() { candidates,
distribution.insert(value.to_string(), docids.len()); |facet_key, nbr_docids, _| {
} let facet_key = OrderedF64Codec::bytes_decode(facet_key).unwrap();
distribution.insert(facet_key.to_string(), nbr_docids);
if distribution.len() == self.max_values_per_facet { if distribution.len() == self.max_values_per_facet {
break; Ok(ControlFlow::Break(()))
} else {
Ok(ControlFlow::Continue(()))
} }
} },
)
Ok(())
} }
fn facet_strings_distribution_from_facet_levels( fn facet_strings_distribution_from_facet_levels(
@ -156,21 +161,32 @@ impl<'a> FacetDistribution<'a> {
candidates: &RoaringBitmap, candidates: &RoaringBitmap,
distribution: &mut BTreeMap<String, u64>, distribution: &mut BTreeMap<String, u64>,
) -> heed::Result<()> { ) -> heed::Result<()> {
let iter = facet_distribution_iter::iterate_over_facet_distribution(
FacetStringIter::new_non_reducing(self.rtxn, self.index, field_id, candidates.clone())?; self.rtxn,
self.index
.facet_id_string_docids
.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>(),
field_id,
candidates,
|facet_key, nbr_docids, any_docid| {
let facet_key = StrRefCodec::bytes_decode(facet_key).unwrap();
for result in iter { let key: (FieldId, _, &str) = (field_id, any_docid, facet_key);
let (_normalized, original, mut docids) = result?; let original_string = self
docids &= candidates; .index
if !docids.is_empty() { .field_id_docid_facet_strings
distribution.insert(original.to_string(), docids.len()); .get(self.rtxn, &key)?
} .unwrap()
.to_owned();
distribution.insert(original_string, nbr_docids);
if distribution.len() == self.max_values_per_facet { if distribution.len() == self.max_values_per_facet {
break; Ok(ControlFlow::Break(()))
} else {
Ok(ControlFlow::Continue(()))
} }
} },
)
Ok(())
} }
/// Placeholder search, a.k.a. no candidates were specified. We iterate throught the /// Placeholder search, a.k.a. no candidates were specified. We iterate throught the
@ -182,11 +198,18 @@ impl<'a> FacetDistribution<'a> {
let mut distribution = BTreeMap::new(); let mut distribution = BTreeMap::new();
let db = self.index.facet_id_f64_docids; let db = self.index.facet_id_f64_docids;
let range = FacetNumberRange::new(self.rtxn, db, field_id, 0, Unbounded, Unbounded)?; let mut prefix = vec![];
prefix.extend_from_slice(&field_id.to_be_bytes());
prefix.push(0); // read values from level 0 only
for result in range { let iter = db
let ((_, _, value, _), docids) = result?; .as_polymorph()
distribution.insert(value.to_string(), docids.len()); .prefix_iter::<_, ByteSlice, ByteSlice>(self.rtxn, prefix.as_slice())?
.remap_types::<FacetGroupKeyCodec<OrderedF64Codec>, FacetGroupValueCodec>();
for result in iter {
let (key, value) = result?;
distribution.insert(key.left_bound.to_string(), value.bitmap.len());
if distribution.len() == self.max_values_per_facet { if distribution.len() == self.max_values_per_facet {
break; break;
} }
@ -195,24 +218,24 @@ impl<'a> FacetDistribution<'a> {
let iter = self let iter = self
.index .index
.facet_id_string_docids .facet_id_string_docids
.remap_key_type::<ByteSlice>() .as_polymorph()
.prefix_iter(self.rtxn, &field_id.to_be_bytes())? .prefix_iter::<_, ByteSlice, ByteSlice>(self.rtxn, prefix.as_slice())?
.remap_key_type::<FacetStringLevelZeroCodec>(); .remap_types::<FacetGroupKeyCodec<StrRefCodec>, FacetGroupValueCodec>();
let mut normalized_distribution = BTreeMap::new();
for result in iter { for result in iter {
let ((_, normalized_value), (original_value, docids)) = result?; let (key, value) = result?;
normalized_distribution.insert(normalized_value, (original_value, docids.len()));
if normalized_distribution.len() == self.max_values_per_facet { let docid = value.bitmap.iter().next().unwrap();
let key: (FieldId, _, &'a str) = (field_id, docid, key.left_bound);
let original_string =
self.index.field_id_docid_facet_strings.get(self.rtxn, &key)?.unwrap().to_owned();
distribution.insert(original_string, value.bitmap.len());
if distribution.len() == self.max_values_per_facet {
break; break;
} }
} }
let iter = normalized_distribution
.into_iter()
.map(|(_normalized, (original, count))| (original.to_string(), count));
distribution.extend(iter);
Ok(distribution) Ok(distribution)
} }
@ -301,3 +324,216 @@ impl fmt::Debug for FacetDistribution<'_> {
.finish() .finish()
} }
} }
#[cfg(test)]
mod tests {
use big_s::S;
use maplit::hashset;
use crate::documents::documents_batch_reader_from_objects;
use crate::index::tests::TempIndex;
use crate::{milli_snap, FacetDistribution};
#[test]
fn few_candidates_few_facet_values() {
// All the tests here avoid using the code in `facet_distribution_iter` because there aren't
// enough candidates.
let mut index = TempIndex::new();
index.index_documents_config.autogenerate_docids = true;
index
.update_settings(|settings| settings.set_filterable_fields(hashset! { S("colour") }))
.unwrap();
let documents = documents!([
{ "colour": "Blue" },
{ "colour": " blue" },
{ "colour": "RED" }
]);
index.add_documents(documents).unwrap();
let txn = index.read_txn().unwrap();
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), @r###"{"colour": {"Blue": 2, "RED": 1}}"###);
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.candidates([0, 1, 2].iter().copied().collect())
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), @r###"{"colour": {"Blue": 2, "RED": 1}}"###);
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.candidates([1, 2].iter().copied().collect())
.execute()
.unwrap();
// I think it would be fine if " blue" was "Blue" instead.
// We just need to get any non-normalised string I think, even if it's not in
// the candidates
milli_snap!(format!("{map:?}"), @r###"{"colour": {" blue": 1, "RED": 1}}"###);
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.candidates([2].iter().copied().collect())
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), @r###"{"colour": {"RED": 1}}"###);
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.candidates([0, 1, 2].iter().copied().collect())
.max_values_per_facet(1)
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), @r###"{"colour": {"Blue": 1}}"###);
}
#[test]
fn many_candidates_few_facet_values() {
let mut index = TempIndex::new_with_map_size(4096 * 10_000);
index.index_documents_config.autogenerate_docids = true;
index
.update_settings(|settings| settings.set_filterable_fields(hashset! { S("colour") }))
.unwrap();
let facet_values = ["Red", "RED", " red ", "Blue", "BLUE"];
let mut documents = vec![];
for i in 0..10_000 {
let document = serde_json::json!({
"colour": facet_values[i % 5],
})
.as_object()
.unwrap()
.clone();
documents.push(document);
}
let documents = documents_batch_reader_from_objects(documents);
index.add_documents(documents).unwrap();
let txn = index.read_txn().unwrap();
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), @r###"{"colour": {"Blue": 4000, "Red": 6000}}"###);
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.max_values_per_facet(1)
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), @r###"{"colour": {"Blue": 4000}}"###);
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.candidates((0..10_000).into_iter().collect())
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), @r###"{"colour": {"Blue": 4000, "Red": 6000}}"###);
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.candidates((0..5_000).into_iter().collect())
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), @r###"{"colour": {"Blue": 2000, "Red": 3000}}"###);
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.candidates((0..5_000).into_iter().collect())
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), @r###"{"colour": {"Blue": 2000, "Red": 3000}}"###);
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.candidates((0..5_000).into_iter().collect())
.max_values_per_facet(1)
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), @r###"{"colour": {"Blue": 2000}}"###);
}
#[test]
fn many_candidates_many_facet_values() {
let mut index = TempIndex::new_with_map_size(4096 * 10_000);
index.index_documents_config.autogenerate_docids = true;
index
.update_settings(|settings| settings.set_filterable_fields(hashset! { S("colour") }))
.unwrap();
let facet_values = (0..1000).into_iter().map(|x| format!("{x:x}")).collect::<Vec<_>>();
let mut documents = vec![];
for i in 0..10_000 {
let document = serde_json::json!({
"colour": facet_values[i % 1000],
})
.as_object()
.unwrap()
.clone();
documents.push(document);
}
let documents = documents_batch_reader_from_objects(documents);
index.add_documents(documents).unwrap();
let txn = index.read_txn().unwrap();
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), "no_candidates", @"ac9229ed5964d893af96a7076e2f8af5");
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.max_values_per_facet(2)
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), "no_candidates_with_max_2", @r###"{"colour": {"0": 10, "1": 10}}"###);
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.candidates((0..10_000).into_iter().collect())
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), "candidates_0_10_000", @"ac9229ed5964d893af96a7076e2f8af5");
let map = FacetDistribution::new(&txn, &index)
.facets(std::iter::once("colour"))
.candidates((0..5_000).into_iter().collect())
.execute()
.unwrap();
milli_snap!(format!("{map:?}"), "candidates_0_5_000", @"825f23a4090d05756f46176987b7d992");
}
}

View File

@ -0,0 +1,196 @@
use std::ops::ControlFlow;
use heed::Result;
use roaring::RoaringBitmap;
use super::{get_first_facet_value, get_highest_level};
use crate::heed_codec::facet::{FacetGroupKey, FacetGroupKeyCodec, FacetGroupValueCodec};
use crate::heed_codec::ByteSliceRefCodec;
use crate::DocumentId;
/// Call the given closure on the facet distribution of the candidate documents.
///
/// The arguments to the closure are:
/// - the facet value, as a byte slice
/// - the number of documents among the candidates that contain this facet value
/// - the id of a document which contains the facet value. Note that this document
/// is not necessarily from the list of candidates, it is simply *any* document which
/// contains this facet value.
///
/// The return value of the closure is a `ControlFlow<()>` which indicates whether we should
/// keep iterating over the different facet values or stop.
pub fn iterate_over_facet_distribution<'t, CB>(
rtxn: &'t heed::RoTxn<'t>,
db: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
field_id: u16,
candidates: &RoaringBitmap,
callback: CB,
) -> Result<()>
where
CB: FnMut(&'t [u8], u64, DocumentId) -> Result<ControlFlow<()>>,
{
let mut fd = FacetDistribution { rtxn, db, field_id, callback };
let highest_level = get_highest_level(
rtxn,
db.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>(),
field_id,
)?;
if let Some(first_bound) = get_first_facet_value::<ByteSliceRefCodec>(rtxn, db, field_id)? {
fd.iterate(candidates, highest_level, first_bound, usize::MAX)?;
return Ok(());
} else {
return Ok(());
}
}
struct FacetDistribution<'t, CB>
where
CB: FnMut(&'t [u8], u64, DocumentId) -> Result<ControlFlow<()>>,
{
rtxn: &'t heed::RoTxn<'t>,
db: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
field_id: u16,
callback: CB,
}
impl<'t, CB> FacetDistribution<'t, CB>
where
CB: FnMut(&'t [u8], u64, DocumentId) -> Result<ControlFlow<()>>,
{
fn iterate_level_0(
&mut self,
candidates: &RoaringBitmap,
starting_bound: &'t [u8],
group_size: usize,
) -> Result<ControlFlow<()>> {
let starting_key =
FacetGroupKey { field_id: self.field_id, level: 0, left_bound: starting_bound };
let iter = self.db.range(self.rtxn, &(starting_key..))?.take(group_size);
for el in iter {
let (key, value) = el?;
// The range is unbounded on the right and the group size for the highest level is MAX,
// so we need to check that we are not iterating over the next field id
if key.field_id != self.field_id {
return Ok(ControlFlow::Break(()));
}
let docids_in_common = value.bitmap & candidates;
if !docids_in_common.is_empty() {
let any_docid_in_common = docids_in_common.min().unwrap();
match (self.callback)(key.left_bound, docids_in_common.len(), any_docid_in_common)?
{
ControlFlow::Continue(_) => (),
ControlFlow::Break(_) => return Ok(ControlFlow::Break(())),
}
}
}
return Ok(ControlFlow::Continue(()));
}
fn iterate(
&mut self,
candidates: &RoaringBitmap,
level: u8,
starting_bound: &'t [u8],
group_size: usize,
) -> Result<ControlFlow<()>> {
if level == 0 {
return self.iterate_level_0(candidates, starting_bound, group_size);
}
let starting_key =
FacetGroupKey { field_id: self.field_id, level, left_bound: starting_bound };
let iter = self.db.range(&self.rtxn, &(&starting_key..)).unwrap().take(group_size);
for el in iter {
let (key, value) = el.unwrap();
// The range is unbounded on the right and the group size for the highest level is MAX,
// so we need to check that we are not iterating over the next field id
if key.field_id != self.field_id {
return Ok(ControlFlow::Break(()));
}
let docids_in_common = value.bitmap & candidates;
if docids_in_common.len() > 0 {
let cf = self.iterate(
&docids_in_common,
level - 1,
key.left_bound,
value.size as usize,
)?;
match cf {
ControlFlow::Continue(_) => {}
ControlFlow::Break(_) => return Ok(ControlFlow::Break(())),
}
}
}
return Ok(ControlFlow::Continue(()));
}
}
#[cfg(test)]
mod tests {
use std::ops::ControlFlow;
use heed::BytesDecode;
use roaring::RoaringBitmap;
use super::iterate_over_facet_distribution;
use crate::heed_codec::facet::OrderedF64Codec;
use crate::milli_snap;
use crate::search::facet::tests::{get_random_looking_index, get_simple_index};
#[test]
fn filter_distribution_all() {
let indexes = [get_simple_index(), get_random_looking_index()];
for (i, index) in indexes.iter().enumerate() {
let txn = index.env.read_txn().unwrap();
let candidates = (0..=255).into_iter().collect::<RoaringBitmap>();
let mut results = String::new();
iterate_over_facet_distribution(
&txn,
index.content,
0,
&candidates,
|facet, count, _| {
let facet = OrderedF64Codec::bytes_decode(facet).unwrap();
results.push_str(&format!("{facet}: {count}\n"));
Ok(ControlFlow::Continue(()))
},
)
.unwrap();
milli_snap!(results, i);
txn.commit().unwrap();
}
}
#[test]
fn filter_distribution_all_stop_early() {
let indexes = [get_simple_index(), get_random_looking_index()];
for (i, index) in indexes.iter().enumerate() {
let txn = index.env.read_txn().unwrap();
let candidates = (0..=255).into_iter().collect::<RoaringBitmap>();
let mut results = String::new();
let mut nbr_facets = 0;
iterate_over_facet_distribution(
&txn,
index.content,
0,
&candidates,
|facet, count, _| {
let facet = OrderedF64Codec::bytes_decode(facet).unwrap();
if nbr_facets == 100 {
return Ok(ControlFlow::Break(()));
} else {
nbr_facets += 1;
results.push_str(&format!("{facet}: {count}\n"));
Ok(ControlFlow::Continue(()))
}
},
)
.unwrap();
milli_snap!(results, i);
txn.commit().unwrap();
}
}
}

View File

@ -1,248 +0,0 @@
use std::ops::Bound::{self, Excluded, Included, Unbounded};
use either::Either::{self, Left, Right};
use heed::types::{ByteSlice, DecodeIgnore};
use heed::{Database, LazyDecode, RoRange, RoRevRange};
use roaring::RoaringBitmap;
use crate::heed_codec::facet::FacetLevelValueF64Codec;
use crate::heed_codec::CboRoaringBitmapCodec;
use crate::{FieldId, Index};
pub struct FacetNumberRange<'t> {
iter: RoRange<'t, FacetLevelValueF64Codec, LazyDecode<CboRoaringBitmapCodec>>,
end: Bound<f64>,
}
impl<'t> FacetNumberRange<'t> {
pub fn new(
rtxn: &'t heed::RoTxn,
db: Database<FacetLevelValueF64Codec, CboRoaringBitmapCodec>,
field_id: FieldId,
level: u8,
left: Bound<f64>,
right: Bound<f64>,
) -> heed::Result<FacetNumberRange<'t>> {
let left_bound = match left {
Included(left) => Included((field_id, level, left, f64::MIN)),
Excluded(left) => Excluded((field_id, level, left, f64::MIN)),
Unbounded => Included((field_id, level, f64::MIN, f64::MIN)),
};
let right_bound = Included((field_id, level, f64::MAX, f64::MAX));
let iter = db.lazily_decode_data().range(rtxn, &(left_bound, right_bound))?;
Ok(FacetNumberRange { iter, end: right })
}
}
impl<'t> Iterator for FacetNumberRange<'t> {
type Item = heed::Result<((FieldId, u8, f64, f64), RoaringBitmap)>;
fn next(&mut self) -> Option<Self::Item> {
match self.iter.next() {
Some(Ok(((fid, level, left, right), docids))) => {
let must_be_returned = match self.end {
Included(end) => right <= end,
Excluded(end) => right < end,
Unbounded => true,
};
if must_be_returned {
match docids.decode() {
Ok(docids) => Some(Ok(((fid, level, left, right), docids))),
Err(e) => Some(Err(e)),
}
} else {
None
}
}
Some(Err(e)) => Some(Err(e)),
None => None,
}
}
}
pub struct FacetNumberRevRange<'t> {
iter: RoRevRange<'t, FacetLevelValueF64Codec, LazyDecode<CboRoaringBitmapCodec>>,
end: Bound<f64>,
}
impl<'t> FacetNumberRevRange<'t> {
pub fn new(
rtxn: &'t heed::RoTxn,
db: Database<FacetLevelValueF64Codec, CboRoaringBitmapCodec>,
field_id: FieldId,
level: u8,
left: Bound<f64>,
right: Bound<f64>,
) -> heed::Result<FacetNumberRevRange<'t>> {
let left_bound = match left {
Included(left) => Included((field_id, level, left, f64::MIN)),
Excluded(left) => Excluded((field_id, level, left, f64::MIN)),
Unbounded => Included((field_id, level, f64::MIN, f64::MIN)),
};
let right_bound = Included((field_id, level, f64::MAX, f64::MAX));
let iter = db.lazily_decode_data().rev_range(rtxn, &(left_bound, right_bound))?;
Ok(FacetNumberRevRange { iter, end: right })
}
}
impl<'t> Iterator for FacetNumberRevRange<'t> {
type Item = heed::Result<((FieldId, u8, f64, f64), RoaringBitmap)>;
fn next(&mut self) -> Option<Self::Item> {
loop {
match self.iter.next() {
Some(Ok(((fid, level, left, right), docids))) => {
let must_be_returned = match self.end {
Included(end) => right <= end,
Excluded(end) => right < end,
Unbounded => true,
};
if must_be_returned {
match docids.decode() {
Ok(docids) => return Some(Ok(((fid, level, left, right), docids))),
Err(e) => return Some(Err(e)),
}
}
continue;
}
Some(Err(e)) => return Some(Err(e)),
None => return None,
}
}
}
}
pub struct FacetNumberIter<'t> {
rtxn: &'t heed::RoTxn<'t>,
db: Database<FacetLevelValueF64Codec, CboRoaringBitmapCodec>,
field_id: FieldId,
level_iters: Vec<(RoaringBitmap, Either<FacetNumberRange<'t>, FacetNumberRevRange<'t>>)>,
must_reduce: bool,
}
impl<'t> FacetNumberIter<'t> {
/// Create a `FacetNumberIter` that will iterate on the different facet entries
/// (facet value + documents ids) and that will reduce the given documents ids
/// while iterating on the different facet levels.
pub fn new_reducing(
rtxn: &'t heed::RoTxn,
index: &'t Index,
field_id: FieldId,
documents_ids: RoaringBitmap,
) -> heed::Result<FacetNumberIter<'t>> {
let db = index.facet_id_f64_docids.remap_key_type::<FacetLevelValueF64Codec>();
let highest_level = Self::highest_level(rtxn, db, field_id)?.unwrap_or(0);
let highest_iter =
FacetNumberRange::new(rtxn, db, field_id, highest_level, Unbounded, Unbounded)?;
let level_iters = vec![(documents_ids, Left(highest_iter))];
Ok(FacetNumberIter { rtxn, db, field_id, level_iters, must_reduce: true })
}
/// Create a `FacetNumberIter` that will iterate on the different facet entries in reverse
/// (facet value + documents ids) and that will reduce the given documents ids
/// while iterating on the different facet levels.
pub fn new_reverse_reducing(
rtxn: &'t heed::RoTxn,
index: &'t Index,
field_id: FieldId,
documents_ids: RoaringBitmap,
) -> heed::Result<FacetNumberIter<'t>> {
let db = index.facet_id_f64_docids;
let highest_level = Self::highest_level(rtxn, db, field_id)?.unwrap_or(0);
let highest_iter =
FacetNumberRevRange::new(rtxn, db, field_id, highest_level, Unbounded, Unbounded)?;
let level_iters = vec![(documents_ids, Right(highest_iter))];
Ok(FacetNumberIter { rtxn, db, field_id, level_iters, must_reduce: true })
}
/// Create a `FacetNumberIter` that will iterate on the different facet entries
/// (facet value + documents ids) and that will not reduce the given documents ids
/// while iterating on the different facet levels, possibly returning multiple times
/// a document id associated with multiple facet values.
pub fn new_non_reducing(
rtxn: &'t heed::RoTxn,
index: &'t Index,
field_id: FieldId,
documents_ids: RoaringBitmap,
) -> heed::Result<FacetNumberIter<'t>> {
let db = index.facet_id_f64_docids.remap_key_type::<FacetLevelValueF64Codec>();
let highest_level = Self::highest_level(rtxn, db, field_id)?.unwrap_or(0);
let highest_iter =
FacetNumberRange::new(rtxn, db, field_id, highest_level, Unbounded, Unbounded)?;
let level_iters = vec![(documents_ids, Left(highest_iter))];
Ok(FacetNumberIter { rtxn, db, field_id, level_iters, must_reduce: false })
}
fn highest_level<X>(
rtxn: &'t heed::RoTxn,
db: Database<FacetLevelValueF64Codec, X>,
fid: FieldId,
) -> heed::Result<Option<u8>> {
let level = db
.remap_types::<ByteSlice, DecodeIgnore>()
.prefix_iter(rtxn, &fid.to_be_bytes())?
.remap_key_type::<FacetLevelValueF64Codec>()
.last()
.transpose()?
.map(|((_, level, _, _), _)| level);
Ok(level)
}
}
impl<'t> Iterator for FacetNumberIter<'t> {
type Item = heed::Result<(f64, RoaringBitmap)>;
fn next(&mut self) -> Option<Self::Item> {
'outer: loop {
let (documents_ids, last) = self.level_iters.last_mut()?;
let is_ascending = last.is_left();
for result in last {
// If the last iterator must find an empty set of documents it means
// that we found all the documents in the sub level iterations already,
// we can pop this level iterator.
if documents_ids.is_empty() {
break;
}
match result {
Ok(((_fid, level, left, right), mut docids)) => {
docids &= &*documents_ids;
if !docids.is_empty() {
if self.must_reduce {
*documents_ids -= &docids;
}
if level == 0 {
return Some(Ok((left, docids)));
}
let rtxn = self.rtxn;
let db = self.db;
let fid = self.field_id;
let left = Included(left);
let right = Included(right);
let result = if is_ascending {
FacetNumberRange::new(rtxn, db, fid, level - 1, left, right)
.map(Left)
} else {
FacetNumberRevRange::new(rtxn, db, fid, level - 1, left, right)
.map(Right)
};
match result {
Ok(iter) => {
self.level_iters.push((docids, iter));
continue 'outer;
}
Err(e) => return Some(Err(e)),
}
}
}
Err(e) => return Some(Err(e)),
}
}
self.level_iters.pop();
}
}
}

View File

@ -0,0 +1,487 @@
use std::ops::{Bound, RangeBounds};
use heed::BytesEncode;
use roaring::RoaringBitmap;
use super::{get_first_facet_value, get_highest_level, get_last_facet_value};
use crate::heed_codec::facet::{FacetGroupKey, FacetGroupKeyCodec, FacetGroupValueCodec};
use crate::heed_codec::ByteSliceRefCodec;
use crate::Result;
/// Find all the document ids for which the given field contains a value contained within
/// the two bounds.
pub fn find_docids_of_facet_within_bounds<'t, BoundCodec>(
rtxn: &'t heed::RoTxn<'t>,
db: heed::Database<FacetGroupKeyCodec<BoundCodec>, FacetGroupValueCodec>,
field_id: u16,
left: &'t Bound<<BoundCodec as BytesEncode<'t>>::EItem>,
right: &'t Bound<<BoundCodec as BytesEncode<'t>>::EItem>,
docids: &mut RoaringBitmap,
) -> Result<()>
where
BoundCodec: for<'a> BytesEncode<'a>,
for<'a> <BoundCodec as BytesEncode<'a>>::EItem: Sized,
{
let inner;
let left = match left {
Bound::Included(left) => {
inner = BoundCodec::bytes_encode(left).ok_or(heed::Error::Encoding)?;
Bound::Included(inner.as_ref())
}
Bound::Excluded(left) => {
inner = BoundCodec::bytes_encode(left).ok_or(heed::Error::Encoding)?;
Bound::Excluded(inner.as_ref())
}
Bound::Unbounded => Bound::Unbounded,
};
let inner;
let right = match right {
Bound::Included(right) => {
inner = BoundCodec::bytes_encode(right).ok_or(heed::Error::Encoding)?;
Bound::Included(inner.as_ref())
}
Bound::Excluded(right) => {
inner = BoundCodec::bytes_encode(right).ok_or(heed::Error::Encoding)?;
Bound::Excluded(inner.as_ref())
}
Bound::Unbounded => Bound::Unbounded,
};
let db = db.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>();
let mut f = FacetRangeSearch { rtxn, db, field_id, left, right, docids };
let highest_level = get_highest_level(rtxn, db, field_id)?;
if let Some(starting_left_bound) =
get_first_facet_value::<ByteSliceRefCodec>(rtxn, db, field_id)?
{
let rightmost_bound = Bound::Included(
get_last_facet_value::<ByteSliceRefCodec>(rtxn, db, field_id)?.unwrap(),
); // will not fail because get_first_facet_value succeeded
let group_size = usize::MAX;
f.run(highest_level, starting_left_bound, rightmost_bound, group_size)?;
Ok(())
} else {
return Ok(());
}
}
/// Fetch the document ids that have a facet with a value between the two given bounds
struct FacetRangeSearch<'t, 'b, 'bitmap> {
rtxn: &'t heed::RoTxn<'t>,
db: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
field_id: u16,
left: Bound<&'b [u8]>,
right: Bound<&'b [u8]>,
docids: &'bitmap mut RoaringBitmap,
}
impl<'t, 'b, 'bitmap> FacetRangeSearch<'t, 'b, 'bitmap> {
fn run_level_0(&mut self, starting_left_bound: &'t [u8], group_size: usize) -> Result<()> {
let left_key =
FacetGroupKey { field_id: self.field_id, level: 0, left_bound: starting_left_bound };
let iter = self.db.range(&self.rtxn, &(left_key..))?.take(group_size);
for el in iter {
let (key, value) = el?;
// the right side of the iter range is unbounded, so we need to make sure that we are not iterating
// on the next field id
if key.field_id != self.field_id {
return Ok(());
}
let should_skip = {
match self.left {
Bound::Included(left) => left > key.left_bound,
Bound::Excluded(left) => left >= key.left_bound,
Bound::Unbounded => false,
}
};
if should_skip {
continue;
}
let should_stop = {
match self.right {
Bound::Included(right) => right < key.left_bound,
Bound::Excluded(right) => right <= key.left_bound,
Bound::Unbounded => false,
}
};
if should_stop {
break;
}
if RangeBounds::<&[u8]>::contains(&(self.left, self.right), &key.left_bound) {
*self.docids |= value.bitmap;
}
}
Ok(())
}
/// Recursive part of the algorithm for level > 0.
///
/// It works by visiting a slice of a level and checking whether the range asscociated
/// with each visited element is contained within the bounds.
///
/// 1. So long as the element's range is less than the left bound, we do nothing and keep iterating
/// 2. If the element's range is fully contained by the bounds, then all of its docids are added to
/// the roaring bitmap.
/// 3. If the element's range merely intersects the bounds, then we call the algorithm recursively
/// on the children of the element from the level below.
/// 4. If the element's range is greater than the right bound, we do nothing and stop iterating.
/// Note that the right bound is found through either the `left_bound` of the *next* element,
/// or from the `rightmost_bound` argument
///
/// ## Arguments
/// - `level`: the level being visited
/// - `starting_left_bound`: the left_bound of the first element to visit
/// - `rightmost_bound`: the right bound of the last element that should be visited
/// - `group_size`: the number of elements that should be visited
fn run(
&mut self,
level: u8,
starting_left_bound: &'t [u8],
rightmost_bound: Bound<&'t [u8]>,
group_size: usize,
) -> Result<()> {
if level == 0 {
return self.run_level_0(starting_left_bound, group_size);
}
let left_key =
FacetGroupKey { field_id: self.field_id, level, left_bound: starting_left_bound };
let mut iter = self.db.range(&self.rtxn, &(left_key..))?.take(group_size);
// We iterate over the range while keeping in memory the previous value
let (mut previous_key, mut previous_value) = iter.next().unwrap()?;
for el in iter {
let (next_key, next_value) = el?;
// the right of the iter range is potentially unbounded (e.g. if `group_size` is usize::MAX),
// so we need to make sure that we are not iterating on the next field id
if next_key.field_id != self.field_id {
break;
}
// now, do we skip, stop, or visit?
let should_skip = {
match self.left {
Bound::Included(left) => left >= next_key.left_bound,
Bound::Excluded(left) => left >= next_key.left_bound,
Bound::Unbounded => false,
}
};
if should_skip {
previous_key = next_key;
previous_value = next_value;
continue;
}
// should we stop?
let should_stop = {
match self.right {
Bound::Included(right) => right < previous_key.left_bound,
Bound::Excluded(right) => right <= previous_key.left_bound,
Bound::Unbounded => false,
}
};
if should_stop {
return Ok(());
}
// should we take the whole thing, without recursing down?
let should_take_whole_group = {
let left_condition = match self.left {
Bound::Included(left) => previous_key.left_bound >= left,
Bound::Excluded(left) => previous_key.left_bound > left,
Bound::Unbounded => true,
};
let right_condition = match self.right {
Bound::Included(right) => next_key.left_bound <= right,
Bound::Excluded(right) => next_key.left_bound <= right,
Bound::Unbounded => true,
};
left_condition && right_condition
};
if should_take_whole_group {
*self.docids |= &previous_value.bitmap;
previous_key = next_key;
previous_value = next_value;
continue;
}
// from here, we should visit the children of the previous element and
// call the function recursively
let level = level - 1;
let starting_left_bound = previous_key.left_bound;
let rightmost_bound = Bound::Excluded(next_key.left_bound);
let group_size = previous_value.size as usize;
self.run(level, starting_left_bound, rightmost_bound, group_size)?;
previous_key = next_key;
previous_value = next_value;
}
// previous_key/previous_value are the last element's key/value
// now, do we skip, stop, or visit?
let should_skip = {
match (self.left, rightmost_bound) {
(Bound::Included(left), Bound::Included(right)) => left > right,
(Bound::Included(left), Bound::Excluded(right)) => left >= right,
(Bound::Excluded(left), Bound::Included(right) | Bound::Excluded(right)) => {
left >= right
}
(Bound::Unbounded, _) => false,
(_, Bound::Unbounded) => false, // should never run?
}
};
if should_skip {
return Ok(());
}
// should we stop?
let should_stop = {
match self.right {
Bound::Included(right) => right <= previous_key.left_bound,
Bound::Excluded(right) => right < previous_key.left_bound,
Bound::Unbounded => false,
}
};
if should_stop {
return Ok(());
}
// should we take the whole thing, without recursing down?
let should_take_whole_group = {
let left_condition = match self.left {
Bound::Included(left) => previous_key.left_bound >= left,
Bound::Excluded(left) => previous_key.left_bound > left,
Bound::Unbounded => true,
};
let right_condition = match (self.right, rightmost_bound) {
(Bound::Included(right), Bound::Included(rightmost)) => {
// we need to stay within the bound ..=right
// the element's range goes to ..=righmost
// so the element fits entirely within the bound if rightmost <= right
rightmost <= right
}
(Bound::Included(right), Bound::Excluded(rightmost)) => {
// we need to stay within the bound ..=right
// the element's range goes to ..righmost
// so the element fits entirely within the bound if rightmost <= right
rightmost <= right
}
(Bound::Excluded(right), Bound::Included(rightmost)) => {
// we need to stay within the bound ..right
// the element's range goes to ..=righmost
// so the element fits entirely within the bound if rightmost < right
rightmost < right
}
(Bound::Excluded(right), Bound::Excluded(rightmost)) => {
// we need to stay within the bound ..right
// the element's range goes to ..righmost
// so the element fits entirely within the bound if rightmost <= right
rightmost <= right
}
(Bound::Unbounded, _) => {
// we need to stay within the bound ..inf
// so the element always fits entirely within the bound
true
}
(_, Bound::Unbounded) => {
// we need to stay within a finite bound
// but the element's range goes to ..inf
// so the element never fits entirely within the bound
false
}
};
left_condition && right_condition
};
if should_take_whole_group {
*self.docids |= &previous_value.bitmap;
} else {
let level = level - 1;
let starting_left_bound = previous_key.left_bound;
let group_size = previous_value.size as usize;
self.run(level, starting_left_bound, rightmost_bound, group_size)?;
}
Ok(())
}
}
#[cfg(test)]
mod tests {
use std::ops::Bound;
use roaring::RoaringBitmap;
use super::find_docids_of_facet_within_bounds;
use crate::heed_codec::facet::{FacetGroupKeyCodec, OrderedF64Codec};
use crate::milli_snap;
use crate::search::facet::tests::{
get_random_looking_index, get_random_looking_index_with_multiple_field_ids,
get_simple_index, get_simple_index_with_multiple_field_ids,
};
use crate::snapshot_tests::display_bitmap;
#[test]
fn random_looking_index_snap() {
let index = get_random_looking_index();
milli_snap!(format!("{index}"));
}
#[test]
fn filter_range_increasing() {
let indexes = [
get_simple_index(),
get_random_looking_index(),
get_simple_index_with_multiple_field_ids(),
get_random_looking_index_with_multiple_field_ids(),
];
for (i, index) in indexes.iter().enumerate() {
let txn = index.env.read_txn().unwrap();
let mut results = String::new();
for i in 0..=255 {
let i = i as f64;
let start = Bound::Included(0.);
let end = Bound::Included(i);
let mut docids = RoaringBitmap::new();
find_docids_of_facet_within_bounds::<OrderedF64Codec>(
&txn,
index.content.remap_key_type::<FacetGroupKeyCodec<OrderedF64Codec>>(),
0,
&start,
&end,
&mut docids,
)
.unwrap();
results.push_str(&format!("{}\n", display_bitmap(&docids)));
}
milli_snap!(results, format!("included_{i}"));
let mut results = String::new();
for i in 0..=255 {
let i = i as f64;
let start = Bound::Excluded(0.);
let end = Bound::Excluded(i);
let mut docids = RoaringBitmap::new();
find_docids_of_facet_within_bounds::<OrderedF64Codec>(
&txn,
index.content.remap_key_type::<FacetGroupKeyCodec<OrderedF64Codec>>(),
0,
&start,
&end,
&mut docids,
)
.unwrap();
results.push_str(&format!("{}\n", display_bitmap(&docids)));
}
milli_snap!(results, format!("excluded_{i}"));
txn.commit().unwrap();
}
}
#[test]
fn filter_range_decreasing() {
let indexes = [
get_simple_index(),
get_random_looking_index(),
get_simple_index_with_multiple_field_ids(),
get_random_looking_index_with_multiple_field_ids(),
];
for (i, index) in indexes.iter().enumerate() {
let txn = index.env.read_txn().unwrap();
let mut results = String::new();
for i in (0..=255).into_iter().rev() {
let i = i as f64;
let start = Bound::Included(i);
let end = Bound::Included(255.);
let mut docids = RoaringBitmap::new();
find_docids_of_facet_within_bounds::<OrderedF64Codec>(
&txn,
index.content.remap_key_type::<FacetGroupKeyCodec<OrderedF64Codec>>(),
0,
&start,
&end,
&mut docids,
)
.unwrap();
results.push_str(&format!("{}\n", display_bitmap(&docids)));
}
milli_snap!(results, format!("included_{i}"));
let mut results = String::new();
for i in (0..=255).into_iter().rev() {
let i = i as f64;
let start = Bound::Excluded(i);
let end = Bound::Excluded(255.);
let mut docids = RoaringBitmap::new();
find_docids_of_facet_within_bounds::<OrderedF64Codec>(
&txn,
index.content.remap_key_type::<FacetGroupKeyCodec<OrderedF64Codec>>(),
0,
&start,
&end,
&mut docids,
)
.unwrap();
results.push_str(&format!("{}\n", display_bitmap(&docids)));
}
milli_snap!(results, format!("excluded_{i}"));
txn.commit().unwrap();
}
}
#[test]
fn filter_range_pinch() {
let indexes = [
get_simple_index(),
get_random_looking_index(),
get_simple_index_with_multiple_field_ids(),
get_random_looking_index_with_multiple_field_ids(),
];
for (i, index) in indexes.iter().enumerate() {
let txn = index.env.read_txn().unwrap();
let mut results = String::new();
for i in (0..=128).into_iter().rev() {
let i = i as f64;
let start = Bound::Included(i);
let end = Bound::Included(255. - i);
let mut docids = RoaringBitmap::new();
find_docids_of_facet_within_bounds::<OrderedF64Codec>(
&txn,
index.content.remap_key_type::<FacetGroupKeyCodec<OrderedF64Codec>>(),
0,
&start,
&end,
&mut docids,
)
.unwrap();
results.push_str(&format!("{}\n", display_bitmap(&docids)));
}
milli_snap!(results, format!("included_{i}"));
let mut results = String::new();
for i in (0..=128).into_iter().rev() {
let i = i as f64;
let start = Bound::Excluded(i);
let end = Bound::Excluded(255. - i);
let mut docids = RoaringBitmap::new();
find_docids_of_facet_within_bounds::<OrderedF64Codec>(
&txn,
index.content.remap_key_type::<FacetGroupKeyCodec<OrderedF64Codec>>(),
0,
&start,
&end,
&mut docids,
)
.unwrap();
results.push_str(&format!("{}\n", display_bitmap(&docids)));
}
milli_snap!(results, format!("excluded_{i}"));
txn.commit().unwrap();
}
}
}

View File

@ -0,0 +1,136 @@
use heed::Result;
use roaring::RoaringBitmap;
use super::{get_first_facet_value, get_highest_level};
use crate::heed_codec::facet::{
FacetGroupKey, FacetGroupKeyCodec, FacetGroupValue, FacetGroupValueCodec,
};
use crate::heed_codec::ByteSliceRefCodec;
/// Return an iterator which iterates over the given candidate documents in
/// ascending order of their facet value for the given field id.
///
/// The documents returned by the iterator are grouped by the facet values that
/// determined their rank. For example, given the documents:
///
/// ```ignore
/// 0: { "colour": ["blue", "green"] }
/// 1: { "colour": ["blue", "red"] }
/// 2: { "colour": ["orange", "red"] }
/// 3: { "colour": ["green", "red"] }
/// 4: { "colour": ["blue", "orange", "red"] }
/// ```
/// Then calling the function on the candidates `[0, 2, 3, 4]` will return an iterator
/// over the following elements:
/// ```ignore
/// [0, 4] // corresponds to all the documents within the candidates that have the facet value "blue"
/// [3] // same for "green"
/// [2] // same for "orange"
/// END
/// ```
/// Note that once a document id is returned by the iterator, it is never returned again.
pub fn ascending_facet_sort<'t>(
rtxn: &'t heed::RoTxn<'t>,
db: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
field_id: u16,
candidates: RoaringBitmap,
) -> Result<Box<dyn Iterator<Item = Result<RoaringBitmap>> + 't>> {
let highest_level = get_highest_level(rtxn, db, field_id)?;
if let Some(first_bound) = get_first_facet_value::<ByteSliceRefCodec>(rtxn, db, field_id)? {
let first_key = FacetGroupKey { field_id, level: highest_level, left_bound: first_bound };
let iter = db.range(rtxn, &(first_key..)).unwrap().take(usize::MAX);
Ok(Box::new(AscendingFacetSort { rtxn, db, field_id, stack: vec![(candidates, iter)] }))
} else {
Ok(Box::new(std::iter::empty()))
}
}
struct AscendingFacetSort<'t, 'e> {
rtxn: &'t heed::RoTxn<'e>,
db: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
field_id: u16,
stack: Vec<(
RoaringBitmap,
std::iter::Take<
heed::RoRange<'t, FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
>,
)>,
}
impl<'t, 'e> Iterator for AscendingFacetSort<'t, 'e> {
type Item = Result<RoaringBitmap>;
fn next(&mut self) -> Option<Self::Item> {
'outer: loop {
let (documents_ids, deepest_iter) = self.stack.last_mut()?;
for result in deepest_iter {
let (
FacetGroupKey { level, left_bound, field_id },
FacetGroupValue { size: group_size, mut bitmap },
) = result.unwrap();
// The range is unbounded on the right and the group size for the highest level is MAX,
// so we need to check that we are not iterating over the next field id
if field_id != self.field_id {
return None;
}
// If the last iterator found an empty set of documents it means
// that we found all the documents in the sub level iterations already,
// we can pop this level iterator.
if documents_ids.is_empty() {
break;
}
bitmap &= &*documents_ids;
if !bitmap.is_empty() {
*documents_ids -= &bitmap;
if level == 0 {
return Some(Ok(bitmap));
}
let starting_key_below =
FacetGroupKey { field_id: self.field_id, level: level - 1, left_bound };
let iter = match self.db.range(&self.rtxn, &(starting_key_below..)) {
Ok(iter) => iter,
Err(e) => return Some(Err(e.into())),
}
.take(group_size as usize);
self.stack.push((bitmap, iter));
continue 'outer;
}
}
self.stack.pop();
}
}
}
#[cfg(test)]
mod tests {
use roaring::RoaringBitmap;
use crate::milli_snap;
use crate::search::facet::facet_sort_ascending::ascending_facet_sort;
use crate::search::facet::tests::{get_random_looking_index, get_simple_index};
use crate::snapshot_tests::display_bitmap;
#[test]
fn filter_sort() {
let indexes = [get_simple_index(), get_random_looking_index()];
for (i, index) in indexes.iter().enumerate() {
let txn = index.env.read_txn().unwrap();
let candidates = (200..=300).into_iter().collect::<RoaringBitmap>();
let mut results = String::new();
let iter = ascending_facet_sort(&txn, index.content, 0, candidates).unwrap();
for el in iter {
let docids = el.unwrap();
results.push_str(&display_bitmap(&docids));
results.push('\n');
}
milli_snap!(results, i);
txn.commit().unwrap();
}
}
}

View File

@ -0,0 +1,151 @@
use std::ops::Bound;
use heed::Result;
use roaring::RoaringBitmap;
use super::{get_first_facet_value, get_highest_level, get_last_facet_value};
use crate::heed_codec::facet::{
FacetGroupKey, FacetGroupKeyCodec, FacetGroupValue, FacetGroupValueCodec,
};
use crate::heed_codec::ByteSliceRefCodec;
/// See documentationg for [`ascending_facet_sort`](super::ascending_facet_sort).
///
/// This function does the same thing, but in the opposite order.
pub fn descending_facet_sort<'t>(
rtxn: &'t heed::RoTxn<'t>,
db: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
field_id: u16,
candidates: RoaringBitmap,
) -> Result<Box<dyn Iterator<Item = Result<RoaringBitmap>> + 't>> {
let highest_level = get_highest_level(rtxn, db, field_id)?;
if let Some(first_bound) = get_first_facet_value::<ByteSliceRefCodec>(rtxn, db, field_id)? {
let first_key = FacetGroupKey { field_id, level: highest_level, left_bound: first_bound };
let last_bound = get_last_facet_value::<ByteSliceRefCodec>(rtxn, db, field_id)?.unwrap();
let last_key = FacetGroupKey { field_id, level: highest_level, left_bound: last_bound };
let iter = db.rev_range(rtxn, &(first_key..=last_key))?.take(usize::MAX);
Ok(Box::new(DescendingFacetSort {
rtxn,
db,
field_id,
stack: vec![(candidates, iter, Bound::Included(last_bound))],
}))
} else {
Ok(Box::new(std::iter::empty()))
}
}
struct DescendingFacetSort<'t> {
rtxn: &'t heed::RoTxn<'t>,
db: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
field_id: u16,
stack: Vec<(
RoaringBitmap,
std::iter::Take<
heed::RoRevRange<'t, FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
>,
Bound<&'t [u8]>,
)>,
}
impl<'t> Iterator for DescendingFacetSort<'t> {
type Item = Result<RoaringBitmap>;
fn next(&mut self) -> Option<Self::Item> {
'outer: loop {
let (documents_ids, deepest_iter, right_bound) = self.stack.last_mut()?;
while let Some(result) = deepest_iter.next() {
let (
FacetGroupKey { level, left_bound, field_id },
FacetGroupValue { size: group_size, mut bitmap },
) = result.unwrap();
// The range is unbounded on the right and the group size for the highest level is MAX,
// so we need to check that we are not iterating over the next field id
if field_id != self.field_id {
return None;
}
// If the last iterator found an empty set of documents it means
// that we found all the documents in the sub level iterations already,
// we can pop this level iterator.
if documents_ids.is_empty() {
break;
}
bitmap &= &*documents_ids;
if !bitmap.is_empty() {
*documents_ids -= &bitmap;
if level == 0 {
return Some(Ok(bitmap));
}
let starting_key_below =
FacetGroupKey { field_id, level: level - 1, left_bound };
let end_key_kelow = match *right_bound {
Bound::Included(right) => Bound::Included(FacetGroupKey {
field_id,
level: level - 1,
left_bound: right,
}),
Bound::Excluded(right) => Bound::Excluded(FacetGroupKey {
field_id,
level: level - 1,
left_bound: right,
}),
Bound::Unbounded => Bound::Unbounded,
};
let prev_right_bound = *right_bound;
*right_bound = Bound::Excluded(left_bound);
let iter = match self
.db
.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>()
.rev_range(
&self.rtxn,
&(Bound::Included(starting_key_below), end_key_kelow),
) {
Ok(iter) => iter,
Err(e) => return Some(Err(e.into())),
}
.take(group_size as usize);
self.stack.push((bitmap, iter, prev_right_bound));
continue 'outer;
}
*right_bound = Bound::Excluded(left_bound);
}
self.stack.pop();
}
}
}
#[cfg(test)]
mod tests {
use roaring::RoaringBitmap;
use crate::heed_codec::facet::FacetGroupKeyCodec;
use crate::heed_codec::ByteSliceRefCodec;
use crate::milli_snap;
use crate::search::facet::facet_sort_descending::descending_facet_sort;
use crate::search::facet::tests::{get_random_looking_index, get_simple_index};
use crate::snapshot_tests::display_bitmap;
#[test]
fn filter_sort_descending() {
let indexes = [get_simple_index(), get_random_looking_index()];
for (i, index) in indexes.iter().enumerate() {
let txn = index.env.read_txn().unwrap();
let candidates = (200..=300).into_iter().collect::<RoaringBitmap>();
let mut results = String::new();
let db = index.content.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>();
let iter = descending_facet_sort(&txn, db, 0, candidates).unwrap();
for el in iter {
let docids = el.unwrap();
results.push_str(&display_bitmap(&docids));
results.push('\n');
}
milli_snap!(results, i);
txn.commit().unwrap();
}
}
}

View File

@ -1,652 +0,0 @@
//! This module contains helpers iterators for facet strings.
//!
//! The purpose is to help iterate over the quite complex system of facets strings. A simple
//! description of the system would be that every facet string value is stored into an LMDB database
//! and that every value is associated with the document ids which are associated with this facet
//! string value.
//!
//! In reality it is a little bit more complex as we have to create aggregations of runs of facet
//! string values, those aggregations helps in choosing the right groups of facets to follow.
//!
//! ## A typical algorithm run
//!
//! If a group of aggregated facets values contains one of the documents ids, we must continue
//! iterating over the sub-groups.
//!
//! If this group is the lowest level and contain at least one document id we yield the associated
//! facet documents ids.
//!
//! If the group doesn't contain one of our documents ids, we continue to the next group at this
//! same level.
//!
//! ## The complexity comes from the strings
//!
//! This algorithm is exactly the one that we use for facet numbers. It is quite easy to create
//! aggregated facet number, groups of facets are easy to define in the LMDB key, we just put the
//! two numbers bounds, the left and the right bound of the group, both inclusive.
//!
//! It is easy to make sure that the groups are ordered, LMDB sort its keys lexicographically and
//! puting two numbers big-endian encoded one after the other gives us ordered groups. The values
//! are simple unions of the documents ids coming from the groups below.
//!
//! ### Example of what a facet number LMDB database contain
//!
//! | level | left-bound | right-bound | documents ids |
//! |-------|------------|-------------|------------------|
//! | 0 | 0 | _skipped_ | 1, 2 |
//! | 0 | 1 | _skipped_ | 6, 7 |
//! | 0 | 3 | _skipped_ | 4, 7 |
//! | 0 | 5 | _skipped_ | 2, 3, 4 |
//! | 1 | 0 | 1 | 1, 2, 6, 7 |
//! | 1 | 3 | 5 | 2, 3, 4, 7 |
//! | 2 | 0 | 5 | 1, 2, 3, 4, 6, 7 |
//!
//! As you can see the level 0 have two equal bounds, therefore we skip serializing the second
//! bound, that's the base level where you can directly fetch the documents ids associated with an
//! exact number.
//!
//! The next levels have two different bounds and the associated documents ids are simply the result
//! of an union of all the documents ids associated with the aggregated groups above.
//!
//! ## The complexity of defining groups for facet strings
//!
//! As explained above, defining groups of facet numbers is easy, LMDB stores the keys in
//! lexicographical order, it means that whatever the key represent the bytes are read in their raw
//! form and a simple `strcmp` will define the order in which keys will be read from the store.
//!
//! That's easy for types with a known size, like floats or integers, they are 64 bytes long and
//! appending one after the other in big-endian is consistent. LMDB will simply sort the keys by the
//! first number then by the second if the the first number is equal on two keys.
//!
//! For strings it is a lot more complex as those types are unsized, it means that the size of facet
//! strings is different for each facet value.
//!
//! ### Basic approach: padding the keys
//!
//! A first approach would be to simply define the maximum size of a facet string and pad the keys
//! with zeroes. The big problem of this approach is that it:
//! 1. reduces the maximum size of facet strings by half, as we need to put two keys one after the
//! other.
//! 2. makes the keys of facet strings very big (approximately 250 bytes), impacting a lot LMDB
//! performances.
//!
//! ### Better approach: number the facet groups
//!
//! A better approach would be to number the groups, this way we don't have the downsides of the
//! previously described approach but we need to be able to describe the groups by using a number.
//!
//! #### Example of facet strings with numbered groups
//!
//! | level | left-bound | right-bound | left-string | right-string | documents ids |
//! |-------|------------|-------------|-------------|--------------|------------------|
//! | 0 | alpha | _skipped_ | _skipped_ | _skipped_ | 1, 2 |
//! | 0 | beta | _skipped_ | _skipped_ | _skipped_ | 6, 7 |
//! | 0 | gamma | _skipped_ | _skipped_ | _skipped_ | 4, 7 |
//! | 0 | omega | _skipped_ | _skipped_ | _skipped_ | 2, 3, 4 |
//! | 1 | 0 | 1 | alpha | beta | 1, 2, 6, 7 |
//! | 1 | 2 | 3 | gamma | omega | 2, 3, 4, 7 |
//! | 2 | 0 | 3 | _skipped_ | _skipped_ | 1, 2, 3, 4, 6, 7 |
//!
//! As you can see the level 0 doesn't actually change much, we skip nearly everything, we do not
//! need to store the facet string value two times.
//!
//! The number in the left-bound and right-bound columns are incremental numbers representing the
//! level 0 strings, .i.e. alpha is 0, beta is 1. Those numbers are just here to keep the ordering
//! of the LMDB keys.
//!
//! In the value, not in the key, you can see that we added two new values: the left-string and the
//! right-string, which defines the original facet strings associated with the given group.
//!
//! We put those two strings inside of the value, this way we do not limit the maximum size of the
//! facet string values, and the impact on performances is not important as, IIRC, LMDB put big
//! values on another page, this helps in iterating over keys fast enough and only fetch the page
//! with the values when required.
//!
//! The other little advantage with this solution is that there is no a big overhead, compared with
//! the facet number levels, we only duplicate the facet strings once for the level 1.
//!
//! #### A typical algorithm run
//!
//! Note that the algorithm is always moving from the highest level to the lowest one, one level
//! by one level, this is why it is ok to only store the facets string on the level 1.
//!
//! If a group of aggregated facets values, a group with numbers contains one of the documents ids,
//! we must continue iterating over the sub-groups. To do so:
//! - If we are at a level >= 2, we just do the same as with the facet numbers, get both bounds
//! and iterate over the facet groups defined by these numbers over the current level - 1.
//! - If we are at level 1, we retrieve both keys, the left-string and right-string, from the
//! value and just do the same as with the facet numbers but with strings: iterate over the
//! current level - 1 with both keys.
//!
//! If this group is the lowest level (level 0) and contain at least one document id we yield the
//! associated facet documents ids.
//!
//! If the group doesn't contain one of our documents ids, we continue to the next group at this
//! same level.
//!
use std::num::NonZeroU8;
use std::ops::Bound;
use std::ops::Bound::{Excluded, Included, Unbounded};
use either::{Either, Left, Right};
use heed::types::{ByteSlice, DecodeIgnore};
use heed::{Database, LazyDecode, RoRange, RoRevRange};
use roaring::RoaringBitmap;
use crate::heed_codec::facet::{
FacetLevelValueU32Codec, FacetStringLevelZeroCodec, FacetStringLevelZeroValueCodec,
FacetStringZeroBoundsValueCodec,
};
use crate::heed_codec::CboRoaringBitmapCodec;
use crate::{FieldId, Index};
/// An iterator that is used to explore the facets level strings
/// from the level 1 to infinity.
///
/// It yields the level, group id that an entry covers, the optional group strings
/// that it covers of the level 0 only if it is an entry from the level 1 and
/// the roaring bitmap associated.
pub struct FacetStringGroupRange<'t> {
iter: RoRange<
't,
FacetLevelValueU32Codec,
LazyDecode<FacetStringZeroBoundsValueCodec<CboRoaringBitmapCodec>>,
>,
end: Bound<u32>,
}
impl<'t> FacetStringGroupRange<'t> {
pub fn new<X, Y>(
rtxn: &'t heed::RoTxn,
db: Database<X, Y>,
field_id: FieldId,
level: NonZeroU8,
left: Bound<u32>,
right: Bound<u32>,
) -> heed::Result<FacetStringGroupRange<'t>> {
let db = db.remap_types::<
FacetLevelValueU32Codec,
FacetStringZeroBoundsValueCodec<CboRoaringBitmapCodec>,
>();
let left_bound = match left {
Included(left) => Included((field_id, level, left, u32::MIN)),
Excluded(left) => Excluded((field_id, level, left, u32::MIN)),
Unbounded => Included((field_id, level, u32::MIN, u32::MIN)),
};
let right_bound = Included((field_id, level, u32::MAX, u32::MAX));
let iter = db.lazily_decode_data().range(rtxn, &(left_bound, right_bound))?;
Ok(FacetStringGroupRange { iter, end: right })
}
}
impl<'t> Iterator for FacetStringGroupRange<'t> {
type Item = heed::Result<((NonZeroU8, u32, u32), (Option<(&'t str, &'t str)>, RoaringBitmap))>;
fn next(&mut self) -> Option<Self::Item> {
match self.iter.next() {
Some(Ok(((_fid, level, left, right), docids))) => {
let must_be_returned = match self.end {
Included(end) => right <= end,
Excluded(end) => right < end,
Unbounded => true,
};
if must_be_returned {
match docids.decode() {
Ok((bounds, docids)) => Some(Ok(((level, left, right), (bounds, docids)))),
Err(e) => Some(Err(e)),
}
} else {
None
}
}
Some(Err(e)) => Some(Err(e)),
None => None,
}
}
}
pub struct FacetStringGroupRevRange<'t> {
iter: RoRevRange<
't,
FacetLevelValueU32Codec,
LazyDecode<FacetStringZeroBoundsValueCodec<CboRoaringBitmapCodec>>,
>,
end: Bound<u32>,
}
impl<'t> FacetStringGroupRevRange<'t> {
pub fn new<X, Y>(
rtxn: &'t heed::RoTxn,
db: Database<X, Y>,
field_id: FieldId,
level: NonZeroU8,
left: Bound<u32>,
right: Bound<u32>,
) -> heed::Result<FacetStringGroupRevRange<'t>> {
let db = db.remap_types::<
FacetLevelValueU32Codec,
FacetStringZeroBoundsValueCodec<CboRoaringBitmapCodec>,
>();
let left_bound = match left {
Included(left) => Included((field_id, level, left, u32::MIN)),
Excluded(left) => Excluded((field_id, level, left, u32::MIN)),
Unbounded => Included((field_id, level, u32::MIN, u32::MIN)),
};
let right_bound = Included((field_id, level, u32::MAX, u32::MAX));
let iter = db.lazily_decode_data().rev_range(rtxn, &(left_bound, right_bound))?;
Ok(FacetStringGroupRevRange { iter, end: right })
}
}
impl<'t> Iterator for FacetStringGroupRevRange<'t> {
type Item = heed::Result<((NonZeroU8, u32, u32), (Option<(&'t str, &'t str)>, RoaringBitmap))>;
fn next(&mut self) -> Option<Self::Item> {
loop {
match self.iter.next() {
Some(Ok(((_fid, level, left, right), docids))) => {
let must_be_returned = match self.end {
Included(end) => right <= end,
Excluded(end) => right < end,
Unbounded => true,
};
if must_be_returned {
match docids.decode() {
Ok((bounds, docids)) => {
return Some(Ok(((level, left, right), (bounds, docids))))
}
Err(e) => return Some(Err(e)),
}
}
continue;
}
Some(Err(e)) => return Some(Err(e)),
None => return None,
}
}
}
}
/// An iterator that is used to explore the level 0 of the facets string database.
///
/// It yields the facet string and the roaring bitmap associated with it.
pub struct FacetStringLevelZeroRange<'t> {
iter: RoRange<'t, FacetStringLevelZeroCodec, FacetStringLevelZeroValueCodec>,
}
impl<'t> FacetStringLevelZeroRange<'t> {
pub fn new<X, Y>(
rtxn: &'t heed::RoTxn,
db: Database<X, Y>,
field_id: FieldId,
left: Bound<&str>,
right: Bound<&str>,
) -> heed::Result<FacetStringLevelZeroRange<'t>> {
fn encode_value<'a>(buffer: &'a mut Vec<u8>, field_id: FieldId, value: &str) -> &'a [u8] {
buffer.extend_from_slice(&field_id.to_be_bytes());
buffer.push(0);
buffer.extend_from_slice(value.as_bytes());
&buffer[..]
}
let mut left_buffer = Vec::new();
let left_bound = match left {
Included(value) => Included(encode_value(&mut left_buffer, field_id, value)),
Excluded(value) => Excluded(encode_value(&mut left_buffer, field_id, value)),
Unbounded => {
left_buffer.extend_from_slice(&field_id.to_be_bytes());
left_buffer.push(0);
Included(&left_buffer[..])
}
};
let mut right_buffer = Vec::new();
let right_bound = match right {
Included(value) => Included(encode_value(&mut right_buffer, field_id, value)),
Excluded(value) => Excluded(encode_value(&mut right_buffer, field_id, value)),
Unbounded => {
right_buffer.extend_from_slice(&field_id.to_be_bytes());
right_buffer.push(1); // we must only get the level 0
Excluded(&right_buffer[..])
}
};
let iter = db
.remap_key_type::<ByteSlice>()
.range(rtxn, &(left_bound, right_bound))?
.remap_types::<FacetStringLevelZeroCodec, FacetStringLevelZeroValueCodec>();
Ok(FacetStringLevelZeroRange { iter })
}
}
impl<'t> Iterator for FacetStringLevelZeroRange<'t> {
type Item = heed::Result<(&'t str, &'t str, RoaringBitmap)>;
fn next(&mut self) -> Option<Self::Item> {
match self.iter.next() {
Some(Ok(((_fid, normalized), (original, docids)))) => {
Some(Ok((normalized, original, docids)))
}
Some(Err(e)) => Some(Err(e)),
None => None,
}
}
}
pub struct FacetStringLevelZeroRevRange<'t> {
iter: RoRevRange<'t, FacetStringLevelZeroCodec, FacetStringLevelZeroValueCodec>,
}
impl<'t> FacetStringLevelZeroRevRange<'t> {
pub fn new<X, Y>(
rtxn: &'t heed::RoTxn,
db: Database<X, Y>,
field_id: FieldId,
left: Bound<&str>,
right: Bound<&str>,
) -> heed::Result<FacetStringLevelZeroRevRange<'t>> {
fn encode_value<'a>(buffer: &'a mut Vec<u8>, field_id: FieldId, value: &str) -> &'a [u8] {
buffer.extend_from_slice(&field_id.to_be_bytes());
buffer.push(0);
buffer.extend_from_slice(value.as_bytes());
&buffer[..]
}
let mut left_buffer = Vec::new();
let left_bound = match left {
Included(value) => Included(encode_value(&mut left_buffer, field_id, value)),
Excluded(value) => Excluded(encode_value(&mut left_buffer, field_id, value)),
Unbounded => {
left_buffer.extend_from_slice(&field_id.to_be_bytes());
left_buffer.push(0);
Included(&left_buffer[..])
}
};
let mut right_buffer = Vec::new();
let right_bound = match right {
Included(value) => Included(encode_value(&mut right_buffer, field_id, value)),
Excluded(value) => Excluded(encode_value(&mut right_buffer, field_id, value)),
Unbounded => {
right_buffer.extend_from_slice(&field_id.to_be_bytes());
right_buffer.push(1); // we must only get the level 0
Excluded(&right_buffer[..])
}
};
let iter = db
.remap_key_type::<ByteSlice>()
.rev_range(rtxn, &(left_bound, right_bound))?
.remap_types::<FacetStringLevelZeroCodec, FacetStringLevelZeroValueCodec>();
Ok(FacetStringLevelZeroRevRange { iter })
}
}
impl<'t> Iterator for FacetStringLevelZeroRevRange<'t> {
type Item = heed::Result<(&'t str, &'t str, RoaringBitmap)>;
fn next(&mut self) -> Option<Self::Item> {
match self.iter.next() {
Some(Ok(((_fid, normalized), (original, docids)))) => {
Some(Ok((normalized, original, docids)))
}
Some(Err(e)) => Some(Err(e)),
None => None,
}
}
}
type EitherStringRange<'t> = Either<FacetStringGroupRange<'t>, FacetStringLevelZeroRange<'t>>;
type EitherStringRevRange<'t> =
Either<FacetStringGroupRevRange<'t>, FacetStringLevelZeroRevRange<'t>>;
/// An iterator that is used to explore the facet strings level by level,
/// it will only return facets strings that are associated with the
/// candidates documents ids given.
pub struct FacetStringIter<'t> {
rtxn: &'t heed::RoTxn<'t>,
db: Database<ByteSlice, ByteSlice>,
field_id: FieldId,
level_iters: Vec<(RoaringBitmap, Either<EitherStringRange<'t>, EitherStringRevRange<'t>>)>,
must_reduce: bool,
}
impl<'t> FacetStringIter<'t> {
pub fn new_reducing(
rtxn: &'t heed::RoTxn,
index: &'t Index,
field_id: FieldId,
documents_ids: RoaringBitmap,
) -> heed::Result<FacetStringIter<'t>> {
let db = index.facet_id_string_docids.remap_types::<ByteSlice, ByteSlice>();
let highest_iter = Self::highest_iter(rtxn, index, db, field_id)?;
Ok(FacetStringIter {
rtxn,
db,
field_id,
level_iters: vec![(documents_ids, Left(highest_iter))],
must_reduce: true,
})
}
pub fn new_reverse_reducing(
rtxn: &'t heed::RoTxn,
index: &'t Index,
field_id: FieldId,
documents_ids: RoaringBitmap,
) -> heed::Result<FacetStringIter<'t>> {
let db = index.facet_id_string_docids.remap_types::<ByteSlice, ByteSlice>();
let highest_reverse_iter = Self::highest_reverse_iter(rtxn, index, db, field_id)?;
Ok(FacetStringIter {
rtxn,
db,
field_id,
level_iters: vec![(documents_ids, Right(highest_reverse_iter))],
must_reduce: true,
})
}
pub fn new_non_reducing(
rtxn: &'t heed::RoTxn,
index: &'t Index,
field_id: FieldId,
documents_ids: RoaringBitmap,
) -> heed::Result<FacetStringIter<'t>> {
let db = index.facet_id_string_docids.remap_types::<ByteSlice, ByteSlice>();
let highest_iter = Self::highest_iter(rtxn, index, db, field_id)?;
Ok(FacetStringIter {
rtxn,
db,
field_id,
level_iters: vec![(documents_ids, Left(highest_iter))],
must_reduce: false,
})
}
fn highest_level<X, Y>(
rtxn: &'t heed::RoTxn,
db: Database<X, Y>,
fid: FieldId,
) -> heed::Result<Option<u8>> {
Ok(db
.remap_types::<ByteSlice, DecodeIgnore>()
.prefix_iter(rtxn, &fid.to_be_bytes())? // the field id is the first two bits
.last()
.transpose()?
.map(|(key_bytes, _)| key_bytes[2])) // the level is the third bit
}
fn highest_iter<X, Y>(
rtxn: &'t heed::RoTxn,
index: &'t Index,
db: Database<X, Y>,
field_id: FieldId,
) -> heed::Result<Either<FacetStringGroupRange<'t>, FacetStringLevelZeroRange<'t>>> {
let highest_level = Self::highest_level(rtxn, db, field_id)?.unwrap_or(0);
match NonZeroU8::new(highest_level) {
Some(highest_level) => FacetStringGroupRange::new(
rtxn,
index.facet_id_string_docids,
field_id,
highest_level,
Unbounded,
Unbounded,
)
.map(Left),
None => FacetStringLevelZeroRange::new(
rtxn,
index.facet_id_string_docids,
field_id,
Unbounded,
Unbounded,
)
.map(Right),
}
}
fn highest_reverse_iter<X, Y>(
rtxn: &'t heed::RoTxn,
index: &'t Index,
db: Database<X, Y>,
field_id: FieldId,
) -> heed::Result<Either<FacetStringGroupRevRange<'t>, FacetStringLevelZeroRevRange<'t>>> {
let highest_level = Self::highest_level(rtxn, db, field_id)?.unwrap_or(0);
match NonZeroU8::new(highest_level) {
Some(highest_level) => FacetStringGroupRevRange::new(
rtxn,
index.facet_id_string_docids,
field_id,
highest_level,
Unbounded,
Unbounded,
)
.map(Left),
None => FacetStringLevelZeroRevRange::new(
rtxn,
index.facet_id_string_docids,
field_id,
Unbounded,
Unbounded,
)
.map(Right),
}
}
}
impl<'t> Iterator for FacetStringIter<'t> {
type Item = heed::Result<(&'t str, &'t str, RoaringBitmap)>;
fn next(&mut self) -> Option<Self::Item> {
'outer: loop {
let (documents_ids, last) = self.level_iters.last_mut()?;
let is_ascending = last.is_left();
// We remap the different iterator types to make
// the algorithm less complex to understand.
let last = match last {
Left(ascending) => match ascending {
Left(group) => Left(Left(group)),
Right(zero_level) => Right(Left(zero_level)),
},
Right(descending) => match descending {
Left(group) => Left(Right(group)),
Right(zero_level) => Right(Right(zero_level)),
},
};
match last {
Left(group) => {
for result in group {
match result {
Ok(((level, left, right), (string_bounds, mut docids))) => {
docids &= &*documents_ids;
if !docids.is_empty() {
if self.must_reduce {
*documents_ids -= &docids;
}
let result = if is_ascending {
match string_bounds {
Some((left, right)) => FacetStringLevelZeroRange::new(
self.rtxn,
self.db,
self.field_id,
Included(left),
Included(right),
)
.map(Right),
None => FacetStringGroupRange::new(
self.rtxn,
self.db,
self.field_id,
NonZeroU8::new(level.get() - 1).unwrap(),
Included(left),
Included(right),
)
.map(Left),
}
.map(Left)
} else {
match string_bounds {
Some((left, right)) => {
FacetStringLevelZeroRevRange::new(
self.rtxn,
self.db,
self.field_id,
Included(left),
Included(right),
)
.map(Right)
}
None => FacetStringGroupRevRange::new(
self.rtxn,
self.db,
self.field_id,
NonZeroU8::new(level.get() - 1).unwrap(),
Included(left),
Included(right),
)
.map(Left),
}
.map(Right)
};
match result {
Ok(iter) => {
self.level_iters.push((docids, iter));
continue 'outer;
}
Err(e) => return Some(Err(e)),
}
}
}
Err(e) => return Some(Err(e)),
}
}
}
Right(zero_level) => {
// level zero only
for result in zero_level {
match result {
Ok((normalized, original, mut docids)) => {
docids &= &*documents_ids;
if !docids.is_empty() {
if self.must_reduce {
*documents_ids -= &docids;
}
return Some(Ok((normalized, original, docids)));
}
}
Err(e) => return Some(Err(e)),
}
}
}
}
self.level_iters.pop();
}
}
}

View File

@ -5,15 +5,14 @@ use std::ops::Bound::{self, Excluded, Included};
use either::Either; use either::Either;
pub use filter_parser::{Condition, Error as FPError, FilterCondition, Span, Token}; pub use filter_parser::{Condition, Error as FPError, FilterCondition, Span, Token};
use heed::types::DecodeIgnore; use heed::types::DecodeIgnore;
use log::debug;
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use super::FacetNumberRange; use super::facet_range_search;
use crate::error::{Error, UserError}; use crate::error::{Error, UserError};
use crate::heed_codec::facet::FacetLevelValueF64Codec; use crate::heed_codec::facet::{
use crate::{ FacetGroupKey, FacetGroupKeyCodec, FacetGroupValueCodec, OrderedF64Codec,
distance_between_two_points, lat_lng_to_xyz, CboRoaringBitmapCodec, FieldId, Index, Result,
}; };
use crate::{distance_between_two_points, lat_lng_to_xyz, FieldId, Index, Result};
/// The maximum number of filters the filter AST can process. /// The maximum number of filters the filter AST can process.
const MAX_FILTER_DEPTH: usize = 2000; const MAX_FILTER_DEPTH: usize = 2000;
@ -145,112 +144,14 @@ impl<'a> Filter<'a> {
} }
impl<'a> Filter<'a> { impl<'a> Filter<'a> {
/// Aggregates the documents ids that are part of the specified range automatically pub fn evaluate(&self, rtxn: &heed::RoTxn, index: &Index) -> Result<RoaringBitmap> {
/// going deeper through the levels. // to avoid doing this for each recursive call we're going to do it ONCE ahead of time
fn explore_facet_number_levels( let soft_deleted_documents = index.soft_deleted_documents_ids(rtxn)?;
rtxn: &heed::RoTxn, let filterable_fields = index.filterable_fields(rtxn)?;
db: heed::Database<FacetLevelValueF64Codec, CboRoaringBitmapCodec>,
field_id: FieldId,
level: u8,
left: Bound<f64>,
right: Bound<f64>,
output: &mut RoaringBitmap,
) -> Result<()> {
match (left, right) {
// If the request is an exact value we must go directly to the deepest level.
(Included(l), Included(r)) if l == r && level > 0 => {
return Self::explore_facet_number_levels(
rtxn, db, field_id, 0, left, right, output,
);
}
// lower TO upper when lower > upper must return no result
(Included(l), Included(r)) if l > r => return Ok(()),
(Included(l), Excluded(r)) if l >= r => return Ok(()),
(Excluded(l), Excluded(r)) if l >= r => return Ok(()),
(Excluded(l), Included(r)) if l >= r => return Ok(()),
(_, _) => (),
}
let mut left_found = None; // and finally we delete all the soft_deleted_documents, again, only once at the very end
let mut right_found = None; self.inner_evaluate(rtxn, index, &filterable_fields)
.map(|result| result - soft_deleted_documents)
// We must create a custom iterator to be able to iterate over the
// requested range as the range iterator cannot express some conditions.
let iter = FacetNumberRange::new(rtxn, db, field_id, level, left, right)?;
debug!("Iterating between {:?} and {:?} (level {})", left, right, level);
for (i, result) in iter.enumerate() {
let ((_fid, level, l, r), docids) = result?;
debug!("{:?} to {:?} (level {}) found {} documents", l, r, level, docids.len());
*output |= docids;
// We save the leftest and rightest bounds we actually found at this level.
if i == 0 {
left_found = Some(l);
}
right_found = Some(r);
}
// Can we go deeper?
let deeper_level = match level.checked_sub(1) {
Some(level) => level,
None => return Ok(()),
};
// We must refine the left and right bounds of this range by retrieving the
// missing part in a deeper level.
match left_found.zip(right_found) {
Some((left_found, right_found)) => {
// If the bound is satisfied we avoid calling this function again.
if !matches!(left, Included(l) if l == left_found) {
let sub_right = Excluded(left_found);
debug!(
"calling left with {:?} to {:?} (level {})",
left, sub_right, deeper_level
);
Self::explore_facet_number_levels(
rtxn,
db,
field_id,
deeper_level,
left,
sub_right,
output,
)?;
}
if !matches!(right, Included(r) if r == right_found) {
let sub_left = Excluded(right_found);
debug!(
"calling right with {:?} to {:?} (level {})",
sub_left, right, deeper_level
);
Self::explore_facet_number_levels(
rtxn,
db,
field_id,
deeper_level,
sub_left,
right,
output,
)?;
}
}
None => {
// If we found nothing at this level it means that we must find
// the same bounds but at a deeper, more precise level.
Self::explore_facet_number_levels(
rtxn,
db,
field_id,
deeper_level,
left,
right,
output,
)?;
}
}
Ok(())
} }
fn evaluate_operator( fn evaluate_operator(
@ -277,8 +178,16 @@ impl<'a> Filter<'a> {
return Ok(exist); return Ok(exist);
} }
Condition::Equal(val) => { Condition::Equal(val) => {
let (_original_value, string_docids) = strings_db let string_docids = strings_db
.get(rtxn, &(field_id, &val.value().to_lowercase()))? .get(
rtxn,
&FacetGroupKey {
field_id,
level: 0,
left_bound: &val.value().to_lowercase(),
},
)?
.map(|v| v.bitmap)
.unwrap_or_default(); .unwrap_or_default();
let number = val.parse::<f64>().ok(); let number = val.parse::<f64>().ok();
let number_docids = match number { let number_docids = match number {
@ -312,8 +221,19 @@ impl<'a> Filter<'a> {
// that's fine if it don't, the value just before will be returned instead. // that's fine if it don't, the value just before will be returned instead.
let biggest_level = numbers_db let biggest_level = numbers_db
.remap_data_type::<DecodeIgnore>() .remap_data_type::<DecodeIgnore>()
.get_lower_than_or_equal_to(rtxn, &(field_id, u8::MAX, f64::MAX, f64::MAX))? .get_lower_than_or_equal_to(
.and_then(|((id, level, _, _), _)| if id == field_id { Some(level) } else { None }); rtxn,
&FacetGroupKey { field_id, level: u8::MAX, left_bound: f64::MAX },
)?
.and_then(
|(FacetGroupKey { field_id: id, level, .. }, _)| {
if id == field_id {
Some(level)
} else {
None
}
},
);
match biggest_level { match biggest_level {
Some(level) => { Some(level) => {
@ -333,14 +253,36 @@ impl<'a> Filter<'a> {
} }
} }
pub fn evaluate(&self, rtxn: &heed::RoTxn, index: &Index) -> Result<RoaringBitmap> { /// Aggregates the documents ids that are part of the specified range automatically
// to avoid doing this for each recursive call we're going to do it ONCE ahead of time /// going deeper through the levels.
let soft_deleted_documents = index.soft_deleted_documents_ids(rtxn)?; fn explore_facet_number_levels(
let filterable_fields = index.filterable_fields(rtxn)?; rtxn: &heed::RoTxn,
db: heed::Database<FacetGroupKeyCodec<OrderedF64Codec>, FacetGroupValueCodec>,
field_id: FieldId,
level: u8,
left: Bound<f64>,
right: Bound<f64>,
output: &mut RoaringBitmap,
) -> Result<()> {
match (left, right) {
// If the request is an exact value we must go directly to the deepest level.
(Included(l), Included(r)) if l == r && level > 0 => {
return Self::explore_facet_number_levels(
rtxn, db, field_id, 0, left, right, output,
);
}
// lower TO upper when lower > upper must return no result
(Included(l), Included(r)) if l > r => return Ok(()),
(Included(l), Excluded(r)) if l >= r => return Ok(()),
(Excluded(l), Excluded(r)) if l >= r => return Ok(()),
(Excluded(l), Included(r)) if l >= r => return Ok(()),
(_, _) => (),
}
facet_range_search::find_docids_of_facet_within_bounds::<OrderedF64Codec>(
rtxn, db, field_id, &left, &right, output,
)?;
// and finally we delete all the soft_deleted_documents, again, only once at the very end Ok(())
self.inner_evaluate(rtxn, index, &filterable_fields)
.map(|result| result - soft_deleted_documents)
} }
fn inner_evaluate( fn inner_evaluate(

View File

@ -1,9 +1,153 @@
pub use self::facet_distribution::{FacetDistribution, DEFAULT_VALUES_PER_FACET}; pub use facet_sort_ascending::ascending_facet_sort;
pub use self::facet_number::{FacetNumberIter, FacetNumberRange, FacetNumberRevRange}; pub use facet_sort_descending::descending_facet_sort;
pub use self::facet_string::FacetStringIter; use heed::types::{ByteSlice, DecodeIgnore};
pub use self::filter::Filter; use heed::{BytesDecode, RoTxn};
pub use self::facet_distribution::{FacetDistribution, DEFAULT_VALUES_PER_FACET};
pub use self::filter::Filter;
use crate::heed_codec::facet::{FacetGroupKeyCodec, FacetGroupValueCodec};
use crate::heed_codec::ByteSliceRefCodec;
mod facet_distribution; mod facet_distribution;
mod facet_number; mod facet_distribution_iter;
mod facet_string; mod facet_range_search;
mod facet_sort_ascending;
mod facet_sort_descending;
mod filter; mod filter;
/// Get the first facet value in the facet database
pub(crate) fn get_first_facet_value<'t, BoundCodec>(
txn: &'t RoTxn,
db: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
field_id: u16,
) -> heed::Result<Option<BoundCodec::DItem>>
where
BoundCodec: BytesDecode<'t>,
{
let mut level0prefix = vec![];
level0prefix.extend_from_slice(&field_id.to_be_bytes());
level0prefix.push(0);
let mut level0_iter_forward = db
.as_polymorph()
.prefix_iter::<_, ByteSlice, DecodeIgnore>(txn, level0prefix.as_slice())?;
if let Some(first) = level0_iter_forward.next() {
let (first_key, _) = first?;
let first_key = FacetGroupKeyCodec::<BoundCodec>::bytes_decode(first_key)
.ok_or(heed::Error::Encoding)?;
Ok(Some(first_key.left_bound))
} else {
Ok(None)
}
}
/// Get the last facet value in the facet database
pub(crate) fn get_last_facet_value<'t, BoundCodec>(
txn: &'t RoTxn,
db: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
field_id: u16,
) -> heed::Result<Option<BoundCodec::DItem>>
where
BoundCodec: BytesDecode<'t>,
{
let mut level0prefix = vec![];
level0prefix.extend_from_slice(&field_id.to_be_bytes());
level0prefix.push(0);
let mut level0_iter_backward = db
.as_polymorph()
.rev_prefix_iter::<_, ByteSlice, DecodeIgnore>(txn, level0prefix.as_slice())?;
if let Some(last) = level0_iter_backward.next() {
let (last_key, _) = last?;
let last_key = FacetGroupKeyCodec::<BoundCodec>::bytes_decode(last_key)
.ok_or(heed::Error::Encoding)?;
Ok(Some(last_key.left_bound))
} else {
Ok(None)
}
}
/// Get the height of the highest level in the facet database
pub(crate) fn get_highest_level<'t>(
txn: &'t RoTxn<'t>,
db: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
field_id: u16,
) -> heed::Result<u8> {
let field_id_prefix = &field_id.to_be_bytes();
Ok(db
.as_polymorph()
.rev_prefix_iter::<_, ByteSlice, DecodeIgnore>(&txn, field_id_prefix)?
.next()
.map(|el| {
let (key, _) = el.unwrap();
let key = FacetGroupKeyCodec::<ByteSliceRefCodec>::bytes_decode(key).unwrap();
key.level
})
.unwrap_or(0))
}
#[cfg(test)]
pub(crate) mod tests {
use rand::{Rng, SeedableRng};
use roaring::RoaringBitmap;
use crate::heed_codec::facet::OrderedF64Codec;
use crate::update::facet::tests::FacetIndex;
pub fn get_simple_index() -> FacetIndex<OrderedF64Codec> {
let index = FacetIndex::<OrderedF64Codec>::new(4, 8, 5);
let mut txn = index.env.write_txn().unwrap();
for i in 0..256u16 {
let mut bitmap = RoaringBitmap::new();
bitmap.insert(i as u32);
index.insert(&mut txn, 0, &(i as f64), &bitmap);
}
txn.commit().unwrap();
index
}
pub fn get_random_looking_index() -> FacetIndex<OrderedF64Codec> {
let index = FacetIndex::<OrderedF64Codec>::new(4, 8, 5);
let mut txn = index.env.write_txn().unwrap();
let mut rng = rand::rngs::SmallRng::from_seed([0; 32]);
let keys =
std::iter::from_fn(|| Some(rng.gen_range(0..256))).take(128).collect::<Vec<u32>>();
for (_i, key) in keys.into_iter().enumerate() {
let mut bitmap = RoaringBitmap::new();
bitmap.insert(key);
bitmap.insert(key + 100);
index.insert(&mut txn, 0, &(key as f64), &bitmap);
}
txn.commit().unwrap();
index
}
pub fn get_simple_index_with_multiple_field_ids() -> FacetIndex<OrderedF64Codec> {
let index = FacetIndex::<OrderedF64Codec>::new(4, 8, 5);
let mut txn = index.env.write_txn().unwrap();
for fid in 0..2 {
for i in 0..256u16 {
let mut bitmap = RoaringBitmap::new();
bitmap.insert(i as u32);
index.insert(&mut txn, fid, &(i as f64), &bitmap);
}
}
txn.commit().unwrap();
index
}
pub fn get_random_looking_index_with_multiple_field_ids() -> FacetIndex<OrderedF64Codec> {
let index = FacetIndex::<OrderedF64Codec>::new(4, 8, 5);
let mut txn = index.env.write_txn().unwrap();
let mut rng = rand::rngs::SmallRng::from_seed([0; 32]);
let keys =
std::iter::from_fn(|| Some(rng.gen_range(0..256))).take(128).collect::<Vec<u32>>();
for fid in 0..2 {
for (_i, &key) in keys.iter().enumerate() {
let mut bitmap = RoaringBitmap::new();
bitmap.insert(key);
bitmap.insert(key + 100);
index.insert(&mut txn, fid, &(key as f64), &bitmap);
}
}
txn.commit().unwrap();
index
}
}

View File

@ -0,0 +1,260 @@
---
source: milli/src/search/facet/facet_distribution_iter.rs
---
0: 1
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
7: 1
8: 1
9: 1
10: 1
11: 1
12: 1
13: 1
14: 1
15: 1
16: 1
17: 1
18: 1
19: 1
20: 1
21: 1
22: 1
23: 1
24: 1
25: 1
26: 1
27: 1
28: 1
29: 1
30: 1
31: 1
32: 1
33: 1
34: 1
35: 1
36: 1
37: 1
38: 1
39: 1
40: 1
41: 1
42: 1
43: 1
44: 1
45: 1
46: 1
47: 1
48: 1
49: 1
50: 1
51: 1
52: 1
53: 1
54: 1
55: 1
56: 1
57: 1
58: 1
59: 1
60: 1
61: 1
62: 1
63: 1
64: 1
65: 1
66: 1
67: 1
68: 1
69: 1
70: 1
71: 1
72: 1
73: 1
74: 1
75: 1
76: 1
77: 1
78: 1
79: 1
80: 1
81: 1
82: 1
83: 1
84: 1
85: 1
86: 1
87: 1
88: 1
89: 1
90: 1
91: 1
92: 1
93: 1
94: 1
95: 1
96: 1
97: 1
98: 1
99: 1
100: 1
101: 1
102: 1
103: 1
104: 1
105: 1
106: 1
107: 1
108: 1
109: 1
110: 1
111: 1
112: 1
113: 1
114: 1
115: 1
116: 1
117: 1
118: 1
119: 1
120: 1
121: 1
122: 1
123: 1
124: 1
125: 1
126: 1
127: 1
128: 1
129: 1
130: 1
131: 1
132: 1
133: 1
134: 1
135: 1
136: 1
137: 1
138: 1
139: 1
140: 1
141: 1
142: 1
143: 1
144: 1
145: 1
146: 1
147: 1
148: 1
149: 1
150: 1
151: 1
152: 1
153: 1
154: 1
155: 1
156: 1
157: 1
158: 1
159: 1
160: 1
161: 1
162: 1
163: 1
164: 1
165: 1
166: 1
167: 1
168: 1
169: 1
170: 1
171: 1
172: 1
173: 1
174: 1
175: 1
176: 1
177: 1
178: 1
179: 1
180: 1
181: 1
182: 1
183: 1
184: 1
185: 1
186: 1
187: 1
188: 1
189: 1
190: 1
191: 1
192: 1
193: 1
194: 1
195: 1
196: 1
197: 1
198: 1
199: 1
200: 1
201: 1
202: 1
203: 1
204: 1
205: 1
206: 1
207: 1
208: 1
209: 1
210: 1
211: 1
212: 1
213: 1
214: 1
215: 1
216: 1
217: 1
218: 1
219: 1
220: 1
221: 1
222: 1
223: 1
224: 1
225: 1
226: 1
227: 1
228: 1
229: 1
230: 1
231: 1
232: 1
233: 1
234: 1
235: 1
236: 1
237: 1
238: 1
239: 1
240: 1
241: 1
242: 1
243: 1
244: 1
245: 1
246: 1
247: 1
248: 1
249: 1
250: 1
251: 1
252: 1
253: 1
254: 1
255: 1

View File

@ -0,0 +1,105 @@
---
source: milli/src/search/facet/facet_distribution_iter.rs
---
3: 2
5: 2
6: 2
9: 2
10: 2
11: 2
14: 2
18: 2
19: 2
24: 2
26: 2
28: 2
29: 2
32: 2
33: 2
35: 2
36: 2
37: 2
38: 2
39: 2
41: 2
46: 2
47: 2
49: 2
52: 2
53: 2
55: 2
59: 2
61: 2
64: 2
68: 2
71: 2
74: 2
75: 2
76: 2
81: 2
83: 2
85: 2
86: 2
88: 2
90: 2
91: 2
92: 2
98: 2
99: 2
101: 2
102: 2
103: 2
107: 2
111: 2
115: 2
119: 2
123: 2
124: 2
130: 2
131: 2
133: 2
135: 2
136: 2
137: 2
139: 2
141: 2
143: 2
144: 2
147: 2
150: 2
156: 1
158: 1
160: 1
162: 1
163: 1
164: 1
167: 1
169: 1
173: 1
177: 1
178: 1
179: 1
181: 1
182: 1
186: 1
189: 1
192: 1
193: 1
195: 1
197: 1
205: 1
206: 1
207: 1
208: 1
209: 1
210: 1
216: 1
219: 1
220: 1
223: 1
226: 1
235: 1
236: 1
238: 1
243: 1

View File

@ -0,0 +1,104 @@
---
source: milli/src/search/facet/facet_distribution_iter.rs
---
0: 1
1: 1
2: 1
3: 1
4: 1
5: 1
6: 1
7: 1
8: 1
9: 1
10: 1
11: 1
12: 1
13: 1
14: 1
15: 1
16: 1
17: 1
18: 1
19: 1
20: 1
21: 1
22: 1
23: 1
24: 1
25: 1
26: 1
27: 1
28: 1
29: 1
30: 1
31: 1
32: 1
33: 1
34: 1
35: 1
36: 1
37: 1
38: 1
39: 1
40: 1
41: 1
42: 1
43: 1
44: 1
45: 1
46: 1
47: 1
48: 1
49: 1
50: 1
51: 1
52: 1
53: 1
54: 1
55: 1
56: 1
57: 1
58: 1
59: 1
60: 1
61: 1
62: 1
63: 1
64: 1
65: 1
66: 1
67: 1
68: 1
69: 1
70: 1
71: 1
72: 1
73: 1
74: 1
75: 1
76: 1
77: 1
78: 1
79: 1
80: 1
81: 1
82: 1
83: 1
84: 1
85: 1
86: 1
87: 1
88: 1
89: 1
90: 1
91: 1
92: 1
93: 1
94: 1
95: 1
96: 1
97: 1
98: 1
99: 1

View File

@ -0,0 +1,104 @@
---
source: milli/src/search/facet/facet_distribution_iter.rs
---
3: 2
5: 2
6: 2
9: 2
10: 2
11: 2
14: 2
18: 2
19: 2
24: 2
26: 2
28: 2
29: 2
32: 2
33: 2
35: 2
36: 2
37: 2
38: 2
39: 2
41: 2
46: 2
47: 2
49: 2
52: 2
53: 2
55: 2
59: 2
61: 2
64: 2
68: 2
71: 2
74: 2
75: 2
76: 2
81: 2
83: 2
85: 2
86: 2
88: 2
90: 2
91: 2
92: 2
98: 2
99: 2
101: 2
102: 2
103: 2
107: 2
111: 2
115: 2
119: 2
123: 2
124: 2
130: 2
131: 2
133: 2
135: 2
136: 2
137: 2
139: 2
141: 2
143: 2
144: 2
147: 2
150: 2
156: 1
158: 1
160: 1
162: 1
163: 1
164: 1
167: 1
169: 1
173: 1
177: 1
178: 1
179: 1
181: 1
182: 1
186: 1
189: 1
192: 1
193: 1
195: 1
197: 1
205: 1
206: 1
207: 1
208: 1
209: 1
210: 1
216: 1
219: 1
220: 1
223: 1
226: 1
235: 1
236: 1
238: 1

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
fcedc563a82c1c61f50174a5f3f982b6

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
6cc26e77fc6bd9145deedf14cf422b03

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
fcedc563a82c1c61f50174a5f3f982b6

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
6cc26e77fc6bd9145deedf14cf422b03

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
57d35cfa419a19a1a1f8d7c8ef096e0f

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
3dbe0547b42759795e9b16989df72cee

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
57d35cfa419a19a1a1f8d7c8ef096e0f

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
3dbe0547b42759795e9b16989df72cee

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
c1c7a0bb91d53d33724583b6d4a99f16

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
12213d3f1047a0c3d08e4670a7d688e7

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
c1c7a0bb91d53d33724583b6d4a99f16

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
12213d3f1047a0c3d08e4670a7d688e7

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
ca59f20e043a4d52c49e15b10adf96bb

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
cb69e0fe10fb299bafe77514204379cb

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
ca59f20e043a4d52c49e15b10adf96bb

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
cb69e0fe10fb299bafe77514204379cb

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
3456db9a1bb94c33c1e9f656184ee711

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
2127cd818b457e0611e0c8e1a871602a

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
3456db9a1bb94c33c1e9f656184ee711

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
2127cd818b457e0611e0c8e1a871602a

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
b976551ceff412bfb2ec9bfbda320bbb

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
7620ca1a96882c7147d3fd996570f9b3

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
b976551ceff412bfb2ec9bfbda320bbb

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
7620ca1a96882c7147d3fd996570f9b3

View File

@ -0,0 +1,4 @@
---
source: milli/src/search/facet/facet_range_search.rs
---
3256c76a7c1b768a013e78d5fa6e9ff9

View File

@ -0,0 +1,60 @@
---
source: milli/src/search/facet/facet_sort_ascending.rs
---
[200, ]
[201, ]
[202, ]
[203, ]
[204, ]
[205, ]
[206, ]
[207, ]
[208, ]
[209, ]
[210, ]
[211, ]
[212, ]
[213, ]
[214, ]
[215, ]
[216, ]
[217, ]
[218, ]
[219, ]
[220, ]
[221, ]
[222, ]
[223, ]
[224, ]
[225, ]
[226, ]
[227, ]
[228, ]
[229, ]
[230, ]
[231, ]
[232, ]
[233, ]
[234, ]
[235, ]
[236, ]
[237, ]
[238, ]
[239, ]
[240, ]
[241, ]
[242, ]
[243, ]
[244, ]
[245, ]
[246, ]
[247, ]
[248, ]
[249, ]
[250, ]
[251, ]
[252, ]
[253, ]
[254, ]
[255, ]

View File

@ -0,0 +1,54 @@
---
source: milli/src/search/facet/facet_sort_ascending.rs
---
[201, ]
[202, ]
[203, ]
[207, ]
[211, ]
[215, ]
[219, ]
[223, ]
[224, ]
[230, ]
[231, ]
[233, ]
[235, ]
[236, ]
[237, ]
[239, ]
[241, ]
[243, ]
[244, ]
[247, ]
[250, ]
[256, ]
[258, ]
[260, ]
[262, ]
[263, ]
[264, ]
[267, ]
[269, ]
[273, ]
[277, ]
[278, ]
[279, ]
[281, ]
[282, ]
[286, ]
[289, ]
[292, ]
[293, ]
[295, ]
[297, ]
[205, ]
[206, ]
[208, ]
[209, ]
[210, ]
[216, ]
[220, ]
[226, ]
[238, ]

View File

@ -0,0 +1,60 @@
---
source: milli/src/search/facet/facet_sort_descending.rs
---
[255, ]
[254, ]
[253, ]
[252, ]
[251, ]
[250, ]
[249, ]
[248, ]
[247, ]
[246, ]
[245, ]
[244, ]
[243, ]
[242, ]
[241, ]
[240, ]
[239, ]
[238, ]
[237, ]
[236, ]
[235, ]
[234, ]
[233, ]
[232, ]
[231, ]
[230, ]
[229, ]
[228, ]
[227, ]
[226, ]
[225, ]
[224, ]
[223, ]
[222, ]
[221, ]
[220, ]
[219, ]
[218, ]
[217, ]
[216, ]
[215, ]
[214, ]
[213, ]
[212, ]
[211, ]
[210, ]
[209, ]
[208, ]
[207, ]
[206, ]
[205, ]
[204, ]
[203, ]
[202, ]
[201, ]
[200, ]

View File

@ -0,0 +1,54 @@
---
source: milli/src/search/facet/facet_sort_descending.rs
---
[243, ]
[238, ]
[236, ]
[235, ]
[226, ]
[223, ]
[220, ]
[219, ]
[216, ]
[210, ]
[209, ]
[208, ]
[207, ]
[206, ]
[205, ]
[297, ]
[295, ]
[293, ]
[292, ]
[289, ]
[286, ]
[282, ]
[281, ]
[279, ]
[278, ]
[277, ]
[273, ]
[269, ]
[267, ]
[264, ]
[263, ]
[262, ]
[260, ]
[258, ]
[256, ]
[250, ]
[247, ]
[244, ]
[241, ]
[239, ]
[237, ]
[233, ]
[231, ]
[230, ]
[224, ]
[215, ]
[211, ]
[203, ]
[202, ]
[201, ]

View File

@ -15,7 +15,7 @@ use log::debug;
use once_cell::sync::Lazy; use once_cell::sync::Lazy;
use roaring::bitmap::RoaringBitmap; use roaring::bitmap::RoaringBitmap;
pub use self::facet::{FacetDistribution, FacetNumberIter, Filter, DEFAULT_VALUES_PER_FACET}; pub use self::facet::{FacetDistribution, Filter, DEFAULT_VALUES_PER_FACET};
use self::fst_utils::{Complement, Intersection, StartsWith, Union}; use self::fst_utils::{Complement, Intersection, StartsWith, Union};
pub use self::matches::{ pub use self::matches::{
FormatOptions, MatchBounds, Matcher, MatcherBuilder, MatchingWord, MatchingWords, FormatOptions, MatchBounds, Matcher, MatcherBuilder, MatchingWord, MatchingWords,
@ -32,7 +32,7 @@ static LEVDIST2: Lazy<LevBuilder> = Lazy::new(|| LevBuilder::new(2, true));
mod criteria; mod criteria;
mod distinct; mod distinct;
mod facet; pub mod facet;
mod fst_utils; mod fst_utils;
mod matches; mod matches;
mod query_tree; mod query_tree;

View File

@ -2,18 +2,14 @@ use std::borrow::Cow;
use std::fmt::Write; use std::fmt::Write;
use std::path::Path; use std::path::Path;
use heed::types::ByteSlice;
use heed::BytesDecode;
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use crate::heed_codec::facet::{ use crate::facet::FacetType;
FacetLevelValueU32Codec, FacetStringLevelZeroCodec, FacetStringLevelZeroValueCodec, use crate::heed_codec::facet::{FacetGroupKey, FacetGroupValue};
FacetStringZeroBoundsValueCodec, use crate::{make_db_snap_from_iter, ExternalDocumentsIds, Index};
};
use crate::{make_db_snap_from_iter, CboRoaringBitmapCodec, ExternalDocumentsIds, Index};
#[track_caller] #[track_caller]
pub fn default_db_snapshot_settings_for_test(name: Option<&str>) -> insta::Settings { pub fn default_db_snapshot_settings_for_test(name: Option<&str>) -> (insta::Settings, String) {
let mut settings = insta::Settings::clone_current(); let mut settings = insta::Settings::clone_current();
settings.set_prepend_module_to_snapshot(false); settings.set_prepend_module_to_snapshot(false);
let path = Path::new(std::panic::Location::caller().file()); let path = Path::new(std::panic::Location::caller().file());
@ -23,12 +19,63 @@ pub fn default_db_snapshot_settings_for_test(name: Option<&str>) -> insta::Setti
if let Some(name) = name { if let Some(name) = name {
settings settings
.set_snapshot_path(Path::new("snapshots").join(filename).join(test_name).join(name)); .set_snapshot_path(Path::new("snapshots").join(filename).join(&test_name).join(name));
} else { } else {
settings.set_snapshot_path(Path::new("snapshots").join(filename).join(test_name)); settings.set_snapshot_path(Path::new("snapshots").join(filename).join(&test_name));
} }
settings (settings, test_name)
}
#[macro_export]
macro_rules! milli_snap {
($value:expr, $name:expr) => {
let (settings, _) = $crate::snapshot_tests::default_db_snapshot_settings_for_test(None);
settings.bind(|| {
let snap = $value;
let snaps = $crate::snapshot_tests::convert_snap_to_hash_if_needed(&format!("{}", $name), &snap, false);
for (name, snap) in snaps {
insta::assert_snapshot!(name, snap);
}
});
};
($value:expr) => {
let (settings, test_name) = $crate::snapshot_tests::default_db_snapshot_settings_for_test(None);
settings.bind(|| {
let snap = $value;
let snaps = $crate::snapshot_tests::convert_snap_to_hash_if_needed(&format!("{}", test_name), &snap, false);
for (name, snap) in snaps {
insta::assert_snapshot!(name, snap);
}
});
};
($value:expr, @$inline:literal) => {
let (settings, test_name) = $crate::snapshot_tests::default_db_snapshot_settings_for_test(None);
settings.bind(|| {
let snap = $value;
let snaps = $crate::snapshot_tests::convert_snap_to_hash_if_needed(&format!("{}", test_name), &snap, true);
for (name, snap) in snaps {
if !name.ends_with(".full") {
insta::assert_snapshot!(snap, @$inline);
} else {
insta::assert_snapshot!(name, snap);
}
}
});
};
($value:expr, $name:expr, @$inline:literal) => {
let (settings, _) = $crate::snapshot_tests::default_db_snapshot_settings_for_test(None);
settings.bind(|| {
let snap = $value;
let snaps = $crate::snapshot_tests::convert_snap_to_hash_if_needed(&format!("{}", $name), &snap, true);
for (name, snap) in snaps {
if !name.ends_with(".full") {
insta::assert_snapshot!(snap, @$inline);
} else {
insta::assert_snapshot!(name, snap);
}
}
});
};
} }
/** /**
@ -99,7 +146,7 @@ db_snap!(index, word_docids, "some_identifier", @"");
#[macro_export] #[macro_export]
macro_rules! db_snap { macro_rules! db_snap {
($index:ident, $db_name:ident, $name:expr) => { ($index:ident, $db_name:ident, $name:expr) => {
let settings = $crate::snapshot_tests::default_db_snapshot_settings_for_test(Some( let (settings, _) = $crate::snapshot_tests::default_db_snapshot_settings_for_test(Some(
&format!("{}", $name), &format!("{}", $name),
)); ));
settings.bind(|| { settings.bind(|| {
@ -111,7 +158,7 @@ macro_rules! db_snap {
}); });
}; };
($index:ident, $db_name:ident) => { ($index:ident, $db_name:ident) => {
let settings = $crate::snapshot_tests::default_db_snapshot_settings_for_test(None); let (settings, _) = $crate::snapshot_tests::default_db_snapshot_settings_for_test(None);
settings.bind(|| { settings.bind(|| {
let snap = $crate::full_snap_of_db!($index, $db_name); let snap = $crate::full_snap_of_db!($index, $db_name);
let snaps = $crate::snapshot_tests::convert_snap_to_hash_if_needed(stringify!($db_name), &snap, false); let snaps = $crate::snapshot_tests::convert_snap_to_hash_if_needed(stringify!($db_name), &snap, false);
@ -121,7 +168,7 @@ macro_rules! db_snap {
}); });
}; };
($index:ident, $db_name:ident, @$inline:literal) => { ($index:ident, $db_name:ident, @$inline:literal) => {
let settings = $crate::snapshot_tests::default_db_snapshot_settings_for_test(None); let (settings, _) = $crate::snapshot_tests::default_db_snapshot_settings_for_test(None);
settings.bind(|| { settings.bind(|| {
let snap = $crate::full_snap_of_db!($index, $db_name); let snap = $crate::full_snap_of_db!($index, $db_name);
let snaps = $crate::snapshot_tests::convert_snap_to_hash_if_needed(stringify!($db_name), &snap, true); let snaps = $crate::snapshot_tests::convert_snap_to_hash_if_needed(stringify!($db_name), &snap, true);
@ -134,8 +181,8 @@ macro_rules! db_snap {
} }
}); });
}; };
($index:ident, $db_name:ident, $name:literal, @$inline:literal) => { ($index:ident, $db_name:ident, $name:expr, @$inline:literal) => {
let settings = $crate::snapshot_tests::default_db_snapshot_settings_for_test(Some(&format!("{}", $name))); let (settings, _) = $crate::snapshot_tests::default_db_snapshot_settings_for_test(Some(&format!("{}", $name)));
settings.bind(|| { settings.bind(|| {
let snap = $crate::full_snap_of_db!($index, $db_name); let snap = $crate::full_snap_of_db!($index, $db_name);
let snaps = $crate::snapshot_tests::convert_snap_to_hash_if_needed(stringify!($db_name), &snap, true); let snaps = $crate::snapshot_tests::convert_snap_to_hash_if_needed(stringify!($db_name), &snap, true);
@ -233,44 +280,35 @@ pub fn snap_word_prefix_position_docids(index: &Index) -> String {
} }
pub fn snap_facet_id_f64_docids(index: &Index) -> String { pub fn snap_facet_id_f64_docids(index: &Index) -> String {
let snap = make_db_snap_from_iter!(index, facet_id_f64_docids, |( let snap = make_db_snap_from_iter!(index, facet_id_f64_docids, |(
(facet_id, level, left, right), FacetGroupKey { field_id, level, left_bound },
b, FacetGroupValue { size, bitmap },
)| { )| {
&format!("{facet_id:<3} {level:<2} {left:<6} {right:<6} {}", display_bitmap(&b)) &format!("{field_id:<3} {level:<2} {left_bound:<6} {size:<2} {}", display_bitmap(&bitmap))
});
snap
}
pub fn snap_facet_id_exists_docids(index: &Index) -> String {
let snap = make_db_snap_from_iter!(index, facet_id_exists_docids, |(facet_id, docids)| {
&format!("{facet_id:<3} {}", display_bitmap(&docids))
}); });
snap snap
} }
pub fn snap_facet_id_string_docids(index: &Index) -> String { pub fn snap_facet_id_string_docids(index: &Index) -> String {
let rtxn = index.read_txn().unwrap(); let snap = make_db_snap_from_iter!(index, facet_id_string_docids, |(
let bytes_db = index.facet_id_string_docids.remap_types::<ByteSlice, ByteSlice>(); FacetGroupKey { field_id, level, left_bound },
let iter = bytes_db.iter(&rtxn).unwrap(); FacetGroupValue { size, bitmap },
let mut snap = String::new(); )| {
&format!("{field_id:<3} {level:<2} {left_bound:<12} {size:<2} {}", display_bitmap(&bitmap))
for x in iter { });
let (key, value) = x.unwrap(); snap
if let Some((field_id, normalized_str)) = FacetStringLevelZeroCodec::bytes_decode(key) { }
let (orig_string, docids) = pub fn snap_field_id_docid_facet_strings(index: &Index) -> String {
FacetStringLevelZeroValueCodec::bytes_decode(value).unwrap(); let snap = make_db_snap_from_iter!(index, field_id_docid_facet_strings, |(
snap.push_str(&format!( (field_id, doc_id, string),
"{field_id:<3} {normalized_str:<8} {orig_string:<8} {}\n", other_string,
display_bitmap(&docids) )| {
)); &format!("{field_id:<3} {doc_id:<4} {string:<12} {other_string}")
} else if let Some((field_id, level, left, right)) = });
FacetLevelValueU32Codec::bytes_decode(key)
{
snap.push_str(&format!("{field_id:<3} {level:<2} {left:<6} {right:<6} "));
let (bounds, docids) =
FacetStringZeroBoundsValueCodec::<CboRoaringBitmapCodec>::bytes_decode(value)
.unwrap();
if let Some((left, right)) = bounds {
snap.push_str(&format!("{left:<8} {right:<8} "));
}
snap.push_str(&display_bitmap(&docids));
snap.push('\n');
} else {
panic!();
}
}
snap snap
} }
pub fn snap_documents_ids(index: &Index) -> String { pub fn snap_documents_ids(index: &Index) -> String {
@ -339,7 +377,7 @@ pub fn snap_number_faceted_documents_ids(index: &Index) -> String {
let mut snap = String::new(); let mut snap = String::new();
for field_id in fields_ids_map.ids() { for field_id in fields_ids_map.ids() {
let number_faceted_documents_ids = let number_faceted_documents_ids =
index.number_faceted_documents_ids(&rtxn, field_id).unwrap(); index.faceted_documents_ids(&rtxn, field_id, FacetType::Number).unwrap();
writeln!(&mut snap, "{field_id:<3} {}", display_bitmap(&number_faceted_documents_ids)) writeln!(&mut snap, "{field_id:<3} {}", display_bitmap(&number_faceted_documents_ids))
.unwrap(); .unwrap();
} }
@ -352,7 +390,7 @@ pub fn snap_string_faceted_documents_ids(index: &Index) -> String {
let mut snap = String::new(); let mut snap = String::new();
for field_id in fields_ids_map.ids() { for field_id in fields_ids_map.ids() {
let string_faceted_documents_ids = let string_faceted_documents_ids =
index.string_faceted_documents_ids(&rtxn, field_id).unwrap(); index.faceted_documents_ids(&rtxn, field_id, FacetType::String).unwrap();
writeln!(&mut snap, "{field_id:<3} {}", display_bitmap(&string_faceted_documents_ids)) writeln!(&mut snap, "{field_id:<3} {}", display_bitmap(&string_faceted_documents_ids))
.unwrap(); .unwrap();
} }
@ -454,6 +492,12 @@ macro_rules! full_snap_of_db {
($index:ident, facet_id_string_docids) => {{ ($index:ident, facet_id_string_docids) => {{
$crate::snapshot_tests::snap_facet_id_string_docids(&$index) $crate::snapshot_tests::snap_facet_id_string_docids(&$index)
}}; }};
($index:ident, field_id_docid_facet_strings) => {{
$crate::snapshot_tests::snap_field_id_docid_facet_strings(&$index)
}};
($index:ident, facet_id_exists_docids) => {{
$crate::snapshot_tests::snap_facet_id_exists_docids(&$index)
}};
($index:ident, documents_ids) => {{ ($index:ident, documents_ids) => {{
$crate::snapshot_tests::snap_documents_ids(&$index) $crate::snapshot_tests::snap_documents_ids(&$index)
}}; }};

View File

@ -1,6 +1,7 @@
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use time::OffsetDateTime; use time::OffsetDateTime;
use crate::facet::FacetType;
use crate::{ExternalDocumentsIds, FieldDistribution, Index, Result}; use crate::{ExternalDocumentsIds, FieldDistribution, Index, Result};
pub struct ClearDocuments<'t, 'u, 'i> { pub struct ClearDocuments<'t, 'u, 'i> {
@ -55,8 +56,18 @@ impl<'t, 'u, 'i> ClearDocuments<'t, 'u, 'i> {
// We clean all the faceted documents ids. // We clean all the faceted documents ids.
for field_id in faceted_fields { for field_id in faceted_fields {
self.index.put_number_faceted_documents_ids(self.wtxn, field_id, &empty_roaring)?; self.index.put_faceted_documents_ids(
self.index.put_string_faceted_documents_ids(self.wtxn, field_id, &empty_roaring)?; self.wtxn,
field_id,
FacetType::Number,
&empty_roaring,
)?;
self.index.put_faceted_documents_ids(
self.wtxn,
field_id,
FacetType::String,
&empty_roaring,
)?;
} }
// Clear the other databases. // Clear the other databases.

View File

@ -1,23 +1,24 @@
use std::collections::btree_map::Entry; use std::collections::btree_map::Entry;
use std::collections::{HashMap, HashSet};
use fst::IntoStreamer; use fst::IntoStreamer;
use heed::types::{ByteSlice, Str}; use heed::types::{ByteSlice, DecodeIgnore, Str};
use heed::{BytesDecode, BytesEncode, Database}; use heed::Database;
use roaring::RoaringBitmap; use roaring::RoaringBitmap;
use serde::{Deserialize, Serialize}; use serde::{Deserialize, Serialize};
use serde_json::Value; use serde_json::Value;
use time::OffsetDateTime; use time::OffsetDateTime;
use super::facet::delete::FacetsDelete;
use super::ClearDocuments; use super::ClearDocuments;
use crate::error::{InternalError, SerializationError, UserError}; use crate::error::{InternalError, UserError};
use crate::heed_codec::facet::{ use crate::facet::FacetType;
FacetLevelValueU32Codec, FacetStringLevelZeroValueCodec, FacetStringZeroBoundsValueCodec, use crate::heed_codec::facet::FieldDocIdFacetCodec;
};
use crate::heed_codec::CboRoaringBitmapCodec; use crate::heed_codec::CboRoaringBitmapCodec;
use crate::index::{db_name, main_key}; use crate::index::{db_name, main_key};
use crate::{ use crate::{
DocumentId, ExternalDocumentsIds, FieldId, FieldIdMapMissingEntry, Index, Result, ExternalDocumentsIds, FieldId, FieldIdMapMissingEntry, Index, Result, RoaringBitmapCodec,
RoaringBitmapCodec, SmallString32, BEU32, SmallString32, BEU32,
}; };
pub struct DeleteDocuments<'t, 'u, 'i> { pub struct DeleteDocuments<'t, 'u, 'i> {
@ -25,6 +26,8 @@ pub struct DeleteDocuments<'t, 'u, 'i> {
index: &'i Index, index: &'i Index,
external_documents_ids: ExternalDocumentsIds<'static>, external_documents_ids: ExternalDocumentsIds<'static>,
to_delete_docids: RoaringBitmap, to_delete_docids: RoaringBitmap,
#[cfg(test)]
disable_soft_deletion: bool,
} }
#[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)] #[derive(Debug, Clone, PartialEq, Eq, Serialize, Deserialize)]
@ -45,9 +48,16 @@ impl<'t, 'u, 'i> DeleteDocuments<'t, 'u, 'i> {
index, index,
external_documents_ids, external_documents_ids,
to_delete_docids: RoaringBitmap::new(), to_delete_docids: RoaringBitmap::new(),
#[cfg(test)]
disable_soft_deletion: false,
}) })
} }
#[cfg(test)]
pub fn disable_soft_deletion(&mut self, disable: bool) {
self.disable_soft_deletion = disable;
}
pub fn delete_document(&mut self, docid: u32) { pub fn delete_document(&mut self, docid: u32) {
self.to_delete_docids.insert(docid); self.to_delete_docids.insert(docid);
} }
@ -64,6 +74,7 @@ impl<'t, 'u, 'i> DeleteDocuments<'t, 'u, 'i> {
pub fn execute(mut self) -> Result<DocumentDeletionResult> { pub fn execute(mut self) -> Result<DocumentDeletionResult> {
self.index.set_updated_at(self.wtxn, &OffsetDateTime::now_utc())?; self.index.set_updated_at(self.wtxn, &OffsetDateTime::now_utc())?;
// We retrieve the current documents ids that are in the database. // We retrieve the current documents ids that are in the database.
let mut documents_ids = self.index.documents_ids(self.wtxn)?; let mut documents_ids = self.index.documents_ids(self.wtxn)?;
let mut soft_deleted_docids = self.index.soft_deleted_documents_ids(self.wtxn)?; let mut soft_deleted_docids = self.index.soft_deleted_documents_ids(self.wtxn)?;
@ -127,7 +138,7 @@ impl<'t, 'u, 'i> DeleteDocuments<'t, 'u, 'i> {
// the `soft_deleted_documents_ids` bitmap and early exit. // the `soft_deleted_documents_ids` bitmap and early exit.
let size_used = self.index.used_size()?; let size_used = self.index.used_size()?;
let map_size = self.index.env.map_size()? as u64; let map_size = self.index.env.map_size()? as u64;
let nb_documents = self.index.number_of_documents(self.wtxn)?; let nb_documents = self.index.number_of_documents(&self.wtxn)?;
let nb_soft_deleted = soft_deleted_docids.len(); let nb_soft_deleted = soft_deleted_docids.len();
let percentage_available = 100 - (size_used * 100 / map_size); let percentage_available = 100 - (size_used * 100 / map_size);
@ -145,7 +156,20 @@ impl<'t, 'u, 'i> DeleteDocuments<'t, 'u, 'i> {
// We run the deletion. // We run the deletion.
// - With 100Go of disk and 50Go used including 15Go of soft-deleted documents // - With 100Go of disk and 50Go used including 15Go of soft-deleted documents
// We run the deletion. // We run the deletion.
if percentage_available > 10 && percentage_used_by_soft_deleted_documents < 10 { let disable_soft_deletion = {
#[cfg(not(test))]
{
false
}
#[cfg(test)]
{
self.disable_soft_deletion
}
};
if !disable_soft_deletion
&& percentage_available > 10
&& percentage_used_by_soft_deleted_documents < 10
{
self.index.put_soft_deleted_documents_ids(self.wtxn, &soft_deleted_docids)?; self.index.put_soft_deleted_documents_ids(self.wtxn, &soft_deleted_docids)?;
return Ok(DocumentDeletionResult { return Ok(DocumentDeletionResult {
deleted_documents: self.to_delete_docids.len(), deleted_documents: self.to_delete_docids.len(),
@ -185,11 +209,11 @@ impl<'t, 'u, 'i> DeleteDocuments<'t, 'u, 'i> {
prefix_word_pair_proximity_docids, prefix_word_pair_proximity_docids,
word_position_docids, word_position_docids,
word_prefix_position_docids, word_prefix_position_docids,
facet_id_f64_docids, facet_id_f64_docids: _,
facet_id_string_docids: _,
field_id_docid_facet_f64s: _,
field_id_docid_facet_strings: _,
facet_id_exists_docids, facet_id_exists_docids,
facet_id_string_docids,
field_id_docid_facet_f64s,
field_id_docid_facet_strings,
documents, documents,
} = self.index; } = self.index;
@ -440,54 +464,42 @@ impl<'t, 'u, 'i> DeleteDocuments<'t, 'u, 'i> {
self.index.put_geo_faceted_documents_ids(self.wtxn, &geo_faceted_doc_ids)?; self.index.put_geo_faceted_documents_ids(self.wtxn, &geo_faceted_doc_ids)?;
} }
// We delete the documents ids that are under the facet field id values. for facet_type in [FacetType::Number, FacetType::String] {
remove_docids_from_facet_field_id_docids( let mut affected_facet_values = HashMap::new();
for field_id in self.index.faceted_fields_ids(self.wtxn)? {
// Remove docids from the number faceted documents ids
let mut docids =
self.index.faceted_documents_ids(self.wtxn, field_id, facet_type)?;
docids -= &self.to_delete_docids;
self.index.put_faceted_documents_ids(self.wtxn, field_id, facet_type, &docids)?;
let facet_values = remove_docids_from_field_id_docid_facet_value(
&self.index,
self.wtxn, self.wtxn,
facet_id_f64_docids, facet_type,
field_id,
&self.to_delete_docids, &self.to_delete_docids,
)?; )?;
if !facet_values.is_empty() {
affected_facet_values.insert(field_id, facet_values);
}
}
FacetsDelete::new(
self.index,
facet_type,
affected_facet_values,
&self.to_delete_docids,
)
.execute(self.wtxn)?;
}
// We delete the documents ids that are under the facet field id values. // We delete the documents ids that are under the facet field id values.
remove_docids_from_facet_field_id_docids( remove_docids_from_facet_id_exists_docids(
self.wtxn, self.wtxn,
facet_id_exists_docids, facet_id_exists_docids,
&self.to_delete_docids, &self.to_delete_docids,
)?; )?;
remove_docids_from_facet_field_id_string_docids(
self.wtxn,
facet_id_string_docids,
&self.to_delete_docids,
)?;
// Remove the documents ids from the faceted documents ids.
for field_id in self.index.faceted_fields_ids(self.wtxn)? {
// Remove docids from the number faceted documents ids
let mut docids = self.index.number_faceted_documents_ids(self.wtxn, field_id)?;
docids -= &self.to_delete_docids;
self.index.put_number_faceted_documents_ids(self.wtxn, field_id, &docids)?;
remove_docids_from_field_id_docid_facet_value(
self.wtxn,
field_id_docid_facet_f64s,
field_id,
&self.to_delete_docids,
|(_fid, docid, _value)| docid,
)?;
// Remove docids from the string faceted documents ids
let mut docids = self.index.string_faceted_documents_ids(self.wtxn, field_id)?;
docids -= &self.to_delete_docids;
self.index.put_string_faceted_documents_ids(self.wtxn, field_id, &docids)?;
remove_docids_from_field_id_docid_facet_value(
self.wtxn,
field_id_docid_facet_strings,
field_id,
&self.to_delete_docids,
|(_fid, docid, _value)| docid,
)?;
}
Ok(DocumentDeletionResult { Ok(DocumentDeletionResult {
deleted_documents: self.to_delete_docids.len(), deleted_documents: self.to_delete_docids.len(),
remaining_documents: documents_ids.len(), remaining_documents: documents_ids.len(),
@ -553,95 +565,41 @@ fn remove_from_word_docids(
Ok(()) Ok(())
} }
fn remove_docids_from_field_id_docid_facet_value<'a, C, K, F, DC, V>( fn remove_docids_from_field_id_docid_facet_value<'i, 'a>(
index: &'i Index,
wtxn: &'a mut heed::RwTxn, wtxn: &'a mut heed::RwTxn,
db: &heed::Database<C, DC>, facet_type: FacetType,
field_id: FieldId, field_id: FieldId,
to_remove: &RoaringBitmap, to_remove: &RoaringBitmap,
convert: F, ) -> heed::Result<HashSet<Vec<u8>>> {
) -> heed::Result<()> let db = match facet_type {
where FacetType::String => {
C: heed::BytesDecode<'a, DItem = K>, index.field_id_docid_facet_strings.remap_types::<ByteSlice, DecodeIgnore>()
DC: heed::BytesDecode<'a, DItem = V>, }
F: Fn(K) -> DocumentId, FacetType::Number => {
{ index.field_id_docid_facet_f64s.remap_types::<ByteSlice, DecodeIgnore>()
}
};
let mut all_affected_facet_values = HashSet::default();
let mut iter = db let mut iter = db
.remap_key_type::<ByteSlice>()
.prefix_iter_mut(wtxn, &field_id.to_be_bytes())? .prefix_iter_mut(wtxn, &field_id.to_be_bytes())?
.remap_key_type::<C>(); .remap_key_type::<FieldDocIdFacetCodec<ByteSlice>>();
while let Some(result) = iter.next() { while let Some(result) = iter.next() {
let (key, _) = result?; let ((_, docid, facet_value), _) = result?;
if to_remove.contains(convert(key)) { if to_remove.contains(docid) {
if !all_affected_facet_values.contains(facet_value) {
all_affected_facet_values.insert(facet_value.to_owned());
}
// safety: we don't keep references from inside the LMDB database. // safety: we don't keep references from inside the LMDB database.
unsafe { iter.del_current()? }; unsafe { iter.del_current()? };
} }
} }
Ok(()) Ok(all_affected_facet_values)
} }
fn remove_docids_from_facet_field_id_string_docids<'a, C, D>( fn remove_docids_from_facet_id_exists_docids<'a, C>(
wtxn: &'a mut heed::RwTxn,
db: &heed::Database<C, D>,
to_remove: &RoaringBitmap,
) -> crate::Result<()> {
let db_name = Some(crate::index::db_name::FACET_ID_STRING_DOCIDS);
let mut iter = db.remap_types::<ByteSlice, ByteSlice>().iter_mut(wtxn)?;
while let Some(result) = iter.next() {
let (key, val) = result?;
match FacetLevelValueU32Codec::bytes_decode(key) {
Some(_) => {
// If we are able to parse this key it means it is a facet string group
// level key. We must then parse the value using the appropriate codec.
let (group, mut docids) =
FacetStringZeroBoundsValueCodec::<CboRoaringBitmapCodec>::bytes_decode(val)
.ok_or(SerializationError::Decoding { db_name })?;
let previous_len = docids.len();
docids -= to_remove;
if docids.is_empty() {
// safety: we don't keep references from inside the LMDB database.
unsafe { iter.del_current()? };
} else if docids.len() != previous_len {
let key = key.to_owned();
let val = &(group, docids);
let value_bytes =
FacetStringZeroBoundsValueCodec::<CboRoaringBitmapCodec>::bytes_encode(val)
.ok_or(SerializationError::Encoding { db_name })?;
// safety: we don't keep references from inside the LMDB database.
unsafe { iter.put_current(&key, &value_bytes)? };
}
}
None => {
// The key corresponds to a level zero facet string.
let (original_value, mut docids) =
FacetStringLevelZeroValueCodec::bytes_decode(val)
.ok_or(SerializationError::Decoding { db_name })?;
let previous_len = docids.len();
docids -= to_remove;
if docids.is_empty() {
// safety: we don't keep references from inside the LMDB database.
unsafe { iter.del_current()? };
} else if docids.len() != previous_len {
let key = key.to_owned();
let val = &(original_value, docids);
let value_bytes = FacetStringLevelZeroValueCodec::bytes_encode(val)
.ok_or(SerializationError::Encoding { db_name })?;
// safety: we don't keep references from inside the LMDB database.
unsafe { iter.put_current(&key, &value_bytes)? };
}
}
}
}
Ok(())
}
fn remove_docids_from_facet_field_id_docids<'a, C>(
wtxn: &'a mut heed::RwTxn, wtxn: &'a mut heed::RwTxn,
db: &heed::Database<C, CboRoaringBitmapCodec>, db: &heed::Database<C, CboRoaringBitmapCodec>,
to_remove: &RoaringBitmap, to_remove: &RoaringBitmap,
@ -675,12 +633,13 @@ mod tests {
use super::*; use super::*;
use crate::index::tests::TempIndex; use crate::index::tests::TempIndex;
use crate::Filter; use crate::{db_snap, Filter};
fn delete_documents<'t>( fn delete_documents<'t>(
wtxn: &mut RwTxn<'t, '_>, wtxn: &mut RwTxn<'t, '_>,
index: &'t Index, index: &'t Index,
external_ids: &[&str], external_ids: &[&str],
disable_soft_deletion: bool,
) -> Vec<u32> { ) -> Vec<u32> {
let external_document_ids = index.external_documents_ids(&wtxn).unwrap(); let external_document_ids = index.external_documents_ids(&wtxn).unwrap();
let ids_to_delete: Vec<u32> = external_ids let ids_to_delete: Vec<u32> = external_ids
@ -690,14 +649,14 @@ mod tests {
// Delete some documents. // Delete some documents.
let mut builder = DeleteDocuments::new(wtxn, index).unwrap(); let mut builder = DeleteDocuments::new(wtxn, index).unwrap();
builder.disable_soft_deletion(disable_soft_deletion);
external_ids.iter().for_each(|id| drop(builder.delete_external_id(id))); external_ids.iter().for_each(|id| drop(builder.delete_external_id(id)));
builder.execute().unwrap(); builder.execute().unwrap();
ids_to_delete ids_to_delete
} }
#[test] fn delete_documents_with_numbers_as_primary_key_(disable_soft_deletion: bool) {
fn delete_documents_with_numbers_as_primary_key() {
let index = TempIndex::new(); let index = TempIndex::new();
let mut wtxn = index.write_txn().unwrap(); let mut wtxn = index.write_txn().unwrap();
@ -717,19 +676,36 @@ mod tests {
builder.delete_document(0); builder.delete_document(0);
builder.delete_document(1); builder.delete_document(1);
builder.delete_document(2); builder.delete_document(2);
builder.disable_soft_deletion(disable_soft_deletion);
builder.execute().unwrap(); builder.execute().unwrap();
wtxn.commit().unwrap(); wtxn.commit().unwrap();
// All these snapshots should be empty since the database was cleared
db_snap!(index, documents_ids, disable_soft_deletion);
db_snap!(index, word_docids, disable_soft_deletion);
db_snap!(index, word_pair_proximity_docids, disable_soft_deletion);
db_snap!(index, facet_id_exists_docids, disable_soft_deletion);
db_snap!(index, soft_deleted_documents_ids, disable_soft_deletion);
let rtxn = index.read_txn().unwrap(); let rtxn = index.read_txn().unwrap();
assert!(index.field_distribution(&rtxn).unwrap().is_empty()); assert!(index.field_distribution(&rtxn).unwrap().is_empty());
} }
#[test] #[test]
fn delete_documents_with_strange_primary_key() { fn delete_documents_with_numbers_as_primary_key() {
delete_documents_with_numbers_as_primary_key_(true);
delete_documents_with_numbers_as_primary_key_(false);
}
fn delete_documents_with_strange_primary_key_(disable_soft_deletion: bool) {
let index = TempIndex::new(); let index = TempIndex::new();
index
.update_settings(|settings| settings.set_searchable_fields(vec!["name".to_string()]))
.unwrap();
let mut wtxn = index.write_txn().unwrap(); let mut wtxn = index.write_txn().unwrap();
index index
.add_documents_using_wtxn( .add_documents_using_wtxn(
@ -741,18 +717,33 @@ mod tests {
]), ]),
) )
.unwrap(); .unwrap();
wtxn.commit().unwrap();
let mut wtxn = index.write_txn().unwrap();
// Delete not all of the documents but some of them. // Delete not all of the documents but some of them.
let mut builder = DeleteDocuments::new(&mut wtxn, &index).unwrap(); let mut builder = DeleteDocuments::new(&mut wtxn, &index).unwrap();
builder.delete_external_id("0"); builder.delete_external_id("0");
builder.delete_external_id("1"); builder.delete_external_id("1");
builder.disable_soft_deletion(disable_soft_deletion);
builder.execute().unwrap(); builder.execute().unwrap();
wtxn.commit().unwrap(); wtxn.commit().unwrap();
db_snap!(index, documents_ids, disable_soft_deletion);
db_snap!(index, word_docids, disable_soft_deletion);
db_snap!(index, word_pair_proximity_docids, disable_soft_deletion);
db_snap!(index, soft_deleted_documents_ids, disable_soft_deletion);
} }
#[test] #[test]
fn filtered_placeholder_search_should_not_return_deleted_documents() { fn delete_documents_with_strange_primary_key() {
delete_documents_with_strange_primary_key_(true);
delete_documents_with_strange_primary_key_(false);
}
fn filtered_placeholder_search_should_not_return_deleted_documents_(
disable_soft_deletion: bool,
) {
let index = TempIndex::new(); let index = TempIndex::new();
let mut wtxn = index.write_txn().unwrap(); let mut wtxn = index.write_txn().unwrap();
@ -760,7 +751,7 @@ mod tests {
index index
.update_settings_using_wtxn(&mut wtxn, |settings| { .update_settings_using_wtxn(&mut wtxn, |settings| {
settings.set_primary_key(S("docid")); settings.set_primary_key(S("docid"));
settings.set_filterable_fields(hashset! { S("label") }); settings.set_filterable_fields(hashset! { S("label"), S("label2") });
}) })
.unwrap(); .unwrap();
@ -768,31 +759,34 @@ mod tests {
.add_documents_using_wtxn( .add_documents_using_wtxn(
&mut wtxn, &mut wtxn,
documents!([ documents!([
{ "docid": "1_4", "label": "sign" }, { "docid": "1_4", "label": ["sign"] },
{ "docid": "1_5", "label": "letter" }, { "docid": "1_5", "label": ["letter"] },
{ "docid": "1_7", "label": "abstract,cartoon,design,pattern" }, { "docid": "1_7", "label": ["abstract","cartoon","design","pattern"] },
{ "docid": "1_36", "label": "drawing,painting,pattern" }, { "docid": "1_36", "label": ["drawing","painting","pattern"] },
{ "docid": "1_37", "label": "art,drawing,outdoor" }, { "docid": "1_37", "label": ["art","drawing","outdoor"] },
{ "docid": "1_38", "label": "aquarium,art,drawing" }, { "docid": "1_38", "label": ["aquarium","art","drawing"] },
{ "docid": "1_39", "label": "abstract" }, { "docid": "1_39", "label": ["abstract"] },
{ "docid": "1_40", "label": "cartoon" }, { "docid": "1_40", "label": ["cartoon"] },
{ "docid": "1_41", "label": "art,drawing" }, { "docid": "1_41", "label": ["art","drawing"] },
{ "docid": "1_42", "label": "art,pattern" }, { "docid": "1_42", "label": ["art","pattern"] },
{ "docid": "1_43", "label": "abstract,art,drawing,pattern" }, { "docid": "1_43", "label": ["abstract","art","drawing","pattern"] },
{ "docid": "1_44", "label": "drawing" }, { "docid": "1_44", "label": ["drawing"] },
{ "docid": "1_45", "label": "art" }, { "docid": "1_45", "label": ["art"] },
{ "docid": "1_46", "label": "abstract,colorfulness,pattern" }, { "docid": "1_46", "label": ["abstract","colorfulness","pattern"] },
{ "docid": "1_47", "label": "abstract,pattern" }, { "docid": "1_47", "label": ["abstract","pattern"] },
{ "docid": "1_52", "label": "abstract,cartoon" }, { "docid": "1_52", "label": ["abstract","cartoon"] },
{ "docid": "1_57", "label": "abstract,drawing,pattern" }, { "docid": "1_57", "label": ["abstract","drawing","pattern"] },
{ "docid": "1_58", "label": "abstract,art,cartoon" }, { "docid": "1_58", "label": ["abstract","art","cartoon"] },
{ "docid": "1_68", "label": "design" }, { "docid": "1_68", "label": ["design"] },
{ "docid": "1_69", "label": "geometry" } { "docid": "1_69", "label": ["geometry"] },
{ "docid": "1_70", "label2": ["geometry", 1.2] },
{ "docid": "1_71", "label2": ["design", 2.2] },
{ "docid": "1_72", "label2": ["geometry", 1.2] }
]), ]),
) )
.unwrap(); .unwrap();
delete_documents(&mut wtxn, &index, &["1_4"]); delete_documents(&mut wtxn, &index, &["1_4", "1_70", "1_72"], disable_soft_deletion);
// Placeholder search with filter // Placeholder search with filter
let filter = Filter::from_str("label = sign").unwrap().unwrap(); let filter = Filter::from_str("label = sign").unwrap().unwrap();
@ -800,10 +794,22 @@ mod tests {
assert!(results.documents_ids.is_empty()); assert!(results.documents_ids.is_empty());
wtxn.commit().unwrap(); wtxn.commit().unwrap();
db_snap!(index, soft_deleted_documents_ids, disable_soft_deletion);
db_snap!(index, word_docids, disable_soft_deletion);
db_snap!(index, facet_id_f64_docids, disable_soft_deletion);
db_snap!(index, word_pair_proximity_docids, disable_soft_deletion);
db_snap!(index, facet_id_exists_docids, disable_soft_deletion);
db_snap!(index, facet_id_string_docids, disable_soft_deletion);
} }
#[test] #[test]
fn placeholder_search_should_not_return_deleted_documents() { fn filtered_placeholder_search_should_not_return_deleted_documents() {
filtered_placeholder_search_should_not_return_deleted_documents_(true);
filtered_placeholder_search_should_not_return_deleted_documents_(false);
}
fn placeholder_search_should_not_return_deleted_documents_(disable_soft_deletion: bool) {
let index = TempIndex::new(); let index = TempIndex::new();
let mut wtxn = index.write_txn().unwrap(); let mut wtxn = index.write_txn().unwrap();
@ -817,31 +823,35 @@ mod tests {
.add_documents_using_wtxn( .add_documents_using_wtxn(
&mut wtxn, &mut wtxn,
documents!([ documents!([
{ "docid": "1_4", "label": "sign" }, { "docid": "1_4", "label": ["sign"] },
{ "docid": "1_5", "label": "letter" }, { "docid": "1_5", "label": ["letter"] },
{ "docid": "1_7", "label": "abstract,cartoon,design,pattern" }, { "docid": "1_7", "label": ["abstract","cartoon","design","pattern"] },
{ "docid": "1_36", "label": "drawing,painting,pattern" }, { "docid": "1_36", "label": ["drawing","painting","pattern"] },
{ "docid": "1_37", "label": "art,drawing,outdoor" }, { "docid": "1_37", "label": ["art","drawing","outdoor"] },
{ "docid": "1_38", "label": "aquarium,art,drawing" }, { "docid": "1_38", "label": ["aquarium","art","drawing"] },
{ "docid": "1_39", "label": "abstract" }, { "docid": "1_39", "label": ["abstract"] },
{ "docid": "1_40", "label": "cartoon" }, { "docid": "1_40", "label": ["cartoon"] },
{ "docid": "1_41", "label": "art,drawing" }, { "docid": "1_41", "label": ["art","drawing"] },
{ "docid": "1_42", "label": "art,pattern" }, { "docid": "1_42", "label": ["art","pattern"] },
{ "docid": "1_43", "label": "abstract,art,drawing,pattern" }, { "docid": "1_43", "label": ["abstract","art","drawing","pattern"] },
{ "docid": "1_44", "label": "drawing" }, { "docid": "1_44", "label": ["drawing"] },
{ "docid": "1_45", "label": "art" }, { "docid": "1_45", "label": ["art"] },
{ "docid": "1_46", "label": "abstract,colorfulness,pattern" }, { "docid": "1_46", "label": ["abstract","colorfulness","pattern"] },
{ "docid": "1_47", "label": "abstract,pattern" }, { "docid": "1_47", "label": ["abstract","pattern"] },
{ "docid": "1_52", "label": "abstract,cartoon" }, { "docid": "1_52", "label": ["abstract","cartoon"] },
{ "docid": "1_57", "label": "abstract,drawing,pattern" }, { "docid": "1_57", "label": ["abstract","drawing","pattern"] },
{ "docid": "1_58", "label": "abstract,art,cartoon" }, { "docid": "1_58", "label": ["abstract","art","cartoon"] },
{ "docid": "1_68", "label": "design" }, { "docid": "1_68", "label": ["design"] },
{ "docid": "1_69", "label": "geometry" } { "docid": "1_69", "label": ["geometry"] },
{ "docid": "1_70", "label2": ["geometry", 1.2] },
{ "docid": "1_71", "label2": ["design", 2.2] },
{ "docid": "1_72", "label2": ["geometry", 1.2] }
]), ]),
) )
.unwrap(); .unwrap();
let deleted_internal_ids = delete_documents(&mut wtxn, &index, &["1_4"]); let deleted_internal_ids =
delete_documents(&mut wtxn, &index, &["1_4"], disable_soft_deletion);
// Placeholder search // Placeholder search
let results = index.search(&wtxn).execute().unwrap(); let results = index.search(&wtxn).execute().unwrap();
@ -858,7 +868,12 @@ mod tests {
} }
#[test] #[test]
fn search_should_not_return_deleted_documents() { fn placeholder_search_should_not_return_deleted_documents() {
placeholder_search_should_not_return_deleted_documents_(true);
placeholder_search_should_not_return_deleted_documents_(false);
}
fn search_should_not_return_deleted_documents_(disable_soft_deletion: bool) {
let index = TempIndex::new(); let index = TempIndex::new();
let mut wtxn = index.write_txn().unwrap(); let mut wtxn = index.write_txn().unwrap();
@ -872,31 +887,35 @@ mod tests {
.add_documents_using_wtxn( .add_documents_using_wtxn(
&mut wtxn, &mut wtxn,
documents!([ documents!([
{"docid": "1_4", "label": "sign"}, { "docid": "1_4", "label": ["sign"] },
{"docid": "1_5", "label": "letter"}, { "docid": "1_5", "label": ["letter"] },
{"docid": "1_7", "label": "abstract,cartoon,design,pattern"}, { "docid": "1_7", "label": ["abstract","cartoon","design","pattern"] },
{"docid": "1_36","label": "drawing,painting,pattern"}, { "docid": "1_36", "label": ["drawing","painting","pattern"] },
{"docid": "1_37","label": "art,drawing,outdoor"}, { "docid": "1_37", "label": ["art","drawing","outdoor"] },
{"docid": "1_38","label": "aquarium,art,drawing"}, { "docid": "1_38", "label": ["aquarium","art","drawing"] },
{"docid": "1_39","label": "abstract"}, { "docid": "1_39", "label": ["abstract"] },
{"docid": "1_40","label": "cartoon"}, { "docid": "1_40", "label": ["cartoon"] },
{"docid": "1_41","label": "art,drawing"}, { "docid": "1_41", "label": ["art","drawing"] },
{"docid": "1_42","label": "art,pattern"}, { "docid": "1_42", "label": ["art","pattern"] },
{"docid": "1_43","label": "abstract,art,drawing,pattern"}, { "docid": "1_43", "label": ["abstract","art","drawing","pattern"] },
{"docid": "1_44","label": "drawing"}, { "docid": "1_44", "label": ["drawing"] },
{"docid": "1_45","label": "art"}, { "docid": "1_45", "label": ["art"] },
{"docid": "1_46","label": "abstract,colorfulness,pattern"}, { "docid": "1_46", "label": ["abstract","colorfulness","pattern"] },
{"docid": "1_47","label": "abstract,pattern"}, { "docid": "1_47", "label": ["abstract","pattern"] },
{"docid": "1_52","label": "abstract,cartoon"}, { "docid": "1_52", "label": ["abstract","cartoon"] },
{"docid": "1_57","label": "abstract,drawing,pattern"}, { "docid": "1_57", "label": ["abstract","drawing","pattern"] },
{"docid": "1_58","label": "abstract,art,cartoon"}, { "docid": "1_58", "label": ["abstract","art","cartoon"] },
{"docid": "1_68","label": "design"}, { "docid": "1_68", "label": ["design"] },
{"docid": "1_69","label": "geometry"} { "docid": "1_69", "label": ["geometry"] },
{ "docid": "1_70", "label2": ["geometry", 1.2] },
{ "docid": "1_71", "label2": ["design", 2.2] },
{ "docid": "1_72", "label2": ["geometry", 1.2] }
]), ]),
) )
.unwrap(); .unwrap();
let deleted_internal_ids = delete_documents(&mut wtxn, &index, &["1_7", "1_52"]); let deleted_internal_ids =
delete_documents(&mut wtxn, &index, &["1_7", "1_52"], disable_soft_deletion);
// search for abstract // search for abstract
let results = index.search(&wtxn).query("abstract").execute().unwrap(); let results = index.search(&wtxn).query("abstract").execute().unwrap();
@ -910,10 +929,19 @@ mod tests {
} }
wtxn.commit().unwrap(); wtxn.commit().unwrap();
db_snap!(index, soft_deleted_documents_ids, disable_soft_deletion);
} }
#[test] #[test]
fn geo_filtered_placeholder_search_should_not_return_deleted_documents() { fn search_should_not_return_deleted_documents() {
search_should_not_return_deleted_documents_(true);
search_should_not_return_deleted_documents_(false);
}
fn geo_filtered_placeholder_search_should_not_return_deleted_documents_(
disable_soft_deletion: bool,
) {
let index = TempIndex::new(); let index = TempIndex::new();
let mut wtxn = index.write_txn().unwrap(); let mut wtxn = index.write_txn().unwrap();
@ -949,7 +977,8 @@ mod tests {
])).unwrap(); ])).unwrap();
let external_ids_to_delete = ["5", "6", "7", "12", "17", "19"]; let external_ids_to_delete = ["5", "6", "7", "12", "17", "19"];
let deleted_internal_ids = delete_documents(&mut wtxn, &index, &external_ids_to_delete); let deleted_internal_ids =
delete_documents(&mut wtxn, &index, &external_ids_to_delete, disable_soft_deletion);
// Placeholder search with geo filter // Placeholder search with geo filter
let filter = Filter::from_str("_geoRadius(50.6924, 3.1763, 20000)").unwrap().unwrap(); let filter = Filter::from_str("_geoRadius(50.6924, 3.1763, 20000)").unwrap().unwrap();
@ -964,10 +993,19 @@ mod tests {
} }
wtxn.commit().unwrap(); wtxn.commit().unwrap();
db_snap!(index, soft_deleted_documents_ids, disable_soft_deletion);
db_snap!(index, facet_id_f64_docids, disable_soft_deletion);
db_snap!(index, facet_id_string_docids, disable_soft_deletion);
} }
#[test] #[test]
fn get_documents_should_not_return_deleted_documents() { fn geo_filtered_placeholder_search_should_not_return_deleted_documents() {
geo_filtered_placeholder_search_should_not_return_deleted_documents_(true);
geo_filtered_placeholder_search_should_not_return_deleted_documents_(false);
}
fn get_documents_should_not_return_deleted_documents_(disable_soft_deletion: bool) {
let index = TempIndex::new(); let index = TempIndex::new();
let mut wtxn = index.write_txn().unwrap(); let mut wtxn = index.write_txn().unwrap();
@ -981,32 +1019,36 @@ mod tests {
.add_documents_using_wtxn( .add_documents_using_wtxn(
&mut wtxn, &mut wtxn,
documents!([ documents!([
{ "docid": "1_4", "label": "sign" }, { "docid": "1_4", "label": ["sign"] },
{ "docid": "1_5", "label": "letter" }, { "docid": "1_5", "label": ["letter"] },
{ "docid": "1_7", "label": "abstract,cartoon,design,pattern" }, { "docid": "1_7", "label": ["abstract","cartoon","design","pattern"] },
{ "docid": "1_36", "label": "drawing,painting,pattern" }, { "docid": "1_36", "label": ["drawing","painting","pattern"] },
{ "docid": "1_37", "label": "art,drawing,outdoor" }, { "docid": "1_37", "label": ["art","drawing","outdoor"] },
{ "docid": "1_38", "label": "aquarium,art,drawing" }, { "docid": "1_38", "label": ["aquarium","art","drawing"] },
{ "docid": "1_39", "label": "abstract" }, { "docid": "1_39", "label": ["abstract"] },
{ "docid": "1_40", "label": "cartoon" }, { "docid": "1_40", "label": ["cartoon"] },
{ "docid": "1_41", "label": "art,drawing" }, { "docid": "1_41", "label": ["art","drawing"] },
{ "docid": "1_42", "label": "art,pattern" }, { "docid": "1_42", "label": ["art","pattern"] },
{ "docid": "1_43", "label": "abstract,art,drawing,pattern" }, { "docid": "1_43", "label": ["abstract","art","drawing","pattern"] },
{ "docid": "1_44", "label": "drawing" }, { "docid": "1_44", "label": ["drawing"] },
{ "docid": "1_45", "label": "art" }, { "docid": "1_45", "label": ["art"] },
{ "docid": "1_46", "label": "abstract,colorfulness,pattern" }, { "docid": "1_46", "label": ["abstract","colorfulness","pattern"] },
{ "docid": "1_47", "label": "abstract,pattern" }, { "docid": "1_47", "label": ["abstract","pattern"] },
{ "docid": "1_52", "label": "abstract,cartoon" }, { "docid": "1_52", "label": ["abstract","cartoon"] },
{ "docid": "1_57", "label": "abstract,drawing,pattern" }, { "docid": "1_57", "label": ["abstract","drawing","pattern"] },
{ "docid": "1_58", "label": "abstract,art,cartoon" }, { "docid": "1_58", "label": ["abstract","art","cartoon"] },
{ "docid": "1_68", "label": "design" }, { "docid": "1_68", "label": ["design"] },
{ "docid": "1_69", "label": "geometry" } { "docid": "1_69", "label": ["geometry"] },
{ "docid": "1_70", "label2": ["geometry", 1.2] },
{ "docid": "1_71", "label2": ["design", 2.2] },
{ "docid": "1_72", "label2": ["geometry", 1.2] }
]), ]),
) )
.unwrap(); .unwrap();
let deleted_external_ids = ["1_7", "1_52"]; let deleted_external_ids = ["1_7", "1_52"];
let deleted_internal_ids = delete_documents(&mut wtxn, &index, &deleted_external_ids); let deleted_internal_ids =
delete_documents(&mut wtxn, &index, &deleted_external_ids, disable_soft_deletion);
// list all documents // list all documents
let results = index.all_documents(&wtxn).unwrap(); let results = index.all_documents(&wtxn).unwrap();
@ -1036,10 +1078,17 @@ mod tests {
} }
wtxn.commit().unwrap(); wtxn.commit().unwrap();
db_snap!(index, soft_deleted_documents_ids, disable_soft_deletion);
} }
#[test] #[test]
fn stats_should_not_return_deleted_documents() { fn get_documents_should_not_return_deleted_documents() {
get_documents_should_not_return_deleted_documents_(true);
get_documents_should_not_return_deleted_documents_(false);
}
fn stats_should_not_return_deleted_documents_(disable_soft_deletion: bool) {
let index = TempIndex::new(); let index = TempIndex::new();
let mut wtxn = index.write_txn().unwrap(); let mut wtxn = index.write_txn().unwrap();
@ -1051,29 +1100,29 @@ mod tests {
.unwrap(); .unwrap();
index.add_documents_using_wtxn(&mut wtxn, documents!([ index.add_documents_using_wtxn(&mut wtxn, documents!([
{ "docid": "1_4", "label": "sign"}, { "docid": "1_4", "label": ["sign"]},
{ "docid": "1_5", "label": "letter"}, { "docid": "1_5", "label": ["letter"]},
{ "docid": "1_7", "label": "abstract,cartoon,design,pattern", "title": "Mickey Mouse"}, { "docid": "1_7", "label": ["abstract","cartoon","design","pattern"], "title": "Mickey Mouse"},
{ "docid": "1_36", "label": "drawing,painting,pattern"}, { "docid": "1_36", "label": ["drawing","painting","pattern"]},
{ "docid": "1_37", "label": "art,drawing,outdoor"}, { "docid": "1_37", "label": ["art","drawing","outdoor"]},
{ "docid": "1_38", "label": "aquarium,art,drawing", "title": "Nemo"}, { "docid": "1_38", "label": ["aquarium","art","drawing"], "title": "Nemo"},
{ "docid": "1_39", "label": "abstract"}, { "docid": "1_39", "label": ["abstract"]},
{ "docid": "1_40", "label": "cartoon"}, { "docid": "1_40", "label": ["cartoon"]},
{ "docid": "1_41", "label": "art,drawing"}, { "docid": "1_41", "label": ["art","drawing"]},
{ "docid": "1_42", "label": "art,pattern"}, { "docid": "1_42", "label": ["art","pattern"]},
{ "docid": "1_43", "label": "abstract,art,drawing,pattern", "number": 32i32}, { "docid": "1_43", "label": ["abstract","art","drawing","pattern"], "number": 32i32},
{ "docid": "1_44", "label": "drawing", "number": 44i32}, { "docid": "1_44", "label": ["drawing"], "number": 44i32},
{ "docid": "1_45", "label": "art"}, { "docid": "1_45", "label": ["art"]},
{ "docid": "1_46", "label": "abstract,colorfulness,pattern"}, { "docid": "1_46", "label": ["abstract","colorfulness","pattern"]},
{ "docid": "1_47", "label": "abstract,pattern"}, { "docid": "1_47", "label": ["abstract","pattern"]},
{ "docid": "1_52", "label": "abstract,cartoon"}, { "docid": "1_52", "label": ["abstract","cartoon"]},
{ "docid": "1_57", "label": "abstract,drawing,pattern"}, { "docid": "1_57", "label": ["abstract","drawing","pattern"]},
{ "docid": "1_58", "label": "abstract,art,cartoon"}, { "docid": "1_58", "label": ["abstract","art","cartoon"]},
{ "docid": "1_68", "label": "design"}, { "docid": "1_68", "label": ["design"]},
{ "docid": "1_69", "label": "geometry"} { "docid": "1_69", "label": ["geometry"]}
])).unwrap(); ])).unwrap();
delete_documents(&mut wtxn, &index, &["1_7", "1_52"]); delete_documents(&mut wtxn, &index, &["1_7", "1_52"], disable_soft_deletion);
// count internal documents // count internal documents
let results = index.number_of_documents(&wtxn).unwrap(); let results = index.number_of_documents(&wtxn).unwrap();
@ -1086,5 +1135,13 @@ mod tests {
assert_eq!(Some(&2), results.get("number")); assert_eq!(Some(&2), results.get("number"));
wtxn.commit().unwrap(); wtxn.commit().unwrap();
db_snap!(index, soft_deleted_documents_ids, disable_soft_deletion);
}
#[test]
fn stats_should_not_return_deleted_documents() {
stats_should_not_return_deleted_documents_(true);
stats_should_not_return_deleted_documents_(false);
} }
} }

View File

@ -0,0 +1,438 @@
use std::borrow::Cow;
use std::fs::File;
use grenad::CompressionType;
use heed::types::ByteSlice;
use heed::{BytesEncode, Error, RoTxn, RwTxn};
use roaring::RoaringBitmap;
use super::{FACET_GROUP_SIZE, FACET_MIN_LEVEL_SIZE};
use crate::facet::FacetType;
use crate::heed_codec::facet::{
FacetGroupKey, FacetGroupKeyCodec, FacetGroupValue, FacetGroupValueCodec,
};
use crate::heed_codec::ByteSliceRefCodec;
use crate::update::index_documents::{create_writer, writer_into_reader};
use crate::{CboRoaringBitmapCodec, FieldId, Index, Result};
/// Algorithm to insert elememts into the `facet_id_(string/f64)_docids` databases
/// by rebuilding the database "from scratch".
///
/// First, the new elements are inserted into the level 0 of the database. Then, the
/// higher levels are cleared and recomputed from the content of level 0.
///
/// Finally, the `faceted_documents_ids` value in the main database of `Index`
/// is updated to contain the new set of faceted documents.
pub struct FacetsUpdateBulk<'i> {
index: &'i Index,
group_size: u8,
min_level_size: u8,
facet_type: FacetType,
field_ids: Vec<FieldId>,
// None if level 0 does not need to be updated
new_data: Option<grenad::Reader<File>>,
}
impl<'i> FacetsUpdateBulk<'i> {
pub fn new(
index: &'i Index,
field_ids: Vec<FieldId>,
facet_type: FacetType,
new_data: grenad::Reader<File>,
group_size: u8,
min_level_size: u8,
) -> FacetsUpdateBulk<'i> {
FacetsUpdateBulk {
index,
field_ids,
group_size,
min_level_size,
facet_type,
new_data: Some(new_data),
}
}
pub fn new_not_updating_level_0(
index: &'i Index,
field_ids: Vec<FieldId>,
facet_type: FacetType,
) -> FacetsUpdateBulk<'i> {
FacetsUpdateBulk {
index,
field_ids,
group_size: FACET_GROUP_SIZE,
min_level_size: FACET_MIN_LEVEL_SIZE,
facet_type,
new_data: None,
}
}
#[logging_timer::time("FacetsUpdateBulk::{}")]
pub fn execute(self, wtxn: &mut heed::RwTxn) -> Result<()> {
let Self { index, field_ids, group_size, min_level_size, facet_type, new_data } = self;
let db = match facet_type {
FacetType::String => index
.facet_id_string_docids
.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>(),
FacetType::Number => {
index.facet_id_f64_docids.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>()
}
};
let inner = FacetsUpdateBulkInner { db, new_data, group_size, min_level_size };
inner.update(wtxn, &field_ids, |wtxn, field_id, all_docids| {
index.put_faceted_documents_ids(wtxn, field_id, facet_type, &all_docids)?;
Ok(())
})?;
Ok(())
}
}
/// Implementation of `FacetsUpdateBulk` that is independent of milli's `Index` type
pub(crate) struct FacetsUpdateBulkInner<R: std::io::Read + std::io::Seek> {
pub db: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
pub new_data: Option<grenad::Reader<R>>,
pub group_size: u8,
pub min_level_size: u8,
}
impl<R: std::io::Read + std::io::Seek> FacetsUpdateBulkInner<R> {
pub fn update(
mut self,
wtxn: &mut RwTxn,
field_ids: &[u16],
mut handle_all_docids: impl FnMut(&mut RwTxn, FieldId, RoaringBitmap) -> Result<()>,
) -> Result<()> {
self.update_level0(wtxn)?;
for &field_id in field_ids.iter() {
self.clear_levels(wtxn, field_id)?;
}
for &field_id in field_ids.iter() {
let (level_readers, all_docids) = self.compute_levels_for_field_id(field_id, &wtxn)?;
handle_all_docids(wtxn, field_id, all_docids)?;
for level_reader in level_readers {
let mut cursor = level_reader.into_cursor()?;
while let Some((k, v)) = cursor.move_on_next()? {
self.db.remap_types::<ByteSlice, ByteSlice>().put(wtxn, k, v)?;
}
}
}
Ok(())
}
fn clear_levels(&self, wtxn: &mut heed::RwTxn, field_id: FieldId) -> Result<()> {
let left = FacetGroupKey::<&[u8]> { field_id, level: 1, left_bound: &[] };
let right = FacetGroupKey::<&[u8]> { field_id, level: u8::MAX, left_bound: &[] };
let range = left..=right;
self.db.delete_range(wtxn, &range).map(drop)?;
Ok(())
}
fn update_level0(&mut self, wtxn: &mut RwTxn) -> Result<()> {
let new_data = match self.new_data.take() {
Some(x) => x,
None => return Ok(()),
};
if self.db.is_empty(wtxn)? {
let mut buffer = Vec::new();
let mut database = self.db.iter_mut(wtxn)?.remap_types::<ByteSlice, ByteSlice>();
let mut cursor = new_data.into_cursor()?;
while let Some((key, value)) = cursor.move_on_next()? {
buffer.clear();
// the group size for level 0
buffer.push(1);
// then we extend the buffer with the docids bitmap
buffer.extend_from_slice(value);
unsafe { database.append(key, &buffer)? };
}
} else {
let mut buffer = Vec::new();
let database = self.db.remap_types::<ByteSlice, ByteSlice>();
let mut cursor = new_data.into_cursor()?;
while let Some((key, value)) = cursor.move_on_next()? {
// the value is a CboRoaringBitmap, but I still need to prepend the
// group size for level 0 (= 1) to it
buffer.clear();
buffer.push(1);
// then we extend the buffer with the docids bitmap
match database.get(wtxn, key)? {
Some(prev_value) => {
let old_bitmap = &prev_value[1..];
CboRoaringBitmapCodec::merge_into(
&[Cow::Borrowed(value), Cow::Borrowed(old_bitmap)],
&mut buffer,
)?;
}
None => {
buffer.extend_from_slice(value);
}
};
database.put(wtxn, key, &buffer)?;
}
}
Ok(())
}
fn compute_levels_for_field_id(
&self,
field_id: FieldId,
txn: &RoTxn,
) -> Result<(Vec<grenad::Reader<File>>, RoaringBitmap)> {
let mut all_docids = RoaringBitmap::new();
let subwriters = self.compute_higher_levels(txn, field_id, 32, &mut |bitmaps, _| {
for bitmap in bitmaps {
all_docids |= bitmap;
}
Ok(())
})?;
Ok((subwriters, all_docids))
}
fn read_level_0<'t>(
&self,
rtxn: &'t RoTxn,
field_id: u16,
handle_group: &mut dyn FnMut(&[RoaringBitmap], &'t [u8]) -> Result<()>,
) -> Result<()> {
// we read the elements one by one and
// 1. keep track of the left bound
// 2. fill the `bitmaps` vector to give it to level 1 once `level_group_size` elements were read
let mut bitmaps = vec![];
let mut level_0_prefix = vec![];
level_0_prefix.extend_from_slice(&field_id.to_be_bytes());
level_0_prefix.push(0);
let level_0_iter = self
.db
.as_polymorph()
.prefix_iter::<_, ByteSlice, ByteSlice>(rtxn, level_0_prefix.as_slice())?
.remap_types::<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>();
let mut left_bound: &[u8] = &[];
let mut first_iteration_for_new_group = true;
for el in level_0_iter {
let (key, value) = el?;
let bound = key.left_bound;
let docids = value.bitmap;
if first_iteration_for_new_group {
left_bound = bound;
first_iteration_for_new_group = false;
}
bitmaps.push(docids);
if bitmaps.len() == self.group_size as usize {
handle_group(&bitmaps, left_bound)?;
first_iteration_for_new_group = true;
bitmaps.clear();
}
}
// don't forget to give the leftover bitmaps as well
if !bitmaps.is_empty() {
handle_group(&bitmaps, left_bound)?;
bitmaps.clear();
}
Ok(())
}
/// Compute the content of the database levels from its level 0 for the given field id.
///
/// ## Returns:
/// A vector of grenad::Reader. The reader at index `i` corresponds to the elements of level `i + 1`
/// that must be inserted into the database.
fn compute_higher_levels<'t>(
&self,
rtxn: &'t RoTxn,
field_id: u16,
level: u8,
handle_group: &mut dyn FnMut(&[RoaringBitmap], &'t [u8]) -> Result<()>,
) -> Result<Vec<grenad::Reader<File>>> {
if level == 0 {
self.read_level_0(rtxn, field_id, handle_group)?;
// Level 0 is already in the database
return Ok(vec![]);
}
// level >= 1
// we compute each element of this level based on the elements of the level below it
// once we have computed `level_group_size` elements, we give the left bound
// of those elements, and their bitmaps, to the level above
let mut cur_writer = create_writer(CompressionType::None, None, tempfile::tempfile()?);
let mut cur_writer_len: usize = 0;
let mut group_sizes = vec![];
let mut left_bounds = vec![];
let mut bitmaps = vec![];
// compute the levels below
// in the callback, we fill `cur_writer` with the correct elements for this level
let mut sub_writers = self.compute_higher_levels(
rtxn,
field_id,
level - 1,
&mut |sub_bitmaps, left_bound| {
let mut combined_bitmap = RoaringBitmap::default();
for bitmap in sub_bitmaps {
combined_bitmap |= bitmap;
}
group_sizes.push(sub_bitmaps.len() as u8);
left_bounds.push(left_bound);
bitmaps.push(combined_bitmap);
if bitmaps.len() != self.group_size as usize {
return Ok(());
}
let left_bound = left_bounds.first().unwrap();
handle_group(&bitmaps, left_bound)?;
for ((bitmap, left_bound), group_size) in
bitmaps.drain(..).zip(left_bounds.drain(..)).zip(group_sizes.drain(..))
{
let key = FacetGroupKey { field_id, level, left_bound };
let key = FacetGroupKeyCodec::<ByteSliceRefCodec>::bytes_encode(&key)
.ok_or(Error::Encoding)?;
let value = FacetGroupValue { size: group_size, bitmap };
let value =
FacetGroupValueCodec::bytes_encode(&value).ok_or(Error::Encoding)?;
cur_writer.insert(key, value)?;
cur_writer_len += 1;
}
Ok(())
},
)?;
// don't forget to insert the leftover elements into the writer as well
// but only do so if the current number of elements to be inserted into this
// levelcould grow to the minimum level size
if !bitmaps.is_empty() && (cur_writer_len >= self.min_level_size as usize - 1) {
// the length of bitmaps is between 0 and group_size
assert!(bitmaps.len() < self.group_size as usize);
assert!(cur_writer_len > 0);
let left_bound = left_bounds.first().unwrap();
handle_group(&bitmaps, left_bound)?;
// Note: how many bitmaps are there here?
for ((bitmap, left_bound), group_size) in
bitmaps.drain(..).zip(left_bounds.drain(..)).zip(group_sizes.drain(..))
{
let key = FacetGroupKey { field_id, level, left_bound };
let key = FacetGroupKeyCodec::<ByteSliceRefCodec>::bytes_encode(&key)
.ok_or(Error::Encoding)?;
let value = FacetGroupValue { size: group_size, bitmap };
let value = FacetGroupValueCodec::bytes_encode(&value).ok_or(Error::Encoding)?;
cur_writer.insert(key, value)?;
cur_writer_len += 1;
}
}
// if we inserted enough elements to reach the minimum level size, then we push the writer
if cur_writer_len as u8 >= self.min_level_size {
sub_writers.push(writer_into_reader(cur_writer)?);
} else {
// otherwise, if there are still leftover elements, we give them to the level above
// this is necessary in order to get the union of all docids
if !bitmaps.is_empty() {
handle_group(&bitmaps, left_bounds.first().unwrap())?;
}
}
return Ok(sub_writers);
}
}
#[cfg(test)]
mod tests {
use std::iter::once;
use roaring::RoaringBitmap;
use crate::heed_codec::facet::OrderedF64Codec;
use crate::milli_snap;
use crate::update::facet::tests::FacetIndex;
#[test]
fn insert() {
let test = |name: &str, group_size: u8, min_level_size: u8| {
let index =
FacetIndex::<OrderedF64Codec>::new(group_size, 0 /*NA*/, min_level_size);
let mut elements = Vec::<((u16, f64), RoaringBitmap)>::new();
for i in 0..1_000u32 {
// field id = 0, left_bound = i, docids = [i]
elements.push(((0, i as f64), once(i).collect()));
}
for i in 0..100u32 {
// field id = 1, left_bound = i, docids = [i]
elements.push(((1, i as f64), once(i).collect()));
}
let mut wtxn = index.env.write_txn().unwrap();
index.bulk_insert(&mut wtxn, &[0, 1], elements.iter());
index.verify_structure_validity(&wtxn, 0);
index.verify_structure_validity(&wtxn, 1);
wtxn.commit().unwrap();
milli_snap!(format!("{index}"), name);
};
test("default", 4, 5);
test("small_group_small_min_level", 2, 2);
test("small_group_large_min_level", 2, 128);
test("large_group_small_min_level", 16, 2);
test("odd_group_odd_min_level", 7, 3);
}
#[test]
fn insert_delete_field_insert() {
let test = |name: &str, group_size: u8, min_level_size: u8| {
let index =
FacetIndex::<OrderedF64Codec>::new(group_size, 0 /*NA*/, min_level_size);
let mut wtxn = index.env.write_txn().unwrap();
let mut elements = Vec::<((u16, f64), RoaringBitmap)>::new();
for i in 0..100u32 {
// field id = 0, left_bound = i, docids = [i]
elements.push(((0, i as f64), once(i).collect()));
}
for i in 0..100u32 {
// field id = 1, left_bound = i, docids = [i]
elements.push(((1, i as f64), once(i).collect()));
}
index.bulk_insert(&mut wtxn, &[0, 1], elements.iter());
index.verify_structure_validity(&wtxn, 0);
index.verify_structure_validity(&wtxn, 1);
// delete all the elements for the facet id 0
for i in 0..100u32 {
index.delete_single_docid(&mut wtxn, 0, &(i as f64), i);
}
index.verify_structure_validity(&wtxn, 0);
index.verify_structure_validity(&wtxn, 1);
let mut elements = Vec::<((u16, f64), RoaringBitmap)>::new();
// then add some elements again for the facet id 1
for i in 0..110u32 {
// field id = 1, left_bound = i, docids = [i]
elements.push(((1, i as f64), once(i).collect()));
}
index.verify_structure_validity(&wtxn, 0);
index.verify_structure_validity(&wtxn, 1);
index.bulk_insert(&mut wtxn, &[0, 1], elements.iter());
wtxn.commit().unwrap();
milli_snap!(format!("{index}"), name);
};
test("default", 4, 5);
test("small_group_small_min_level", 2, 2);
test("small_group_large_min_level", 2, 128);
test("large_group_small_min_level", 16, 2);
test("odd_group_odd_min_level", 7, 3);
}
}

View File

@ -0,0 +1,239 @@
use std::collections::{HashMap, HashSet};
use heed::RwTxn;
use log::debug;
use roaring::RoaringBitmap;
use time::OffsetDateTime;
use super::{FACET_GROUP_SIZE, FACET_MAX_GROUP_SIZE, FACET_MIN_LEVEL_SIZE};
use crate::facet::FacetType;
use crate::heed_codec::facet::{FacetGroupKey, FacetGroupKeyCodec, FacetGroupValueCodec};
use crate::heed_codec::ByteSliceRefCodec;
use crate::update::{FacetsUpdateBulk, FacetsUpdateIncrementalInner};
use crate::{FieldId, Index, Result};
/// A builder used to remove elements from the `facet_id_string_docids` or `facet_id_f64_docids` databases.
///
/// Depending on the number of removed elements and the existing size of the database, we use either
/// a bulk delete method or an incremental delete method.
pub struct FacetsDelete<'i, 'b> {
index: &'i Index,
database: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
facet_type: FacetType,
affected_facet_values: HashMap<FieldId, HashSet<Vec<u8>>>,
docids_to_delete: &'b RoaringBitmap,
group_size: u8,
max_group_size: u8,
min_level_size: u8,
}
impl<'i, 'b> FacetsDelete<'i, 'b> {
pub fn new(
index: &'i Index,
facet_type: FacetType,
affected_facet_values: HashMap<FieldId, HashSet<Vec<u8>>>,
docids_to_delete: &'b RoaringBitmap,
) -> Self {
let database = match facet_type {
FacetType::String => index
.facet_id_string_docids
.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>(),
FacetType::Number => {
index.facet_id_f64_docids.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>()
}
};
Self {
index,
database,
facet_type,
affected_facet_values,
docids_to_delete,
group_size: FACET_GROUP_SIZE,
max_group_size: FACET_MAX_GROUP_SIZE,
min_level_size: FACET_MIN_LEVEL_SIZE,
}
}
pub fn execute(self, wtxn: &mut RwTxn) -> Result<()> {
debug!("Computing and writing the facet values levels docids into LMDB on disk...");
self.index.set_updated_at(wtxn, &OffsetDateTime::now_utc())?;
for (field_id, affected_facet_values) in self.affected_facet_values {
// This is an incorrect condition, since we assume that the length of the database is equal
// to the number of facet values for the given field_id. It means that in some cases, we might
// wrongly choose the incremental indexer over the bulk indexer. But the only case where that could
// really be a performance problem is when we fully delete a large ratio of all facet values for
// each field id. This would almost never happen. Still, to be overly cautious, I have added a
// 2x penalty to the incremental indexer. That is, instead of assuming a 70x worst-case performance
// penalty to the incremental indexer, we assume a 150x worst-case performance penalty instead.
if affected_facet_values.len() >= (self.database.len(wtxn)? / 150) {
// Bulk delete
let mut modified = false;
for facet_value in affected_facet_values {
let key =
FacetGroupKey { field_id, level: 0, left_bound: facet_value.as_slice() };
let mut old = self.database.get(wtxn, &key)?.unwrap();
let previous_len = old.bitmap.len();
old.bitmap -= self.docids_to_delete;
if old.bitmap.is_empty() {
modified = true;
self.database.delete(wtxn, &key)?;
} else if old.bitmap.len() != previous_len {
modified = true;
self.database.put(wtxn, &key, &old)?;
}
}
if modified {
let builder = FacetsUpdateBulk::new_not_updating_level_0(
self.index,
vec![field_id],
self.facet_type,
);
builder.execute(wtxn)?;
}
} else {
// Incremental
let inc = FacetsUpdateIncrementalInner {
db: self.database,
group_size: self.group_size,
min_level_size: self.min_level_size,
max_group_size: self.max_group_size,
};
for facet_value in affected_facet_values {
inc.delete(wtxn, field_id, facet_value.as_slice(), &self.docids_to_delete)?;
}
}
}
Ok(())
}
}
#[cfg(test)]
mod tests {
use std::iter::FromIterator;
use big_s::S;
use maplit::hashset;
use roaring::RoaringBitmap;
use crate::db_snap;
use crate::documents::documents_batch_reader_from_objects;
use crate::index::tests::TempIndex;
use crate::update::DeleteDocuments;
#[test]
fn delete_mixed_incremental_and_bulk() {
// The point of this test is to create an index populated with documents
// containing different filterable attributes. Then, we delete a bunch of documents
// such that a mix of the incremental and bulk indexer is used (depending on the field id)
let index = TempIndex::new_with_map_size(4096 * 1000 * 100);
index
.update_settings(|settings| {
settings.set_filterable_fields(
hashset! { S("id"), S("label"), S("timestamp"), S("colour") },
);
})
.unwrap();
let mut documents = vec![];
for i in 0..1000 {
documents.push(
serde_json::json! {
{
"id": i,
"label": i / 10,
"colour": i / 100,
"timestamp": i / 2,
}
}
.as_object()
.unwrap()
.clone(),
);
}
let documents = documents_batch_reader_from_objects(documents);
index.add_documents(documents).unwrap();
db_snap!(index, facet_id_f64_docids, 1);
db_snap!(index, number_faceted_documents_ids, 1);
let mut wtxn = index.env.write_txn().unwrap();
let mut builder = DeleteDocuments::new(&mut wtxn, &index).unwrap();
builder.disable_soft_deletion(true);
builder.delete_documents(&RoaringBitmap::from_iter(0..100));
// by deleting the first 100 documents, we expect that:
// - the "id" part of the DB will be updated in bulk, since #affected_facet_value = 100 which is > database_len / 150 (= 13)
// - the "label" part will be updated incrementally, since #affected_facet_value = 10 which is < 13
// - the "colour" part will also be updated incrementally, since #affected_values = 1 which is < 13
// - the "timestamp" part will be updated in bulk, since #affected_values = 50 which is > 13
// This has to be verified manually by inserting breakpoint/adding print statements to the code when running the test
builder.execute().unwrap();
wtxn.commit().unwrap();
db_snap!(index, soft_deleted_documents_ids, @"[]");
db_snap!(index, facet_id_f64_docids, 2);
db_snap!(index, number_faceted_documents_ids, 2);
}
}
#[allow(unused)]
#[cfg(test)]
mod comparison_bench {
use std::iter::once;
use rand::Rng;
use roaring::RoaringBitmap;
use crate::heed_codec::facet::OrderedF64Codec;
use crate::update::facet::tests::FacetIndex;
// This is a simple test to get an intuition on the relative speed
// of the incremental vs. bulk indexer.
//
// The benchmark shows the worst-case scenario for the incremental indexer, since
// each facet value contains only one document ID.
//
// In that scenario, it appears that the incremental indexer is about 70 times slower than the
// bulk indexer.
// #[test]
fn benchmark_facet_indexing_delete() {
let mut r = rand::thread_rng();
for i in 1..=20 {
let size = 50_000 * i;
let index = FacetIndex::<OrderedF64Codec>::new(4, 8, 5);
let mut txn = index.env.write_txn().unwrap();
let mut elements = Vec::<((u16, f64), RoaringBitmap)>::new();
for i in 0..size {
// field id = 0, left_bound = i, docids = [i]
elements.push(((0, i as f64), once(i).collect()));
}
let timer = std::time::Instant::now();
index.bulk_insert(&mut txn, &[0], elements.iter());
let time_spent = timer.elapsed().as_millis();
println!("bulk {size} : {time_spent}ms");
txn.commit().unwrap();
for nbr_doc in [1, 100, 1000, 10_000] {
let mut txn = index.env.write_txn().unwrap();
let timer = std::time::Instant::now();
//
// delete one document
//
for _ in 0..nbr_doc {
let deleted_u32 = r.gen::<u32>() % size;
let deleted_f64 = deleted_u32 as f64;
index.delete_single_docid(&mut txn, 0, &deleted_f64, deleted_u32)
}
let time_spent = timer.elapsed().as_millis();
println!(" delete {nbr_doc} : {time_spent}ms");
txn.abort().unwrap();
}
}
}
}

File diff suppressed because it is too large Load Diff

View File

@ -0,0 +1,499 @@
/*!
This module implements two different algorithms for updating the `facet_id_string_docids`
and `facet_id_f64_docids` databases. The first algorithm is a "bulk" algorithm, meaning that
it recreates the database from scratch when new elements are added to it. The second algorithm
is incremental: it modifies the database as little as possible.
The databases must be able to return results for queries such as:
1. Filter : find all the document ids that have a facet value greater than X and/or smaller than Y
2. Min/Max : find the minimum/maximum facet value among these document ids
3. Sort : sort these document ids by increasing/decreasing facet values
4. Distribution : given some document ids, make a list of each facet value
found in these documents along with the number of documents that contain it
The algorithms that implement these queries are found in the `src/search/facet` folder.
To make these queries fast to compute, the database adopts a tree structure:
```ignore
"ab" (2) "gaf" (2) "woz" (1)
Level 2
[a, b, d, f, z] [c, d, e, f, g] [u, y]
"ab" (2) "ba" (2) "gaf" (2) "form" (2) "woz" (2)
Level 1
[a, b, d, z] [a, b, f] [c, d, g] [e, f] [u, y]
"ab" "ac" "ba" "bac" "gaf" "gal" "form" "wow" "woz" "zz"
Level 0
[a, b] [d, z] [b, f] [a, f] [c, d] [g] [e] [e, f] [y] [u]
```
In the diagram above, each cell corresponds to a node in the tree. The first line of the cell
contains the left bound of the range of facet values as well as the number of children of the node.
The second line contains the document ids which have a facet value within the range of the node.
The nodes at level 0 are the leaf nodes. They have 0 children and a single facet value in their range.
In the diagram above, the first cell of level 2 is `ab (2)`. Its range is `ab .. gaf` (because
`gaf` is the left bound of the next node) and it has two children. Its document ids are `[a,b,d,f,z]`.
These documents all contain a facet value that is contained within `ab .. gaf`.
In the database, each node is represented by a key/value pair encoded as a [`FacetGroupKey`] and a
[`FacetGroupValue`], which have the following format:
```ignore
FacetGroupKey:
- field id : u16
- level : u8
- left bound: [u8] // the facet value encoded using either OrderedF64Codec or Str
FacetGroupValue:
- #children : u8
- docids : RoaringBitmap
```
When the database is first created using the "bulk" method, each node has a fixed number of children
(except for possibly the last one) given by the `group_size` parameter (default to `FACET_GROUP_SIZE`).
The tree is also built such that the highest level has more than `min_level_size`
(default to `FACET_MIN_LEVEL_SIZE`) elements in it.
When the database is incrementally updated, the number of children of a node can vary between
1 and `max_group_size`. This is done so that most incremental operations do not need to change
the structure of the tree. When the number of children of a node reaches `max_group_size`,
we split the node in two and update the number of children of its parent.
When adding documents to the databases, it is important to determine which method to use to
minimise indexing time. The incremental method is faster when adding few new facet values, but the
bulk method is faster when a large part of the database is modified. Empirically, it seems that
it takes 50x more time to incrementally add N facet values to an existing database than it is to
construct a database of N facet values. This is the heuristic that is used to choose between the
two methods.
Related PR: https://github.com/meilisearch/milli/pull/619
*/
pub const FACET_MAX_GROUP_SIZE: u8 = 8;
pub const FACET_GROUP_SIZE: u8 = 4;
pub const FACET_MIN_LEVEL_SIZE: u8 = 5;
use std::fs::File;
use log::debug;
use time::OffsetDateTime;
use self::incremental::FacetsUpdateIncremental;
use super::FacetsUpdateBulk;
use crate::facet::FacetType;
use crate::heed_codec::facet::{FacetGroupKeyCodec, FacetGroupValueCodec};
use crate::heed_codec::ByteSliceRefCodec;
use crate::{Index, Result};
pub mod bulk;
pub mod delete;
pub mod incremental;
/// A builder used to add new elements to the `facet_id_string_docids` or `facet_id_f64_docids` databases.
///
/// Depending on the number of new elements and the existing size of the database, we use either
/// a bulk update method or an incremental update method.
pub struct FacetsUpdate<'i> {
index: &'i Index,
database: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
facet_type: FacetType,
new_data: grenad::Reader<File>,
group_size: u8,
max_group_size: u8,
min_level_size: u8,
}
impl<'i> FacetsUpdate<'i> {
pub fn new(index: &'i Index, facet_type: FacetType, new_data: grenad::Reader<File>) -> Self {
let database = match facet_type {
FacetType::String => index
.facet_id_string_docids
.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>(),
FacetType::Number => {
index.facet_id_f64_docids.remap_key_type::<FacetGroupKeyCodec<ByteSliceRefCodec>>()
}
};
Self {
index,
database,
group_size: FACET_GROUP_SIZE,
max_group_size: FACET_MAX_GROUP_SIZE,
min_level_size: FACET_MIN_LEVEL_SIZE,
facet_type,
new_data,
}
}
pub fn execute(self, wtxn: &mut heed::RwTxn) -> Result<()> {
if self.new_data.is_empty() {
return Ok(());
}
debug!("Computing and writing the facet values levels docids into LMDB on disk...");
self.index.set_updated_at(wtxn, &OffsetDateTime::now_utc())?;
// See self::comparison_bench::benchmark_facet_indexing
if self.new_data.len() >= (self.database.len(wtxn)? as u64 / 50) {
let field_ids =
self.index.faceted_fields_ids(wtxn)?.iter().copied().collect::<Vec<_>>();
let bulk_update = FacetsUpdateBulk::new(
self.index,
field_ids,
self.facet_type,
self.new_data,
self.group_size,
self.min_level_size,
);
bulk_update.execute(wtxn)?;
} else {
let incremental_update = FacetsUpdateIncremental::new(
self.index,
self.facet_type,
self.new_data,
self.group_size,
self.min_level_size,
self.max_group_size,
);
incremental_update.execute(wtxn)?;
}
Ok(())
}
}
#[cfg(test)]
pub(crate) mod tests {
use std::cell::Cell;
use std::fmt::Display;
use std::iter::FromIterator;
use std::marker::PhantomData;
use std::rc::Rc;
use heed::types::ByteSlice;
use heed::{BytesDecode, BytesEncode, Env, RoTxn, RwTxn};
use roaring::RoaringBitmap;
use super::bulk::FacetsUpdateBulkInner;
use crate::heed_codec::facet::{
FacetGroupKey, FacetGroupKeyCodec, FacetGroupValue, FacetGroupValueCodec,
};
use crate::heed_codec::ByteSliceRefCodec;
use crate::search::facet::get_highest_level;
use crate::snapshot_tests::display_bitmap;
use crate::update::FacetsUpdateIncrementalInner;
use crate::CboRoaringBitmapCodec;
/// A dummy index that only contains the facet database, used for testing
pub struct FacetIndex<BoundCodec>
where
for<'a> BoundCodec:
BytesEncode<'a> + BytesDecode<'a, DItem = <BoundCodec as BytesEncode<'a>>::EItem>,
{
pub env: Env,
pub content: heed::Database<FacetGroupKeyCodec<ByteSliceRefCodec>, FacetGroupValueCodec>,
pub group_size: Cell<u8>,
pub min_level_size: Cell<u8>,
pub max_group_size: Cell<u8>,
_tempdir: Rc<tempfile::TempDir>,
_phantom: PhantomData<BoundCodec>,
}
impl<BoundCodec> FacetIndex<BoundCodec>
where
for<'a> BoundCodec:
BytesEncode<'a> + BytesDecode<'a, DItem = <BoundCodec as BytesEncode<'a>>::EItem>,
{
#[cfg(all(test, fuzzing))]
pub fn open_from_tempdir(
tempdir: Rc<tempfile::TempDir>,
group_size: u8,
max_group_size: u8,
min_level_size: u8,
) -> FacetIndex<BoundCodec> {
let group_size = std::cmp::min(16, std::cmp::max(group_size, 2)); // 2 <= x <= 16
let max_group_size = std::cmp::min(16, std::cmp::max(group_size * 2, max_group_size)); // 2*group_size <= x <= 16
let min_level_size = std::cmp::min(17, std::cmp::max(1, min_level_size)); // 1 <= x <= 17
let mut options = heed::EnvOpenOptions::new();
let options = options.map_size(4096 * 4 * 10 * 1000);
unsafe {
options.flag(heed::flags::Flags::MdbAlwaysFreePages);
}
let env = options.open(tempdir.path()).unwrap();
let content = env.open_database(None).unwrap().unwrap();
FacetIndex {
content,
group_size: Cell::new(group_size),
max_group_size: Cell::new(max_group_size),
min_level_size: Cell::new(min_level_size),
_tempdir: tempdir,
env,
_phantom: PhantomData,
}
}
pub fn new(
group_size: u8,
max_group_size: u8,
min_level_size: u8,
) -> FacetIndex<BoundCodec> {
let group_size = std::cmp::min(127, std::cmp::max(group_size, 2)); // 2 <= x <= 127
let max_group_size = std::cmp::min(127, std::cmp::max(group_size * 2, max_group_size)); // 2*group_size <= x <= 127
let min_level_size = std::cmp::max(1, min_level_size); // 1 <= x <= inf
let mut options = heed::EnvOpenOptions::new();
let options = options.map_size(4096 * 4 * 1000 * 100);
let tempdir = tempfile::TempDir::new().unwrap();
let env = options.open(tempdir.path()).unwrap();
let content = env.create_database(None).unwrap();
FacetIndex {
content,
group_size: Cell::new(group_size),
max_group_size: Cell::new(max_group_size),
min_level_size: Cell::new(min_level_size),
_tempdir: Rc::new(tempdir),
env,
_phantom: PhantomData,
}
}
#[cfg(all(test, fuzzing))]
pub fn set_group_size(&self, group_size: u8) {
// 2 <= x <= 64
self.group_size.set(std::cmp::min(64, std::cmp::max(group_size, 2)));
}
#[cfg(all(test, fuzzing))]
pub fn set_max_group_size(&self, max_group_size: u8) {
// 2*group_size <= x <= 128
let max_group_size = std::cmp::max(4, std::cmp::min(128, max_group_size));
self.max_group_size.set(max_group_size);
if self.group_size.get() < max_group_size / 2 {
self.group_size.set(max_group_size / 2);
}
}
#[cfg(all(test, fuzzing))]
pub fn set_min_level_size(&self, min_level_size: u8) {
// 1 <= x <= inf
self.min_level_size.set(std::cmp::max(1, min_level_size));
}
pub fn insert<'a>(
&self,
wtxn: &'a mut RwTxn,
field_id: u16,
key: &'a <BoundCodec as BytesEncode<'a>>::EItem,
docids: &RoaringBitmap,
) {
let update = FacetsUpdateIncrementalInner {
db: self.content,
group_size: self.group_size.get(),
min_level_size: self.min_level_size.get(),
max_group_size: self.max_group_size.get(),
};
let key_bytes = BoundCodec::bytes_encode(&key).unwrap();
update.insert(wtxn, field_id, &key_bytes, docids).unwrap();
}
pub fn delete_single_docid<'a>(
&self,
wtxn: &'a mut RwTxn,
field_id: u16,
key: &'a <BoundCodec as BytesEncode<'a>>::EItem,
docid: u32,
) {
self.delete(wtxn, field_id, key, &RoaringBitmap::from_iter(std::iter::once(docid)))
}
pub fn delete<'a>(
&self,
wtxn: &'a mut RwTxn,
field_id: u16,
key: &'a <BoundCodec as BytesEncode<'a>>::EItem,
docids: &RoaringBitmap,
) {
let update = FacetsUpdateIncrementalInner {
db: self.content,
group_size: self.group_size.get(),
min_level_size: self.min_level_size.get(),
max_group_size: self.max_group_size.get(),
};
let key_bytes = BoundCodec::bytes_encode(&key).unwrap();
update.delete(wtxn, field_id, &key_bytes, docids).unwrap();
}
pub fn bulk_insert<'a, 'b>(
&self,
wtxn: &'a mut RwTxn,
field_ids: &[u16],
els: impl IntoIterator<
Item = &'a ((u16, <BoundCodec as BytesEncode<'a>>::EItem), RoaringBitmap),
>,
) where
for<'c> <BoundCodec as BytesEncode<'c>>::EItem: Sized,
{
let mut new_data = vec![];
let mut writer = grenad::Writer::new(&mut new_data);
for ((field_id, left_bound), docids) in els {
let left_bound_bytes = BoundCodec::bytes_encode(left_bound).unwrap().into_owned();
let key: FacetGroupKey<&[u8]> =
FacetGroupKey { field_id: *field_id, level: 0, left_bound: &left_bound_bytes };
let key = FacetGroupKeyCodec::<ByteSliceRefCodec>::bytes_encode(&key).unwrap();
let value = CboRoaringBitmapCodec::bytes_encode(&docids).unwrap();
writer.insert(&key, &value).unwrap();
}
writer.finish().unwrap();
let reader = grenad::Reader::new(std::io::Cursor::new(new_data)).unwrap();
let update = FacetsUpdateBulkInner {
db: self.content,
new_data: Some(reader),
group_size: self.group_size.get(),
min_level_size: self.min_level_size.get(),
};
update.update(wtxn, field_ids, |_, _, _| Ok(())).unwrap();
}
pub fn verify_structure_validity(&self, txn: &RoTxn, field_id: u16) {
let mut field_id_prefix = vec![];
field_id_prefix.extend_from_slice(&field_id.to_be_bytes());
let highest_level = get_highest_level(txn, self.content, field_id).unwrap();
for level_no in (1..=highest_level).rev() {
let mut level_no_prefix = vec![];
level_no_prefix.extend_from_slice(&field_id.to_be_bytes());
level_no_prefix.push(level_no);
let mut iter = self
.content
.as_polymorph()
.prefix_iter::<_, ByteSlice, FacetGroupValueCodec>(txn, &level_no_prefix)
.unwrap();
while let Some(el) = iter.next() {
let (key, value) = el.unwrap();
let key = FacetGroupKeyCodec::<ByteSliceRefCodec>::bytes_decode(&key).unwrap();
let mut prefix_start_below = vec![];
prefix_start_below.extend_from_slice(&field_id.to_be_bytes());
prefix_start_below.push(level_no - 1);
prefix_start_below.extend_from_slice(&key.left_bound);
let start_below = {
let mut start_below_iter = self
.content
.as_polymorph()
.prefix_iter::<_, ByteSlice, FacetGroupValueCodec>(
txn,
&prefix_start_below,
)
.unwrap();
let (key_bytes, _) = start_below_iter.next().unwrap().unwrap();
FacetGroupKeyCodec::<ByteSliceRefCodec>::bytes_decode(&key_bytes).unwrap()
};
assert!(value.size > 0);
let mut actual_size = 0;
let mut values_below = RoaringBitmap::new();
let mut iter_below = self
.content
.range(txn, &(start_below..))
.unwrap()
.take(value.size as usize);
while let Some(el) = iter_below.next() {
let (_, value) = el.unwrap();
actual_size += 1;
values_below |= value.bitmap;
}
assert_eq!(actual_size, value.size, "{key:?} start_below: {start_below:?}");
assert_eq!(value.bitmap, values_below);
}
}
}
}
impl<BoundCodec> Display for FacetIndex<BoundCodec>
where
for<'a> <BoundCodec as BytesEncode<'a>>::EItem: Sized + Display,
for<'a> BoundCodec:
BytesEncode<'a> + BytesDecode<'a, DItem = <BoundCodec as BytesEncode<'a>>::EItem>,
{
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
let txn = self.env.read_txn().unwrap();
let mut iter = self.content.iter(&txn).unwrap();
while let Some(el) = iter.next() {
let (key, value) = el.unwrap();
let FacetGroupKey { field_id, level, left_bound: bound } = key;
let bound = BoundCodec::bytes_decode(bound).unwrap();
let FacetGroupValue { size, bitmap } = value;
writeln!(
f,
"{field_id:<2} {level:<2} k{bound:<8} {size:<4} {values:?}",
values = display_bitmap(&bitmap)
)?;
}
Ok(())
}
}
}
#[allow(unused)]
#[cfg(test)]
mod comparison_bench {
use std::iter::once;
use rand::Rng;
use roaring::RoaringBitmap;
use super::tests::FacetIndex;
use crate::heed_codec::facet::OrderedF64Codec;
// This is a simple test to get an intuition on the relative speed
// of the incremental vs. bulk indexer.
//
// The benchmark shows the worst-case scenario for the incremental indexer, since
// each facet value contains only one document ID.
//
// In that scenario, it appears that the incremental indexer is about 50 times slower than the
// bulk indexer.
// #[test]
fn benchmark_facet_indexing() {
let mut facet_value = 0;
let mut r = rand::thread_rng();
for i in 1..=20 {
let size = 50_000 * i;
let index = FacetIndex::<OrderedF64Codec>::new(4, 8, 5);
let mut txn = index.env.write_txn().unwrap();
let mut elements = Vec::<((u16, f64), RoaringBitmap)>::new();
for i in 0..size {
// field id = 0, left_bound = i, docids = [i]
elements.push(((0, facet_value as f64), once(i).collect()));
facet_value += 1;
}
let timer = std::time::Instant::now();
index.bulk_insert(&mut txn, &[0], elements.iter());
let time_spent = timer.elapsed().as_millis();
println!("bulk {size} : {time_spent}ms");
txn.commit().unwrap();
for nbr_doc in [1, 100, 1000, 10_000] {
let mut txn = index.env.write_txn().unwrap();
let timer = std::time::Instant::now();
//
// insert one document
//
for _ in 0..nbr_doc {
index.insert(&mut txn, 0, &r.gen(), &once(1).collect());
}
let time_spent = timer.elapsed().as_millis();
println!(" add {nbr_doc} : {time_spent}ms");
txn.abort().unwrap();
}
}
}
}

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/bulk.rs
---
b40dd31a65e033ffc6b35c027ce19506

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/bulk.rs
---
7ee22d8e9387e72758f00918eb67e4c6

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/bulk.rs
---
60f567359382507afdaf45fb075740c3

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/bulk.rs
---
b986d6e6cbf425685f409a8b417010e1

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/bulk.rs
---
ee10dd2ae2b5c6621a89a5d0a9aa8ccc

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/bulk.rs
---
fa877559eef78b383b496c15a364a2dc

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/bulk.rs
---
16a96353bc42f2ff3e91611ca4d5b184

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/bulk.rs
---
be1b08073b9d9788d18080c1320151d7

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/bulk.rs
---
16a96353bc42f2ff3e91611ca4d5b184

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/bulk.rs
---
32a45d555df2e001420fea149818d376

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/delete.rs
---
550cd138d6fe31ccdd42cd5392fbd576

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/delete.rs
---
9a0ea88e7c9dcf6dc0ef0b601736ffcf

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/delete.rs
---
d4d5f14e7f1e1f09b86821a0b6defcc6

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/delete.rs
---
3570e0ac0fdb21be9ebe433f59264b56

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---
5dbfa134cc44abeb3ab6242fc182e48e

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---
6ed7bf5d440599b3b10b37549a271fdf

View File

@ -0,0 +1,19 @@
---
source: milli/src/update/facet/incremental.rs
---
0 0 k0 1 "[0, ]"
0 0 k1 1 "[1, ]"
0 0 k2 1 "[2, ]"
0 0 k3 1 "[3, ]"
0 0 k4 1 "[4, ]"
0 0 k5 1 "[5, ]"
0 0 k6 1 "[6, ]"
0 0 k7 1 "[7, ]"
0 0 k8 1 "[8, ]"
0 0 k9 1 "[9, ]"
0 0 k10 1 "[10, ]"
0 0 k11 1 "[11, ]"
0 0 k12 1 "[12, ]"
0 0 k13 1 "[13, ]"
0 0 k14 1 "[14, ]"

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---
b5203f0df0036ebaa133dd77d63a00eb

View File

@ -0,0 +1,26 @@
---
source: milli/src/update/facet/incremental.rs
---
0 0 k0 1 "[0, ]"
0 0 k1 1 "[1, ]"
0 0 k2 1 "[2, ]"
0 0 k3 1 "[3, ]"
0 0 k4 1 "[4, ]"
0 0 k5 1 "[5, ]"
0 0 k6 1 "[6, ]"
0 0 k7 1 "[7, ]"
0 0 k8 1 "[8, ]"
0 0 k9 1 "[9, ]"
0 0 k10 1 "[10, ]"
0 0 k11 1 "[11, ]"
0 0 k12 1 "[12, ]"
0 0 k13 1 "[13, ]"
0 0 k14 1 "[14, ]"
0 0 k15 1 "[15, ]"
0 0 k16 1 "[16, ]"
0 1 k0 4 "[0, 1, 2, 3, ]"
0 1 k4 4 "[4, 5, 6, 7, ]"
0 1 k8 4 "[8, 9, 10, 11, ]"
0 1 k12 4 "[12, 13, 14, 15, ]"
0 1 k16 1 "[16, ]"

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---
95497d8579740868ee0bfc655b0bf782

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---
d565c2f7bbd9e13e12de40cfbbfba6bb

View File

@ -0,0 +1,54 @@
---
source: milli/src/update/facet/incremental.rs
---
0 0 k216 1 "[216, ]"
0 0 k217 1 "[217, ]"
0 0 k218 1 "[218, ]"
0 0 k219 1 "[219, ]"
0 0 k220 1 "[220, ]"
0 0 k221 1 "[221, ]"
0 0 k222 1 "[222, ]"
0 0 k223 1 "[223, ]"
0 0 k224 1 "[224, ]"
0 0 k225 1 "[225, ]"
0 0 k226 1 "[226, ]"
0 0 k227 1 "[227, ]"
0 0 k228 1 "[228, ]"
0 0 k229 1 "[229, ]"
0 0 k230 1 "[230, ]"
0 0 k231 1 "[231, ]"
0 0 k232 1 "[232, ]"
0 0 k233 1 "[233, ]"
0 0 k234 1 "[234, ]"
0 0 k235 1 "[235, ]"
0 0 k236 1 "[236, ]"
0 0 k237 1 "[237, ]"
0 0 k238 1 "[238, ]"
0 0 k239 1 "[239, ]"
0 0 k240 1 "[240, ]"
0 0 k241 1 "[241, ]"
0 0 k242 1 "[242, ]"
0 0 k243 1 "[243, ]"
0 0 k244 1 "[244, ]"
0 0 k245 1 "[245, ]"
0 0 k246 1 "[246, ]"
0 0 k247 1 "[247, ]"
0 0 k248 1 "[248, ]"
0 0 k249 1 "[249, ]"
0 0 k250 1 "[250, ]"
0 0 k251 1 "[251, ]"
0 0 k252 1 "[252, ]"
0 0 k253 1 "[253, ]"
0 0 k254 1 "[254, ]"
0 0 k255 1 "[255, ]"
0 1 k216 4 "[216, 217, 218, 219, ]"
0 1 k220 4 "[220, 221, 222, 223, ]"
0 1 k224 4 "[224, 225, 226, 227, ]"
0 1 k228 4 "[228, 229, 230, 231, ]"
0 1 k232 4 "[232, 233, 234, 235, ]"
0 1 k236 4 "[236, 237, 238, 239, ]"
0 1 k240 4 "[240, 241, 242, 243, ]"
0 1 k244 4 "[244, 245, 246, 247, ]"
0 1 k248 4 "[248, 249, 250, 251, ]"
0 1 k252 4 "[252, 253, 254, 255, ]"

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---
7cb503827ba17e9670296cc9531a1380

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---
b061f43e379e16f0617c05d3313d0078

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---
81fc9489d6b163935b97433477dea63b

View File

@ -0,0 +1,4 @@
---
source: milli/src/update/facet/incremental.rs
---
b17b2c4ec87a778aae07854c96c08b48

View File

@ -0,0 +1,20 @@
---
source: milli/src/update/facet/incremental.rs
---
0 0 k0 1 "[3, 435, 583, 849, ]"
0 0 k1 1 "[35, 494, 693, 796, ]"
0 0 k2 1 "[76, 420, 526, 909, ]"
0 0 k3 1 "[133, 451, 653, 806, ]"
0 0 k4 1 "[131, 464, 656, 853, ]"
0 0 k5 1 "[61, 308, 701, 903, ]"
0 0 k6 1 "[144, 449, 674, 794, ]"
0 0 k7 1 "[182, 451, 735, 941, ]"
0 0 k8 1 "[6, 359, 679, 1003, ]"
0 0 k9 1 "[197, 418, 659, 904, ]"
0 0 k10 1 "[88, 297, 567, 800, ]"
0 0 k11 1 "[150, 309, 530, 946, ]"
0 0 k12 1 "[156, 466, 567, 892, ]"
0 0 k13 1 "[46, 425, 610, 807, ]"
0 0 k14 1 "[236, 433, 549, 891, ]"
0 0 k15 1 "[207, 472, 603, 974, ]"

Some files were not shown because too many files have changed in this diff Show More