diff --git a/milli/src/search/facet/facet_string.rs b/milli/src/search/facet/facet_string.rs new file mode 100644 index 000000000..61fc32f8e --- /dev/null +++ b/milli/src/search/facet/facet_string.rs @@ -0,0 +1,123 @@ +//! This module contains helpers iterators for facet strings. +//! +//! The purpose is to help iterate over the quite complex system of facets strings. A simple +//! description of the system would be that every facet string value is stored into an LMDB database +//! and that every value is associated with the document ids which are associated with this facet +//! string value. +//! +//! In reality it is a little bit more complex as we have to create aggregations of runs of facet +//! string values, those aggregations helps in choosing the right groups of facets to follow. +//! +//! ## A typical algorithm run +//! +//! If a group of aggregated facets values contains one of the documents ids, we must continue +//! iterating over the sub-groups. +//! +//! If this group is the lowest level and contain at least one document id we yield the associated +//! facet documents ids. +//! +//! If the group doesn't contain one of our documents ids, we continue to the next group at this +//! same level. +//! +//! ## The complexity comes from the strings +//! +//! This algorithm is exactly the one that we use for facet numbers. It is quite easy to create +//! aggregated facet number, groups of facets are easy to define in the LMDB key, we just put the +//! two numbers bounds, the left and the right bound of the group, both inclusive. +//! +//! It is easy to make sure that the groups are ordered, LMDB sort its keys lexicographically and +//! puting two numbers big-endian encoded one after the other gives us ordered groups. The values +//! are simple unions of the documents ids coming from the groups below. +//! +//! ### Example of what a facet number LMDB database contain +//! +//! | level | left-bound | right-bound | docs | +//! |-------|------------|-------------|------------------| +//! | 0 | 0 | _skipped_ | 1, 2 | +//! | 0 | 1 | _skipped_ | 6, 7 | +//! | 0 | 3 | _skipped_ | 4, 7 | +//! | 0 | 5 | _skipped_ | 2, 3, 4 | +//! | 1 | 0 | 1 | 1, 2, 6, 7 | +//! | 1 | 3 | 5 | 2, 3, 4, 7 | +//! | 2 | 0 | 5 | 1, 2, 3, 4, 6, 7 | +//! +//! As you can see the level 0 have two equal bounds, therefore we skip serializing the second +//! bound, that's the base level where you can directly fetch the documents ids associated with an +//! exact number. +//! +//! The next levels have two different bounds and the associated documents ids are simply the result +//! of an union of all the documents ids associated with the aggregated groups above. +//! +//! ## The complexity of defining groups of facet strings +//! +//! As explained above, defining groups of facet numbers is easy, LMDB stores the keys in +//! lexicographical order, it means that whatever the key represent the bytes are read in their raw +//! form and a simple `strcmp` will define the order in which keys will be read from the store. +//! +//! That's easy for types with a known size, like floats or integers, they are 64 bytes long and +//! appending one after the other in big-endian is consistent. LMDB will simply sort the keys by the +//! first number then by the second if the the first number is equal on two keys. +//! +//! For strings it is a lot more complex as those types are unsized, it means that the size of facet +//! strings is different for each facet value. +//! +//! ### Basic approach: padding the keys +//! +//! A first approach would be to simply define the maximum size of a facet string and pad the keys +//! with zeroes. The big problem of this approach is that it: +//! 1. reduces the maximum size of facet strings by half, as we need to put two keys one after the +//! other. +//! 2. makes the keys of facet strings very big (approximately 250 bytes), impacting a lot LMDB +//! performances. +//! +//! ### Better approach: number the facet groups +//! +//! A better approach would be to number the groups, this way we don't have the downsides of the +//! previously described approach but we need to be able to describe the groups by using a number. +//! +//! #### Example of facet strings with numbered groups +//! +//! | level | left-bound | right-bound | left-string | right-string | docs | +//! |-------|------------|-------------|-------------|--------------|------------------| +//! | 0 | alpha | _skipped_ | _skipped_ | _skipped_ | 1, 2 | +//! | 0 | beta | _skipped_ | _skipped_ | _skipped_ | 6, 7 | +//! | 0 | gamma | _skipped_ | _skipped_ | _skipped_ | 4, 7 | +//! | 0 | omega | _skipped_ | _skipped_ | _skipped_ | 2, 3, 4 | +//! | 1 | 0 | 1 | alpha | beta | 1, 2, 6, 7 | +//! | 1 | 3 | 5 | gamma | omega | 2, 3, 4, 7 | +//! | 2 | 0 | 5 | _skipped_ | _skipped_ | 1, 2, 3, 4, 6, 7 | +//! +//! As you can see the level 0 doesn't actually change much, we skip nearly everything, we do not +//! need to store the facet string value two times. +//! +//! In the value, not in the key, you can see that we added two new values: +//! the left-string and the right-string, which defines the original facet strings associated with +//! the given group. +//! +//! We put those two strings inside of the value, this way we do not limit the maximum size of the +//! facet string values, and the impact on performances is not important as, IIRC, LMDB put big +//! values on another page, this helps in iterating over keys fast enough and only fetch the page +//! with the values when required. +//! +//! The other little advantage with this solution is that there is no a big overhead, compared with +//! the facet number levels, we only duplicate the facet strings once for the level 1. +//! +//! #### A typical algorithm run +//! +//! Note that the algorithm is always moving from the highest level to the lowest one, one level +//! by one level, this is why it is ok to only store the facets string on the level 1. +//! +//! If a group of aggregated facets values, a group with numbers contains one of the documents ids, +//! we must continue iterating over the sub-groups. To do so: +//! - If we are at a level >= 2, we just do the same as with the facet numbers, get both bounds +//! and iterate over the facet groups defined by these numbers over the current level - 1. +//! - If we are at level 1, we retrieve both keys, the left-string and right-string, from the +//! value and just do the same as with the facet numbers but with strings: iterate over the +//! current level - 1 with both keys. +//! +//! If this group is the lowest level (level 0) and contain at least one document id we yield the +//! associated facet documents ids. +//! +//! If the group doesn't contain one of our documents ids, we continue to the next group at this +//! same level. +//! diff --git a/milli/src/search/facet/mod.rs b/milli/src/search/facet/mod.rs index e6ea92543..d92a8e4bd 100644 --- a/milli/src/search/facet/mod.rs +++ b/milli/src/search/facet/mod.rs @@ -5,5 +5,6 @@ pub(crate) use self::parser::Rule as ParserRule; mod facet_distribution; mod facet_number; +mod facet_string; mod filter_condition; mod parser;