733: Avoid a prefix-related worst-case scenario in the proximity criterion r=loiclec a=loiclec

# Pull Request

## Related issue
Somewhat fixes (until merged into meilisearch) https://github.com/meilisearch/meilisearch/issues/3118

## What does this PR do?
When a query ends with a word and a prefix, such as:
```
word pr
```
Then we first determine whether `pre` *could possibly* be in the proximity prefix database before querying it. There are then three possibilities:

1. `pr` is not in any prefix cache because it is not the prefix of many words. We don't query the proximity prefix database. Instead, we list all the word derivations of `pre` through the FST and query the regular proximity databases.

2. `pr` is in the prefix cache but cannot be found in the proximity prefix databases. **In this case, we partially disable the proximity ranking rule for the pair `word pre`.** This is done as follows:
   1. Only find the documents where `word` is in proximity to `pre` **exactly** (no derivations)
   2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8

3. `pr` is in the prefix cache and can be found in the proximity prefix databases. In this case we simply query the proximity prefix databases.

Note that if a prefix is longer than 2 bytes, then it cannot be in the proximity prefix databases. Also, proximities larger than 4 are not present in these databases either. Therefore, the impact on relevancy is:

1. For common prefixes of one or two letters: we no longer distinguish between proximities from 4 to 8
2. For common prefixes of more than two letters: we no longer distinguish between any proximities
3. For uncommon prefixes: nothing changes

Regarding (1), it means that these two documents would be considered equally relevant according to the proximity rule for the query `heard pr` (IF `pr` is the prefix of more than 200 words in the dataset):
```json
[
    { "text": "I heard there is a faster proximity criterion" },
    { "text": "I heard there is a faster but less relevant proximity criterion" }
]
```

Regarding (2), it means that two documents would be considered equally relevant according to the proximity rule for the query "faster pro":
```json
[
    { "text": "I heard there is a faster but less relevant proximity criterion" }
    { "text": "I heard there is a faster proximity criterion" },
]
```
But the following document would be considered more relevant than the two documents above:
```json
{ "text": "I heard there is a faster swimmer who is competing in the pro section of the competition " }
```

Note, however, that this change of behaviour only occurs when using the set-based version of the proximity criterion. In cases where there are fewer than 1000 candidate documents when the proximity criterion is called, this PR does not change anything. 

---

## Performance

I couldn't use the existing search benchmarks to measure the impact of the PR, but I did some manual tests with the `songs` benchmark dataset.   

```
1. 10x 'a': 
	- 640ms ⟹ 630ms                  = no significant difference
2. 10x 'b':
	- set-based: 4.47s ⟹ 7.42        = bad, ~2x regression
	- dynamic: 1s ⟹ 870 ms           = no significant difference
3. 'Someone I l':
	- set-based: 250ms ⟹ 12 ms       = very good, x20 speedup
	- dynamic: 21ms ⟹ 11 ms          = good, x2 speedup 
4. 'billie e':
	- set-based: 623ms ⟹ 2ms         = very good, x300 speedup 
	- dynamic: ~4ms ⟹ 4ms            = no difference
5. 'billie ei':
	- set-based: 57ms ⟹ 20ms         = good, ~2x speedup
	- dynamic: ~4ms ⟹ ~2ms.          = no significant difference
6. 'i am getting o' 
	- set-based: 300ms ⟹ 60ms        = very good, 5x speedup
	- dynamic: 30ms ⟹ 6ms            = very good, 5x speedup
7. 'prologue 1 a 1:
	- set-based: 3.36s ⟹ 120ms       = very good, 30x speedup
	- dynamic: 200ms ⟹ 30ms          = very good, 6x speedup
8. 'prologue 1 a 10':
	- set-based: 590ms ⟹ 18ms        = very good, 30x speedup 
	- dynamic: 82ms ⟹ 35ms           = good, ~2x speedup
```

Performance is often significantly better, but there is also one regression in the set-based implementation with the query `b b b b b b b b b b`.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
This commit is contained in:
bors[bot] 2023-01-04 09:00:50 +00:00 committed by GitHub
commit c3f4835e8e
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 225 additions and 50 deletions

View File

@ -17,6 +17,7 @@ use super::query_tree::{Operation, PrimitiveQueryPart, Query, QueryKind};
use super::CriterionImplementationStrategy;
use crate::search::criteria::geo::Geo;
use crate::search::{word_derivations, Distinct, WordDerivationsCache};
use crate::update::{MAX_LENGTH_FOR_PREFIX_PROXIMITY_DB, MAX_PROXIMITY_FOR_PREFIX_PROXIMITY_DB};
use crate::{AscDesc as AscDescName, DocumentId, FieldId, Index, Member, Result};
mod asc_desc;
@ -653,14 +654,30 @@ fn query_pair_proximity_docids(
match (&left.kind, &right.kind) {
(QueryKind::Exact { word: left, .. }, QueryKind::Exact { word: right, .. }) => {
if prefix {
match word_prefix_pair_overall_proximity_docids(
ctx,
left.as_str(),
right.as_str(),
proximity,
)? {
Some(docids) => Ok(docids),
None => {
// There are three distinct cases which we need to distinguish regarding the prefix `right`:
//
// 1. `right` is not in any prefix cache because it is not the prefix of many words
// (and thus, it doesn't have many word derivations)
// 2. `right` is in the prefix cache but cannot be found in the "word prefix pair proximity" databases either
// because it is too long or because the given proximity is too high.
// 3. `right` is in the prefix cache and can be found in the "word prefix pair proximity" databases
//
// The three cases are handled as follows:
// 1. We manually retrieve all the word derivations of `right` and check the `word_pair_proximity`
// database for each of them.
// 2. It would be too expensive to apply the same strategy as (1), therefore, we "disable" the
// proximity ranking rule for the prefixes of the right word. This is done as follows:
// 1. Only find the documents where left is in proximity to the exact (ie non-prefix) right word
// 2. Otherwise, assume that their proximity in all the documents in which they coexist is >= 8
//
// 3. Query the prefix proximity databases.
match (
ctx.in_prefix_cache(right),
right.len() <= MAX_LENGTH_FOR_PREFIX_PROXIMITY_DB
&& proximity <= MAX_PROXIMITY_FOR_PREFIX_PROXIMITY_DB,
) {
// Case 1: not in prefix cache
(false, _) => {
let r_words = word_derivations(right, true, 0, ctx.words_fst(), wdcache)?;
all_word_pair_overall_proximity_docids(
ctx,
@ -669,40 +686,91 @@ fn query_pair_proximity_docids(
proximity,
)
}
// Case 2: in prefix cache but either the prefix length or the proximity makes it impossible to
// query the prefix proximity databases.
(true, false) => {
// To "save" the relevancy a little bit, we still find the documents where the
// exact (i.e. non-prefix) right word is in the given proximity to the left word.
Ok(word_pair_overall_proximity_docids(
ctx,
left.as_str(),
right.as_str(),
proximity,
)?
.unwrap_or_default())
}
// Case 3: in prefix cache, short enough, and proximity is low enough
(true, true) => Ok(word_prefix_pair_overall_proximity_docids(
ctx,
left.as_str(),
right.as_str(),
proximity,
)?
.unwrap_or_default()),
}
} else {
Ok(ctx
.word_pair_proximity_docids(left.as_str(), right.as_str(), proximity)?
.unwrap_or_default())
Ok(word_pair_overall_proximity_docids(
ctx,
left.as_str(),
right.as_str(),
proximity,
)?
.unwrap_or_default())
}
}
(QueryKind::Tolerant { typo, word: left }, QueryKind::Exact { word: right, .. }) => {
let l_words =
word_derivations(left, false, *typo, ctx.words_fst(), wdcache)?.to_owned();
if prefix {
let mut docids = RoaringBitmap::new();
for (left, _) in l_words {
let current_docids = match word_prefix_pair_overall_proximity_docids(
ctx,
left.as_str(),
right.as_str(),
proximity,
)? {
Some(docids) => Ok(docids),
None => {
let r_words =
word_derivations(right, true, 0, ctx.words_fst(), wdcache)?;
all_word_pair_overall_proximity_docids(
// The logic here is almost identical to the one in the previous match branch.
// The difference is that we fetch the docids for each derivation of the left word.
match (
ctx.in_prefix_cache(right),
right.len() <= MAX_LENGTH_FOR_PREFIX_PROXIMITY_DB
&& proximity <= MAX_PROXIMITY_FOR_PREFIX_PROXIMITY_DB,
) {
// Case 1: not in prefix cache
(false, _) => {
let mut docids = RoaringBitmap::new();
let r_words = word_derivations(right, true, 0, ctx.words_fst(), wdcache)?;
for (left, _) in l_words {
docids |= all_word_pair_overall_proximity_docids(
ctx,
&[(left, 0)],
r_words,
proximity,
)
)?;
}
}?;
docids |= current_docids;
Ok(docids)
}
// Case 2: in prefix cache but either the prefix length or the proximity makes it impossible to
// query the prefix proximity databases.
(true, false) => {
// To "save" the relevancy a little bit, we still find the documents where the
// exact (i.e. non-prefix) right word is in proximity to any derivation of the left word.
let mut candidates = RoaringBitmap::new();
for (left, _) in l_words {
candidates |= ctx
.word_pair_proximity_docids(&left, right, proximity)?
.unwrap_or_default();
}
Ok(candidates)
}
// Case 3: in prefix cache, short enough, and proximity is low enough
(true, true) => {
let mut docids = RoaringBitmap::new();
for (left, _) in l_words {
docids |= word_prefix_pair_overall_proximity_docids(
ctx,
left.as_str(),
right.as_str(),
proximity,
)?
.unwrap_or_default();
}
Ok(docids)
}
}
Ok(docids)
} else {
all_word_pair_overall_proximity_docids(ctx, &l_words, &[(right, 0)], proximity)
}

View File

@ -590,3 +590,123 @@ fn resolve_plane_sweep_candidates(
Ok(candidates)
}
#[cfg(test)]
mod tests {
use std::io::Cursor;
use big_s::S;
use crate::documents::{DocumentsBatchBuilder, DocumentsBatchReader};
use crate::index::tests::TempIndex;
use crate::{CriterionImplementationStrategy, SearchResult};
fn documents_with_enough_different_words_for_prefixes(prefixes: &[&str]) -> Vec<crate::Object> {
let mut documents = Vec::new();
for prefix in prefixes {
for i in 0..500 {
documents.push(
serde_json::json!({
"text": format!("{prefix}{i:x}"),
})
.as_object()
.unwrap()
.clone(),
)
}
}
documents
}
#[test]
fn test_proximity_criterion_prefix_handling() {
let mut index = TempIndex::new();
index.index_documents_config.autogenerate_docids = true;
index
.update_settings(|settings| {
settings.set_primary_key(S("id"));
settings.set_criteria(vec![
"words".to_owned(),
"typo".to_owned(),
"proximity".to_owned(),
]);
})
.unwrap();
let mut documents = DocumentsBatchBuilder::new(Vec::new());
for doc in [
// 0
serde_json::json!({ "text": "zero is exactly the amount of configuration I want" }),
// 1
serde_json::json!({ "text": "zero bad configuration" }),
// 2
serde_json::json!({ "text": "zero configuration" }),
// 3
serde_json::json!({ "text": "zero config" }),
// 4
serde_json::json!({ "text": "zero conf" }),
// 5
serde_json::json!({ "text": "zero bad conf" }),
] {
documents.append_json_object(doc.as_object().unwrap()).unwrap();
}
for doc in documents_with_enough_different_words_for_prefixes(&["conf"]) {
documents.append_json_object(&doc).unwrap();
}
let documents =
DocumentsBatchReader::from_reader(Cursor::new(documents.into_inner().unwrap()))
.unwrap();
index.add_documents(documents).unwrap();
let rtxn = index.read_txn().unwrap();
let SearchResult { matching_words: _, candidates: _, documents_ids } = index
.search(&rtxn)
.query("zero c")
.criterion_implementation_strategy(CriterionImplementationStrategy::OnlySetBased)
.execute()
.unwrap();
insta::assert_snapshot!(format!("{documents_ids:?}"), @"[2, 3, 4, 1, 5, 0]");
let SearchResult { matching_words: _, candidates: _, documents_ids } = index
.search(&rtxn)
.query("zero co")
.criterion_implementation_strategy(CriterionImplementationStrategy::OnlySetBased)
.execute()
.unwrap();
insta::assert_snapshot!(format!("{documents_ids:?}"), @"[2, 3, 4, 1, 5, 0]");
let SearchResult { matching_words: _, candidates: _, documents_ids } = index
.search(&rtxn)
.query("zero con")
.criterion_implementation_strategy(CriterionImplementationStrategy::OnlySetBased)
.execute()
.unwrap();
// Here searh results are degraded because `con` is in the prefix cache but it is too
// long to be stored in the prefix proximity databases, and we don't want to iterate over
// all of its word derivations
insta::assert_snapshot!(format!("{documents_ids:?}"), @"[0, 1, 2, 3, 4, 5]");
let SearchResult { matching_words: _, candidates: _, documents_ids } = index
.search(&rtxn)
.criterion_implementation_strategy(CriterionImplementationStrategy::OnlySetBased)
.query("zero conf")
.execute()
.unwrap();
// Here search results are degraded as well, but we can still rank correctly documents
// that contain `conf` exactly, and not as a prefix.
insta::assert_snapshot!(format!("{documents_ids:?}"), @"[4, 5, 0, 1, 2, 3]");
let SearchResult { matching_words: _, candidates: _, documents_ids } = index
.search(&rtxn)
.criterion_implementation_strategy(CriterionImplementationStrategy::OnlySetBased)
.query("zero config")
.execute()
.unwrap();
// `config` is not a common prefix, so the normal methods are used
insta::assert_snapshot!(format!("{documents_ids:?}"), @"[2, 3, 1, 0, 4, 5]");
}
}

View File

@ -7,7 +7,10 @@ pub use self::index_documents::{
DocumentAdditionResult, DocumentId, IndexDocuments, IndexDocumentsConfig, IndexDocumentsMethod,
};
pub use self::indexer_config::IndexerConfig;
pub use self::prefix_word_pairs::PrefixWordPairsProximityDocids;
pub use self::prefix_word_pairs::{
PrefixWordPairsProximityDocids, MAX_LENGTH_FOR_PREFIX_PROXIMITY_DB,
MAX_PROXIMITY_FOR_PREFIX_PROXIMITY_DB,
};
pub use self::settings::{Setting, Settings};
pub use self::update_step::UpdateIndexingStep;
pub use self::word_prefix_docids::WordPrefixDocids;

View File

@ -14,6 +14,9 @@ mod word_prefix;
pub use prefix_word::index_prefix_word_database;
pub use word_prefix::index_word_prefix_database;
pub const MAX_PROXIMITY_FOR_PREFIX_PROXIMITY_DB: u8 = 4;
pub const MAX_LENGTH_FOR_PREFIX_PROXIMITY_DB: usize = 2;
pub struct PrefixWordPairsProximityDocids<'t, 'u, 'i> {
wtxn: &'t mut heed::RwTxn<'i, 'u>,
index: &'i Index,
@ -32,31 +35,12 @@ impl<'t, 'u, 'i> PrefixWordPairsProximityDocids<'t, 'u, 'i> {
Self {
wtxn,
index,
max_proximity: 4,
max_prefix_length: 2,
max_proximity: MAX_PROXIMITY_FOR_PREFIX_PROXIMITY_DB,
max_prefix_length: MAX_LENGTH_FOR_PREFIX_PROXIMITY_DB,
chunk_compression_type,
chunk_compression_level,
}
}
/// Set the maximum proximity required to make a prefix be part of the words prefixes
/// database. If two words are too far from the threshold the associated documents will
/// not be part of the prefix database.
///
/// Default value is 4. This value must be lower or equal than 7 and will be clamped
/// to this bound otherwise.
pub fn max_proximity(&mut self, value: u8) -> &mut Self {
self.max_proximity = value.max(7);
self
}
/// Set the maximum length the prefix of a word pair is allowed to have to be part of the words
/// prefixes database. If the prefix length is higher than the threshold, the associated documents
/// will not be part of the prefix database.
///
/// Default value is 2.
pub fn max_prefix_length(&mut self, value: usize) -> &mut Self {
self.max_prefix_length = value;
self
}
#[logging_timer::time("WordPrefixPairProximityDocids::{}")]
pub fn execute<'a>(