Commit Graph

2165 Commits

Author SHA1 Message Date
meili-bors[bot]
ccf3ba3f32
Merge #4019
4019: Bringing back changes from `v1.3.2` onto `main` r=irevoire a=Kerollmops



Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: irevoire <irevoire@users.noreply.github.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2023-08-28 12:14:11 +00:00
Clément Renault
8c0ebd1331
Update milli/src/search/new/bucket_sort.rs
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-08-23 16:40:39 +02:00
Kerollmops
5130e06b41
Temporarily disable an assert in the ranking rules 2023-08-23 16:11:54 +02:00
meili-bors[bot]
914b125c5f
Merge #3945
3945: Do not leak field information on error r=Kerollmops a=vivek-26

# Pull Request

## Related issue
Fixes #3865

## What does this PR do?
This PR ensures that `InvalidSortableAttribute`and `InvalidFacetSearchFacetName` errors do not leak field information i.e. fields which are not part of `displayedAttributes` in the settings are hidden from the error message.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Vivek Kumar <vivek.26@outlook.com>
2023-08-22 18:55:27 +00:00
Kerollmops
717b069907
Bump charabia to 0.8.3 2023-08-22 16:25:00 +02:00
Kerollmops
c53841e166
Accept the null JSON value as the value of _vectors 2023-08-14 16:03:55 +02:00
ManyTheFish
cab27c2ab4 upgrade indexmap = "2.0.0" 2023-08-10 18:09:02 +02:00
ManyTheFish
624fa9052f upgrade deserr = "0.6.0" 2023-08-10 18:09:02 +02:00
ManyTheFish
60c11dbdbd upgrade rstar - "0.11.0" 2023-08-10 18:09:02 +02:00
ManyTheFish
dacee40ebc upgrade memmap2 = "0.7.1" 2023-08-10 18:09:02 +02:00
ManyTheFish
cc2c19d4c3 upgrade itertools = "0.10.5" 2023-08-10 18:09:02 +02:00
meili-bors[bot]
e4e49e63d0
Merge #3993
3993: Bringing back changes from v1.3.1 to `main` r=irevoire a=curquiza



Co-authored-by: irevoire <irevoire@users.noreply.github.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-08-10 14:30:02 +00:00
ManyTheFish
5a7c1bde84 Fix clippy 2023-08-10 11:27:56 +02:00
ManyTheFish
6b2d671be7 Fix PR comments 2023-08-10 10:44:07 +02:00
Many the fish
43c13faeda
Update milli/src/update/index_documents/extract/extract_docid_word_positions.rs
Co-authored-by: Tamo <tamo@meilisearch.com>
2023-08-10 10:05:03 +02:00
meili-bors[bot]
44c1900f36
Merge #3986
3986: Fix geo bounding box with strings r=ManyTheFish a=irevoire

# Pull Request

When sending a document with one geofield of type string (i.e.: `{ "_geo": { "lat": 12, "lng": "13" }}`), the geobounding box would exclude this document.

This PR fixes this issue by automatically parsing the string value in case we're working on a geofield.

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/3973

## What does this PR do?
- Automatically parse the facet value iif we're working on a geofield.
- Make insta works with snapshots in loops or closure executed multiple times. (you may need to update your cli if it panics after this PR: `cargo install cargo-insta`).
- Add one integration test in milli and in meilisearch to ensure it works forever.
- Add three snapshots for the dump that mysteriously disappeared I don't know how


Co-authored-by: Tamo <tamo@meilisearch.com>
2023-08-09 07:58:15 +00:00
ManyTheFish
8dc5acf998 Try fix 2023-08-08 16:52:36 +02:00
ManyTheFish
35758db9ec Truncate the the normalized long facets used in search for facet value 2023-08-08 16:38:30 +02:00
Tamo
4988199bb9 ensure the geoboundingbox works with strings and int geofields in milli and meilisearch 2023-08-08 16:29:25 +02:00
Tamo
9d061cec26 automatically parse the filterable attribute to float if it's a geo field 2023-08-08 16:28:07 +02:00
ManyTheFish
4a21fecf67 Merge branch 'main' into settings-customizing-tokenization 2023-08-08 16:08:16 +02:00
Vivek Kumar
dd57873f8e
hide fields not in the displayedAttributes list from errors 2023-08-05 16:03:10 +05:30
ManyTheFish
b45c36cd71 Merge branch 'main' into tmp-release-v1.3.0 2023-08-01 15:05:17 +02:00
meili-bors[bot]
151c31c18f
Merge #3963
3963: Fix the milli crate r=ManyTheFish a=irevoire

Milli was using the serde feature of either without enabling it first; thus, it wasn't working.

It was working in meilisearch, though, because `meilisearch-types` was using the feature which enables it globally for all the other crates.

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/3962

Co-authored-by: Tamo <tamo@meilisearch.com>
2023-07-31 09:32:08 +00:00
Tamo
a8ad0902d3 Fix the milli crate
Milli was using the serde feature of either without enabling it first, thus it wasn't working
2023-07-31 11:08:27 +02:00
ThatOneCalculator
ba919b6123
fix: ⬆️ up mimalloc 2023-07-28 20:35:47 -07:00
ManyTheFish
9d5e3457e5 Fix clippy 2023-07-27 14:21:19 +02:00
meili-bors[bot]
3a3414270d
Merge #3952
3952: Use the new safe `read-txn-no-tls` heed feature r=ManyTheFish a=Kerollmops

[We recently found out](https://github.com/meilisearch/heed/issues/191#issuecomment-1650280513) that the `read-sync-txn` heed feature was invalid and must be removed from this crate. We were declaring it in milli/meilisearch but, fortunately, not sharing the `RoTxn`s across threads 😮‍💨

[I recently introduced the `read-txn-no-tls` heed feature](https://github.com/meilisearch/heed/pull/194), which implements `RoTxn: Send` and allows multiple read transactions on a single thread (which we use).

This PR removes the `sync-read-txn` heed feature from the _Cargo.toml_ file. I will fix this in heed v0.20.0 and will fill a RustSec advisory in the meantime.

Co-authored-by: Clément Renault <clement@meilisearch.com>
2023-07-26 16:40:58 +00:00
meili-bors[bot]
939b2fc6fd
Merge #3949
3949: Fix score details casing r=Kerollmops a=ManyTheFish

# Pull Request

Fixes #3941


Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-07-26 14:14:59 +00:00
Clément Renault
d8b47b689e
Use the new read-txn-no-tls heed feature 2023-07-26 15:45:15 +02:00
ManyTheFish
b0c1a9504a ensure the synonyms are updated when the tokenizer settings are changed 2023-07-26 09:33:42 +02:00
meili-bors[bot]
be72be7c0d
Merge #3942
3942: Normalize for the search the facets values r=ManyTheFish a=Kerollmops

This PR improves and fixes the search for facet values feature. Searching for _bre_ wasn't returning facet values like _brévent_ or _brô_.

The issue was related to the fact that facets are normalized but not in the same way as the `searchableAttributes` are. We decided to normalize them further and add another intermediate database where the key is the normalized facet value, and the value is a set of the non-normalized facets. We then use these non-normalized ones to get the correct counts by fetching the associated databases.

### What's missing in this PR?
 - [x] Apply the change to the whole set of `SearchForFacetValue::execute` conditions.
 - [x] Factorize the code that does an intermediate normalized value fetch in a function.
 - [x] Add or modify the search for facet value test.

Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2023-07-25 14:37:17 +00:00
ManyTheFish
88559a2d54 Fix score details casing 2023-07-25 15:49:33 +02:00
ManyTheFish
d57026cd96 Support synonyms sinergies 2023-07-25 15:01:42 +02:00
Kerollmops
29ab54b259
Replace the hnsw crate by the instant-distance one 2023-07-25 12:37:35 +02:00
ManyTheFish
d4ff59fcf5 Fix clippy 2023-07-24 18:42:26 +02:00
ManyTheFish
9c485f8563 Make the search and the indexing work 2023-07-24 18:35:20 +02:00
Kerollmops
691a536893
Implement the facet search with the normalized index 2023-07-24 17:56:17 +02:00
ManyTheFish
d8d12d5979 Be able to set and reset settings 2023-07-24 17:00:18 +02:00
Clément Renault
df528b41d8
Normalize for the search the facets values 2023-07-20 17:57:07 +02:00
ManyTheFish
0497f93494 Update Charabia to the last version 2023-07-19 15:19:32 +02:00
Kerollmops
eef95de30e
First iteration on exposing puffin profiling 2023-07-18 17:38:13 +02:00
Kerollmops
d383afc82b
Fix the geo sort when lat and lng are strings 2023-07-17 18:28:04 +02:00
meili-bors[bot]
7745cc9d3c
Merge #3921
3921: Deactivate camel case segmentation r=dureuill a=ManyTheFish

# Pull Request
This PR deactivates the camel case segmentation to retrieve the possibility to accept typos over camel-cased words

## Related issue
Fixes #3869
Fixes #3818

## What does this PR do?
- deactivates camelcase segmentation

related to #3919



Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-07-13 11:00:14 +00:00
ManyTheFish
c106906f8f deactivate camelCase segmentation 2023-07-13 12:06:27 +02:00
Louis Dureuil
4310928803
Fixes #3912 2023-07-12 10:08:56 +02:00
Louis Dureuil
74315b4ea8
Fixes #3911 2023-07-12 10:08:29 +02:00
Louis Dureuil
40fa59d64c
Sort by lexicographic order after normalization 2023-07-10 09:26:59 +02:00
Louis Dureuil
55cd7738b9
Update snapshots 2023-07-04 16:31:01 +02:00
Louis Dureuil
48409c9183
Add missing exactness.matchingWords, exactness.maxMatchingWords 2023-07-04 16:31:01 +02:00
Kerollmops
a442af6a7c
Update the features of the either dependency to compile milli successfully 2023-07-03 18:51:43 +02:00
Louis Dureuil
324d448236
Format let-else ❤️ 🎉 2023-07-03 10:20:28 +02:00
meili-bors[bot]
661d1f90dc
Merge #3866
3866: Update charabia v0.8.0 r=dureuill a=ManyTheFish

# Pull Request

Update Charabia:
- enhance Japanese segmentation
- enhance Latin Tokenization
  - words containing `_` are now properly segmented into several words
  - brackets `{([])}` are no more considered as context separators so word separated by brackets are now considered near together for the proximity ranking rule
- fixes #3815
- fixes #3778
- fixes [product#151](https://github.com/meilisearch/product/discussions/151)

> Important note: now the float numbers are segmented around the `.` so `3.22` is segmented as [`3`, `.`, `22`] but the middle dot isn't considered as a hard separator, which means that if we search `3.22` we find documents containing `3.22`

Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-06-29 15:24:36 +00:00
ManyTheFish
6ec7541026 Update inta snapshots 2023-06-29 17:18:39 +02:00
ManyTheFish
a82c49ab08 Update test 2023-06-29 15:56:36 +02:00
ManyTheFish
84845de9ef Update Charabia 2023-06-29 15:56:32 +02:00
Clément Renault
7c157fc442
Document that the LevelEntry fields order is important 2023-06-29 14:33:32 +02:00
Clément Renault
0b97596c93
Replace unwraps with ? 2023-06-29 14:33:32 +02:00
Clément Renault
a0e0fce677
Simplify a Rust lifetime trick 2023-06-29 14:33:32 +02:00
Clément Renault
b951830461
Add more tests 2023-06-29 14:33:32 +02:00
Kerollmops
b132e859f7
Make clippy happy 2023-06-29 14:33:31 +02:00
Kerollmops
9917bf046a
Move the sortFacetValuesBy in the faceting settings 2023-06-29 14:33:31 +02:00
Kerollmops
d9fea0143f
Make Clippy happy 2023-06-29 14:33:31 +02:00
Kerollmops
a385642ec3
Replace the BTreeMap by an IndexMap to return values in order 2023-06-29 14:33:31 +02:00
Kerollmops
34b2e98fe9
Expose a sortFacetValuesBy parameter to the user 2023-06-29 14:33:00 +02:00
Kerollmops
80bbd4b6f3
Clean and make the facet order configurable internally 2023-06-29 14:31:17 +02:00
Kerollmops
f42bef2f66
Make the search to always return the facets ordered by count 2023-06-29 14:31:17 +02:00
Kerollmops
bd3c026406
First to-test version of the algorithm 2023-06-29 14:31:17 +02:00
Kerollmops
84f8938f33
Rename facet distribution to be explicit on the order to find them 2023-06-29 14:31:15 +02:00
Clément Renault
efbe7ce78b
Clean the facet string FSTs when we clear the documents 2023-06-28 15:36:32 +02:00
Kerollmops
26f0fa678d
Change the error message when a facet is not searchable 2023-06-28 15:06:09 +02:00
Kerollmops
60ddd53439
Return one of the original facet values when doing a facet search 2023-06-28 15:06:09 +02:00
Kerollmops
2bcd8d2983
Make sure the facet queries are normalized 2023-06-28 15:06:09 +02:00
Kerollmops
41760a9306
Introduce a new invalid_facet_search_facet_name error code 2023-06-28 15:06:07 +02:00
Kerollmops
e9a3029c30
Use the right field id to write the string facet values FST 2023-06-28 15:01:51 +02:00
Kerollmops
ed0ff47551
Return an empty list of results if attribute is set as filterable 2023-06-28 15:01:51 +02:00
Clément Renault
e1b8fb48ee
Use the minWordSizeForTypos index settings 2023-06-28 15:01:51 +02:00
Clément Renault
87e22e436a
Fix compilation issues 2023-06-28 15:01:51 +02:00
Clément Renault
0252cfe8b6
Simplify the placeholder search of the facet-search route 2023-06-28 15:01:50 +02:00
Clément Renault
f35ad96afa
Use the disableOnAttributes parameter on the facet-search route 2023-06-28 15:01:50 +02:00
Clément Renault
2ceb781c73
Use the disableOnWords parameter on the facet-search route 2023-06-28 15:01:50 +02:00
Clément Renault
7bd67543dd
Support the typoTolerant.enabled parameter 2023-06-28 15:01:50 +02:00
Clément Renault
8e86eb91bb
Log an error when a facet value is missing from the database 2023-06-28 15:01:50 +02:00
Clément Renault
55c17aa38b
Rename the SearchForFacetValues struct 2023-06-28 15:01:50 +02:00
Clément Renault
aadbe88048
Return an internal error when a field id is missing 2023-06-28 15:01:50 +02:00
Clément Renault
f36de2115f
Make clippy happy 2023-06-28 15:01:50 +02:00
Clément Renault
702041b7e1
Improve the returned errors from the facet-search route 2023-06-28 15:01:48 +02:00
Clément Renault
a05074e675
Fix the max number of facets to be returned to 100 2023-06-28 14:58:42 +02:00
Clément Renault
93f30e65a9
Return the correct response JSON object from the facet-search route 2023-06-28 14:58:42 +02:00
Clément Renault
e81809aae7
Make the search for facet work 2023-06-28 14:58:41 +02:00
Kerollmops
ce7e7f12c8
Introduce the facet search route 2023-06-28 14:58:41 +02:00
Kerollmops
addb21f110
Restrict the number of facet search results to 1000 2023-06-28 14:58:41 +02:00
Kerollmops
c34de05106
Introduce the SearchForFacetValue struct 2023-06-28 14:58:41 +02:00
Clément Renault
15a4c05379
Store the facet string values in multiple FSTs 2023-06-28 14:58:41 +02:00
meili-bors[bot]
d4f10800f2
Merge #3834
3834: Define searchable fields at runtime r=Kerollmops a=ManyTheFish

## Summary
This feature allows the end-user to search in one or multiple attributes using the search parameter `attributesToSearchOn`:

```json
{
  "q": "Captain Marvel",
  "attributesToSearchOn": ["title"]
}
```

This feature act like a filter, forcing Meilisearch to only return the documents containing the requested words in the attributes-to-search-on. Note that, with the matching strategy `last`, Meilisearch will only ensure that the first word is in the attributes-to-search-on, but, the retrieved documents will be ordered taking into account the word contained in the attributes-to-search-on. 

## Trying the prototype

A dedicated docker image has been released for this feature:

#### last prototype version:

```bash
docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-1
```

#### others prototype versions:

```bash
docker pull getmeili/meilisearch:prototype-define-searchable-fields-at-search-time-0
```

## Technical Detail

The attributes-to-search-on list is given to the search context, then, the search context uses the `fid_word_docids`database using only the allowed field ids instead of the global `word_docids` database. This is the same for the prefix databases.
The database cache is updated with the merged values, meaning that the union of the field-id-database values is only made if the requested key is missing from the cache.

### Relevancy limits

Almost all ranking rules behave as expected when ordering the documents.
Only `proximity` could miss-order documents if all the searched words are in the restricted attribute but a better proximity is found in an ignored attribute in a document that should be ranked lower. I put below a failing test showing it:
```rust
#[actix_rt::test]
async fn proximity_ranking_rule_order() {
    let server = Server::new().await;
    let index = index_with_documents(
        &server,
        &json!([
        {
            "title": "Captain super mega cool. A Marvel story",
            // Perfect distance between words in an ignored attribute
            "desc": "Captain Marvel",
            "id": "1",
        },
        {
            "title": "Captain America from Marvel",
            "desc": "a Shazam ersatz",
            "id": "2",
        }]),
    )
    .await;

    // Document 2 should appear before document 1.
    index
        .search(json!({"q": "Captain Marvel", "attributesToSearchOn": ["title"], "attributesToRetrieve": ["id"]}), |response, code| {
            assert_eq!(code, 200, "{}", response);
            assert_eq!(
                response["hits"],
                json!([
                    {"id": "2"},
                    {"id": "1"},
                ])
            );
        })
        .await;
}
```

Fixing this would force us to create a `fid_word_pair_proximity_docids` and a `fid_word_prefix_pair_proximity_docids` databases which may multiply the keys of `word_pair_proximity_docids` and `word_prefix_pair_proximity_docids` by the number of attributes in the searchable_attributes list. If we think we should fix this test, I'll suggest doing it in another PR.

## Related

Fixes #3772

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-06-28 08:19:23 +00:00
Clément Renault
30741d17fa
Change the TODO message 2023-06-27 12:32:43 +02:00
Clément Renault
ebad1f396f
Remove the useless euclidean distance implementation 2023-06-27 12:32:43 +02:00
Clément Renault
29d8268c94
Fix the vector query part by using the correct universe 2023-06-27 12:32:43 +02:00
Clément Renault
63bfe1cee2
Ignore when there are too many vectors 2023-06-27 12:32:43 +02:00
Kerollmops
7c2f5f77b8
Make clippy and fmt happy 2023-06-27 12:32:42 +02:00
Kerollmops
66b8cfd8c8
Introduce a way to store the HNSW on multiple LMDB entries 2023-06-27 12:32:42 +02:00
Kerollmops
ff3664431f
Make rustfmt happy 2023-06-27 12:32:42 +02:00
Kerollmops
531748c536
Return a user error when the _vectors type is invalid 2023-06-27 12:32:41 +02:00
Kerollmops
7aa1275337
Display the _semanticSimilarity even if the _vectors field is not displayed 2023-06-27 12:32:41 +02:00
Kerollmops
737aec1705
Expose an _semanticSimilarity as a dot product in the documents 2023-06-27 12:32:41 +02:00
Kerollmops
3e3c743392
Make Rustfmt happy 2023-06-27 12:32:41 +02:00
Kerollmops
5c5a4e075d
Make clippy happy 2023-06-27 12:32:41 +02:00
Kerollmops
ab9f2269aa
Normalize the vectors during indexation and search 2023-06-27 12:32:41 +02:00
Kerollmops
321ec5f3fa
Accept multiple vectors by documents using the _vectors field 2023-06-27 12:32:40 +02:00
Kerollmops
717d4fddd4
Remove the unused distance 2023-06-27 12:32:40 +02:00
Kerollmops
a7e0f0de89
Introduce a new error message for invalid vector dimensions 2023-06-27 12:32:40 +02:00
Kerollmops
3b560ef7d0
Make clippy happy 2023-06-27 12:32:40 +02:00
Kerollmops
2cf747cb89
Fix the tests 2023-06-27 12:32:40 +02:00
Kerollmops
3c31e1cdd1
Support more pages but in an ugly way 2023-06-27 12:32:39 +02:00
Kerollmops
23eaaf1001
Change the name of the distance module 2023-06-27 12:32:39 +02:00
Kerollmops
c2a402f3ae
Implement an ugly deletion of values in the HNSW 2023-06-27 12:32:39 +02:00
Kerollmops
436a10bef4
Replace the euclidean with a dot product 2023-06-27 12:32:39 +02:00
Kerollmops
8debf6fe81
Use a basic euclidean distance function 2023-06-27 12:32:39 +02:00
Kerollmops
c79e82c62a
Move back to the hnsw crate
This reverts commit 7a4b6c065482f988b01298642f4c18775503f92f.
2023-06-27 12:32:39 +02:00
Kerollmops
aca305bb77
Log more to make sure we insert vectors in the hgg data-structure 2023-06-27 12:32:38 +02:00
Kerollmops
5816008139
Introduce an optimized version of the euclidean distance function 2023-06-27 12:32:38 +02:00
Kerollmops
268a9ef416
Move to the hgg crate 2023-06-27 12:32:38 +02:00
Clément Renault
642b0f3a1b
Expose a new vector field on the search route 2023-06-27 12:32:38 +02:00
Clément Renault
4571e512d2
Store the vectors in an HNSW in LMDB 2023-06-27 12:32:38 +02:00
Clément Renault
7ac2f1489d
Extract the vectors from the documents 2023-06-27 12:32:37 +02:00
Clément Renault
34349faeae
Create a new _vector extractor 2023-06-27 12:32:37 +02:00
ManyTheFish
63ca25290b Take into account small Review requests 2023-06-26 14:56:19 +02:00
ManyTheFish
59f64a5256 Return an error when an attribute is not searchable 2023-06-26 14:56:19 +02:00
ManyTheFish
42709ea9a5 Fix clippy warnings 2023-06-26 14:55:57 +02:00
ManyTheFish
fb8fa07169 Restrict field ids in search context 2023-06-26 14:55:57 +02:00
ManyTheFish
0ccf1e2e40 Allow the search cache to store owned values 2023-06-26 14:55:57 +02:00
ManyTheFish
9680e1e41f Introduce a BytesDecodeOwned trait in heed_codecs 2023-06-26 14:55:14 +02:00
ManyTheFish
461b5118bd Add API search setting 2023-06-26 14:55:14 +02:00
Tamo
a3716c5678 add the new parameter to the search builder of milli 2023-06-26 14:55:14 +02:00
meili-bors[bot]
2d34005965
Merge #3821
3821: Add normalized and detailed scores to documents returned by a query r=dureuill a=dureuill

# Pull Request

## Related issue
Fixes #3771 

## What does this PR do?

### User standpoint

<details>
<summary>Request ranking score</summary>

```
echo '{ 
  "q": "Badman dark knight returns",
  "showRankingScore": true, 
  "limit": 10,
  "attributesToRetrieve": ["title"]
}' | mieli search -i index-word-count-10-count
```

</details>


<details>
<summary>Response</summary>

```json
{
  "hits": [
    {
      "title": "Batman: The Dark Knight Returns, Part 1",
      "_rankingScore": 0.947520325203252
    },
    {
      "title": "Batman: The Dark Knight Returns, Part 2",
      "_rankingScore": 0.947520325203252
    },
    {
      "title": "Batman Unmasked: The Psychology of the Dark Knight",
      "_rankingScore": 0.6657594086021505
    },
    {
      "title": "Legends of the Dark Knight: The History of Batman",
      "_rankingScore": 0.6654905913978495
    },
    {
      "title": "Angel and the Badman",
      "_rankingScore": 0.2196969696969697
    },
    {
      "title": "Angel and the Badman",
      "_rankingScore": 0.2196969696969697
    },
    {
      "title": "Batman",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Begins",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Returns",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Forever",
      "_rankingScore": 0.11553030303030302
    }
  ],
  "query": "Badman dark knight returns",
  "processingTimeMs": 12,
  "limit": 10,
  "offset": 0,
  "estimatedTotalHits": 46
}
```

</details>



- If adding a `showRankingScore` parameter to the search query, then documents returned by a search now contain an additional field `_rankingScore` that is a float bigger than 0 and lower or equal to 1.0. This field represents the relevancy of the document, relatively to the search query and the settings of the index, with 1.0 meaning "perfect match" and 0 meaning "not matching the query" (Meilisearch should never return documents not matching the query at all). 
  - The `sort` and `geosort` ranking rules do not influence the `_rankingScore`.

<details>
<summary>Request detailed ranking scores</summary>

```
echo '{ 
  "q": "Badman dark knight returns",
  "showRankingScoreDetails": true, 
  "limit": 5, 
  "attributesToRetrieve": ["title"]
}' | mieli search -i index-word-count-10-count
```

</details>

<details>
<summary>Response</summary>

```json
{
  "hits": [
    {
      "title": "Batman: The Dark Knight Returns, Part 1",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 4,
          "maxMatchingWords": 4,
          "score": 1.0
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 4,
          "score": 0.8
        },
        "proximity": {
          "order": 2,
          "score": 0.9545454545454546
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.926829268292683,
          "score": 0.926829268292683
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.26666666666666666
        }
      }
    },
    {
      "title": "Batman: The Dark Knight Returns, Part 2",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 4,
          "maxMatchingWords": 4,
          "score": 1.0
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 4,
          "score": 0.8
        },
        "proximity": {
          "order": 2,
          "score": 0.9545454545454546
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.926829268292683,
          "score": 0.926829268292683
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.26666666666666666
        }
      }
    },
    {
      "title": "Batman Unmasked: The Psychology of the Dark Knight",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 3,
          "maxMatchingWords": 4,
          "score": 0.75
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 3,
          "score": 0.75
        },
        "proximity": {
          "order": 2,
          "score": 0.6666666666666666
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.8064516129032258,
          "score": 0.8064516129032258
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.25
        }
      }
    },
    {
      "title": "Legends of the Dark Knight: The History of Batman",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 3,
          "maxMatchingWords": 4,
          "score": 0.75
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 3,
          "score": 0.75
        },
        "proximity": {
          "order": 2,
          "score": 0.6666666666666666
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.7419354838709677,
          "score": 0.7419354838709677
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.25
        }
      }
    },
    {
      "title": "Angel and the Badman",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 1,
          "maxMatchingWords": 4,
          "score": 0.25
        },
        "typo": {
          "order": 1,
          "typoCount": 0,
          "maxTypoCount": 1,
          "score": 1.0
        },
        "proximity": {
          "order": 2,
          "score": 1.0
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.8181818181818182,
          "score": 0.8181818181818182
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.3333333333333333
        }
      }
    }
  ],
  "query": "Badman dark knight returns",
  "processingTimeMs": 9,
  "limit": 5,
  "offset": 0,
  "estimatedTotalHits": 46
}
```

</details>

- If adding a `showRankingScoreDetails` parameter to the search query, then the returned documents will now contain an additional `_rankingScoreDetails` field that is a JSON object containing one field per ranking rule that was applied, whose value is a JSON object with the following fields:
  - `order`: a number indicating the order this rule was applied (0 is the first applied ranking rule)
  - `score` (except for `sort` and `geosort`): a float indicating how the document matched this particular rule.
  - other fields that are specific to the rule, indicating for example how many words matched for a document and how many typos were counted in a matching document.
- If the `displayableAttributes` list is defined in the settings of the index, any ranking rule using an attribute **not** part of that list will be marked as `<hidden-rule>` in the `_rankingScoreDetails`.  

- Search queries that are part of a `multi-search` requests are modified in the same way and each of the queries can take the `showRankingScore` and `showRankingScoreDetails` parameters independently. The results are still returned in separate lists and providing a unified list of results between multiple queries is not in the scope of this PR (but is unblocked by this PR and can be done manually by using the scores of the various documents). 

### Implementation standpoint

- Fix difference in how the position of terms were computed at indexing time and query time: this difference meant that a query containing a hard separator would fail the exactness check.
- Fix the id reported by the sort ranking rule (very minor)
- Change how the cost of removing words is computed. After this change the cost no longer works for any other ranking rule than `words`. Also made `words` have a cost of 0 such that the entire cost of `words` is given by the termRemovalStrategy. The new cost computation makes it so the score is computed in a way consistent with the number of words in the query. Additionally, the words that appear in phrases in the query are also counted as matching words.
- When any score computation is requested through `showRankingScore` or `showRankingScoreDetails`, remove optimization where ranking rules are not executed on buckets of a single document: this is important to allow the computation of an accurate score.
- add virtual conditions to fid and position to always have the max cost: this ensures that the score is independent from the dataset
- the Position ranking rule now takes into account the distance to the position of the word in the query instead of the distance to the position 0.
- modified proximity ranking rule cost calculation so that the cost is 0 for documents that are perfectly matching the query
- Add a new `milli::score_details` module containing all the types that are involved in score computation.
- Make it so a bucket of result now contains a `ScoreDetails` and changed the ranking rules to produce their `ScoreDetails`.
- Expose the scores in the REST API.
- Add very light analytics for scoring.
- Update the search tests to add the expected scores.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-06-26 09:32:43 +00:00
meili-bors[bot]
040b5a5b6f
Merge #3842
3842: fix some typos r=dureuill a=cuishuang

# Pull Request

## Related issue
Fixes #<issue_number>

## What does this PR do?
- fix some typos

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: cui fliter <imcusg@gmail.com>
2023-06-22 18:01:10 +00:00
cui fliter
530a3e2df3 fix some typos
Signed-off-by: cui fliter <imcusg@gmail.com>
2023-06-22 21:59:00 +08:00
Louis Dureuil
d26e9a96ec
Add score details to new search tests 2023-06-22 12:39:14 +02:00
Louis Dureuil
49c8bc4de6
Fix tests 2023-06-22 12:39:14 +02:00
Louis Dureuil
da833eb095
Expose the scores and detailed scores in the API 2023-06-22 12:39:14 +02:00
Louis Dureuil
701d44bd91
Store the scores for each bucket
Remove optimization where ranking rules are not executed on buckets of a single document
when the score needs to be computed
2023-06-22 12:39:14 +02:00
Louis Dureuil
c621a250a7
Score for graph based ranking rules
Count phrases in matchingWords and maxMatchingWords
2023-06-22 12:39:14 +02:00
Louis Dureuil
8939e85f60
Add rank_to_score for graph based ranking rules 2023-06-22 12:39:14 +02:00
Louis Dureuil
fa41d2489e
Score for sort 2023-06-22 12:39:14 +02:00
Louis Dureuil
59c5b992c2
Score for geosort 2023-06-22 12:39:14 +02:00
Louis Dureuil
2ea8194c18
Score for exact_attributes 2023-06-22 12:39:14 +02:00
Louis Dureuil
421df64602
RankingRuleOutput now contains a Score 2023-06-22 12:39:14 +02:00
Louis Dureuil
c0fca6f884
Add score_details 2023-06-22 12:39:14 +02:00
Louis Dureuil
f050634b1e
add virtual conditions to fid and position to always have the max cost 2023-06-20 10:07:18 +02:00
Louis Dureuil
becf1f066a
Change how the cost of removing words is computed 2023-06-20 09:45:43 +02:00
Louis Dureuil
701d299369
Remove out-of-date comment 2023-06-20 09:45:42 +02:00
Louis Dureuil
a20e4d447c
Position now takes into account the distance to the position of the word in the query
it used to be based on the distance to the position 0
2023-06-20 09:45:42 +02:00
Louis Dureuil
af57c3c577
Proximity costs 0 for documents that are perfectly matching 2023-06-20 09:45:42 +02:00
Louis Dureuil
0c40ef6911
Fix sort id 2023-06-20 09:45:42 +02:00
meili-bors[bot]
45636d315c
Merge #3670
3670: Fix addition deletion bug r=irevoire a=irevoire

The first commit of this PR is a revert of https://github.com/meilisearch/meilisearch/pull/3667. It re-enable the auto-batching of addition and deletion of tasks. No new changes have been introduced outside of `milli`. So all the changes you see on the autobatcher have actually already been reviewed.

It fixes https://github.com/meilisearch/meilisearch/issues/3440.

### What was happening?

The issue was that the `external_documents_ids` generated in the `transform` were used in a very strange way that wasn’t compatible with the deletion of documents.
Instead of doing a clear merge between the external document IDs of the DB and the one returned by the transform + writing it on disk, we were doing some weird tricks with the soft-deleted to avoid writing the fst on disk as much as possible.
The new algorithm may be a bit slower but is way more straightforward and doesn’t change depending on if the soft deletion was used or not. Here is a list of the changes introduced:
1. We now do a clear distinction between the `new_external_documents_ids` coming from the transform and only held on RAM and the `external_documents_ids` coming from the DB.
2. The `new_external_documents_ids` (coming out of the transform) are now represented as an `fst`. We don't need to struggle with the hard, soft distinction + the soft_deleted => That's easier to understand
3. When indexing documents, we merge the `external_documents_ids` coming from the DB and the `new_external_documents_ids` coming from the transform.

### Other things introduced in this  PR

Since we constantly have to write small, very specialized fuzzers for this kind of bug, we decided to push the one used to reproduce this bug.
It's not perfect, but it's easy to improve in the future.
It'll also run for as long as possible on every merge on the main branch.

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>
2023-06-19 09:09:30 +00:00
meili-bors[bot]
cb9d78fc7f
Merge #3835
3835: Add more documentation to graph-based ranking rule algorithms + comment cleanup r=Kerollmops a=loiclec

In addition to documenting the `cheapest_path.rs` file, this PR cleans up a few outdated comments as well as some TODOs. These TODOs have been moved to https://github.com/meilisearch/meilisearch/issues/3776



Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>
2023-06-15 15:30:24 +00:00
Louis Dureuil
e0c4682758
Fix tests 2023-06-14 13:30:52 +02:00
Louis Dureuil
d9b4b39922
Add trailing pipe to the snapshots so it doesn't end with trailing whitespace 2023-06-14 13:30:52 +02:00
Loïc Lecrenier
2da86b31a6 Remove comments and add documentation 2023-06-14 12:39:42 +02:00
Louis Dureuil
a2a3b8c973
Fix offset difference between query and indexing for hard separators 2023-06-08 12:07:12 +02:00
Louis Dureuil
9f37b61666
DB BREAKING: raise limit of word count from 10 to 30. 2023-06-08 12:07:12 +02:00
Louis Dureuil
c15c076da9
DB BREAKING: Count the number of words in field_id_word_count_docids 2023-06-08 12:07:11 +02:00
Loïc Lecrenier
8628a0c856 Remove docid_word_positions_db + fix deletion bug
That would happen when a word was deleted from all exact attributes
but not all regular attributes.
2023-06-07 10:52:50 +02:00
Clémentine U. - curqui
f3e2f79290
Merge branch 'main' into tmp-release-v1.2.0 2023-06-05 18:36:28 +02:00
Kerollmops
da04edff8c
Better use deserialize_unchecked_from to reduce the deserialization time 2023-05-30 14:58:30 +02:00
Tamo
23a5b45ebf
drop the old fuzz file 2023-05-29 14:02:37 +02:00
Tamo
6c6387d05e
move the fuzzer to its own crate 2023-05-29 12:27:39 +02:00
Louis Dureuil
1dfc4038ab
Add test that fails before PR and passes now 2023-05-29 11:58:26 +02:00
Louis Dureuil
73198179f1
Consistently use wrapping add to avoid overflow in debug when query starts with a separator 2023-05-29 11:54:12 +02:00
meili-bors[bot]
2e49d6aec1
Merge #3768
3768: Fix bugs in graph-based ranking rules + make `words` a graph-based ranking rule r=dureuill a=loiclec

This PR contains three changes:

## 1. Don't call the `words` ranking rule if the term matching strategy is `All`

This is because the purpose of `words` is only to remove nodes from the query graph. It would never do any useful work when the matching strategy was `All`. Remember that the universe was already computed before by computing all the docids corresponding to the "maximally reduced" query graph, which, in the case of `All`, is equal to the original graph.

## 2. The `words` ranking rule is replaced by a graph-based ranking rule. 

This is for three reasons:

1. **performance**: graph-based ranking rules benefit from a lot of optimisations by default, which ensures that they are never too slow. The previous implementation of `words` could call `compute_query_graph_docids` many times if some words had to be removed from the query, which would be quite expensive. I was especially worried about its performance in cases where it is placed right after the `sort` ranking rule. Furthermore, `compute_query_graph_docids` would clone a lot of bitmaps many times unnecessarily.

2. **consistency**: every other ranking rule (except `sort`) is graph-based. It makes sense to implement `words` like that as well. It will automatically benefit from all the features, optimisations, and bug fixes that all the other ranking rules get.

3. **surfacing bugs**: as the first ranking rule to be called (most of the time), I'd like `words` to behave the same as the other ranking rules so that we can quickly detect bugs in our graph algorithms. This actually already happened, which is why this PR also contains a bug fix.

## 3. Fix the `update_all_costs_before_nodes` function

It is a bit difficult to explain what was wrong, but I'll try. The bug happened when we had graphs like:
<img width="730" alt="Screenshot 2023-05-16 at 10 58 57" src="https://github.com/meilisearch/meilisearch/assets/6040237/40db1a68-d852-4e89-99d5-0d65757242a7">
and we gave the node `is` as argument.

Then, we'd walk backwards from the node breadth-first. We'd update the costs of:
1. `sun`
2. `thesun`
3. `start`
4. `the`

which is an incorrect order. The correct order is:

1. `sun`
2. `thesun`
3. `the`
4. `start`

That is, we can only update the cost of a node when all of its successors have either already been visited or were not affected by the update to the node passed as argument. To solve this bug, I factored out the graph-traversal logic into a `traverse_breadth_first_backward` function.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-05-23 13:28:08 +00:00
Louis Dureuil
51043f78f0
Remove trailing whitespace 2023-05-23 15:27:25 +02:00
Louis Dureuil
a490a11325
Add explanatory comment on the way we're recomputing costs 2023-05-23 15:24:24 +02:00
Tamo
002f42875f fix the fuzzer 2023-05-23 11:42:40 +02:00
Tamo
22213dc604
push the fuzzer 2023-05-23 09:14:26 +02:00
Tamo
602ad98cb8 improve the way we handle the fsts 2023-05-22 11:15:14 +02:00
Tamo
7f619ff0e4 get rids of the now unused soft_deletion_used parameter 2023-05-22 10:33:49 +02:00
Tamo
4391cba6ca
fix the addition + deletion bug 2023-05-17 18:28:57 +02:00
meili-bors[bot]
101f5a20d2
Merge #3757
3757: Adjust the cost of edges in the `position` ranking rule by bucketing positions more aggressively r=loiclec a=loiclec

This PR significantly improves the performance of the `position` ranking rule when:
1. a query contains many words
2. the `position` ranking rule needs to be called many times
3. the score of the documents according to `position` is high

These conditions greatly increase:
1. the number of edge traversals that are needed to find a valid path from the `start` node to the `end` node
2. the number of edges that need to be deleted from the graph, and therefore the number of times that we need to recompute all the possible costs from START to END

As a result, a majority of the search time is spent in `visit_condition`, `visit_node`, and `update_all_costs_before_node`. This is frustrating because it often happens when the "universe" given to the rule consists of only a handful of document ids.

By limiting the number of possible edges between two nodes from `20` to `10`, we:
1. reduce the number of possible costs from START to END
2. reduce the number of edges that will be deleted 
3. make it faster to update the costs after deleting an edge
4. reduce the number of buckets that need to be computed

In terms of relevancy, I don't think we lose or gain much. We still prefer terms that are in a lower positions, with decreasing precision as we go further. The previous choice of bucketing wasn't chosen in a principled way, and neither is this one. They both "feel" right to me.


Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
2023-05-17 11:43:59 +00:00
Loïc Lecrenier
ec8f685d84 Fix bug in cheapest path algorithm 2023-05-16 17:01:30 +02:00
Loïc Lecrenier
5758268866 Don't compute split_words for phrases 2023-05-16 17:01:18 +02:00
Loïc Lecrenier
3e19702de6 Update snapshot tests 2023-05-16 12:22:46 +02:00
meili-bors[bot]
1e762d151f
Merge #3755
3755: Re-add final dot r=curquiza a=ManyTheFish

I removed the final dot of the error message in my last PR, this one re-adds it.

related to https://github.com/meilisearch/meilisearch/pull/3749

> Oups 😬 

Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-05-16 10:10:58 +00:00
Loïc Lecrenier
f6524a6858 Adjust costs of edges in position ranking rule
To ensure good performance
2023-05-16 11:28:56 +02:00
meili-bors[bot]
65ad8cce36
Merge #3741
3741: Add ngram support to the highlighter r=ManyTheFish a=loiclec

This PR fixes a bug introduced by the search refactor, where ngrams were not highlighted. 

The solution was to add the ngrams to the vector of `LocatedQueryTerm` that is given to the `MatchingWords` structure.

Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-05-16 09:03:31 +00:00
ManyTheFish
42650f82e8 Re-add final dot 2023-05-16 10:57:26 +02:00
Loïc Lecrenier
a37da36766 Implement words as a graph-based ranking rule and fix some bugs 2023-05-16 10:42:11 +02:00
Loïc Lecrenier
85d96d35a8 Highlight ngram matches as well 2023-05-16 10:39:36 +02:00
meili-bors[bot]
bf66e97b48
Merge #3749
3749: Fix back: sort error message r=ManyTheFish a=ManyTheFish

This PR reintroduces the error message modified in https://github.com/meilisearch/milli/pull/375.
However, this added double-quotes around `sort` in the message. I don't think another message contains double-quotes, so I have added a separate commit replacing the double-quotes with back-ticks, which seems more consistent with the other error messages, this last change can be reverted easily.

## Detailed changes
#### v1.2-rc0
```
The sort ranking rule must be specified in the ranking rules settings to use the sort parameter at search time.
```
#### [Reintroduce fix (previous and expected behavior)](23d1c86825)
```
You must specify where "sort" is listed in the rankingRules setting to use the sort parameter at search time
```
#### [Replace double-quotes with back-ticks (my suggestion)](4d691d071a)
```
You must specify where `sort` is listed in the rankingRules setting to use the sort parameter at search time
```

## Related

Fixes #3722

## Reviewers

- technical review: `@irevoire`
- to validate the replacement: `@macraig`

Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-05-15 14:55:51 +00:00
Kerollmops
1a79fd0c3c
Use the new heed v0.12.6 2023-05-15 11:42:30 +02:00
Kerollmops
f759ec7fad
Expose a flag to enable the MDB_WRITEMAP flag 2023-05-15 11:38:43 +02:00
ManyTheFish
4d691d071a Change double-quotes by back-ticks in sort error message 2023-05-15 11:10:36 +02:00
ManyTheFish
23d1c86825 Re-introduce the sort error message fix 2023-05-15 11:07:23 +02:00
Kerollmops
c4a40e7110
Use the writemap flag to reduce the memory usage 2023-05-15 10:15:33 +02:00
Loïc Lecrenier
4d352a21ac Compute split words derivations of terms that don't accept typos 2023-05-10 13:31:19 +02:00
Loïc Lecrenier
3625389057 Highlight ngram matches as well 2023-05-08 15:35:41 +02:00
meili-bors[bot]
eace6df91b
Merge #3726
3726: Fix prefix highlighting r=loiclec a=ManyTheFish

The prefix queries were not properly highlighted, this PR now highlights only the start of a word when it matched with a prefix

Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
2023-05-08 07:46:46 +00:00
Loïc Lecrenier
83ab8cf4e5 Remove dbg!(..) expression in highlighter tests 2023-05-08 09:45:23 +02:00
ManyTheFish
cd2573fcc3 Fix prefix highlighting 2023-05-04 16:53:50 +02:00
meili-bors[bot]
9f7981df28
Merge #3687
3687: Allow to disable specialized tokenizations (again) r=Kerollmops a=jirutka

In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai` feature flags to allow melisearch to be built without huge specialed tokenizations that took up 90% of the melisearch binary size. Unfortunately, due to some recent changes, this doesn't work anymore. The problem lies in excessive use of the `default` feature flag, which infects the dependency graph.

Instead of adding `default-features = false` here and there, it's easier and more future-proof to not declare `default` in `milli` and `meilisearch-types`. I've renamed it to `all-tokenizers`, which also makes it a bit clearer what it's about.


Co-authored-by: Jakub Jirutka <jakub@jirutka.cz>
2023-05-04 14:48:01 +00:00
Jakub Jirutka
e615fa5ec6 Fix unused_imports warning in milli when japanese is not enabled 2023-05-04 15:46:11 +02:00
Jakub Jirutka
13f1277637 Allow to disable specialized tokenizations (again)
In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai`
feature flags to allow melisearch to be built without huge specialed
tokenizations that took up 90% of the melisearch binary size.
Unfortunately, due to some recent changes, this doesn't work anymore.
The problem lies in excessive use of the `default` feature flag, which
infects the dependency graph.

Instead of adding `default-features = false` here and there, it's easier
and more future-proof to not declare `default` in `milli` and
`meilisearch-types`. I've renamed it to `all-tokenizers`, which also
makes it a bit clearer what it's about.
2023-05-04 15:45:40 +02:00
Louis Dureuil
a35d3fc708
Add Index::iter_documents 2023-05-04 15:31:54 +02:00
Louis Dureuil
732c52093d
Processing time without autobatching implementation 2023-05-03 17:41:48 +02:00
Louis Dureuil
f8f190cd40
Update exactness tests following charabia camelCase tokenization 2023-05-03 14:45:09 +02:00
Louis Dureuil
3a408e8287
Increase map size for tests following charabia camelCase tokenization 2023-05-03 14:44:48 +02:00
Louis Dureuil
d3e5b10e23
fix nb of dbs 2023-05-03 14:11:20 +02:00
Louis Dureuil
1aaf24ccbf
Cargo fmt 2023-05-03 12:21:58 +02:00
Louis Dureuil
90bc230820
Merge remote-tracking branch 'origin/main' into search-refactor
Conflicts | resolution
----------|-----------
Cargo.lock | added mimalloc
Cargo.toml |  took origin/main version
milli/src/search/criteria/exactness.rs | deleted after checking it was only clippy changes
milli/src/search/query_tree.rs | deleted after checking it was only clippy changes
2023-05-03 12:19:06 +02:00
Louis Dureuil
342c4ff85d
geosort: Remove rtree unwrap 2023-05-03 09:52:16 +02:00
Tamo
c85392ce40
make the descendent geosort fast 2023-05-03 09:13:12 +02:00
Tamo
8875d24a48
deserialize the rtree only when its needed, and keep it in memory once it has been deserialized 2023-05-03 09:13:12 +02:00
Tamo
c470b67fa2
revamp the test to use execute_iterative_and_rtree_returns_the_same 2023-05-03 09:13:12 +02:00
meili-bors[bot]
c0e081cd98
Merge #3702 #3710
3702: Update charabia v0.7.2 r=curquiza a=ManyTheFish

fixes #3701
fixes #3689
fixes #3285 

3710: Updated messages pointing to the docs website r=curquiza a=roy9495

# Pull Request

Fixes partially #3668

## What does this PR do?
- ...Any messages referencing this docs site https://docs.meilisearch.com has been changed to this docs site https://meilisearch.com/docs .
 Thanks.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: TATHAGATA ROY <98920199+roy9495@users.noreply.github.com>
2023-05-02 17:27:57 +00:00
Louis Dureuil
b60840ebff
Remove self.iterating from words 2023-05-02 18:54:23 +02:00
Louis Dureuil
fdc1763838
Use MultiOps for resolve_query_graph 2023-05-02 18:54:09 +02:00
Louis Dureuil
75819bc940
Remove too many arguments on resolve_maximally_reduced_query_graph 2023-05-02 18:53:40 +02:00
Louis Dureuil
7b8cc25625
rename located_query_terms_from_string -> located_query_terms_from_tokens 2023-05-02 18:53:01 +02:00
Loïc Lecrenier
aa63091752 Fix bug in exact_attribute 2023-05-02 10:48:32 +02:00
Loïc Lecrenier
58735d6d8f Fix outdated relevancy test 2023-05-02 10:48:32 +02:00
Loïc Lecrenier
1b514517f5 Fix bug in computation of query term at a position 2023-05-02 10:48:32 +02:00
Loïc Lecrenier
11f814821d Minor cleanup 2023-05-02 10:48:32 +02:00
Loïc Lecrenier
30fb1153cc Speed up graph based ranking rule when a lot of different costs exist 2023-05-02 09:59:42 +02:00
Loïc Lecrenier
3b2c8b9f25 Improve performance of position rr 2023-05-02 09:59:42 +02:00
Loïc Lecrenier
2a7f9adf78 Build query graph more correctly from paths
Update snapshots
2023-05-02 09:59:42 +02:00
Loïc Lecrenier
608ceea440 Fix bug in position rr 2023-05-02 09:59:42 +02:00
Loïc Lecrenier
79001b9c97 Improve performance of the cheapest path finder algorithm 2023-05-02 09:59:42 +02:00
Loïc Lecrenier
59b12fca87 Fix errors, clippy warnings, and add review comments 2023-04-29 11:48:11 +02:00
Loïc Lecrenier
48f5bb1693 Implements the geo-sort ranking rule 2023-04-29 11:02:16 +02:00
Loïc Lecrenier
93188b3c88 Fix indexing of word_prefix_fid_docids 2023-04-29 10:56:48 +02:00
Loïc Lecrenier
bc4efca611 Add more tests for the attribute ranking rule 2023-04-29 10:56:48 +02:00
bors[bot]
414b3fae89
Merge #3571
3571: Introduce two filters to select documents with `null` and empty fields r=irevoire a=Kerollmops

# Pull Request

## Related issue
This PR implements the `X IS NULL`, `X IS NOT NULL`, `X IS EMPTY`, `X IS NOT EMPTY` filters that [this comment](https://github.com/meilisearch/product/discussions/539#discussioncomment-5115884) is describing in a very detailed manner.

## What does this PR do?

### `IS NULL` and `IS NOT NULL`

This PR will be exposed as a prototype for now. Below is the copy/pasted version of a spec that defines this filter.

- `IS NULL` matches fields that `EXISTS` AND `= IS NULL`
- `IS NOT NULL` matches fields that `NOT EXISTS` OR `!= IS NULL`

1. `{"name": "A", "price": null}`
2. `{"name": "A", "price": 10}`
3. `{"name": "A"}`

`price IS NULL` would match 1
`price IS NOT NULL` or `NOT price IS NULL` would match 2,3
`price EXISTS` would match 1, 2
`price NOT EXISTS` or `NOT price EXISTS` would match 3

common query : `(price EXISTS) AND (price IS NOT NULL)` would match 2

### `IS EMPTY` and `IS NOT EMPTY`

- `IS EMPTY` matches Array `[]`, Object `{}`, or String `""` fields that `EXISTS` and are empty
- `IS NOT EMPTY` matches fields that `NOT EXISTS` OR are not empty.

1. `{"name": "A", "tags": null}`
2. `{"name": "A", "tags": [null]}`
3. `{"name": "A", "tags": []}`
4. `{"name": "A", "tags": ["hello","world"]}`
5. `{"name": "A", "tags": [""]}`
6. `{"name": "A"}`
7. `{"name": "A", "tags": {}}`
8. `{"name": "A", "tags": {"t1":"v1"}}`
9. `{"name": "A", "tags": {"t1":""}}`
10. `{"name": "A", "tags": ""}`

`tags IS EMPTY` would match 3,7,10
`tags IS NOT EMPTY` or `NOT tags IS EMPTY` would match 1,2,4,5,6,8,9
`tags IS NULL` would match 1
`tags IS NOT NULL` or `NOT tags IS NULL` would match 2,3,4,5,6,7,8,9,10
`tags EXISTS` would match 1,2,3,4,5,7,8,9,10
`tags NOT EXISTS` or `NOT tags EXISTS` would match 6

common query : `(tags EXISTS) AND (tags IS NOT NULL) AND (tags IS NOT EMPTY)` would match 2,4,5,8,9

## What should the reviewer do?

- Check that I tested the filters
- Check that I deleted the ids of the documents when deleting documents


Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2023-04-27 13:14:00 +00:00
Loïc Lecrenier
899baa0ea5 Update forgotten snapshot from previous commit 2023-04-27 13:43:04 +02:00
Loïc Lecrenier
374095d42c Add tests for stop words and fix a couple of bugs 2023-04-27 13:30:09 +02:00
Louis Dureuil
b41a6cbd7a
Check sort criteria also in placeholder search 2023-04-26 16:28:17 +02:00
Louis Dureuil
c8af572697
Add tests for exact words and exact attributes 2023-04-26 16:13:01 +02:00
ManyTheFish
249053e514 Update feature flags 2023-04-26 14:59:25 +02:00
ManyTheFish
ff2cf2a5ae Update charabia in milli 2023-04-26 14:56:54 +02:00
Loïc Lecrenier
b448aca49c Add more tests for exactness rr 2023-04-26 11:04:18 +02:00
Loïc Lecrenier
55bad07c16 Fix bug in exact_attribute rr implementation 2023-04-26 10:40:05 +02:00
Loïc Lecrenier
3421125a55 Prevent the exactness ranking rule from removing random words
Make it strictly follow the term matching strategy
2023-04-26 09:09:19 +02:00
Clément Renault
14293f6c8f
Make rustfmt happy 2023-04-25 16:55:39 +02:00
Loïc Lecrenier
d3a94e8b25 Fix bugs and add tests to exactness ranking rule 2023-04-25 16:49:08 +02:00
Clément Renault
cfd1b2cc97
Fix the clippy warnings 2023-04-25 16:40:32 +02:00
Kerollmops
a109802d45
Upgrade the incompatible versions of the dependencies 2023-04-24 17:50:57 +02:00
Kerollmops
47b66e49b8
Upgrade the compatible versions of the dependencies 2023-04-24 17:50:52 +02:00
Loïc Lecrenier
8f2e971879 Add tests for "exactness" rr, make correct universe computation 2023-04-24 16:57:34 +02:00
Loïc Lecrenier
d1fdbb63da Make all search tests pass, fix distinctAttribute bug 2023-04-24 12:12:08 +02:00
Loïc Lecrenier
a7a0891210 Update examples 2023-04-24 10:07:49 +02:00
Loïc Lecrenier
84d9c731f8 Fix bug in encoding of word_position_docids and word_fid_docids 2023-04-24 09:59:30 +02:00
Loïc Lecrenier
bd9aba4d77 Add "position" part of the attribute ranking rule 2023-04-13 10:46:09 +02:00
Loïc Lecrenier
8edad8291b Add logger to attribute rr, fix a bug 2023-04-13 10:25:00 +02:00
Kerollmops
d9cebff61c Add a simple test to check that attributes are ranking correctly 2023-04-13 08:27:09 +02:00
Loïc Lecrenier
30f7bd03f6 Fix compiler warning/errors caused by previous merge 2023-04-13 08:27:09 +02:00
Kerollmops
df0d9bb878 Introduce the attribute ranking rule in the list of ranking rules 2023-04-13 08:27:09 +02:00
Kerollmops
5230ddb3ea Resolve the attribute ranking rule conditions 2023-04-13 08:27:09 +02:00
Kerollmops
d6a7c28e4d Implement the attribute ranking rule edge computation 2023-04-13 08:27:09 +02:00
Kerollmops
e55efc419e Introduce a new cache for the words fids 2023-04-13 08:27:09 +02:00
Loïc Lecrenier
644e136aee Merge branch 'search-refactor-typo-attributes' into search-refactor 2023-04-13 08:26:56 +02:00
Louis Dureuil
38b7b31beb Decide to use prefix DB if the word is not an ngram 2023-04-12 16:45:38 +02:00
Louis Dureuil
7a01f20df7 Use word_prefix_docids, make get_word_prefix_docids private 2023-04-12 16:45:38 +02:00
Louis Dureuil
c20c38a7fa Add SearchContext::word_prefix_docids() method 2023-04-12 16:44:43 +02:00
Louis Dureuil
5ab46324c4 Everyone uses the SearchContext::word_docids instead of get_db_word_docids
make get_db_word_docids private
2023-04-12 16:44:43 +02:00
Louis Dureuil
325f17488a Add SearchContext::word_docids() method 2023-04-12 16:37:05 +02:00
Louis Dureuil
e7ff987c46 Update call sites 2023-04-12 16:36:38 +02:00
Louis Dureuil
244003e36f Refactor DB cache to return Roaring Bitmaps directly instead of byte slices 2023-04-12 16:35:48 +02:00
Loïc Lecrenier
1f813a6f3b Simplify implementation of the detailed (=visual) logger 2023-04-12 16:32:53 +02:00
Loïc Lecrenier
96183e804a Simplify the logger 2023-04-12 16:32:53 +02:00
Loïc Lecrenier
7ab48ed8c7 Matching words fixes 2023-04-12 16:21:43 +02:00
Loïc Lecrenier
e7bb8c940f Merge branch 'search-refactor-highlighter' into search-refactor-highlighter-merged 2023-04-11 12:22:34 +02:00
Loïc Lecrenier
8cb85294ef Remove unused import warning 2023-04-07 11:09:30 +02:00
Loïc Lecrenier
d0e9d65025 Fix distinct attribute bugs 2023-04-07 11:09:01 +02:00
Loïc Lecrenier
540a396e49 Fix indexing bug in words_prefix_position 2023-04-07 11:08:39 +02:00
Loïc Lecrenier
a81165f0d8 Merge remote-tracking branch 'origin/main' into search-refactor 2023-04-07 10:15:55 +02:00
Loïc Lecrenier
d6585eb10b Avoid splitting ngrams into their original component words 2023-04-07 10:13:49 +02:00
Loïc Lecrenier
f7d90ad19f Merge remote-tracking branch 'origin/search-refactor-tests-doc' into search-refactor 2023-04-07 10:13:18 +02:00
Louis Dureuil
31630c85d0 exactness graph rr: Add important TODO/FIXME after review 2023-04-06 17:50:39 +02:00
Louis Dureuil
ab09dc0167 exact_attributes: Add TODOs and additional check after review 2023-04-06 17:50:39 +02:00
Louis Dureuil
618c54915d exact_attribute: dedup nodes after sorting them 2023-04-06 17:50:39 +02:00
Loïc Lecrenier
130d2061bd Fix indexing of word_position_docid and fid 2023-04-06 17:50:39 +02:00
Louis Dureuil
66ddee4390 Fix word_position_docids indexing 2023-04-06 17:50:39 +02:00
Louis Dureuil
90a6c01495 Use correct codec in proximity 2023-04-06 17:50:39 +02:00
Louis Dureuil
e58426109a Fix panics and issues in exactness graph ranking rule 2023-04-06 17:50:39 +02:00
Louis Dureuil
f513cf930a Exact attribute with state 2023-04-06 17:50:39 +02:00
Louis Dureuil
8a13ed7e3f Add exactness ranking rules 2023-04-06 17:50:39 +02:00
Louis Dureuil
1b8e4d0301 Add ExactTerm and helper method 2023-04-06 17:50:39 +02:00
Louis Dureuil
996619b22a Increase position by 8 on hard separator when building query terms 2023-04-06 17:50:39 +02:00
Louis Dureuil
2c9822a337 Rename is_multiple_words to is_ngram and zero_typo to exact 2023-04-06 17:50:39 +02:00
Louis Dureuil
7276deee0a Add new db caches 2023-04-06 17:50:39 +02:00
ManyTheFish
f7e7f438f8 Patch prefix match 2023-04-06 17:22:31 +02:00
ManyTheFish
ba8dcc2d78 Fix clippy 2023-04-06 15:50:47 +02:00
Loïc Lecrenier
7ca91ebb71 Merge branch 'search-refactor-exactness' into search-refactor-tests-doc 2023-04-06 15:16:35 +02:00
ManyTheFish
47f6a3ad3d Take into account that a logger need the search context 2023-04-06 15:02:23 +02:00
bors[bot]
b4c01581cd
Merge #3641
3641: Bring back changes from `release v1.1.0` into `main` after v1.1.0 release r=curquiza a=curquiza

Replace https://github.com/meilisearch/meilisearch/pull/3637 since we don't want to pull commits from `main` into `release-v1.1.0` when fixing git conflicts

Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com>
Co-authored-by: Charlotte Vermandel <charlottevermandel@gmail.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: curquiza <clementine@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Many the fish <many@meilisearch.com>
2023-04-06 12:37:54 +00:00
ManyTheFish
ae17c62e24 Remove warnings 2023-04-06 14:07:18 +02:00
ManyTheFish
a1148c09c2 remove old matcher 2023-04-06 14:00:21 +02:00
ManyTheFish
9c5f64769a Integrate the new Highlighter in the search 2023-04-06 13:58:56 +02:00
ManyTheFish
ebe23b04c9 Make the matcher consume the search context 2023-04-06 12:28:28 +02:00
ManyTheFish
13b7c826c1 add new highlighter 2023-04-06 12:15:37 +02:00
Loïc Lecrenier
5440f43fd3
Fix indexing of word_position_docid and fid 2023-04-05 18:14:00 +02:00
Louis Dureuil
d9460a76f4
Fix word_position_docids indexing 2023-04-05 18:14:00 +02:00
Louis Dureuil
d1ddaa223d
Use correct codec in proximity 2023-04-05 18:14:00 +02:00
Louis Dureuil
f7ecea142e
Fix panics and issues in exactness graph ranking rule 2023-04-05 18:13:46 +02:00
Louis Dureuil
337e75b0e4
Exact attribute with state 2023-04-05 18:12:46 +02:00
Loïc Lecrenier
b5691802a3 Add new tests and fix construction of query graph from paths 2023-04-05 16:31:10 +02:00
Loïc Lecrenier
6e50f23896 Add more search tests 2023-04-05 13:33:23 +02:00
Tamo
597d57bf1d Merge branch 'main' into bring-back-changes-v1.1.0 2023-04-05 11:32:14 +02:00
Loïc Lecrenier
4c8a0179ba Add more search tests 2023-04-05 11:30:49 +02:00
Loïc Lecrenier
c69cbec64a Add more search tests 2023-04-05 11:20:04 +02:00
Loïc Lecrenier
ce328c329d Move bucket sort function to its own module and fix a bug 2023-04-04 18:03:08 +02:00
Loïc Lecrenier
959e4607bb Add more search tests 2023-04-04 18:02:46 +02:00
Louis Dureuil
4b4ffb8ec9
Add exactness ranking rules 2023-04-04 17:12:07 +02:00
Louis Dureuil
3951fe22ab
Add ExactTerm and helper method 2023-04-04 17:09:32 +02:00
Louis Dureuil
4d5bc9df4c
Increase position by 8 on hard separator when building query terms 2023-04-04 17:07:26 +02:00
Louis Dureuil
ec2f8e8040
Rename is_multiple_words to is_ngram and zero_typo to exact 2023-04-04 17:06:07 +02:00
Louis Dureuil
406b8bd248
Add new db caches 2023-04-04 17:04:46 +02:00
Loïc Lecrenier
62b9c6fbee Add search tests 2023-04-04 16:18:22 +02:00
Loïc Lecrenier
b439d36807 Split query_term module into multiple submodules 2023-04-04 15:38:30 +02:00
Loïc Lecrenier
faceb661e3 Add note that a part of the code needs fixing 2023-04-04 15:02:01 +02:00
Loïc Lecrenier
4129d657e2 Simplify query_term module a bit 2023-04-04 15:01:42 +02:00
Filip Bachul
1e6fe71a67 fix clippy warning 2023-04-03 20:18:26 +02:00
Filip Bachul
fddfb37f1f remove unnecessary FilterError:ReservedGeo and FilterError:ReservedGeo 2023-04-03 20:18:26 +02:00
Loïc Lecrenier
3f13608002 Fix computation of ngram derivations 2023-04-03 15:27:49 +02:00
Loïc Lecrenier
4708d9b016 Fix compiler warnings/errors 2023-04-03 10:09:27 +02:00
Clément Renault
0d2e7bcc13 Implement the previous way for the exhaustive distinct candidates 2023-04-03 10:08:10 +02:00
Loïc Lecrenier
55fbfb6124 Merge branch 'search-refactor-located-query-terms' into search-refactor 2023-04-03 10:04:36 +02:00
Loïc Lecrenier
58fe260c72 Allow removing all the terms from a query if it contains a phrase 2023-04-03 09:18:02 +02:00
Loïc Lecrenier
24e5f6f7a9 Don't remove phrases with "last" term matching strategy 2023-04-03 09:17:33 +02:00
Louis Dureuil
9b87c36200 Limit the number of derivations for a single word. 2023-03-31 09:19:18 +02:00
Filip Bachul
1861c69964 fmt 2023-03-30 23:37:26 +02:00
Filip Bachul
cb2b5eb38e handle _geoDistance(x,x) sort error 2023-03-30 23:21:23 +02:00
Filip Bachul
53aa0a1b54 handle _geo(x,x) sort error 2023-03-30 23:17:34 +02:00
Loïc Lecrenier
12b26cd54e Don't remove phrases from the query with term matching strategy Last 2023-03-30 14:54:08 +02:00
Loïc Lecrenier
061b1e6d7c Tiny refactor of query graph remove_nodes method 2023-03-30 14:49:25 +02:00
Loïc Lecrenier
0d6e8b5c31 Fix phrase search bug when the phrase has only one word 2023-03-30 14:48:12 +02:00
Loïc Lecrenier
d48cdc67a0 Fix term matching strategy bugs 2023-03-30 14:01:52 +02:00
Loïc Lecrenier
35c16ad047 Use new term matching strategy logic in words ranking rule 2023-03-30 13:15:43 +02:00
Loïc Lecrenier
2997d1f186 Use new term matching strategy logic in resolve_maximally_reduced_... 2023-03-30 13:12:51 +02:00
Loïc Lecrenier
2a5997fb20 Avoid expensive assert! in bucket sort function 2023-03-30 13:07:17 +02:00
Loïc Lecrenier
ee8a9e0bad Remove outdated sentence in documentation 2023-03-30 12:22:24 +02:00
Loïc Lecrenier
3b0737a092 Fix detailed logger 2023-03-30 12:20:44 +02:00
Loïc Lecrenier
fdd02105ac Graph-based ranking rule + term matching strategy support 2023-03-30 12:19:21 +02:00
Loïc Lecrenier
aa9592455c Refactor the paths_of_cost algorithm
Support conditions that require certain nodes to be skipped
2023-03-30 12:11:11 +02:00
Loïc Lecrenier
01e24dd630 Rewrite proximity ranking rule 2023-03-30 11:59:06 +02:00
Loïc Lecrenier
ae6bb1ce17 Update the ConditionDocidsCache after change to RankingRuleGraphTrait 2023-03-30 11:41:20 +02:00
Loïc Lecrenier
5fd28620cd Build ranking rule graph correctly after changes to trait definition 2023-03-30 11:32:55 +02:00
Loïc Lecrenier
728710d63a Update typo ranking rule to use new query term structure 2023-03-30 11:32:19 +02:00
Loïc Lecrenier
fa81381865 Update the trait requirements of ranking-rule graphs 2023-03-30 11:19:45 +02:00
Loïc Lecrenier
b96a682f16 Update resolve_graph module to work with lazy query terms 2023-03-30 11:10:38 +02:00
Loïc Lecrenier
d0f048c068 Simplify the API of the DatabaseCache 2023-03-30 11:08:17 +02:00
Loïc Lecrenier
223e82a10d Update QueryGraph to use new lazy query terms + build from paths 2023-03-30 11:06:02 +02:00