Commit Graph

8142 Commits

Author SHA1 Message Date
Louis Dureuil
864ad2a23c
Check that vector store feature is enabled 2023-06-27 12:32:42 +02:00
Kerollmops
66fb5c150c
Rename _semanticSimilarity into _semanticScore 2023-06-27 12:32:42 +02:00
Kerollmops
7c2f5f77b8
Make clippy and fmt happy 2023-06-27 12:32:42 +02:00
Kerollmops
66b8cfd8c8
Introduce a way to store the HNSW on multiple LMDB entries 2023-06-27 12:32:42 +02:00
Kerollmops
ff3664431f
Make rustfmt happy 2023-06-27 12:32:42 +02:00
Kerollmops
531748c536
Return a user error when the _vectors type is invalid 2023-06-27 12:32:41 +02:00
Kerollmops
7aa1275337
Display the _semanticSimilarity even if the _vectors field is not displayed 2023-06-27 12:32:41 +02:00
Kerollmops
737aec1705
Expose an _semanticSimilarity as a dot product in the documents 2023-06-27 12:32:41 +02:00
Kerollmops
3e3c743392
Make Rustfmt happy 2023-06-27 12:32:41 +02:00
Kerollmops
5c5a4e075d
Make clippy happy 2023-06-27 12:32:41 +02:00
Kerollmops
ab9f2269aa
Normalize the vectors during indexation and search 2023-06-27 12:32:41 +02:00
Kerollmops
321ec5f3fa
Accept multiple vectors by documents using the _vectors field 2023-06-27 12:32:40 +02:00
Kerollmops
1b2923f7c0
Return the vector in the output of the search routes 2023-06-27 12:32:40 +02:00
Kerollmops
717d4fddd4
Remove the unused distance 2023-06-27 12:32:40 +02:00
Kerollmops
a7e0f0de89
Introduce a new error message for invalid vector dimensions 2023-06-27 12:32:40 +02:00
Kerollmops
3b560ef7d0
Make clippy happy 2023-06-27 12:32:40 +02:00
Kerollmops
2cf747cb89
Fix the tests 2023-06-27 12:32:40 +02:00
Kerollmops
3c31e1cdd1
Support more pages but in an ugly way 2023-06-27 12:32:39 +02:00
Kerollmops
23eaaf1001
Change the name of the distance module 2023-06-27 12:32:39 +02:00
Kerollmops
c2a402f3ae
Implement an ugly deletion of values in the HNSW 2023-06-27 12:32:39 +02:00
Kerollmops
436a10bef4
Replace the euclidean with a dot product 2023-06-27 12:32:39 +02:00
Kerollmops
8debf6fe81
Use a basic euclidean distance function 2023-06-27 12:32:39 +02:00
Kerollmops
c79e82c62a
Move back to the hnsw crate
This reverts commit 7a4b6c065482f988b01298642f4c18775503f92f.
2023-06-27 12:32:39 +02:00
Kerollmops
aca305bb77
Log more to make sure we insert vectors in the hgg data-structure 2023-06-27 12:32:38 +02:00
Kerollmops
5816008139
Introduce an optimized version of the euclidean distance function 2023-06-27 12:32:38 +02:00
Kerollmops
268a9ef416
Move to the hgg crate 2023-06-27 12:32:38 +02:00
Clément Renault
642b0f3a1b
Expose a new vector field on the search route 2023-06-27 12:32:38 +02:00
Clément Renault
cad90e8cbc
Add a vector field to the search routes 2023-06-27 12:32:38 +02:00
Clément Renault
4571e512d2
Store the vectors in an HNSW in LMDB 2023-06-27 12:32:38 +02:00
Clément Renault
7ac2f1489d
Extract the vectors from the documents 2023-06-27 12:32:37 +02:00
Clément Renault
34349faeae
Create a new _vector extractor 2023-06-27 12:32:37 +02:00
meili-bors[bot]
f105df6599
Merge #3850
3850: Experimental features r=Kerollmops a=dureuill

# Pull Request

## Related issue

- Fixes https://github.com/meilisearch/meilisearch/issues/3857
- Related to https://github.com/meilisearch/meilisearch/issues/3771
## What does this PR do?

### Example

<details>
<summary>Using the feature to enable `scoreDetails`</summary>

```json
❯ curl \
  -X POST 'http://localhost:7700/indexes/index-word-count-10-count/search' \
  -H 'Content-Type: application/json' \
  --data-binary '{ "q": "Batman", "limit": 1, "showRankingScoreDetails": true, "attributesToRetrieve": ["title"]}' | jsonxf

{
  "message": "Computing score details requires enabling the `score details` experimental feature. See https://github.com/meilisearch/product/discussions/674",
  "code": "feature_not_enabled",
  "type": "invalid_request",
  "link": "https://docs.meilisearch.com/errors#feature_not_enabled"
}
```

```json
❯ curl \
  -X PATCH 'http://localhost:7700/experimental-features/' \
  -H 'Content-Type: application/json'  \
--data-binary '{
    "scoreDetails": true
  }'
{"scoreDetails":true,"vectorSearch":false}
```

```json
❯ curl \
  -X POST 'http://localhost:7700/indexes/index-word-count-10-count/search' \
  -H 'Content-Type: application/json' \
  --data-binary '{ "q": "Batman", "limit": 1, "showRankingScoreDetails": true, "attributesToRetrieve": ["title"]}' | jsonxf
{
  "hits": [
    {
      "title": "Batman",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 1,
          "maxMatchingWords": 1,
          "score": 1.0
        },
        "typo": {
          "order": 1,
          "typoCount": 0,
          "maxTypoCount": 1,
          "score": 1.0
        },
        "proximity": {
          "order": 2,
          "score": 1.0
        },
        "attribute": {
          "order": 3,
          "attribute_ranking_order_score": 1.0,
          "query_word_distance_score": 1.0,
          "score": 1.0
        },
        "exactness": {
          "order": 4,
          "matchType": "exactMatch",
          "score": 1.0
        }
      }
    }
  ],
  "query": "Batman",
  "processingTimeMs": 3,
  "limit": 1,
  "offset": 0,
  "estimatedTotalHits": 46
}
```


</details>

### User standpoint

- Add new route GET/POST/PATCH/DELETE `/experimental-features` to switch on or off some of the experimental features in a manner persistent between instance restarts
- Use these new routes to allow setting on/off the following experimental features:
  - vector store **TODO:** fill in issue 
  - score details (related to https://github.com/meilisearch/meilisearch/issues/3771)
- Make the way of checking feature availability and error message uniform for the Prometheus metrics experimental feature
- Save the enabled features in dump, restore from dumps
- **TODO:** tests:
  - Test new security permissions (do they allow access with ALL, do they prevent access when missing)
  - Test dump behavior, in particular ability to import existing v6 dumps
  - Test basic behavior when calling the rule 

### Implementation standpoint

- New DB "experimental-features"
- dumps are modified to save the state of that new DB as a `experimental-features.json` file, that is then loaded back when importing the dump. This doesn't change the dump version, as the file is optional and it missing will not cause the dump to fail

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-06-26 15:13:43 +00:00
Louis Dureuil
13e9b4c2e5
Add dump support 2023-06-26 16:29:43 +02:00
Louis Dureuil
5a83cecb0f
fix tests 2023-06-26 16:29:43 +02:00
Louis Dureuil
cca6e47ec1
Errors when GETting metrics without the feature gate 2023-06-26 16:29:43 +02:00
Louis Dureuil
6196a53668
Gate score_details behind a runtime experimental feature flag 2023-06-26 16:29:43 +02:00
Louis Dureuil
bb6448dc2e
Compute instance features from CLI options 2023-06-26 16:29:43 +02:00
Louis Dureuil
eef9293630
New route to set some experimental features 2023-06-26 16:29:43 +02:00
Louis Dureuil
dac77dfd14
Add new permissions for experimental-features route 2023-06-26 16:29:43 +02:00
Louis Dureuil
072d81843f
Persistently save to DB the status of experimental features 2023-06-26 16:29:43 +02:00
Louis Dureuil
29ec02d4d4
Add meilisearch_types::features module 2023-06-26 16:09:03 +02:00
meili-bors[bot]
2d34005965
Merge #3821
3821: Add normalized and detailed scores to documents returned by a query r=dureuill a=dureuill

# Pull Request

## Related issue
Fixes #3771 

## What does this PR do?

### User standpoint

<details>
<summary>Request ranking score</summary>

```
echo '{ 
  "q": "Badman dark knight returns",
  "showRankingScore": true, 
  "limit": 10,
  "attributesToRetrieve": ["title"]
}' | mieli search -i index-word-count-10-count
```

</details>


<details>
<summary>Response</summary>

```json
{
  "hits": [
    {
      "title": "Batman: The Dark Knight Returns, Part 1",
      "_rankingScore": 0.947520325203252
    },
    {
      "title": "Batman: The Dark Knight Returns, Part 2",
      "_rankingScore": 0.947520325203252
    },
    {
      "title": "Batman Unmasked: The Psychology of the Dark Knight",
      "_rankingScore": 0.6657594086021505
    },
    {
      "title": "Legends of the Dark Knight: The History of Batman",
      "_rankingScore": 0.6654905913978495
    },
    {
      "title": "Angel and the Badman",
      "_rankingScore": 0.2196969696969697
    },
    {
      "title": "Angel and the Badman",
      "_rankingScore": 0.2196969696969697
    },
    {
      "title": "Batman",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Begins",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Returns",
      "_rankingScore": 0.11553030303030302
    },
    {
      "title": "Batman Forever",
      "_rankingScore": 0.11553030303030302
    }
  ],
  "query": "Badman dark knight returns",
  "processingTimeMs": 12,
  "limit": 10,
  "offset": 0,
  "estimatedTotalHits": 46
}
```

</details>



- If adding a `showRankingScore` parameter to the search query, then documents returned by a search now contain an additional field `_rankingScore` that is a float bigger than 0 and lower or equal to 1.0. This field represents the relevancy of the document, relatively to the search query and the settings of the index, with 1.0 meaning "perfect match" and 0 meaning "not matching the query" (Meilisearch should never return documents not matching the query at all). 
  - The `sort` and `geosort` ranking rules do not influence the `_rankingScore`.

<details>
<summary>Request detailed ranking scores</summary>

```
echo '{ 
  "q": "Badman dark knight returns",
  "showRankingScoreDetails": true, 
  "limit": 5, 
  "attributesToRetrieve": ["title"]
}' | mieli search -i index-word-count-10-count
```

</details>

<details>
<summary>Response</summary>

```json
{
  "hits": [
    {
      "title": "Batman: The Dark Knight Returns, Part 1",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 4,
          "maxMatchingWords": 4,
          "score": 1.0
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 4,
          "score": 0.8
        },
        "proximity": {
          "order": 2,
          "score": 0.9545454545454546
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.926829268292683,
          "score": 0.926829268292683
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.26666666666666666
        }
      }
    },
    {
      "title": "Batman: The Dark Knight Returns, Part 2",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 4,
          "maxMatchingWords": 4,
          "score": 1.0
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 4,
          "score": 0.8
        },
        "proximity": {
          "order": 2,
          "score": 0.9545454545454546
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.926829268292683,
          "score": 0.926829268292683
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.26666666666666666
        }
      }
    },
    {
      "title": "Batman Unmasked: The Psychology of the Dark Knight",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 3,
          "maxMatchingWords": 4,
          "score": 0.75
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 3,
          "score": 0.75
        },
        "proximity": {
          "order": 2,
          "score": 0.6666666666666666
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.8064516129032258,
          "score": 0.8064516129032258
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.25
        }
      }
    },
    {
      "title": "Legends of the Dark Knight: The History of Batman",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 3,
          "maxMatchingWords": 4,
          "score": 0.75
        },
        "typo": {
          "order": 1,
          "typoCount": 1,
          "maxTypoCount": 3,
          "score": 0.75
        },
        "proximity": {
          "order": 2,
          "score": 0.6666666666666666
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.7419354838709677,
          "score": 0.7419354838709677
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.25
        }
      }
    },
    {
      "title": "Angel and the Badman",
      "_rankingScoreDetails": {
        "words": {
          "order": 0,
          "matchingWords": 1,
          "maxMatchingWords": 4,
          "score": 0.25
        },
        "typo": {
          "order": 1,
          "typoCount": 0,
          "maxTypoCount": 1,
          "score": 1.0
        },
        "proximity": {
          "order": 2,
          "score": 1.0
        },
        "attribute": {
          "order": 3,
          "attributes_ranking_order": 1.0,
          "attributes_query_word_order": 0.8181818181818182,
          "score": 0.8181818181818182
        },
        "exactness": {
          "order": 4,
          "matchType": "noExactMatch",
          "score": 0.3333333333333333
        }
      }
    }
  ],
  "query": "Badman dark knight returns",
  "processingTimeMs": 9,
  "limit": 5,
  "offset": 0,
  "estimatedTotalHits": 46
}
```

</details>

- If adding a `showRankingScoreDetails` parameter to the search query, then the returned documents will now contain an additional `_rankingScoreDetails` field that is a JSON object containing one field per ranking rule that was applied, whose value is a JSON object with the following fields:
  - `order`: a number indicating the order this rule was applied (0 is the first applied ranking rule)
  - `score` (except for `sort` and `geosort`): a float indicating how the document matched this particular rule.
  - other fields that are specific to the rule, indicating for example how many words matched for a document and how many typos were counted in a matching document.
- If the `displayableAttributes` list is defined in the settings of the index, any ranking rule using an attribute **not** part of that list will be marked as `<hidden-rule>` in the `_rankingScoreDetails`.  

- Search queries that are part of a `multi-search` requests are modified in the same way and each of the queries can take the `showRankingScore` and `showRankingScoreDetails` parameters independently. The results are still returned in separate lists and providing a unified list of results between multiple queries is not in the scope of this PR (but is unblocked by this PR and can be done manually by using the scores of the various documents). 

### Implementation standpoint

- Fix difference in how the position of terms were computed at indexing time and query time: this difference meant that a query containing a hard separator would fail the exactness check.
- Fix the id reported by the sort ranking rule (very minor)
- Change how the cost of removing words is computed. After this change the cost no longer works for any other ranking rule than `words`. Also made `words` have a cost of 0 such that the entire cost of `words` is given by the termRemovalStrategy. The new cost computation makes it so the score is computed in a way consistent with the number of words in the query. Additionally, the words that appear in phrases in the query are also counted as matching words.
- When any score computation is requested through `showRankingScore` or `showRankingScoreDetails`, remove optimization where ranking rules are not executed on buckets of a single document: this is important to allow the computation of an accurate score.
- add virtual conditions to fid and position to always have the max cost: this ensures that the score is independent from the dataset
- the Position ranking rule now takes into account the distance to the position of the word in the query instead of the distance to the position 0.
- modified proximity ranking rule cost calculation so that the cost is 0 for documents that are perfectly matching the query
- Add a new `milli::score_details` module containing all the types that are involved in score computation.
- Make it so a bucket of result now contains a `ScoreDetails` and changed the ranking rules to produce their `ScoreDetails`.
- Expose the scores in the REST API.
- Add very light analytics for scoring.
- Update the search tests to add the expected scores.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-06-26 09:32:43 +00:00
meili-bors[bot]
040b5a5b6f
Merge #3842
3842: fix some typos r=dureuill a=cuishuang

# Pull Request

## Related issue
Fixes #<issue_number>

## What does this PR do?
- fix some typos

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: cui fliter <imcusg@gmail.com>
2023-06-22 18:01:10 +00:00
cui fliter
530a3e2df3 fix some typos
Signed-off-by: cui fliter <imcusg@gmail.com>
2023-06-22 21:59:00 +08:00
Louis Dureuil
11d32ad192
Add very light analytics for scoring 2023-06-22 12:39:14 +02:00
Louis Dureuil
d26e9a96ec
Add score details to new search tests 2023-06-22 12:39:14 +02:00
Louis Dureuil
49c8bc4de6
Fix tests 2023-06-22 12:39:14 +02:00
Louis Dureuil
da833eb095
Expose the scores and detailed scores in the API 2023-06-22 12:39:14 +02:00
Louis Dureuil
701d44bd91
Store the scores for each bucket
Remove optimization where ranking rules are not executed on buckets of a single document
when the score needs to be computed
2023-06-22 12:39:14 +02:00
Louis Dureuil
c621a250a7
Score for graph based ranking rules
Count phrases in matchingWords and maxMatchingWords
2023-06-22 12:39:14 +02:00