Commit Graph

9382 Commits

Author SHA1 Message Date
meili-bors[bot] 7e1100bfa3
Merge #4633
4633: Allow to mark vectors as "userProvided" r=dureuill a=dureuill

# Pull Request

## Related issue
Fixes #4606 

## What does this PR do?

[See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb)

- Extends the shape of the special `_vectors` field in documents.
    - previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings.
    - In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields:
      1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings.
      2. `userProvided`: a boolean indicating if the vector was provided by the user.
    - The previous form `embedder_or_array_of_embedders` is semantically equivalent to:
    ```json
    {
        "embeddings": embedder_or_array_of_embedders,
        "userProvided": true
    }
    ```
- During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template.
- This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change.
- The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them.

### Tests

This PR adds the following tests

- Long-needed hybrid search tests of a simple hf embedder
- Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content.
- Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-05-22 20:50:39 +00:00
Louis Dureuil 8a941c0241
Smaller review changes 2024-05-22 14:44:42 +02:00
Louis Dureuil 3412e7fbcf
"[]" is deserialized as 0 embedding rather than 1 embedding of dim 0 2024-05-22 12:25:21 +02:00
Louis Dureuil 16037e2169
Don't remove embedders that are not in the config from the document DB 2024-05-22 12:24:51 +02:00
Louis Dureuil 8f7c8ca7f0
Remove now unused error variant 2024-05-22 12:23:43 +02:00
Louis Dureuil eccbcf5130
Increase index-scheduler test timeouts 2024-05-21 14:59:08 +02:00
meili-bors[bot] abe29772db
Merge #4644
4644: Revert "Stream documents" and keep heed+arroy to the latest verion r=Kerollmops a=irevoire

Reverts meilisearch/meilisearch#4544

Fixes https://github.com/meilisearch/meilisearch/issues/4641

I didn’t realize that some http clients were not handling chunked http requests like you would expect (if you ask the body, it gives you the body), which made the previous PR breaking.

There is no way to provide a good fix to the issue we initially wanted to fix without breaking meilisearch and that’s not planned for now.

Co-authored-by: Tamo <irevoire@protonmail.ch>
Co-authored-by: Tamo <tamo@meilisearch.com>
2024-05-21 10:21:47 +00:00
Tamo c9ac7f2e7e update heed to latest version 2024-05-20 15:19:00 +02:00
Tamo 7e251b43d4
Revert "Stream documents" 2024-05-20 15:09:45 +02:00
Louis Dureuil 9969f7a638
Add test on index-scheduler 2024-05-20 14:44:10 +02:00
Louis Dureuil b17cb56dee
Test array of vectors 2024-05-20 14:44:10 +02:00
Louis Dureuil afcd7b9f0c
Test hybrid search with hf embedder 2024-05-20 14:44:10 +02:00
Louis Dureuil 30cf972987
Add test with a dump 2024-05-20 10:36:18 +02:00
Louis Dureuil d05d49ffd8
Fix tests 2024-05-20 10:36:18 +02:00
Louis Dureuil 0462ebbe58
Don't write an empty _vectors field 2024-05-20 10:36:18 +02:00
Louis Dureuil 2f7a8a4efb
Don't write vectors that weren't autogenerated in document DB 2024-05-20 10:36:18 +02:00
Louis Dureuil 02714ef5ed
Add vectors from vector DB in dump 2024-05-20 10:36:18 +02:00
Louis Dureuil 52d9cb6e5a
Refactor vector indexing
- use the parsed_vectors module
- only parse `_vectors` once per document, instead of once per embedder per document
2024-05-20 10:36:17 +02:00
Louis Dureuil 261de888b7
Add function to get the embeddings of a document in an index 2024-05-20 10:36:17 +02:00
Louis Dureuil 98c811247e
Add parsed vectors module 2024-05-20 10:25:59 +02:00
meili-bors[bot] 59ecf1cea7
Merge #4544
4544: Stream documents r=curquiza a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4383


### Perf
2M hackernews:

main:
Time to retrieve: 7s
RAM consumption: 2+GiB

stream:
Time to retrieve: 4.7s
RAM consumption: Too small

Co-authored-by: Tamo <tamo@meilisearch.com>
2024-05-17 14:49:08 +00:00
Tamo 273c6e8c5c uses the latest version of heed to get rid of unsafe code 2024-05-16 18:31:32 +02:00
Tamo 897d25780e update milli to latest version 2024-05-16 18:31:32 +02:00
Tamo c85d1752dd keep the same rtxn to compute the filters on the documents and to stream the documents later on 2024-05-16 18:31:32 +02:00
Tamo 8e6ffbfc6f stream documents 2024-05-16 18:31:32 +02:00
meili-bors[bot] 7c19c072fa
Merge #4631
4631: Split the field id map from the weight of each fields r=Kerollmops a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4484

## What does this PR do?
- Make the (internal) searchable fields database always contain the searchable fields (instead of None when the user-defined searchable fields were not defined)
- Introduce a new « fieldids_weights_map » that does the mapping between a fieldId and its Weight
- Ensure that when two searchable fields are swapped, the field ID map doesn't change anymore (and thus, doesn't re-index)
- Uses the weight instead of the order of the searchable fields in the attribute ranking rule at search time
- When no searchable attributes are defined, make all their weights equal to zero
- When a field is declared as searchable and contains nested fields, all its subfields share the same weight

## Impact on relevancy

### When no searchable attributes are declared

When no searchable attributes are declared, all the fields have the same importance instead of randomly giving more importance to the field we've encountered « the most early » in the life of the index.

This means that before this PR, send the following json:
```json
[
  { "id": 0, "name": "kefir", "color": "white" },
  { "id": 1, "name": "white", "last name": "spirit" }
]
```

Would make the field `name` more important than the field `color` or `last name`.
This means that searching for `white` would make the document `1` automatically higher ranked than the document `0`.

After this PR, all the fields have the same weight, and none are considered more important than others.

### When a nested field is made searchable

The second behavior change that happened with this PR is in the case you're sending this document, for example:

```json
{
  "id": 0,
  "name": "tamo",
  "doggo": {
    "name": "kefir",
    "surname": "le kef"
  },
  "catto": "gromez"
}
```

Previously, defining the searchable attributes as: `["tamo", "doggo", "catto"]` was actually defining the « real » searchable attributes in the engine as: `["tamo", "doggo", "catto", "doggo.name", "doggo.surname"]`, which means that `doggo.name` and `doggo.surname` were _NOT_ where the user expected them and had completely different weights than `doggo`.
In this PR all the weights have been unified, and the « real » searchable fields look like this:
```json
[ "tamo", "doggo", "doggo.name", "doggo.surname", "catto"]
   ^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^
Weight 0                 Weight 1                  Weight 2

Co-authored-by: Tamo <tamo@meilisearch.com>
2024-05-16 09:59:24 +00:00
Tamo 673b6e1dc0 fix a flaky test 2024-05-16 11:28:14 +02:00
Tamo f2d0a59f1d when no searchable attributes are defined, makes all the weight equals to zero 2024-05-16 01:06:33 +02:00
Tamo c78a2fa4f5 rename method and variable around the attributes to search on feature 2024-05-15 18:04:42 +02:00
Tamo 5542f1d9f1 get back to what we were doingb efore in the DB cache and with the restricted field id 2024-05-15 18:00:39 +02:00
Tamo ad4d8502b3 stops storing the whole fieldids weights map when no searchable are defined 2024-05-15 17:16:10 +02:00
Tamo 7ec4e2a3fb apply all style review comments 2024-05-15 15:02:26 +02:00
Tamo 9fffb8e83d make clippy happy 2024-05-14 17:36:32 +02:00
Tamo caa6a7149a make the attribute ranking rule use the weights and fix the tests 2024-05-14 17:36:32 +02:00
Tamo a0082c4df9 add a failing test on the attribute ranking rule 2024-05-14 17:00:02 +02:00
Tamo b0afe0972e stop updating the fields ids map when fields are only swapped 2024-05-14 17:00:02 +02:00
Tamo 9ecde41853 add a test on the current behaviour 2024-05-14 17:00:02 +02:00
Tamo 685f452fb2 Fix the indexing of the searchable 2024-05-14 17:00:02 +02:00
Tamo 4e4a1ddff7 gate a test behind the required feature 2024-05-14 17:00:02 +02:00
Tamo c22460045c Stops returning an option in the internal searchable fields 2024-05-14 17:00:02 +02:00
meili-bors[bot] 76bb6d565c
Merge #4624
4624: Add "precommands" to benchmark r=dureuill a=dureuill

# Pull Request

## Related issue
Helps for https://github.com/meilisearch/meilisearch/issues/4493

## What does this PR do?
- Add support for precommands for cargo xtask bench
- update benchmark docs
- update workload files


Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-05-13 08:27:56 +00:00
Louis Dureuil 9d3ff11b21
Modify existing workload files to use precommands 2024-05-07 14:03:14 +02:00
Louis Dureuil 43763eb98a
Document precommands 2024-05-07 12:26:22 +02:00
Louis Dureuil 2a0ece814c
Add precommands to workloads 2024-05-07 12:23:36 +02:00
meili-bors[bot] 95fcd17373
Merge #4622
4622: Bump Rustls to non-vulnerable versions r=Kerollmops a=Kerollmops

This PR Fixes #4599 by bumping the Rustls dependency to v0.21.12 and [ureq to v2.9.7](https://github.com/algesten/ureq/blob/main/CHANGELOG.md#297) (which bump rustls to v0.22.4).

Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-05-07 09:47:30 +00:00
Clément Renault ac4bc143c4
Bump ureq to v2.9.7 2024-05-07 10:39:38 +02:00
Clément Renault f33a1282f8
Bump Rustls to v0.21.12 2024-05-07 10:31:39 +02:00
meili-bors[bot] 4d5971f343
Merge #4621
4621: Bring back changes from v1.8.0 into main r=curquiza a=curquiza



Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-05-06 13:46:39 +00:00
meili-bors[bot] ecb5c506b3
Merge #4619
4619: Use http path pattern instead of full path in metrics r=irevoire a=gh2k

# Pull Request

## Related issue

Fixes #3983 

## What does this PR do?

- This records only the HTTP pattern in metrics instead of the full path

An alternative solution was proposed in #4145, but this doesn't really fix the root cause of the issue. The problem I'm experiencing at my end is that by using the full path, the number of labels is far too high to be useful. It is normal practice to use the path with variable placeholders, instead of the fully-expanded path.

The example given in the ticket was endpoints under `/tasks`, but this can also be a very significant problem under `/indexes/{index-uid}/documents`. e.g.:
<img width="1510" alt="Screenshot 2024-05-03 at 12 14 36" src="https://github.com/meilisearch/meilisearch/assets/6530014/1df2ec19-5f69-4164-90d2-f65c59f9b544">

This patch replaces the fully-expanded path with the matched pattern.

The linked PR also mentions paths under other routes, e.g. `/static`, but this feels like a separate concern and these can be stripped out at the Prometheus end by filters if they are unwanted. The most important thing is to make the paths usable so that we can still get stats on e.g. the number of document deletes we see.

## PR checklist

Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Simon Detheridge <s@sd.ai>
Co-authored-by: Tamo <tamo@meilisearch.com>
2024-05-06 09:37:32 +00:00
Tamo 3698aef66b fix warning 2024-05-06 11:36:37 +02:00