4664: Update README.md r=curquiza a=tpayet
Add hybrid & semantic as a feature
# Pull Request
## Related issue
Fixes #<issue_number>
## What does this PR do?
- ...
## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [ ] Have you read the contributing guidelines?
- [ ] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
Co-authored-by: Thomas Payet <thomas@meilisearch.com>
4663: Bring back release v1.8.1 into main r=ManyTheFish a=ManyTheFish
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: ManyTheFish <ManyTheFish@users.noreply.github.com>
Co-authored-by: Many the fish <many@meilisearch.com>
4657: Update version for the next release (v1.9.0) in Cargo.toml r=curquiza a=meili-bot
⚠️ This PR is automatically generated. Check the new version is the expected one and Cargo.lock has been updated before merging.
Co-authored-by: curquiza <curquiza@users.noreply.github.com>
4655: Remove `exportPuffinReport` experimental feature r=Kerollmops a=Kerollmops
This PR fixes#4605 by removing every trace of Puffin. Puffin is a great tool, but we use a better approach to measuring performance.
Co-authored-by: Clément Renault <clement@meilisearch.com>
4646: Reduce `Transform`'s disk usage r=Kerollmops a=Kerollmops
This PR implements what is described in #4485. It reduces the number of disk writes and disk usage.
Co-authored-by: Clément Renault <clement@meilisearch.com>
4633: Allow to mark vectors as "userProvided" r=Kerollmops a=dureuill
# Pull Request
## Related issue
Fixes#4606
## What does this PR do?
[See usage in PRD](https://meilisearch.notion.site/v1-9-AI-search-changes-e90d6803eca8417aa70a1ac5d0225697#deb96fb0595947bda7d4a371100326eb)
- Extends the shape of the special `_vectors` field in documents.
- previously, the `_vectors` field had to be an object, with each field the name of a configured embedder, and each value either `null`, an embedding (array of numbers), or an array of embeddings.
- In this PR, the value of an embedder in the `_vectors` field can additionally be an object. The object has two fields:
1. `embeddings`: `null`, an embedding (array of numbers), or an array of embeddings.
2. `userProvided`: a boolean indicating if the vector was provided by the user.
- The previous form `embedder_or_array_of_embedders` is semantically equivalent to:
```json
{
"embeddings": embedder_or_array_of_embedders,
"userProvided": true
}
```
- During the indexing step, the subfields and values of the `_vectors` field that have `userProvided` set to **false** are added in the vector DB, but not in the documents DB: that means that future modifications of the documents will trigger a regeneration of that particular vector using the document template.
- This allows **importing** embeddings as a one-shot process, while still retaining the ability to regenerate embeddings on document change.
- The dump process now uses this ability: it enriches the `_vectors` fields of documents with the embeddings that were autogenerated, marking them as not `userProvided`. This allows importing the vectors from a dump without regenerating them.
### Tests
This PR adds the following tests
- Long-needed hybrid search tests of a simple hf embedder
- Dump test that imports vectors. Due to the difficulty of actually importing a dump in tests, we just read the dump and check it contains the expected content.
- Tests in the index-scheduler: this tests that documents containing the same kind of instructions as in the dump indexes as expected
Co-authored-by: Louis Dureuil <louis@meilisearch.com>