Commit Graph

559 Commits

Author SHA1 Message Date
Clément Renault
4571e512d2
Store the vectors in an HNSW in LMDB 2023-06-27 12:32:38 +02:00
Clément Renault
7ac2f1489d
Extract the vectors from the documents 2023-06-27 12:32:37 +02:00
Clément Renault
34349faeae
Create a new _vector extractor 2023-06-27 12:32:37 +02:00
meili-bors[bot]
040b5a5b6f
Merge #3842
3842: fix some typos r=dureuill a=cuishuang

# Pull Request

## Related issue
Fixes #<issue_number>

## What does this PR do?
- fix some typos

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: cui fliter <imcusg@gmail.com>
2023-06-22 18:01:10 +00:00
cui fliter
530a3e2df3 fix some typos
Signed-off-by: cui fliter <imcusg@gmail.com>
2023-06-22 21:59:00 +08:00
meili-bors[bot]
45636d315c
Merge #3670
3670: Fix addition deletion bug r=irevoire a=irevoire

The first commit of this PR is a revert of https://github.com/meilisearch/meilisearch/pull/3667. It re-enable the auto-batching of addition and deletion of tasks. No new changes have been introduced outside of `milli`. So all the changes you see on the autobatcher have actually already been reviewed.

It fixes https://github.com/meilisearch/meilisearch/issues/3440.

### What was happening?

The issue was that the `external_documents_ids` generated in the `transform` were used in a very strange way that wasn’t compatible with the deletion of documents.
Instead of doing a clear merge between the external document IDs of the DB and the one returned by the transform + writing it on disk, we were doing some weird tricks with the soft-deleted to avoid writing the fst on disk as much as possible.
The new algorithm may be a bit slower but is way more straightforward and doesn’t change depending on if the soft deletion was used or not. Here is a list of the changes introduced:
1. We now do a clear distinction between the `new_external_documents_ids` coming from the transform and only held on RAM and the `external_documents_ids` coming from the DB.
2. The `new_external_documents_ids` (coming out of the transform) are now represented as an `fst`. We don't need to struggle with the hard, soft distinction + the soft_deleted => That's easier to understand
3. When indexing documents, we merge the `external_documents_ids` coming from the DB and the `new_external_documents_ids` coming from the transform.

### Other things introduced in this  PR

Since we constantly have to write small, very specialized fuzzers for this kind of bug, we decided to push the one used to reproduce this bug.
It's not perfect, but it's easy to improve in the future.
It'll also run for as long as possible on every merge on the main branch.

Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Loïc Lecrenier <loic.lecrenier@icloud.com>
2023-06-19 09:09:30 +00:00
Louis Dureuil
9f37b61666
DB BREAKING: raise limit of word count from 10 to 30. 2023-06-08 12:07:12 +02:00
Louis Dureuil
c15c076da9
DB BREAKING: Count the number of words in field_id_word_count_docids 2023-06-08 12:07:11 +02:00
Loïc Lecrenier
8628a0c856 Remove docid_word_positions_db + fix deletion bug
That would happen when a word was deleted from all exact attributes
but not all regular attributes.
2023-06-07 10:52:50 +02:00
Tamo
602ad98cb8 improve the way we handle the fsts 2023-05-22 11:15:14 +02:00
Tamo
7f619ff0e4 get rids of the now unused soft_deletion_used parameter 2023-05-22 10:33:49 +02:00
Tamo
4391cba6ca
fix the addition + deletion bug 2023-05-17 18:28:57 +02:00
Kerollmops
c4a40e7110
Use the writemap flag to reduce the memory usage 2023-05-15 10:15:33 +02:00
Jakub Jirutka
13f1277637 Allow to disable specialized tokenizations (again)
In PR #2773, I added the `chinese`, `hebrew`, `japanese` and `thai`
feature flags to allow melisearch to be built without huge specialed
tokenizations that took up 90% of the melisearch binary size.
Unfortunately, due to some recent changes, this doesn't work anymore.
The problem lies in excessive use of the `default` feature flag, which
infects the dependency graph.

Instead of adding `default-features = false` here and there, it's easier
and more future-proof to not declare `default` in `milli` and
`meilisearch-types`. I've renamed it to `all-tokenizers`, which also
makes it a bit clearer what it's about.
2023-05-04 15:45:40 +02:00
Louis Dureuil
90bc230820
Merge remote-tracking branch 'origin/main' into search-refactor
Conflicts | resolution
----------|-----------
Cargo.lock | added mimalloc
Cargo.toml |  took origin/main version
milli/src/search/criteria/exactness.rs | deleted after checking it was only clippy changes
milli/src/search/query_tree.rs | deleted after checking it was only clippy changes
2023-05-03 12:19:06 +02:00
Loïc Lecrenier
93188b3c88 Fix indexing of word_prefix_fid_docids 2023-04-29 10:56:48 +02:00
bors[bot]
414b3fae89
Merge #3571
3571: Introduce two filters to select documents with `null` and empty fields r=irevoire a=Kerollmops

# Pull Request

## Related issue
This PR implements the `X IS NULL`, `X IS NOT NULL`, `X IS EMPTY`, `X IS NOT EMPTY` filters that [this comment](https://github.com/meilisearch/product/discussions/539#discussioncomment-5115884) is describing in a very detailed manner.

## What does this PR do?

### `IS NULL` and `IS NOT NULL`

This PR will be exposed as a prototype for now. Below is the copy/pasted version of a spec that defines this filter.

- `IS NULL` matches fields that `EXISTS` AND `= IS NULL`
- `IS NOT NULL` matches fields that `NOT EXISTS` OR `!= IS NULL`

1. `{"name": "A", "price": null}`
2. `{"name": "A", "price": 10}`
3. `{"name": "A"}`

`price IS NULL` would match 1
`price IS NOT NULL` or `NOT price IS NULL` would match 2,3
`price EXISTS` would match 1, 2
`price NOT EXISTS` or `NOT price EXISTS` would match 3

common query : `(price EXISTS) AND (price IS NOT NULL)` would match 2

### `IS EMPTY` and `IS NOT EMPTY`

- `IS EMPTY` matches Array `[]`, Object `{}`, or String `""` fields that `EXISTS` and are empty
- `IS NOT EMPTY` matches fields that `NOT EXISTS` OR are not empty.

1. `{"name": "A", "tags": null}`
2. `{"name": "A", "tags": [null]}`
3. `{"name": "A", "tags": []}`
4. `{"name": "A", "tags": ["hello","world"]}`
5. `{"name": "A", "tags": [""]}`
6. `{"name": "A"}`
7. `{"name": "A", "tags": {}}`
8. `{"name": "A", "tags": {"t1":"v1"}}`
9. `{"name": "A", "tags": {"t1":""}}`
10. `{"name": "A", "tags": ""}`

`tags IS EMPTY` would match 3,7,10
`tags IS NOT EMPTY` or `NOT tags IS EMPTY` would match 1,2,4,5,6,8,9
`tags IS NULL` would match 1
`tags IS NOT NULL` or `NOT tags IS NULL` would match 2,3,4,5,6,7,8,9,10
`tags EXISTS` would match 1,2,3,4,5,7,8,9,10
`tags NOT EXISTS` or `NOT tags EXISTS` would match 6

common query : `(tags EXISTS) AND (tags IS NOT NULL) AND (tags IS NOT EMPTY)` would match 2,4,5,8,9

## What should the reviewer do?

- Check that I tested the filters
- Check that I deleted the ids of the documents when deleting documents


Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2023-04-27 13:14:00 +00:00
Clément Renault
cfd1b2cc97
Fix the clippy warnings 2023-04-25 16:40:32 +02:00
Loïc Lecrenier
d1fdbb63da Make all search tests pass, fix distinctAttribute bug 2023-04-24 12:12:08 +02:00
Loïc Lecrenier
84d9c731f8 Fix bug in encoding of word_position_docids and word_fid_docids 2023-04-24 09:59:30 +02:00
Loïc Lecrenier
8cb85294ef Remove unused import warning 2023-04-07 11:09:30 +02:00
Loïc Lecrenier
540a396e49 Fix indexing bug in words_prefix_position 2023-04-07 11:08:39 +02:00
Loïc Lecrenier
a81165f0d8 Merge remote-tracking branch 'origin/main' into search-refactor 2023-04-07 10:15:55 +02:00
Loïc Lecrenier
130d2061bd Fix indexing of word_position_docid and fid 2023-04-06 17:50:39 +02:00
Louis Dureuil
66ddee4390 Fix word_position_docids indexing 2023-04-06 17:50:39 +02:00
Louis Dureuil
e58426109a Fix panics and issues in exactness graph ranking rule 2023-04-06 17:50:39 +02:00
Louis Dureuil
996619b22a Increase position by 8 on hard separator when building query terms 2023-04-06 17:50:39 +02:00
Tamo
597d57bf1d Merge branch 'main' into bring-back-changes-v1.1.0 2023-04-05 11:32:14 +02:00
ManyTheFish
efea1e5837 Fix facet normalization 2023-03-29 12:02:24 +02:00
Gregory Conrad
e7994cdeb3 feat: check to see if the PK changed before erroring out
Previously, if the primary key was set and a Settings update contained
a primary key, an error would be returned.
However, this error is not needed if the new PK == the current PK.
This commit just checks to see if the PK actually changes
before raising an error.
2023-03-26 12:18:39 -04:00
Loïc Lecrenier
d18ebe4f3a Remove more warnings 2023-03-23 09:41:18 +01:00
Loïc Lecrenier
9b2653427d Split position DB into fid and relative position DB 2023-03-23 09:22:01 +01:00
Clément Renault
1a9c58a7ab
Fix a bug with the new flattening rules 2023-03-15 16:56:44 +01:00
Clément Renault
64571c8288
Improve the testing of the filters 2023-03-15 14:57:17 +01:00
Clément Renault
ea016d97af
Implementing an IS EMPTY filter 2023-03-15 14:12:34 +01:00
ManyTheFish
2f8eb4f54a last PR fixes 2023-03-09 15:34:36 +01:00
Clément Renault
df48ac8803
Add one more test for the NULL operator 2023-03-09 13:53:37 +01:00
Clément Renault
0ad53784e7
Create a new struct to reduce the type complexity 2023-03-09 13:21:21 +01:00
Clément Renault
e064c52544
Rename an internal facet deletion method 2023-03-09 13:08:02 +01:00
Clément Renault
e106b16148
Fix a typo in a variable
Co-authored-by: Louis Dureuil <louis@meilisearch.com>

aaa
2023-03-09 13:08:02 +01:00
ManyTheFish
5deea631ea fix clippy too many arguments 2023-03-09 11:19:13 +01:00
ManyTheFish
b4b859ec8c Fix typos 2023-03-09 10:58:35 +01:00
Clément Renault
7dc04747fd
Make clippy happy 2023-03-08 17:37:08 +01:00
Clément Renault
43ff236df8
Write the NULL facet values in the database 2023-03-08 16:49:53 +01:00
Clément Renault
19ab4d1a15
Classify the NULL fields values in the facet extractor 2023-03-08 16:49:31 +01:00
Clément Renault
9287858997
Introduce a new facet_id_is_null_docids database in the index 2023-03-08 16:14:00 +01:00
ManyTheFish
24c0775c67 Change indexing threshold 2023-03-08 12:36:04 +01:00
ManyTheFish
3092cf0448 Fix clippy errors 2023-03-08 10:53:42 +01:00
ManyTheFish
da48506f15 Rerun extraction when language detection might have failed 2023-03-07 18:35:26 +01:00
Louis Dureuil
5822764be9
Skip computing index budget in tests 2023-02-23 11:23:39 +01:00