Commit Graph

132 Commits

Author SHA1 Message Date
Loïc Lecrenier
3794962330 Use an unstable algorithm for grenad::Sorter when possible 2022-09-13 14:49:53 +02:00
Kerollmops
c83c3cd796
Add a test to make sure that long words are correctly skipped 2022-09-07 14:12:36 +02:00
ManyTheFish
5391e3842c replace optional_words by term_matching_strategy 2022-08-22 17:47:19 +02:00
ManyTheFish
9640976c79 Rename TermMatchingPolicies 2022-08-18 17:36:08 +02:00
Irevoire
e96b852107
bump heed 2022-08-17 17:05:50 +02:00
ManyTheFish
e9e2349ce6 Fix typo in comment 2022-08-17 15:09:48 +02:00
ManyTheFish
2668f841d1 Fix update indexing 2022-08-17 15:03:37 +02:00
ManyTheFish
7384650d85 Update test to showcase the bug 2022-08-17 15:03:08 +02:00
Loïc Lecrenier
58cb1c1bda Simplify unit tests in facet/filter.rs 2022-08-04 12:03:44 +02:00
Loïc Lecrenier
acff17fb88 Simplify indexing tests 2022-08-04 12:03:13 +02:00
bors[bot]
21284cf235
Merge #556
556: Add EXISTS filter r=loiclec a=loiclec

## What does this PR do?

Fixes issue [#2484](https://github.com/meilisearch/meilisearch/issues/2484) in the meilisearch repo.

It creates a `field EXISTS` filter which selects all documents containing the `field` key. 
For example, with the following documents:
```json
[{
	"id": 0,
	"colour": []
},
{
	"id": 1,
	"colour": ["blue", "green"]
},
{
	"id": 2,
	"colour": 145238
},
{
	"id": 3,
	"colour": null
},
{
	"id": 4,
	"colour": {
		"green": []
	}
},
{
	"id": 5,
	"colour": {}
},
{
	"id": 6
}]
```
Then the filter `colour EXISTS` selects the ids `[0, 1, 2, 3, 4, 5]`. The filter `colour NOT EXISTS` selects `[6]`.

## Details
There is a new database named `facet-id-exists-docids`. Its keys are field ids and its values are bitmaps of all the document ids where the corresponding field exists.

To create this database, the indexing part of milli had to be adapted. The implementation there is basically copy/pasted from the code handling the `facet-id-f64-docids` database, with appropriate modifications in place.

There was an issue involving the flattening of documents during (re)indexing. Previously, the following JSON:
```json
{
    "id": 0,
    "colour": [],
    "size": {}
}
```
would be flattened to:
```json
{
    "id": 0
}
```
prior to being given to the extraction pipeline.

This transformation would lose the information that is needed to populate the `facet-id-exists-docids` database. Therefore, I have also changed the implementation of the `flatten-serde-json` crate. Now, as it traverses the Json, it keeps track of which key was encountered. Then, at the end, if a previously encountered key is not present in the flattened object, it adds that key to the object with an empty array as value. For example:
```json
{
    "id": 0,
    "colour": {
        "green": [],
        "blue": 1
    },
    "size": {}
} 
```
becomes
```json
{
    "id": 0,
    "colour": [],
    "colour.green": [],
    "colour.blue": 1,
    "size": []
} 
```


Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-08-04 09:46:06 +00:00
ManyTheFish
d6f9a60a32 fix: Remove whitespace trimming during document id validation
fix #592
2022-08-03 11:38:40 +02:00
Loïc Lecrenier
07003704a8 Merge branch 'filter/field-exist' 2022-07-21 14:51:41 +02:00
Loïc Lecrenier
1eb1e73bb3 Add integration tests for the EXISTS filter 2022-07-19 10:07:33 +02:00
Loïc Lecrenier
c17d616250 Refactor index_documents_check_exists_database tests 2022-07-19 10:07:33 +02:00
Loïc Lecrenier
453d593ce8 Add a database containing the docids where each field exists 2022-07-19 10:07:33 +02:00
Kerollmops
192793ee38
Add some tests to check for the nested documents ids 2022-07-12 15:14:07 +02:00
Kerollmops
dc61105554
Fix the nested document id fetching function 2022-07-12 15:14:06 +02:00
Kerollmops
2eec290424
Check the validity of the latitute and longitude numbers 2022-07-12 15:14:06 +02:00
Kerollmops
0bbcc7b180
Expose the DocumentId struct to be sure to inject the generated ids 2022-07-12 15:14:06 +02:00
Kerollmops
c8ebf0de47
Rename the validate function as an enriching function 2022-07-12 15:14:06 +02:00
Kerollmops
6a0a0ae94f
Make the Transform read from an EnrichedDocumentsBatchReader 2022-07-12 14:55:52 +02:00
Kerollmops
8ebf5eed0d
Make the nested primary key work 2022-07-12 14:55:52 +02:00
Kerollmops
19eb3b4708
Make sur that we do not accept floats as documents ids 2022-07-12 14:55:52 +02:00
Kerollmops
2ceeb51c37
Support the auto-generated ids when validating documents 2022-07-12 14:55:51 +02:00
Kerollmops
399eec5c01
Fix the indexation tests 2022-07-12 14:55:51 +02:00
Kerollmops
0146175fe6
Introduce the validate_documents_batch function 2022-07-12 14:55:51 +02:00
Kerollmops
e8297ad27e
Fix the tests for the new DocumentsBatchBuilder/Reader 2022-07-12 14:52:56 +02:00
Tamo
eaf28b0628
Apply review suggestions
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-07-05 15:30:33 +02:00
Tamo
3b309f654a
Fasten the document deletion
When a document deletion occurs, instead of deleting the document we mark it as deleted
in the new “soft deleted” bitmap. It is then removed from the search, and all the other
endpoints.
2022-07-05 15:30:33 +02:00
ad hoc
31776fdc3f
add failing test 2022-06-07 15:49:33 +02:00
bors[bot]
08c6d50cd1
Merge #531
531: fix the mixed dataset geosearch indexing bug r=Kerollmops a=irevoire

port #529 to main

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-05-16 16:06:36 +00:00
Tamo
0af399a6d7
fix the mixed dataset geosearch indexing bug 2022-05-16 17:37:45 +02:00
Tamo
f586028f9a
fix the searchable fields bug when a field is nested
Update milli/src/index.rs

Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-05-16 17:24:36 +02:00
bors[bot]
65e6aa0de2
Merge #523
523: Improve geosearch error messages r=irevoire a=irevoire

Improve the geosearch error messages (#488).
And try to parse the string as specified in https://github.com/meilisearch/meilisearch/issues/2354

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-05-04 13:36:11 +00:00
Kerollmops
211c8763b9
Make sure that we do not generate too long keys 2022-05-03 10:03:15 +02:00
Kerollmops
7e47031bdc
Add a test for long keys in LMDB 2022-05-03 10:03:13 +02:00
Tamo
3cb1f6d0a1
improve geosearch error messages 2022-05-02 19:20:47 +02:00
Tamo
f19d2dc548
Only flatten the required fields
apply review comments

Co-authored-by: Kerollmops <kero@meilisearch.com>
2022-04-26 12:33:46 +02:00
Clément Renault
eb5830aa40
Add a test to make sure that long words are handled 2022-04-21 13:45:28 +02:00
Tamo
ee64f4a936
Use smartstring to store the external id in our hashmap
We need to store all the external id (primary key) in a hashmap
associated to their internal id during.
The smartstring remove heap allocation / memory usage and should
improve the cache locality.
2022-04-13 21:22:07 +02:00
Irevoire
4f3ce6d9cd
nested fields 2022-04-07 16:58:46 +02:00
ad hoc
e8f06f6c06
extract exact_word_prefix_docids 2022-04-04 20:54:03 +02:00
ad hoc
ba0bb29cd8
refactor WordPrefixDocids to take dbs instead of indexes 2022-04-04 20:54:02 +02:00
ad hoc
c4c6e35352
query exact_word_docids in resolve_query_tree 2022-04-04 20:54:02 +02:00
ad hoc
8d46a5b0b5
extract exact word docids 2022-04-04 20:54:02 +02:00
ad hoc
0a77be4ec0
introduce exact_word_docids db 2022-04-04 20:54:02 +02:00
Kerollmops
d5b8b5a2f8
Replace the ugly unwraps by clean if let Somes 2022-02-28 16:31:33 +01:00
Kerollmops
8d26f3040c
Remove a useless grenad file merging 2022-02-28 16:31:33 +01:00
bors[bot]
25123af3b8
Merge #436
436: Speed up the word prefix databases computation time r=Kerollmops a=Kerollmops

This PR depends on the fixes done in #431 and must be merged after it.

In this PR we will bring the `WordPrefixPairProximityDocids`, `WordPrefixDocids` and, `WordPrefixPositionDocids` update structures to a new era, a better era, where computing the word prefix pair proximities costs much fewer CPU cycles, an era where this update structure can use the, previously computed, set of new word docids from the newly indexed batch of documents.

---

The `WordPrefixPairProximityDocids` is an update structure, which means that it is an object that we feed with some parameters and which modifies the LMDB database of an index when asked for. This structure specifically computes the list of word prefix pair proximities, which correspond to a list of pairs of words associated with a proximity (the distance between both words) where the second word is not a word but a prefix e.g. `s`, `se`, `a`. This word prefix pair proximity is associated with the list of documents ids which contains the pair of words and prefix at the given proximity.

The origin of the performances issue that this struct brings is related to the fact that it starts its job from the beginning, it clears the LMDB database before rewriting everything from scratch, using the other LMDB databases to achieve that. I hope you understand that this is absolutely not an optimized way of doing things.

Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-02-16 15:41:14 +00:00