Loïc Lecrenier
306593144d
Refactor word prefix pair proximity indexation
2022-08-17 11:59:00 +02:00
Loïc Lecrenier
58cb1c1bda
Simplify unit tests in facet/filter.rs
2022-08-04 12:03:44 +02:00
Loïc Lecrenier
acff17fb88
Simplify indexing tests
2022-08-04 12:03:13 +02:00
bors[bot]
21284cf235
Merge #556
...
556: Add EXISTS filter r=loiclec a=loiclec
## What does this PR do?
Fixes issue [#2484 ](https://github.com/meilisearch/meilisearch/issues/2484 ) in the meilisearch repo.
It creates a `field EXISTS` filter which selects all documents containing the `field` key.
For example, with the following documents:
```json
[{
"id": 0,
"colour": []
},
{
"id": 1,
"colour": ["blue", "green"]
},
{
"id": 2,
"colour": 145238
},
{
"id": 3,
"colour": null
},
{
"id": 4,
"colour": {
"green": []
}
},
{
"id": 5,
"colour": {}
},
{
"id": 6
}]
```
Then the filter `colour EXISTS` selects the ids `[0, 1, 2, 3, 4, 5]`. The filter `colour NOT EXISTS` selects `[6]`.
## Details
There is a new database named `facet-id-exists-docids`. Its keys are field ids and its values are bitmaps of all the document ids where the corresponding field exists.
To create this database, the indexing part of milli had to be adapted. The implementation there is basically copy/pasted from the code handling the `facet-id-f64-docids` database, with appropriate modifications in place.
There was an issue involving the flattening of documents during (re)indexing. Previously, the following JSON:
```json
{
"id": 0,
"colour": [],
"size": {}
}
```
would be flattened to:
```json
{
"id": 0
}
```
prior to being given to the extraction pipeline.
This transformation would lose the information that is needed to populate the `facet-id-exists-docids` database. Therefore, I have also changed the implementation of the `flatten-serde-json` crate. Now, as it traverses the Json, it keeps track of which key was encountered. Then, at the end, if a previously encountered key is not present in the flattened object, it adds that key to the object with an empty array as value. For example:
```json
{
"id": 0,
"colour": {
"green": [],
"blue": 1
},
"size": {}
}
```
becomes
```json
{
"id": 0,
"colour": [],
"colour.green": [],
"colour.blue": 1,
"size": []
}
```
Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-08-04 09:46:06 +00:00
bors[bot]
50f6524ff2
Merge #579
...
579: Stop reindexing already indexed documents r=ManyTheFish a=irevoire
```
% ./compare.sh indexing_stop-reindexing-unchanged-documents_cb5a1669.json indexing_main_eeba1960.json
group indexing_main_eeba1960 indexing_stop-reindexing-unchanged-documents_cb5a1669
----- ---------------------- -----------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable- 1.03 2.0±0.22ms ? ?/sec 1.00 1955.4±336.24µs ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable- 1.08 11.0±2.93ms ? ?/sec 1.00 10.2±4.04ms ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested- 1.00 15.1±3.89ms ? ?/sec 1.14 17.1±5.18ms ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable- 1.26 59.2±12.01ms ? ?/sec 1.00 47.1±8.52ms ? ?/sec
indexing/-wiki-delete-searchable- 1.08 316.6±31.53ms ? ?/sec 1.00 293.6±17.00ms ? ?/sec
indexing/Indexing geo_point 1.01 60.9±0.31s ? ?/sec 1.00 60.6±0.36s ? ?/sec
indexing/Indexing movies in three batches 1.04 20.0±0.30s ? ?/sec 1.00 19.2±0.25s ? ?/sec
indexing/Indexing movies with default settings 1.02 19.1±0.18s ? ?/sec 1.00 18.7±0.24s ? ?/sec
indexing/Indexing nested movies with default settings 1.02 26.2±0.29s ? ?/sec 1.00 25.9±0.22s ? ?/sec
indexing/Indexing nested movies without any facets 1.02 25.3±0.32s ? ?/sec 1.00 24.7±0.26s ? ?/sec
indexing/Indexing songs in three batches with default settings 1.00 66.7±0.41s ? ?/sec 1.01 67.1±0.86s ? ?/sec
indexing/Indexing songs with default settings 1.00 58.3±0.90s ? ?/sec 1.01 58.8±1.32s ? ?/sec
indexing/Indexing songs without any facets 1.00 54.5±1.43s ? ?/sec 1.01 55.2±1.29s ? ?/sec
indexing/Indexing songs without faceted numbers 1.00 57.9±1.20s ? ?/sec 1.01 58.4±0.93s ? ?/sec
indexing/Indexing wiki 1.00 1052.0±10.95s ? ?/sec 1.02 1069.4±20.38s ? ?/sec
indexing/Indexing wiki in three batches 1.00 1193.1±8.83s ? ?/sec 1.00 1189.5±9.40s ? ?/sec
indexing/Reindexing geo_point 3.22 67.5±0.73s ? ?/sec 1.00 21.0±0.16s ? ?/sec
indexing/Reindexing movies with default settings 3.75 19.4±0.28s ? ?/sec 1.00 5.2±0.05s ? ?/sec
indexing/Reindexing songs with default settings 8.90 61.4±0.91s ? ?/sec 1.00 6.9±0.07s ? ?/sec
indexing/Reindexing wiki 1.00 1748.2±35.68s ? ?/sec 1.00 1750.5±18.53s ? ?/sec
```
tldr: We do not lose any performance on the normal indexing benchmark, but we get between 3 and 8 times faster on the reindexing benchmarks 👍
Co-authored-by: Tamo <tamo@meilisearch.com>
2022-08-04 08:10:37 +00:00
ManyTheFish
d6f9a60a32
fix: Remove whitespace trimming during document id validation
...
fix #592
2022-08-03 11:38:40 +02:00
Tamo
7fc35c5586
remove the useless prints
2022-08-02 10:31:22 +02:00
Tamo
f156d7dd3b
Stop reindexing already indexed documents
2022-08-02 10:31:20 +02:00
Loïc Lecrenier
07003704a8
Merge branch 'filter/field-exist'
2022-07-21 14:51:41 +02:00
Loïc Lecrenier
1506683705
Avoid using too much memory when indexing facet-exists-docids
2022-07-19 14:42:35 +02:00
Loïc Lecrenier
aed8c69bcb
Refactor indexation of the "facet-id-exists-docids" database
...
The idea is to directly create a sorted and merged list of bitmaps
in the form of a BTreeMap<FieldId, RoaringBitmap> instead of creating
a grenad::Reader where the keys are field_id and the values are docids.
Then we send that BTreeMap to the thing that handles TypedChunks, which
inserts its content into the database.
2022-07-19 10:07:33 +02:00
Loïc Lecrenier
1eb1e73bb3
Add integration tests for the EXISTS filter
2022-07-19 10:07:33 +02:00
Loïc Lecrenier
80b962b4f4
Run cargo fmt
2022-07-19 10:07:33 +02:00
Loïc Lecrenier
c17d616250
Refactor index_documents_check_exists_database tests
2022-07-19 10:07:33 +02:00
Loïc Lecrenier
30bd4db0fc
Simplify indexing task for facet_exists_docids database
2022-07-19 10:07:33 +02:00
Loïc Lecrenier
392472f4bb
Apply suggestions from code review
...
Co-authored-by: Tamo <tamo@meilisearch.com>
2022-07-19 10:07:33 +02:00
Loïc Lecrenier
453d593ce8
Add a database containing the docids where each field exists
2022-07-19 10:07:33 +02:00
Loïc Lecrenier
fc9f3f31e7
Change DocumentsBatchReader to access cursor and index at same time
...
Otherwise it is not possible to iterate over all documents while
using the fields index at the same time.
2022-07-18 16:08:14 +02:00
Loïc Lecrenier
ab1571cdec
Simplify Transform::read_documents, enabled by enriched documents reader
2022-07-18 12:45:47 +02:00
Kerollmops
448114cc1c
Fix the benchmarks with the new indexation API
2022-07-12 15:22:09 +02:00
Kerollmops
25e768f31c
Fix another issue with the nested primary key selector
2022-07-12 15:14:07 +02:00
Kerollmops
192793ee38
Add some tests to check for the nested documents ids
2022-07-12 15:14:07 +02:00
Kerollmops
dc61105554
Fix the nested document id fetching function
2022-07-12 15:14:06 +02:00
Kerollmops
2eec290424
Check the validity of the latitute and longitude numbers
2022-07-12 15:14:06 +02:00
Kerollmops
5d149d631f
Remove tests for a function that no more exists
2022-07-12 15:14:06 +02:00
Kerollmops
0bbcc7b180
Expose the DocumentId
struct to be sure to inject the generated ids
2022-07-12 15:14:06 +02:00
Kerollmops
d1a4da9812
Generate a real UUIDv4 when ids are auto-generated
2022-07-12 15:14:06 +02:00
Kerollmops
c8ebf0de47
Rename the validate function as an enriching function
2022-07-12 15:14:06 +02:00
Kerollmops
905af2a2e9
Use the primary key and external id in the transform
2022-07-12 15:14:05 +02:00
Kerollmops
742543091e
Constify the default primary key name
2022-07-12 14:55:52 +02:00
Kerollmops
5f1bfb73ee
Extract the primary key name and make it accessible
2022-07-12 14:55:52 +02:00
Kerollmops
6a0a0ae94f
Make the Transform read from an EnrichedDocumentsBatchReader
2022-07-12 14:55:52 +02:00
Kerollmops
8ebf5eed0d
Make the nested primary key work
2022-07-12 14:55:52 +02:00
Kerollmops
19eb3b4708
Make sur that we do not accept floats as documents ids
2022-07-12 14:55:52 +02:00
Kerollmops
2ceeb51c37
Support the auto-generated ids when validating documents
2022-07-12 14:55:51 +02:00
Kerollmops
399eec5c01
Fix the indexation tests
2022-07-12 14:55:51 +02:00
Kerollmops
fcfc4caf8c
Move the Object type in the lib.rs file and use it everywhere
2022-07-12 14:55:51 +02:00
Kerollmops
0146175fe6
Introduce the validate_documents_batch function
2022-07-12 14:55:51 +02:00
Kerollmops
bdc4263883
Introduce the validate_documents_batch function
2022-07-12 14:55:51 +02:00
Kerollmops
e8297ad27e
Fix the tests for the new DocumentsBatchBuilder/Reader
2022-07-12 14:52:56 +02:00
bors[bot]
ebddfdb9a3
Merge #578
...
578: Bump uuid to 1.1.2 r=ManyTheFish a=Kerollmops
Just to [align the version with Meilisearch](https://github.com/meilisearch/meilisearch/pull/2584 ).
Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-07-05 14:56:08 +00:00
Kerollmops
1bfdcfc84f
Bump uuid to 1.1.2
2022-07-05 16:23:36 +02:00
Tamo
eaf28b0628
Apply review suggestions
...
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-07-05 15:30:33 +02:00
Tamo
3b309f654a
Fasten the document deletion
...
When a document deletion occurs, instead of deleting the document we mark it as deleted
in the new “soft deleted” bitmap. It is then removed from the search, and all the other
endpoints.
2022-07-05 15:30:33 +02:00
Tamo
d0aaa7ff00
Fix wrong internal ids assignments
2022-06-07 15:49:33 +02:00
ad hoc
31776fdc3f
add failing test
2022-06-07 15:49:33 +02:00
ManyTheFish
86ac8568e6
Use Charabia in milli
2022-06-02 16:59:11 +02:00
bors[bot]
08c6d50cd1
Merge #531
...
531: fix the mixed dataset geosearch indexing bug r=Kerollmops a=irevoire
port #529 to main
Co-authored-by: Tamo <tamo@meilisearch.com>
2022-05-16 16:06:36 +00:00
Tamo
0af399a6d7
fix the mixed dataset geosearch indexing bug
2022-05-16 17:37:45 +02:00
Tamo
f586028f9a
fix the searchable fields bug when a field is nested
...
Update milli/src/index.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-05-16 17:24:36 +02:00