Commit Graph

1933 Commits

Author SHA1 Message Date
Louis Dureuil
8370fbc92b
Fix snaps 2023-10-30 11:40:20 +01:00
Louis Dureuil
85f42fbc03
Handle external to internal id mapping from TypedChunk::Documents 2023-10-30 11:40:20 +01:00
Louis Dureuil
c6b3c18c85
WIP: Comment out document deletion in other pipelines than update
TODO: fix calls to DELETE route
2023-10-30 11:40:20 +01:00
Louis Dureuil
bafeb892a7
Modify Index after changes to ExternalDocumentsIds 2023-10-30 11:40:20 +01:00
Louis Dureuil
8fb221dae3
Refactor ExternalDocumentsIds
- Remove soft deleted
- Add apply method that takes a list of operations to encapsulate modifications to the external -> internal mapping
2023-10-30 11:40:20 +01:00
Louis Dureuil
946c762d28
WIP: reset documents in TypedChunk::Documents 2023-10-30 11:40:20 +01:00
Louis Dureuil
cda6ca1ee6
Remove TypedChunk::NewDocumentIds 2023-10-30 11:40:18 +01:00
Louis Dureuil
696fcf4d18
Fix document insertion into LMDB 2023-10-30 11:39:31 +01:00
ManyTheFish
476e4d3dbe
Use value buffer instead of the initial value when writting the final result in the sorter 2023-10-30 11:39:31 +01:00
Clément Renault
576fa9c6da
Remove useless comment 2023-10-30 11:39:31 +01:00
Kerollmops
77dcbff6b2
Remove and Insert the DelAdd geo points 2023-10-30 11:39:31 +01:00
Kerollmops
544440c363
Ignore geo fields when the Del and Add content is the same 2023-10-30 11:39:31 +01:00
Clément Renault
a3dae4db9b
Extract the geo fields DelAdd and generate a new DelAdd obkv with it 2023-10-30 11:39:31 +01:00
ManyTheFish
ba90a5ec0e
update extract fid word count docids 2023-10-30 11:39:31 +01:00
Louis Dureuil
b26dc9aabe
Explanatory code comment 2023-10-30 11:39:31 +01:00
Louis Dureuil
66abac9364
Use specialized KvReaderDelAdd type
Co-authored-by: Clément Renault <clement@meilisearch.com>
2023-10-30 11:39:31 +01:00
Louis Dureuil
59f88c14b3
Simplify facet update after removing Index::faceted_documents_ids 2023-10-30 11:39:29 +01:00
Louis Dureuil
14832cb324
Remove Index::faceted_documents_ids 2023-10-30 11:37:32 +01:00
Louis Dureuil
04ec293024
Facet Incremental update 2023-10-30 11:37:30 +01:00
Louis Dureuil
f67ff3a738
Facets Bulk update 2023-10-30 11:36:40 +01:00
Clément Renault
560e8f5613
Introduce the CboRoaringBitmapCodec merge_deladd_into and use it 2023-10-30 11:34:55 +01:00
Clément Renault
2d3f15f82c
Introduce a function to only serialize the Add side of a DelAdd obkv 2023-10-30 11:34:55 +01:00
Clément Renault
40186bf403
Rename FieldIdWordCountDocids correctly 2023-10-30 11:34:50 +01:00
ManyTheFish
87e3d27878
update extract word pair proximity to support deladd obkvs 2023-10-30 11:34:02 +01:00
ManyTheFish
6bcf8b4f8c
update extract word position docids 2023-10-30 11:34:02 +01:00
ManyTheFish
46aa75abdb
update extract word docids 2023-10-30 11:34:02 +01:00
ManyTheFish
2597bbd107
Make script language docids map taking a tuple of roaring bitmaps expressing the deletions and the additions 2023-10-30 11:34:00 +01:00
Clément Renault
e2bc054604
Update extract_facet_string_docids to support deladd obkvs 2023-10-30 11:32:36 +01:00
Clément Renault
fcd3a1434d
Update extract_facet_number_docids to support deladd obkvs 2023-10-30 11:31:04 +01:00
Clément Renault
a82dee21e0
Rename docid_fid into fid_docid 2023-10-30 11:31:02 +01:00
Clément Renault
bc45c1206d
Implement all the facet extraction paths and simplify them 2023-10-30 11:29:08 +01:00
Clément Renault
6ae4100f07
Generate the DelAdd for is_null, is_empty, and exists 2023-10-30 11:29:08 +01:00
Clément Renault
0c47defeee
Work on fid docid facet values rewrite 2023-10-30 11:29:06 +01:00
ManyTheFish
313b16bec2
Support diff indexing on extract_docid_word_positions 2023-10-30 11:24:19 +01:00
ManyTheFish
1dd97578a8
Make the transform struct return diff-based documents obkvs 2023-10-30 11:22:07 +01:00
ManyTheFish
f5ef69293b
deactivate prefix dbs 2023-10-30 11:22:07 +01:00
ManyTheFish
1c5705c164
clean PR warnings 2023-10-30 11:22:05 +01:00
ManyTheFish
66c2c82a18
Split wpp in several sorters 2023-10-30 11:15:02 +01:00
ManyTheFish
28a8d0ccda
Fix word pair proximity 2023-10-30 11:15:02 +01:00
ManyTheFish
96be85396d
Use a vecDeque in wpp database 2023-10-30 11:15:02 +01:00
ManyTheFish
df9e5c8651
Generalize usage of CboRoaringBitmap codec to ease the use 2023-10-30 11:15:02 +01:00
ManyTheFish
b541d48847
Add buffer to the obkv writter 2023-10-30 11:15:02 +01:00
ManyTheFish
8ccf32d1a0
Compute word_fid_docids before word_docids and exact_word_docids 2023-10-30 11:15:02 +01:00
ManyTheFish
db1ca21231
add puffin in sorter into reeder function 2023-10-30 11:15:00 +01:00
ManyTheFish
11ea5acff9
Fix 2023-10-30 11:13:10 +01:00
ManyTheFish
8d77736a67
Fix fid_word_docids 2023-10-30 11:13:10 +01:00
ManyTheFish
748b333161
Add usefull debug assert before key insertion in database 2023-10-30 11:13:10 +01:00
ManyTheFish
17b647dfe5
Wip 2023-10-30 11:13:08 +01:00
meili-bors[bot]
5e0485d8dd
Merge #4131
4131: Reduce proximity range from 7 to 3 r=Kerollmops a=ManyTheFish

## Summary
This PR aims to reduce the impact of the proximity databases on the indexing time and on the database size by reducing the maximum distance between two words to be indexed in the proximity database.

## Stats

### Impact on database size and indexing time
![Impact on datasets](https://github.com/meilisearch/meilisearch/assets/6482087/28ed3d96-bdde-41c1-bdac-e90c1b1dbb23)

### Impact on search relevancy

<details>

| dataset_name | host_name        | Relevancy rate (Precision) | completion_rate  25.00% | completion_rate 50.00% | completion_rate 75.00% | completion_rate 100.00% |
|--------------|------------------|------------------------------------|-----------------|-----------------|-----------------|-----------------|
| FBIS         | 1_4_0            | percentile-10 |           0.00% |           0.00% |           0.00% |           0.00% |
| FBIS         | 1_4_0            | percentile-25 |           0.00% |           0.00% |           0.00% |           0.00% |
| FBIS         | 1_4_0            | percentile-50 |           0.00% |           0.00% |           5.00% |           5.56% |
| FBIS         | 1_4_0            | percentile-75 |           0.00% |          12.50% |          35.00% |          45.00% |
| FBIS         | 1_4_0            | percentile-90 |          20.00% |          40.00% |                 |         100.00% |
| FBIS         | 1_4_0            | average       |           5.78% |          11.16% |          21.90% |          26.29% |
| FBIS         | reduce_proximity | percentile-10 |           0.00% |           0.00% |           0.00% |           0.00% |
| FBIS         | reduce_proximity | percentile-25 |           0.00% |           0.00% |           0.00% |           0.00% |
| FBIS         | reduce_proximity | percentile-50 |           0.00% |           0.00% |           5.00% |           5.56% |
| FBIS         | reduce_proximity | percentile-75 |           0.00% |          15.00% |          35.00% |          40.00% |
| FBIS         | reduce_proximity | percentile-90 |          20.00% |          40.00% |          85.00% |         100.00% |
| FBIS         | reduce_proximity | average       |           5.55% |          11.34% |          21.75% |          26.14% |
| FR94         | 1_4_0            | percentile-10 |           0.00% |           0.00% |           0.00% |           0.00% |
| FR94         | 1_4_0            | percentile-25 |           0.00% |           0.00% |           0.00% |           0.00% |
| FR94         | 1_4_0            | percentile-50 |           0.00% |           0.00% |           0.00% |           0.00% |
| FR94         | 1_4_0            | percentile-75 |           0.00% |           5.00% |          15.00% |          42.11% |
| FR94         | 1_4_0            | percentile-90 |          15.00% |          54.55% |         100.00% |         100.00% |
| FR94         | 1_4_0            | average       |           5.95% |          12.07% |          18.70% |          25.57% |
| FR94         | reduce_proximity | percentile-10 |           0.00% |           0.00% |           0.00% |           0.00% |
| FR94         | reduce_proximity | percentile-25 |           0.00% |           0.00% |           0.00% |           0.00% |
| FR94         | reduce_proximity | percentile-50 |           0.00% |           0.00% |           0.00% |           0.00% |
| FR94         | reduce_proximity | percentile-75 |           0.00% |           5.00% |          15.00% |          42.11% |
| FR94         | reduce_proximity | percentile-90 |          15.00% |          54.55% |         100.00% |         100.00% |
| FR94         | reduce_proximity | average       |           5.79% |          12.00% |          18.70% |          25.53% |
| FT           | 1_4_0            | percentile-10 |           0.00% |           0.00% |           0.00% |           0.00% |
| FT           | 1_4_0            | percentile-25 |           0.00% |           0.00% |           0.00% |           0.00% |
| FT           | 1_4_0            | percentile-50 |           0.00% |           0.00% |           5.00% |          10.00% |
| FT           | 1_4_0            | percentile-75 |           0.00% |          15.00% |          30.00% |          40.00% |
| FT           | 1_4_0            | percentile-90 |          20.00% |          50.00% |          65.00% |         100.00% |
| FT           | 1_4_0            | average       |           5.08% |          12.58% |          20.00% |          25.49% |
| FT           | reduce_proximity | percentile-10 |           0.00% |           0.00% |           0.00% |           0.00% |
| FT           | reduce_proximity | percentile-25 |           0.00% |           0.00% |           0.00% |           0.00% |
| FT           | reduce_proximity | percentile-50 |           0.00% |           0.00% |           5.00% |          10.00% |
| FT           | reduce_proximity | percentile-75 |           0.00% |          15.00% |          30.00% |          40.00% |
| FT           | reduce_proximity | percentile-90 |          10.00% |          45.00% |          60.00% |         100.00% |
| FT           | reduce_proximity | average       |           5.01% |          12.64% |          20.10% |          25.53% |
| LAT          | 1_4_0            | percentile-10 |           0.00% |           0.00% |           0.00% |           0.00% |
| LAT          | 1_4_0            | percentile-25 |           0.00% |           0.00% |           0.00% |           0.00% |
| LAT          | 1_4_0            | percentile-50 |           0.00% |           0.00% |           5.00% |           5.00% |
| LAT          | 1_4_0            | percentile-75 |           5.00% |          15.00% |          30.00% |          30.00% |
| LAT          | 1_4_0            | percentile-90 |          15.00% |          45.00% |          60.00% |          80.00% |
| LAT          | 1_4_0            | average       |           4.80% |          11.80% |          17.88% |          21.62% |
| LAT          | reduce_proximity | percentile-10 |           0.00% |           0.00% |           0.00% |           0.00% |
| LAT          | reduce_proximity | percentile-25 |           0.00% |           0.00% |           0.00% |           0.00% |
| LAT          | reduce_proximity | percentile-50 |           0.00% |           0.00% |           5.00% |           5.00% |
| LAT          | reduce_proximity | percentile-75 |           0.00% |          11.11% |          25.00% |          35.00% |
| LAT          | reduce_proximity | percentile-90 |          15.00% |          45.00% |          55.00% |          80.00% |
| LAT          | reduce_proximity | average       |           4.43% |          11.23% |          17.32% |          21.45% |

</details>

### Impact on Search time

| dataset_name | host_name        |      25.00% |      50.00% |      75.00% |     100.00% | Average     |
|--------------|------------------|------------:|------------:|------------:|------------:|-------------|
| FBIS         | 1_4_0            |        3.45 | 7.446666667 | 9.773489933 | 9.620300752 | 7.572614338 |
| FBIS         | reduce_proximity | 2.983333333 | 5.316666667 | 6.911073826 | 7.637218045 | 5.712072968 |
| FR94         | 1_4_0            | 2.236666667 |        4.45 | 5.523489933 | 4.560150376 | 4.192576744 |
| FR94         | reduce_proximity |        2.09 | 3.991666667 | 4.981543624 | 4.266917293 | 3.832531896 |
| FT           | 1_4_0            | 5.956666667 | 9.656666667 | 13.86912752 | 10.83270677 |  10.0787919 |
| FT           | reduce_proximity |        4.51 | 5.981666667 | 7.701342282 | 6.766917293 |  6.23998156 |
| LAT          | 1_4_0            | 5.856666667 | 9.233333333 | 12.98322148 | 10.78759398 | 9.715203865 |
| LAT          | reduce_proximity |        6.91 | 6.706666667 | 8.463087248 | 8.265037594 | 7.586197877 |

## Technical approach

- Ensure the MAX_DISTANCE constant is used everywhere needed
- Reduce the MAX_DISTANCE from 8 to 4

## Related

TBD

Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-10-18 14:56:08 +00:00
ManyTheFish
27eec21415 Fix tests 2023-10-18 16:03:22 +02:00