meili-bors[bot]
72d3fa4898
Merge #4203
...
4203: Extract external document docids from docs on deletion by filter r=Kerollmops a=dureuill
This fixes some of the performance regression observed on `diff-indexing` when doing delete-by-filter with a filter matching many documents.
To delete 19 768 771 documents (hackernews dataset, all documents matching `type = comment`), here are the observed time:
|branch (commit sha1sum)|time|speed-down factor (lower is better)|
|--|--|--|
|`main` (48865470d7
)|1212.885536s (~20min)|x1.0 (baseline)|
|`diff-indexing` (523519fdbf
)|5385.550543s (90min)|x4.44|
|**`diff-indexing-extract-primary-key`**(f8289cd974
)|2582.323324s (43min) | x2.13|
So we're still suffering a speed-down of x2.13, but that's much better than x4.44.
---
Changes:
- Refactor the logic of PrimaryKey extraction to a struct
- Add a trait to abstract the extraction of field id from a name between `DocumentBatch` and `FieldIdMap`.
- Add `Index::external_id_of` to get the external ids of a bitmap of internal ids.
- Use this new method to add new Transform and Batch methods to remove documents that are known to be from the DB.
- Modify delete-by-filter to use the new method
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-11-13 13:02:10 +00:00
Louis Dureuil
772964125d
Factor removal of document from DB
2023-11-13 13:51:22 +01:00
Louis Dureuil
378deb0bef
Rename trait
2023-11-13 13:38:36 +01:00
Louis Dureuil
264b10ec20
Fixup documentation
2023-11-09 16:23:20 +01:00
Louis Dureuil
825257da76
Use more efficient method for deletion in benchmarks
2023-11-09 16:13:15 +01:00
Louis Dureuil
f8289cd974
Use it from delete-by-filter
2023-11-09 14:23:15 +01:00
Louis Dureuil
3053e01c05
Batch::remove_documents_from_db_no_batch
2023-11-09 14:23:02 +01:00
Louis Dureuil
b11c2afac0
Index::external_id_of
2023-11-09 14:22:43 +01:00
Louis Dureuil
9cef800b2a
Enrich uses the new type
2023-11-09 14:22:05 +01:00
Louis Dureuil
db2fb86b8b
Extract PrimaryKey logic to a type
2023-11-09 14:19:16 +01:00
Many the fish
523519fdbf
Merge pull request #4195 from meilisearch/diff-indexing-remove-from-batch
...
Remove `IndexOperation::DocumentDeletion`
2023-11-08 10:29:49 +01:00
Louis Dureuil
ef6fa10f7a
Remove IndexOperation::DocumentDeletion
2023-11-06 12:16:15 +01:00
Louis Dureuil
620fee35f9
Fix benches
2023-11-06 11:56:46 +01:00
Louis Dureuil
cbaa54cafd
Fix clippy issues
2023-11-06 11:19:31 +01:00
Louis Dureuil
1bccf2079e
Correctly mark non-tests as non-tests
2023-11-06 11:03:56 +01:00
ManyTheFish
1b2ea6cf19
REVERT ME: ignore prefix pair databases tests
2023-11-06 10:46:22 +01:00
Louis Dureuil
1ad1fcc8c8
Remove all warnings
2023-11-06 10:31:14 +01:00
ManyTheFish
87610a5f98
Don't try to delete a document that is not in the database
2023-11-02 16:49:03 +01:00
Many the fish
2544bc1416
Merge pull request #4160 from meilisearch/diff-indexing-vector-points
...
Diff Indexing for the vector points
2023-11-02 16:01:51 +01:00
Clément Renault
ff522c919d
Fix the vector extractions for the diff indexing
2023-11-02 15:58:08 +01:00
Many the fish
1c39459cf4
Merge pull request #4179 from meilisearch/diff-indexing-fix-nested-primary-key
...
Diff indexing fix nested primary key
2023-11-02 15:39:50 +01:00
ManyTheFish
bf0651f23c
Implement iter method on ExternalDocumentsIds
2023-11-02 15:38:00 +01:00
ManyTheFish
5b20e625f3
fix merge
2023-11-02 15:31:37 +01:00
ManyTheFish
bc51d6157a
Fix transform reindexing path
2023-11-02 15:26:20 +01:00
ManyTheFish
1b4ff991c0
update typed chunks
2023-11-02 15:26:20 +01:00
ManyTheFish
4b64c33aa2
update vector extractor
2023-11-02 15:26:20 +01:00
ManyTheFish
12323d610e
Change the original document sorter key from the internal docid to a concatenation of the internal and the external docid
2023-11-02 15:26:20 +01:00
Clément Renault
44e9033b3a
Merge pull request #4181 from meilisearch/diff-indexing-parallel-transform
...
Use rayon to sort entries in parallel
2023-11-02 15:16:10 +01:00
Clément Renault
4d864f0702
Always sort internal Sorter entries in parallel
2023-11-02 14:47:43 +01:00
Clément Renault
b10c060bf7
Cleanup TOML
2023-11-01 14:03:04 +01:00
Clément Renault
e507ef5932
Slow the logging down
2023-11-01 13:49:32 +01:00
Clément Renault
c71b1d33ae
Sort entries using rayon in the transform sorters
2023-11-01 11:07:16 +01:00
Clément Renault
0fc446c62f
Add more timing logs to the Transform
2023-11-01 11:07:16 +01:00
Louis Dureuil
0fb6acefc3
Add snapshots for facets
2023-10-31 17:11:08 +01:00
Louis Dureuil
b1d1355b69
remove tests on soft-deleted
2023-10-31 16:36:27 +01:00
Louis Dureuil
f19332466e
Extract field value as values instead of Option<Value>
2023-10-31 16:36:27 +01:00
Louis Dureuil
03ddb4f310
use deladd in facet update tests
2023-10-31 16:36:27 +01:00
Louis Dureuil
c855cc2721
Remove unused test
2023-10-31 16:36:27 +01:00
Louis Dureuil
da0503ef80
Fix document count
2023-10-31 16:36:27 +01:00
ManyTheFish
94206b0055
Update tests
2023-10-31 13:48:47 +01:00
Louis Dureuil
b40253bf18
update snapshots
2023-10-31 10:30:48 +01:00
Louis Dureuil
d8bf3f3fc2
Remove unused snapshots
2023-10-31 10:12:49 +01:00
Louis Dureuil
9d59e8011a
fix some tests
2023-10-31 10:08:36 +01:00
Louis Dureuil
dad78cbf8d
Bulk facet remove deletes keys from DB when value empty
2023-10-31 09:53:55 +01:00
Louis Dureuil
4e91707a06
Rename test
2023-10-31 09:41:17 +01:00
Louis Dureuil
de10f20732
Fix field distribution again
2023-10-30 17:47:22 +01:00
Louis Dureuil
be395c7944
Change order of arguments to tokenizer_builder
2023-10-30 16:26:29 +01:00
Louis Dureuil
9fedd8101a
Fix tests
2023-10-30 15:11:07 +01:00
Louis Dureuil
54d07a8da3
Update field distribution taking into account both deletions and additions
2023-10-30 14:47:51 +01:00
Louis Dureuil
58690dfb19
Fix tests compilation after changes to ExternalDocumentsIds API
2023-10-30 13:34:07 +01:00