4090: Diff indexing r=ManyTheFish a=ManyTheFish
This pull request aims to reduce the indexing time by computing a difference between the data added to the index and the data removed from the index before writing in LMDB.
## Why focus on reducing the writings in LMDB?
The indexing in Meilisearch is split into 3 main phases:
1) The computing or the extraction of the data (Multi-threaded)
2) The writing of the data in LMDB (Mono-threaded)
3) The processing of the prefix databases (Mono-threaded)
see below:
![Capture d’écran 2023-09-28 à 20 01 45](https://github.com/meilisearch/meilisearch/assets/6482087/51513162-7c39-4244-978b-2c6b60c43a56)
Because the writing is mono-threaded, it represents a bottleneck in the indexing, reducing the number of writes in LMDB will reduce the pressure on the main thread and should reduce the global time spent on the indexing.
## Give Feedback
We created [a dedicated discussion](https://github.com/meilisearch/meilisearch/discussions/4196) for users to try this new feature and to give feedback on bugs or performance issues.
## Technical approach
### Part 1: merge the addition and the deletion process
This part:
a) Aims to reduce the time spent on indexing only the filterable/sortable fields of documents, for example:
- Updating the number of "likes" or "stars" of a song or a movie
- Updating the "stock count" or the "price" of a product
b) Aims to reduce the time spent on writing in LMDB which should reduce the global indexing time for the highly multi-threaded machines by reducing the writing bottleneck.
c) Aims to reduce the average time spent to delete documents without having to keep the soft-deleted documents implementation
- [x] Create a preprocessing function that creates the diff-based documents chuck (`OBKV<fid, OBKV<AddDel, value>>`)
- [x] and clearly separate the faceted fields and the searchable fields in two different chunks
- Change the parameters of the input extractor by taking an `OBKV<fid, OBKV<AddDel, value>>` instead of `OBKV<fid, value>`.
- [x] extract_docid_word_positions
- [x] extract_geo_points
- [x] extract_vector_points
- [x] extract_fid_docid_facet_values
- Adapt the searchable extractors to the new diff-chucks
- [x] extract_fid_word_count_docids
- [x] extract_word_pair_proximity_docids
- [x] extract_word_position_docids
- [x] extract_word_docids
- Adapt the facet extractors to the new diff-chucks
- [x] extract_facet_number_docids
- [x] extract_facet_string_docids
- [x] extract_fid_docid_facet_values
- [x] FacetsUpdate
- [x] Adapt the prefix database extractors ⚠️⚠️
- [x] Make the LMDB writer remove the document_ids to delete at the same time the new document_ids are added
- [x] Remove document deletion pipeline
- [x] remove `new_documents_ids` entirely and `replaced_documents_ids`
- [x] reuse extracted external id from transform instead of re-extracting in `TypedChunks::Documents`
- [x] Remove deletion pipeline after autobatcher
- [x] remove autobatcher deletion pipeline
- [x] everything uses `IndexOperation::DocumentOperation`
- [x] repair deletion by internal id for filter by delete
- [x] Improve the deletion via internal ids by avoiding iterating over the whole set of external document ids.
- [x] Remove soft-deleted documents
#### FIXME
- [x] field distribution is not correctly updated after deletion
- [x] missing documents in the tests of tokenizer_customization
### Part 2: Only compute the documents field by field
This part aims to reduce the global indexing time for any kind of partial document modification on any size of machine from the mono-threaded one to the highly multi-threaded one.
- [ ] Make the preprocessing function only send the fields that changed to the extractors
- [ ] remove the `word_docids` and `exact_word_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] replace the `word_pair_proximity_docids` database with a `word_pair_proximity_fid_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] Adapt the prefix database extractors ⚠️⚠️
## Technical Concerns
- The part 1 implementation could increase the indexing time for the smallest machines (with few threads) by increasing the extracting time (multi-threaded) more than the writing time (mono-threaded)
- The part 2 implementation needs to change the databases which could have a significant impact on the search performances
- The prefix databases are a bit special to process and may be a pain to adapt to the difference-based indexing
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
The issue was that the operation « DocumentDeletionByFilter » was not
declared as an index operation. That means the indexes stats were not
reprocessed after the application of the operation.
3319: Transparently resize indexes on MaxDatabaseSizeReached errors r=Kerollmops a=dureuill
# Pull Request
## Related issue
Related to https://github.com/meilisearch/meilisearch/discussions/3280, depends on https://github.com/meilisearch/milli/pull/760
## What does this PR do?
### User standpoint
- Meilisearch no longer fails tasks that encounter the `milli::UserError(MaxDatabaseSizeReached)` error.
- Instead, these tasks are retried after increasing the maximum size allocated to the index where the failure occurred.
### Implementation standpoint
- Add `Batch::index_uid` to get the `index_uid` of a batch of task if there is one
- `IndexMapper::create_or_open_index` now takes an additional `size` argument that allows to (re)open indexes with a size different from the base `IndexScheduler::index_size` field
- `IndexScheduler::tick` now returns a `Result<TickOutcome>` instead of a `Result<usize>`. This offers more explicit control over what the behavior should be wrt the next tick.
- Add `IndexStatus::BeingResized` that contains a handle that a thread can use to await for the resize operation to complete and the index to be available again.
- Add `IndexMapper::resize_index` to increase the size of an index.
- In `IndexScheduler::tick`, intercept task batches that failed due to `MaxDatabaseSizeReached` and resize the index that caused the error, then request a new tick that will eventually handle the still enqueued task.
## Testing the PR
The following diff can be applied to this branch to make testing the PR easier:
<details>
```diff
diff --git a/index-scheduler/src/index_mapper.rs b/index-scheduler/src/index_mapper.rs
index 553ab45a..022b2f00 100644
--- a/index-scheduler/src/index_mapper.rs
+++ b/index-scheduler/src/index_mapper.rs
`@@` -228,13 +228,15 `@@` impl IndexMapper {
drop(lock);
+ std:🧵:sleep_ms(2000);
+
let current_size = index.map_size()?;
let closing_event = index.prepare_for_closing();
- log::info!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2);
+ log::error!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2);
closing_event.wait();
- log::info!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2);
+ log::error!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2);
let index_path = self.base_path.join(uuid.to_string());
let index = self.create_or_open_index(&index_path, None, 2 * current_size)?;
`@@` -268,8 +270,10 `@@` impl IndexMapper {
match index {
Some(Available(index)) => break index,
Some(BeingResized(ref resize_operation)) => {
+ log::error!("waiting for resize end");
// Deadlock: no lock taken while doing this operation.
resize_operation.wait();
+ log::error!("trying our luck again!");
continue;
}
Some(BeingDeleted) => return Err(Error::IndexNotFound(name.to_string())),
diff --git a/index-scheduler/src/lib.rs b/index-scheduler/src/lib.rs
index 11b17d05..242dc095 100644
--- a/index-scheduler/src/lib.rs
+++ b/index-scheduler/src/lib.rs
`@@` -908,6 +908,7 `@@` impl IndexScheduler {
///
/// Returns the number of processed tasks.
fn tick(&self) -> Result<TickOutcome> {
+ log::error!("ticking!");
#[cfg(test)]
{
*self.run_loop_iteration.write().unwrap() += 1;
diff --git a/meilisearch/src/main.rs b/meilisearch/src/main.rs
index 050c825a..63f312f6 100644
--- a/meilisearch/src/main.rs
+++ b/meilisearch/src/main.rs
`@@` -25,7 +25,7 `@@` fn setup(opt: &Opt) -> anyhow::Result<()> {
#[actix_web::main]
async fn main() -> anyhow::Result<()> {
- let (opt, config_read_from) = Opt::try_build()?;
+ let (mut opt, config_read_from) = Opt::try_build()?;
setup(&opt)?;
`@@` -56,6 +56,8 `@@` We generated a secure master key for you (you can safely copy this token):
_ => (),
}
+ opt.max_index_size = byte_unit::Byte::from_str("1MB").unwrap();
+
let (index_scheduler, auth_controller) = setup_meilisearch(&opt)?;
#[cfg(all(not(debug_assertions), feature = "analytics"))]
```
</details>
Mainly, these debug changes do the following:
- Set the default index size to 1MiB so that index resizes are initially frequent
- Turn some logs from info to error so that they can be displayed with `--log-level ERROR` (hiding the other infos)
- Add a long sleep between the beginning and the end of the resize so that we can observe the `BeingResized` index status (otherwise it would never come up in my tests)
## Open questions
- Is the growth factor of x2 the correct solution? For a `Vec` in memory it makes sense, but here we're manipulating quantities that are potentially in the order of 500GiBs. For bigger indexes it may make more sense to add at most e.g. 100GiB on each resize operation, avoiding big steps like 500GiB -> 1TiB.
## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [ ] Have you read the contributing guidelines?
- [ ] Have you made sure that the title is accurate and descriptive of the changes?
Thank you so much for contributing to Meilisearch!
3470: Autobatch addition and deletion r=irevoire a=irevoire
This PR adds the capability to meilisearch to batch document addition and deletion together.
Fix https://github.com/meilisearch/meilisearch/issues/3440
--------------
Things to check before merging;
- [x] What happens if we delete multiple time the same documents -> add a test
- [x] If a documentDeletion gets batched with a documentAddition but the index doesn't exist yet? It should not work
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
While updating the test suite I also noticed an issue with the indexed_documents value of failed task and had to update it.
I also named a bunch of snapshots that had no name sorry 😬
* Fix error code of the "duplicate index found" error
* Use the content of the ProcessingTasks in the tasks cancelation system
* Change the missing_filters error code into missing_task_filters
* WIP Introduce the invalid_task_uid error code
* Use more precise error codes/message for the task routes
+ Allow star operator in delete/cancel tasks
+ rename originalQuery to originalFilters
+ Display error/canceled_by in task view even when they are = null
+ Rename task filter fields by using their plural forms
+ Prepare an error code for canceledBy filter
+ Only return global tasks if the API key action `index.*` is there
* Add canceledBy task filter
* Update tests following task API changes
* Rename original_query to original_filters everywhere
* Update more insta-snap tests
* Make clippy happy
They're a happy clip now.
* Make rustfmt happy
>:-(
* Fix Index name parsing error message to fit the specification
* Bump milli version to 0.35.1
* Fix the new error messages
* fix the error messages and add tests
* rename the error codes for the sake of consistency
* refactor the way we send the cli informations + add the analytics for the config file and ssl usage
* Apply suggestions from code review
Co-authored-by: Clément Renault <clement@meilisearch.com>
* add a comment over the new infos structure
* reformat, sorry @kero
* Store analytics for the documents deletions
* Add analytics on all the settings
* Spawn threads with names
* Spawn rayon threads with names
* update the distinct attributes to the spec update
* update the analytics on the search route
* implements the analytics on the health and version routes
* Fix task details serialization
* Add the question mark to the task deletion query filter
* Add the question mark to the task cancelation query filter
* Fix tests
* add analytics on the task route
* Add all the missing fields of the new task query type
* Create a new analytics for the task deletion
* Create a new analytics for the task creation
* batch the tasks seen events
* Update the finite pagination analytics
* add the analytics of the swap-indexes route
* Stop removing the DB when failing to read it
* Rename originalFilters into originalFilters
* Rename matchedDocuments into providedIds
* Add `workflow_dispatch` to flaky.yml
* Bump grenad to 0.4.4
* Bump milli to version v0.37.0
* Don't multiply total memory returned by sysinfo anymore
sysinfo now returns bytes rather than KB
* Add a dispatch to the publish binaries workflow
* Fix publish release CI
* Don't use gold but the default linker
* Always display details for the indexDeletion task
* Fix the insta tests
* refactorize the whole test suite
1. Make a call to assert_internally_consistent automatically when snapshoting the scheduler. There is no point in snapshoting something broken and expect the dumb humans to notice.
2. Replace every possible call to assert_internally_consistent by a snapshot of the scheduler. It takes as many lines and ensure we never change something without noticing in any tests ever.
3. Name every snapshots: it's easier to debug when something goes wrong and easier to review in general.
4. Stop skipping breakpoints, it's too easy to miss something. Now you must explicitely show which path is the scheduler supposed to use.
5. Add a timeout on the channel.recv, it eases the process of writing tests, now when something file you get a failure instead of a deadlock.
* rebase on release-v0.30
* makes clippy happy
* update the snapshots after a rebase
* try to remove the flakyness of the failing test
* Add more analytics on the ranking rules positions
* Update the dump test to check for the dumpUid dumpCreation task details
* send the ranking rules as a string because amplitude is too dumb to process an array as a single value
* Display a null dumpUid until we computed the dump itself on disk
* Update tests
* Check if the master key is missing before returning an error
Co-authored-by: Loïc Lecrenier <loic.lecrenier@me.com>
Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
And make index_not_found error asynchronous, since we can't know
whether the index will exist by the time the index swap task is
processed.
Improve the index-swap test to verify that future tasks are not swapped
and to test the new error messages that were introduced.