2171: Update LICENSE with Meili SAS r=curquiza a=curquiza
Check with thomas, we must put the real name of the company
Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
451: Update LICENSE with Meili SAS name r=Kerollmops a=curquiza
Check with thomas, we must put the real name of the company
Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
450: Get rid of chrono in favor of time r=Kerollmops a=irevoire
We only use `chrono` as a wrapper around `time`, and since there has been an [open CVE on `chrono` for at least 3 months now](https://github.com/chronotope/chrono/pull/632) and the repo seems to be [struggling with maintenance](https://github.com/chronotope/chrono/pull/639), I think we should use `time` directly which is way more active and sufficient for our use case.
EDIT: Actually the CVE status has been known for more than 6 months: https://github.com/chronotope/chrono/issues/602
Co-authored-by: Irevoire <tamo@meilisearch.com>
2122: fix: docker image failed to boot on arm64 node r=curquiza a=Thearas
# Pull Request
## What does this PR do?
Fixes#2115.
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
Co-authored-by: Thearas <thearas850@gmail.com>
Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
2157: fix(auth): fix env being closed when dumping r=Kerollmops a=MarinPostma
When creating a dump, the auth store environment would be closed on drop, so subsequent dumps couldn't reopen the environment. I have added a flag in the environment to prevent the closing of the environment on drop when dumping.
Co-authored-by: ad hoc <postma.marin@protonmail.com>
442: fix phrase search r=curquiza a=MarinPostma
Run the exact match search on 7 words windows instead of only two. This makes false positive very very unlikely, and impossible on phrase query that are less than seven words.
Co-authored-by: ad hoc <postma.marin@protonmail.com>
2136: Refactoring CI regarding ARM binary publish r=curquiza a=curquiza
Fixes https://github.com/meilisearch/meilisearch/issues/1909
- Remove CI file to publish aarch64 binary and put the logic into `publish-binary.yml`
- Remove the job to publish armv8 binary
- Fix download-latest script accordingly
- Adapt dowload-latest with the specific case of the MacOS m1
Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
Co-authored-by: meili-bot <74670311+meili-bot@users.noreply.github.com>
445: allow null values in csv r=Kerollmops a=MarinPostma
This pr allows null values in csv:
- if the field is of type string, then an empty field is considered null (`,,`), anything other is turned into a string (i.e `, ,` is a single whitespace string)
- if the field is of type number, when the trimmed field is empty, we consider the value null (i.e `,,`, `, ,` are both null numbers) otherwise we try to parse the number.
Co-authored-by: ad hoc <postma.marin@protonmail.com>
417: Change chunk size to 4MiB to fit more the end user usage r=Kerollmops a=ManyTheFish
Reverts meilisearch/milli#379
We made several indexing tests using different sizes of datasets (5 datasets from 9MiB to 100MiB) on several typologies of VMs (`XS: 1GiB RAM, 1 VCPU`, `S: 2GiB RAM, 2 VCPU`, `M: 4GiB RAM, 3 VCPU`, `L: 8GiB RAM, 4 VCPU`).
The result of these tests shows that the `4MiB` chunk size seems to be the best size compared to other chunk sizes (`2Mib`, `4MiB`, `8Mib`, `16Mib`, `32Mib`, `64Mib`, `128Mib`).
below is the average time per chunk size:
![Capture d’écran 2021-09-27 à 14 27 50](https://user-images.githubusercontent.com/6482087/134909368-ef0bc45e-68d5-49d1-aaf9-91113b7c410f.png)
<details>
<summary>Detailled data</summary>
<br>
![Capture d’écran 2021-09-27 à 14 39 48](https://user-images.githubusercontent.com/6482087/134909952-a36b1457-bbbd-4a6c-bbe5-519e4b926b5a.png)
</br>
</details>
Co-authored-by: Many <many@meilisearch.com>
444: Fix the parsing of ndjson requests to index more than the first line r=Kerollmops a=Kerollmops
This PR correctly uses the `BufRead` trait to read every line of the content instead of just the first one. This bug was only affecting the http-ui test crate.
Co-authored-by: Kerollmops <clement@meilisearch.com>
2005: auto batching r=MarinPostma a=MarinPostma
This pr implements auto batching. The basic functioning of this is that all updates that can be batched together are batched together while the previous batch is being processed.
For now, the only updates that can be batched together are the document addition updates (both update and replace), for a single index.
The batching is disabled by default for multiple reasons:
- We need more experimentation with the scheduling techniques
- Right now, if one task fails in a batch, the whole batch fails. We need more permissive error handling when processing document indexation.
There are four CLI options, for now, to interact with how the batch is scheduled:
- `enable-autobatching`: enable the autobatching feature.
- `debounce-duration-sec`: When an update is received, wait that number of seconds before batching and performing the updates. Defaults to 0s.
- `max-batch-size`: the maximum number of tasks per batch, defaults to unlimited.
- `max-documents-per-batch`: the maximum number of documents in a batch, defaults to unlimited. The batch will always contain a least 1 task, no matter the number of documents in that task.
# Implementation
The current implementation is made of 3 major components:
## TaskStore
The `TaskStore` contains all the tasks. When a task is pushed, it is directly registered to the task store.
## Scheduler
The scheduler is in charge of making the batches. At its core, there is a `TaskQueue` and a job queue. `Job`s are always processed first. They are *volatile* tasks, that is, they don't have a TaskId and are not persisted to disk. Snapshots and dumps are examples of Jobs.
If no `Job` is available for processing, then the scheduler attempts to make a `Task` batch from the `TaskQueue`. The first step is to gather new tasks from the `TaskStore` to populate the `TaskQueue`. When this is done, we can prepare our batch. The `TaskQueue` is itself a `BinaryHeap` of `Tasklist`. Each `index_uid` is associated with a `TaskList` that contains all the updates associated with that index uid. Each `TaskList` in the `TaskQueue` is ordered by the id of its first task.
When preparing a batch, the `TaskList` at the top of the `TaskQueue` is popped, and the tasks are popped from the list to make the next batch. If there are remaining tasks in the list, the list is inserted back in the `TaskQueue`.
## UpdateLoop
The `UpdateLoop` role is to perform batch sequentially. Each time updates are pushed to the update store, the scheduler is notified, and will in turn notify the update loop that work can be performed. When notified, the update loop waits some time to wait for more incoming update and then asks the scheduler for the next batch to perform and perform it. When it is done, the status of the task is put back into the store, and the next batch is processed.
Co-authored-by: mpostma <postma.marin@protonmail.com>