Commit Graph

7627 Commits

Author SHA1 Message Date
bors[bot]
267d14c28d
Merge #445
445: allow null values in csv r=Kerollmops a=MarinPostma

This pr allows null values in csv:
- if the field is of type string, then an empty field is considered null (`,,`), anything other is turned into a string (i.e `, ,` is a single whitespace string)
- if the field is of type number, when the trimmed field is empty, we consider the value null (i.e `,,`, `, ,` are both null numbers) otherwise we try to parse the number.


Co-authored-by: ad hoc <postma.marin@protonmail.com>
2022-02-03 15:11:32 +00:00
ad hoc
bd2262ceea
allow null values in csv 2022-02-03 16:03:01 +01:00
ad hoc
13de251047
rewrite word pair distance gathering 2022-02-03 15:57:20 +01:00
Clémentine Urquizar
78cf8f1f9f
Fix typo 2022-02-02 19:32:20 +01:00
bors[bot]
fda4f229bb
Merge #417
417: Change chunk size to 4MiB to fit more the end user usage r=Kerollmops a=ManyTheFish

Reverts meilisearch/milli#379

We made several indexing tests using different sizes of datasets (5 datasets from 9MiB to 100MiB) on several typologies of VMs (`XS: 1GiB RAM, 1 VCPU`, `S: 2GiB RAM, 2 VCPU`, `M: 4GiB RAM, 3 VCPU`, `L: 8GiB RAM, 4 VCPU`).
The result of these tests shows that the `4MiB` chunk size seems to be the best size compared to other chunk sizes (`2Mib`, `4MiB`, `8Mib`, `16Mib`,  `32Mib`, `64Mib`, `128Mib`).

below is the average time per chunk size:

![Capture d’écran 2021-09-27 à 14 27 50](https://user-images.githubusercontent.com/6482087/134909368-ef0bc45e-68d5-49d1-aaf9-91113b7c410f.png)

<details>
<summary>Detailled data</summary>
<br>

![Capture d’écran 2021-09-27 à 14 39 48](https://user-images.githubusercontent.com/6482087/134909952-a36b1457-bbbd-4a6c-bbe5-519e4b926b5a.png)
</br>
</details> 


Co-authored-by: Many <many@meilisearch.com>
2022-02-02 18:30:59 +00:00
Clémentine Urquizar
1da7277817
Fix dowload-latest.sh according to the new name of the binary 2022-02-02 19:25:52 +01:00
Clémentine Urquizar
c71c95feb0
Refactor CIs to publish aaarch64 binary 2022-02-02 19:25:28 +01:00
bors[bot]
2468ebb76b
Merge #444
444: Fix the parsing of ndjson requests to index more than the first line r=Kerollmops a=Kerollmops

This PR correctly uses the `BufRead` trait to read every line of the content instead of just the first one. This bug was only affecting the http-ui test crate.

Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-02-02 17:59:44 +00:00
ManyTheFish
3bee31e6c7 bug(auth): Make API keys accept Null descriptions 2022-02-02 18:18:17 +01:00
Kerollmops
9142ba9dd4
Fix the parsing of ndjson requests to index more than the first line 2022-02-02 17:55:13 +01:00
Many
d59bcea749 Revert "Revert "Change chunk size to 4MiB to fit more the end user usage"" 2022-02-02 17:01:13 +01:00
mpostma
7541ab99cd
review changes 2022-02-02 12:59:01 +01:00
mpostma
d0aabde502
optimize 2 typos case 2022-02-02 12:56:09 +01:00
mpostma
55e6cb9c7b
typos on first letter counts as 2 2022-02-02 12:56:09 +01:00
mpostma
642c01d0dc
set max typos on ngram to 1 2022-02-02 12:56:08 +01:00
bors[bot]
9448ca58aa
Merge #2005
2005: auto batching r=MarinPostma a=MarinPostma

This pr implements auto batching. The basic functioning of this is that all updates that can be batched together are batched together while the previous batch is being processed.

For now, the only updates that can be batched together are the document addition updates (both update and replace), for a single index.

The batching is disabled by default for multiple reasons:
- We need more experimentation with the scheduling techniques
- Right now, if one task fails in a batch, the whole batch fails. We need more permissive error handling when processing document indexation.

There are four CLI options, for now, to interact with how the batch is scheduled:
- `enable-autobatching`: enable the autobatching feature.
- `debounce-duration-sec`: When an update is received, wait that number of seconds before batching and performing the updates. Defaults to 0s.
- `max-batch-size`: the maximum number of tasks per batch, defaults to unlimited.
- `max-documents-per-batch`: the maximum number of documents in a batch, defaults to unlimited. The batch will always contain a least 1 task, no matter the number of documents in that task.

# Implementation

The current implementation is made of 3 major components:

## TaskStore
The `TaskStore` contains all the tasks. When a task is pushed, it is directly registered to the task store.

## Scheduler
The scheduler is in charge of making the batches. At its core, there is a `TaskQueue` and a job queue. `Job`s are always processed first. They are *volatile* tasks, that is, they don't have a TaskId and are not persisted to disk. Snapshots and dumps are examples of Jobs.

If no `Job` is available for processing, then the scheduler attempts to make a `Task` batch from the `TaskQueue`. The first step is to gather new tasks from the `TaskStore` to populate the `TaskQueue`. When this is done, we can prepare our batch. The `TaskQueue` is itself a `BinaryHeap` of `Tasklist`. Each `index_uid` is associated with a `TaskList` that contains all the updates associated with that index uid. Each `TaskList` in the `TaskQueue` is ordered by the id of its first task.

When preparing a batch, the `TaskList` at the top of the `TaskQueue` is popped, and the tasks are popped from the list to make the next batch. If there are remaining tasks in the list, the list is inserted back in the `TaskQueue`.

## UpdateLoop
The `UpdateLoop` role is to perform batch sequentially. Each time updates are pushed to the update store, the scheduler is notified, and will in turn notify the update loop that work can be performed. When notified, the update loop waits some time to wait for more incoming update and then asks the scheduler for the next batch to perform and perform it. When it is done, the status of the task is put back into the store, and the next batch is processed.


Co-authored-by: mpostma <postma.marin@protonmail.com>
2022-02-02 11:04:30 +00:00
ad hoc
d852dc0d2b
fix phrase search 2022-02-01 20:21:33 +01:00
mpostma
c9a236b0af
feat(lib): auto-batching 2022-02-01 18:06:20 +01:00
bors[bot]
622c15e825
Merge #2096
2096: feat(auth): Tenant token r=Kerollmops a=ManyTheFish

Make meilisearch support JWT authentication signed with meilisearch API keys
using HS256, HS384 or HS512 algorithms.

Related spec: [specifications#89](https://github.com/meilisearch/specifications/pull/89) [rendered](https://github.com/meilisearch/specifications/blob/scoped-api-keys/text/0089-tenant-tokens.md)
Fix #1991 


Co-authored-by: ManyTheFish <many@meilisearch.com>
2022-01-27 10:38:41 +00:00
Kerollmops
fb79c32430
Compute the new, common and, deleted prefix words fst once 2022-01-27 11:00:18 +01:00
bors[bot]
054598734a
Merge #2120
2120: Bring `stable` into `main` r=curquiza a=curquiza

I forgot to do it, tell me `@Kerollmops` or `@irevoire` if it's useful or not. I would say yes, otherwise I will have conflict when I will try to bring `main` into `stable` for the next release. Maybe I'm wrong

Co-authored-by: Irevoire <tamo@meilisearch.com>
Co-authored-by: mpostma <postma.marin@protonmail.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com>
Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>
2022-01-27 09:35:21 +00:00
Clément Renault
51d1e64b23
Remove, now useless, the WriteMethod enum 2022-01-27 10:08:35 +01:00
Clément Renault
e9c02173cf
Rework the WordsPrefixPositionDocids update to compute a subset of the database 2022-01-27 10:08:35 +01:00
Clément Renault
dbba5fd461
Create a function to simplify the word prefix pair proximity docids compute 2022-01-27 10:08:35 +01:00
Clément Renault
e760e02737
Fix the computation of the newly added and common prefix pair proximity words 2022-01-27 10:08:35 +01:00
Clément Renault
d59e559317
Fix the computation of the newly added and common prefix words 2022-01-27 10:08:34 +01:00
Clément Renault
2ec8542105
Rework the WordPrefixDocids update to compute a subset of the database 2022-01-27 10:08:34 +01:00
Clément Renault
28692f65be
Rework the WordPrefixDocids update to compute a subset of the database 2022-01-27 10:08:34 +01:00
Clément Renault
5404bc02dd
Move the fst_stream_into_hashset method in the helper methods 2022-01-27 10:06:00 +01:00
Clément Renault
c90fa95f93
Only compute the word prefix pairs on the created word pair proximities 2022-01-27 10:06:00 +01:00
Clément Renault
822f67e9ad
Bring the newly created word pair proximity docids 2022-01-27 10:06:00 +01:00
Clément Renault
d28f18658e
Retrieve the previous version of the words prefixes FST 2022-01-27 10:05:59 +01:00
ManyTheFish
7ca647f0d0 feat(auth): Implement Tenant token
Make meilisearch support JWT authentication signed with meilisearch API keys
using HS256, HS384 or HS512 algorithms.

Related spec: https://github.com/meilisearch/specifications/pull/89
Fix #1991
2022-01-27 08:25:39 +01:00
bors[bot]
38d23546a5
Merge #431
431: Fix and improve word prefix pair proximity r=ManyTheFish a=Kerollmops

This PR first fixes the algorithm we used to select and compute the word prefix pair proximity database. The previous version was skipping nearly all of the prefixes. The issue is that this fix made this method to take more time and we were trying to reduce the time spent in it.

With `@ManyTheFish` we found out that we could skip some of the work we were doing by:
 - discarding the prefixes that were shorter than a specific threshold (default: 2).
 - discarding the word prefix pairs with proximity bigger than a specific threshold (default: 4).
 - remove the unused threshold that was specifying a minimum amount of word docids to merge.

We will take more time to do some more optimization, like stop clearing and recomputing from scratch the database, we will compute the subsets of keys to create, keep and merge. This change is a little bit more complex than what this PR does.

I keep this PR as a draft as I want to further test the real gain if it is enough or not if it is valid or not. I advise reviewers to review commit by commit to see the changes bit by bit, reviewing the whole PR can be hard.

Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-01-27 07:04:56 +00:00
Clémentine Urquizar - curqui
aa50fcb1f0
Merge branch 'main' into stable 2022-01-26 20:17:41 +01:00
Clémentine Urquizar - curqui
b408de0761
Merge pull request #2117 from meilisearch/rebranding
Changes related to the rebranding
2022-01-26 19:58:54 +01:00
Tamo
72d9c5ee5c
fix(rebranding): Update the ascii art (#2118) 2022-01-26 18:53:07 +01:00
bors[bot]
c63f945093
Merge #441
441: Changes related to the rebranding r=curquiza a=meili-bot

_This PR is auto-generated._

 - [X] Change the name `MeiliSearch` to `Meilisearch` in README.
 - [x] ⚠️ Ensure the bot did not update part you don’t want it to update, especially in the code examples in the Getting started.
 - [x] Please, ensure there is no other "MeiliSearch". For example, in the comments or in the tests name.
 - [x] Put the new logo on the README if needed -> still using the milli logo so far


Co-authored-by: meili-bot <74670311+meili-bot@users.noreply.github.com>
Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2022-01-26 17:07:37 +00:00
Clémentine Urquizar
2b7440d4b5
Fix some typo 2022-01-26 17:56:18 +01:00
Clémentine Urquizar - curqui
3bc6a18bcd
Update README.md
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-01-26 17:54:51 +01:00
Clémentine Urquizar - curqui
db56d6cb11
Update download-latest.sh 2022-01-26 17:54:22 +01:00
Clémentine Urquizar - curqui
a5759139bf
Update CONTRIBUTING.md
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-01-26 17:51:38 +01:00
Clémentine Urquizar
0f213f2202
Replace MeiliSearch by Meilisearch 2022-01-26 17:49:55 +01:00
Clémentine Urquizar
de808a391a
Replace meilisearch by Meilisearch 2022-01-26 17:48:22 +01:00
Clémentine Urquizar
8a959da120
Update MeiliSearch into Meilisearch everywhere 2022-01-26 17:43:16 +01:00
Clémentine Urquizar
0a78750465
Replace meilisearch by Meilisearch 2022-01-26 17:35:56 +01:00
Clémentine Urquizar
372f4fc924
Replace logo 2022-01-26 17:34:31 +01:00
meili-bot
0d282e3cc5 Update README.md 2022-01-26 16:33:16 +01:00
meili-bot
ae5b401e74 Update README.md 2022-01-26 16:31:04 +01:00
meili-bot
c562655be7 Update CONTRIBUTING.md 2022-01-26 16:31:03 +01:00