Commit Graph

524 Commits

Author SHA1 Message Date
Louis Dureuil
8a941c0241
Smaller review changes 2024-05-22 14:44:42 +02:00
Louis Dureuil
eccbcf5130
Increase index-scheduler test timeouts 2024-05-21 14:59:08 +02:00
Louis Dureuil
9969f7a638
Add test on index-scheduler 2024-05-20 14:44:10 +02:00
Louis Dureuil
02714ef5ed
Add vectors from vector DB in dump 2024-05-20 10:36:18 +02:00
Tamo
897d25780e update milli to latest version 2024-05-16 18:31:32 +02:00
Tamo
9fffb8e83d make clippy happy 2024-05-14 17:36:32 +02:00
meili-bors[bot]
4d5971f343
Merge #4621
4621: Bring back changes from v1.8.0 into main r=curquiza a=curquiza



Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-05-06 13:46:39 +00:00
ManyTheFish
7468c1cf8d Introduce WildcardSetting that are serialized as wildcards by default 2024-04-24 18:15:03 +02:00
writegr
ab43a8a949 chore: fix some typos in comments
Signed-off-by: writegr <wellweek@outlook.com>
2024-04-18 14:12:52 +08:00
ManyTheFish
a1ea224da9 Fix tests 2024-04-16 17:29:34 +02:00
ManyTheFish
5ab901dd30 Fix tests 2024-04-16 14:39:30 +02:00
ManyTheFish
bad46f88d6 Fix embedder test 2024-04-16 14:39:30 +02:00
meili-bors[bot]
56bf8503db
Merge #4537
4537: Expose distribution shift in settings r=ManyTheFish a=dureuill

See [usage page](https://meilisearch.notion.site/v1-8-AI-search-API-usage-135552d6e85a4a52bc7109be82aeca42#d652adc0890445658aaf36352dbc8802)

# Changes

- Distribution shift added to all embedders.
- Exposed in settings
- Changed the reindexing logic to not trigger a reindex operation when only the distribution shift or API key change

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-04-03 09:08:58 +00:00
redistay
182cb42953 chore: fix some typos in conments
Signed-off-by: redistay <wujunjing@outlook.com>
2024-04-02 19:37:55 +08:00
meili-bors[bot]
78668584cd
Merge #4533
4533: Hide api key in settings and task queue r=dureuill a=dureuill

# Pull Request

See [Usage page](https://meilisearch.notion.site/v1-8-AI-search-API-usage-135552d6e85a4a52bc7109be82aeca42#117f5ff7b19f4d95bb3ae0005f6c6633)

## Motivation

See [slack discussion (internal link)](https://meilisearch.slack.com/archives/C06GQP7FQ6P/p1709804022298749)


## Changes

- The value of the `apiKey` parameter is now hidden in the settings and the details of the task queue.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-03-28 16:02:53 +00:00
Bruno Casali
8f2606d79d
fixes typos 2024-03-27 14:26:47 -03:00
Louis Dureuil
92224f109a
Fix tests 2024-03-27 12:19:10 +01:00
Louis Dureuil
9a95ed619d
Add tests 2024-03-26 10:36:56 +01:00
Louis Dureuil
f82d056072
Hide secrets in settings and task queue 2024-03-26 10:36:24 +01:00
Tamo
f2f1367ec3
add a timeout to the webhook 2024-03-20 13:59:43 +01:00
Tamo
b130917933 add the content type in the webhook + improve the test 2024-03-05 11:22:29 +01:00
Louis Dureuil
452a343a2b
Fix imports 2024-02-28 18:09:40 +01:00
meili-bors[bot]
b005eb3289
Merge #4435
4435: Make update file deletion atomic r=Kerollmops a=irevoire

# Pull Request

## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4432
Fixes https://github.com/meilisearch/meilisearch/issues/4438 by adding the logs the user asked

## What does this PR do?
- Adds a bunch of logs to help debug this kind of issue in the future
- Delete the update files AFTER committing the update in the `index-scheduler` (thus, if a restart happens, we are able to re-process the batch successfully)
- Multi-thread the deletion of all update files.


Co-authored-by: Tamo <tamo@meilisearch.com>
2024-02-26 17:54:40 +00:00
Tamo
0562818c2a fix and remove the file-store hack of /dev/null 2024-02-26 13:59:41 +01:00
Tamo
36c27a18a1 implement the dry run ha parameter 2024-02-26 13:58:04 +01:00
Tamo
1eb1c043b5 disable the auto deletion of tasks when the ha mode is enabled 2024-02-26 13:58:04 +01:00
Tamo
eb25b07390 let you specify your task id 2024-02-26 13:56:31 +01:00
Tamo
066a7a3cde takes only one read transaction per thread 2024-02-26 10:43:04 +01:00
Tamo
91cdd502f8 When processing tasks, make the update file deletion atomic 2024-02-22 14:56:22 +01:00
Tamo
3b6544db6d Implement the experimental log mode cli flag 2024-02-13 18:09:15 +01:00
Louis Dureuil
ef994d84d0
Change error messages and fix tests 2024-02-08 15:04:06 +01:00
Tamo
f70a615ed9
update the github discussion links 2024-02-08 15:04:05 +01:00
Tamo
7ff722b72e
get rids of the log dependencies everywhere 2024-02-08 15:04:05 +01:00
Tamo
e23ec4886d
fix the tests and add tests on the experimental features 2024-02-08 15:04:03 +01:00
Tamo
7793ba67a4
hide the route logs behind a feature flag 2024-02-08 15:03:33 +01:00
Louis Dureuil
02e6c8a440
Add tracing to index-scheduler 2024-02-08 15:03:31 +01:00
Louis Dureuil
05edd85d75
Stabilize scoreDetails 2024-02-06 11:15:19 +01:00
meili-bors[bot]
1ccde9bf0b
Merge #4316
4316: Autobatch the task deletions r=curquiza a=irevoire

# Pull Request

## Related issue
Fix part of https://github.com/meilisearch/meilisearch-support/issues/69
Fix #4315 

## What does this PR do?
- Autobatch the task deletions

Co-authored-by: Tamo <tamo@meilisearch.com>
2024-01-15 17:54:50 +00:00
Tamo
b4d7d80ad9
autobatch the task deletions 2024-01-11 14:58:07 +01:00
Louis Dureuil
97bb1ff9e2
Move currently_updating_index to IndexMapper 2024-01-09 15:37:27 +01:00
meili-bors[bot]
658ec6e0a4
Merge #4279
4279: Check experimental feature on setting update query rather than in the task. r=ManyTheFish a=dureuill

Improve the UX by checking for the vector store feature and returning an error synchronously when sending a setting update, rather than in the indexing task.

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-12-22 11:36:12 +00:00
Louis Dureuil
ee54d3171e
Check experimental feature at query time 2023-12-21 15:26:12 +01:00
Clément Renault
fa2b96b9a5
Add an Authorization Header along with the webhook calls 2023-12-19 12:18:45 +01:00
Tamo
4fb25b8782
fix clippy 2023-12-19 10:35:51 +01:00
Tamo
c83a33017e
stream and chunk the data 2023-12-19 10:35:51 +01:00
Tamo
be72326c0a
gzip the tasks 2023-12-19 10:35:51 +01:00
Tamo
0b2fff27f2
update and fix the test 2023-12-19 10:35:51 +01:00
Tamo
3adbc2b942
return a task view instead of a task 2023-12-19 10:35:51 +01:00
Tamo
fbea721378
add a first working test with actixweb 2023-12-19 10:35:51 +01:00
Tamo
d78ad51082
Implement the webhook 2023-12-19 10:35:50 +01:00
Many the fish
9e1b458010
Merge branch 'main' into change-proximity-precision-settings 2023-12-18 09:08:47 +01:00
Louis Dureuil
e0cc775dc4
Various changes
- DistributionShift in Search object (to be set from model in embed?)
- Fix issue where embedder index wasn't computed at search time
- Accept as default embedder either the "default" one, or the only embedder when there is only one
2023-12-14 16:08:41 +01:00
Louis Dureuil
922a640188
WIP multi embedders
fixed template bugs
2023-12-14 16:08:41 +01:00
Louis Dureuil
abbe131084
Cosmetic change 2023-12-14 16:08:41 +01:00
Louis Dureuil
13c2c6c16b
Small commit to add hybrid search and autoembedding 2023-12-14 16:07:48 +01:00
ManyTheFish
35e1981488 Remove proximityPrecision form the experimental feature 2023-12-14 15:52:42 +01:00
Clément Renault
7e259cb0d2
Expose the --max-number-of-batched-tasks argument 2023-12-11 16:08:39 +01:00
ManyTheFish
1f4fc9c229 Make the feature experimental 2023-12-06 15:49:05 +01:00
Clément Renault
ec9b52d608
Rename copy_to_path to copy_to_file 2023-11-28 14:32:30 +01:00
Clément Renault
34c67ac389
Remove the possibility to fail fetching the env info 2023-11-28 14:31:23 +01:00
Clément Renault
0dbf1a16ff
Make clippy happy 2023-11-23 14:11:38 +01:00
Clément Renault
462b4c0080
Fix the tests 2023-11-23 12:07:35 +01:00
Clément Renault
0d4482625a
Make the changes to use heed v0.20-alpha.6 2023-11-23 11:43:58 +01:00
Clément Renault
7cb7e37ba8
Merge branch 'main' into tmp-release-v1.5.0 2023-11-21 16:30:46 +01:00
meili-bors[bot]
33b7c574ea
Merge #4090
4090: Diff indexing r=ManyTheFish a=ManyTheFish

This pull request aims to reduce the indexing time by computing a difference between the data added to the index and the data removed from the index before writing in LMDB.

## Why focus on reducing the writings in LMDB?

The indexing in Meilisearch is split into 3 main phases:
1) The computing or the extraction of the data (Multi-threaded)
2) The writing of the data in LMDB (Mono-threaded)
3) The processing of the prefix databases (Mono-threaded)

see below:
![Capture d’écran 2023-09-28 à 20 01 45](https://github.com/meilisearch/meilisearch/assets/6482087/51513162-7c39-4244-978b-2c6b60c43a56)


Because the writing is mono-threaded, it represents a bottleneck in the indexing, reducing the number of writes in LMDB will reduce the pressure on the main thread and should reduce the global time spent on the indexing.

## Give Feedback

We created [a dedicated discussion](https://github.com/meilisearch/meilisearch/discussions/4196) for users to try this new feature and to give feedback on bugs or performance issues.

## Technical approach
### Part 1: merge the addition and the deletion process
This part:
a) Aims to reduce the time spent on indexing only the filterable/sortable fields of documents, for example:
  - Updating the number of "likes" or "stars" of a song or a movie
  - Updating the "stock count" or the "price" of a product

b) Aims to reduce the time spent on writing in LMDB which should reduce the global indexing time for the highly multi-threaded machines by reducing the writing bottleneck.

c) Aims to reduce the average time spent to delete documents without having to keep the soft-deleted documents implementation

- [x] Create a preprocessing function that creates the diff-based documents chuck (`OBKV<fid, OBKV<AddDel, value>>`)
  - [x] and clearly separate the faceted fields and the searchable fields in two different chunks
- Change the parameters of the input extractor by taking an `OBKV<fid, OBKV<AddDel, value>>` instead of  `OBKV<fid, value>`.
  - [x] extract_docid_word_positions
  - [x] extract_geo_points
  - [x] extract_vector_points
  - [x] extract_fid_docid_facet_values
- Adapt the searchable extractors to the new diff-chucks
  - [x] extract_fid_word_count_docids
  - [x] extract_word_pair_proximity_docids
  - [x] extract_word_position_docids
  - [x] extract_word_docids
- Adapt the facet extractors to the new diff-chucks
  - [x] extract_facet_number_docids
  - [x] extract_facet_string_docids
  - [x] extract_fid_docid_facet_values
  - [x] FacetsUpdate
- [x] Adapt the prefix database extractors ⚠️ ⚠️ 
- [x] Make the LMDB writer remove the document_ids to delete at the same time the new document_ids are added
- [x] Remove document deletion pipeline
  - [x] remove `new_documents_ids` entirely and `replaced_documents_ids`
  - [x] reuse extracted external id from transform instead of re-extracting in `TypedChunks::Documents`
  - [x] Remove deletion pipeline after autobatcher
  - [x] remove autobatcher deletion pipeline
    - [x] everything uses `IndexOperation::DocumentOperation`
    - [x] repair deletion by internal id for filter by delete
    - [x] Improve the deletion via internal ids by avoiding iterating over the whole set of external document ids.  
- [x] Remove soft-deleted documents

#### FIXME

- [x] field distribution is not correctly updated after deletion
- [x] missing documents in the tests of tokenizer_customization

### Part 2: Only compute the documents field by field
This part aims to reduce the global indexing time for any kind of partial document modification on any size of machine from the mono-threaded one to the highly multi-threaded one.

- [ ] Make the preprocessing function only send the fields that changed to the extractors
- [ ] remove the `word_docids` and `exact_word_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] replace the `word_pair_proximity_docids` database with a `word_pair_proximity_fid_docids` database and adapt the search (⚠️ could impact the search performances)
- [ ] Adapt the prefix database extractors ⚠️ ⚠️

## Technical Concerns
- The part 1 implementation could increase the indexing time for the smallest machines (with few threads) by increasing the extracting time (multi-threaded) more than the writing time (mono-threaded)
- The part 2 implementation needs to change the databases which could have a significant impact on the search performances
- The prefix databases are a bit special to process and may be a pain to adapt to the difference-based indexing

Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-11-21 09:44:38 +00:00
Tamo
5b57fbab08 makes the dump cancellable 2023-11-14 11:23:13 +01:00
Louis Dureuil
a2d6dc8571
Fix typo, remove caching for the change of index 2023-11-13 10:44:36 +01:00
Louis Dureuil
492fc086f0
cargo fmt 2023-11-12 21:53:11 +01:00
Louis Dureuil
a2d0c73b41
Save the currently updating index so that the search can access it at all times 2023-11-10 10:52:03 +01:00
Louis Dureuil
f8289cd974
Use it from delete-by-filter 2023-11-09 14:23:15 +01:00
Louis Dureuil
ef6fa10f7a
Remove IndexOperation::DocumentDeletion 2023-11-06 12:16:15 +01:00
Louis Dureuil
cbaa54cafd
Fix clippy issues 2023-11-06 11:19:31 +01:00
Clément Renault
e507ef5932
Slow the logging down 2023-11-01 13:49:32 +01:00
Clément Renault
13416ccbf7
Introduce a new meilitool to help the cloud team 2023-10-30 14:30:20 +01:00
Clément Renault
dfab6293c9
Use an LMDB database to store the external documents ids 2023-10-30 11:41:23 +01:00
Louis Dureuil
652ac3052d
use new iterator in batch 2023-10-30 11:41:22 +01:00
Louis Dureuil
c534a1b687
Stop using delete documents pipeline in batch runner 2023-10-30 11:41:22 +01:00
Louis Dureuil
cf8dad1ca0
index_scheduler.features() is no longer fallible 2023-10-23 10:38:56 +02:00
bwbonanno
dd619913da Use RwLock to never persist cli state to db 2023-10-19 12:45:57 -07:00
bwbonanno
d8c649b3cd Return recoverable error if we fail to retrieve metrics state 2023-10-18 08:28:24 -07:00
bwbonanno
12fc878640 Merge remote-tracking branch 'origin/main' into enable-metrics-http 2023-10-16 13:48:01 -07:00
bwbonanno
689ec7c7ad Make the experimental route /metrics activable via HTTP 2023-10-13 22:12:54 +00:00
Clément Renault
3655d4bdca
Move the puffin file export logic into the run function 2023-10-13 13:11:30 +02:00
Clément Renault
055ca3935b
Update index-scheduler/src/batch.rs
Co-authored-by: Tamo <tamo@meilisearch.com>
2023-10-13 13:11:30 +02:00
Kerollmops
bf8fac6676
Fix the tests 2023-10-13 13:11:30 +02:00
Kerollmops
f2a9e1ebbb
Improve the debugging experience in the puffin reports 2023-10-13 13:11:30 +02:00
Kerollmops
513e61e9a3
Remove the experimental CLI flag 2023-10-13 13:11:29 +02:00
Kerollmops
90a626bf80
Use the runtime feature to enable puffin report exporting 2023-10-13 13:11:29 +02:00
Kerollmops
0d4acf2daa
Fix the metrics product URL 2023-10-13 13:11:29 +02:00
Kerollmops
58db8d85ec
Add the exportPuffinReports option to the runtime features route 2023-10-13 13:11:29 +02:00
Clément Renault
656dadabea
Expose an experimental flag to write the puffin reports to disk 2023-10-13 13:11:09 +02:00
Tamo
34fac115d5 fix clippy 2023-09-11 17:15:57 +02:00
Tamo
9258e5b5bf Fix the stats of the documents deletion by filter
The issue was that the operation « DocumentDeletionByFilter » was not
declared as an index operation. That means the indexes stats were not
reprocessed after the application of the operation.
2023-09-11 14:04:10 +02:00
meili-bors[bot]
e4e49e63d0
Merge #3993
3993: Bringing back changes from v1.3.1 to `main` r=irevoire a=curquiza



Co-authored-by: irevoire <irevoire@users.noreply.github.com>
Co-authored-by: meili-bors[bot] <89034592+meili-bors[bot]@users.noreply.github.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-08-10 14:30:02 +00:00
Tamo
fe819a9d80 fix the get stats method
It was not taking into account the processing tasks at all
2023-08-08 13:21:15 +02:00
ManyTheFish
b45c36cd71 Merge branch 'main' into tmp-release-v1.3.0 2023-08-01 15:05:17 +02:00
Kerollmops
eef95de30e
First iteration on exposing puffin profiling 2023-07-18 17:38:13 +02:00
Clément Renault
22762808ab
Fix the tests 2023-07-06 12:13:29 +02:00
Clément Renault
86b834c9e4
Display the total number of tasks in the tasks route 2023-07-06 10:05:18 +02:00
meili-bors[bot]
aae099e330
Merge #3851
3851: Expose lastUpdate and isIndexing in /stats endpoint r=dureuill a=gentcys

# Pull Request

## Related issue
Fixes #3843

## What does this PR do?
- expose lastUpdate in `/stats` endpoint
- expose isIndex in `stats` endpoint
- add a method `is_task_processing` in index-scheduler/src/lib.rs.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Cong Chen <cong.chen@ocrlabs.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2023-07-03 13:41:04 +00:00