Commit Graph

372 Commits

Author SHA1 Message Date
Clément Renault
ea016d97af
Implementing an IS EMPTY filter 2023-03-15 14:12:34 +01:00
ManyTheFish
2f8eb4f54a last PR fixes 2023-03-09 15:34:36 +01:00
Clément Renault
df48ac8803
Add one more test for the NULL operator 2023-03-09 13:53:37 +01:00
Clément Renault
0ad53784e7
Create a new struct to reduce the type complexity 2023-03-09 13:21:21 +01:00
Clément Renault
e106b16148
Fix a typo in a variable
Co-authored-by: Louis Dureuil <louis@meilisearch.com>

aaa
2023-03-09 13:08:02 +01:00
ManyTheFish
5deea631ea fix clippy too many arguments 2023-03-09 11:19:13 +01:00
ManyTheFish
b4b859ec8c Fix typos 2023-03-09 10:58:35 +01:00
Clément Renault
7dc04747fd
Make clippy happy 2023-03-08 17:37:08 +01:00
Clément Renault
43ff236df8
Write the NULL facet values in the database 2023-03-08 16:49:53 +01:00
Clément Renault
19ab4d1a15
Classify the NULL fields values in the facet extractor 2023-03-08 16:49:31 +01:00
ManyTheFish
24c0775c67 Change indexing threshold 2023-03-08 12:36:04 +01:00
ManyTheFish
3092cf0448 Fix clippy errors 2023-03-08 10:53:42 +01:00
ManyTheFish
da48506f15 Rerun extraction when language detection might have failed 2023-03-07 18:35:26 +01:00
ManyTheFish
bbecab8948 fix clippy 2023-02-21 10:18:44 +01:00
ManyTheFish
8aa808d51b Merge branch 'main' into enhance-language-detection 2023-02-20 18:14:34 +01:00
bors[bot]
b08a49a16e
Merge #3319 #3470
3319: Transparently resize indexes on MaxDatabaseSizeReached errors r=Kerollmops a=dureuill

# Pull Request

## Related issue
Related to https://github.com/meilisearch/meilisearch/discussions/3280, depends on https://github.com/meilisearch/milli/pull/760

## What does this PR do?

### User standpoint

- Meilisearch no longer fails tasks that encounter the `milli::UserError(MaxDatabaseSizeReached)` error.
- Instead, these tasks are retried after increasing the maximum size allocated to the index where the failure occurred.

### Implementation standpoint

- Add `Batch::index_uid` to get the `index_uid` of a batch of task if there is one
- `IndexMapper::create_or_open_index` now takes an additional `size` argument that allows to (re)open indexes with a size different from the base `IndexScheduler::index_size` field
- `IndexScheduler::tick` now returns a `Result<TickOutcome>` instead of a `Result<usize>`. This offers more explicit control over what the behavior should be wrt the next tick.
- Add `IndexStatus::BeingResized` that contains a handle that a thread can use to await for the resize operation to complete and the index to be available again.
- Add `IndexMapper::resize_index` to increase the size of an index.
- In `IndexScheduler::tick`, intercept task batches that failed due to `MaxDatabaseSizeReached` and resize the index that caused the error, then request a new tick that will eventually handle the still enqueued task.

## Testing the PR

The following diff can be applied to this branch to make testing the PR easier:

<details>


```diff
diff --git a/index-scheduler/src/index_mapper.rs b/index-scheduler/src/index_mapper.rs
index 553ab45a..022b2f00 100644
--- a/index-scheduler/src/index_mapper.rs
+++ b/index-scheduler/src/index_mapper.rs
`@@` -228,13 +228,15 `@@` impl IndexMapper {
 
         drop(lock);
 
+        std:🧵:sleep_ms(2000);
+
         let current_size = index.map_size()?;
         let closing_event = index.prepare_for_closing();
-        log::info!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2);
+        log::error!("Resizing index {} from {} to {} bytes", name, current_size, current_size * 2);
 
         closing_event.wait();
 
-        log::info!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2);
+        log::error!("Resized index {} from {} to {} bytes", name, current_size, current_size * 2);
 
         let index_path = self.base_path.join(uuid.to_string());
         let index = self.create_or_open_index(&index_path, None, 2 * current_size)?;
`@@` -268,8 +270,10 `@@` impl IndexMapper {
             match index {
                 Some(Available(index)) => break index,
                 Some(BeingResized(ref resize_operation)) => {
+                    log::error!("waiting for resize end");
                     // Deadlock: no lock taken while doing this operation.
                     resize_operation.wait();
+                    log::error!("trying our luck again!");
                     continue;
                 }
                 Some(BeingDeleted) => return Err(Error::IndexNotFound(name.to_string())),
diff --git a/index-scheduler/src/lib.rs b/index-scheduler/src/lib.rs
index 11b17d05..242dc095 100644
--- a/index-scheduler/src/lib.rs
+++ b/index-scheduler/src/lib.rs
`@@` -908,6 +908,7 `@@` impl IndexScheduler {
     ///
     /// Returns the number of processed tasks.
     fn tick(&self) -> Result<TickOutcome> {
+        log::error!("ticking!");
         #[cfg(test)]
         {
             *self.run_loop_iteration.write().unwrap() += 1;
diff --git a/meilisearch/src/main.rs b/meilisearch/src/main.rs
index 050c825a..63f312f6 100644
--- a/meilisearch/src/main.rs
+++ b/meilisearch/src/main.rs
`@@` -25,7 +25,7 `@@` fn setup(opt: &Opt) -> anyhow::Result<()> {
 
 #[actix_web::main]
 async fn main() -> anyhow::Result<()> {
-    let (opt, config_read_from) = Opt::try_build()?;
+    let (mut opt, config_read_from) = Opt::try_build()?;
 
     setup(&opt)?;
 
`@@` -56,6 +56,8 `@@` We generated a secure master key for you (you can safely copy this token):
         _ => (),
     }
 
+    opt.max_index_size = byte_unit::Byte::from_str("1MB").unwrap();
+
     let (index_scheduler, auth_controller) = setup_meilisearch(&opt)?;
 
     #[cfg(all(not(debug_assertions), feature = "analytics"))]
```
</details>

Mainly, these debug changes do the following:

- Set the default index size to 1MiB so that index resizes are initially frequent
- Turn some logs from info to error so that they can be displayed with `--log-level ERROR` (hiding the other infos)
- Add a long sleep between the beginning and the end of the resize so that we can observe the `BeingResized` index status (otherwise it would never come up in my tests)

## Open questions

- Is the growth factor of x2 the correct solution? For a `Vec` in memory it makes sense, but here we're manipulating quantities that are potentially in the order of 500GiBs. For bigger indexes it may make more sense to add at most e.g. 100GiB on each resize operation, avoiding big steps like 500GiB -> 1TiB.

## PR checklist
Please check if your PR fulfills the following requirements:
- [ ] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [ ] Have you read the contributing guidelines?
- [ ] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


3470: Autobatch addition and deletion r=irevoire a=irevoire

This PR adds the capability to meilisearch to batch document addition and deletion together.

Fix https://github.com/meilisearch/meilisearch/issues/3440

--------------

Things to check before merging;

- [x] What happens if we delete multiple time the same documents -> add a test
- [x] If a documentDeletion gets batched with a documentAddition but the index doesn't exist yet? It should not work

Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
2023-02-20 15:00:19 +00:00
Tamo
18796d6e6a Consider null as a valid geo object 2023-02-20 13:45:51 +01:00
Tamo
895ab2906c apply review suggestions 2023-02-16 18:42:47 +01:00
Tamo
74dcfe9676
Fix a bug when you update a document that was already present in the db, deleted and then inserted again in the same transform 2023-02-14 19:09:40 +01:00
Tamo
1b1703a609
make a small optimization to merge obkvs a little bit faster 2023-02-14 18:32:41 +01:00
Tamo
fb5e4957a6
fix and test the early exit in case a grenad ends with a deletion 2023-02-14 18:23:57 +01:00
Tamo
8de3c9f737
Update milli/src/update/index_documents/transform.rs
Co-authored-by: Clément Renault <clement@meilisearch.com>
2023-02-14 17:57:14 +01:00
Tamo
43a19d0709
document the operation enum + the grenads 2023-02-14 17:55:26 +01:00
Tamo
746b31c1ce
makes clippy happy 2023-02-09 12:23:01 +01:00
Tamo
93db755d57
add a test to ensure we handle correctly a deletion of multiple time the same document 2023-02-08 21:03:34 +01:00
Tamo
93f130a400
fix all warnings 2023-02-08 20:57:35 +01:00
Tamo
421a9cf05e
provide a new method on the transform to remove documents 2023-02-08 16:06:09 +01:00
Tamo
8f64fba1ce
rewrite the current transform to handle a new byte specifying the kind of operation it's merging 2023-02-08 12:53:38 +01:00
Kerollmops
fbec48f56e
Merge remote-tracking branch 'milli/main' into bring-v1-changes 2023-02-06 16:48:10 +01:00
f3r10
7681be5367 Format code 2023-01-31 11:28:05 +01:00
f3r10
50bc156257 Fix tests 2023-01-31 11:28:05 +01:00
f3r10
d8207356f4 Skip script,language insertion if language is undetected 2023-01-31 11:28:05 +01:00
f3r10
fd60a39f1c Format code 2023-01-31 11:28:05 +01:00
f3r10
a27f329e3a Add tests for checking that detected script and language associated with document(s) were stored during indexing 2023-01-31 11:28:05 +01:00
f3r10
d97fb6117e Extract and index data 2023-01-31 11:28:05 +01:00
Louis Dureuil
cbf029f64c
clippy: --fix 2023-01-31 11:12:59 +01:00
Louis Dureuil
89675e5f15
clippy: Replace seek 0 by rewind 2023-01-31 09:32:40 +01:00
Tamo
de3c4f1986 throw an error on unknown fields specified in the _geo field 2023-01-24 12:23:24 +01:00
Philipp Ahlner
f5ca421227
Superfluous test removed 2023-01-19 15:39:21 +01:00
Philipp Ahlner
a2cd7214f0
Fixes error message when lat/lng are unparseable 2023-01-19 10:10:26 +01:00
ManyTheFish
d1fc42b53a Use compatibility decomposition normalizer in facets 2023-01-18 15:02:13 +01:00
Clément Renault
1b78231e18
Make clippy happy 2023-01-17 18:25:54 +01:00
bors[bot]
6a10e85707
Merge #736
736: Update charabia r=curquiza a=ManyTheFish

Update Charabia to the last version.

> We are now Romanizing Chinese characters into Pinyin.
> Note that we keep the accent because they are in fact never typed directly by the end-user, moreover, changing an accent leads to a different Chinese character, and I don't have sufficient knowledge to forecast the impact of removing accents in this context.

Co-authored-by: ManyTheFish <many@meilisearch.com>
2023-01-03 15:44:41 +00:00
Louis Dureuil
4b166bea2b
Add primary_key_inference test 2022-12-21 15:13:38 +01:00
Louis Dureuil
5943100754
Fix existing tests 2022-12-21 15:13:38 +01:00
Louis Dureuil
b24def3281
Add logging when inference took place.
Displays log message in the form:
```
[2022-12-21T09:19:42Z INFO  milli::update::index_documents::enrich] Primary key was not specified in index. Inferred to 'id'
```
2022-12-21 15:13:38 +01:00
Louis Dureuil
402dcd6b2f
Simplify primary key inference 2022-12-21 15:13:38 +01:00
Louis Dureuil
13c95d25aa
Remove uses of UserError::MissingPrimaryKey not related to inference 2022-12-21 15:13:36 +01:00
Louis Dureuil
fc7618d49b
Add DeletionStrategy 2022-12-19 09:49:58 +01:00
ManyTheFish
7f88c4ff2f Fix #1714 test 2022-12-15 18:22:28 +01:00
Loïc Lecrenier
be3b00350c Apply review suggestions: naming and documentation 2022-12-13 10:15:22 +01:00
Loïc Lecrenier
e3ee553dcc Remove soft deleted ids from ExternalDocumentIds during document import
If the document import replaces a document using hard deletion
2022-12-12 14:16:09 +01:00
Loïc Lecrenier
303d740245 Prepare fix within facet range search
By creating snapshots and updating the format of the existing
snapshots. The next commit will apply the fix, which will show
its effects cleanly on the old and new snapshot tests
2022-12-07 14:38:10 +01:00
Loïc Lecrenier
67d8cec209 Fix bug in handling of soft deleted documents when updating settings 2022-12-06 15:09:19 +01:00
Loïc Lecrenier
f2cf981641 Add more tests and allow disabling of soft-deletion outside of tests
Also allow disabling soft-deletion in the IndexDocumentsConfig
2022-12-05 10:51:01 +01:00
Loïc Lecrenier
8d0ace2d64 Avoid creating a MatchingWord for words that exceed the length limit 2022-11-28 10:20:13 +01:00
Loïc Lecrenier
ac3baafbe8 Truncate facet values that are too long before indexing them 2022-11-17 11:29:42 +01:00
Loïc Lecrenier
990a861241 Add test for indexing a document with a long facet value 2022-11-17 11:29:42 +01:00
Loïc Lecrenier
d95d02cb8a Fix Facet Indexing bugs
1. Handle keys with variable length correctly

This fixes https://github.com/meilisearch/meilisearch/issues/3042 and
is easily reproducible with the updated fuzz tests, which now generate
keys with variable lengths.

2. Prevent adding facets to the database if their encoded value does
not satisfy `valid_lmdb_key`.

This fixes an indexing failure when a document had a filterable
attribute containing a value whose length is higher than ~500 bytes.
2022-11-17 11:29:42 +01:00
Kerollmops
37b3c5c323
Fix transform to use all_documents and ignore soft_deleted documents 2022-11-08 14:23:16 +01:00
unvalley
70465aa5ce Execute cargo fmt 2022-11-04 08:59:58 +09:00
unvalley
3009981d31 Fix clippy errors
Add clippy job

Add clippy job to CI
2022-11-04 08:58:14 +09:00
unvalley
c7322f704c Fix cargo clippy errors
Dont apply clippy for tests for now

Fix clippy warnings of filter-parser package

parent 8352febd646ec4bcf56a44161e5c4dce0e55111f
author unvalley <38400669+unvalley@users.noreply.github.com> 1666325847 +0900
committer unvalley <kirohi.code@gmail.com> 1666791316 +0900

Update .github/workflows/rust.yml

Co-authored-by: Clémentine Urquizar - curqui <clementine@meilisearch.com>

Allow clippy lint too_many_argments

Allow clippy lint needless_collect

Allow clippy lint too_many_arguments and type_complexity

Fix for clippy warnings comparison_chains

Fix for clippy warnings vec_init_then_push

Allow clippy lint should_implement_trait

Allow clippy lint drop_non_drop

Fix lifetime clipy warnings in filter-paprser

Execute cargo fmt

Fix clippy remaining warnings

Fix clippy remaining warnings again and allow lint on each place
2022-10-27 01:04:23 +09:00
unvalley
811f156031 Execute cargo clippy --fix 2022-10-27 01:00:00 +09:00
Loïc Lecrenier
54c0cf93fe Merge remote-tracking branch 'origin/main' into facet-levels-refactor 2022-10-26 15:13:34 +02:00
bors[bot]
365f44c39b
Merge #668
668: Fix many Clippy errors part 2 r=ManyTheFish a=ehiggs

This brings us a step closer to enforcing clippy on each build.

# Pull Request

## Related issue
This does not fix any issue outright, but it is a second round of fixes for clippy after https://github.com/meilisearch/milli/pull/665. This should contribute to fixing https://github.com/meilisearch/milli/pull/659.

## What does this PR do?

Satisfies many issues for clippy. The complaints are mostly:

* Passing reference where a variable is already a reference.
* Using clone where a struct already implements `Copy`
* Using `ok_or_else` when it is a closure that returns a value instead of using the closure to call function (hence we use `ok_or`)
* Unambiguous lifetimes don't need names, so we can just use `'_`
* Using `return` when it is not needed as we are on the last expression of a function.

## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!


Co-authored-by: Ewan Higgs <ewan.higgs@gmail.com>
2022-10-26 12:16:24 +00:00
Loïc Lecrenier
2741756248 Merge remote-tracking branch 'origin/main' into facet-levels-refactor 2022-10-26 14:03:23 +02:00
Loïc Lecrenier
b7f2428961 Fix formatting and warning after rebasing from main 2022-10-26 13:49:33 +02:00
Loïc Lecrenier
a034a1e628 Move StrRefCodec and ByteSliceRefCodec to their own files 2022-10-26 13:47:46 +02:00
Loïc Lecrenier
51961e1064 Polish some details 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
b1ab09196c Remove outdated TODOs 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
27454e9828 Document and refine facet indexing algorithms 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
bee3c23b45 Add comparison benchmark between bulk and incremental facet indexing 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
9026867d17 Give same interface to bulk and incremental facet indexing types
+ cargo fmt, oops, sorry for the bad history :(
2022-10-26 13:47:04 +02:00
Loïc Lecrenier
485a72306d Refactor facet-related codecs 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
9b55e582cd Add FacetsUpdate type that wraps incremental and bulk indexing methods 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
afdf87f6f7 Fix bugs in asc/desc criterion and facet indexing 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
a7201ece04 cargo fmt 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
61252248fb Fix some facet indexing bugs 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
85824ee203 Try to make facet indexing incremental 2022-10-26 13:47:04 +02:00
Loïc Lecrenier
e8a156d682 Reorganise facets database indexing code 2022-10-26 13:46:46 +02:00
Loïc Lecrenier
bd2c0e1ab6 Remove unused code 2022-10-26 13:46:14 +02:00
Loïc Lecrenier
39a4a0a362 Reintroduce filter range search and facet extractors 2022-10-26 13:46:14 +02:00
Loïc Lecrenier
7913d6365c Update Facets indexing to be compatible with new database structure 2022-10-26 13:46:14 +02:00
Loïc Lecrenier
c3f49f766d Prepare refactor of facets database
Prepare refactor of facets database
2022-10-26 13:46:14 +02:00
bors[bot]
c8f16530d5
Merge #616
616: Introduce an indexation abortion function when indexing documents r=Kerollmops a=Kerollmops



Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-10-26 11:41:18 +00:00
Ewan Higgs
6b2fe94192 Fixes for clippy bringing us down to 18 remaining issues.
This brings us a step closer to enforcing clippy on each build.
2022-10-25 20:49:02 +02:00
Loïc Lecrenier
d76d0cb1bf Merge branch 'main' into word-pair-proximity-docids-refactor 2022-10-24 15:23:00 +02:00
Loïc Lecrenier
a983129613 Apply suggestions from code review 2022-10-20 09:49:37 +02:00
Loïc Lecrenier
a7de4f5b85 Don't add swapped word pairs to the word_pair_proximity_docids db 2022-10-18 10:37:34 +02:00
Loïc Lecrenier
264a04922d Add prefix_word_pair_proximity database
Similar to the word_prefix_pair_proximity one but instead the keys are:
(proximity, prefix, word2)
2022-10-18 10:37:34 +02:00
Loïc Lecrenier
bdeb47305e Change encoding of word_pair_proximity DB to (proximity, word1, word2)
Same for word_prefix_pair_proximity
2022-10-18 10:37:34 +02:00
Kerollmops
6603437cb1
Introduce an indexation abortion function when indexing documents 2022-10-17 17:28:03 +02:00
Ewan Higgs
beb987d3d1 Fixing piles of clippy errors.
Most of these are calling clone when the struct supports Copy.

Many are using & and &mut on `self` when the function they are called
from already has an immutable or mutable borrow so this isn't needed.

I tried to stay away from actual changes or places where I'd have to
name fresh variables.
2022-10-13 22:02:54 +02:00
msvaljek
762e320c35
Add proximity calculation for the same word 2022-10-07 12:59:12 +02:00
vishalsodani
00c02d00f3 Add missing logging timer to extractors 2022-09-30 22:17:06 +05:30
bors[bot]
15d478cf4d
Merge #635
635: Use an unstable algorithm for `grenad::Sorter` when possible r=Kerollmops a=loiclec

# Pull Request
## What does this PR do?

Use an unstable algorithm to sort the internal vector used by `grenad::Sorter` whenever possible to speed up indexing.

In practice, every time the merge function creates a `RoaringBitmap`, we use an unstable sort. For every other merge function, such as `keep_first`, `keep_last`, etc., a stable sort is used.


Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
2022-09-14 12:00:52 +00:00
Loïc Lecrenier
3794962330 Use an unstable algorithm for grenad::Sorter when possible 2022-09-13 14:49:53 +02:00
Kerollmops
d4d7c9d577
We avoid skipping errors in the indexing pipeline 2022-09-13 14:03:00 +02:00
Kerollmops
fe3973a51c
Make sure that long words are correctly skipped 2022-09-07 15:03:32 +02:00