426: Fix search highlight for non-unicode chars r=ManyTheFish a=Samyak2
# Pull Request
## What does this PR do?
Fixes https://github.com/meilisearch/MeiliSearch/issues/1480
<!-- Please link the issue you're trying to fix with this PR, if none then please create an issue first. -->
## PR checklist
Please check if your PR fulfills the following requirements:
- [x] Does this PR fix an existing issue?
- [x] Have you read the contributing guidelines?
- [x] Have you made sure that the title is accurate and descriptive of the changes?
## Changes
The `matching_bytes` function takes a `&Token` now and:
- gets the number of bytes to highlight (unchanged).
- uses `Token.num_graphemes_from_bytes` to get the number of grapheme clusters to highlight.
In essence, the `matching_bytes` function now returns the number of matching grapheme clusters instead of bytes.
Added proper highlighting in the HTTP UI:
- requires dependency on `unicode-segmentation` to extract grapheme clusters from tokens
- `<mark>` tag is put around only the matched part
- before this change, the entire word was highlighted even if only a part of it matched
## Questions
Since `matching_bytes` does not return number of bytes but grapheme clusters, should it be renamed to something like `matching_chars` or `matching_graphemes`? Will this break the API?
Thank you very much `@ManyTheFish` for helping 😄
Co-authored-by: Samyak S Sarnayak <samyak201@gmail.com>
- Stop lowercasing the field when looking in the field id map
- When a field id does not exist it means there is currently zero
documents containing this field thus we returns an empty RoaringBitmap
instead of throwing an internal error
The `matching_bytes` function takes a `&Token` now and:
- gets the number of bytes to highlight (unchanged).
- uses `Token.num_graphemes_from_bytes` to get the number of grapheme
clusters to highlight.
In essence, the `matching_bytes` function returns the number of matching
grapheme clusters instead of bytes. Should this function be renamed
then?
Added proper highlighting in the HTTP UI:
- requires dependency on `unicode-segmentation` to extract grapheme
clusters from tokens
- `<mark>` tag is put around only the matched part
- before this change, the entire word was highlighted even if only a
part of it matched
returned metaimprove document addition returned metaimprove document
addition returned metaimprove document addition returned metaimprove
document addition returned metaimprove document addition returned
metaimprove document addition returned meta
407: Update version for the next release (v0.20.0) r=curquiza a=curquiza
Breaking because of #405 and #406
Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
402: Optimize document transform r=MarinPostma a=MarinPostma
This pr optimizes the transform of documents additions in the obkv format. Instead on accepting any serializable objects, we instead treat json and CSV specifically:
- For json, we build a serde `Visitor`, that transform the json straight into obkv without intermediate representation.
- For csv, we directly write the lines in the obkv, applying other optimization as well.
Co-authored-by: marin postma <postma.marin@protonmail.com>
390: Add helper methods on the settings r=Kerollmops a=irevoire
This would be a good addition to look at the content of a setting without consuming it.
It’s useful for analytics.
Co-authored-by: Irevoire <tamo@meilisearch.com>
384: Replace memmap with memmap2 r=Kerollmops a=palfrey
[memmap is unmaintained](https://rustsec.org/advisories/RUSTSEC-2020-0077.html) and needs replacing. memmap2 is a drop-in replacement fork that's well maintained. Note that the version numbers got reset on fork, hence the lower values.
Co-authored-by: Tom Parker-Shemilt <palfrey@tevp.net>
388: fix primary key inference r=MarinPostma a=MarinPostma
The primary key is was infered from a hashtable index of the field. For this reason the order in which the fields were interated upon was not deterministic, and the primary key was chosed ffrom the first field containing "id".
This fix sorts the the index by field_id when infering the primary key.
Co-authored-by: mpostma <postma.marin@protonmail.com>
Instead of using an arbitrary limit we encode the absolute position in a u32
using one strong u16 for the field id and a weak u16 for the relative position in the attribute.
386: fix obkv document r=curquiza a=MarinPostma
When serializing a document, the serializer resolved the field_id of the current field and immediately added it to the obkv document under construction. The issue with that is that obkv expects the fields to be inserted in order, and when a document with out of order fields was added, obkv failed to insert the field.
The current fix first resolves each field_id, and adds all the fields to a temporary `BTreeMap`, until `end` is called on the map serializer, where all the fields are added to the obkv at once, and in order.
Co-authored-by: mpostma <postma.marin@protonmail.com>
Latitude are not supposed to go beyound 90 degrees or below -90.
The same goes for longitude with 180 or -180.
This was badly implemented in the filters, and was not implemented for the AscDesc rules.
379: Revert "Change chunk size to 4MiB to fit more the end user usage" r=curquiza a=ManyTheFish
Reverts meilisearch/milli#370
Co-authored-by: Many <legendre.maxime.isn@gmail.com>
376: Stop casting integer docids to string r=Kerollmops a=irevoire
When a docid is an integer, we stop casting it to a string, and thus we don't add `"` around it.
Co-authored-by: Tamo <tamo@meilisearch.com>
373: Improve error message for bad sort syntax with geosearch r=Kerollmops a=irevoire
`@Kerollmops` This should be the last PR for the geosearch and error handling, sorry for doing it in so many steps 😬
Co-authored-by: Tamo <tamo@meilisearch.com>
372: Fix Meilisearch 1714 r=Kerollmops a=ManyTheFish
The bug comes from the typo tolerance, to know how many typos are accepted we were counting bytes instead of characters in a word.
On Chinese Script characters, we were allowing 2 typos on 3 characters words.
We are now counting the number of char instead of counting bytes to assign the typo tolerance.
Related to [Meilisearch#1714](https://github.com/meilisearch/MeiliSearch/issues/1714)
Co-authored-by: many <maxime@meilisearch.com>
360: Update version for the next release (v0.14.0) r=Kerollmops a=curquiza
Release containing the geosearch, cf #322
Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
322: Geosearch r=ManyTheFish a=irevoire
This PR introduces [basic geo-search functionalities](https://github.com/meilisearch/specifications/pull/59), it makes the engine able to index, filter and, sort by geo-point. We decided to use [the rstar library](https://docs.rs/rstar) and to save the points in [an RTree](https://docs.rs/rstar/0.9.1/rstar/struct.RTree.html) that we de/serialize in the index database [by using serde](https://serde.rs/) with [bincode](https://docs.rs/bincode). This is not an efficient way to query this tree as it will consume a lot of CPU and memory when a search is made, but at least it is an easy first way to do so.
### What we will have to do on the indexing part:
- [x] Index the `_geo` fields from the documents.
- [x] Create a new module with an extractor in the `extract` module that takes the `obkv_documents` and retrieves the latitude and longitude coordinates, outputting them in a `grenad::Reader` for further process.
- [x] Call the extractor in the `extract::extract_documents_data` function and send the result to the `TypedChunk` module.
- [x] Get the `grenad::Reader` in the `typed_chunk::write_typed_chunk_into_index` function and store all the points in the `rtree`
- [x] Delete the documents from the `RTree` when deleting documents from the database. All this can be done in the `delete_documents.rs` file by getting the data structure and removing the points from it, inserting it back after the modification.
- [x] Clearing the `RTree` entirely when we clear the documents from the database, everything happens in the `clear_documents.rs` file.
- [x] save a Roaring bitmap of all documents containing the `_geo` field
### What we will have to do on the query part:
- [x] Filter the documents at a certain distance around a point, this is done by [collecting the documents from the searched point](https://docs.rs/rstar/0.9.1/rstar/struct.RTree.html#method.nearest_neighbor_iter) while they are in range.
- [x] We must introduce new `geoLowerThan` and `geoGreaterThan` variants to the `Operator` filter enum.
- [x] Implement the `negative` method on both variants where the `geoGreaterThan` variant is implemented by executing the `geoLowerThan` and removing the results found from the whole list of geo faceted documents.
- [x] Add the `_geoRadius` function in the pest parser.
- [x] Introduce a `_geo` ascending ranking function that takes a point in parameter, ~~this function must keep the iterator on the `RTree` and make it peekable~~ This was not possible for now, we had to collect the whole iterator. Only the documents that are part of the candidates must be sent too!
- [x] This ascending ranking rule will only be active if the search is set up with the `_geoPoint` parameter that indicates the center point of the ascending ranking rule.
-----------
- On Meilisearch part: We must introduce a new concept, returning the documents with a new `_geoDistance` field when it passed by the `_geo` ranking rule, this has never been done before. We could maybe just do it afterward when the documents have been retrieved from the database, computing the distance from the `_geoPoint` and all of the documents to be returned.
Co-authored-by: Irevoire <tamo@meilisearch.com>
Co-authored-by: cvermand <33010418+bidoubiwa@users.noreply.github.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
342: Let the caller decide what kind of error they want to returns when parsing `AscDesc` r=Kerollmops a=irevoire
This is one possible fix for #339
We would then need to patch these lines https://github.com/meilisearch/MeiliSearch/blob/main/meilisearch-http/src/index/search.rs#L110-L114 to return the error we want.
Another solution would be to add a parameter to the `from_str` to specify which context we are in.
Co-authored-by: Tamo <tamo@meilisearch.com>
344: Move the sort ranking rule before the exactness ranking rule r=ManyTheFish a=Kerollmops
This PR moves the sort ranking rule at the 5th position by default, right before the exactness one.
Co-authored-by: Kerollmops <clement@meilisearch.com>
341: Throw a query time error when a sort parameter is used but the sort ranking rule is missing r=Kerollmops a=Kerollmops
This PR makes the engine throw an error for when the ranking rules don't contain the `sort` rule, the `sortable_fields` are correctly set but the user tries to use the `sort` query parameter. Doing so will have no effect on the returned documents so we preferred returning an error to help debug this.
That's breaking on the MeiliSearch side as we added a new variant to the `UserError` enum.
Co-authored-by: Kerollmops <clement@meilisearch.com>
308: Implement a better parallel indexer r=Kerollmops a=ManyTheFish
Rewrite the indexer:
- enhance memory consumption control
- optimize parallelism using rayon and crossbeam channel
- factorize the different parts and make new DB implementation easier
- optimize and fix prefix databases
Co-authored-by: many <maxime@meilisearch.com>
309: Sort at query time r=Kerollmops a=Kerollmops
This PR:
- Makes the `Asc/Desc` criteria work with strings too, it first returns documents ordered by numbers then by strings, and finally the documents that can't be ordered. Note that it is lexicographically ordered and not ordered by character, which means that it doesn't know about wide and short characters i.e. `a`, `丹`, `▲`.
- Changes the syntax for the `Asc/Desc` criterion by now using a colon to separate the name and the order i.e. `title:asc`, `price:desc`.
- Add the `Sort` criterion at the third position in the ranking rules by default.
- Add the `sort_criteria` method to the `Search` builder struct to let the users define the `Asc/Desc` sortable attributes they want to use at query time. Note that we need to check that the fields are registered in the sortable attributes before performing the search.
- Introduce a new `InvalidSortableAttribute` user error that is raised when the sort criteria declared at query time are not part of the sortable attributes.
- `@ManyTheFish` introduced integration tests for the dynamic Sort criterion.
Fixes#305.
Co-authored-by: Kerollmops <clement@meilisearch.com>
Co-authored-by: many <maxime@meilisearch.com>
307: Update version for the next release (v0.10.0) r=Kerollmops a=curquiza
Replaces https://github.com/meilisearch/milli/pull/304
Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
302: Update milli to v0.9.0 r=curquiza a=curquiza
Updating the minor and not patch since #300 seems to be breaking: it involves a re-indexation to get the fix, so it involves an additional step from the users, not only downloading the latest version.
Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
300: Fix prefix level position docids database r=curquiza a=ManyTheFish
The prefix search was inverted when we generated the DB.
Instead of searching if word had a prefix in prefix fst,
we were searching if the word was a prefix of a prefix contained in the prefix fst.
The indexer, now, iterate over prefix contained in the fst
and search them by prefix in the word-level-position-docids database,
aggregating matches in a sorter.
Fix#299
Co-authored-by: many <maxime@meilisearch.com>
The prefix search was inverted when we generated the DB.
Instead of searching if word had a prefix in prefix fst,
we were searching if the word was a prefix of a prefix contained in the prefix fst.
The indexer, now, iterate over prefix contained in the fst
and search them by prefix in the word-level-position-docids database,
aggregating matches in a sorter.
Fix#299
291: Fix a bug about zero bytes in the inputs r=irevoire a=Kerollmops
Ok, good news, after a little session of debugging with `@irevoire` we found out that the bug seems to be related to zeroes in the input update. The engine wasn't designed to accept those. The chosen solution is to update the tokenizer to remove those zeroes. We are waiting on https://github.com/meilisearch/tokenizer/pull/52 to be merged and a new version to be released.
It is not an undefined behavior, I repeat: it is a "normal" bug 🎉👏
----
This PR tries to fix a bug where we use LMDB in the wrong way, leading to panic due to an undefined behavior on the Rust side. I thought [we fixed it in a previous PR](https://github.com/meilisearch/milli/pull/264) but we found out that _a similar_ bug was still present. `@bb` found a way to trigger this bug and helped us find the origin of it.
As I don't have a minimal reproducible example of this bug I bet on the unsafe `put_current` calls when we index new documents as the bug was trigger after a big indexation on a clean database, thus not triggering a deletion update. I only replaced the unsafe `put_current` with two safe calls to `get`/`put`.
I hope it helps and fixes the bug, only `@bb` can help us check that. I am not even sure how I can create a custom Docker image and expose it for testing purposes.
<details>
<summary>The backtrace leading us to a panic in grenad.</summary>
```
meilisearch_1 | thread 'tokio-runtime-worker' panicked at 'assertion failed: key > &last_key', /root/.cargo/git/checkouts/grenad-e2cb77f65d31bb02/3adcb26/src/block_builder.rs:38:17
meilisearch_1 | stack backtrace:
meilisearch_1 | 0: rust_begin_unwind
meilisearch_1 | at ./rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panicking.rs:493:5
meilisearch_1 | 1: core::panicking::panic_fmt
meilisearch_1 | at ./rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/panicking.rs:92:14
meilisearch_1 | 2: core::panicking::panic
meilisearch_1 | at ./rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/core/src/panicking.rs:50:5
meilisearch_1 | 3: grenad::block_builder::BlockBuilder::insert
meilisearch_1 | at ./root/.cargo/git/checkouts/grenad-e2cb77f65d31bb02/3adcb26/src/block_builder.rs:38:17
meilisearch_1 | 4: grenad::writer::Writer<W>::insert
meilisearch_1 | at ./root/.cargo/git/checkouts/grenad-e2cb77f65d31bb02/3adcb26/src/writer.rs:92:12
meilisearch_1 | 5: milli::update::words_level_positions::write_level_entry
meilisearch_1 | at ./root/.cargo/git/checkouts/milli-00376cd5db949a15/007fec2/milli/src/update/words_level_positions.rs:262:5
meilisearch_1 | 6: milli::update::words_level_positions::compute_positions_levels
meilisearch_1 | at ./root/.cargo/git/checkouts/milli-00376cd5db949a15/007fec2/milli/src/update/words_level_positions.rs:211:13
meilisearch_1 | 7: milli::update::words_level_positions::WordsLevelPositions::execute
meilisearch_1 | at ./root/.cargo/git/checkouts/milli-00376cd5db949a15/007fec2/milli/src/update/words_level_positions.rs:65:23
meilisearch_1 | 8: milli::update::index_documents::IndexDocuments::execute_raw
meilisearch_1 | at ./root/.cargo/git/checkouts/milli-00376cd5db949a15/007fec2/milli/src/update/index_documents/mod.rs:831:9
meilisearch_1 | 9: milli::update::index_documents::IndexDocuments::execute
meilisearch_1 | at ./root/.cargo/git/checkouts/milli-00376cd5db949a15/007fec2/milli/src/update/index_documents/mod.rs:372:9
meilisearch_1 | 10: meilisearch_http::index::updates::<impl meilisearch_http::index::Index>::update_documents_txn
meilisearch_1 | at ./meilisearch/meilisearch-http/src/index/updates.rs:225:30
meilisearch_1 | 11: meilisearch_http::index::updates::<impl meilisearch_http::index::Index>::update_documents
meilisearch_1 | at ./meilisearch/meilisearch-http/src/index/updates.rs:183:22
meilisearch_1 | 12: meilisearch_http::index::update_handler::UpdateHandler::handle_update
meilisearch_1 | at ./meilisearch/meilisearch-http/src/index/update_handler.rs:75:18
meilisearch_1 | 13: meilisearch_http::index_controller::index_actor::actor::IndexActor<S>::handle_update::{{closure}}::{{closure}}
meilisearch_1 | at ./meilisearch/meilisearch-http/src/index_controller/index_actor/actor.rs:174:35
meilisearch_1 | 14: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/runtime/blocking/task.rs:42:21
meilisearch_1 | 15: tokio::runtime::task::core::CoreStage<T>::poll::{{closure}}
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/runtime/task/core.rs:243:17
meilisearch_1 | 16: tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/loom/std/unsafe_cell.rs:14:9
meilisearch_1 | 17: tokio::runtime::task::core::CoreStage<T>::poll
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/runtime/task/core.rs:233:13
meilisearch_1 | 18: tokio::runtime::task::harness::poll_future::{{closure}}
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/runtime/task/harness.rs:427:23
meilisearch_1 | 19: <std::panic::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once
meilisearch_1 | at ./rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panic.rs:344:9
meilisearch_1 | 20: std::panicking::try::do_call
meilisearch_1 | at ./rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panicking.rs:379:40
meilisearch_1 | 21: std::panicking::try
meilisearch_1 | at ./rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panicking.rs:343:19
meilisearch_1 | 22: std::panic::catch_unwind
meilisearch_1 | at ./rustc/53cb7b09b00cbea8754ffb78e7e3cb521cb8af4b/library/std/src/panic.rs:431:14
meilisearch_1 | 23: tokio::runtime::task::harness::poll_future
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/runtime/task/harness.rs:414:19
meilisearch_1 | 24: tokio::runtime::task::harness::Harness<T,S>::poll_inner
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/runtime/task/harness.rs:89:9
meilisearch_1 | 25: tokio::runtime::task::harness::Harness<T,S>::poll
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/runtime/task/harness.rs:59:15
meilisearch_1 | 26: tokio::runtime::task::raw::RawTask::poll
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/runtime/task/raw.rs:66:18
meilisearch_1 | 27: tokio::runtime::task::Notified<S>::run
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/runtime/task/mod.rs:171:9
meilisearch_1 | 28: tokio::runtime::blocking::pool::Inner::run
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/runtime/blocking/pool.rs:265:17
meilisearch_1 | 29: tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}}
meilisearch_1 | at ./root/.cargo/registry/src/github.com-1ecc6299db9ec823/tokio-1.7.1/src/runtime/blocking/pool.rs:245:17
meilisearch_1 | note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
```
</details>
Co-authored-by: Kerollmops <clement@meilisearch.com>
269: Fix bug when inserting previously deleted documents r=Kerollmops a=Kerollmops
This PR fixes#268.
The issue was in the `ExternalDocumentsIds` implementation in the specific case that an external document id was in the soft map marked as deleted.
The bug was due to a wrong assumption on my side about how the FST unions were returning the `IndexedValue`s, I thought the values returned in an array were in the same order as the FSTs given to the `OpBuilder` but in fact, [the `IndexedValue`'s `index` field was here to indicate from which FST the values were coming from](https://docs.rs/fst/0.4.7/fst/map/struct.IndexedValue.html).
271: Remove the roaring operation functions warnings r=Kerollmops a=Kerollmops
In this PR we are just replacing the usages of the roaring operations function by the new operators. This removes a lot of warnings.
Co-authored-by: Kerollmops <clement@meilisearch.com>
It is undefined behavior to keep a reference to the database while
modifying it, we were keeping references in the database and also
feeding the heed put_current methods with keys referenced inside
the database itself.
https://github.com/Kerollmops/heed/pull/108
245: Warn for when a key is too large for LMDB r=Kerollmops a=Kerollmops
Closes#191, and resolves#140.
Co-authored-by: Kerollmops <clement@meilisearch.com>