Commit Graph

105 Commits

Author SHA1 Message Date
Loïc Lecrenier 453d593ce8 Add a database containing the docids where each field exists 2022-07-19 10:07:33 +02:00
Tamo eaf28b0628
Apply review suggestions
Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-07-05 15:30:33 +02:00
Tamo 3b309f654a
Fasten the document deletion
When a document deletion occurs, instead of deleting the document we mark it as deleted
in the new “soft deleted” bitmap. It is then removed from the search, and all the other
endpoints.
2022-07-05 15:30:33 +02:00
ad hoc 31776fdc3f
add failing test 2022-06-07 15:49:33 +02:00
bors[bot] 08c6d50cd1
Merge #531
531: fix the mixed dataset geosearch indexing bug r=Kerollmops a=irevoire

port #529 to main

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-05-16 16:06:36 +00:00
Tamo 0af399a6d7
fix the mixed dataset geosearch indexing bug 2022-05-16 17:37:45 +02:00
Tamo f586028f9a
fix the searchable fields bug when a field is nested
Update milli/src/index.rs

Co-authored-by: Clément Renault <clement@meilisearch.com>
2022-05-16 17:24:36 +02:00
bors[bot] 65e6aa0de2
Merge #523
523: Improve geosearch error messages r=irevoire a=irevoire

Improve the geosearch error messages (#488).
And try to parse the string as specified in https://github.com/meilisearch/meilisearch/issues/2354

Co-authored-by: Tamo <tamo@meilisearch.com>
2022-05-04 13:36:11 +00:00
Kerollmops 211c8763b9
Make sure that we do not generate too long keys 2022-05-03 10:03:15 +02:00
Kerollmops 7e47031bdc
Add a test for long keys in LMDB 2022-05-03 10:03:13 +02:00
Tamo 3cb1f6d0a1
improve geosearch error messages 2022-05-02 19:20:47 +02:00
Tamo f19d2dc548
Only flatten the required fields
apply review comments

Co-authored-by: Kerollmops <kero@meilisearch.com>
2022-04-26 12:33:46 +02:00
Clément Renault eb5830aa40
Add a test to make sure that long words are handled 2022-04-21 13:45:28 +02:00
Tamo ee64f4a936
Use smartstring to store the external id in our hashmap
We need to store all the external id (primary key) in a hashmap
associated to their internal id during.
The smartstring remove heap allocation / memory usage and should
improve the cache locality.
2022-04-13 21:22:07 +02:00
Irevoire 4f3ce6d9cd
nested fields 2022-04-07 16:58:46 +02:00
ad hoc e8f06f6c06
extract exact_word_prefix_docids 2022-04-04 20:54:03 +02:00
ad hoc ba0bb29cd8
refactor WordPrefixDocids to take dbs instead of indexes 2022-04-04 20:54:02 +02:00
ad hoc c4c6e35352
query exact_word_docids in resolve_query_tree 2022-04-04 20:54:02 +02:00
ad hoc 8d46a5b0b5
extract exact word docids 2022-04-04 20:54:02 +02:00
ad hoc 0a77be4ec0
introduce exact_word_docids db 2022-04-04 20:54:02 +02:00
Kerollmops d5b8b5a2f8
Replace the ugly unwraps by clean if let Somes 2022-02-28 16:31:33 +01:00
Kerollmops 8d26f3040c
Remove a useless grenad file merging 2022-02-28 16:31:33 +01:00
bors[bot] 25123af3b8
Merge #436
436: Speed up the word prefix databases computation time r=Kerollmops a=Kerollmops

This PR depends on the fixes done in #431 and must be merged after it.

In this PR we will bring the `WordPrefixPairProximityDocids`, `WordPrefixDocids` and, `WordPrefixPositionDocids` update structures to a new era, a better era, where computing the word prefix pair proximities costs much fewer CPU cycles, an era where this update structure can use the, previously computed, set of new word docids from the newly indexed batch of documents.

---

The `WordPrefixPairProximityDocids` is an update structure, which means that it is an object that we feed with some parameters and which modifies the LMDB database of an index when asked for. This structure specifically computes the list of word prefix pair proximities, which correspond to a list of pairs of words associated with a proximity (the distance between both words) where the second word is not a word but a prefix e.g. `s`, `se`, `a`. This word prefix pair proximity is associated with the list of documents ids which contains the pair of words and prefix at the given proximity.

The origin of the performances issue that this struct brings is related to the fact that it starts its job from the beginning, it clears the LMDB database before rewriting everything from scratch, using the other LMDB databases to achieve that. I hope you understand that this is absolutely not an optimized way of doing things.

Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-02-16 15:41:14 +00:00
Clément Renault ff8d7a810d
Change the behavior of the as_cloneable_grenad by taking a ref 2022-02-16 15:40:08 +01:00
Many d59bcea749 Revert "Revert "Change chunk size to 4MiB to fit more the end user usage"" 2022-02-02 17:01:13 +01:00
Kerollmops fb79c32430
Compute the new, common and, deleted prefix words fst once 2022-01-27 11:00:18 +01:00
Clément Renault 51d1e64b23
Remove, now useless, the WriteMethod enum 2022-01-27 10:08:35 +01:00
Clément Renault e9c02173cf
Rework the WordsPrefixPositionDocids update to compute a subset of the database 2022-01-27 10:08:35 +01:00
Clément Renault d59e559317
Fix the computation of the newly added and common prefix words 2022-01-27 10:08:34 +01:00
Clément Renault 28692f65be
Rework the WordPrefixDocids update to compute a subset of the database 2022-01-27 10:08:34 +01:00
Clément Renault 5404bc02dd
Move the fst_stream_into_hashset method in the helper methods 2022-01-27 10:06:00 +01:00
Clément Renault 822f67e9ad
Bring the newly created word pair proximity docids 2022-01-27 10:06:00 +01:00
Clément Renault d28f18658e
Retrieve the previous version of the words prefixes FST 2022-01-27 10:05:59 +01:00
Marin Postma 0c84a40298 document batch support
reusable transform

rework update api

add indexer config

fix tests

review changes

Co-authored-by: Clément Renault <clement@meilisearch.com>

fmt
2022-01-19 12:40:20 +01:00
Marin Postma 6eb47ab792 remove update_id in UpdateBuilder 2021-11-16 13:07:04 +01:00
Marin Postma 09b4281cff improve document addition returned metaimprove document addition
returned metaimprove document addition returned metaimprove document
addition returned metaimprove document addition returned metaimprove
document addition returned metaimprove document addition returned
metaimprove document addition returned meta
2021-11-10 14:08:36 +01:00
marin postma baddd80069
implement review suggestions 2021-10-25 18:29:12 +02:00
marin postma 2e62925a6e
fix tests 2021-10-25 10:26:42 +02:00
marin postma 8d70b01714
optimize document deserialization 2021-10-25 10:26:42 +02:00
many c5a6075484
Make max_position_per_attributes changable 2021-10-12 10:10:50 +02:00
many 360c5ff3df
Remove limit of 1000 position per attribute
Instead of using an arbitrary limit we encode the absolute position in a u32
using one strong u16 for the field id and a weak u16 for the relative position in the attribute.
2021-10-12 10:10:50 +02:00
many 3296bb243c
Simplify word level position DB into a word position DB 2021-10-05 12:15:02 +02:00
Many 26b5dad042
Revert "Change chunk size to 4MiB to fit more the end user usage" 2021-09-29 15:08:39 +02:00
Tamo f65153ad64
stop casting integer docids to string 2021-09-28 18:35:54 +02:00
many 1988416295
Add failing test related to Meilisearch#1714 2021-09-28 12:05:11 +02:00
many b188063869
Change chunk size to 4MiB to fit more the end user usage 2021-09-27 14:26:21 +02:00
many 551df0cb77
Add test checking the bug reported in meilisearch issue 1716 2021-09-23 15:55:39 +02:00
mpostma aa6c5df0bc Implement documents format
document reader transform

remove update format

support document sequences

fix document transform

clean transform

improve error handling

add documents! macro

fix transform bug

fix tests

remove csv dependency

Add comments on the transform process

replace search cli

fmt

review edits

fix http ui

fix clippy warnings

Revert "fix clippy warnings"

This reverts commit a1ce3cd96e603633dbf43e9e0b12b2453c9c5620.

fix review comments

remove smallvec in transform loop

review edits
2021-09-21 16:58:33 +02:00
bors[bot] 31c8de1cca
Merge #322
322: Geosearch r=ManyTheFish a=irevoire

This PR introduces [basic geo-search functionalities](https://github.com/meilisearch/specifications/pull/59), it makes the engine able to index, filter and, sort by geo-point. We decided to use [the rstar library](https://docs.rs/rstar) and to save the points in [an RTree](https://docs.rs/rstar/0.9.1/rstar/struct.RTree.html) that we de/serialize in the index database [by using serde](https://serde.rs/) with [bincode](https://docs.rs/bincode). This is not an efficient way to query this tree as it will consume a lot of CPU and memory when a search is made, but at least it is an easy first way to do so.

### What we will have to do on the indexing part:
 - [x] Index the `_geo` fields from the documents.
   - [x] Create a new module with an extractor in the `extract` module that takes the `obkv_documents` and retrieves the latitude and longitude coordinates, outputting them in a `grenad::Reader` for further process.
   - [x] Call the extractor in the `extract::extract_documents_data` function and send the result to the `TypedChunk` module.
   - [x] Get the `grenad::Reader` in the `typed_chunk::write_typed_chunk_into_index` function and store all the points in the `rtree`
- [x] Delete the documents from the `RTree` when deleting documents from the database. All this can be done in the `delete_documents.rs` file by getting the data structure and removing the points from it, inserting it back after the modification.
- [x] Clearing the `RTree` entirely when we clear the documents from the database, everything happens in the `clear_documents.rs` file.
- [x] save a Roaring bitmap of all documents containing the `_geo` field

### What we will have to do on the query part:
- [x] Filter the documents at a certain distance around a point, this is done by [collecting the documents from the searched point](https://docs.rs/rstar/0.9.1/rstar/struct.RTree.html#method.nearest_neighbor_iter) while they are in range.
  - [x] We must introduce new `geoLowerThan` and `geoGreaterThan` variants to the `Operator` filter enum.
  - [x] Implement the `negative` method on both variants where the `geoGreaterThan` variant is implemented by executing the `geoLowerThan` and removing the results found from the whole list of geo faceted documents.
  - [x] Add the `_geoRadius` function in the pest parser.
- [x] Introduce a `_geo` ascending ranking function that takes a point in parameter, ~~this function must keep the iterator on the `RTree` and make it peekable~~ This was not possible for now, we had to collect the whole iterator. Only the documents that are part of the candidates must be sent too!
  - [x] This ascending ranking rule will only be active if the search is set up with the `_geoPoint` parameter that indicates the center point of the ascending ranking rule.

-----------

- On Meilisearch part: We must introduce a new concept, returning the documents with a new `_geoDistance` field when it passed by the `_geo` ranking rule, this has never been done before. We could maybe just do it afterward when the documents have been retrieved from the database, computing the distance from the `_geoPoint` and all of the documents to be returned.

Co-authored-by: Irevoire <tamo@meilisearch.com>
Co-authored-by: cvermand <33010418+bidoubiwa@users.noreply.github.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
2021-09-20 19:04:57 +00:00
many 26deeb45a3
Add lacking parameter to word level position builder 2021-09-09 17:49:04 +02:00