Commit Graph

121 Commits

Author SHA1 Message Date
Tamo bab898ce86
move the flatten-serde-json crate inside of milli 2022-04-07 18:20:44 +02:00
Irevoire 4f3ce6d9cd
nested fields 2022-04-07 16:58:46 +02:00
Clémentine Urquizar 9eec44dd98
Update version (v0.25.0) 2022-04-05 12:06:42 +02:00
Clémentine Urquizar ddf78a735b
Update version (v0.24.1) 2022-03-24 16:39:45 +01:00
Irevoire 86dd88698d
bump tokenizer 2022-03-23 14:25:58 +01:00
Irevoire 5dc464b9a7
rollback meilisearch-tokenizer version 2022-03-21 17:29:10 +01:00
Kerollmops 08a06b49f0
Bump version to 0.23.1 2022-03-15 15:50:28 +01:00
Kerollmops 63682c2c9a
Upgrade the dependencies 2022-03-15 11:17:44 +01:00
Kerollmops 288a879411
Remove three useless dependencies 2022-03-15 11:17:44 +01:00
Clémentine Urquizar d9ed9de2b0
Update heed link in cargo toml 2022-03-01 19:45:29 +01:00
bors[bot] 25123af3b8
Merge #436
436: Speed up the word prefix databases computation time r=Kerollmops a=Kerollmops

This PR depends on the fixes done in #431 and must be merged after it.

In this PR we will bring the `WordPrefixPairProximityDocids`, `WordPrefixDocids` and, `WordPrefixPositionDocids` update structures to a new era, a better era, where computing the word prefix pair proximities costs much fewer CPU cycles, an era where this update structure can use the, previously computed, set of new word docids from the newly indexed batch of documents.

---

The `WordPrefixPairProximityDocids` is an update structure, which means that it is an object that we feed with some parameters and which modifies the LMDB database of an index when asked for. This structure specifically computes the list of word prefix pair proximities, which correspond to a list of pairs of words associated with a proximity (the distance between both words) where the second word is not a word but a prefix e.g. `s`, `se`, `a`. This word prefix pair proximity is associated with the list of documents ids which contains the pair of words and prefix at the given proximity.

The origin of the performances issue that this struct brings is related to the fact that it starts its job from the beginning, it clears the LMDB database before rewriting everything from scratch, using the other LMDB databases to achieve that. I hope you understand that this is absolutely not an optimized way of doing things.

Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Kerollmops <clement@meilisearch.com>
2022-02-16 15:41:14 +00:00
Clément Renault f367cc2e75
Finally bump grenad to v0.4.1 2022-02-16 15:28:48 +01:00
Irevoire 0defeb268c
bump milli 2022-02-16 13:27:41 +01:00
Irevoire 48542ac8fd
get rid of chrono in favor of time 2022-02-15 11:41:55 +01:00
Clémentine Urquizar d03b3ceb58
Update version for the next release (v0.22.1) 2022-02-07 18:39:29 +01:00
Tamo 367f403693
bump milli 2022-01-17 16:41:34 +01:00
Samyak S Sarnayak c10f58b7bd
Update tokenizer to v0.2.7 2022-01-17 13:02:00 +05:30
many 1b3923b5ce
Update all packages to 0.21.0 2021-11-29 12:17:59 +01:00
many 64ef5869d7
Update tokenizer v0.2.6 2021-11-18 16:56:05 +01:00
Tamo f28600031d
Rename the filter_parser crate into filter-parser
Co-authored-by: Clément Renault <clement@meilisearch.com>
2021-11-09 16:41:10 +01:00
Tamo 6831c23449
merge with main 2021-11-06 16:34:30 +01:00
Tamo a58bc5bebb
update milli with the new parser_filter 2021-11-04 15:02:36 +01:00
many 743ed9f57f
Bump milli version 2021-11-04 14:04:21 +01:00
many 702589104d
Update version for the next release (v0.20.1) 2021-11-03 14:20:01 +01:00
Clémentine Urquizar 056ff13c4d
Update version for the next release (v0.20.0) 2021-10-28 14:52:57 +02:00
bors[bot] d7943fe225
Merge #402
402: Optimize document transform r=MarinPostma a=MarinPostma

This pr optimizes the transform of documents additions in the obkv format. Instead on accepting any serializable objects, we instead treat json and CSV specifically:
- For json, we build a serde `Visitor`, that transform the json straight into obkv without intermediate representation.
- For csv, we directly write the lines in the obkv, applying other optimization as well.

Co-authored-by: marin postma <postma.marin@protonmail.com>
2021-10-26 09:55:28 +00:00
bors[bot] 15c29cdd9b
Merge #401
401: Update version for the next release (v0.19.0) r=curquiza a=curquiza



Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2021-10-25 12:49:53 +00:00
Clémentine Urquizar 208903ddde
Revert "Replacing pest with nom " 2021-10-25 11:58:00 +02:00
Clémentine Urquizar 679fe18b17
Update version for the next release (v0.19.0) 2021-10-25 11:52:17 +02:00
marin postma 0f86d6b28f
implement csv serialization 2021-10-25 10:26:42 +02:00
Tamo efb2f8b325
convert the errors 2021-10-22 16:38:35 +02:00
Tamo c27870e765
integrate a first version without any error handling 2021-10-22 14:33:18 +02:00
Tamo 01dedde1c9
update some names and move some parser out of the lib.rs 2021-10-22 01:59:38 +02:00
Clémentine Urquizar f8fe9316c0
Update version for the next release (v0.18.1) 2021-10-21 11:56:14 +02:00
Clémentine Urquizar 2209acbfe2
Update version for the next release (v0.18.2) 2021-10-18 13:45:48 +02:00
bors[bot] 59cc59e93e
Merge #358
358: Replacing pest with nom  r=Kerollmops a=CNLHC



Co-authored-by: 刘瀚骋 <cn_lhc@qq.com>
2021-10-16 20:44:38 +00:00
刘瀚骋 7666e4f34a follow the suggestions 2021-10-14 21:37:59 +08:00
bors[bot] c7db4176f3
Merge #384
384: Replace memmap with memmap2 r=Kerollmops a=palfrey

[memmap is unmaintained](https://rustsec.org/advisories/RUSTSEC-2020-0077.html) and needs replacing. memmap2 is a drop-in replacement fork that's well maintained. Note that the version numbers got reset on fork, hence the lower values.

Co-authored-by: Tom Parker-Shemilt <palfrey@tevp.net>
2021-10-13 13:47:23 +00:00
刘瀚骋 f7796edc7e remove everything about pest 2021-10-12 13:30:40 +08:00
刘瀚骋 8748df2ca4 draft without error handling 2021-10-12 13:30:40 +08:00
Clémentine Urquizar dd56e82dba
Update version for the next release (v0.17.2) 2021-10-11 15:20:35 +02:00
Tom Parker-Shemilt 2dfe24f067 memmap -> memmap2 2021-10-10 22:47:12 +01:00
Clémentine Urquizar 05d8a33a28
Update version for the next release (v0.17.1) 2021-10-02 16:21:31 +02:00
Clémentine Urquizar 0e8665bf18
Update version for the next release (v0.17.0) 2021-09-28 19:38:12 +02:00
Clémentine Urquizar 1eacab2169
Update version for the next release (v0.15.1) 2021-09-22 17:18:54 +02:00
Clémentine Urquizar f8ecbc28e2
Update version for the next release (v0.15.0) 2021-09-21 18:09:14 +02:00
mpostma aa6c5df0bc Implement documents format
document reader transform

remove update format

support document sequences

fix document transform

clean transform

improve error handling

add documents! macro

fix transform bug

fix tests

remove csv dependency

Add comments on the transform process

replace search cli

fmt

review edits

fix http ui

fix clippy warnings

Revert "fix clippy warnings"

This reverts commit a1ce3cd96e603633dbf43e9e0b12b2453c9c5620.

fix review comments

remove smallvec in transform loop

review edits
2021-09-21 16:58:33 +02:00
bors[bot] 94764e5c7c
Merge #360
360: Update version for the next release (v0.14.0) r=Kerollmops a=curquiza

Release containing the geosearch, cf #322 

Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com>
2021-09-21 08:43:27 +00:00
bors[bot] 31c8de1cca
Merge #322
322: Geosearch r=ManyTheFish a=irevoire

This PR introduces [basic geo-search functionalities](https://github.com/meilisearch/specifications/pull/59), it makes the engine able to index, filter and, sort by geo-point. We decided to use [the rstar library](https://docs.rs/rstar) and to save the points in [an RTree](https://docs.rs/rstar/0.9.1/rstar/struct.RTree.html) that we de/serialize in the index database [by using serde](https://serde.rs/) with [bincode](https://docs.rs/bincode). This is not an efficient way to query this tree as it will consume a lot of CPU and memory when a search is made, but at least it is an easy first way to do so.

### What we will have to do on the indexing part:
 - [x] Index the `_geo` fields from the documents.
   - [x] Create a new module with an extractor in the `extract` module that takes the `obkv_documents` and retrieves the latitude and longitude coordinates, outputting them in a `grenad::Reader` for further process.
   - [x] Call the extractor in the `extract::extract_documents_data` function and send the result to the `TypedChunk` module.
   - [x] Get the `grenad::Reader` in the `typed_chunk::write_typed_chunk_into_index` function and store all the points in the `rtree`
- [x] Delete the documents from the `RTree` when deleting documents from the database. All this can be done in the `delete_documents.rs` file by getting the data structure and removing the points from it, inserting it back after the modification.
- [x] Clearing the `RTree` entirely when we clear the documents from the database, everything happens in the `clear_documents.rs` file.
- [x] save a Roaring bitmap of all documents containing the `_geo` field

### What we will have to do on the query part:
- [x] Filter the documents at a certain distance around a point, this is done by [collecting the documents from the searched point](https://docs.rs/rstar/0.9.1/rstar/struct.RTree.html#method.nearest_neighbor_iter) while they are in range.
  - [x] We must introduce new `geoLowerThan` and `geoGreaterThan` variants to the `Operator` filter enum.
  - [x] Implement the `negative` method on both variants where the `geoGreaterThan` variant is implemented by executing the `geoLowerThan` and removing the results found from the whole list of geo faceted documents.
  - [x] Add the `_geoRadius` function in the pest parser.
- [x] Introduce a `_geo` ascending ranking function that takes a point in parameter, ~~this function must keep the iterator on the `RTree` and make it peekable~~ This was not possible for now, we had to collect the whole iterator. Only the documents that are part of the candidates must be sent too!
  - [x] This ascending ranking rule will only be active if the search is set up with the `_geoPoint` parameter that indicates the center point of the ascending ranking rule.

-----------

- On Meilisearch part: We must introduce a new concept, returning the documents with a new `_geoDistance` field when it passed by the `_geo` ranking rule, this has never been done before. We could maybe just do it afterward when the documents have been retrieved from the database, computing the distance from the `_geoPoint` and all of the documents to be returned.

Co-authored-by: Irevoire <tamo@meilisearch.com>
Co-authored-by: cvermand <33010418+bidoubiwa@users.noreply.github.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
2021-09-20 19:04:57 +00:00
Clémentine Urquizar 3f1453f470
Update version for the next release (v0.14.0) 2021-09-20 18:12:23 +02:00