8dff08d772
400: Rewrite the filter parser and add a lot of tests r=irevoire a=irevoire This PR is a complete rewrite of #358, which was reverted in #403. You can already try this PR in Meilisearch here https://github.com/meilisearch/MeiliSearch/pull/1880. Since writing a parser is quite complicated, I moved all the logic to another workspace called `filter_parser`. In this workspace, we don't know anything about milli, the filterable fields / field ID or anything. As you can see in its `cargo.toml`, it has only three dependencies entirely focused on the parsing part: ``` nom = "7.0.0" nom_locate = "4.0.0" ``` But introducing this new workspace made some changes necessary on the “AST”. Now the parser only returns `Tokens` (a simple `&str` with a bit of context). Everything is interpreted when we execute the filter later in milli. This crate provides a new error type for all filter related errors. --------- ## Errors Currently, we have multiple kinds of errors. Sometimes we are generating errors looking like that: (for `name = truc`) ``` Attribute `name` is not filterable. Available filterable attributes are: ``. ``` While sometimes pest was generating errors looking like that: ``` Invalid syntax for the filter parameter: ` --> 1:7 | 1 | name = | ^--- | = expected word`. ``` Which most people were seeing like that: (for `name =`) ``` Invalid syntax for the filter parameter: ` --> 1:7\n |\n1 | name =\n | ^---\n |\n = expected word`. ``` ----------- With this PR, the error format is unified between all errors. All errors follow this more straightforward format: ``` The error message. [from char]:[to char] filter ``` This should be way easier to read when embedded in the JSON for a human. And it should also allow us to parse the errors easily and provide highlighting or something with a frontend playground. Here is an example of the two previous errors with the new format: For `name = truc`: ``` Attribute `name` is not filterable. Available filterable attributes are: ``. 1:4 name = truc ``` Or in one line: ``` Attribute `name` is not filterable. Available filterable attributes are: ``.\n1:4 name = truc ``` And for `name =`: ``` Was expecting a value but instead got nothing. 7:7 name = ``` Or in one line: ``` Was expecting a value but instead got nothing.\n7:7 name = ``` Also, since we now have control over the parser, we can generate more explicit error messages so a lot of new errors have been created. I tried to be as helpful as possible for the user; here is a little overview of the new error message you can get when misusing a filter: ``` Expression `"truc` is missing the following closing delimiter: `"`. 8:13 name = "truc ``` The `_geoRadius` filter is an operation and can't be used as a value. 8:30 name = _geoRadius(12, 13, 14) ``` etc ## Tests A lot of tests have been written in the `filter_parser` crate. I think there is a unit test for every part of the syntax. But since we can never be sure we covered all the cases, I also fuzzed the new parser A LOT (for ±8 hours on 20 threads). And the code to fuzz the parser is included in the workspace, so if one day we need to change something to the syntax, we'll be able to re-use it by simply running: ``` cargo fuzz run --release parse ``` ## Milli I renamed the type and module `filter_condition.rs` / `FilterCondition` to `filter.rs` / `Filter`. Co-authored-by: Tamo <tamo@meilisearch.com> |
||
---|---|---|
.github | ||
benchmarks | ||
cli | ||
filter-parser | ||
helpers | ||
http-ui | ||
infos | ||
milli | ||
script | ||
.gitignore | ||
.rustfmt.toml | ||
bors.toml | ||
Cargo.toml | ||
LICENSE | ||
README.md |
a concurrent indexer combined with fast and relevant search algorithms
Introduction
This repository contains the core engine used in MeiliSearch.
It contains a library that can manage one and only one index. MeiliSearch manages the multi-index itself. Milli is unable to store updates in a store: it is the job of something else above and this is why it is only able to process one update at a time.
This repository contains crates to quickly debug the engine:
- There are benchmarks located in the
benchmarks
crate. - The
http-ui
crate is a simple HTTP dashboard to tests the features like for real! - The
infos
crate is used to dump the internal data-structure and ensure correctness. - The
search
crate is a simple command-line that helps run flamegraph on top of it. - The
helpers
crate is only used to modify the database inplace, sometimes.
Compile and run the HTTP debug server
You can specify the number of threads to use to index documents and many other settings too.
cd http-ui
cargo run --release -- --db my-database.mdb -vvv --indexing-jobs 8
Index your documents
It can index a massive amount of documents in not much time, I already achieved to index:
- 115m songs (song and artist name) in ~48min and take 81GiB on disk.
- 12m cities (name, timezone and country ID) in ~4min and take 6GiB on disk.
These metrics are done on a MacBook Pro with the M1 processor.
You can feed the engine with your CSV (comma-separated, yes) data like this:
printf "id,name,age\n1,hello,32\n2,kiki,24\n" | http POST 127.0.0.1:9700/documents content-type:text/csv
Don't forget to specify the id
of the documents. Also, note that it supports JSON and JSON
streaming: you can send them to the engine by using the content-type:application/json
and
content-type:application/x-ndjson
headers respectively.
Querying the engine via the website
You can query the engine by going to the HTML page itself.
Contributing
You can setup a git-hook
to stop you from making a commit too fast. It'll stop you if:
- Any of the workspaces does not build
- Your code is not well-formatted
These two things are also checked in the CI, so ignoring the hook won't help you merge your code.
But if you need to, you can still add --no-verify
when creating your commit to ignore the hook.
To enable the hook, run the following command from the root of the project:
cp script/pre-commit .git/hooks/pre-commit