mirror of
https://github.com/meilisearch/MeiliSearch
synced 2024-12-25 06:00:08 +01:00
Merge pull request #148 from meilisearch/split-fst-docindexes
Split fst doc-indexes
This commit is contained in:
commit
349f0f7068
49
README.md
49
README.md
@ -10,19 +10,19 @@ A _full-text search database_ using a key-value store internally.
|
||||
|
||||
## Features
|
||||
|
||||
- Provides [6 default ranking criteria](https://github.com/meilisearch/MeiliDB/blob/e0b759839d552f02e3dd0064948f4d8022415ed7/src/rank/criterion/mod.rs#L94-L105) used to [bucket sort](https://en.wikipedia.org/wiki/Bucket_sort) documents
|
||||
- Accepts [custom criteria](https://github.com/meilisearch/MeiliDB/blob/e0b759839d552f02e3dd0064948f4d8022415ed7/src/rank/criterion/mod.rs#L24-L31) and can apply them in any custom order
|
||||
- Support [ranged queries](https://github.com/meilisearch/MeiliDB/blob/e0b759839d552f02e3dd0064948f4d8022415ed7/src/rank/query_builder.rs#L165), useful for paginating results
|
||||
- Can [distinct](https://github.com/meilisearch/MeiliDB/blob/e0b759839d552f02e3dd0064948f4d8022415ed7/src/rank/query_builder.rs#L96) and [filter](https://github.com/meilisearch/MeiliDB/blob/e0b759839d552f02e3dd0064948f4d8022415ed7/src/rank/query_builder.rs#L85) returned documents based on context defined rules
|
||||
- Can store complete documents or only [user schema specified fields](https://github.com/meilisearch/MeiliDB/blob/20b5a6a06e4b897313e83e24fe1e1e47c660bfe8/examples/schema-example.toml)
|
||||
- The [default tokenizer](https://github.com/meilisearch/MeiliDB/blob/a960c325f30f38be6a63634b3bd621daf82912a8/src/tokenizer/mod.rs) can index latin and kanji based languages
|
||||
- Returns [the matching text areas](https://github.com/meilisearch/MeiliDB/blob/e0b759839d552f02e3dd0064948f4d8022415ed7/src/rank/mod.rs#L15-L18), useful to highlight matched words in results
|
||||
- Accepts query time search config like the [searchable fields](https://github.com/meilisearch/MeiliDB/blob/e0b759839d552f02e3dd0064948f4d8022415ed7/src/rank/query_builder.rs#L107)
|
||||
- Provides [6 default ranking criteria](https://github.com/meilisearch/MeiliDB/blob/3d85cbf0cfa3a3103cf1e151a75a443719cdd5d7/meilidb-core/src/criterion/mod.rs#L95-L101) used to [bucket sort](https://en.wikipedia.org/wiki/Bucket_sort) documents
|
||||
- Accepts [custom criteria](https://github.com/meilisearch/MeiliDB/blob/3d85cbf0cfa3a3103cf1e151a75a443719cdd5d7/meilidb-core/src/criterion/mod.rs#L22-L29) and can apply them in any custom order
|
||||
- Support [ranged queries](https://github.com/meilisearch/MeiliDB/blob/3d85cbf0cfa3a3103cf1e151a75a443719cdd5d7/meilidb-core/src/query_builder.rs#L146), useful for paginating results
|
||||
- Can [distinct](https://github.com/meilisearch/MeiliDB/blob/3d85cbf0cfa3a3103cf1e151a75a443719cdd5d7/meilidb-core/src/query_builder.rs#L68) and [filter](https://github.com/meilisearch/MeiliDB/blob/3d85cbf0cfa3a3103cf1e151a75a443719cdd5d7/meilidb-core/src/query_builder.rs#L57) returned documents based on context defined rules
|
||||
- Can store complete documents or only [user schema specified fields](https://github.com/meilisearch/MeiliDB/blob/3d85cbf0cfa3a3103cf1e151a75a443719cdd5d7/examples/movies/schema-movies.toml)
|
||||
- The [default tokenizer](https://github.com/meilisearch/MeiliDB/blob/3d85cbf0cfa3a3103cf1e151a75a443719cdd5d7/meilidb-tokenizer/src/lib.rs#L99) can index latin and kanji based languages
|
||||
- Returns [the matching text areas](https://github.com/meilisearch/MeiliDB/blob/3d85cbf0cfa3a3103cf1e151a75a443719cdd5d7/meilidb-core/src/lib.rs#L117-L120), useful to highlight matched words in results
|
||||
- Accepts query time search config like the [searchable fields](https://github.com/meilisearch/MeiliDB/blob/3d85cbf0cfa3a3103cf1e151a75a443719cdd5d7/meilidb-core/src/query_builder.rs#L79)
|
||||
- Supports run time indexing (incremental indexing)
|
||||
|
||||
|
||||
|
||||
It uses [RocksDB](https://github.com/facebook/rocksdb) as the internal key-value store. The key-value store allows us to handle updates and queries with small memory and CPU overheads. The whole ranking system is [data oriented](https://github.com/meilisearch/MeiliDB/issues/82) and provides great performances.
|
||||
It uses [sled](https://github.com/spacejam/sled) as the internal key-value store. The key-value store allows us to handle updates and queries with small memory and CPU overheads. The whole ranking system is [data oriented](https://github.com/meilisearch/MeiliDB/issues/82) and provides great performances.
|
||||
|
||||
You can [read the deep dive](deep-dive.md) if you want more information on the engine, it describes the whole process of generating updates and handling queries or you can take a look at the [typos and ranking rules](typos-ranking-rules.md) if you want to know the default rules used to sort the documents.
|
||||
|
||||
@ -59,15 +59,34 @@ We have seen much better performances when [using jemalloc as the global allocat
|
||||
|
||||
## Usage and examples
|
||||
|
||||
MeiliDB runs with an index like most search engines.
|
||||
So to test the library you can create one by indexing a simple csv file.
|
||||
You can test a little part of MeiliDB by using this command, it create an index named _movies_ and initialize it with to great Tarantino movies.
|
||||
|
||||
```bash
|
||||
cargo run --release --example create-database -- test.mdb examples/movies/movies.csv --schema examples/movies/schema-movies.toml
|
||||
cargo run --release
|
||||
|
||||
curl -XPOST 'http://127.0.0.1:8000/movies' \
|
||||
-d '
|
||||
identifier = "id"
|
||||
|
||||
[attributes.id]
|
||||
stored = true
|
||||
|
||||
[attributes.title]
|
||||
stored = true
|
||||
indexed = true
|
||||
'
|
||||
|
||||
curl -H 'Content-Type: application/json' \
|
||||
-XPUT 'http://127.0.0.1:8000/movies' \
|
||||
-d '{ "id": 123, "title": "Inglorious Bastards" }'
|
||||
|
||||
curl -H 'Content-Type: application/json' \
|
||||
-XPUT 'http://127.0.0.1:8000/movies' \
|
||||
-d '{ "id": 456, "title": "Django Unchained" }'
|
||||
```
|
||||
|
||||
Once the command is executed, the index should be in the `test.mdb` folder. You are now able to run the `query-database` example and play with MeiliDB.
|
||||
Once the database is initialized you can query it by using the following command:
|
||||
|
||||
```bash
|
||||
cargo run --release --example query-database -- test.mdb -n 10 id title overview release_date
|
||||
```
|
||||
curl -XGET 'http://127.0.0.1:8000/movies/search?q=inglo'
|
||||
```
|
||||
|
91
deep-dive.md
91
deep-dive.md
@ -1,28 +1,22 @@
|
||||
# A deep dive in MeiliDB
|
||||
|
||||
On the 9 of december 2018.
|
||||
|
||||
MeiliDB is a full text search engine based on a final state transducer named [fst](https://github.com/BurntSushi/fst) and a key-value store named [RocksDB](https://github.com/facebook/rocksdb). The goal of a search engine is to store data and to respond to queries as accurate and fast as possible. To achieve this it must save the data as an [inverted index](https://en.wikipedia.org/wiki/Inverted_index).
|
||||
|
||||
On the 15 of May 2019.
|
||||
|
||||
MeiliDB is a full text search engine based on a final state transducer named [fst](https://github.com/BurntSushi/fst) and a key-value store named [sled](https://github.com/spacejam/sled). The goal of a search engine is to store data and to respond to queries as accurate and fast as possible. To achieve this it must save the matching words in an [inverted index](https://en.wikipedia.org/wiki/Inverted_index).
|
||||
|
||||
<!-- MarkdownTOC autolink="true" -->
|
||||
|
||||
- [Where is the data stored?](#where-is-the-data-stored)
|
||||
- [What does the key-value store contains?](#what-does-the-key-value-store-contains)
|
||||
- [The blob type](#the-blob-type)
|
||||
- [The inverted word index](#the-inverted-word-index)
|
||||
- [A final state transducer](#a-final-state-transducer)
|
||||
- [Document indexes](#document-indexes)
|
||||
- [Document ids](#document-ids)
|
||||
- [The schema](#the-schema)
|
||||
- [Document attributes](#document-attributes)
|
||||
- [How is an update handled?](#how-is-an-update-handled)
|
||||
- [The merge operation is CPU consuming](#the-merge-operation-is-cpu-consuming)
|
||||
- [How is a request processed?](#how-is-a-request-processed)
|
||||
- [Query lexemes](#query-lexemes)
|
||||
- [Automatons and query index](#automatons-and-query-index)
|
||||
- [Sort by criteria](#sort-by-criteria)
|
||||
- [Retrieve original documents](#retrieve-original-documents)
|
||||
|
||||
<!-- /MarkdownTOC -->
|
||||
|
||||
@ -30,21 +24,17 @@ MeiliDB is a full text search engine based on a final state transducer named [fs
|
||||
|
||||
MeiliDB is entirely backed by a key-value store like any good database (i.e. Postgres, MySQL). This brings a great flexibility in the way documents can be stored and updates handled along time.
|
||||
|
||||
[RocksDB brings some](https://rocksdb.org/blog/2015/02/27/write-batch-with-index.html) of the [A.C.I.D. properties](https://en.wikipedia.org/wiki/ACID_(computer_science)) to help us be sure the saved data is consistent, for example we use SST files and the key-value store ability to load them in one time to manage updates.
|
||||
|
||||
Note that the SST file have the same restriction as the fst, it needs its keys to be added in order at creation.
|
||||
[sled will brings some](https://github.com/spacejam/sled/tree/434533332a3f485e6d2e467023be0a0b55d3a1af#plans) of the [A.C.I.D. properties](https://en.wikipedia.org/wiki/ACID_(computer_science)) to help us be sure the saved data is consistent.
|
||||
|
||||
|
||||
|
||||
## What does the key-value store contains?
|
||||
|
||||
It contain the blob, the schema and the documents stored attributes.
|
||||
It contain the inverted word index, the schema and the documents fields.
|
||||
|
||||
### The blob type
|
||||
### The inverted word index
|
||||
|
||||
[The Blob type](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/database/blob/mod.rs#L16-L19) is a data structure that indicate if an update is a positive or a negative one. In the case where the update is considered positive, the blob will contain [an fst map and the document indexes](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/database/blob/positive/blob.rs#L15-L18) associated. In the other case it will only contain [all the document ids](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/database/blob/negative/blob.rs#L12-L14) that must be considered removed.
|
||||
|
||||
The Blob type [is stored under the "*data-index*" entry](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/database/update/positive/update.rs#L497-L499) and marked as [a merge operation](https://github.com/facebook/rocksdb/wiki/Merge-Operator-Implementation) in the key-value store.
|
||||
[The inverted word index](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-data/src/database/words_index.rs) is a sled Tree dedicated to store and give access to all documents that contains a specific word. The information stored under the word is simply a big ordered array of where in the document the word has been found. In other word, a big list of [`DocIndex`](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-core/src/lib.rs#L35-L51).
|
||||
|
||||
#### A final state transducer
|
||||
|
||||
@ -52,89 +42,54 @@ _...also abbreviated fst_
|
||||
|
||||
This is the first entry point of the engine, you can read more about how it work with the beautiful blog post of @BurntSushi, [Index 1,600,000,000 Keys with Automata and Rust](https://blog.burntsushi.net/transducers/).
|
||||
|
||||
To make it short it is a powerful way to store all the words that are present in the indexed documents. You construct it by giving it all the words you want to index associated with a value that, for the moment, can only be an `u64`. When you want to search in it you can provide any automaton you want, in MeiliDB [a custom levenshtein automaton](https://github.com/tantivy-search/levenshtein-automata/) is used.
|
||||
|
||||
Note that the number under each word is auto-incremental, each new word have a new number that is greater than the previous one.
|
||||
|
||||
Another powerful feature of `fst` is that it can nearly avoid using RAM and be streamed to disk for example, the problem is that the keys must be always added in lexicographic order, so you must sort them before, for the moment MeiliDB uses a [BTreeMap](https://github.com/Kerollmops/raptor-rs/blob/8abdb0a228e2808fe1814a6a0641a4b72d158579/src/metadata/doc_indexes.rs#L107-L112).
|
||||
To make it short it is a powerful way to store all the words that are present in the indexed documents. You construct it by giving it all the words you want to index. When you want to search in it you can provide any automaton you want, in MeiliDB [a custom levenshtein automaton](https://github.com/tantivy-search/levenshtein-automata/) is used.
|
||||
|
||||
#### Document indexes
|
||||
|
||||
As it has been specified, the `fst` can only store a number corresponding to a word, an `u64`, but the goal of the search engine is to retrieve a match in a document when a query is made. You want it to return some sort of position in an attribute in a document, an information about where the given word match.
|
||||
The `fst` will only return the words that match with the search automaton but the goal of the search engine is to retrieve all matches in all the documents when a query is made. You want it to return some sort of position in an attribute in a document, an information about where the given word matched.
|
||||
|
||||
To make it possible, a custom data structure has been developed, the document indexes is composed of two arrays, the ranges array and all the docindexes corresponding to a given range, each range identify the word number. The [DocIndexes](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/data/doc_indexes.rs#L23) type is designed to be streamed when constructed, consumming a minimum amount of ram like the fst. Another advantage is that the slices are accessible in `O(1)` when you know the word associated number.
|
||||
|
||||
#### Document ids
|
||||
|
||||
This is a simple ordered list of all documents ids which must be considered deleted. It is used with [the sdset library](https://docs.rs/sdset/0.3.0/sdset/duo/struct.DifferenceByKey.html), the docindexes and the `DifferenceByKey` operation builder when merging blobs.
|
||||
|
||||
When a blob represent a negative update it only contains this simple slice of deleted documents ids.
|
||||
To make it possible we retrieve all of the `DocIndex` corresponding to all the matching words in the fst, we use the [`WordsIndex`](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-data/src/database/words_index.rs#L11-L21) Tree to get the `DocIndexes` corresponding the words.
|
||||
|
||||
### The schema
|
||||
|
||||
The schema is a data structure that represents which documents attributes should be stored and which should be indexed. It is stored under the "_data-schema_" entry and given to MeiliDB only at the creation.
|
||||
The schema is a data structure that represents which documents attributes should be stored and which should be indexed. It is stored under a the [`MainIndex`](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-data/src/database/main_index.rs#L12) Tree and given to MeiliDB only at the creation of an index.
|
||||
|
||||
Each document attribute is associated to a unique 32 bit number named `SchemaAttr`.
|
||||
Each document attribute is associated to a unique 16 bit number named [`SchemaAttr`](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-data/src/schema.rs#L186).
|
||||
|
||||
In the future this schema type could be given along with updates and probably be different from the original, the database could be able to handled this document structure and reindex it.
|
||||
In the future, this schema type could be given along with updates, the database could be able to handled a new schema and reindex the database according to the new one.
|
||||
|
||||
### Document attributes
|
||||
|
||||
When the engine handle a query the result that the requester want is a document, not only the [match](https://github.com/Kerollmops/MeiliDB/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/lib.rs#L51-L79) associated to it, fields of the original document must be returned too.
|
||||
When the engine handle a query the result that the requester want is a document, not only the [`Matches`](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-core/src/lib.rs#L62-L88) associated to it, fields of the original document must be returned too.
|
||||
|
||||
So MeiliDB again uses the power of the underlying key-value store and save the documents attributes marked as _STORE_. The key is prefixed by "_doc_" followed by the 64 bit document id in bytes and the schema attribute number in bytes corresponding to the document attribute stored.
|
||||
So MeiliDB again uses the power of the underlying key-value store and save the documents attributes marked as _STORE_ in the schema. The dedicated Tree for this information is the [`DocumentsIndex`](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-data/src/database/documents_index.rs#L11).
|
||||
|
||||
When a document field is saved in the key-value store its value is binary encoded using the [bincode](https://docs.rs/bincode/) library, so a document must be serializable using serde.
|
||||
|
||||
|
||||
|
||||
## How is an update handled?
|
||||
|
||||
First of all an update in MeiliDB is nothing more than [a RocksDB SST file](https://github.com/facebook/rocksdb/wiki/Creating-and-Ingesting-SST-files). It contains the blob and all the documents attributes binary encoded like described above. Note that the blob is stored under the "_data-index_" key marked as [a merge operation](https://github.com/facebook/rocksdb/wiki/Merge-Operator-Implementation).
|
||||
|
||||
### The merge operation is CPU consuming
|
||||
|
||||
When [the database ingest an update](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/database/mod.rs#L108-L145) it gives the SST file to the underlying RocksDB, once it has ingested it there is a "_data-index_" entry available, we can request it but the key-value store will call a function before, a merge operation is performed.
|
||||
|
||||
This merge operation is done on multiple blobs as you have understood and will compute a [PositiveBlob](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/database/blob/positive/blob.rs#L15), this type contains the fst and document indexes structures allowing us to search for documents. This two data structures can be considered as the inverted index.
|
||||
|
||||
The computation time of this merge is important, RocksDB doesn't keep the previous merged result, it will call our merge operation each time until it decided to do a compaction. So [we must force this compaction earlier](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/database/mod.rs#L129-L131) when we receive an update to reduce this cost.
|
||||
|
||||
This way when we request the "_data-index_" value it will gives us the previously merged positive blob without any other merge overhead.
|
||||
When a document field is saved in the key-value store its value is binary encoded using [message pack](https://github.com/3Hren/msgpack-rust), so a document must be serializable using serde.
|
||||
|
||||
|
||||
|
||||
## How is a request processed?
|
||||
|
||||
Now that we have our "_data-index_" we are able to return results based on a query. In the MeiliDB universe a query is a string.
|
||||
Now that we have our inverted index we are able to return results based on a query. In the MeiliDB universe a query is a simple string containing words.
|
||||
|
||||
### Query lexemes
|
||||
|
||||
The first step to be able to call the underlying structures is to split the query in words, for that we use a [custom tokenizer](https://github.com/Kerollmops/MeiliDB/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/tokenizer/mod.rs) that is not finished for the moment, [there is an open issue](https://github.com/Kerollmops/MeiliDB/issues/3). Note that a tokenizer is specialized for a human language, this is the hard part.
|
||||
The first step to be able to call the underlying structures is to split the query in words, for that we use a [custom tokenizer](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-tokenizer/src/lib.rs#L82-L84). Note that a tokenizer is specialized for a human language, this is the hard part.
|
||||
|
||||
### Automatons and query index
|
||||
|
||||
So to query the fst we need an automaton, in MeiliDB we use a [levenshtein automaton](https://en.wikipedia.org/wiki/Levenshtein_automaton), this automaton is constructed using a string and a maximum distance. According to the [Algolia's blog post](https://blog.algolia.com/inside-the-algolia-engine-part-3-query-processing/#algolia%e2%80%99s-way-of-searching-for-alternatives) we [created the DFAs](https://github.com/Kerollmops/MeiliDB/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/automaton.rs#L62-L75) with different settings.
|
||||
So to query the fst we need an automaton, in MeiliDB we use a [levenshtein automaton](https://en.wikipedia.org/wiki/Levenshtein_automaton), this automaton is constructed using a string and a maximum distance. According to the [Algolia's blog post](https://blog.algolia.com/inside-the-algolia-engine-part-3-query-processing/#algolia%e2%80%99s-way-of-searching-for-alternatives) we [created the DFAs](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-core/src/automaton.rs#L59-L78) with different settings.
|
||||
|
||||
Thanks to the power of the fst library [it is possible to union multiple automatons](https://docs.rs/fst/0.3.2/fst/map/struct.OpBuilder.html#method.union) on the same fst map, it will allow us to know which [automaton returns a word according to its index](https://github.com/Kerollmops/MeiliDB/blob/fc2cdf92596fc002ce278e3aa8718640ac44724d/src/metadata/ops.rs#L111). The `Stream` is able to return all the numbers associated to the words. We use these numbers to find the whole list of `DocIndexes` associated and do the union set operation.
|
||||
Thanks to the power of the fst library [it is possible to union multiple automatons](https://docs.rs/fst/0.3.2/fst/map/struct.OpBuilder.html#method.union) on the same fst set. The `Stream` is able to return all the matching words. We use these words to find the whole list of `DocIndexes` associated.
|
||||
|
||||
With all these informations it is possible [to reconstruct a list of all the DocIndexes associated](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/rank/query_builder.rs#L62-L99) with the words queried.
|
||||
With all these informations it is possible [to reconstruct a list of all the `DocIndexes` associated](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-core/src/query_builder.rs#L103-L130) with the words queried.
|
||||
|
||||
### Sort by criteria
|
||||
|
||||
Now that we are able to get a big list of [DocIndexes](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/lib.rs#L21-L36) it is not enough to sort them by criteria, we need more informations like the levenshtein distance or the fact that a query word match exactly the word stored in the fst. So [we stuff it a little bit](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/rank/query_builder.rs#L86-L93), and aggregate all these [Matches](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/lib.rs#L47-L74) for each document. This way it will be easy to sort a simple vector of document using a bunch of functions.
|
||||
|
||||
With this big list of documents and associated matches [we are able to sort only the part of the slice that we want](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/rank/query_builder.rs#L108-L119) using bucket sorting. [Each criterion](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/rank/criterion/mod.rs#L75-L87) is evaluated on each subslice without copy, thanks to [GroupByMut](https://github.com/Kerollmops/group-by/blob/cab857bae01463dbd0edb99b0e0d7f3624e6c6f5/src/lib.rs#L180-L185) which, I hope [will soon be merged](https://github.com/rust-lang/rfcs/pull/2477).
|
||||
|
||||
Note that it is possible to customize the criteria used by using the `QueryBuilder::with_criteria` constructor, this way you can implement some custom ranking based on the document attributes using the appropriate structure and the `retrieve_document` method.
|
||||
|
||||
### Retrieve original documents
|
||||
|
||||
The [DatabaseView](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/database/database_view.rs#L18-L24) structure that you must have created to be able to query the database have [two functions](https://github.com/Kerollmops/MeiliDB/blob/550dc1e99224e386516877450320f694947332d4/src/database/database_view.rs#L60-L76) that allows you to retrieve a full (or not) document according to the schema you specified at creation time (i.e. the _STORED_ attributes).
|
||||
|
||||
As you can see, these functions force the created type `T` to implement [the serde Deserialize trait](https://docs.rs/serde/1.0.81/serde/trait.Deserialize.html), MeiliDB will use the `bincode::deserialise` function for each attribute to construct your type and return it to you.
|
||||
|
||||
With this big list of documents and associated matches [we are able to sort only the part of the slice that we want](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-core/src/query_builder.rs#L160-L188) using bucket sorting. [Each criterion](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-core/src/criterion/mod.rs#L95-L101) is evaluated on each subslice without copy, thanks to [GroupByMut](https://docs.rs/slice-group-by/0.2.4/slice_group_by/) which, I hope [will soon be merged](https://github.com/rust-lang/rfcs/pull/2477).
|
||||
|
||||
Note that it is possible to customize the criteria used by using the `QueryBuilder::with_criteria` constructor, this way you can implement some custom ranking based on the document attributes using the appropriate structure and the [`document` method](https://github.com/meilisearch/MeiliDB/blob/3db823de002243004612e36a19b4578d800dab97/meilidb-data/src/database/index.rs#L86).
|
||||
|
||||
At this point, MeiliDB work is over 🎉
|
||||
|
||||
|
@ -1,122 +0,0 @@
|
||||
id,title,description,image
|
||||
711158459,Sony PlayStation 4 (PS4) (Latest Model)- 500 GB Jet Black Console,"The PlayStation 4 system opens the door to an incredible journey through immersive new gaming worlds and a deeply connected gaming community. Step into living, breathing worlds where you are hero of your epic journey. Explore gritty urban environments, vast galactic landscapes, and fantastic historical settings brought to life on an epic scale, without limits. With an astounding launch lineup and over 180 games in development the PS4 system offers more top-tier blockbusters and inventive indie hits than any other next-gen console. The PS4 system is developer inspired, gamer focused. The PS4 system learns how you play and intuitively curates the content you use most often. Fire it up, and your PS4 system points the way to new, amazing experiences you can jump into alone or with friends. Create your own legend using a sophisticated, intuitive network built for gamers. Broadcast your gameplay live and direct to the world, complete with your commentary. Or immortalize your most epic moments and share at the press of a button. Access the best in music, movies, sports and television. PS4 system doesn t require a membership fee to access your digital entertainment subscriptions. You get the full spectrum of entertainment that matters to you on the PS4 system. PlayStation 4: The Best Place to Play The PlayStation 4 system provides dynamic, connected gaming, powerful graphics and speed, intelligent personalization, deeply integrated social capabilities, and innovative second-screen features. Combining unparalleled content, immersive gaming experiences, all of your favorite digital entertainment apps, and PlayStation exclusives, the PS4 system focuses on the gamers.Gamer Focused, Developer InspiredThe PS4 system focuses on the gamer, ensuring that the very best games and the most immersive experiences are possible on the platform.<br>Read more about the PS4 on ebay guides.</br>",http://thumbs2.ebaystatic.com/d/l225/m/mzvzEUIknaQclZ801YCY1ew.jpg
|
||||
711158460,Sony PlayStation 4 (Latest Model)- 500 GB Jet Black Console,"The PlayStation 4 system opens the door to an incredible journey through immersive new gaming worlds and a deeply connected gaming community. Step into living, breathing worlds where you are hero of your epic journey. Explore gritty urban environments, vast galactic landscapes, and fantastic historical settings brought to life on an epic scale, without limits. With an astounding launch lineup and over 180 games in development the PS4 system offers more top-tier blockbusters and inventive indie hits than any other next-gen console. The PS4 system is developer inspired, gamer focused. The PS4 system learns how you play and intuitively curates the content you use most often. Fire it up, and your PS4 system points the way to new, amazing experiences you can jump into alone or with friends. Create your own legend using a sophisticated, intuitive network built for gamers. Broadcast your gameplay live and direct to the world, complete with your commentary. Or immortalize your most epic moments and share at the press of a button. Access the best in music, movies, sports and television. PS4 system doesn t require a membership fee to access your digital entertainment subscriptions. You get the full spectrum of entertainment that matters to you on the PS4 system. PlayStation 4: The Best Place to Play The PlayStation 4 system provides dynamic, connected gaming, powerful graphics and speed, intelligent personalization, deeply integrated social capabilities, and innovative second-screen features. Combining unparalleled content, immersive gaming experiences, all of your favorite digital entertainment apps, and PlayStation exclusives, the PS4 system focuses on the gamers.Gamer Focused, Developer InspiredThe PS4 system focuses on the gamer, ensuring that the very best games and the most immersive experiences are possible on the platform.<br>Read more about the PS4 on ebay guides.</br>",http://thumbs3.ebaystatic.com/d/l225/m/mJNDmSyIS3vUasKIJEBy4Cw.jpg
|
||||
711158461,Sony PlayStation 4 PS4 500 GB Jet Black Console,"The PlayStation 4 system opens the door to an incredible journey through immersive new gaming worlds and a deeply connected gaming community. Step into living, breathing worlds where you are hero of your epic journey. Explore gritty urban environments, vast galactic landscapes, and fantastic historical settings brought to life on an epic scale, without limits. With an astounding launch lineup and over 180 games in development the PS4 system offers more top-tier blockbusters and inventive indie hits than any other next-gen console. The PS4 system is developer inspired, gamer focused. The PS4 system learns how you play and intuitively curates the content you use most often. Fire it up, and your PS4 system points the way to new, amazing experiences you can jump into alone or with friends. Create your own legend using a sophisticated, intuitive network built for gamers. Broadcast your gameplay live and direct to the world, complete with your commentary. Or immortalize your most epic moments and share at the press of a button. Access the best in music, movies, sports and television. PS4 system doesn t require a membership fee to access your digital entertainment subscriptions. You get the full spectrum of entertainment that matters to you on the PS4 system. PlayStation 4: The Best Place to Play The PlayStation 4 system provides dynamic, connected gaming, powerful graphics and speed, intelligent personalization, deeply integrated social capabilities, and innovative second-screen features. Combining unparalleled content, immersive gaming experiences, all of your favorite digital entertainment apps, and PlayStation exclusives, the PS4 system focuses on the gamers.Gamer Focused, Developer InspiredThe PS4 system focuses on the gamer, ensuring that the very best games and the most immersive experiences are possible on the platform.<br>Read more about the PS4 on ebay guides.</br>",http://thumbs4.ebaystatic.com/d/l225/m/m10NZXArmiIkpkTDDkAUVvA.jpg
|
||||
711158462,Sony - PlayStation 4 500GB The Last of Us Remastered Bundle - Black,,http://thumbs2.ebaystatic.com/d/l225/m/mZZXTmAE8WZDH1l_E_PPAkg.jpg
|
||||
711158463,Sony PlayStation 4 (PS4) (Latest Model)- 500 GB Jet Black Console,"The PlayStation 4 system opens the door to an incredible journey through immersive new gaming worlds and a deeply connected gaming community. Step into living, breathing worlds where you are hero of your epic journey. Explore gritty urban environments, vast galactic landscapes, and fantastic historical settings brought to life on an epic scale, without limits. With an astounding launch lineup and over 180 games in development the PS4 system offers more top-tier blockbusters and inventive indie hits than any other next-gen console. The PS4 system is developer inspired, gamer focused. The PS4 system learns how you play and intuitively curates the content you use most often. Fire it up, and your PS4 system points the way to new, amazing experiences you can jump into alone or with friends. Create your own legend using a sophisticated, intuitive network built for gamers. Broadcast your gameplay live and direct to the world, complete with your commentary. Or immortalize your most epic moments and share at the press of a button. Access the best in music, movies, sports and television. PS4 system doesn t require a membership fee to access your digital entertainment subscriptions. You get the full spectrum of entertainment that matters to you on the PS4 system. PlayStation 4: The Best Place to Play The PlayStation 4 system provides dynamic, connected gaming, powerful graphics and speed, intelligent personalization, deeply integrated social capabilities, and innovative second-screen features. Combining unparalleled content, immersive gaming experiences, all of your favorite digital entertainment apps, and PlayStation exclusives, the PS4 system focuses on the gamers.Gamer Focused, Developer InspiredThe PS4 system focuses on the gamer, ensuring that the very best games and the most immersive experiences are possible on the platform.<br>Read more about the PS4 on ebay guides.</br>",http://thumbs3.ebaystatic.com/d/l225/m/mzvzEUIknaQclZ801YCY1ew.jpg
|
||||
711158464,Sony PlayStation 4 (PS4) (Latest Model)- 500 GB Jet Black Console,"The PlayStation 4 system opens the door to an incredible journey through immersive new gaming worlds and a deeply connected gaming community. Step into living, breathing worlds where you are hero of your epic journey. Explore gritty urban environments, vast galactic landscapes, and fantastic historical settings brought to life on an epic scale, without limits. With an astounding launch lineup and over 180 games in development the PS4 system offers more top-tier blockbusters and inventive indie hits than any other next-gen console. The PS4 system is developer inspired, gamer focused. The PS4 system learns how you play and intuitively curates the content you use most often. Fire it up, and your PS4 system points the way to new, amazing experiences you can jump into alone or with friends. Create your own legend using a sophisticated, intuitive network built for gamers. Broadcast your gameplay live and direct to the world, complete with your commentary. Or immortalize your most epic moments and share at the press of a button. Access the best in music, movies, sports and television. PS4 system doesn t require a membership fee to access your digital entertainment subscriptions. You get the full spectrum of entertainment that matters to you on the PS4 system. PlayStation 4: The Best Place to Play The PlayStation 4 system provides dynamic, connected gaming, powerful graphics and speed, intelligent personalization, deeply integrated social capabilities, and innovative second-screen features. Combining unparalleled content, immersive gaming experiences, all of your favorite digital entertainment apps, and PlayStation exclusives, the PS4 system focuses on the gamers.Gamer Focused, Developer InspiredThe PS4 system focuses on the gamer, ensuring that the very best games and the most immersive experiences are possible on the platform.<br>Read more about the PS4 on ebay guides.</br>",http://thumbs4.ebaystatic.com/d/l225/m/mzvzEUIknaQclZ801YCY1ew.jpg
|
||||
711158465,BRAND NEW Sony PlayStation 4 BUNDLE 500gb,,http://thumbs4.ebaystatic.com/d/l225/m/m9TQTiWcWig7SeQh9algLZg.jpg
|
||||
711158466,"Sony PlayStation 4 500GB, Dualshock Wireless Control, HDMI Gaming Console Refurb","The PlayStation 4 system opens the door to an incredible journey through immersive new gaming worlds and a deeply connected gaming community. Step into living, breathing worlds where you are hero of your epic journey. Explore gritty urban environments, vast galactic landscapes, and fantastic historical settings brought to life on an epic scale, without limits. With an astounding launch lineup and over 180 games in development the PS4 system offers more top-tier blockbusters and inventive indie hits than any other next-gen console. The PS4 system is developer inspired, gamer focused. The PS4 system learns how you play and intuitively curates the content you use most often. Fire it up, and your PS4 system points the way to new, amazing experiences you can jump into alone or with friends. Create your own legend using a sophisticated, intuitive network built for gamers. Broadcast your gameplay live and direct to the world, complete with your commentary. Or immortalize your most epic moments and share at the press of a button. Access the best in music, movies, sports and television. PS4 system doesn t require a membership fee to access your digital entertainment subscriptions. You get the full spectrum of entertainment that matters to you on the PS4 system. PlayStation 4: The Best Place to Play The PlayStation 4 system provides dynamic, connected gaming, powerful graphics and speed, intelligent personalization, deeply integrated social capabilities, and innovative second-screen features. Combining unparalleled content, immersive gaming experiences, all of your favorite digital entertainment apps, and PlayStation exclusives, the PS4 system focuses on the gamers.Gamer Focused, Developer InspiredThe PS4 system focuses on the gamer, ensuring that the very best games and the most immersive experiences are possible on the platform.<br>Read more about the PS4 on ebay guides.</br>",http://thumbs4.ebaystatic.com/d/l225/m/mTZYG5N6xWfBi4Ok03HmpMw.jpg
|
||||
711158467,Sony PlayStation 4 (Latest Model)- 500 GB Jet Black Console w/ 2 Controllers,,http://thumbs2.ebaystatic.com/d/l225/m/mX5Qphrygqeoi7tAH5eku2A.jpg
|
||||
711158468,Sony PlayStation 4 (Latest Model)- 500 GB Jet Black Console *NEW*,"The PlayStation 4 system opens the door to an incredible journey through immersive new gaming worlds and a deeply connected gaming community. Step into living, breathing worlds where you are hero of your epic journey. Explore gritty urban environments, vast galactic landscapes, and fantastic historical settings brought to life on an epic scale, without limits. With an astounding launch lineup and over 180 games in development the PS4 system offers more top-tier blockbusters and inventive indie hits than any other next-gen console. The PS4 system is developer inspired, gamer focused. The PS4 system learns how you play and intuitively curates the content you use most often. Fire it up, and your PS4 system points the way to new, amazing experiences you can jump into alone or with friends. Create your own legend using a sophisticated, intuitive network built for gamers. Broadcast your gameplay live and direct to the world, complete with your commentary. Or immortalize your most epic moments and share at the press of a button. Access the best in music, movies, sports and television. PS4 system doesn t require a membership fee to access your digital entertainment subscriptions. You get the full spectrum of entertainment that matters to you on the PS4 system. PlayStation 4: The Best Place to Play The PlayStation 4 system provides dynamic, connected gaming, powerful graphics and speed, intelligent personalization, deeply integrated social capabilities, and innovative second-screen features. Combining unparalleled content, immersive gaming experiences, all of your favorite digital entertainment apps, and PlayStation exclusives, the PS4 system focuses on the gamers.Gamer Focused, Developer InspiredThe PS4 system focuses on the gamer, ensuring that the very best games and the most immersive experiences are possible on the platform.<br>Read more about the PS4 on ebay guides.</br>",http://thumbs2.ebaystatic.com/d/l225/m/mGjN4IrJ0O8kKD_TYMWgGgQ.jpg
|
||||
711158469,Sony PlayStation 4 (Latest Model)- 500 GB Jet Black Console..wth Mortal Kombat X,,http://thumbs2.ebaystatic.com/d/l225/m/mrpqSNXwlnUVKnEscE4348w.jpg
|
||||
711158470,Genuine SONY PS4 Playstation 4 500GB Gaming Console - Black,,http://thumbs4.ebaystatic.com/d/l225/m/myrPBFCpb4H5rHI8NyiS2zA.jpg
|
||||
711158471,[Sony] Playstation 4 PS4 Video Game Console Black - Latest Model,,http://thumbs4.ebaystatic.com/d/l225/m/mce0c7mCuv3xpjllJXx093w.jpg
|
||||
711158472,Sony PlayStation 4 (Latest Model) 500 GB Jet Black Console,"The PlayStation 4 system opens the door to an incredible journey through immersive new gaming worlds and a deeply connected gaming community. Step into living, breathing worlds where you are hero of your epic journey. Explore gritty urban environments, vast galactic landscapes, and fantastic historical settings brought to life on an epic scale, without limits. With an astounding launch lineup and over 180 games in development the PS4 system offers more top-tier blockbusters and inventive indie hits than any other next-gen console. The PS4 system is developer inspired, gamer focused. The PS4 system learns how you play and intuitively curates the content you use most often. Fire it up, and your PS4 system points the way to new, amazing experiences you can jump into alone or with friends. Create your own legend using a sophisticated, intuitive network built for gamers. Broadcast your gameplay live and direct to the world, complete with your commentary. Or immortalize your most epic moments and share at the press of a button. Access the best in music, movies, sports and television. PS4 system doesn t require a membership fee to access your digital entertainment subscriptions. You get the full spectrum of entertainment that matters to you on the PS4 system. PlayStation 4: The Best Place to Play The PlayStation 4 system provides dynamic, connected gaming, powerful graphics and speed, intelligent personalization, deeply integrated social capabilities, and innovative second-screen features. Combining unparalleled content, immersive gaming experiences, all of your favorite digital entertainment apps, and PlayStation exclusives, the PS4 system focuses on the gamers.Gamer Focused, Developer InspiredThe PS4 system focuses on the gamer, ensuring that the very best games and the most immersive experiences are possible on the platform.<br>Read more about the PS4 on ebay guides.</br>",http://thumbs2.ebaystatic.com/d/l225/m/miVSA1xPO5fCNdYzEMc8rSQ.jpg
|
||||
711158473,Sony PlayStation 4 - 500 GB Jet Black Console - WITH LAST OF US REMASTERED,,http://thumbs2.ebaystatic.com/d/l225/m/mLjnOxv2GWkrkCtgsDGhJ6A.jpg
|
||||
711158474,Sony PlayStation 4 (Latest Model)- 500 GB Jet Black Console,,http://thumbs3.ebaystatic.com/d/l225/m/mjMittBaXmm_n4AMpETBXhQ.jpg
|
||||
711158475,Sony PlayStation 4 (Latest Model)- 500 GB Jet Black Console,,http://thumbs2.ebaystatic.com/d/l225/m/m1n1qrJ7-VGbe7xQvGdeD6Q.jpg
|
||||
711158476,"Sony PlayStation 4 - 500 GB Jet Black Console (3 controllers,3 games included)",,http://thumbs3.ebaystatic.com/d/l225/m/mIoGIj9FZG7HoEVkPlnyizA.jpg
|
||||
711158477,Sony PlayStation 4 500GB Console with 2 Controllers,"The PlayStation 4 system opens the door to an incredible journey through immersive new gaming worlds and a deeply connected gaming community. Step into living, breathing worlds where you are hero of your epic journey. Explore gritty urban environments, vast galactic landscapes, and fantastic historical settings brought to life on an epic scale, without limits. With an astounding launch lineup and over 180 games in development the PS4 system offers more top-tier blockbusters and inventive indie hits than any other next-gen console. The PS4 system is developer inspired, gamer focused. The PS4 system learns how you play and intuitively curates the content you use most often. Fire it up, and your PS4 system points the way to new, amazing experiences you can jump into alone or with friends. Create your own legend using a sophisticated, intuitive network built for gamers. Broadcast your gameplay live and direct to the world, complete with your commentary. Or immortalize your most epic moments and share at the press of a button. Access the best in music, movies, sports and television. PS4 system doesn t require a membership fee to access your digital entertainment subscriptions. You get the full spectrum of entertainment that matters to you on the PS4 system. PlayStation 4: The Best Place to Play The PlayStation 4 system provides dynamic, connected gaming, powerful graphics and speed, intelligent personalization, deeply integrated social capabilities, and innovative second-screen features. Combining unparalleled content, immersive gaming experiences, all of your favorite digital entertainment apps, and PlayStation exclusives, the PS4 system focuses on the gamers.Gamer Focused, Developer InspiredThe PS4 system focuses on the gamer, ensuring that the very best games and the most immersive experiences are possible on the platform.<br>Read more about the PS4 on ebay guides.</br>",http://thumbs2.ebaystatic.com/d/l225/m/m4fuJ5Ibrj450-TZ83FAkIQ.jpg
|
||||
711158478,Sony - PlayStation 4 500GB The Last of Us Remastered Bundle - Black,,http://thumbs3.ebaystatic.com/d/l225/m/mzXSIw8Hlnff8IjXJQrXJSw.jpg
|
||||
711158479,Sony PlayStation 4 (Latest Model)- 500 GB Jet Black Console,,http://thumbs2.ebaystatic.com/d/l225/m/m-9S63CgFoUijY3ZTyNs3KA.jpg
|
||||
711158480,Sony PlayStation 4 (Latest Model)- 500 GB Jet Black Console,,http://thumbs1.ebaystatic.com/d/l225/m/mdF9Bisg9wXjv_R9Y_13MWw.jpg
|
||||
711158481,Sony PlayStation 4 (Latest Model)- 500 GB Jet Black Console*,,http://thumbs1.ebaystatic.com/d/l225/m/m4_OQHMmIOCa8uEkBepRR5A.jpg
|
||||
711158482,Sony PlayStation 4 (Latest Model)- 500 GB Jet Black Console,,http://thumbs2.ebaystatic.com/d/l225/m/mZ0nR8iz-QAfLssJZMp3L5Q.jpg
|
||||
711158483,[Sony] Playstation 4 PS4 1105A Video Game Console 500GB White - Latest Model,,http://thumbs4.ebaystatic.com/d/l225/m/m8iTz5cLQLNjD9D3O2jT3IQ.jpg
|
||||
711158484,NEW! Clinique Repairwear Laser Focus Wrinkle Correcting Eye Cream 5ml,,http://thumbs2.ebaystatic.com/d/l225/m/mrraWCpvP5YKk5rYgotVDLg.jpg
|
||||
711158485,Obagi Elastiderm Eye Treatment Cream 0.5 oz / 15g Authentic NiB Sealed [5],,http://thumbs1.ebaystatic.com/d/l225/m/mJ4ekz6_bDT5G7wYtjM-qRg.jpg
|
||||
711158486,Lancome Renergie Eye Anti-Wrinkle & Firming Eye Cream 0.5oz New,,http://thumbs2.ebaystatic.com/d/l225/m/mxwwyDQraZ-TEtr_Y6qRi7Q.jpg
|
||||
711158487,OZ Naturals - The BEST Eye Gel - Eye Cream For Dark Circles Puffiness and,,http://thumbs2.ebaystatic.com/d/l225/m/mk2Z-hX5sT4kUxfG6g_KFpg.jpg
|
||||
711158488,Elastiderm Eye Cream (0.5oz/15g),,http://thumbs3.ebaystatic.com/d/l225/m/mHxb5WUc5MtGzCT2UXgY_hg.jpg
|
||||
711158489,new CLINIQUE Repairwear Laser Focus Wrinkle Correcting Eye Cream 0.17 oz/ 5 ml,,http://thumbs1.ebaystatic.com/d/l225/m/mQSX2wfrSeGy3uA8Q4SbOKw.jpg
|
||||
711158490,NIB Full Size Dermalogica Multivitamin Power Firm Eye Cream,,http://thumbs4.ebaystatic.com/d/l225/m/m2hxo12e5NjXgGiKIaCvTLA.jpg
|
||||
711158491,24K Gold Collagen Anti-Dark Circles Anti-Aging Bio Essence Repairing Eye Cream,,http://thumbs4.ebaystatic.com/d/l225/m/mt96efUK5cPAe60B9aGmgMA.jpg
|
||||
711158492,Clinique Repairwear Laser Focus Wrinkle Correcting Eye Cream Full Size .5oz 15mL,,http://thumbs3.ebaystatic.com/d/l225/m/mZyV3wKejCMx9RrnC8X-eMw.jpg
|
||||
711158493,NEW! Clinique Repairwear Laser Focus Wrinkle Correcting Eye Cream 5ml,,http://thumbs4.ebaystatic.com/d/l225/m/m9hX_z_DFnbNCTh0VFv3KcQ.jpg
|
||||
711158494,3 Clinique Repairwear Laser Focus Wrinkle Correcting Eye Cream .17 oz/5 ml Each,,http://thumbs1.ebaystatic.com/d/l225/m/mYiHsrGffCg_qgkTbUWZU1A.jpg
|
||||
711158495,Lancome High Resolution Eye Cream .95 Oz Refill-3X .25 Oz Plus .20 Oz Lot,,http://thumbs1.ebaystatic.com/d/l225/m/mFuQxKoEKQ6wtk2bGxfKwow.jpg
|
||||
711158496,NEW! Clinique Repairwear Laser Focus Wrinkle Correcting Eye Cream 5ml,,http://thumbs4.ebaystatic.com/d/l225/m/mLBRCDiELUnYos-vFmIcc7A.jpg
|
||||
711158497,Neutrogena Rapid Wrinkle Repair Eye Cream -0.5 Oz. -New-,,http://thumbs4.ebaystatic.com/d/l225/m/mE1RWpCOxkCGuuiJBX6HiBQ.jpg
|
||||
711158498,20g Snail Repair Eye Cream Natural Anti-Dark Circles Puffiness Aging Wrinkles,,http://thumbs4.ebaystatic.com/d/l225/m/mh4gBNzINDwds_r778sJRjg.jpg
|
||||
711158499,Vichy-Neovadiol GF Eye & Lip Contour Cream 0.5 Fl. Oz,,http://thumbs4.ebaystatic.com/d/l225/m/m_6f0ofCm7PTzuithYuZx3w.jpg
|
||||
711158500,Obagi Elastiderm Eye Cream 0.5 oz. New In Box. 100% Authentic! New Packaging!,,http://thumbs2.ebaystatic.com/d/l225/m/ma0PK-ASBXUiHERR19MyImA.jpg
|
||||
711158501,NEW! Clinique Repairwear Laser Focus Wrinkle Correcting Eye Cream .17oz / 5ml,,http://thumbs3.ebaystatic.com/d/l225/m/m72NaXYlcXcEeqQFKWvsdZA.jpg
|
||||
711158502,Kiehl's CREAMY EYE TREATMENT cream with AVOCADO 0.5 oz FULL SIZE,,http://thumbs3.ebaystatic.com/d/l225/m/mOI407HnILb_tf-RgdvfYyA.jpg
|
||||
711158503,Clinique repairwear laser focus wrinkle correcting eye cream .5 oz 15ml,,http://thumbs4.ebaystatic.com/d/l225/m/mQwNVst3bYG6QXouubmLaJg.jpg
|
||||
711158504,Caudalie Premier Cru The Eye Cream La Creme New Anti Aging Eye Treatment,,http://thumbs1.ebaystatic.com/d/l225/m/mM4hPTAWXeOjovNk9s_Cqag.jpg
|
||||
711158505,Jeunesse Instantly Ageless -- New Box Of 50 Sachets -- Eye - Face Wrinkle Cream,,http://thumbs2.ebaystatic.com/d/l225/m/m5EfWbi6ZYs4JpYcsl0Ubaw.jpg
|
||||
711158506,VELOUR SKIN EYE CREAM .5 FL OZ 15ML NEW NIP ANTI-AGING WRINKLE CREAM,,http://thumbs1.ebaystatic.com/d/l225/m/m2uEf6q1yASH8FkWqYdOv1w.jpg
|
||||
711158507,Shiseido White Lucent Anti-Dark Circles/Puffiness Eye Cream 15ml/.53oz Full Size,,http://thumbs1.ebaystatic.com/d/l225/m/m_CtzoqU2Vgv4GKx8ONS6qw.jpg
|
||||
711158508,Murad Resurgence Renewing Eye Cream Anti-Aging .25 oz NEW Dark Circles Wrinkle,,http://thumbs1.ebaystatic.com/d/l225/m/mhWJC10iowgUDGm4KMQKNMg.jpg
|
||||
711158509,D-Link DIR-615 300Mbps Wireless-N Router 4-Port w/Firewall,,http://thumbs3.ebaystatic.com/d/l225/m/mdSBH9ROXRn3TBb8OFDT6jA.jpg
|
||||
711158510,Triton MOF001 2 1/4hp dual mode precision Router. New!! *3 day auction*,,http://thumbs1.ebaystatic.com/d/l225/m/mozWd2SBskbDBlWAKsMlVew.jpg
|
||||
711158511,Porter-Cable 3-1/4 HP Five-Speed Router 7518 - Power Tools Routers,,http://thumbs2.ebaystatic.com/d/l225/m/mpZDTXpiyesDrZh_FLMyqXQ.jpg
|
||||
711158512,Linksys EA6900 AC1900 Wi-Fi Wireless Router Dual Band with Gigabit &USB 3.0 Port,,http://thumbs4.ebaystatic.com/d/l225/m/m3OfBSnHBDhhs_Ve-DSBKQw.jpg
|
||||
711158513,Linksys EA6500 1300 Mbps 4-Port Gigabit Wireless AC Router,,http://thumbs1.ebaystatic.com/d/l225/m/m7cfymJPc7CLADoTiEYFzwA.jpg
|
||||
711158514,Makita RT0700CX3 1-1/4 Horsepower Compact Router Kit / Trimmer NEW,,http://thumbs2.ebaystatic.com/d/l225/m/mr-F3rCxDYsLcj8hnmaRN4A.jpg
|
||||
711158515,NETGEAR R6250 AC1600 Smart WiFi Dual Band Gigabit Router 802.11ac 300 1300 Mbps,,http://thumbs4.ebaystatic.com/d/l225/m/mc8Ic8Cq2lPqPnjNGAQBBCQ.jpg
|
||||
711158516,NETGEAR Nighthawk AC1900 Dual Band Wi-Fi Gigabit Router (R7000) BRAND NEW SEALED,,http://thumbs3.ebaystatic.com/d/l225/m/mdL34EQi0l-Kg-DlvF6wpqA.jpg
|
||||
711158517,Netgear WNDR3400 N600 Wireless Dual Band Router (WNDR3400-100),,http://thumbs4.ebaystatic.com/d/l225/m/mKr4cNk6utJXSdVYXzwrScQ.jpg
|
||||
711158518,Netgear N600 300 Mbps 4-Port 10/100 Wireless N Router (WNDR3400),,http://thumbs2.ebaystatic.com/d/l225/m/mUPdyhbW9pzEm1VbqX0YudA.jpg
|
||||
711158519,NETGEAR N600 WNDR3400 Wireless Dual Band Router F/S,,http://thumbs1.ebaystatic.com/d/l225/m/my55jF5kHnG9ipzFycnjooA.jpg
|
||||
711158520,Netgear NIGHTHAWK AC1900 1300 Mbps 4-Port Gigabit Wireless AC Router (R7000),,http://thumbs3.ebaystatic.com/d/l225/m/mrPLRTnWx_JXLNIp5pCBnzQ.jpg
|
||||
711158521,Netgear N900 450 Mbps 4-Port Gigabit Wireless N Router (WNDR4500),,http://thumbs2.ebaystatic.com/d/l225/m/mXBL01faHlHm7Ukh188t3yQ.jpg
|
||||
711158522,Netgear R6300V2 AC1750 1300 Mbps 4-Port Gigabit Wireless AC Router,,http://thumbs1.ebaystatic.com/d/l225/m/mTdnFB9Z71efYJ9I5-k186w.jpg
|
||||
711158523,Makita RT0701C 1-1/4 HP Compact Router With FACTORY WARRANTY!!!,,http://thumbs2.ebaystatic.com/d/l225/m/m7AA4k3MzYFJcTlBrT3DwhA.jpg
|
||||
711158524,"CISCO LINKSYS EA4500 DUAL-BAND N9000 WIRELESS ROUTER, 802.11N, UP TO 450 MBPs",,http://thumbs4.ebaystatic.com/d/l225/m/mwfVIXD3dZYt_qpHyprd7hg.jpg
|
||||
711158525,Netgear N300 v.3 300 Mbps 5-Port 10/100 Wireless N Router (WNR2000),,http://thumbs4.ebaystatic.com/d/l225/m/mopRjvnZwbsVH9euqGov5kw.jpg
|
||||
711158526,Netgear Nighthawk R7000 2330 Mbps 4-Port Gigabit Wireless N Router...,,http://thumbs4.ebaystatic.com/d/l225/m/mns82UY4FfqYXPgqrpJ9Bzw.jpg
|
||||
711158527,Netgear N900 450 Mbps 4-Port Gigabit Wireless N Router R4500 ~ FreE ShiPPinG ~,,http://thumbs1.ebaystatic.com/d/l225/m/m_o0mSRmySgJUuqHYDIQiuA.jpg
|
||||
711158528,D-Link Wireless Router Model DIR-625,,http://thumbs2.ebaystatic.com/d/l225/m/mYPXwZMlDUjOQ3Sm3EtU37Q.jpg
|
||||
711158529,D-Link DIR-657 300 Mbps 4-Port Gigabit Wireless N Router Hd Media Router 1000,"Stream multiple media content - videos, music and more to multiple devices all at the same time without lag or skipping. The HD Fuel technology in the DIR-657 lets you watch Netflix and Vudu , play your Wii or Xbox 360 online or make Skype calls all without worrying about the skipping or latency you might experience with standard routers. It does so by automatically giving extra bandwidth for video, gaming and VoIP calls using HD Fuel QoS technology. The D-Link HD Media Router 1000(DIR-657) also comes equipped with 4 Gigabit ports to provide speeds up to 10x faster than standard 10/100 ports. What s more, it uses 802.11n technology with multiple intelligent antennas to maximize the speed and range of your wireless signal to significantly outperform 802.11g devices.",http://thumbs1.ebaystatic.com/d/l225/m/m0xyPdWrdVKe7By4QFouVeA.jpg
|
||||
711158530,D-Link DIR-860L AC1200 4-Port Cloud Router Gigabit Wireless 802.11 AC,,http://thumbs3.ebaystatic.com/d/l225/m/mk4KNj6oLm7863qCS-TqmbQ.jpg
|
||||
711158531,D-Link DIR-862L Wireless AC1600 Dual Band Gigabit Router,,http://thumbs2.ebaystatic.com/d/l225/m/m6Arw8kaZ4EUbyKjHtJZLkA.jpg
|
||||
711158532,LINKSYS AC1600 DUAL BAND SMART WI-FI ROUTER EA6400 BRAND NEW,,http://thumbs3.ebaystatic.com/d/l225/m/mdK7igTS7_TDD7ajfVqj-_w.jpg
|
||||
711158533,Netgear AC1900 1300 Mbps 4-Port Gigabit Wireless AC Router (R7000),,http://thumbs4.ebaystatic.com/d/l225/m/mdL34EQi0l-Kg-DlvF6wpqA.jpg
|
||||
711158534,Panasonic ES-LA63 Cordless Rechargeable Men's Electric Shaver,,http://thumbs3.ebaystatic.com/d/l225/m/mzKKlCxbADObevcgoNjbXRg.jpg
|
||||
711158535,Panasonic ARC 5 Best Mens Shaver,,http://thumbs4.ebaystatic.com/d/l225/m/mt34Y-u0okj-SqQm8Ng_rbQ.jpg
|
||||
711158536,Panasonic Es8092 Wet Dry Electric Razor Shaver Cordless,,http://thumbs3.ebaystatic.com/d/l225/m/mlIxTz1LsVjXiZz2CzDquJw.jpg
|
||||
711158537,Panasonic ARC4 ES-RF31-s Rechargeable Electric Shaver Wet/dry 4 Nanotech Blade,"Made for folks who need a great shave, the Panasonic electric shaver is convenient and consistent. Featuring an ergonomic design, this Panasonic ES-RF31-S is ideal for keeping a stubble-free face, so you can retain wonderfully smooth skin. With the precision blades included on the Panasonic electric shaver, you can get smooth shaves with every use. As this men's electric shaver features a gentle shaving mechanism, you can help avoid burning sensations on tender skin. Make sure you consistently get multiple perfect shaves without depleting the power with the exceptional shave time typical of this Panasonic ES-RF31-S.",http://thumbs1.ebaystatic.com/d/l225/m/mi4QM99Jq4oma5WLAL0K7Wg.jpg
|
||||
711158538,"Panasonic ES3831K Single Blade Travel Shaver, Black New","Strong and trustworthy, the Panasonic electric shaver is built for folks who are worried about a wonderful shave every day. This Panasonic ES3833S is just right for taming your beard, with an easy-to-maneuver design, so you can retain wonderfully soft skin. Spend as much time as you need getting a complete shave by making use of the outstanding shave time typical of the Panasonic electric shaver. Moreover, this men's electric shaver includes precision foil blades, so you can get wonderful shaves over a prolonged period. With the gentle shaving mechanism on this Panasonic ES3833S, you can help avoid burning sensations on tender skin.",http://thumbs3.ebaystatic.com/d/l225/m/mfqMoj4xDlBFXp1ZznxCGbQ.jpg
|
||||
711158539,Panasonic ES8103S Arc3 Electric Shaver Wet/Dry with Nanotech Blades for Men,,http://thumbs1.ebaystatic.com/d/l225/m/myaZLqzt3I7O-3xXxsJ_4fQ.jpg
|
||||
711158540,Panasonic ES8103S Arc3 Electric Shaver Wet/Dry with Nanotech Blades,,http://thumbs1.ebaystatic.com/d/l225/m/mcrO4BkjBkM78XHm-aClRGg.jpg
|
||||
711158543,Panasonic ES3831K Single Blade Wet & Dry Travel Shaver - New & Sealed,,http://thumbs4.ebaystatic.com/d/l225/m/mqWDU2mHsFWAuGosMIGcIMg.jpg
|
||||
711158544,Panasonic ES8103S Arc 3 E W/O POUCH & MANUAL Men's Wet/Dry Rechargeable Shaver,,http://thumbs2.ebaystatic.com/d/l225/m/mZXgTj-fQfcgAlzOGQYkqFw.jpg
|
||||
711158545,PANASONIC ES3831K Pro-Curve Battery Operated Travel Wet/Dry Shaver,,http://thumbs1.ebaystatic.com/d/l225/m/m8McQMCfgdp50trM_YJ88cw.jpg
|
||||
711158546,PANASONIC ARC3 ES-LT33-S WET DRY WASHABLE RECHARGEABLE MEN'S ELECTRIC SHAVER NIB,,http://thumbs1.ebaystatic.com/d/l225/m/m9yUif5xyhGfh7Ag-_fcLdA.jpg
|
||||
711158547,Panasonic ES-LV81-k Arc 5 Wet & Dry Rechargeable Men's Foil Shaver New,,http://thumbs1.ebaystatic.com/d/l225/m/mEfZHzDoKrH4DBfU8e_K93A.jpg
|
||||
711158548,"NEW Panasonic ES-RF31-S 4 Blade Men's Electric Razor Wet/Dry, Factory Sealed",,http://thumbs2.ebaystatic.com/d/l225/m/mfhMhMoDkrGtqWW_IyqVGuQ.jpg
|
||||
711158549,Panasonic ES8243A E Arc4 Men's Electric Shaver Wet/Dry,"eBay item number:181670746515
|
||||
|
||||
|
||||
Seller assumes all responsibility for this listing.
|
||||
|
||||
Last updated on
|
||||
Mar 23, 2015 08:55:50 PDT
|
||||
View all revisions
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
<strong>Item specifics</strong>
|
||||
<table>
|
||||
<tr>
|
||||
<th>Condition:</th>
|
||||
<td><strong>Used</strong>
|
||||
<strong>:</strong>
|
||||
|
||||
|
||||
</td></tr></table>",http://thumbs4.ebaystatic.com/d/l225/m/mcxFUwt3FrGEEPzT7cfQn7w.jpg
|
||||
711158550,Panasonic ES-3833 Wet/Dry Men Shaver Razor Battery Operate Compact Travel ES3833,,http://thumbs2.ebaystatic.com/d/l225/m/mAqa9pHisKsLSk5nqMg4JJQ.jpg
|
||||
711158551,Panasonic Pro-Curve ES3831K Shaver - Dry/Wet Technology - Stainless Steel Foil,,http://thumbs3.ebaystatic.com/d/l225/m/mGqD8eGIwseT5nsM53W3uRQ.jpg
|
||||
711158552,Panasonic Wet and Dry Shaver - ES-RW30s ES-RW30-S,"The Panasonic electric shaver is well-suited to shielding particularly sensitive skin and providing a smooth shave. It's both trustworthy and transportable. Because this Panasonic ES-RW30-S has a gentle shaving mechanism, you can avoid irritation and raw feeling skin in particularly tender areas. The Panasonic electric shaver is ideal for ridding yourself of stubble, with its special design, so you can sustain wonderfully supple skin. The exceptional shave time featured on this men's electric shaver helps you to make sure you consistently receive many complete shaves without depleting the power. Plus, this Panasonic ES-RW30-S features precision blades, so you can enjoy smooth shaves for months on end.",http://thumbs1.ebaystatic.com/d/l225/m/mvPElpjXmgo0NhP-P5F8LlQ.jpg
|
||||
711158553,Panasonic ES-LF51-A Arc4 Electric Shaver Wet/Dry with Flexible Pivoting Head,,http://thumbs3.ebaystatic.com/d/l225/m/mC_zAQrMQKPLHdENU7N3UjQ.jpg
|
||||
711158554,Panasonic ES8103S Arc3 Men's Electric Shaver Wet/Dry with Nanotech Blades,,http://thumbs3.ebaystatic.com/d/l225/m/moBByNwPn93-g-oBBceS2kw.jpg
|
||||
711158555,panasonic ARC3 shaver es8103s,,http://thumbs1.ebaystatic.com/d/l225/m/mJlAp6t6OMIOaYgKnyelIMg.jpg
|
||||
711158556,Panasonic ES-534 Men's Electric Shaver New ES534 Battery Operated Compact Travel,,http://thumbs3.ebaystatic.com/d/l225/m/mDr2kpZLVSdy1KTPVYK2YUg.jpg
|
||||
711158557,Panasonic Portable Shaving Machine Cclippers Washable Single Blade Shaver+Brush,,http://thumbs3.ebaystatic.com/d/l225/m/mJdzJPoOALps0Lv4WtW2b0A.jpg
|
||||
711158559,Baratza Solis Maestro Conical Burr Coffee Bean Grinder Works Great Nice Cond,,http://thumbs4.ebaystatic.com/d/l225/m/mdjbD7YFR6JRq-pkeajhK7w.jpg
|
||||
711158560,Proctor Silex Fresh Grind Electric Coffee Bean Grinder White,,http://thumbs4.ebaystatic.com/d/l225/m/mtXoRn5Ytmqz0GLHYmBUxpA.jpg
|
||||
711158561,Cuisinart 8-oz. Supreme Grind Automatic Burr Coffee Grinder,,http://thumbs4.ebaystatic.com/d/l225/m/my_9cXPvwwRVFqo6MXWfpag.jpg
|
|
19
examples/kaggle/schema-kaggle.toml
Normal file
19
examples/kaggle/schema-kaggle.toml
Normal file
@ -0,0 +1,19 @@
|
||||
# This schema has been generated ...
|
||||
# The order in which the attributes are declared is important,
|
||||
# it specify the attribute xxx...
|
||||
|
||||
identifier = "id"
|
||||
|
||||
[attributes.id]
|
||||
stored = true
|
||||
|
||||
[attributes.title]
|
||||
stored = true
|
||||
indexed = true
|
||||
|
||||
[attributes.description]
|
||||
stored = true
|
||||
indexed = true
|
||||
|
||||
[attributes.image]
|
||||
stored = true
|
@ -14,6 +14,7 @@ rayon = "1.0.3"
|
||||
sdset = "0.3.1"
|
||||
serde = { version = "1.0.88", features = ["derive"] }
|
||||
slice-group-by = "0.2.4"
|
||||
zerocopy = "0.2.2"
|
||||
|
||||
[dependencies.fst]
|
||||
git = "https://github.com/Kerollmops/fst.git"
|
||||
|
@ -9,4 +9,8 @@ impl Criterion for DocumentId {
|
||||
fn evaluate(&self, lhs: &RawDocument, rhs: &RawDocument) -> Ordering {
|
||||
lhs.id.cmp(&rhs.id)
|
||||
}
|
||||
|
||||
fn name(&self) -> &'static str {
|
||||
"DocumentId"
|
||||
}
|
||||
}
|
||||
|
@ -36,4 +36,30 @@ impl Criterion for Exact {
|
||||
|
||||
lhs.cmp(&rhs).reverse()
|
||||
}
|
||||
|
||||
fn name(&self) -> &'static str {
|
||||
"Exact"
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
// typing: "soulier"
|
||||
//
|
||||
// doc0: "Soulier bleu"
|
||||
// doc1: "souliereres rouge"
|
||||
#[test]
|
||||
fn easy_case() {
|
||||
let query_index0 = &[0];
|
||||
let is_exact0 = &[true];
|
||||
|
||||
let query_index1 = &[0];
|
||||
let is_exact1 = &[false];
|
||||
|
||||
let doc0 = number_exact_matches(query_index0, is_exact0);
|
||||
let doc1 = number_exact_matches(query_index1, is_exact1);
|
||||
assert_eq!(doc0.cmp(&doc1).reverse(), Ordering::Less);
|
||||
}
|
||||
}
|
||||
|
@ -22,6 +22,9 @@ pub use self::{
|
||||
pub trait Criterion: Send + Sync {
|
||||
fn evaluate(&self, lhs: &RawDocument, rhs: &RawDocument) -> Ordering;
|
||||
|
||||
#[inline]
|
||||
fn name(&self) -> &'static str;
|
||||
|
||||
#[inline]
|
||||
fn eq(&self, lhs: &RawDocument, rhs: &RawDocument) -> bool {
|
||||
self.evaluate(lhs, rhs) == Ordering::Equal
|
||||
@ -33,6 +36,10 @@ impl<'a, T: Criterion + ?Sized + Send + Sync> Criterion for &'a T {
|
||||
(**self).evaluate(lhs, rhs)
|
||||
}
|
||||
|
||||
fn name(&self) -> &'static str {
|
||||
(**self).name()
|
||||
}
|
||||
|
||||
fn eq(&self, lhs: &RawDocument, rhs: &RawDocument) -> bool {
|
||||
(**self).eq(lhs, rhs)
|
||||
}
|
||||
@ -43,6 +50,10 @@ impl<T: Criterion + ?Sized> Criterion for Box<T> {
|
||||
(**self).evaluate(lhs, rhs)
|
||||
}
|
||||
|
||||
fn name(&self) -> &'static str {
|
||||
(**self).name()
|
||||
}
|
||||
|
||||
fn eq(&self, lhs: &RawDocument, rhs: &RawDocument) -> bool {
|
||||
(**self).eq(lhs, rhs)
|
||||
}
|
||||
|
@ -24,4 +24,8 @@ impl Criterion for NumberOfWords {
|
||||
|
||||
lhs.cmp(&rhs).reverse()
|
||||
}
|
||||
|
||||
fn name(&self) -> &'static str {
|
||||
"NumberOfWords"
|
||||
}
|
||||
}
|
||||
|
@ -53,6 +53,10 @@ impl Criterion for SumOfTypos {
|
||||
|
||||
lhs.cmp(&rhs).reverse()
|
||||
}
|
||||
|
||||
fn name(&self) -> &'static str {
|
||||
"SumOfTypos"
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
|
@ -35,4 +35,30 @@ impl Criterion for SumOfWordsAttribute {
|
||||
|
||||
lhs.cmp(&rhs)
|
||||
}
|
||||
|
||||
fn name(&self) -> &'static str {
|
||||
"SumOfWordsAttribute"
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
// typing: "soulier"
|
||||
//
|
||||
// doc0: { 0. "Soulier bleu", 1. "bla bla bla" }
|
||||
// doc1: { 0. "Botte rouge", 1. "Soulier en cuir" }
|
||||
#[test]
|
||||
fn title_vs_description() {
|
||||
let query_index0 = &[0];
|
||||
let attribute0 = &[0];
|
||||
|
||||
let query_index1 = &[0];
|
||||
let attribute1 = &[1];
|
||||
|
||||
let doc0 = sum_matches_attributes(query_index0, attribute0);
|
||||
let doc1 = sum_matches_attributes(query_index1, attribute1);
|
||||
assert_eq!(doc0.cmp(&doc1), Ordering::Less);
|
||||
}
|
||||
}
|
||||
|
@ -35,4 +35,30 @@ impl Criterion for SumOfWordsPosition {
|
||||
|
||||
lhs.cmp(&rhs)
|
||||
}
|
||||
|
||||
fn name(&self) -> &'static str {
|
||||
"SumOfWordsPosition"
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
// typing: "soulier"
|
||||
//
|
||||
// doc0: "Soulier bleu"
|
||||
// doc1: "Botte rouge et soulier noir"
|
||||
#[test]
|
||||
fn easy_case() {
|
||||
let query_index0 = &[0];
|
||||
let word_index0 = &[0];
|
||||
|
||||
let query_index1 = &[0];
|
||||
let word_index1 = &[3];
|
||||
|
||||
let doc0 = sum_matches_attribute_index(query_index0, word_index0);
|
||||
let doc1 = sum_matches_attribute_index(query_index1, word_index1);
|
||||
assert_eq!(doc0.cmp(&doc1), Ordering::Less);
|
||||
}
|
||||
}
|
||||
|
@ -98,6 +98,10 @@ impl Criterion for WordsProximity {
|
||||
|
||||
lhs.cmp(&rhs)
|
||||
}
|
||||
|
||||
fn name(&self) -> &'static str {
|
||||
"WordsProximity"
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
|
@ -1,61 +0,0 @@
|
||||
use std::slice::from_raw_parts;
|
||||
use std::mem::size_of;
|
||||
use std::error::Error;
|
||||
|
||||
use byteorder::{LittleEndian, ReadBytesExt, WriteBytesExt};
|
||||
use sdset::Set;
|
||||
|
||||
use crate::shared_data_cursor::{SharedDataCursor, FromSharedDataCursor};
|
||||
use crate::write_to_bytes::WriteToBytes;
|
||||
use crate::data::SharedData;
|
||||
use crate::DocumentId;
|
||||
|
||||
use super::into_u8_slice;
|
||||
|
||||
#[derive(Default, Clone)]
|
||||
pub struct DocIds(SharedData);
|
||||
|
||||
impl DocIds {
|
||||
pub fn new(ids: &Set<DocumentId>) -> DocIds {
|
||||
let bytes = unsafe { into_u8_slice(ids.as_slice()) };
|
||||
let data = SharedData::from_bytes(bytes.to_vec());
|
||||
DocIds(data)
|
||||
}
|
||||
|
||||
pub fn is_empty(&self) -> bool {
|
||||
self.0.is_empty()
|
||||
}
|
||||
|
||||
pub fn as_bytes(&self) -> &[u8] {
|
||||
&self.0
|
||||
}
|
||||
}
|
||||
|
||||
impl AsRef<Set<DocumentId>> for DocIds {
|
||||
fn as_ref(&self) -> &Set<DocumentId> {
|
||||
let slice = &self.0;
|
||||
let ptr = slice.as_ptr() as *const DocumentId;
|
||||
let len = slice.len() / size_of::<DocumentId>();
|
||||
let slice = unsafe { from_raw_parts(ptr, len) };
|
||||
Set::new_unchecked(slice)
|
||||
}
|
||||
}
|
||||
|
||||
impl FromSharedDataCursor for DocIds {
|
||||
type Error = Box<Error>;
|
||||
|
||||
fn from_shared_data_cursor(cursor: &mut SharedDataCursor) -> Result<DocIds, Self::Error> {
|
||||
let len = cursor.read_u64::<LittleEndian>()? as usize;
|
||||
let data = cursor.extract(len);
|
||||
|
||||
Ok(DocIds(data))
|
||||
}
|
||||
}
|
||||
|
||||
impl WriteToBytes for DocIds {
|
||||
fn write_to_bytes(&self, bytes: &mut Vec<u8>) {
|
||||
let len = self.0.len() as u64;
|
||||
bytes.write_u64::<LittleEndian>(len).unwrap();
|
||||
bytes.extend_from_slice(&self.0);
|
||||
}
|
||||
}
|
@ -1,231 +0,0 @@
|
||||
use std::io::{self, Write};
|
||||
use std::slice::from_raw_parts;
|
||||
use std::mem::size_of;
|
||||
use std::ops::Index;
|
||||
|
||||
use byteorder::{LittleEndian, ReadBytesExt, WriteBytesExt};
|
||||
use sdset::Set;
|
||||
|
||||
use crate::shared_data_cursor::{SharedDataCursor, FromSharedDataCursor};
|
||||
use crate::write_to_bytes::WriteToBytes;
|
||||
use crate::data::SharedData;
|
||||
use crate::DocIndex;
|
||||
|
||||
use super::into_u8_slice;
|
||||
|
||||
#[derive(Debug)]
|
||||
#[repr(C)]
|
||||
struct Range {
|
||||
start: u64,
|
||||
end: u64,
|
||||
}
|
||||
|
||||
#[derive(Clone, Default)]
|
||||
pub struct DocIndexes {
|
||||
ranges: SharedData,
|
||||
indexes: SharedData,
|
||||
}
|
||||
|
||||
impl DocIndexes {
|
||||
pub fn get(&self, index: usize) -> Option<&Set<DocIndex>> {
|
||||
self.ranges().get(index).map(|Range { start, end }| {
|
||||
let start = *start as usize;
|
||||
let end = *end as usize;
|
||||
let slice = &self.indexes()[start..end];
|
||||
Set::new_unchecked(slice)
|
||||
})
|
||||
}
|
||||
|
||||
fn ranges(&self) -> &[Range] {
|
||||
let slice = &self.ranges;
|
||||
let ptr = slice.as_ptr() as *const Range;
|
||||
let len = slice.len() / size_of::<Range>();
|
||||
unsafe { from_raw_parts(ptr, len) }
|
||||
}
|
||||
|
||||
fn indexes(&self) -> &[DocIndex] {
|
||||
let slice = &self.indexes;
|
||||
let ptr = slice.as_ptr() as *const DocIndex;
|
||||
let len = slice.len() / size_of::<DocIndex>();
|
||||
unsafe { from_raw_parts(ptr, len) }
|
||||
}
|
||||
}
|
||||
|
||||
impl Index<usize> for DocIndexes {
|
||||
type Output = [DocIndex];
|
||||
|
||||
fn index(&self, index: usize) -> &Self::Output {
|
||||
match self.get(index) {
|
||||
Some(indexes) => indexes,
|
||||
None => panic!("index {} out of range for a maximum of {} ranges", index, self.ranges().len()),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl FromSharedDataCursor for DocIndexes {
|
||||
type Error = io::Error;
|
||||
|
||||
fn from_shared_data_cursor(cursor: &mut SharedDataCursor) -> Result<DocIndexes, Self::Error> {
|
||||
let len = cursor.read_u64::<LittleEndian>()? as usize;
|
||||
let ranges = cursor.extract(len);
|
||||
|
||||
let len = cursor.read_u64::<LittleEndian>()? as usize;
|
||||
let indexes = cursor.extract(len);
|
||||
|
||||
Ok(DocIndexes { ranges, indexes })
|
||||
}
|
||||
}
|
||||
|
||||
impl WriteToBytes for DocIndexes {
|
||||
fn write_to_bytes(&self, bytes: &mut Vec<u8>) {
|
||||
let ranges_len = self.ranges.len() as u64;
|
||||
let _ = bytes.write_u64::<LittleEndian>(ranges_len);
|
||||
bytes.extend_from_slice(&self.ranges);
|
||||
|
||||
let indexes_len = self.indexes.len() as u64;
|
||||
let _ = bytes.write_u64::<LittleEndian>(indexes_len);
|
||||
bytes.extend_from_slice(&self.indexes);
|
||||
}
|
||||
}
|
||||
|
||||
pub struct DocIndexesBuilder<W> {
|
||||
ranges: Vec<Range>,
|
||||
indexes: Vec<DocIndex>,
|
||||
wtr: W,
|
||||
}
|
||||
|
||||
impl DocIndexesBuilder<Vec<u8>> {
|
||||
pub fn memory() -> Self {
|
||||
DocIndexesBuilder {
|
||||
ranges: Vec::new(),
|
||||
indexes: Vec::new(),
|
||||
wtr: Vec::new(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl<W: Write> DocIndexesBuilder<W> {
|
||||
pub fn new(wtr: W) -> Self {
|
||||
DocIndexesBuilder {
|
||||
ranges: Vec::new(),
|
||||
indexes: Vec::new(),
|
||||
wtr: wtr,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn insert(&mut self, indexes: &Set<DocIndex>) {
|
||||
let len = indexes.len() as u64;
|
||||
let start = self.ranges.last().map(|r| r.end).unwrap_or(0);
|
||||
let range = Range { start, end: start + len };
|
||||
self.ranges.push(range);
|
||||
|
||||
self.indexes.extend_from_slice(indexes);
|
||||
}
|
||||
|
||||
pub fn finish(self) -> io::Result<()> {
|
||||
self.into_inner().map(drop)
|
||||
}
|
||||
|
||||
pub fn into_inner(mut self) -> io::Result<W> {
|
||||
let ranges = unsafe { into_u8_slice(&self.ranges) };
|
||||
let len = ranges.len() as u64;
|
||||
self.wtr.write_u64::<LittleEndian>(len)?;
|
||||
self.wtr.write_all(ranges)?;
|
||||
|
||||
let indexes = unsafe { into_u8_slice(&self.indexes) };
|
||||
let len = indexes.len() as u64;
|
||||
self.wtr.write_u64::<LittleEndian>(len)?;
|
||||
self.wtr.write_all(indexes)?;
|
||||
|
||||
Ok(self.wtr)
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use std::error::Error;
|
||||
use crate::DocumentId;
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn builder_serialize_deserialize() -> Result<(), Box<Error>> {
|
||||
let a = DocIndex {
|
||||
document_id: DocumentId(0),
|
||||
attribute: 3,
|
||||
word_index: 11,
|
||||
char_index: 30,
|
||||
char_length: 4,
|
||||
};
|
||||
let b = DocIndex {
|
||||
document_id: DocumentId(1),
|
||||
attribute: 4,
|
||||
word_index: 21,
|
||||
char_index: 35,
|
||||
char_length: 6,
|
||||
};
|
||||
let c = DocIndex {
|
||||
document_id: DocumentId(2),
|
||||
attribute: 8,
|
||||
word_index: 2,
|
||||
char_index: 89,
|
||||
char_length: 6,
|
||||
};
|
||||
|
||||
let mut builder = DocIndexesBuilder::memory();
|
||||
|
||||
builder.insert(Set::new(&[a])?);
|
||||
builder.insert(Set::new(&[a, b, c])?);
|
||||
builder.insert(Set::new(&[a, c])?);
|
||||
|
||||
let bytes = builder.into_inner()?;
|
||||
let docs = DocIndexes::from_bytes(bytes)?;
|
||||
|
||||
assert_eq!(docs.get(0), Some(Set::new(&[a])?));
|
||||
assert_eq!(docs.get(1), Some(Set::new(&[a, b, c])?));
|
||||
assert_eq!(docs.get(2), Some(Set::new(&[a, c])?));
|
||||
assert_eq!(docs.get(3), None);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn serialize_deserialize() -> Result<(), Box<Error>> {
|
||||
let a = DocIndex {
|
||||
document_id: DocumentId(0),
|
||||
attribute: 3,
|
||||
word_index: 11,
|
||||
char_index: 30,
|
||||
char_length: 4,
|
||||
};
|
||||
let b = DocIndex {
|
||||
document_id: DocumentId(1),
|
||||
attribute: 4,
|
||||
word_index: 21,
|
||||
char_index: 35,
|
||||
char_length: 6,
|
||||
};
|
||||
let c = DocIndex {
|
||||
document_id: DocumentId(2),
|
||||
attribute: 8,
|
||||
word_index: 2,
|
||||
char_index: 89,
|
||||
char_length: 6,
|
||||
};
|
||||
|
||||
let mut builder = DocIndexesBuilder::memory();
|
||||
|
||||
builder.insert(Set::new(&[a])?);
|
||||
builder.insert(Set::new(&[a, b, c])?);
|
||||
builder.insert(Set::new(&[a, c])?);
|
||||
|
||||
let builder_bytes = builder.into_inner()?;
|
||||
let docs = DocIndexes::from_bytes(builder_bytes.clone())?;
|
||||
|
||||
let mut bytes = Vec::new();
|
||||
docs.write_to_bytes(&mut bytes);
|
||||
|
||||
assert_eq!(builder_bytes, bytes);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
@ -1,16 +0,0 @@
|
||||
mod doc_ids;
|
||||
mod doc_indexes;
|
||||
mod shared_data;
|
||||
|
||||
use std::slice::from_raw_parts;
|
||||
use std::mem::size_of;
|
||||
|
||||
pub use self::doc_ids::DocIds;
|
||||
pub use self::doc_indexes::{DocIndexes, DocIndexesBuilder};
|
||||
pub use self::shared_data::SharedData;
|
||||
|
||||
unsafe fn into_u8_slice<T: Sized>(slice: &[T]) -> &[u8] {
|
||||
let ptr = slice.as_ptr() as *const u8;
|
||||
let len = slice.len() * size_of::<T>();
|
||||
from_raw_parts(ptr, len)
|
||||
}
|
@ -1,58 +0,0 @@
|
||||
use std::sync::Arc;
|
||||
use std::ops::Deref;
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct SharedData {
|
||||
pub bytes: Arc<[u8]>,
|
||||
pub offset: usize,
|
||||
pub len: usize,
|
||||
}
|
||||
|
||||
impl SharedData {
|
||||
pub fn from_bytes(vec: Vec<u8>) -> SharedData {
|
||||
let len = vec.len();
|
||||
let bytes = Arc::from(vec);
|
||||
SharedData::new(bytes, 0, len)
|
||||
}
|
||||
|
||||
pub fn new(bytes: Arc<[u8]>, offset: usize, len: usize) -> SharedData {
|
||||
SharedData { bytes, offset, len }
|
||||
}
|
||||
|
||||
pub fn as_slice(&self) -> &[u8] {
|
||||
&self.bytes[self.offset..self.offset + self.len]
|
||||
}
|
||||
|
||||
pub fn range(&self, offset: usize, len: usize) -> SharedData {
|
||||
assert!(offset + len <= self.len);
|
||||
SharedData {
|
||||
bytes: self.bytes.clone(),
|
||||
offset: self.offset + offset,
|
||||
len: len,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Default for SharedData {
|
||||
fn default() -> SharedData {
|
||||
SharedData {
|
||||
bytes: Arc::from(Vec::new()),
|
||||
offset: 0,
|
||||
len: 0,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Deref for SharedData {
|
||||
type Target = [u8];
|
||||
|
||||
fn deref(&self) -> &Self::Target {
|
||||
self.as_slice()
|
||||
}
|
||||
}
|
||||
|
||||
impl AsRef<[u8]> for SharedData {
|
||||
fn as_ref(&self) -> &[u8] {
|
||||
self.as_slice()
|
||||
}
|
||||
}
|
@ -1,175 +0,0 @@
|
||||
use std::error::Error;
|
||||
|
||||
use byteorder::{LittleEndian, ReadBytesExt, WriteBytesExt};
|
||||
use fst::{map, Map, IntoStreamer, Streamer};
|
||||
use fst::raw::Fst;
|
||||
use sdset::duo::{Union, DifferenceByKey};
|
||||
use sdset::{Set, SetOperation};
|
||||
|
||||
use crate::shared_data_cursor::{SharedDataCursor, FromSharedDataCursor};
|
||||
use crate::write_to_bytes::WriteToBytes;
|
||||
use crate::data::{DocIndexes, DocIndexesBuilder};
|
||||
use crate::{DocumentId, DocIndex};
|
||||
|
||||
#[derive(Default)]
|
||||
pub struct Index {
|
||||
pub map: Map,
|
||||
pub indexes: DocIndexes,
|
||||
}
|
||||
|
||||
impl Index {
|
||||
pub fn remove_documents(&self, documents: &Set<DocumentId>) -> Index {
|
||||
let mut buffer = Vec::new();
|
||||
let mut builder = IndexBuilder::new();
|
||||
let mut stream = self.into_stream();
|
||||
|
||||
while let Some((key, indexes)) = stream.next() {
|
||||
buffer.clear();
|
||||
|
||||
let op = DifferenceByKey::new(indexes, documents, |x| x.document_id, |x| *x);
|
||||
op.extend_vec(&mut buffer);
|
||||
|
||||
if !buffer.is_empty() {
|
||||
let indexes = Set::new_unchecked(&buffer);
|
||||
builder.insert(key, indexes).unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
builder.build()
|
||||
}
|
||||
|
||||
pub fn union(&self, other: &Index) -> Index {
|
||||
let mut builder = IndexBuilder::new();
|
||||
let mut stream = map::OpBuilder::new().add(&self.map).add(&other.map).union();
|
||||
|
||||
let mut buffer = Vec::new();
|
||||
while let Some((key, ivalues)) = stream.next() {
|
||||
buffer.clear();
|
||||
match ivalues {
|
||||
[a, b] => {
|
||||
let indexes = if a.index == 0 { &self.indexes } else { &other.indexes };
|
||||
let indexes = &indexes[a.value as usize];
|
||||
let a = Set::new_unchecked(indexes);
|
||||
|
||||
let indexes = if b.index == 0 { &self.indexes } else { &other.indexes };
|
||||
let indexes = &indexes[b.value as usize];
|
||||
let b = Set::new_unchecked(indexes);
|
||||
|
||||
let op = Union::new(a, b);
|
||||
op.extend_vec(&mut buffer);
|
||||
},
|
||||
[x] => {
|
||||
let indexes = if x.index == 0 { &self.indexes } else { &other.indexes };
|
||||
let indexes = &indexes[x.value as usize];
|
||||
buffer.extend_from_slice(indexes)
|
||||
},
|
||||
_ => continue,
|
||||
}
|
||||
|
||||
if !buffer.is_empty() {
|
||||
let indexes = Set::new_unchecked(&buffer);
|
||||
builder.insert(key, indexes).unwrap();
|
||||
}
|
||||
}
|
||||
|
||||
builder.build()
|
||||
}
|
||||
}
|
||||
|
||||
impl FromSharedDataCursor for Index {
|
||||
type Error = Box<Error>;
|
||||
|
||||
fn from_shared_data_cursor(cursor: &mut SharedDataCursor) -> Result<Index, Self::Error> {
|
||||
let len = cursor.read_u64::<LittleEndian>()? as usize;
|
||||
let data = cursor.extract(len);
|
||||
|
||||
let fst = Fst::from_shared_bytes(data.bytes, data.offset, data.len)?;
|
||||
let map = Map::from(fst);
|
||||
|
||||
let indexes = DocIndexes::from_shared_data_cursor(cursor)?;
|
||||
|
||||
Ok(Index { map, indexes})
|
||||
}
|
||||
}
|
||||
|
||||
impl WriteToBytes for Index {
|
||||
fn write_to_bytes(&self, bytes: &mut Vec<u8>) {
|
||||
let slice = self.map.as_fst().as_bytes();
|
||||
let len = slice.len() as u64;
|
||||
let _ = bytes.write_u64::<LittleEndian>(len);
|
||||
bytes.extend_from_slice(slice);
|
||||
|
||||
self.indexes.write_to_bytes(bytes);
|
||||
}
|
||||
}
|
||||
|
||||
impl<'m, 'a> IntoStreamer<'a> for &'m Index {
|
||||
type Item = (&'a [u8], &'a Set<DocIndex>);
|
||||
type Into = Stream<'m>;
|
||||
|
||||
fn into_stream(self) -> Self::Into {
|
||||
Stream {
|
||||
map_stream: self.map.into_stream(),
|
||||
indexes: &self.indexes,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub struct Stream<'m> {
|
||||
map_stream: map::Stream<'m>,
|
||||
indexes: &'m DocIndexes,
|
||||
}
|
||||
|
||||
impl<'m, 'a> Streamer<'a> for Stream<'m> {
|
||||
type Item = (&'a [u8], &'a Set<DocIndex>);
|
||||
|
||||
fn next(&'a mut self) -> Option<Self::Item> {
|
||||
match self.map_stream.next() {
|
||||
Some((input, index)) => {
|
||||
let indexes = &self.indexes[index as usize];
|
||||
let indexes = Set::new_unchecked(indexes);
|
||||
Some((input, indexes))
|
||||
},
|
||||
None => None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub struct IndexBuilder {
|
||||
map: fst::MapBuilder<Vec<u8>>,
|
||||
indexes: DocIndexesBuilder<Vec<u8>>,
|
||||
value: u64,
|
||||
}
|
||||
|
||||
impl IndexBuilder {
|
||||
pub fn new() -> Self {
|
||||
IndexBuilder {
|
||||
map: fst::MapBuilder::memory(),
|
||||
indexes: DocIndexesBuilder::memory(),
|
||||
value: 0,
|
||||
}
|
||||
}
|
||||
|
||||
/// If a key is inserted that is less than or equal to any previous key added,
|
||||
/// then an error is returned. Similarly, if there was a problem writing
|
||||
/// to the underlying writer, an error is returned.
|
||||
// FIXME what if one write doesn't work but the other do ?
|
||||
pub fn insert<K>(&mut self, key: K, indexes: &Set<DocIndex>) -> fst::Result<()>
|
||||
where K: AsRef<[u8]>,
|
||||
{
|
||||
self.map.insert(key, self.value)?;
|
||||
self.indexes.insert(indexes);
|
||||
self.value += 1;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn build(self) -> Index {
|
||||
let map = self.map.into_inner().unwrap();
|
||||
let indexes = self.indexes.into_inner().unwrap();
|
||||
|
||||
let map = Map::from_bytes(map).unwrap();
|
||||
let indexes = DocIndexes::from_bytes(indexes).unwrap();
|
||||
|
||||
Index { map, indexes }
|
||||
}
|
||||
}
|
@ -1,28 +1,28 @@
|
||||
pub mod criterion;
|
||||
pub mod data;
|
||||
mod index;
|
||||
mod automaton;
|
||||
mod query_builder;
|
||||
mod distinct_map;
|
||||
mod query_builder;
|
||||
mod store;
|
||||
pub mod criterion;
|
||||
|
||||
pub mod shared_data_cursor;
|
||||
pub mod write_to_bytes;
|
||||
|
||||
use std::fmt;
|
||||
use std::sync::Arc;
|
||||
use serde::{Serialize, Deserialize};
|
||||
|
||||
use slice_group_by::GroupBy;
|
||||
use rayon::slice::ParallelSliceMut;
|
||||
use serde::{Serialize, Deserialize};
|
||||
use slice_group_by::GroupBy;
|
||||
use zerocopy::{AsBytes, FromBytes};
|
||||
|
||||
pub use self::index::{Index, IndexBuilder};
|
||||
pub use self::query_builder::{QueryBuilder, DistinctQueryBuilder};
|
||||
pub use self::store::Store;
|
||||
|
||||
/// Represent an internally generated document unique identifier.
|
||||
///
|
||||
/// It is used to inform the database the document you want to deserialize.
|
||||
/// Helpful for custom ranking.
|
||||
#[derive(Serialize, Deserialize)]
|
||||
#[derive(Debug, Copy, Clone, Eq, PartialEq, PartialOrd, Ord, Hash)]
|
||||
#[derive(Serialize, Deserialize)]
|
||||
#[derive(AsBytes, FromBytes)]
|
||||
#[repr(C)]
|
||||
pub struct DocumentId(pub u64);
|
||||
|
||||
/// This structure represent the position of a word
|
||||
@ -31,6 +31,7 @@ pub struct DocumentId(pub u64);
|
||||
/// This is stored in the map, generated at index time,
|
||||
/// extracted and interpreted at search time.
|
||||
#[derive(Debug, Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
|
||||
#[derive(AsBytes, FromBytes)]
|
||||
#[repr(C)]
|
||||
pub struct DocIndex {
|
||||
/// The document identifier where the word was found.
|
||||
@ -210,6 +211,21 @@ impl RawDocument {
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Debug for RawDocument {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
f.debug_struct("RawDocument")
|
||||
.field("id", &self.id)
|
||||
.field("query_index", &self.query_index())
|
||||
.field("distance", &self.distance())
|
||||
.field("attribute", &self.attribute())
|
||||
.field("word_index", &self.word_index())
|
||||
.field("is_exact", &self.is_exact())
|
||||
.field("char_index", &self.char_index())
|
||||
.field("char_length", &self.char_length())
|
||||
.finish()
|
||||
}
|
||||
}
|
||||
|
||||
pub fn raw_documents_from_matches(mut matches: Vec<(DocumentId, Match)>) -> Vec<RawDocument> {
|
||||
let mut docs_ranges = Vec::<(DocumentId, Range)>::new();
|
||||
let mut matches2 = Matches::with_capacity(matches.len());
|
||||
|
@ -1,5 +1,5 @@
|
||||
use std::hash::Hash;
|
||||
use std::ops::{Range, Deref};
|
||||
use std::ops::Range;
|
||||
use std::rc::Rc;
|
||||
use std::time::Instant;
|
||||
use std::{cmp, mem};
|
||||
@ -14,8 +14,8 @@ use log::info;
|
||||
use crate::automaton::{self, DfaExt, AutomatonExt};
|
||||
use crate::distinct_map::{DistinctMap, BufferedDistinctMap};
|
||||
use crate::criterion::Criteria;
|
||||
use crate::{raw_documents_from_matches, RawDocument, Document};
|
||||
use crate::{Index, Match, DocumentId};
|
||||
use crate::raw_documents_from_matches;
|
||||
use crate::{Match, DocumentId, Store, RawDocument, Document};
|
||||
|
||||
fn generate_automatons(query: &str) -> Vec<DfaExt> {
|
||||
let has_end_whitespace = query.chars().last().map_or(false, char::is_whitespace);
|
||||
@ -35,37 +35,37 @@ fn generate_automatons(query: &str) -> Vec<DfaExt> {
|
||||
automatons
|
||||
}
|
||||
|
||||
pub struct QueryBuilder<'c, I, FI = fn(DocumentId) -> bool> {
|
||||
index: I,
|
||||
pub struct QueryBuilder<'c, S, FI = fn(DocumentId) -> bool> {
|
||||
store: S,
|
||||
criteria: Criteria<'c>,
|
||||
searchable_attrs: Option<HashSet<u16>>,
|
||||
filter: Option<FI>,
|
||||
}
|
||||
|
||||
impl<'c, I> QueryBuilder<'c, I, fn(DocumentId) -> bool> {
|
||||
pub fn new(index: I) -> Self {
|
||||
QueryBuilder::with_criteria(index, Criteria::default())
|
||||
impl<'c, S> QueryBuilder<'c, S, fn(DocumentId) -> bool> {
|
||||
pub fn new(store: S) -> Self {
|
||||
QueryBuilder::with_criteria(store, Criteria::default())
|
||||
}
|
||||
|
||||
pub fn with_criteria(index: I, criteria: Criteria<'c>) -> Self {
|
||||
QueryBuilder { index, criteria, searchable_attrs: None, filter: None }
|
||||
pub fn with_criteria(store: S, criteria: Criteria<'c>) -> Self {
|
||||
QueryBuilder { store, criteria, searchable_attrs: None, filter: None }
|
||||
}
|
||||
}
|
||||
|
||||
impl<'c, I, FI> QueryBuilder<'c, I, FI>
|
||||
impl<'c, S, FI> QueryBuilder<'c, S, FI>
|
||||
{
|
||||
pub fn with_filter<F>(self, function: F) -> QueryBuilder<'c, I, F>
|
||||
pub fn with_filter<F>(self, function: F) -> QueryBuilder<'c, S, F>
|
||||
where F: Fn(DocumentId) -> bool,
|
||||
{
|
||||
QueryBuilder {
|
||||
index: self.index,
|
||||
store: self.store,
|
||||
criteria: self.criteria,
|
||||
searchable_attrs: self.searchable_attrs,
|
||||
filter: Some(function)
|
||||
}
|
||||
}
|
||||
|
||||
pub fn with_distinct<F, K>(self, function: F, size: usize) -> DistinctQueryBuilder<'c, I, FI, F>
|
||||
pub fn with_distinct<F, K>(self, function: F, size: usize) -> DistinctQueryBuilder<'c, S, FI, F>
|
||||
where F: Fn(DocumentId) -> Option<K>,
|
||||
K: Hash + Eq,
|
||||
{
|
||||
@ -82,16 +82,17 @@ impl<'c, I, FI> QueryBuilder<'c, I, FI>
|
||||
}
|
||||
}
|
||||
|
||||
impl<'c, I, FI> QueryBuilder<'c, I, FI>
|
||||
where I: Deref<Target=Index>,
|
||||
impl<'c, S, FI> QueryBuilder<'c, S, FI>
|
||||
where S: Store,
|
||||
{
|
||||
fn query_all(&self, query: &str) -> Vec<RawDocument> {
|
||||
fn query_all(&self, query: &str) -> Result<Vec<RawDocument>, S::Error> {
|
||||
let automatons = generate_automatons(query);
|
||||
let words = self.store.words()?.as_fst();
|
||||
|
||||
let mut stream = {
|
||||
let mut op_builder = fst::map::OpBuilder::new();
|
||||
let mut op_builder = fst::raw::OpBuilder::new();
|
||||
for automaton in &automatons {
|
||||
let stream = self.index.map.search(automaton);
|
||||
let stream = words.search(automaton);
|
||||
op_builder.push(stream);
|
||||
}
|
||||
op_builder.r#union()
|
||||
@ -105,10 +106,13 @@ where I: Deref<Target=Index>,
|
||||
let distance = automaton.eval(input).to_u8();
|
||||
let is_exact = distance == 0 && input.len() == automaton.query_len();
|
||||
|
||||
let doc_indexes = &self.index.indexes;
|
||||
let doc_indexes = &doc_indexes[iv.value as usize];
|
||||
let doc_indexes = self.store.word_indexes(input)?;
|
||||
let doc_indexes = match doc_indexes {
|
||||
Some(doc_indexes) => doc_indexes,
|
||||
None => continue,
|
||||
};
|
||||
|
||||
for di in doc_indexes {
|
||||
for di in doc_indexes.as_slice() {
|
||||
if self.searchable_attrs.as_ref().map_or(true, |r| r.contains(&di.attribute)) {
|
||||
let match_ = Match {
|
||||
query_index: iv.index as u32,
|
||||
@ -131,15 +135,15 @@ where I: Deref<Target=Index>,
|
||||
info!("{} total documents to classify", raw_documents.len());
|
||||
info!("{} total matches to classify", total_matches);
|
||||
|
||||
raw_documents
|
||||
Ok(raw_documents)
|
||||
}
|
||||
}
|
||||
|
||||
impl<'c, I, FI> QueryBuilder<'c, I, FI>
|
||||
where I: Deref<Target=Index>,
|
||||
impl<'c, S, FI> QueryBuilder<'c, S, FI>
|
||||
where S: Store,
|
||||
FI: Fn(DocumentId) -> bool,
|
||||
{
|
||||
pub fn query(self, query: &str, range: Range<usize>) -> Vec<Document> {
|
||||
pub fn query(self, query: &str, range: Range<usize>) -> Result<Vec<Document>, S::Error> {
|
||||
// We delegate the filter work to the distinct query builder,
|
||||
// specifying a distinct rule that has no effect.
|
||||
if self.filter.is_some() {
|
||||
@ -148,18 +152,16 @@ where I: Deref<Target=Index>,
|
||||
}
|
||||
|
||||
let start = Instant::now();
|
||||
let mut documents = self.query_all(query);
|
||||
let mut documents = self.query_all(query)?;
|
||||
info!("query_all took {:.2?}", start.elapsed());
|
||||
|
||||
let mut groups = vec![documents.as_mut_slice()];
|
||||
|
||||
'criteria: for (ci, criterion) in self.criteria.as_ref().iter().enumerate() {
|
||||
'criteria: for criterion in self.criteria.as_ref() {
|
||||
let tmp_groups = mem::replace(&mut groups, Vec::new());
|
||||
let mut documents_seen = 0;
|
||||
|
||||
for group in tmp_groups {
|
||||
info!("criterion {}, documents group of size {}", ci, group.len());
|
||||
|
||||
// if this group does not overlap with the requested range,
|
||||
// push it without sorting and splitting it
|
||||
if documents_seen + group.len() < range.start {
|
||||
@ -170,9 +172,11 @@ where I: Deref<Target=Index>,
|
||||
|
||||
let start = Instant::now();
|
||||
group.par_sort_unstable_by(|a, b| criterion.evaluate(a, b));
|
||||
info!("criterion {} sort took {:.2?}", ci, start.elapsed());
|
||||
info!("criterion {} sort took {:.2?}", criterion.name(), start.elapsed());
|
||||
|
||||
for group in group.binary_group_by_mut(|a, b| criterion.eq(a, b)) {
|
||||
info!("criterion {} produced a group of size {}", criterion.name(), group.len());
|
||||
|
||||
documents_seen += group.len();
|
||||
groups.push(group);
|
||||
|
||||
@ -185,7 +189,7 @@ where I: Deref<Target=Index>,
|
||||
|
||||
let offset = cmp::min(documents.len(), range.start);
|
||||
let iter = documents.into_iter().skip(offset).take(range.len());
|
||||
iter.map(|d| Document::from_raw(&d)).collect()
|
||||
Ok(iter.map(|d| Document::from_raw(&d)).collect())
|
||||
}
|
||||
}
|
||||
|
||||
@ -212,15 +216,15 @@ impl<'c, I, FI, FD> DistinctQueryBuilder<'c, I, FI, FD>
|
||||
}
|
||||
}
|
||||
|
||||
impl<'c, I, FI, FD, K> DistinctQueryBuilder<'c, I, FI, FD>
|
||||
where I: Deref<Target=Index>,
|
||||
impl<'c, S, FI, FD, K> DistinctQueryBuilder<'c, S, FI, FD>
|
||||
where S: Store,
|
||||
FI: Fn(DocumentId) -> bool,
|
||||
FD: Fn(DocumentId) -> Option<K>,
|
||||
K: Hash + Eq,
|
||||
{
|
||||
pub fn query(self, query: &str, range: Range<usize>) -> Vec<Document> {
|
||||
pub fn query(self, query: &str, range: Range<usize>) -> Result<Vec<Document>, S::Error> {
|
||||
let start = Instant::now();
|
||||
let mut documents = self.inner.query_all(query);
|
||||
let mut documents = self.inner.query_all(query)?;
|
||||
info!("query_all took {:.2?}", start.elapsed());
|
||||
|
||||
let mut groups = vec![documents.as_mut_slice()];
|
||||
@ -233,14 +237,12 @@ where I: Deref<Target=Index>,
|
||||
let mut distinct_map = DistinctMap::new(self.size);
|
||||
let mut distinct_raw_offset = 0;
|
||||
|
||||
'criteria: for (ci, criterion) in self.inner.criteria.as_ref().iter().enumerate() {
|
||||
'criteria: for criterion in self.inner.criteria.as_ref() {
|
||||
let tmp_groups = mem::replace(&mut groups, Vec::new());
|
||||
let mut buf_distinct = BufferedDistinctMap::new(&mut distinct_map);
|
||||
let mut documents_seen = 0;
|
||||
|
||||
for group in tmp_groups {
|
||||
info!("criterion {}, documents group of size {}", ci, group.len());
|
||||
|
||||
// if this group does not overlap with the requested range,
|
||||
// push it without sorting and splitting it
|
||||
if documents_seen + group.len() < distinct_raw_offset {
|
||||
@ -251,7 +253,7 @@ where I: Deref<Target=Index>,
|
||||
|
||||
let start = Instant::now();
|
||||
group.par_sort_unstable_by(|a, b| criterion.evaluate(a, b));
|
||||
info!("criterion {} sort took {:.2?}", ci, start.elapsed());
|
||||
info!("criterion {} sort took {:.2?}", criterion.name(), start.elapsed());
|
||||
|
||||
for group in group.binary_group_by_mut(|a, b| criterion.eq(a, b)) {
|
||||
// we must compute the real distinguished len of this sub-group
|
||||
@ -278,6 +280,8 @@ where I: Deref<Target=Index>,
|
||||
if buf_distinct.len() >= range.end { break }
|
||||
}
|
||||
|
||||
info!("criterion {} produced a group of size {}", criterion.name(), group.len());
|
||||
|
||||
documents_seen += group.len();
|
||||
groups.push(group);
|
||||
|
||||
@ -318,6 +322,6 @@ where I: Deref<Target=Index>,
|
||||
}
|
||||
}
|
||||
|
||||
out_documents
|
||||
Ok(out_documents)
|
||||
}
|
||||
}
|
||||
|
@ -1,56 +0,0 @@
|
||||
use std::io::{self, Read, Cursor, BufRead};
|
||||
use std::sync::Arc;
|
||||
use crate::data::SharedData;
|
||||
|
||||
pub struct SharedDataCursor(Cursor<SharedData>);
|
||||
|
||||
impl SharedDataCursor {
|
||||
pub fn from_bytes(bytes: Vec<u8>) -> SharedDataCursor {
|
||||
let len = bytes.len();
|
||||
let bytes = Arc::from(bytes);
|
||||
|
||||
SharedDataCursor::from_shared_bytes(bytes, 0, len)
|
||||
}
|
||||
|
||||
pub fn from_shared_bytes(bytes: Arc<[u8]>, offset: usize, len: usize) -> SharedDataCursor {
|
||||
let data = SharedData::new(bytes, offset, len);
|
||||
let cursor = Cursor::new(data);
|
||||
|
||||
SharedDataCursor(cursor)
|
||||
}
|
||||
|
||||
pub fn extract(&mut self, amt: usize) -> SharedData {
|
||||
let offset = self.0.position() as usize;
|
||||
let extracted = self.0.get_ref().range(offset, amt);
|
||||
self.0.consume(amt);
|
||||
|
||||
extracted
|
||||
}
|
||||
}
|
||||
|
||||
impl Read for SharedDataCursor {
|
||||
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
|
||||
self.0.read(buf)
|
||||
}
|
||||
}
|
||||
|
||||
impl BufRead for SharedDataCursor {
|
||||
fn fill_buf(&mut self) -> io::Result<&[u8]> {
|
||||
self.0.fill_buf()
|
||||
}
|
||||
|
||||
fn consume(&mut self, amt: usize) {
|
||||
self.0.consume(amt)
|
||||
}
|
||||
}
|
||||
|
||||
pub trait FromSharedDataCursor: Sized {
|
||||
type Error;
|
||||
|
||||
fn from_shared_data_cursor(cursor: &mut SharedDataCursor) -> Result<Self, Self::Error>;
|
||||
|
||||
fn from_bytes(bytes: Vec<u8>) -> Result<Self, Self::Error> {
|
||||
let mut cursor = SharedDataCursor::from_bytes(bytes);
|
||||
Self::from_shared_data_cursor(&mut cursor)
|
||||
}
|
||||
}
|
23
meilidb-core/src/store.rs
Normal file
23
meilidb-core/src/store.rs
Normal file
@ -0,0 +1,23 @@
|
||||
use std::error::Error;
|
||||
use fst::Set;
|
||||
use sdset::SetBuf;
|
||||
use crate::DocIndex;
|
||||
|
||||
pub trait Store {
|
||||
type Error: Error;
|
||||
|
||||
fn words(&self) -> Result<&Set, Self::Error>;
|
||||
fn word_indexes(&self, word: &[u8]) -> Result<Option<SetBuf<DocIndex>>, Self::Error>;
|
||||
}
|
||||
|
||||
impl<T> Store for &'_ T where T: Store {
|
||||
type Error = T::Error;
|
||||
|
||||
fn words(&self) -> Result<&Set, Self::Error> {
|
||||
(*self).words()
|
||||
}
|
||||
|
||||
fn word_indexes(&self, word: &[u8]) -> Result<Option<SetBuf<DocIndex>>, Self::Error> {
|
||||
(*self).word_indexes(word)
|
||||
}
|
||||
}
|
@ -1,9 +0,0 @@
|
||||
pub trait WriteToBytes {
|
||||
fn write_to_bytes(&self, bytes: &mut Vec<u8>);
|
||||
|
||||
fn into_bytes(&self) -> Vec<u8> {
|
||||
let mut bytes = Vec::new();
|
||||
self.write_to_bytes(&mut bytes);
|
||||
bytes
|
||||
}
|
||||
}
|
@ -7,19 +7,26 @@ edition = "2018"
|
||||
[dependencies]
|
||||
arc-swap = "0.3.11"
|
||||
bincode = "1.1.2"
|
||||
byteorder = "1.3.1"
|
||||
deunicode = "1.0.0"
|
||||
hashbrown = { version = "0.2.2", features = ["serde"] }
|
||||
linked-hash-map = { version = "0.5.2", features = ["serde_impl"] }
|
||||
meilidb-core = { path = "../meilidb-core", version = "0.1.0" }
|
||||
meilidb-tokenizer = { path = "../meilidb-tokenizer", version = "0.1.0" }
|
||||
ordered-float = { version = "1.0.2", features = ["serde"] }
|
||||
sdset = "0.3.1"
|
||||
serde = { version = "1.0.90", features = ["derive"] }
|
||||
serde = { version = "1.0.91", features = ["derive"] }
|
||||
serde_json = { version = "1.0.39", features = ["preserve_order"] }
|
||||
sled = "0.23.0"
|
||||
toml = { version = "0.5.0", features = ["preserve_order"] }
|
||||
deunicode = "1.0.0"
|
||||
zerocopy = "0.2.2"
|
||||
|
||||
[dependencies.rmp-serde]
|
||||
git = "https://github.com/3Hren/msgpack-rust.git"
|
||||
rev = "40b3d48"
|
||||
|
||||
[dependencies.fst]
|
||||
git = "https://github.com/Kerollmops/fst.git"
|
||||
branch = "arc-byte-slice"
|
||||
|
||||
[dev-dependencies]
|
||||
tempfile = "3.0.7"
|
||||
|
@ -1,464 +0,0 @@
|
||||
use std::collections::HashSet;
|
||||
use std::io::{self, Cursor, BufRead};
|
||||
use std::iter::FromIterator;
|
||||
use std::path::Path;
|
||||
use std::sync::Arc;
|
||||
use std::{error, fmt};
|
||||
|
||||
use arc_swap::{ArcSwap, Lease};
|
||||
use byteorder::{ReadBytesExt, BigEndian};
|
||||
use hashbrown::HashMap;
|
||||
use meilidb_core::criterion::Criteria;
|
||||
use meilidb_core::QueryBuilder;
|
||||
use meilidb_core::shared_data_cursor::{FromSharedDataCursor, SharedDataCursor};
|
||||
use meilidb_core::write_to_bytes::WriteToBytes;
|
||||
use meilidb_core::{DocumentId, Index as WordIndex};
|
||||
use rmp_serde::decode::{Error as RmpError};
|
||||
use sdset::SetBuf;
|
||||
use serde::de;
|
||||
use sled::IVec;
|
||||
|
||||
use crate::{Schema, SchemaAttr, RankedMap};
|
||||
use crate::serde::{extract_document_id, Serializer, Deserializer, SerializerError};
|
||||
use crate::indexer::Indexer;
|
||||
|
||||
#[derive(Debug)]
|
||||
pub enum Error {
|
||||
SchemaDiffer,
|
||||
SchemaMissing,
|
||||
WordIndexMissing,
|
||||
MissingDocumentId,
|
||||
SledError(sled::Error),
|
||||
BincodeError(bincode::Error),
|
||||
SerializerError(SerializerError),
|
||||
}
|
||||
|
||||
impl From<sled::Error> for Error {
|
||||
fn from(error: sled::Error) -> Error {
|
||||
Error::SledError(error)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<bincode::Error> for Error {
|
||||
fn from(error: bincode::Error) -> Error {
|
||||
Error::BincodeError(error)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<SerializerError> for Error {
|
||||
fn from(error: SerializerError) -> Error {
|
||||
Error::SerializerError(error)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for Error {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
use self::Error::*;
|
||||
match self {
|
||||
SchemaDiffer => write!(f, "schemas differ"),
|
||||
SchemaMissing => write!(f, "this index does not have a schema"),
|
||||
WordIndexMissing => write!(f, "this index does not have a word index"),
|
||||
MissingDocumentId => write!(f, "document id is missing"),
|
||||
SledError(e) => write!(f, "sled error; {}", e),
|
||||
BincodeError(e) => write!(f, "bincode error; {}", e),
|
||||
SerializerError(e) => write!(f, "serializer error; {}", e),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl error::Error for Error { }
|
||||
|
||||
fn index_name(name: &str) -> Vec<u8> {
|
||||
format!("index-{}", name).into_bytes()
|
||||
}
|
||||
|
||||
fn document_key(id: DocumentId, attr: SchemaAttr) -> Vec<u8> {
|
||||
let DocumentId(document_id) = id;
|
||||
let SchemaAttr(schema_attr) = attr;
|
||||
|
||||
let mut bytes = Vec::new();
|
||||
bytes.extend_from_slice(b"document-");
|
||||
bytes.extend_from_slice(&document_id.to_be_bytes()[..]);
|
||||
bytes.extend_from_slice(&schema_attr.to_be_bytes()[..]);
|
||||
bytes
|
||||
}
|
||||
|
||||
trait CursorExt {
|
||||
fn consume_if_eq(&mut self, needle: &[u8]) -> bool;
|
||||
}
|
||||
|
||||
impl<T: AsRef<[u8]>> CursorExt for Cursor<T> {
|
||||
fn consume_if_eq(&mut self, needle: &[u8]) -> bool {
|
||||
let position = self.position() as usize;
|
||||
let slice = self.get_ref().as_ref();
|
||||
|
||||
if slice[position..].starts_with(needle) {
|
||||
self.consume(needle.len());
|
||||
true
|
||||
} else {
|
||||
false
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn extract_document_key(key: Vec<u8>) -> io::Result<(DocumentId, SchemaAttr)> {
|
||||
let mut key = Cursor::new(key);
|
||||
|
||||
if !key.consume_if_eq(b"document-") {
|
||||
return Err(io::Error::from(io::ErrorKind::InvalidData))
|
||||
}
|
||||
|
||||
let document_id = key.read_u64::<BigEndian>().map(DocumentId)?;
|
||||
let schema_attr = key.read_u16::<BigEndian>().map(SchemaAttr)?;
|
||||
|
||||
Ok((document_id, schema_attr))
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct Database {
|
||||
opened: Arc<ArcSwap<HashMap<String, RawIndex>>>,
|
||||
inner: sled::Db,
|
||||
}
|
||||
|
||||
impl Database {
|
||||
pub fn start_default<P: AsRef<Path>>(path: P) -> Result<Database, Error> {
|
||||
let inner = sled::Db::start_default(path)?;
|
||||
let opened = Arc::new(ArcSwap::new(Arc::new(HashMap::new())));
|
||||
Ok(Database { opened, inner })
|
||||
}
|
||||
|
||||
pub fn open_index(&self, name: &str) -> Result<Option<Index>, Error> {
|
||||
// check if the index was already opened
|
||||
if let Some(raw_index) = self.opened.lease().get(name) {
|
||||
return Ok(Some(Index(raw_index.clone())))
|
||||
}
|
||||
|
||||
let raw_name = index_name(name);
|
||||
if self.inner.tree_names().into_iter().any(|tn| tn == raw_name) {
|
||||
let tree = self.inner.open_tree(raw_name)?;
|
||||
let raw_index = RawIndex::from_raw(tree)?;
|
||||
|
||||
self.opened.rcu(|opened| {
|
||||
let mut opened = HashMap::clone(opened);
|
||||
opened.insert(name.to_string(), raw_index.clone());
|
||||
opened
|
||||
});
|
||||
|
||||
return Ok(Some(Index(raw_index)))
|
||||
}
|
||||
|
||||
Ok(None)
|
||||
}
|
||||
|
||||
pub fn create_index(&self, name: String, schema: Schema) -> Result<Index, Error> {
|
||||
match self.open_index(&name)? {
|
||||
Some(index) => {
|
||||
if index.schema() != &schema {
|
||||
return Err(Error::SchemaDiffer);
|
||||
}
|
||||
|
||||
Ok(index)
|
||||
},
|
||||
None => {
|
||||
let raw_name = index_name(&name);
|
||||
let tree = self.inner.open_tree(raw_name)?;
|
||||
let raw_index = RawIndex::new_from_raw(tree, schema)?;
|
||||
|
||||
self.opened.rcu(|opened| {
|
||||
let mut opened = HashMap::clone(opened);
|
||||
opened.insert(name.clone(), raw_index.clone());
|
||||
opened
|
||||
});
|
||||
|
||||
Ok(Index(raw_index))
|
||||
},
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct RawIndex {
|
||||
schema: Schema,
|
||||
word_index: Arc<ArcSwap<WordIndex>>,
|
||||
ranked_map: Arc<ArcSwap<RankedMap>>,
|
||||
inner: Arc<sled::Tree>,
|
||||
}
|
||||
|
||||
impl RawIndex {
|
||||
fn from_raw(inner: Arc<sled::Tree>) -> Result<RawIndex, Error> {
|
||||
let schema = {
|
||||
let bytes = inner.get("schema")?;
|
||||
let bytes = bytes.ok_or(Error::SchemaMissing)?;
|
||||
Schema::read_from_bin(bytes.as_ref())?
|
||||
};
|
||||
|
||||
let bytes = inner.get("word-index")?;
|
||||
let bytes = bytes.ok_or(Error::WordIndexMissing)?;
|
||||
let word_index = {
|
||||
let len = bytes.len();
|
||||
let bytes: Arc<[u8]> = Into::into(bytes);
|
||||
let mut cursor = SharedDataCursor::from_shared_bytes(bytes, 0, len);
|
||||
|
||||
// TODO must handle this error
|
||||
let word_index = WordIndex::from_shared_data_cursor(&mut cursor).unwrap();
|
||||
|
||||
Arc::new(ArcSwap::new(Arc::new(word_index)))
|
||||
};
|
||||
|
||||
let ranked_map = {
|
||||
let map = match inner.get("ranked-map")? {
|
||||
Some(bytes) => bincode::deserialize(bytes.as_ref())?,
|
||||
None => RankedMap::default(),
|
||||
};
|
||||
|
||||
Arc::new(ArcSwap::new(Arc::new(map)))
|
||||
};
|
||||
|
||||
Ok(RawIndex { schema, word_index, ranked_map, inner })
|
||||
}
|
||||
|
||||
fn new_from_raw(inner: Arc<sled::Tree>, schema: Schema) -> Result<RawIndex, Error> {
|
||||
let mut schema_bytes = Vec::new();
|
||||
schema.write_to_bin(&mut schema_bytes)?;
|
||||
inner.set("schema", schema_bytes)?;
|
||||
|
||||
let word_index = WordIndex::default();
|
||||
inner.set("word-index", word_index.into_bytes())?;
|
||||
let word_index = Arc::new(ArcSwap::new(Arc::new(word_index)));
|
||||
|
||||
let ranked_map = Arc::new(ArcSwap::new(Arc::new(RankedMap::default())));
|
||||
|
||||
Ok(RawIndex { schema, word_index, ranked_map, inner })
|
||||
}
|
||||
|
||||
pub fn schema(&self) -> &Schema {
|
||||
&self.schema
|
||||
}
|
||||
|
||||
pub fn word_index(&self) -> Lease<Arc<WordIndex>> {
|
||||
self.word_index.lease()
|
||||
}
|
||||
|
||||
pub fn ranked_map(&self) -> Lease<Arc<RankedMap>> {
|
||||
self.ranked_map.lease()
|
||||
}
|
||||
|
||||
pub fn update_word_index(&self, word_index: Arc<WordIndex>) -> sled::Result<()> {
|
||||
let data = word_index.into_bytes();
|
||||
self.inner.set("word-index", data).map(drop)?;
|
||||
self.word_index.store(word_index);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn update_ranked_map(&self, ranked_map: Arc<RankedMap>) -> sled::Result<()> {
|
||||
let data = bincode::serialize(ranked_map.as_ref()).unwrap();
|
||||
self.inner.set("ranked-map", data).map(drop)?;
|
||||
self.ranked_map.store(ranked_map);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn set_document_attribute<V>(
|
||||
&self,
|
||||
id: DocumentId,
|
||||
attr: SchemaAttr,
|
||||
value: V,
|
||||
) -> Result<Option<IVec>, sled::Error>
|
||||
where IVec: From<V>,
|
||||
{
|
||||
let key = document_key(id, attr);
|
||||
Ok(self.inner.set(key, value)?)
|
||||
}
|
||||
|
||||
pub fn get_document_attribute(
|
||||
&self,
|
||||
id: DocumentId,
|
||||
attr: SchemaAttr
|
||||
) -> Result<Option<IVec>, sled::Error>
|
||||
{
|
||||
let key = document_key(id, attr);
|
||||
Ok(self.inner.get(key)?)
|
||||
}
|
||||
|
||||
pub fn get_document_fields(&self, id: DocumentId) -> DocumentFieldsIter {
|
||||
let start = document_key(id, SchemaAttr::min());
|
||||
let end = document_key(id, SchemaAttr::max());
|
||||
DocumentFieldsIter(self.inner.range(start..=end))
|
||||
}
|
||||
|
||||
pub fn del_document_attribute(
|
||||
&self,
|
||||
id: DocumentId,
|
||||
attr: SchemaAttr
|
||||
) -> Result<Option<IVec>, sled::Error>
|
||||
{
|
||||
let key = document_key(id, attr);
|
||||
Ok(self.inner.del(key)?)
|
||||
}
|
||||
}
|
||||
|
||||
pub struct DocumentFieldsIter<'a>(sled::Iter<'a>);
|
||||
|
||||
impl<'a> Iterator for DocumentFieldsIter<'a> {
|
||||
type Item = Result<(DocumentId, SchemaAttr, IVec), Error>;
|
||||
|
||||
fn next(&mut self) -> Option<Self::Item> {
|
||||
match self.0.next() {
|
||||
Some(Ok((key, value))) => {
|
||||
let (id, attr) = extract_document_key(key).unwrap();
|
||||
Some(Ok((id, attr, value)))
|
||||
},
|
||||
Some(Err(e)) => Some(Err(Error::SledError(e))),
|
||||
None => None,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct Index(RawIndex);
|
||||
|
||||
impl Index {
|
||||
pub fn query_builder(&self) -> QueryBuilder<Lease<Arc<WordIndex>>> {
|
||||
let word_index = self.word_index();
|
||||
QueryBuilder::new(word_index)
|
||||
}
|
||||
|
||||
pub fn query_builder_with_criteria<'c>(
|
||||
&self,
|
||||
criteria: Criteria<'c>,
|
||||
) -> QueryBuilder<'c, Lease<Arc<WordIndex>>>
|
||||
{
|
||||
let word_index = self.word_index();
|
||||
QueryBuilder::with_criteria(word_index, criteria)
|
||||
}
|
||||
|
||||
pub fn schema(&self) -> &Schema {
|
||||
self.0.schema()
|
||||
}
|
||||
|
||||
pub fn word_index(&self) -> Lease<Arc<WordIndex>> {
|
||||
self.0.word_index()
|
||||
}
|
||||
|
||||
pub fn ranked_map(&self) -> Lease<Arc<RankedMap>> {
|
||||
self.0.ranked_map()
|
||||
}
|
||||
|
||||
pub fn documents_addition(&self) -> DocumentsAddition {
|
||||
let index = self.0.clone();
|
||||
let ranked_map = self.0.ranked_map().clone();
|
||||
DocumentsAddition::from_raw(index, ranked_map)
|
||||
}
|
||||
|
||||
pub fn documents_deletion(&self) -> DocumentsDeletion {
|
||||
let index = self.0.clone();
|
||||
DocumentsDeletion::from_raw(index)
|
||||
}
|
||||
|
||||
pub fn document<T>(
|
||||
&self,
|
||||
fields: Option<&HashSet<&str>>,
|
||||
id: DocumentId,
|
||||
) -> Result<Option<T>, RmpError>
|
||||
where T: de::DeserializeOwned,
|
||||
{
|
||||
let fields = match fields {
|
||||
Some(fields) => {
|
||||
let iter = fields.iter().filter_map(|n| self.0.schema().attribute(n));
|
||||
Some(HashSet::from_iter(iter))
|
||||
},
|
||||
None => None,
|
||||
};
|
||||
|
||||
let mut deserializer = Deserializer {
|
||||
document_id: id,
|
||||
raw_index: &self.0,
|
||||
fields: fields.as_ref(),
|
||||
};
|
||||
|
||||
// TODO: currently we return an error if all document fields are missing,
|
||||
// returning None would have been better
|
||||
T::deserialize(&mut deserializer).map(Some)
|
||||
}
|
||||
}
|
||||
|
||||
pub struct DocumentsAddition {
|
||||
inner: RawIndex,
|
||||
indexer: Indexer,
|
||||
ranked_map: RankedMap,
|
||||
}
|
||||
|
||||
impl DocumentsAddition {
|
||||
pub fn from_raw(inner: RawIndex, ranked_map: RankedMap) -> DocumentsAddition {
|
||||
DocumentsAddition { inner, indexer: Indexer::new(), ranked_map }
|
||||
}
|
||||
|
||||
pub fn update_document<D>(&mut self, document: D) -> Result<(), Error>
|
||||
where D: serde::Serialize,
|
||||
{
|
||||
let schema = self.inner.schema();
|
||||
let identifier = schema.identifier_name();
|
||||
|
||||
let document_id = match extract_document_id(identifier, &document)? {
|
||||
Some(id) => id,
|
||||
None => return Err(Error::MissingDocumentId),
|
||||
};
|
||||
|
||||
let serializer = Serializer {
|
||||
schema,
|
||||
index: &self.inner,
|
||||
indexer: &mut self.indexer,
|
||||
ranked_map: &mut self.ranked_map,
|
||||
document_id,
|
||||
};
|
||||
|
||||
document.serialize(serializer)?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
pub fn finalize(self) -> sled::Result<()> {
|
||||
let delta_index = self.indexer.build();
|
||||
|
||||
let index = self.inner.word_index();
|
||||
let new_index = index.r#union(&delta_index);
|
||||
|
||||
let new_index = Arc::from(new_index);
|
||||
self.inner.update_word_index(new_index)?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
pub struct DocumentsDeletion {
|
||||
inner: RawIndex,
|
||||
documents: Vec<DocumentId>,
|
||||
}
|
||||
|
||||
impl DocumentsDeletion {
|
||||
pub fn from_raw(inner: RawIndex) -> DocumentsDeletion {
|
||||
DocumentsDeletion {
|
||||
inner,
|
||||
documents: Vec::new(),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn delete_document(&mut self, id: DocumentId) {
|
||||
self.documents.push(id);
|
||||
}
|
||||
|
||||
pub fn finalize(mut self) -> Result<(), Error> {
|
||||
self.documents.sort_unstable();
|
||||
self.documents.dedup();
|
||||
|
||||
let idset = SetBuf::new_unchecked(self.documents);
|
||||
let index = self.inner.word_index();
|
||||
|
||||
let new_index = index.remove_documents(&idset);
|
||||
let new_index = Arc::from(new_index);
|
||||
|
||||
self.inner.update_word_index(new_index)?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
13
meilidb-data/src/database/custom_settings.rs
Normal file
13
meilidb-data/src/database/custom_settings.rs
Normal file
@ -0,0 +1,13 @@
|
||||
use std::sync::Arc;
|
||||
use std::ops::Deref;
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct CustomSettings(pub Arc<sled::Tree>);
|
||||
|
||||
impl Deref for CustomSettings {
|
||||
type Target = sled::Tree;
|
||||
|
||||
fn deref(&self) -> &sled::Tree {
|
||||
&self.0
|
||||
}
|
||||
}
|
33
meilidb-data/src/database/docs_words_index.rs
Normal file
33
meilidb-data/src/database/docs_words_index.rs
Normal file
@ -0,0 +1,33 @@
|
||||
use std::sync::Arc;
|
||||
use meilidb_core::DocumentId;
|
||||
use super::Error;
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct DocsWordsIndex(pub Arc<sled::Tree>);
|
||||
|
||||
impl DocsWordsIndex {
|
||||
pub fn doc_words(&self, id: DocumentId) -> Result<Option<fst::Set>, Error> {
|
||||
let key = id.0.to_be_bytes();
|
||||
match self.0.get(key)? {
|
||||
Some(bytes) => {
|
||||
let len = bytes.len();
|
||||
let value = bytes.into();
|
||||
let fst = fst::raw::Fst::from_shared_bytes(value, 0, len)?;
|
||||
Ok(Some(fst::Set::from(fst)))
|
||||
},
|
||||
None => Ok(None)
|
||||
}
|
||||
}
|
||||
|
||||
pub fn set_doc_words(&self, id: DocumentId, words: &fst::Set) -> Result<(), Error> {
|
||||
let key = id.0.to_be_bytes();
|
||||
self.0.set(key, words.as_fst().as_bytes())?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn del_doc_words(&self, id: DocumentId) -> Result<(), Error> {
|
||||
let key = id.0.to_be_bytes();
|
||||
self.0.del(key)?;
|
||||
Ok(())
|
||||
}
|
||||
}
|
131
meilidb-data/src/database/documents_addition.rs
Normal file
131
meilidb-data/src/database/documents_addition.rs
Normal file
@ -0,0 +1,131 @@
|
||||
use std::collections::HashSet;
|
||||
use std::sync::Arc;
|
||||
|
||||
use meilidb_core::DocumentId;
|
||||
use fst::{SetBuilder, set::OpBuilder};
|
||||
use sdset::{SetOperation, duo::Union};
|
||||
|
||||
use crate::indexer::Indexer;
|
||||
use crate::serde::{extract_document_id, Serializer, RamDocumentStore};
|
||||
use crate::RankedMap;
|
||||
|
||||
use super::{Error, Index, InnerIndex, DocumentsDeletion};
|
||||
|
||||
pub struct DocumentsAddition<'a> {
|
||||
inner: &'a Index,
|
||||
document_ids: HashSet<DocumentId>,
|
||||
document_store: RamDocumentStore,
|
||||
indexer: Indexer,
|
||||
ranked_map: RankedMap,
|
||||
}
|
||||
|
||||
impl<'a> DocumentsAddition<'a> {
|
||||
pub fn new(inner: &'a Index, ranked_map: RankedMap) -> DocumentsAddition<'a> {
|
||||
DocumentsAddition {
|
||||
inner,
|
||||
document_ids: HashSet::new(),
|
||||
document_store: RamDocumentStore::new(),
|
||||
indexer: Indexer::new(),
|
||||
ranked_map,
|
||||
}
|
||||
}
|
||||
|
||||
pub fn update_document<D>(&mut self, document: D) -> Result<(), Error>
|
||||
where D: serde::Serialize,
|
||||
{
|
||||
let schema = &self.inner.lease_inner().schema;
|
||||
let identifier = schema.identifier_name();
|
||||
|
||||
let document_id = match extract_document_id(identifier, &document)? {
|
||||
Some(id) => id,
|
||||
None => return Err(Error::MissingDocumentId),
|
||||
};
|
||||
|
||||
// 1. store the document id for future deletion
|
||||
self.document_ids.insert(document_id);
|
||||
|
||||
// 2. index the document fields in ram stores
|
||||
let serializer = Serializer {
|
||||
schema,
|
||||
document_store: &mut self.document_store,
|
||||
indexer: &mut self.indexer,
|
||||
ranked_map: &mut self.ranked_map,
|
||||
document_id,
|
||||
};
|
||||
|
||||
document.serialize(serializer)?;
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn finalize(self) -> Result<(), Error> {
|
||||
let lease_inner = self.inner.lease_inner();
|
||||
let main = &lease_inner.raw.main;
|
||||
let words = &lease_inner.raw.words;
|
||||
let docs_words = &lease_inner.raw.docs_words;
|
||||
let documents = &lease_inner.raw.documents;
|
||||
|
||||
// 1. remove the previous documents match indexes
|
||||
let mut documents_deletion = DocumentsDeletion::new(self.inner);
|
||||
documents_deletion.extend(self.document_ids);
|
||||
documents_deletion.finalize()?;
|
||||
|
||||
// 2. insert new document attributes in the database
|
||||
for ((id, attr), value) in self.document_store.into_inner() {
|
||||
documents.set_document_field(id, attr, value)?;
|
||||
}
|
||||
|
||||
let indexed = self.indexer.build();
|
||||
let mut delta_words_builder = SetBuilder::memory();
|
||||
|
||||
for (word, delta_set) in indexed.words_doc_indexes {
|
||||
delta_words_builder.insert(&word).unwrap();
|
||||
|
||||
let set = match words.doc_indexes(&word)? {
|
||||
Some(set) => Union::new(&set, &delta_set).into_set_buf(),
|
||||
None => delta_set,
|
||||
};
|
||||
|
||||
words.set_doc_indexes(&word, &set)?;
|
||||
}
|
||||
|
||||
for (id, words) in indexed.docs_words {
|
||||
docs_words.set_doc_words(id, &words)?;
|
||||
}
|
||||
|
||||
let delta_words = delta_words_builder
|
||||
.into_inner()
|
||||
.and_then(fst::Set::from_bytes)
|
||||
.unwrap();
|
||||
|
||||
let words = match main.words_set()? {
|
||||
Some(words) => {
|
||||
let op = OpBuilder::new()
|
||||
.add(words.stream())
|
||||
.add(delta_words.stream())
|
||||
.r#union();
|
||||
|
||||
let mut words_builder = SetBuilder::memory();
|
||||
words_builder.extend_stream(op).unwrap();
|
||||
words_builder
|
||||
.into_inner()
|
||||
.and_then(fst::Set::from_bytes)
|
||||
.unwrap()
|
||||
},
|
||||
None => delta_words,
|
||||
};
|
||||
|
||||
main.set_words_set(&words)?;
|
||||
main.set_ranked_map(&self.ranked_map)?;
|
||||
|
||||
// update the "consistent" view of the Index
|
||||
let ranked_map = self.ranked_map;
|
||||
let schema = lease_inner.schema.clone();
|
||||
let raw = lease_inner.raw.clone();
|
||||
|
||||
let inner = InnerIndex { words, schema, ranked_map, raw };
|
||||
self.inner.0.store(Arc::new(inner));
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
127
meilidb-data/src/database/documents_deletion.rs
Normal file
127
meilidb-data/src/database/documents_deletion.rs
Normal file
@ -0,0 +1,127 @@
|
||||
use std::collections::{HashMap, BTreeSet};
|
||||
use std::sync::Arc;
|
||||
|
||||
use sdset::{SetBuf, SetOperation, duo::DifferenceByKey};
|
||||
use fst::{SetBuilder, Streamer};
|
||||
use meilidb_core::DocumentId;
|
||||
use crate::serde::extract_document_id;
|
||||
|
||||
use super::{Index, Error, InnerIndex};
|
||||
|
||||
pub struct DocumentsDeletion<'a> {
|
||||
inner: &'a Index,
|
||||
documents: Vec<DocumentId>,
|
||||
}
|
||||
|
||||
impl<'a> DocumentsDeletion<'a> {
|
||||
pub fn new(inner: &'a Index) -> DocumentsDeletion {
|
||||
DocumentsDeletion { inner, documents: Vec::new() }
|
||||
}
|
||||
|
||||
fn delete_document_by_id(&mut self, id: DocumentId) {
|
||||
self.documents.push(id);
|
||||
}
|
||||
|
||||
pub fn delete_document<D>(&mut self, document: D) -> Result<(), Error>
|
||||
where D: serde::Serialize,
|
||||
{
|
||||
let schema = &self.inner.lease_inner().schema;
|
||||
let identifier = schema.identifier_name();
|
||||
|
||||
let document_id = match extract_document_id(identifier, &document)? {
|
||||
Some(id) => id,
|
||||
None => return Err(Error::MissingDocumentId),
|
||||
};
|
||||
|
||||
self.delete_document_by_id(document_id);
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn finalize(mut self) -> Result<(), Error> {
|
||||
let lease_inner = self.inner.lease_inner();
|
||||
let main = &lease_inner.raw.main;
|
||||
let docs_words = &lease_inner.raw.docs_words;
|
||||
let words = &lease_inner.raw.words;
|
||||
let documents = &lease_inner.raw.documents;
|
||||
|
||||
let idset = {
|
||||
self.documents.sort_unstable();
|
||||
self.documents.dedup();
|
||||
SetBuf::new_unchecked(self.documents)
|
||||
};
|
||||
|
||||
let mut words_document_ids = HashMap::new();
|
||||
for id in idset.into_vec() {
|
||||
if let Some(words) = docs_words.doc_words(id)? {
|
||||
let mut stream = words.stream();
|
||||
while let Some(word) = stream.next() {
|
||||
let word = word.to_vec();
|
||||
words_document_ids.entry(word).or_insert_with(Vec::new).push(id);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
let mut removed_words = BTreeSet::new();
|
||||
for (word, mut document_ids) in words_document_ids {
|
||||
document_ids.sort_unstable();
|
||||
document_ids.dedup();
|
||||
let document_ids = SetBuf::new_unchecked(document_ids);
|
||||
|
||||
if let Some(doc_indexes) = words.doc_indexes(&word)? {
|
||||
let op = DifferenceByKey::new(&doc_indexes, &document_ids, |d| d.document_id, |id| *id);
|
||||
let doc_indexes = op.into_set_buf();
|
||||
|
||||
if !doc_indexes.is_empty() {
|
||||
words.set_doc_indexes(&word, &doc_indexes)?;
|
||||
} else {
|
||||
words.del_doc_indexes(&word)?;
|
||||
removed_words.insert(word);
|
||||
}
|
||||
}
|
||||
|
||||
for id in document_ids.into_vec() {
|
||||
documents.del_all_document_fields(id)?;
|
||||
docs_words.del_doc_words(id)?;
|
||||
}
|
||||
}
|
||||
|
||||
let removed_words = fst::Set::from_iter(removed_words).unwrap();
|
||||
let words = match main.words_set()? {
|
||||
Some(words_set) => {
|
||||
let op = fst::set::OpBuilder::new()
|
||||
.add(words_set.stream())
|
||||
.add(removed_words.stream())
|
||||
.difference();
|
||||
|
||||
let mut words_builder = SetBuilder::memory();
|
||||
words_builder.extend_stream(op).unwrap();
|
||||
words_builder
|
||||
.into_inner()
|
||||
.and_then(fst::Set::from_bytes)
|
||||
.unwrap()
|
||||
},
|
||||
None => fst::Set::default(),
|
||||
};
|
||||
|
||||
main.set_words_set(&words)?;
|
||||
|
||||
// TODO must update the ranked_map too!
|
||||
|
||||
// update the "consistent" view of the Index
|
||||
let ranked_map = lease_inner.ranked_map.clone();
|
||||
let schema = lease_inner.schema.clone();
|
||||
let raw = lease_inner.raw.clone();
|
||||
|
||||
let inner = InnerIndex { words, schema, ranked_map, raw };
|
||||
self.inner.0.store(Arc::new(inner));
|
||||
|
||||
Ok(())
|
||||
}
|
||||
}
|
||||
|
||||
impl<'a> Extend<DocumentId> for DocumentsDeletion<'a> {
|
||||
fn extend<T: IntoIterator<Item=DocumentId>>(&mut self, iter: T) {
|
||||
self.documents.extend(iter)
|
||||
}
|
||||
}
|
71
meilidb-data/src/database/documents_index.rs
Normal file
71
meilidb-data/src/database/documents_index.rs
Normal file
@ -0,0 +1,71 @@
|
||||
use std::sync::Arc;
|
||||
use std::convert::TryInto;
|
||||
|
||||
use meilidb_core::DocumentId;
|
||||
use sled::IVec;
|
||||
|
||||
use crate::document_attr_key::DocumentAttrKey;
|
||||
use crate::schema::SchemaAttr;
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct DocumentsIndex(pub Arc<sled::Tree>);
|
||||
|
||||
impl DocumentsIndex {
|
||||
pub fn document_field(&self, id: DocumentId, attr: SchemaAttr) -> sled::Result<Option<IVec>> {
|
||||
let key = DocumentAttrKey::new(id, attr).to_be_bytes();
|
||||
self.0.get(key)
|
||||
}
|
||||
|
||||
pub fn set_document_field(&self, id: DocumentId, attr: SchemaAttr, value: Vec<u8>) -> sled::Result<()> {
|
||||
let key = DocumentAttrKey::new(id, attr).to_be_bytes();
|
||||
self.0.set(key, value)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn del_document_field(&self, id: DocumentId, attr: SchemaAttr) -> sled::Result<()> {
|
||||
let key = DocumentAttrKey::new(id, attr).to_be_bytes();
|
||||
self.0.del(key)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn del_all_document_fields(&self, id: DocumentId) -> sled::Result<()> {
|
||||
let start = DocumentAttrKey::new(id, SchemaAttr::min()).to_be_bytes();
|
||||
let end = DocumentAttrKey::new(id, SchemaAttr::max()).to_be_bytes();
|
||||
let document_attrs = self.0.range(start..=end).keys();
|
||||
|
||||
for key in document_attrs {
|
||||
self.0.del(key?)?;
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn document_fields(&self, id: DocumentId) -> DocumentFieldsIter {
|
||||
let start = DocumentAttrKey::new(id, SchemaAttr::min());
|
||||
let start = start.to_be_bytes();
|
||||
|
||||
let end = DocumentAttrKey::new(id, SchemaAttr::max());
|
||||
let end = end.to_be_bytes();
|
||||
|
||||
DocumentFieldsIter(self.0.range(start..=end))
|
||||
}
|
||||
}
|
||||
|
||||
pub struct DocumentFieldsIter<'a>(sled::Iter<'a>);
|
||||
|
||||
impl<'a> Iterator for DocumentFieldsIter<'a> {
|
||||
type Item = sled::Result<(SchemaAttr, IVec)>;
|
||||
|
||||
fn next(&mut self) -> Option<Self::Item> {
|
||||
match self.0.next() {
|
||||
Some(Ok((key, value))) => {
|
||||
let slice: &[u8] = key.as_ref();
|
||||
let array = slice.try_into().unwrap();
|
||||
let key = DocumentAttrKey::from_be_bytes(array);
|
||||
Some(Ok((key.attribute, value)))
|
||||
},
|
||||
Some(Err(e)) => Some(Err(e)),
|
||||
None => None,
|
||||
}
|
||||
}
|
||||
}
|
57
meilidb-data/src/database/error.rs
Normal file
57
meilidb-data/src/database/error.rs
Normal file
@ -0,0 +1,57 @@
|
||||
use std::{error, fmt};
|
||||
use crate::serde::SerializerError;
|
||||
|
||||
#[derive(Debug)]
|
||||
pub enum Error {
|
||||
SchemaDiffer,
|
||||
SchemaMissing,
|
||||
WordIndexMissing,
|
||||
MissingDocumentId,
|
||||
SledError(sled::Error),
|
||||
FstError(fst::Error),
|
||||
BincodeError(bincode::Error),
|
||||
SerializerError(SerializerError),
|
||||
}
|
||||
|
||||
impl From<sled::Error> for Error {
|
||||
fn from(error: sled::Error) -> Error {
|
||||
Error::SledError(error)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<fst::Error> for Error {
|
||||
fn from(error: fst::Error) -> Error {
|
||||
Error::FstError(error)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<bincode::Error> for Error {
|
||||
fn from(error: bincode::Error) -> Error {
|
||||
Error::BincodeError(error)
|
||||
}
|
||||
}
|
||||
|
||||
impl From<SerializerError> for Error {
|
||||
fn from(error: SerializerError) -> Error {
|
||||
Error::SerializerError(error)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for Error {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
use self::Error::*;
|
||||
match self {
|
||||
SchemaDiffer => write!(f, "schemas differ"),
|
||||
SchemaMissing => write!(f, "this index does not have a schema"),
|
||||
WordIndexMissing => write!(f, "this index does not have a word index"),
|
||||
MissingDocumentId => write!(f, "document id is missing"),
|
||||
SledError(e) => write!(f, "sled error; {}", e),
|
||||
FstError(e) => write!(f, "fst error; {}", e),
|
||||
BincodeError(e) => write!(f, "bincode error; {}", e),
|
||||
SerializerError(e) => write!(f, "serializer error; {}", e),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl error::Error for Error { }
|
||||
|
126
meilidb-data/src/database/index.rs
Normal file
126
meilidb-data/src/database/index.rs
Normal file
@ -0,0 +1,126 @@
|
||||
use sdset::SetBuf;
|
||||
use std::collections::HashSet;
|
||||
use std::sync::Arc;
|
||||
|
||||
use arc_swap::{ArcSwap, Lease};
|
||||
use meilidb_core::criterion::Criteria;
|
||||
use meilidb_core::{DocIndex, Store, DocumentId, QueryBuilder};
|
||||
use rmp_serde::decode::Error as RmpError;
|
||||
use serde::de;
|
||||
|
||||
use crate::ranked_map::RankedMap;
|
||||
use crate::schema::Schema;
|
||||
use crate::serde::Deserializer;
|
||||
|
||||
use super::{Error, CustomSettings};
|
||||
use super::{RawIndex, DocumentsAddition, DocumentsDeletion};
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct Index(pub ArcSwap<InnerIndex>);
|
||||
|
||||
pub struct InnerIndex {
|
||||
pub words: fst::Set,
|
||||
pub schema: Schema,
|
||||
pub ranked_map: RankedMap,
|
||||
pub raw: RawIndex, // TODO this will be a snapshot in the future
|
||||
}
|
||||
|
||||
impl Index {
|
||||
pub fn from_raw(raw: RawIndex) -> Result<Index, Error> {
|
||||
let words = match raw.main.words_set()? {
|
||||
Some(words) => words,
|
||||
None => fst::Set::default(),
|
||||
};
|
||||
|
||||
let schema = match raw.main.schema()? {
|
||||
Some(schema) => schema,
|
||||
None => return Err(Error::SchemaMissing),
|
||||
};
|
||||
|
||||
let ranked_map = match raw.main.ranked_map()? {
|
||||
Some(map) => map,
|
||||
None => RankedMap::default(),
|
||||
};
|
||||
|
||||
let inner = InnerIndex { words, schema, ranked_map, raw };
|
||||
let index = Index(ArcSwap::new(Arc::new(inner)));
|
||||
|
||||
Ok(index)
|
||||
}
|
||||
|
||||
pub fn query_builder(&self) -> QueryBuilder<IndexLease> {
|
||||
let lease = IndexLease(self.0.lease());
|
||||
QueryBuilder::new(lease)
|
||||
}
|
||||
|
||||
pub fn query_builder_with_criteria<'c>(
|
||||
&self,
|
||||
criteria: Criteria<'c>,
|
||||
) -> QueryBuilder<'c, IndexLease>
|
||||
{
|
||||
let lease = IndexLease(self.0.lease());
|
||||
QueryBuilder::with_criteria(lease, criteria)
|
||||
}
|
||||
|
||||
pub fn lease_inner(&self) -> Lease<Arc<InnerIndex>> {
|
||||
self.0.lease()
|
||||
}
|
||||
|
||||
pub fn schema(&self) -> Schema {
|
||||
self.0.lease().schema.clone()
|
||||
}
|
||||
|
||||
pub fn custom_settings(&self) -> CustomSettings {
|
||||
self.0.lease().raw.custom.clone()
|
||||
}
|
||||
|
||||
pub fn documents_addition(&self) -> DocumentsAddition {
|
||||
let ranked_map = self.0.lease().ranked_map.clone();
|
||||
DocumentsAddition::new(self, ranked_map)
|
||||
}
|
||||
|
||||
pub fn documents_deletion(&self) -> DocumentsDeletion {
|
||||
DocumentsDeletion::new(self)
|
||||
}
|
||||
|
||||
pub fn document<T>(
|
||||
&self,
|
||||
fields: Option<&HashSet<&str>>,
|
||||
id: DocumentId,
|
||||
) -> Result<Option<T>, RmpError>
|
||||
where T: de::DeserializeOwned,
|
||||
{
|
||||
let schema = &self.lease_inner().schema;
|
||||
let fields = fields
|
||||
.map(|fields| {
|
||||
fields
|
||||
.into_iter()
|
||||
.filter_map(|name| schema.attribute(name))
|
||||
.collect()
|
||||
});
|
||||
|
||||
let mut deserializer = Deserializer {
|
||||
document_id: id,
|
||||
index: &self,
|
||||
fields: fields.as_ref(),
|
||||
};
|
||||
|
||||
// TODO: currently we return an error if all document fields are missing,
|
||||
// returning None would have been better
|
||||
T::deserialize(&mut deserializer).map(Some)
|
||||
}
|
||||
}
|
||||
|
||||
pub struct IndexLease(Lease<Arc<InnerIndex>>);
|
||||
|
||||
impl Store for IndexLease {
|
||||
type Error = Error;
|
||||
|
||||
fn words(&self) -> Result<&fst::Set, Self::Error> {
|
||||
Ok(&self.0.words)
|
||||
}
|
||||
|
||||
fn word_indexes(&self, word: &[u8]) -> Result<Option<SetBuf<DocIndex>>, Self::Error> {
|
||||
Ok(self.0.raw.words.doc_indexes(word)?)
|
||||
}
|
||||
}
|
62
meilidb-data/src/database/main_index.rs
Normal file
62
meilidb-data/src/database/main_index.rs
Normal file
@ -0,0 +1,62 @@
|
||||
use std::sync::Arc;
|
||||
|
||||
use crate::ranked_map::RankedMap;
|
||||
use crate::schema::Schema;
|
||||
|
||||
use super::Error;
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct MainIndex(pub Arc<sled::Tree>);
|
||||
|
||||
impl MainIndex {
|
||||
pub fn schema(&self) -> Result<Option<Schema>, Error> {
|
||||
match self.0.get("schema")? {
|
||||
Some(bytes) => {
|
||||
let schema = Schema::read_from_bin(bytes.as_ref())?;
|
||||
Ok(Some(schema))
|
||||
},
|
||||
None => Ok(None),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn set_schema(&self, schema: &Schema) -> Result<(), Error> {
|
||||
let mut bytes = Vec::new();
|
||||
schema.write_to_bin(&mut bytes)?;
|
||||
self.0.set("schema", bytes)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn words_set(&self) -> Result<Option<fst::Set>, Error> {
|
||||
match self.0.get("words")? {
|
||||
Some(bytes) => {
|
||||
let len = bytes.len();
|
||||
let value = bytes.into();
|
||||
let fst = fst::raw::Fst::from_shared_bytes(value, 0, len)?;
|
||||
Ok(Some(fst::Set::from(fst)))
|
||||
},
|
||||
None => Ok(None),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn set_words_set(&self, value: &fst::Set) -> Result<(), Error> {
|
||||
self.0.set("words", value.as_fst().as_bytes())?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn ranked_map(&self) -> Result<Option<RankedMap>, Error> {
|
||||
match self.0.get("ranked-map")? {
|
||||
Some(bytes) => {
|
||||
let ranked_map = RankedMap::read_from_bin(bytes.as_ref())?;
|
||||
Ok(Some(ranked_map))
|
||||
},
|
||||
None => Ok(None),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn set_ranked_map(&self, value: &RankedMap) -> Result<(), Error> {
|
||||
let mut bytes = Vec::new();
|
||||
value.write_to_bin(&mut bytes)?;
|
||||
self.0.set("ranked_map", bytes)?;
|
||||
Ok(())
|
||||
}
|
||||
}
|
175
meilidb-data/src/database/mod.rs
Normal file
175
meilidb-data/src/database/mod.rs
Normal file
@ -0,0 +1,175 @@
|
||||
use std::collections::hash_map::Entry;
|
||||
use std::collections::{HashSet, HashMap};
|
||||
use std::path::Path;
|
||||
use std::sync::{Arc, RwLock};
|
||||
|
||||
use crate::Schema;
|
||||
|
||||
mod custom_settings;
|
||||
mod docs_words_index;
|
||||
mod documents_addition;
|
||||
mod documents_deletion;
|
||||
mod documents_index;
|
||||
mod error;
|
||||
mod index;
|
||||
mod main_index;
|
||||
mod raw_index;
|
||||
mod words_index;
|
||||
|
||||
pub use self::error::Error;
|
||||
pub use self::index::Index;
|
||||
pub use self::custom_settings::CustomSettings;
|
||||
|
||||
use self::docs_words_index::DocsWordsIndex;
|
||||
use self::documents_addition::DocumentsAddition;
|
||||
use self::documents_deletion::DocumentsDeletion;
|
||||
use self::documents_index::DocumentsIndex;
|
||||
use self::index::InnerIndex;
|
||||
use self::main_index::MainIndex;
|
||||
use self::raw_index::RawIndex;
|
||||
use self::words_index::WordsIndex;
|
||||
|
||||
pub struct Database {
|
||||
cache: RwLock<HashMap<String, Arc<Index>>>,
|
||||
inner: sled::Db,
|
||||
}
|
||||
|
||||
impl Database {
|
||||
pub fn start_default<P: AsRef<Path>>(path: P) -> Result<Database, Error> {
|
||||
let cache = RwLock::new(HashMap::new());
|
||||
let inner = sled::Db::start_default(path)?;
|
||||
Ok(Database { cache, inner })
|
||||
}
|
||||
|
||||
pub fn indexes(&self) -> Result<Option<HashSet<String>>, Error> {
|
||||
let bytes = match self.inner.get("indexes")? {
|
||||
Some(bytes) => bytes,
|
||||
None => return Ok(None),
|
||||
};
|
||||
|
||||
let indexes = bincode::deserialize(&bytes)?;
|
||||
Ok(Some(indexes))
|
||||
}
|
||||
|
||||
fn set_indexes(&self, value: &HashSet<String>) -> Result<(), Error> {
|
||||
let bytes = bincode::serialize(value)?;
|
||||
self.inner.set("indexes", bytes)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn open_index(&self, name: &str) -> Result<Option<Arc<Index>>, Error> {
|
||||
{
|
||||
let cache = self.cache.read().unwrap();
|
||||
if let Some(index) = cache.get(name).cloned() {
|
||||
return Ok(Some(index))
|
||||
}
|
||||
}
|
||||
|
||||
let mut cache = self.cache.write().unwrap();
|
||||
let index = match cache.entry(name.to_string()) {
|
||||
Entry::Occupied(occupied) => {
|
||||
occupied.get().clone()
|
||||
},
|
||||
Entry::Vacant(vacant) => {
|
||||
if !self.indexes()?.map_or(false, |x| x.contains(name)) {
|
||||
return Ok(None)
|
||||
}
|
||||
|
||||
let main = {
|
||||
let tree = self.inner.open_tree(name)?;
|
||||
MainIndex(tree)
|
||||
};
|
||||
|
||||
let words = {
|
||||
let tree_name = format!("{}-words", name);
|
||||
let tree = self.inner.open_tree(tree_name)?;
|
||||
WordsIndex(tree)
|
||||
};
|
||||
|
||||
let docs_words = {
|
||||
let tree_name = format!("{}-docs-words", name);
|
||||
let tree = self.inner.open_tree(tree_name)?;
|
||||
DocsWordsIndex(tree)
|
||||
};
|
||||
|
||||
let documents = {
|
||||
let tree_name = format!("{}-documents", name);
|
||||
let tree = self.inner.open_tree(tree_name)?;
|
||||
DocumentsIndex(tree)
|
||||
};
|
||||
|
||||
let custom = {
|
||||
let tree_name = format!("{}-custom", name);
|
||||
let tree = self.inner.open_tree(tree_name)?;
|
||||
CustomSettings(tree)
|
||||
};
|
||||
|
||||
let raw_index = RawIndex { main, words, docs_words, documents, custom };
|
||||
let index = Index::from_raw(raw_index)?;
|
||||
|
||||
vacant.insert(Arc::new(index)).clone()
|
||||
},
|
||||
};
|
||||
|
||||
Ok(Some(index))
|
||||
}
|
||||
|
||||
pub fn create_index(&self, name: &str, schema: Schema) -> Result<Arc<Index>, Error> {
|
||||
let mut cache = self.cache.write().unwrap();
|
||||
|
||||
let index = match cache.entry(name.to_string()) {
|
||||
Entry::Occupied(occupied) => {
|
||||
occupied.get().clone()
|
||||
},
|
||||
Entry::Vacant(vacant) => {
|
||||
let main = {
|
||||
let tree = self.inner.open_tree(name)?;
|
||||
MainIndex(tree)
|
||||
};
|
||||
|
||||
if let Some(prev_schema) = main.schema()? {
|
||||
if prev_schema != schema {
|
||||
return Err(Error::SchemaDiffer)
|
||||
}
|
||||
}
|
||||
|
||||
main.set_schema(&schema)?;
|
||||
|
||||
let words = {
|
||||
let tree_name = format!("{}-words", name);
|
||||
let tree = self.inner.open_tree(tree_name)?;
|
||||
WordsIndex(tree)
|
||||
};
|
||||
|
||||
let docs_words = {
|
||||
let tree_name = format!("{}-docs-words", name);
|
||||
let tree = self.inner.open_tree(tree_name)?;
|
||||
DocsWordsIndex(tree)
|
||||
};
|
||||
|
||||
let documents = {
|
||||
let tree_name = format!("{}-documents", name);
|
||||
let tree = self.inner.open_tree(tree_name)?;
|
||||
DocumentsIndex(tree)
|
||||
};
|
||||
|
||||
let custom = {
|
||||
let tree_name = format!("{}-custom", name);
|
||||
let tree = self.inner.open_tree(tree_name)?;
|
||||
CustomSettings(tree)
|
||||
};
|
||||
|
||||
let mut indexes = self.indexes()?.unwrap_or_else(HashSet::new);
|
||||
indexes.insert(name.to_string());
|
||||
self.set_indexes(&indexes)?;
|
||||
|
||||
let raw_index = RawIndex { main, words, docs_words, documents, custom };
|
||||
let index = Index::from_raw(raw_index)?;
|
||||
|
||||
vacant.insert(Arc::new(index)).clone()
|
||||
},
|
||||
};
|
||||
|
||||
Ok(index)
|
||||
}
|
||||
}
|
10
meilidb-data/src/database/raw_index.rs
Normal file
10
meilidb-data/src/database/raw_index.rs
Normal file
@ -0,0 +1,10 @@
|
||||
use super::{MainIndex, WordsIndex, DocsWordsIndex, DocumentsIndex, CustomSettings};
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct RawIndex {
|
||||
pub main: MainIndex,
|
||||
pub words: WordsIndex,
|
||||
pub docs_words: DocsWordsIndex,
|
||||
pub documents: DocumentsIndex,
|
||||
pub custom: CustomSettings,
|
||||
}
|
32
meilidb-data/src/database/words_index.rs
Normal file
32
meilidb-data/src/database/words_index.rs
Normal file
@ -0,0 +1,32 @@
|
||||
use std::sync::Arc;
|
||||
|
||||
use meilidb_core::DocIndex;
|
||||
use sdset::{Set, SetBuf};
|
||||
use zerocopy::{LayoutVerified, AsBytes};
|
||||
|
||||
#[derive(Clone)]
|
||||
pub struct WordsIndex(pub Arc<sled::Tree>);
|
||||
|
||||
impl WordsIndex {
|
||||
pub fn doc_indexes(&self, word: &[u8]) -> sled::Result<Option<SetBuf<DocIndex>>> {
|
||||
match self.0.get(word)? {
|
||||
Some(bytes) => {
|
||||
let layout = LayoutVerified::new_slice(bytes.as_ref()).expect("invalid layout");
|
||||
let slice = layout.into_slice();
|
||||
let setbuf = SetBuf::new_unchecked(slice.to_vec());
|
||||
Ok(Some(setbuf))
|
||||
},
|
||||
None => Ok(None),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn set_doc_indexes(&self, word: &[u8], set: &Set<DocIndex>) -> sled::Result<()> {
|
||||
self.0.set(word, set.as_bytes())?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn del_doc_indexes(&self, word: &[u8]) -> sled::Result<()> {
|
||||
self.0.del(word)?;
|
||||
Ok(())
|
||||
}
|
||||
}
|
69
meilidb-data/src/document_attr_key.rs
Normal file
69
meilidb-data/src/document_attr_key.rs
Normal file
@ -0,0 +1,69 @@
|
||||
use meilidb_core::DocumentId;
|
||||
use crate::schema::SchemaAttr;
|
||||
|
||||
#[derive(Debug, Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
|
||||
pub struct DocumentAttrKey {
|
||||
pub document_id: DocumentId,
|
||||
pub attribute: SchemaAttr,
|
||||
}
|
||||
|
||||
impl DocumentAttrKey {
|
||||
pub fn new(document_id: DocumentId, attribute: SchemaAttr) -> DocumentAttrKey {
|
||||
DocumentAttrKey { document_id, attribute }
|
||||
}
|
||||
|
||||
pub fn to_be_bytes(self) -> [u8; 10] {
|
||||
let mut output = [0u8; 10];
|
||||
|
||||
let document_id = self.document_id.0.to_be_bytes();
|
||||
let attribute = self.attribute.0.to_be_bytes();
|
||||
|
||||
unsafe {
|
||||
use std::{mem::size_of, ptr::copy_nonoverlapping};
|
||||
|
||||
let output = output.as_mut_ptr();
|
||||
copy_nonoverlapping(document_id.as_ptr(), output, size_of::<u64>());
|
||||
|
||||
let output = output.add(size_of::<u64>());
|
||||
copy_nonoverlapping(attribute.as_ptr(), output, size_of::<u16>());
|
||||
}
|
||||
|
||||
output
|
||||
}
|
||||
|
||||
pub fn from_be_bytes(bytes: [u8; 10]) -> DocumentAttrKey {
|
||||
let document_id;
|
||||
let attribute;
|
||||
|
||||
unsafe {
|
||||
use std::ptr::read_unaligned;
|
||||
|
||||
let pointer = bytes.as_ptr() as *const _;
|
||||
let document_id_bytes = read_unaligned(pointer);
|
||||
document_id = u64::from_be_bytes(document_id_bytes);
|
||||
|
||||
let pointer = pointer.add(1) as *const _;
|
||||
let attribute_bytes = read_unaligned(pointer);
|
||||
attribute = u16::from_be_bytes(attribute_bytes);
|
||||
}
|
||||
|
||||
DocumentAttrKey {
|
||||
document_id: DocumentId(document_id),
|
||||
attribute: SchemaAttr(attribute),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn to_from_be_bytes() {
|
||||
let document_id = DocumentId(67578308);
|
||||
let schema_attr = SchemaAttr(3456);
|
||||
let x = DocumentAttrKey::new(document_id, schema_attr);
|
||||
|
||||
assert_eq!(x, DocumentAttrKey::from_be_bytes(x.to_be_bytes()));
|
||||
}
|
||||
}
|
@ -1,45 +0,0 @@
|
||||
use std::error::Error;
|
||||
|
||||
use byteorder::{ReadBytesExt, WriteBytesExt};
|
||||
|
||||
use meilidb_core::{Index as WordIndex};
|
||||
use meilidb_core::data::DocIds;
|
||||
use meilidb_core::write_to_bytes::WriteToBytes;
|
||||
use meilidb_core::shared_data_cursor::{SharedDataCursor, FromSharedDataCursor};
|
||||
|
||||
enum NewIndexEvent<'a> {
|
||||
RemovedDocuments(&'a DocIds),
|
||||
UpdatedDocuments(&'a WordIndex),
|
||||
}
|
||||
|
||||
impl<'a> WriteToBytes for NewIndexEvent<'a> {
|
||||
fn write_to_bytes(&self, bytes: &mut Vec<u8>) {
|
||||
match self {
|
||||
NewIndexEvent::RemovedDocuments(doc_ids) => {
|
||||
let _ = bytes.write_u8(0);
|
||||
doc_ids.write_to_bytes(bytes);
|
||||
},
|
||||
NewIndexEvent::UpdatedDocuments(index) => {
|
||||
let _ = bytes.write_u8(1);
|
||||
index.write_to_bytes(bytes);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
enum IndexEvent {
|
||||
RemovedDocuments(DocIds),
|
||||
UpdatedDocuments(WordIndex),
|
||||
}
|
||||
|
||||
impl FromSharedDataCursor for IndexEvent {
|
||||
type Error = Box<Error>;
|
||||
|
||||
fn from_shared_data_cursor(cursor: &mut SharedDataCursor) -> Result<Self, Self::Error> {
|
||||
match cursor.read_u8()? {
|
||||
0 => DocIds::from_shared_data_cursor(cursor).map(IndexEvent::RemovedDocuments),
|
||||
1 => WordIndex::from_shared_data_cursor(cursor).map(IndexEvent::UpdatedDocuments),
|
||||
_ => Err("invalid index event type".into()),
|
||||
}
|
||||
}
|
||||
}
|
@ -1,11 +1,10 @@
|
||||
use std::collections::BTreeMap;
|
||||
use std::collections::{BTreeMap, HashMap};
|
||||
use std::convert::TryFrom;
|
||||
|
||||
use deunicode::deunicode_with_tofu;
|
||||
use meilidb_core::{DocumentId, DocIndex};
|
||||
use meilidb_core::{Index as WordIndex, IndexBuilder as WordIndexBuilder};
|
||||
use meilidb_tokenizer::{is_cjk, Tokenizer, SeqTokenizer, Token};
|
||||
use sdset::Set;
|
||||
use sdset::SetBuf;
|
||||
|
||||
use crate::SchemaAttr;
|
||||
|
||||
@ -13,27 +12,39 @@ type Word = Vec<u8>; // TODO make it be a SmallVec
|
||||
|
||||
pub struct Indexer {
|
||||
word_limit: usize, // the maximum number of indexed words
|
||||
indexed: BTreeMap<Word, Vec<DocIndex>>,
|
||||
words_doc_indexes: BTreeMap<Word, Vec<DocIndex>>,
|
||||
docs_words: HashMap<DocumentId, Vec<Word>>,
|
||||
}
|
||||
|
||||
pub struct Indexed {
|
||||
pub words_doc_indexes: BTreeMap<Word, SetBuf<DocIndex>>,
|
||||
pub docs_words: HashMap<DocumentId, fst::Set>,
|
||||
}
|
||||
|
||||
impl Indexer {
|
||||
pub fn new() -> Indexer {
|
||||
Indexer {
|
||||
word_limit: 1000,
|
||||
indexed: BTreeMap::new(),
|
||||
}
|
||||
Indexer::with_word_limit(1000)
|
||||
}
|
||||
|
||||
pub fn with_word_limit(limit: usize) -> Indexer {
|
||||
Indexer {
|
||||
word_limit: limit,
|
||||
indexed: BTreeMap::new(),
|
||||
words_doc_indexes: BTreeMap::new(),
|
||||
docs_words: HashMap::new(),
|
||||
}
|
||||
}
|
||||
|
||||
pub fn index_text(&mut self, id: DocumentId, attr: SchemaAttr, text: &str) {
|
||||
for token in Tokenizer::new(text) {
|
||||
let must_continue = index_token(token, id, attr, self.word_limit, &mut self.indexed);
|
||||
let must_continue = index_token(
|
||||
token,
|
||||
id,
|
||||
attr,
|
||||
self.word_limit,
|
||||
&mut self.words_doc_indexes,
|
||||
&mut self.docs_words,
|
||||
);
|
||||
|
||||
if !must_continue { break }
|
||||
}
|
||||
}
|
||||
@ -43,23 +54,38 @@ impl Indexer {
|
||||
{
|
||||
let iter = iter.into_iter();
|
||||
for token in SeqTokenizer::new(iter) {
|
||||
let must_continue = index_token(token, id, attr, self.word_limit, &mut self.indexed);
|
||||
let must_continue = index_token(
|
||||
token,
|
||||
id,
|
||||
attr,
|
||||
self.word_limit,
|
||||
&mut self.words_doc_indexes,
|
||||
&mut self.docs_words,
|
||||
);
|
||||
|
||||
if !must_continue { break }
|
||||
}
|
||||
}
|
||||
|
||||
pub fn build(self) -> WordIndex {
|
||||
let mut builder = WordIndexBuilder::new();
|
||||
pub fn build(self) -> Indexed {
|
||||
let words_doc_indexes = self.words_doc_indexes
|
||||
.into_iter()
|
||||
.map(|(word, mut indexes)| {
|
||||
indexes.sort_unstable();
|
||||
indexes.dedup();
|
||||
(word, SetBuf::new_unchecked(indexes))
|
||||
}).collect();
|
||||
|
||||
for (key, mut indexes) in self.indexed {
|
||||
indexes.sort_unstable();
|
||||
indexes.dedup();
|
||||
let docs_words = self.docs_words
|
||||
.into_iter()
|
||||
.map(|(id, mut words)| {
|
||||
words.sort_unstable();
|
||||
words.dedup();
|
||||
(id, fst::Set::from_iter(words).unwrap())
|
||||
})
|
||||
.collect();
|
||||
|
||||
let indexes = Set::new_unchecked(&indexes);
|
||||
builder.insert(key, indexes).unwrap();
|
||||
}
|
||||
|
||||
builder.build()
|
||||
Indexed { words_doc_indexes, docs_words }
|
||||
}
|
||||
}
|
||||
|
||||
@ -68,7 +94,8 @@ fn index_token(
|
||||
id: DocumentId,
|
||||
attr: SchemaAttr,
|
||||
word_limit: usize,
|
||||
indexed: &mut BTreeMap<Word, Vec<DocIndex>>,
|
||||
words_doc_indexes: &mut BTreeMap<Word, Vec<DocIndex>>,
|
||||
docs_words: &mut HashMap<DocumentId, Vec<Word>>,
|
||||
) -> bool
|
||||
{
|
||||
if token.word_index >= word_limit { return false }
|
||||
@ -78,7 +105,8 @@ fn index_token(
|
||||
match token_to_docindex(id, attr, token) {
|
||||
Some(docindex) => {
|
||||
let word = Vec::from(token.word);
|
||||
indexed.entry(word).or_insert_with(Vec::new).push(docindex);
|
||||
words_doc_indexes.entry(word.clone()).or_insert_with(Vec::new).push(docindex);
|
||||
docs_words.entry(id).or_insert_with(Vec::new).push(word);
|
||||
},
|
||||
None => return false,
|
||||
}
|
||||
@ -90,7 +118,8 @@ fn index_token(
|
||||
match token_to_docindex(id, attr, token) {
|
||||
Some(docindex) => {
|
||||
let word = Vec::from(token.word);
|
||||
indexed.entry(word).or_insert_with(Vec::new).push(docindex);
|
||||
words_doc_indexes.entry(word.clone()).or_insert_with(Vec::new).push(docindex);
|
||||
docs_words.entry(id).or_insert_with(Vec::new).push(word);
|
||||
},
|
||||
None => return false,
|
||||
}
|
||||
|
@ -1,12 +1,13 @@
|
||||
mod database;
|
||||
mod index_event;
|
||||
mod document_attr_key;
|
||||
mod indexer;
|
||||
mod number;
|
||||
mod ranked_map;
|
||||
mod serde;
|
||||
pub mod schema;
|
||||
|
||||
pub use self::database::{Database, Index};
|
||||
pub use sled;
|
||||
pub use self::database::{Database, Index, CustomSettings};
|
||||
pub use self::number::Number;
|
||||
pub use self::ranked_map::RankedMap;
|
||||
pub use self::schema::{Schema, SchemaAttr};
|
||||
|
@ -1,5 +1,27 @@
|
||||
use std::io::{Read, Write};
|
||||
|
||||
use hashbrown::HashMap;
|
||||
use meilidb_core::DocumentId;
|
||||
|
||||
use crate::{SchemaAttr, Number};
|
||||
|
||||
pub type RankedMap = HashMap<(DocumentId, SchemaAttr), Number>;
|
||||
#[derive(Debug, Default, Clone, PartialEq, Eq)]
|
||||
pub struct RankedMap(HashMap<(DocumentId, SchemaAttr), Number>);
|
||||
|
||||
impl RankedMap {
|
||||
pub fn insert(&mut self, document: DocumentId, attribute: SchemaAttr, number: Number) {
|
||||
self.0.insert((document, attribute), number);
|
||||
}
|
||||
|
||||
pub fn get(&self, document: DocumentId, attribute: SchemaAttr) -> Option<Number> {
|
||||
self.0.get(&(document, attribute)).cloned()
|
||||
}
|
||||
|
||||
pub fn read_from_bin<R: Read>(reader: R) -> bincode::Result<RankedMap> {
|
||||
bincode::deserialize_from(reader).map(RankedMap)
|
||||
}
|
||||
|
||||
pub fn write_to_bin<W: Write>(&self, writer: W) -> bincode::Result<()> {
|
||||
bincode::serialize_into(writer, &self.0)
|
||||
}
|
||||
}
|
||||
|
@ -134,12 +134,12 @@ impl Schema {
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub(crate) fn read_from_bin<R: Read>(reader: R) -> bincode::Result<Schema> {
|
||||
pub fn read_from_bin<R: Read>(reader: R) -> bincode::Result<Schema> {
|
||||
let builder: SchemaBuilder = bincode::deserialize_from(reader)?;
|
||||
Ok(builder.build())
|
||||
}
|
||||
|
||||
pub(crate) fn write_to_bin<W: Write>(&self, writer: W) -> bincode::Result<()> {
|
||||
pub fn write_to_bin<W: Write>(&self, writer: W) -> bincode::Result<()> {
|
||||
let identifier = self.inner.identifier.clone();
|
||||
let attributes = self.attributes_ordered();
|
||||
let builder = SchemaBuilder { identifier, attributes };
|
||||
@ -186,12 +186,16 @@ impl Schema {
|
||||
pub struct SchemaAttr(pub u16);
|
||||
|
||||
impl SchemaAttr {
|
||||
pub fn new(value: u16) -> SchemaAttr {
|
||||
pub const fn new(value: u16) -> SchemaAttr {
|
||||
SchemaAttr(value)
|
||||
}
|
||||
|
||||
pub fn min() -> SchemaAttr {
|
||||
SchemaAttr(0)
|
||||
pub const fn min() -> SchemaAttr {
|
||||
SchemaAttr(u16::min_value())
|
||||
}
|
||||
|
||||
pub const fn max() -> SchemaAttr {
|
||||
SchemaAttr(u16::max_value())
|
||||
}
|
||||
|
||||
pub fn next(self) -> Option<SchemaAttr> {
|
||||
@ -201,10 +205,6 @@ impl SchemaAttr {
|
||||
pub fn prev(self) -> Option<SchemaAttr> {
|
||||
self.0.checked_sub(1).map(SchemaAttr)
|
||||
}
|
||||
|
||||
pub fn max() -> SchemaAttr {
|
||||
SchemaAttr(u16::MAX)
|
||||
}
|
||||
}
|
||||
|
||||
impl fmt::Display for SchemaAttr {
|
||||
|
@ -24,7 +24,7 @@ impl ser::Serializer for ConvertToNumber {
|
||||
Ok(Number::Unsigned(u64::from(value)))
|
||||
}
|
||||
|
||||
fn serialize_char(self, value: char) -> Result<Self::Ok, Self::Error> {
|
||||
fn serialize_char(self, _value: char) -> Result<Self::Ok, Self::Error> {
|
||||
Err(SerializerError::UnrankableType { type_name: "char" })
|
||||
}
|
||||
|
||||
|
@ -16,7 +16,7 @@ impl ser::Serializer for ConvertToString {
|
||||
type SerializeStruct = ser::Impossible<Self::Ok, Self::Error>;
|
||||
type SerializeStructVariant = ser::Impossible<Self::Ok, Self::Error>;
|
||||
|
||||
fn serialize_bool(self, value: bool) -> Result<Self::Ok, Self::Error> {
|
||||
fn serialize_bool(self, _value: bool) -> Result<Self::Ok, Self::Error> {
|
||||
Err(SerializerError::UnserializableType { type_name: "boolean" })
|
||||
}
|
||||
|
||||
|
@ -6,12 +6,12 @@ use rmp_serde::decode::{Deserializer as RmpDeserializer, ReadReader};
|
||||
use rmp_serde::decode::{Error as RmpError};
|
||||
use serde::{de, forward_to_deserialize_any};
|
||||
|
||||
use crate::database::RawIndex;
|
||||
use crate::database::Index;
|
||||
use crate::SchemaAttr;
|
||||
|
||||
pub struct Deserializer<'a> {
|
||||
pub document_id: DocumentId,
|
||||
pub raw_index: &'a RawIndex,
|
||||
pub index: &'a Index,
|
||||
pub fields: Option<&'a HashSet<SchemaAttr>>,
|
||||
}
|
||||
|
||||
@ -26,15 +26,18 @@ impl<'de, 'a, 'b> de::Deserializer<'de> for &'b mut Deserializer<'a>
|
||||
}
|
||||
|
||||
forward_to_deserialize_any! {
|
||||
bool u8 u16 u32 u64 i8 i16 i32 i64 f32 f64 char str string unit seq
|
||||
bytes byte_buf unit_struct tuple_struct
|
||||
identifier tuple ignored_any option newtype_struct enum struct
|
||||
bool i8 i16 i32 i64 i128 u8 u16 u32 u64 u128 f32 f64 char str string
|
||||
bytes byte_buf option unit unit_struct newtype_struct seq tuple
|
||||
tuple_struct struct enum identifier ignored_any
|
||||
}
|
||||
|
||||
fn deserialize_map<V>(self, visitor: V) -> Result<V::Value, Self::Error>
|
||||
where V: de::Visitor<'de>
|
||||
{
|
||||
let document_attributes = self.raw_index.get_document_fields(self.document_id);
|
||||
let schema = &self.index.lease_inner().schema;
|
||||
let documents = &self.index.lease_inner().raw.documents;
|
||||
|
||||
let document_attributes = documents.document_fields(self.document_id);
|
||||
let document_attributes = document_attributes.filter_map(|result| {
|
||||
match result {
|
||||
Ok(value) => Some(value),
|
||||
@ -45,9 +48,10 @@ impl<'de, 'a, 'b> de::Deserializer<'de> for &'b mut Deserializer<'a>
|
||||
},
|
||||
}
|
||||
});
|
||||
let iter = document_attributes.filter_map(|(_, attr, value)| {
|
||||
|
||||
let iter = document_attributes.filter_map(|(attr, value)| {
|
||||
if self.fields.map_or(true, |f| f.contains(&attr)) {
|
||||
let attribute_name = self.raw_index.schema().attribute_name(attr);
|
||||
let attribute_name = schema.attribute_name(attr);
|
||||
Some((attribute_name, Value::new(value)))
|
||||
} else {
|
||||
None
|
||||
|
@ -56,11 +56,11 @@ impl<'a> ser::Serializer for ExtractDocumentId<'a> {
|
||||
f64 => serialize_f64,
|
||||
}
|
||||
|
||||
fn serialize_str(self, value: &str) -> Result<Self::Ok, Self::Error> {
|
||||
fn serialize_str(self, _value: &str) -> Result<Self::Ok, Self::Error> {
|
||||
Err(SerializerError::UnserializableType { type_name: "str" })
|
||||
}
|
||||
|
||||
fn serialize_bytes(self, _v: &[u8]) -> Result<Self::Ok, Self::Error> {
|
||||
fn serialize_bytes(self, _value: &[u8]) -> Result<Self::Ok, Self::Error> {
|
||||
Err(SerializerError::UnserializableType { type_name: "&[u8]" })
|
||||
}
|
||||
|
||||
|
@ -2,7 +2,6 @@ use meilidb_core::DocumentId;
|
||||
use serde::ser;
|
||||
use serde::Serialize;
|
||||
|
||||
use crate::database::RawIndex;
|
||||
use crate::indexer::Indexer as RawIndexer;
|
||||
use crate::schema::SchemaAttr;
|
||||
use super::{SerializerError, ConvertToString};
|
||||
@ -24,7 +23,7 @@ impl<'a> ser::Serializer for Indexer<'a> {
|
||||
type SerializeStruct = StructSerializer<'a>;
|
||||
type SerializeStructVariant = ser::Impossible<Self::Ok, Self::Error>;
|
||||
|
||||
fn serialize_bool(self, value: bool) -> Result<Self::Ok, Self::Error> {
|
||||
fn serialize_bool(self, _value: bool) -> Result<Self::Ok, Self::Error> {
|
||||
Err(SerializerError::UnindexableType { type_name: "boolean" })
|
||||
}
|
||||
|
||||
|
@ -22,10 +22,15 @@ pub use self::convert_to_number::ConvertToNumber;
|
||||
pub use self::indexer::Indexer;
|
||||
pub use self::serializer::Serializer;
|
||||
|
||||
use std::collections::BTreeMap;
|
||||
use std::{fmt, error::Error};
|
||||
|
||||
use meilidb_core::DocumentId;
|
||||
use rmp_serde::encode::Error as RmpError;
|
||||
use serde::ser;
|
||||
|
||||
use crate::number::ParseNumberError;
|
||||
use crate::schema::SchemaAttr;
|
||||
|
||||
#[derive(Debug)]
|
||||
pub enum SerializerError {
|
||||
@ -95,3 +100,19 @@ impl From<ParseNumberError> for SerializerError {
|
||||
SerializerError::ParseNumberError(error)
|
||||
}
|
||||
}
|
||||
|
||||
pub struct RamDocumentStore(BTreeMap<(DocumentId, SchemaAttr), Vec<u8>>);
|
||||
|
||||
impl RamDocumentStore {
|
||||
pub fn new() -> RamDocumentStore {
|
||||
RamDocumentStore(BTreeMap::new())
|
||||
}
|
||||
|
||||
pub fn set_document_field(&mut self, id: DocumentId, attr: SchemaAttr, value: Vec<u8>) {
|
||||
self.0.insert((id, attr), value);
|
||||
}
|
||||
|
||||
pub fn into_inner(self) -> BTreeMap<(DocumentId, SchemaAttr), Vec<u8>> {
|
||||
self.0
|
||||
}
|
||||
}
|
||||
|
@ -1,15 +1,14 @@
|
||||
use meilidb_core::DocumentId;
|
||||
use serde::ser;
|
||||
|
||||
use crate::database::RawIndex;
|
||||
use crate::ranked_map::RankedMap;
|
||||
use crate::indexer::Indexer as RawIndexer;
|
||||
use crate::schema::{Schema, SchemaAttr};
|
||||
use super::{SerializerError, ConvertToString, ConvertToNumber, Indexer};
|
||||
use crate::ranked_map::RankedMap;
|
||||
use crate::schema::Schema;
|
||||
use super::{RamDocumentStore, SerializerError, ConvertToString, ConvertToNumber, Indexer};
|
||||
|
||||
pub struct Serializer<'a> {
|
||||
pub schema: &'a Schema,
|
||||
pub index: &'a RawIndex,
|
||||
pub document_store: &'a mut RamDocumentStore,
|
||||
pub indexer: &'a mut RawIndexer,
|
||||
pub ranked_map: &'a mut RankedMap,
|
||||
pub document_id: DocumentId,
|
||||
@ -134,7 +133,7 @@ impl<'a> ser::Serializer for Serializer<'a> {
|
||||
Ok(MapSerializer {
|
||||
schema: self.schema,
|
||||
document_id: self.document_id,
|
||||
index: self.index,
|
||||
document_store: self.document_store,
|
||||
indexer: self.indexer,
|
||||
ranked_map: self.ranked_map,
|
||||
current_key_name: None,
|
||||
@ -150,7 +149,7 @@ impl<'a> ser::Serializer for Serializer<'a> {
|
||||
Ok(StructSerializer {
|
||||
schema: self.schema,
|
||||
document_id: self.document_id,
|
||||
index: self.index,
|
||||
document_store: self.document_store,
|
||||
indexer: self.indexer,
|
||||
ranked_map: self.ranked_map,
|
||||
})
|
||||
@ -171,7 +170,7 @@ impl<'a> ser::Serializer for Serializer<'a> {
|
||||
pub struct MapSerializer<'a> {
|
||||
schema: &'a Schema,
|
||||
document_id: DocumentId,
|
||||
index: &'a RawIndex,
|
||||
document_store: &'a mut RamDocumentStore,
|
||||
indexer: &'a mut RawIndexer,
|
||||
ranked_map: &'a mut RankedMap,
|
||||
current_key_name: Option<String>,
|
||||
@ -208,7 +207,7 @@ impl<'a> ser::SerializeMap for MapSerializer<'a> {
|
||||
serialize_value(
|
||||
self.schema,
|
||||
self.document_id,
|
||||
self.index,
|
||||
self.document_store,
|
||||
self.indexer,
|
||||
self.ranked_map,
|
||||
&key,
|
||||
@ -224,7 +223,7 @@ impl<'a> ser::SerializeMap for MapSerializer<'a> {
|
||||
pub struct StructSerializer<'a> {
|
||||
schema: &'a Schema,
|
||||
document_id: DocumentId,
|
||||
index: &'a RawIndex,
|
||||
document_store: &'a mut RamDocumentStore,
|
||||
indexer: &'a mut RawIndexer,
|
||||
ranked_map: &'a mut RankedMap,
|
||||
}
|
||||
@ -243,7 +242,7 @@ impl<'a> ser::SerializeStruct for StructSerializer<'a> {
|
||||
serialize_value(
|
||||
self.schema,
|
||||
self.document_id,
|
||||
self.index,
|
||||
self.document_store,
|
||||
self.indexer,
|
||||
self.ranked_map,
|
||||
key,
|
||||
@ -259,7 +258,7 @@ impl<'a> ser::SerializeStruct for StructSerializer<'a> {
|
||||
fn serialize_value<T: ?Sized>(
|
||||
schema: &Schema,
|
||||
document_id: DocumentId,
|
||||
index: &RawIndex,
|
||||
document_store: &mut RamDocumentStore,
|
||||
indexer: &mut RawIndexer,
|
||||
ranked_map: &mut RankedMap,
|
||||
key: &str,
|
||||
@ -272,7 +271,7 @@ where T: ser::Serialize,
|
||||
|
||||
if props.is_stored() {
|
||||
let value = rmp_serde::to_vec_named(value)?;
|
||||
index.set_document_attribute(document_id, attr, value)?;
|
||||
document_store.set_document_field(document_id, attr, value);
|
||||
}
|
||||
|
||||
if props.is_indexed() {
|
||||
@ -285,9 +284,8 @@ where T: ser::Serialize,
|
||||
}
|
||||
|
||||
if props.is_ranked() {
|
||||
let key = (document_id, attr);
|
||||
let number = value.serialize(ConvertToNumber)?;
|
||||
ranked_map.insert(key, number);
|
||||
ranked_map.insert(document_id, attr, number);
|
||||
}
|
||||
}
|
||||
|
||||
|
67
meilidb-data/tests/updates.rs
Normal file
67
meilidb-data/tests/updates.rs
Normal file
@ -0,0 +1,67 @@
|
||||
use serde_json::json;
|
||||
use meilidb_data::{Database, Schema};
|
||||
use meilidb_data::schema::{SchemaBuilder, STORED, INDEXED};
|
||||
|
||||
fn simple_schema() -> Schema {
|
||||
let mut builder = SchemaBuilder::with_identifier("objectId");
|
||||
builder.new_attribute("objectId", STORED | INDEXED);
|
||||
builder.new_attribute("title", STORED | INDEXED);
|
||||
builder.build()
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn insert_delete_document() {
|
||||
let tmp_dir = tempfile::tempdir().unwrap();
|
||||
let database = Database::start_default(&tmp_dir).unwrap();
|
||||
|
||||
let schema = simple_schema();
|
||||
let index = database.create_index("hello", schema).unwrap();
|
||||
|
||||
let doc1 = json!({ "objectId": 123, "title": "hello" });
|
||||
|
||||
let mut addition = index.documents_addition();
|
||||
addition.update_document(&doc1).unwrap();
|
||||
addition.finalize().unwrap();
|
||||
|
||||
let docs = index.query_builder().query("hello", 0..10).unwrap();
|
||||
assert_eq!(docs.len(), 1);
|
||||
assert_eq!(index.document(None, docs[0].id).unwrap().as_ref(), Some(&doc1));
|
||||
|
||||
let mut deletion = index.documents_deletion();
|
||||
deletion.delete_document(&doc1).unwrap();
|
||||
deletion.finalize().unwrap();
|
||||
|
||||
let docs = index.query_builder().query("hello", 0..10).unwrap();
|
||||
assert_eq!(docs.len(), 0);
|
||||
}
|
||||
|
||||
#[test]
|
||||
fn replace_document() {
|
||||
let tmp_dir = tempfile::tempdir().unwrap();
|
||||
let database = Database::start_default(&tmp_dir).unwrap();
|
||||
|
||||
let schema = simple_schema();
|
||||
let index = database.create_index("hello", schema).unwrap();
|
||||
|
||||
let doc1 = json!({ "objectId": 123, "title": "hello" });
|
||||
let doc2 = json!({ "objectId": 123, "title": "coucou" });
|
||||
|
||||
let mut addition = index.documents_addition();
|
||||
addition.update_document(&doc1).unwrap();
|
||||
addition.finalize().unwrap();
|
||||
|
||||
let docs = index.query_builder().query("hello", 0..10).unwrap();
|
||||
assert_eq!(docs.len(), 1);
|
||||
assert_eq!(index.document(None, docs[0].id).unwrap().as_ref(), Some(&doc1));
|
||||
|
||||
let mut deletion = index.documents_addition();
|
||||
deletion.update_document(&doc2).unwrap();
|
||||
deletion.finalize().unwrap();
|
||||
|
||||
let docs = index.query_builder().query("hello", 0..10).unwrap();
|
||||
assert_eq!(docs.len(), 0);
|
||||
|
||||
let docs = index.query_builder().query("coucou", 0..10).unwrap();
|
||||
assert_eq!(docs.len(), 1);
|
||||
assert_eq!(index.document(None, docs[0].id).unwrap().as_ref(), Some(&doc2));
|
||||
}
|
@ -5,23 +5,19 @@ version = "0.3.1"
|
||||
authors = ["Kerollmops <renault.cle@gmail.com>"]
|
||||
|
||||
[dependencies]
|
||||
meilidb-core = { path = "../meilidb-core", version = "0.1.0" }
|
||||
meilidb-data = { path = "../meilidb-data", version = "0.1.0" }
|
||||
meilidb-tokenizer = { path = "../meilidb-tokenizer", version = "0.1.0" }
|
||||
|
||||
[features]
|
||||
default = []
|
||||
i128 = ["meilidb-core/i128"]
|
||||
nightly = ["meilidb-core/nightly"]
|
||||
serde = { version = "1.0.91" , features = ["derive"] }
|
||||
serde_json = "1.0.39"
|
||||
tempfile = "3.0.7"
|
||||
tide = "0.2.0"
|
||||
|
||||
[dev-dependencies]
|
||||
meilidb-core = { path = "../meilidb-core", version = "0.1.0" }
|
||||
csv = "1.0.7"
|
||||
env_logger = "0.6.1"
|
||||
jemallocator = "0.1.9"
|
||||
quickcheck = "0.8.2"
|
||||
rand = "0.6.5"
|
||||
rand_xorshift = "0.1.1"
|
||||
serde = { version = "1.0.90", features = ["derive"] }
|
||||
structopt = "0.2.15"
|
||||
tempfile = "3.0.7"
|
||||
termcolor = "1.0.4"
|
||||
|
@ -52,7 +52,7 @@ fn index(
|
||||
{
|
||||
let database = Database::start_default(database_path)?;
|
||||
|
||||
let index = database.create_index("default".to_string(), schema.clone())?;
|
||||
let index = database.create_index("default", schema.clone())?;
|
||||
|
||||
let mut rdr = csv::Reader::from_path(csv_data_path)?;
|
||||
let mut raw_record = csv::StringRecord::new();
|
||||
|
@ -161,7 +161,7 @@ fn main() -> Result<(), Box<Error>> {
|
||||
let start_total = Instant::now();
|
||||
|
||||
let builder = index.query_builder();
|
||||
let documents = builder.query(query, 0..opt.number_results);
|
||||
let documents = builder.query(query, 0..opt.number_results)?;
|
||||
|
||||
let mut retrieve_duration = Duration::default();
|
||||
|
||||
|
@ -1,26 +0,0 @@
|
||||
use std::io::{self, BufReader, BufRead};
|
||||
use std::collections::HashSet;
|
||||
use std::path::Path;
|
||||
use std::fs::File;
|
||||
|
||||
#[derive(Debug)]
|
||||
pub struct CommonWords(HashSet<String>);
|
||||
|
||||
impl CommonWords {
|
||||
pub fn from_file<P>(path: P) -> io::Result<Self>
|
||||
where P: AsRef<Path>
|
||||
{
|
||||
let file = File::open(path)?;
|
||||
let file = BufReader::new(file);
|
||||
let mut set = HashSet::new();
|
||||
for line in file.lines().filter_map(|l| l.ok()) {
|
||||
let word = line.trim().to_owned();
|
||||
set.insert(word);
|
||||
}
|
||||
Ok(CommonWords(set))
|
||||
}
|
||||
|
||||
pub fn contains(&self, word: &str) -> bool {
|
||||
self.0.contains(word)
|
||||
}
|
||||
}
|
@ -1,7 +0,0 @@
|
||||
#![cfg_attr(feature = "nightly", feature(test))]
|
||||
|
||||
mod common_words;
|
||||
mod sort_by_attr;
|
||||
|
||||
pub use self::sort_by_attr::SortByAttr;
|
||||
pub use self::common_words::CommonWords;
|
74
meilidb/src/main.rs
Normal file
74
meilidb/src/main.rs
Normal file
@ -0,0 +1,74 @@
|
||||
#![feature(async_await)]
|
||||
|
||||
use std::collections::HashMap;
|
||||
|
||||
use serde::{Deserialize, Serialize};
|
||||
use tide::querystring::ExtractQuery;
|
||||
use tide::http::status::StatusCode;
|
||||
use tide::{error::ResultExt, response, App, Context, EndpointResult};
|
||||
use serde_json::Value;
|
||||
use meilidb_data::{Database, Schema};
|
||||
|
||||
#[derive(Debug, Serialize, Deserialize, Clone)]
|
||||
struct SearchQuery {
|
||||
q: String,
|
||||
}
|
||||
|
||||
async fn create_index(mut cx: Context<Database>) -> EndpointResult<()> {
|
||||
let index: String = cx.param("index").client_err()?;
|
||||
let schema = cx.body_bytes().await.client_err()?;
|
||||
let schema = Schema::from_toml(schema.as_slice()).unwrap();
|
||||
|
||||
let database = cx.app_data();
|
||||
database.create_index(&index, schema).unwrap();
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn update_documents(mut cx: Context<Database>) -> EndpointResult<()> {
|
||||
let index: String = cx.param("index").client_err()?;
|
||||
let document: HashMap<String, Value> = cx.body_json().await.client_err()?;
|
||||
|
||||
let database = cx.app_data();
|
||||
let index = match database.open_index(&index).unwrap() {
|
||||
Some(index) => index,
|
||||
None => Err(StatusCode::NOT_FOUND)?,
|
||||
};
|
||||
|
||||
let mut addition = index.documents_addition();
|
||||
addition.update_document(document).unwrap();
|
||||
addition.finalize().unwrap();
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
async fn search_index(cx: Context<Database>) -> EndpointResult {
|
||||
let index: String = cx.param("index").client_err()?;
|
||||
let query: SearchQuery = cx.url_query()?;
|
||||
|
||||
let database = cx.app_data();
|
||||
|
||||
let index = match database.open_index(&index).unwrap() {
|
||||
Some(index) => index,
|
||||
None => Err(StatusCode::NOT_FOUND)?,
|
||||
};
|
||||
|
||||
let documents_ids = index.query_builder().query(&query.q, 0..100).unwrap();
|
||||
let documents: Vec<Value> = documents_ids
|
||||
.into_iter()
|
||||
.filter_map(|x| index.document(None, x.id).unwrap())
|
||||
.collect();
|
||||
|
||||
Ok(response::json(documents))
|
||||
}
|
||||
|
||||
fn main() -> std::io::Result<()> {
|
||||
let tmp_dir = tempfile::tempdir().unwrap();
|
||||
let database = Database::start_default(&tmp_dir).unwrap();
|
||||
let mut app = App::new(database);
|
||||
|
||||
app.at("/:index").post(create_index).put(update_documents);
|
||||
app.at("/:index/search").get(search_index);
|
||||
|
||||
app.serve("127.0.0.1:8000")
|
||||
}
|
@ -1,121 +0,0 @@
|
||||
use std::cmp::Ordering;
|
||||
use std::error::Error;
|
||||
use std::fmt;
|
||||
|
||||
use meilidb_core::criterion::Criterion;
|
||||
use meilidb_core::RawDocument;
|
||||
use meilidb_data::{Schema, SchemaAttr, RankedMap};
|
||||
|
||||
/// An helper struct that permit to sort documents by
|
||||
/// some of their stored attributes.
|
||||
///
|
||||
/// # Note
|
||||
///
|
||||
/// If a document cannot be deserialized it will be considered [`None`][].
|
||||
///
|
||||
/// Deserialized documents are compared like `Some(doc0).cmp(&Some(doc1))`,
|
||||
/// so you must check the [`Ord`] of `Option` implementation.
|
||||
///
|
||||
/// [`None`]: https://doc.rust-lang.org/std/option/enum.Option.html#variant.None
|
||||
/// [`Ord`]: https://doc.rust-lang.org/std/option/enum.Option.html#impl-Ord
|
||||
///
|
||||
/// # Example
|
||||
///
|
||||
/// ```ignore
|
||||
/// use serde_derive::Deserialize;
|
||||
/// use meilidb::rank::criterion::*;
|
||||
///
|
||||
/// let custom_ranking = SortByAttr::lower_is_better(&ranked_map, &schema, "published_at")?;
|
||||
///
|
||||
/// let builder = CriteriaBuilder::with_capacity(8)
|
||||
/// .add(SumOfTypos)
|
||||
/// .add(NumberOfWords)
|
||||
/// .add(WordsProximity)
|
||||
/// .add(SumOfWordsAttribute)
|
||||
/// .add(SumOfWordsPosition)
|
||||
/// .add(Exact)
|
||||
/// .add(custom_ranking)
|
||||
/// .add(DocumentId);
|
||||
///
|
||||
/// let criterion = builder.build();
|
||||
///
|
||||
/// ```
|
||||
pub struct SortByAttr<'a> {
|
||||
ranked_map: &'a RankedMap,
|
||||
attr: SchemaAttr,
|
||||
reversed: bool,
|
||||
}
|
||||
|
||||
impl<'a> SortByAttr<'a> {
|
||||
pub fn lower_is_better(
|
||||
ranked_map: &'a RankedMap,
|
||||
schema: &Schema,
|
||||
attr_name: &str,
|
||||
) -> Result<SortByAttr<'a>, SortByAttrError>
|
||||
{
|
||||
SortByAttr::new(ranked_map, schema, attr_name, false)
|
||||
}
|
||||
|
||||
pub fn higher_is_better(
|
||||
ranked_map: &'a RankedMap,
|
||||
schema: &Schema,
|
||||
attr_name: &str,
|
||||
) -> Result<SortByAttr<'a>, SortByAttrError>
|
||||
{
|
||||
SortByAttr::new(ranked_map, schema, attr_name, true)
|
||||
}
|
||||
|
||||
fn new(
|
||||
ranked_map: &'a RankedMap,
|
||||
schema: &Schema,
|
||||
attr_name: &str,
|
||||
reversed: bool,
|
||||
) -> Result<SortByAttr<'a>, SortByAttrError>
|
||||
{
|
||||
let attr = match schema.attribute(attr_name) {
|
||||
Some(attr) => attr,
|
||||
None => return Err(SortByAttrError::AttributeNotFound),
|
||||
};
|
||||
|
||||
if !schema.props(attr).is_ranked() {
|
||||
return Err(SortByAttrError::AttributeNotRegisteredForRanking);
|
||||
}
|
||||
|
||||
Ok(SortByAttr { ranked_map, attr, reversed })
|
||||
}
|
||||
}
|
||||
|
||||
impl<'a> Criterion for SortByAttr<'a> {
|
||||
fn evaluate(&self, lhs: &RawDocument, rhs: &RawDocument) -> Ordering {
|
||||
let lhs = self.ranked_map.get(&(lhs.id, self.attr));
|
||||
let rhs = self.ranked_map.get(&(rhs.id, self.attr));
|
||||
|
||||
match (lhs, rhs) {
|
||||
(Some(lhs), Some(rhs)) => {
|
||||
let order = lhs.cmp(&rhs);
|
||||
if self.reversed { order.reverse() } else { order }
|
||||
},
|
||||
(None, Some(_)) => Ordering::Greater,
|
||||
(Some(_), None) => Ordering::Less,
|
||||
(None, None) => Ordering::Equal,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
|
||||
pub enum SortByAttrError {
|
||||
AttributeNotFound,
|
||||
AttributeNotRegisteredForRanking,
|
||||
}
|
||||
|
||||
impl fmt::Display for SortByAttrError {
|
||||
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
|
||||
use SortByAttrError::*;
|
||||
match self {
|
||||
AttributeNotFound => f.write_str("attribute not found in the schema"),
|
||||
AttributeNotRegisteredForRanking => f.write_str("attribute not registered for ranking"),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Error for SortByAttrError { }
|
Loading…
x
Reference in New Issue
Block a user