mirror of
https://github.com/meilisearch/MeiliSearch
synced 2024-12-25 06:00:08 +01:00
Update the README
This commit is contained in:
parent
63cbeca64e
commit
8453828a65
59
README.md
59
README.md
@ -1,2 +1,61 @@
|
|||||||
# mega-mini-indexer
|
# mega-mini-indexer
|
||||||
A prototype of concurrent indexing, only contains postings ids
|
A prototype of concurrent indexing, only contains postings ids
|
||||||
|
|
||||||
|
## Introduction
|
||||||
|
|
||||||
|
This engine is a prototype, do not use it in production.
|
||||||
|
This is one of the most advanced search engine I have worked on.
|
||||||
|
It currently only supports the proximity criterion.
|
||||||
|
|
||||||
|
### Compile all the binaries
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cargo build --release --bins
|
||||||
|
```
|
||||||
|
|
||||||
|
## Indexing
|
||||||
|
|
||||||
|
It can index mass documents in no much time, I already achieved to index:
|
||||||
|
- 109m songs (song and artist name) in 21min and take 29GB on disk.
|
||||||
|
- 12m cities (name, timezone and country ID) in 3min13s and take 3.3GB on disk.
|
||||||
|
|
||||||
|
All of that on a 39$/month machine with 4cores.
|
||||||
|
|
||||||
|
### Index your documents
|
||||||
|
|
||||||
|
You first need to split your csv yourself, the engine is currently not able to split it itself.
|
||||||
|
The bigger the split size is the faster the engine will index your documents but the higher the RAM usage will be too.
|
||||||
|
|
||||||
|
Here we use [the awesome xsv tool](https://github.com/BurntSushi/xsv) to split our big dataset.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat my-data.csv | xsv split -s 2000000 my-data-split/
|
||||||
|
```
|
||||||
|
|
||||||
|
Once your data is ready you can feed the engine with it, it will spawn one thread by CSV part up to one by number of core.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./target/release/indexer --db my-data.mmdb ../my-data-split/*
|
||||||
|
```
|
||||||
|
|
||||||
|
## Querying
|
||||||
|
|
||||||
|
The engine is designed to handle very frequent words like any other word frequency.
|
||||||
|
This is why you can search for "asia dubai" (the most common timezone) in the countries datasets in no time (59ms) even with 12m documents.
|
||||||
|
|
||||||
|
We haven't modified the algorithm to handle queries that are scattered over multiple attributes, this is an open issue (#4).
|
||||||
|
|
||||||
|
### Exposing a website to request the database
|
||||||
|
|
||||||
|
Once you've indexed the dataset you will be able to access it with your brwoser.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./target/release/serve -l 0.0.0.0:8700 --db my-data.mmdb
|
||||||
|
```
|
||||||
|
|
||||||
|
## Gaps
|
||||||
|
|
||||||
|
There is many ways to make the engine search for too long and consume too much CPU.
|
||||||
|
This can for example be achieved by querying the engine for "the best of the do" on the songs and subreddits datasets.
|
||||||
|
|
||||||
|
There is plenty of way to improve the algorithms and there is and will be new issues explaining potential improvements.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user