2020-05-31 14:21:56 +02:00
|
|
|
# mega-mini-indexer
|
|
|
|
A prototype of concurrent indexing, only contains postings ids
|
2020-06-28 12:40:08 +02:00
|
|
|
|
|
|
|
## Introduction
|
|
|
|
|
|
|
|
This engine is a prototype, do not use it in production.
|
|
|
|
This is one of the most advanced search engine I have worked on.
|
|
|
|
It currently only supports the proximity criterion.
|
|
|
|
|
|
|
|
### Compile all the binaries
|
|
|
|
|
|
|
|
```bash
|
|
|
|
cargo build --release --bins
|
|
|
|
```
|
|
|
|
|
|
|
|
## Indexing
|
|
|
|
|
|
|
|
It can index mass documents in no much time, I already achieved to index:
|
|
|
|
- 109m songs (song and artist name) in 21min and take 29GB on disk.
|
|
|
|
- 12m cities (name, timezone and country ID) in 3min13s and take 3.3GB on disk.
|
|
|
|
|
|
|
|
All of that on a 39$/month machine with 4cores.
|
|
|
|
|
|
|
|
### Index your documents
|
|
|
|
|
|
|
|
You first need to split your csv yourself, the engine is currently not able to split it itself.
|
|
|
|
The bigger the split size is the faster the engine will index your documents but the higher the RAM usage will be too.
|
|
|
|
|
|
|
|
Here we use [the awesome xsv tool](https://github.com/BurntSushi/xsv) to split our big dataset.
|
|
|
|
|
|
|
|
```bash
|
|
|
|
cat my-data.csv | xsv split -s 2000000 my-data-split/
|
|
|
|
```
|
|
|
|
|
|
|
|
Once your data is ready you can feed the engine with it, it will spawn one thread by CSV part up to one by number of core.
|
|
|
|
|
|
|
|
```bash
|
|
|
|
./target/release/indexer --db my-data.mmdb ../my-data-split/*
|
|
|
|
```
|
|
|
|
|
|
|
|
## Querying
|
|
|
|
|
|
|
|
The engine is designed to handle very frequent words like any other word frequency.
|
|
|
|
This is why you can search for "asia dubai" (the most common timezone) in the countries datasets in no time (59ms) even with 12m documents.
|
|
|
|
|
|
|
|
We haven't modified the algorithm to handle queries that are scattered over multiple attributes, this is an open issue (#4).
|
|
|
|
|
|
|
|
### Exposing a website to request the database
|
|
|
|
|
|
|
|
Once you've indexed the dataset you will be able to access it with your brwoser.
|
|
|
|
|
|
|
|
```bash
|
|
|
|
./target/release/serve -l 0.0.0.0:8700 --db my-data.mmdb
|
|
|
|
```
|
|
|
|
|
|
|
|
## Gaps
|
|
|
|
|
|
|
|
There is many ways to make the engine search for too long and consume too much CPU.
|
|
|
|
This can for example be achieved by querying the engine for "the best of the do" on the songs and subreddits datasets.
|
|
|
|
|
|
|
|
There is plenty of way to improve the algorithms and there is and will be new issues explaining potential improvements.
|