MeiliSearch/crates/benchmarks/README.md

Benchmarks
==========

## TOC

- [Run the benchmarks](#run-the-benchmarks)
- [Comparison between benchmarks](#comparison-between-benchmarks)
- [Datasets](#datasets)

## Run the benchmarks

### On our private server

The Meili team has self-hosted his own GitHub runner to run benchmarks on our dedicated bare metal server.

To trigger the benchmark workflow:
- Go to the `Actions` tab of this repository.
- Select the `Benchmarks` workflow on the left.
- Click on `Run workflow` in the blue banner.
- Select the branch on which you want to run the benchmarks and select the dataset you want (default: `songs`).
- Finally, click on `Run workflow`.

This GitHub workflow will run the benchmarks and push the `critcmp` report to a DigitalOcean Space (= S3).

The name of the uploaded file is displayed in the workflow.

_[More about critcmp](https://github.com/BurntSushi/critcmp)._

💡 To compare the just-uploaded benchmark with another one, check out the [next section](#comparison-between-benchmarks).

### On your machine

To run all the benchmarks (~5h):

```bash
cargo bench
```

To run only the `search_songs` (~1h), `search_wiki` (~3h), `search_geo` (~20m) or `indexing` (~2h) benchmark:

```bash
cargo bench --bench <dataset name>
```

By default, the benchmarks will be downloaded and uncompressed automatically in the target directory.<br>
If you don't want to download the datasets every time you update something on the code, you can specify a custom directory with the environment variable `MILLI_BENCH_DATASETS_PATH`:

```bash
mkdir ~/datasets
MILLI_BENCH_DATASETS_PATH=~/datasets cargo bench --bench search_songs # the four datasets are downloaded
touch build.rs
MILLI_BENCH_DATASETS_PATH=~/datasets cargo bench --bench songs # the code is compiled again but the datasets are not downloaded
```

## Comparison between benchmarks

The benchmark reports we push are generated with `critcmp`. Thus, we use `critcmp` to show the result of a benchmark, or compare results between multiple benchmarks.

We provide a script to download and display the comparison report.

Requirements:
- `grep`
- `curl`
- [`critcmp`](https://github.com/BurntSushi/critcmp)

List the available file in the DO Space:

```bash
./benchmarks/script/list.sh
```
```bash
songs_main_09a4321.json
songs_geosearch_24ec456.json
search_songs_main_cb45a10b.json
```

Run the comparison script:

```bash
# we get the result of ONE benchmark, this give you an idea of how much time an operation took
./benchmarks/scripts/compare.sh son songs_geosearch_24ec456.json
# we compare two benchmarks
./benchmarks/scripts/compare.sh songs_main_09a4321.json songs_geosearch_24ec456.json
# we compare three benchmarks
./benchmarks/scripts/compare.sh songs_main_09a4321.json songs_geosearch_24ec456.json search_songs_main_cb45a10b.json
```

## Datasets

The benchmarks uses the following datasets:
- `smol-songs`
- `smol-wiki`
- `movies`
- `smol-all-countries`

### Songs

`smol-songs` is a subset of the [`songs.csv` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/songs.csv.gz).

It was generated with this command:

```bash
xsv sample --seed 42 1000000 songs.csv -o smol-songs.csv
```

_[Download the generated `smol-songs` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/smol-songs.csv.gz)._

### Wiki

`smol-wiki` is a subset of the [`wikipedia-articles.csv` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/wiki-articles.csv.gz).

It was generated with the following command:

```bash
xsv sample --seed 42 500000 wiki-articles.csv -o smol-wiki-articles.csv
```

_[Download the `smol-wiki` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/smol-wiki-articles.csv.gz)._

### Movies

`movies` is a really small dataset we uses as our example in the [getting started](https://www.meilisearch.com/docs/learn/getting_started/quick_start)

_[Download the `movies` dataset](https://www.meilisearch.com/movies.json)._


### All Countries

`smol-all-countries` is a subset of the [`all-countries.csv` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/all-countries.csv.gz)
It has been converted to jsonlines and then edited so it matches our format for the `_geo` field.

It was generated with the following command:
```bash
bat all-countries.csv.gz | gunzip | xsv sample --seed 42 1000000 | csv2json-lite | sd '"latitude":"(.*?)","longitude":"(.*?)"' '"_geo": { "lat": $1, "lng": $2 }' | sd '\[|\]|,$' '' | gzip > smol-all-countries.jsonl.gz
```

_[Download the `smol-all-countries` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/smol-all-countries.jsonl.gz)._
move the benchmarks to another crate so we can download the datasets automatically without adding overhead to the build of milli 2021-05-25 17:09:14 +02:00			`Benchmarks`
			`==========`

Add CI for benchmarks 2021-05-26 15:57:22 +02:00			`## TOC`

			`- [Run the benchmarks](#run-the-benchmarks)`
			`- [Comparison between benchmarks](#comparison-between-benchmarks)`
update the TOC order 2021-06-07 17:29:22 +02:00			`- [Datasets](#datasets)`
Add CI for benchmarks 2021-05-26 15:57:22 +02:00
			`## Run the benchmarks`

			`### On our private server`

			`The Meili team has self-hosted his own GitHub runner to run benchmarks on our dedicated bare metal server.`

			`To trigger the benchmark workflow:`
			- Go to the `Actions` tab of this repository.
			- Select the `Benchmarks` workflow on the left.
			- Click on `Run workflow` in the blue banner.
			- Select the branch on which you want to run the benchmarks and select the dataset you want (default: `songs`).
			- Finally, click on `Run workflow`.

			This GitHub workflow will run the benchmarks and push the `critcmp` report to a DigitalOcean Space (= S3).
move the benchmarks to another crate so we can download the datasets automatically without adding overhead to the build of milli 2021-05-25 17:09:14 +02:00
Update following reviews 2021-06-01 16:37:57 +02:00			`The name of the uploaded file is displayed in the workflow.`

Add CI for benchmarks 2021-05-26 15:57:22 +02:00			`_[More about critcmp](https://github.com/BurntSushi/critcmp)._`
move the benchmarks to another crate so we can download the datasets automatically without adding overhead to the build of milli 2021-05-25 17:09:14 +02:00
Update following reviews 2021-06-01 16:37:57 +02:00			`💡 To compare the just-uploaded benchmark with another one, check out the [next section](#comparison-between-benchmarks).`

Add CI for benchmarks 2021-05-26 15:57:22 +02:00			`### On your machine`

add benchmarks for indexing 2021-07-07 11:42:14 +02:00			`To run all the benchmarks (~5h):`
Add CI for benchmarks 2021-05-26 15:57:22 +02:00
			```bash
			`cargo bench`
			```

add benchmarks for the geosearch 2021-09-13 18:08:28 +02:00			To run only the `search_songs` (~1h), `search_wiki` (~3h), `search_geo` (~20m) or `indexing` (~2h) benchmark:
Add CI for benchmarks 2021-05-26 15:57:22 +02:00
			```bash
			`cargo bench --bench <dataset name>`
move the benchmarks to another crate so we can download the datasets automatically without adding overhead to the build of milli 2021-05-25 17:09:14 +02:00			```
Add CI for benchmarks 2021-05-26 15:57:22 +02:00
			`By default, the benchmarks will be downloaded and uncompressed automatically in the target directory.<br>`
			If you don't want to download the datasets every time you update something on the code, you can specify a custom directory with the environment variable `MILLI_BENCH_DATASETS_PATH`:

			```bash
move the benchmarks to another crate so we can download the datasets automatically without adding overhead to the build of milli 2021-05-25 17:09:14 +02:00			`mkdir ~/datasets`
add benchmarks for the geosearch 2021-09-13 18:08:28 +02:00			`MILLI_BENCH_DATASETS_PATH=~/datasets cargo bench --bench search_songs # the four datasets are downloaded`
move the benchmarks to another crate so we can download the datasets automatically without adding overhead to the build of milli 2021-05-25 17:09:14 +02:00			`touch build.rs`
			`MILLI_BENCH_DATASETS_PATH=~/datasets cargo bench --bench songs # the code is compiled again but the datasets are not downloaded`
			```
Add CI for benchmarks 2021-05-26 15:57:22 +02:00
			`## Comparison between benchmarks`

update the README to better match the new critcmp usage Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com> 2021-09-20 10:51:04 +02:00			The benchmark reports we push are generated with `critcmp`. Thus, we use `critcmp` to show the result of a benchmark, or compare results between multiple benchmarks.
Add CI for benchmarks 2021-05-26 15:57:22 +02:00
			`We provide a script to download and display the comparison report.`

			`Requirements:`
Update following reviews 2021-06-01 16:37:57 +02:00			- `grep`
			- `curl`
Add CI for benchmarks 2021-05-26 15:57:22 +02:00			- [`critcmp`](https://github.com/BurntSushi/critcmp)

			`List the available file in the DO Space:`

			```bash
Update following reviews 2021-06-01 16:37:57 +02:00			`./benchmarks/script/list.sh`
Add CI for benchmarks 2021-05-26 15:57:22 +02:00			```
			```bash
Update README 2021-06-01 18:57:35 +02:00			`songs_main_09a4321.json`
			`songs_geosearch_24ec456.json`
update the README to better match the new critcmp usage Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com> 2021-09-20 10:51:04 +02:00			`search_songs_main_cb45a10b.json`
Add CI for benchmarks 2021-05-26 15:57:22 +02:00			```

			`Run the comparison script:`

			```bash
update the README to better match the new critcmp usage Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com> 2021-09-20 10:51:04 +02:00			`# we get the result of ONE benchmark, this give you an idea of how much time an operation took`
			`./benchmarks/scripts/compare.sh son songs_geosearch_24ec456.json`
			`# we compare two benchmarks`
Update following reviews 2021-06-01 16:37:57 +02:00			`./benchmarks/scripts/compare.sh songs_main_09a4321.json songs_geosearch_24ec456.json`
update the README to better match the new critcmp usage Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com> 2021-09-20 10:51:04 +02:00			`# we compare three benchmarks`
			`./benchmarks/scripts/compare.sh songs_main_09a4321.json songs_geosearch_24ec456.json search_songs_main_cb45a10b.json`
Add CI for benchmarks 2021-05-26 15:57:22 +02:00			```
improve the benchmark’s readme 2021-06-07 14:29:20 +02:00
			`## Datasets`

add benchmarks for the geosearch 2021-09-13 18:08:28 +02:00			`The benchmarks uses the following datasets:`
			- `smol-songs`
			- `smol-wiki`
add benchmarks for indexing 2021-07-07 11:42:14 +02:00			- `movies`
add benchmarks for the geosearch 2021-09-13 18:08:28 +02:00			- `smol-all-countries`
improve the benchmark’s readme 2021-06-07 14:29:20 +02:00
			`### Songs`

add benchmarks for the geosearch 2021-09-13 18:08:28 +02:00			`smol-songs` is a subset of the [`songs.csv` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/songs.csv.gz).
improve the benchmark’s readme 2021-06-07 14:29:20 +02:00
			`It was generated with this command:`

			```bash
			`xsv sample --seed 42 1000000 songs.csv -o smol-songs.csv`
			```

add benchmarks for the geosearch 2021-09-13 18:08:28 +02:00			_[Download the generated `smol-songs` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/smol-songs.csv.gz)._
improve the benchmark’s readme 2021-06-07 14:29:20 +02:00
			`### Wiki`

add benchmarks for the geosearch 2021-09-13 18:08:28 +02:00			`smol-wiki` is a subset of the [`wikipedia-articles.csv` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/wiki-articles.csv.gz).
improve the benchmark’s readme 2021-06-07 14:29:20 +02:00
			`It was generated with the following command:`

			```bash
			`xsv sample --seed 42 500000 wiki-articles.csv -o smol-wiki-articles.csv`
			```

Fix datasets links in the readme Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com> 2021-09-20 10:37:38 +02:00			_[Download the `smol-wiki` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/smol-wiki-articles.csv.gz)._
add benchmarks for the geosearch 2021-09-13 18:08:28 +02:00
add benchmarks for indexing 2021-07-07 11:42:14 +02:00			`### Movies`

Update links of the docs 2023-05-03 19:14:57 +02:00			`movies` is a really small dataset we uses as our example in the [getting started](https://www.meilisearch.com/docs/learn/getting_started/quick_start)
add benchmarks for indexing 2021-07-07 11:42:14 +02:00
Update links of the docs 2023-05-03 19:14:57 +02:00			_[Download the `movies` dataset](https://www.meilisearch.com/movies.json)._
improve the benchmark’s readme 2021-06-07 14:29:20 +02:00
add benchmarks for the geosearch 2021-09-13 18:08:28 +02:00
			`### All Countries`

Fix datasets links in the readme Co-authored-by: Clémentine Urquizar <clementine@meilisearch.com> 2021-09-20 10:37:38 +02:00			`smol-all-countries` is a subset of the [`all-countries.csv` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/all-countries.csv.gz)
add benchmarks for the geosearch 2021-09-13 18:08:28 +02:00			It has been converted to jsonlines and then edited so it matches our format for the `_geo` field.

			`It was generated with the following command:`
			```bash
			`bat all-countries.csv.gz \| gunzip \| xsv sample --seed 42 1000000 \| csv2json-lite \| sd '"latitude":"(.?)","longitude":"(.?)"' '"_geo": { "lat": $1, "lng": $2 }' \| sd '\[\|\]\|,$' '' \| gzip > smol-all-countries.jsonl.gz`
			```

			_[Download the `smol-all-countries` dataset](https://milli-benchmarks.fra1.digitaloceanspaces.com/datasets/smol-all-countries.jsonl.gz)._