MeiliSearch

mirror of https://github.com/meilisearch/MeiliSearch synced 2025-07-15 13:58:36 +02:00

No description

Find a file

meili-bors[bot] 33b7c574ea Merge #4090 4090: Diff indexing r=ManyTheFish a=ManyTheFish This pull request aims to reduce the indexing time by computing a difference between the data added to the index and the data removed from the index before writing in LMDB. ## Why focus on reducing the writings in LMDB? The indexing in Meilisearch is split into 3 main phases: 1) The computing or the extraction of the data (Multi-threaded) 2) The writing of the data in LMDB (Mono-threaded) 3) The processing of the prefix databases (Mono-threaded) see below: ![Capture d’écran 2023-09-28 à 20 01 45](https://github.com/meilisearch/meilisearch/assets/6482087/51513162-7c39-4244-978b-2c6b60c43a56) Because the writing is mono-threaded, it represents a bottleneck in the indexing, reducing the number of writes in LMDB will reduce the pressure on the main thread and should reduce the global time spent on the indexing. ## Give Feedback We created [a dedicated discussion](https://github.com/meilisearch/meilisearch/discussions/4196) for users to try this new feature and to give feedback on bugs or performance issues. ## Technical approach ### Part 1: merge the addition and the deletion process This part: a) Aims to reduce the time spent on indexing only the filterable/sortable fields of documents, for example: - Updating the number of "likes" or "stars" of a song or a movie - Updating the "stock count" or the "price" of a product b) Aims to reduce the time spent on writing in LMDB which should reduce the global indexing time for the highly multi-threaded machines by reducing the writing bottleneck. c) Aims to reduce the average time spent to delete documents without having to keep the soft-deleted documents implementation - [x] Create a preprocessing function that creates the diff-based documents chuck (`OBKV<fid, OBKV<AddDel, value>>`) - [x] and clearly separate the faceted fields and the searchable fields in two different chunks - Change the parameters of the input extractor by taking an `OBKV<fid, OBKV<AddDel, value>>` instead of `OBKV<fid, value>`. - [x] extract_docid_word_positions - [x] extract_geo_points - [x] extract_vector_points - [x] extract_fid_docid_facet_values - Adapt the searchable extractors to the new diff-chucks - [x] extract_fid_word_count_docids - [x] extract_word_pair_proximity_docids - [x] extract_word_position_docids - [x] extract_word_docids - Adapt the facet extractors to the new diff-chucks - [x] extract_facet_number_docids - [x] extract_facet_string_docids - [x] extract_fid_docid_facet_values - [x] FacetsUpdate - [x] Adapt the prefix database extractors ⚠️ ⚠️ - [x] Make the LMDB writer remove the document_ids to delete at the same time the new document_ids are added - [x] Remove document deletion pipeline - [x] remove `new_documents_ids` entirely and `replaced_documents_ids` - [x] reuse extracted external id from transform instead of re-extracting in `TypedChunks::Documents` - [x] Remove deletion pipeline after autobatcher - [x] remove autobatcher deletion pipeline - [x] everything uses `IndexOperation::DocumentOperation` - [x] repair deletion by internal id for filter by delete - [x] Improve the deletion via internal ids by avoiding iterating over the whole set of external document ids. - [x] Remove soft-deleted documents #### FIXME - [x] field distribution is not correctly updated after deletion - [x] missing documents in the tests of tokenizer_customization ### Part 2: Only compute the documents field by field This part aims to reduce the global indexing time for any kind of partial document modification on any size of machine from the mono-threaded one to the highly multi-threaded one. - [ ] Make the preprocessing function only send the fields that changed to the extractors - [ ] remove the `word_docids` and `exact_word_docids` database and adapt the search (⚠️ could impact the search performances) - [ ] replace the `word_pair_proximity_docids` database with a `word_pair_proximity_fid_docids` database and adapt the search (⚠️ could impact the search performances) - [ ] Adapt the prefix database extractors ⚠️ ⚠️ ## Technical Concerns - The part 1 implementation could increase the indexing time for the smallest machines (with few threads) by increasing the extracting time (multi-threaded) more than the writing time (mono-threaded) - The part 2 implementation needs to change the databases which could have a significant impact on the search performances - The prefix databases are a bit special to process and may be a pain to adapt to the difference-based indexing Co-authored-by: ManyTheFish <many@meilisearch.com> Co-authored-by: Clément Renault <clement@meilisearch.com> Co-authored-by: Louis Dureuil <louis@meilisearch.com>		2023-11-21 09:44:38 +00:00
.github	Add the benchmarck name to the bot message	2023-11-15 13:56:54 +01:00
assets	Introduce a PROFILING.md tutorial to profile Meilisearch	2023-07-18 17:38:13 +02:00
benchmarks	Use more efficient method for deletion in benchmarks	2023-11-09 16:13:15 +01:00
dump	Remove unused snapshots	2023-10-31 10:12:49 +01:00
file-store	Upgrade the compatible versions of the dependencies	2023-04-24 17:50:52 +02:00
filter-parser	Use the unescaper crate to unescape any char sequence	2023-09-06 13:59:45 +02:00
flatten-serde-json	Update criterion to 0.5.1 to remove the atty dependency	2023-07-03 18:51:42 +02:00
fuzzers	upgrade fastrand = "2.0.0"	2023-08-10 18:09:02 +02:00
index-scheduler	Merge #4090	2023-11-21 09:44:38 +00:00
json-depth-checker	Update criterion to 0.5.1 to remove the atty dependency	2023-07-03 18:51:42 +02:00
meili-snap	enable the multi-snapshot attribute in insta. This will let us use insta in loops	2023-08-08 16:28:38 +02:00
meilisearch	Slow the logging down	2023-11-01 13:49:32 +01:00
meilisearch-auth	implement the snapshots on demand	2023-09-11 12:35:57 +02:00
meilisearch-types	Remove soft-deleted related methods from Index	2023-10-30 11:41:22 +01:00
milli	Make into_del_add_obkv parameters more human readable	2023-11-20 16:10:39 +01:00
permissive-json-pointer	Refactor empty arrays/objects should return empty instead of null	2023-09-11 15:56:15 +03:00
.dockerignore	Revert "Improve docker cache"	2023-05-25 11:48:26 +02:00
.gitignore	edit gitignore to ignore .idea and .vscode folders	2023-02-10 11:42:19 +04:00
.rustfmt.toml	Introduce a rustfmt file	2022-10-27 11:35:05 +02:00
bors.toml	Remove macos-latest and windows-latest usages	2022-12-20 11:10:09 +01:00
Cargo.lock	Cleanup TOML	2023-11-01 14:03:04 +01:00
Cargo.toml	Update version for the next release (v1.4.1) in Cargo.toml	2023-10-10 09:01:45 +00:00
CODE_OF_CONDUCT.md	Create CODE_OF_CONDUCT.md	2020-04-30 20:16:02 +02:00
config.toml	Merge branch 'main' into tmp-release-v1.2.0	2023-06-05 18:36:28 +02:00
CONTRIBUTING.md	Update links of the docs	2023-05-03 19:14:57 +02:00
Cross.toml	Cross build with action-rs	2021-10-10 02:21:30 +08:00
Dockerfile	Revert "Improve docker cache"	2023-05-25 11:48:26 +02:00
download-latest.sh	Update links of the docs	2023-05-03 19:14:57 +02:00
LICENSE	Update LICENSE	2022-02-15 15:54:45 +01:00
PROFILING.md	Update the PROFILING.md file	2023-10-13 13:11:30 +02:00
README.md	Update README.md	2023-11-02 17:40:18 +01:00
SECURITY.md	docs(security): Fix `Supported`	2022-05-31 14:21:34 -05:00

README.md

Website | Roadmap | Meilisearch Cloud | Blog | Documentation | FAQ | Discord

⚡ A lightning-fast search engine that fits effortlessly into your apps, websites, and workflow 🔍

Meilisearch helps you shape a delightful search experience in a snap, offering features that work out-of-the-box to speed up your workflow.

🔥 Try it! 🔥

✨ Features

Search-as-you-type: find search results in less than 50 milliseconds
Typo tolerance: get relevant matches even when queries contain typos and misspellings
Filtering and faceted search: enhance your user's search experience with custom filters and build a faceted search interface in a few lines of code
Sorting: sort results based on price, date, or pretty much anything else your users need
Synonym support: configure synonyms to include more relevant content in your search results
Geosearch: filter and sort documents based on geographic data
Extensive language support: search datasets in any language, with optimized support for Chinese, Japanese, Hebrew, and languages using the Latin alphabet
Security management: control which users can access what data with API keys that allow fine-grained permissions handling
Multi-Tenancy: personalize search results for any number of application tenants
Highly Customizable: customize Meilisearch to your specific needs or use our out-of-the-box and hassle-free presets
RESTful API: integrate Meilisearch in your technical stack with our plugins and SDKs
Easy to install, deploy, and maintain

📖 Documentation

You can consult Meilisearch's documentation at https://www.meilisearch.com/docs.

🚀 Getting started

For basic instructions on how to set up Meilisearch, add documents to an index, and search for documents, take a look at our Quick Start guide.

You may also want to check out Meilisearch 101 for an introduction to some of Meilisearch's most popular features.

⚡ Supercharge your Meilisearch experience

Say goodbye to server deployment and manual updates with Meilisearch Cloud. No credit card required.

🧰 SDKs & integration tools

Install one of our SDKs in your project for seamless integration between Meilisearch and your favorite language or framework!

Take a look at the complete Meilisearch integration list.

⚙️ Advanced usage

Experienced users will want to keep our API Reference close at hand.

We also offer a wide range of dedicated guides to all Meilisearch features, such as filtering, sorting, geosearch, API keys, and tenant tokens.

Finally, for more in-depth information, refer to our articles explaining fundamental Meilisearch concepts such as documents and indexes.

📊 Telemetry

Meilisearch collects anonymized data from users to help us improve our product. You can deactivate this whenever you want.

To request deletion of collected data, please write to us at privacy@meilisearch.com. Don't forget to include your Instance UID in the message, as this helps us quickly find and delete your data.

If you want to know more about the kind of data we collect and what we use it for, check the telemetry section of our documentation.

📫 Get in touch!

Meilisearch is a search engine created by Meili, a software development company based in France and with team members all over the world. Want to know more about us? Check out our blog!

🗞 Subscribe to our newsletter if you don't want to miss any updates! We promise we won't clutter your mailbox: we only send one edition every two months.

💌 Want to make a suggestion or give feedback? Here are some of the channels where you can reach us:

For feature requests, please visit our product repository
Found a bug? Open an issue!
Want to be part of our Discord community? Join us!

Thank you for your support!

👩‍💻 Contributing

Meilisearch is, and will always be, open-source! If you want to contribute to the project, please take a look at our contribution guidelines.

📦 Versioning

Meilisearch releases and their associated binaries are available in this GitHub page.

The binaries are versioned following SemVer conventions. To know more, read our versioning policy.

Differently from the binaries, crates in this repository are not currently available on crates.io and do not follow SemVer conventions.