ManyTheFish
3f7a500f3b
Build prefix fst
2024-09-25 14:36:06 +02:00
ManyTheFish
974272f2e9
Merge branch 'main' into indexer-edition-2024
2024-09-25 07:41:16 +02:00
Clément Renault
7ad037841f
Move the tracing info to eprintln
2024-09-24 18:21:58 +02:00
Clément Renault
e0c7067355
Expose an IndexedParallelIterator to the index function
2024-09-24 17:24:59 +02:00
ManyTheFish
6e87332410
Change the way the FST is built
2024-09-24 16:28:31 +02:00
Clément Renault
2d1caf27df
Use eprintln to log
2024-09-24 15:59:50 +02:00
Clément Renault
7f148c127c
Measure the SmallVec efficacity
2024-09-24 15:32:15 +02:00
Clément Renault
4ce5d3d66d
Do not check before pushing in bitmaps
2024-09-24 09:43:16 +02:00
Clément Renault
42b093687d
Introduce the new PushOptimizedBitmap
2024-09-23 16:38:21 +02:00
Clément Renault
f00664247d
Add more stats about the channel message sent
2024-09-23 15:13:52 +02:00
Clément Renault
193d7f5d34
Add the mutualized charabia normalization
2024-09-23 14:24:25 +02:00
Clément Renault
013acb3d93
Measure merger writer channel contention
2024-09-23 11:07:59 +02:00
meili-bors[bot]
462a2329f1
Merge #4941
...
4941: Implement the binary quantization in meilisearch r=irevoire a=irevoire
# Pull Request
## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4873
## What does this PR do?
- Add a settings for the binary quantization
- Once enabled, the bq cannot be disabled
TODO:
- [ ] Missing a bunch of tests
Co-authored-by: Tamo <tamo@meilisearch.com>
2024-09-19 15:50:24 +00:00
Tamo
f6483cf15d
apply review comment
2024-09-19 16:47:06 +02:00
meili-bors[bot]
bd34ed01d9
Merge #4945
...
4945: Add swedish in default pipelines r=dureuill a=ManyTheFish
# Summary
## Fix Swedish support
In Swedish the characters `å`/`ä`/`ö` are completely different than `a` or `o` and should not be normalized as the same character.
because the Swedish specialized pipeline was not activated by default, these characters were normalized even with the settings:
```json
{
"localizedAttributes": [ { "locales": ["swe"], "attributePatterns": ["*"] } ]
}
```
## Update Charabia adding German support
German segmentation will now be activated using the setting:
```json
{
"localizedAttributes": [ { "locales": ["deu"], "attributePatterns": ["*"] } ]
}
```
# TODO
- [x] Activate Swedish Pipeline
- [x] Add a test to avoid future regressions
- [x] Update Charabia
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-09-19 14:42:03 +00:00
Tamo
74199f328d
Make clippy happy
2024-09-19 16:27:34 +02:00
Tamo
1113c42de0
fix broken comments
2024-09-19 16:18:36 +02:00
ManyTheFish
7d6768e4c4
Add german tokenization pipeline
2024-09-19 16:09:01 +02:00
ManyTheFish
f77661ec44
Update Charabia v0.9.1
2024-09-19 16:08:59 +02:00
Tamo
b8fd85a46d
Get rids of useless collect before an iteration on the readers
2024-09-19 15:57:38 +02:00
Tamo
fd43c6c404
Improve the error message explaining you can't un-bq an embedder
2024-09-19 15:51:29 +02:00
Tamo
2564ec1496
Update milli/src/index.rs
...
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-09-19 15:41:44 +02:00
Tamo
b6b73fe41c
Update milli/src/update/settings.rs
...
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-09-19 15:41:14 +02:00
Tamo
6dde41cc46
stop using a local version of arroy and instead point to the git repo with the rev
2024-09-19 15:25:38 +02:00
Tamo
163f8023a1
remove debug println
2024-09-19 12:13:25 +02:00
Tamo
633537ccd7
fix updating documents without updating the settings
2024-09-19 12:00:58 +02:00
Tamo
3f6301dbc9
fix the missing embedder name in the error message when trying to disable the binary quantization
2024-09-19 12:00:58 +02:00
Tamo
2b6952eda1
rename the ArroyReader to an ArroyWrapper since it can read and write
2024-09-19 12:00:58 +02:00
Tamo
79f29eed3c
fix the tests and the arroy_readers method
2024-09-19 12:00:58 +02:00
Tamo
cc45e264ca
implement the binary quantization in meilisearch
2024-09-19 12:00:56 +02:00
meili-bors[bot]
5f474a640d
Merge #4938
...
4938: Remove default embedder r=ManyTheFish a=dureuill
# Pull Request
## Related issue
Fixes #4738
## What does this PR do?
[See public usage](https://meilisearch.notion.site/v1-11-AI-search-changes-0e37727193884a70999f254fa953ce6e#1044b06b651f80edb9d4ef6dc367bad0 )
- Remove `hybrid.embedder` boolean from analytics because embedder is now mandatory and so the boolean would always be `true`
- Rework search kind so that a search without query but with vector is a vector search regardless of (non-zero) semantic ratio
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-09-19 09:17:14 +00:00
ManyTheFish
bbaee3dbc6
Add Swedish pipeline in all-tokenization feature
2024-09-19 08:34:51 +02:00
meili-bors[bot]
ff523a2357
Merge #4939
...
4939: Introduce the `STARTS WITH` filter operator r=irevoire a=Kerollmops
This PR fixes #4872 by introducing the `STARTS WITH` filter operator and gating it under the _contains filter_ experimental feature along with the `CONTAINS` one. I also updated [the experimental feature discussion page](https://github.com/orgs/meilisearch/discussions/763 ).
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-09-18 10:19:48 +00:00
Clément Renault
9f1fb4b425
Introduce the STARTS WITH filter operator gated under an experimental feature
2024-09-17 16:44:11 +02:00
Louis Dureuil
3c5e363554
Remove default embedders
2024-09-17 16:30:43 +02:00
Clément Renault
f4ab1f168e
Prefer using Rc<str> than String when cloning a lot
2024-09-16 15:41:29 +02:00
ManyTheFish
1a0e962299
Replace hashmap by vectors in wpp
2024-09-16 15:01:20 +02:00
ManyTheFish
f13e076b8a
Use hashmap instead of Btree in wpp extractor
2024-09-16 14:40:40 +02:00
ManyTheFish
7ba49b849e
Extract and write facet databases
2024-09-16 09:35:16 +02:00
Clément Renault
f7652186e1
WIP geo fields
2024-09-12 18:01:02 +02:00
Louis Dureuil
23e14138bb
facet distribution: implement Display for OrderBy
2024-09-12 17:43:50 +02:00
Louis Dureuil
e44325683a
Facet distribution: fix issue where truncated facet distribution would have a wrong order
2024-09-12 17:43:49 +02:00
Clément Renault
b2f4e67c9a
Do not store useless updates
2024-09-12 15:38:31 +02:00
Clément Renault
ff5d3b59f5
Move the document id extraction to the primary key code
2024-09-12 12:01:42 +02:00
ManyTheFish
aa69308e45
Use a bufWriter to build word FSTs
2024-09-12 11:48:00 +02:00
ManyTheFish
eb9a20ff0b
Fix fid_word_docids extraction
2024-09-12 11:08:18 +02:00
Clément Renault
3e9198ebaa
Support guessing primary key again
2024-09-11 17:25:40 +02:00
Clément Renault
2a0ad0982f
Fix the document counter
2024-09-11 15:59:36 +02:00
ManyTheFish
2b317c681b
Build mergers in parallel
2024-09-11 11:49:26 +02:00
ManyTheFish
39b5990f64
Mutualize tokenization
2024-09-11 10:22:38 +02:00
Clément Renault
8287c2644f
Support CSV again
2024-09-10 21:10:28 +01:00
Clément Renault
c1c44a0b81
Impl serialize on TopLevelMap
2024-09-10 19:32:03 +01:00
Clément Renault
04596f3616
Move the TopLevelMap into a dedicated module
2024-09-10 18:01:17 +01:00
Clément Renault
24cb5839ad
Move the document changes sorting logic to a new trait
2024-09-10 17:37:52 +01:00
ManyTheFish
f69688e8f7
Fix several warnings in extractors and remove unreachable macros
2024-09-09 14:52:50 +02:00
Louis Dureuil
f18e9cb7b3
Change openai default model
2024-09-09 13:09:35 +02:00
Clément Renault
8fd0afaaaa
Make sure we iterate over the payload documents in order
2024-09-06 08:09:08 +02:00
Clément Renault
72c6a21a30
Use raw JSON to read the payloads
2024-09-05 20:08:23 +02:00
Clément Renault
8412be4a7d
Cleanup CowStr and TopLevelMap struct
2024-09-05 18:32:55 +02:00
Louis Dureuil
10f09c531f
add some commented code to read from json with raw values
2024-09-05 18:22:16 +02:00
ManyTheFish
8fd99b111b
Add tracing timers logs
2024-09-05 18:00:22 +02:00
Clément Renault
f6b3d1f9a5
Increase some channel sizes
2024-09-05 15:12:07 +02:00
Clément Renault
73ce67862d
Use the word pair proximity and fid word count docids extractors
...
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-09-05 10:56:22 +02:00
Clément Renault
0fc02f7351
Move the facet extraction to dedicated modules
2024-09-05 10:32:27 +02:00
ManyTheFish
34f11e3380
Implement word count and word pair proximity extractors
2024-09-05 10:30:39 +02:00
Clément Renault
27308eaab1
Import the facet extractors
2024-09-04 17:58:15 +02:00
Clément Renault
b33ec9ba3f
Introduce the FieldIdFacetIsNullDocidsExtractor
2024-09-04 17:50:08 +02:00
Clément Renault
9c0a1cd9fd
Introduce the FieldIdFacetExistsDocidsExtractor
2024-09-04 17:48:49 +02:00
Clément Renault
0b061f1e70
Introduce the FieldIdFacetIsEmptyDocidsExtractor
2024-09-04 17:40:24 +02:00
Clément Renault
19d937ab21
Introduce the facet extractors
2024-09-04 17:03:54 +02:00
Clément Renault
1d59c19cd2
Send the WordsFst by using an Mmap
2024-09-04 14:30:09 +02:00
Clément Renault
98e48371c3
Factorize some stuff
2024-09-04 12:17:13 +02:00
Clément Renault
6d74fb0229
Introduce the WordFidWordDocids database
2024-09-04 11:40:55 +02:00
ManyTheFish
1eb75a1040
remove milli/src/update/new/extract/tokenize_document.rs
2024-09-04 11:40:26 +02:00
Clément Renault
3b82d8b5b9
Fix the cache to serialize entries correctly
2024-09-04 10:55:36 +02:00
ManyTheFish
781a186f75
remove milli/src/update/new/extract/extract_word_docids.rs
2024-09-04 10:28:31 +02:00
ManyTheFish
6a399556b5
Implement more searchable extractor
2024-09-04 10:20:18 +02:00
Clément Renault
27b4cab857
Extract and write the documents and words fst in the database
2024-09-04 09:59:19 +02:00
Clément Renault
52d32b4ee9
Move the channel sender in the closure to stop the merger thread
2024-09-03 16:08:33 +02:00
ManyTheFish
da61408e52
Remove unimplemented from document changes
2024-09-03 15:14:16 +02:00
ManyTheFish
fe69385bd7
Fix tokenizer test
2024-09-03 14:24:37 +02:00
Louis Dureuil
ed19b7c3c3
Only reindex if the size increased
2024-09-03 12:07:59 +02:00
Louis Dureuil
1ac008926b
Add maxBytes parameter
2024-09-03 12:07:15 +02:00
Louis Dureuil
c49d892c82
Changes to prompt
2024-09-03 12:07:10 +02:00
Louis Dureuil
de962a26f3
New error type when maxBytes is null
2024-09-03 12:01:04 +02:00
Clément Renault
c1557734dc
Use the GlobalFieldsIdsMap everywhere and write it to disk
...
Co-authored-by: Dureuill <louis@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-09-03 12:01:01 +02:00
ManyTheFish
c50d3edc4a
Integrate first searchable exctrator
2024-09-03 11:02:39 +02:00
Clément Renault
5369bf4a62
Change some lifetimes
2024-09-02 19:51:22 +02:00
Clément Renault
bcb1aa3d22
Find a temporary solution to par into iter on an HashMap
...
Spoiler: Do not use an HashMap but drain it into a Vec
2024-09-02 19:39:48 +02:00
Clément Renault
9b7858fb90
Expose the new indexer
2024-09-02 15:21:59 +02:00
Clément Renault
ab01679a8f
Remove the useless option from the document changes
2024-09-02 15:21:00 +02:00
Clément Renault
521775f788
I push for Many
2024-09-02 15:10:21 +02:00
Clément Renault
72e7b7846e
Renaming the indexers
2024-09-02 14:42:27 +02:00
Clément Renault
6526ce1208
Fix the merging of documents
2024-09-02 14:41:20 +02:00
Louis Dureuil
21296190a3
Reindex embedders
2024-09-02 13:00:53 +02:00
Louis Dureuil
4464d319af
Change default template to use the new facility
2024-09-02 11:30:59 +02:00
Louis Dureuil
580ea2f450
Pass the fields <-> ids map with metadata to render
2024-09-02 11:30:10 +02:00
Louis Dureuil
915cf4bae5
Add field.is_searchable property to fields
2024-09-02 11:28:53 +02:00
Clément Renault
e639ec79d1
Move the indexers into their own modules
2024-09-02 10:42:19 +02:00
Clément Renault
bb885a5810
Fix the merge for roaring bitmap
2024-09-01 23:20:19 +02:00