ManyTheFish
bb7a503e5d
Compute prefix databases
...
We are now computing the prefix FST and a prefix delta in the Merger thread,
after all the databases are written, the main thread will recompute the prefix databases based on the prefix delta without needing any grenad temporary file anymore
2024-10-01 09:57:06 +02:00
Louis Dureuil
64589278ac
Appease *some* of clippy warnings
2024-09-30 16:08:29 +02:00
ManyTheFish
8df6daf308
Remove fid_wordcount_docids.rs
2024-09-30 11:52:31 +02:00
ManyTheFish
5b552caf42
Fix position in insertions
2024-09-30 11:46:32 +02:00
ManyTheFish
2b51a63418
Remove dead code
2024-09-30 11:42:36 +02:00
Louis Dureuil
3d8024fb2b
write the weighted fields ids map
2024-09-30 11:35:03 +02:00
Louis Dureuil
4b0da0ff24
Fix inversion of field_id and position
2024-09-30 11:34:50 +02:00
Louis Dureuil
079f2b5de0
Format error messages consistently
2024-09-30 11:34:31 +02:00
ManyTheFish
960060ebdf
Fix fst builder when their is no previous FST
2024-09-25 16:53:00 +02:00
Clément Renault
3d244451df
Reduce the lru key size from 8 to 12 bytes
2024-09-25 16:14:13 +02:00
Clément Renault
5f53935c8a
Fix a bug in the Lru
2024-09-25 16:09:34 +02:00
Clément Renault
29a7623c3f
Fxi some logs
2024-09-25 15:57:50 +02:00
Clément Renault
e97041f7d0
Replace the Lru free list by a simple increment
2024-09-25 15:55:52 +02:00
Clément Renault
52d7f3ed1c
Reduce the lru key size from 20 to 8 bytes
2024-09-25 15:37:13 +02:00
Clément Renault
86d5e6d9ff
Use the new Lru
2024-09-25 14:54:56 +02:00
Clément Renault
759b9b1546
Introduce a new custom Lru
2024-09-25 14:49:12 +02:00
ManyTheFish
3f7a500f3b
Build prefix fst
2024-09-25 14:36:06 +02:00
ManyTheFish
974272f2e9
Merge branch 'main' into indexer-edition-2024
2024-09-25 07:41:16 +02:00
Clément Renault
7ad037841f
Move the tracing info to eprintln
2024-09-24 18:21:58 +02:00
Clément Renault
e0c7067355
Expose an IndexedParallelIterator to the index function
2024-09-24 17:24:59 +02:00
ManyTheFish
6e87332410
Change the way the FST is built
2024-09-24 16:28:31 +02:00
Clément Renault
2d1caf27df
Use eprintln to log
2024-09-24 15:59:50 +02:00
Clément Renault
7f148c127c
Measure the SmallVec efficacity
2024-09-24 15:32:15 +02:00
Clément Renault
4ce5d3d66d
Do not check before pushing in bitmaps
2024-09-24 09:43:16 +02:00
Clément Renault
42b093687d
Introduce the new PushOptimizedBitmap
2024-09-23 16:38:21 +02:00
Clément Renault
f00664247d
Add more stats about the channel message sent
2024-09-23 15:13:52 +02:00
Clément Renault
193d7f5d34
Add the mutualized charabia normalization
2024-09-23 14:24:25 +02:00
Clément Renault
013acb3d93
Measure merger writer channel contention
2024-09-23 11:07:59 +02:00
meili-bors[bot]
462a2329f1
Merge #4941
...
4941: Implement the binary quantization in meilisearch r=irevoire a=irevoire
# Pull Request
## Related issue
Fixes https://github.com/meilisearch/meilisearch/issues/4873
## What does this PR do?
- Add a settings for the binary quantization
- Once enabled, the bq cannot be disabled
TODO:
- [ ] Missing a bunch of tests
Co-authored-by: Tamo <tamo@meilisearch.com>
2024-09-19 15:50:24 +00:00
Tamo
f6483cf15d
apply review comment
2024-09-19 16:47:06 +02:00
meili-bors[bot]
bd34ed01d9
Merge #4945
...
4945: Add swedish in default pipelines r=dureuill a=ManyTheFish
# Summary
## Fix Swedish support
In Swedish the characters `å`/`ä`/`ö` are completely different than `a` or `o` and should not be normalized as the same character.
because the Swedish specialized pipeline was not activated by default, these characters were normalized even with the settings:
```json
{
"localizedAttributes": [ { "locales": ["swe"], "attributePatterns": ["*"] } ]
}
```
## Update Charabia adding German support
German segmentation will now be activated using the setting:
```json
{
"localizedAttributes": [ { "locales": ["deu"], "attributePatterns": ["*"] } ]
}
```
# TODO
- [x] Activate Swedish Pipeline
- [x] Add a test to avoid future regressions
- [x] Update Charabia
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-09-19 14:42:03 +00:00
Tamo
74199f328d
Make clippy happy
2024-09-19 16:27:34 +02:00
Tamo
1113c42de0
fix broken comments
2024-09-19 16:18:36 +02:00
ManyTheFish
7d6768e4c4
Add german tokenization pipeline
2024-09-19 16:09:01 +02:00
ManyTheFish
f77661ec44
Update Charabia v0.9.1
2024-09-19 16:08:59 +02:00
Tamo
b8fd85a46d
Get rids of useless collect before an iteration on the readers
2024-09-19 15:57:38 +02:00
Tamo
fd43c6c404
Improve the error message explaining you can't un-bq an embedder
2024-09-19 15:51:29 +02:00
Tamo
2564ec1496
Update milli/src/index.rs
...
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-09-19 15:41:44 +02:00
Tamo
b6b73fe41c
Update milli/src/update/settings.rs
...
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-09-19 15:41:14 +02:00
Tamo
6dde41cc46
stop using a local version of arroy and instead point to the git repo with the rev
2024-09-19 15:25:38 +02:00
Tamo
163f8023a1
remove debug println
2024-09-19 12:13:25 +02:00
Tamo
633537ccd7
fix updating documents without updating the settings
2024-09-19 12:00:58 +02:00
Tamo
3f6301dbc9
fix the missing embedder name in the error message when trying to disable the binary quantization
2024-09-19 12:00:58 +02:00
Tamo
2b6952eda1
rename the ArroyReader to an ArroyWrapper since it can read and write
2024-09-19 12:00:58 +02:00
Tamo
79f29eed3c
fix the tests and the arroy_readers method
2024-09-19 12:00:58 +02:00
Tamo
cc45e264ca
implement the binary quantization in meilisearch
2024-09-19 12:00:56 +02:00
meili-bors[bot]
5f474a640d
Merge #4938
...
4938: Remove default embedder r=ManyTheFish a=dureuill
# Pull Request
## Related issue
Fixes #4738
## What does this PR do?
[See public usage](https://meilisearch.notion.site/v1-11-AI-search-changes-0e37727193884a70999f254fa953ce6e#1044b06b651f80edb9d4ef6dc367bad0 )
- Remove `hybrid.embedder` boolean from analytics because embedder is now mandatory and so the boolean would always be `true`
- Rework search kind so that a search without query but with vector is a vector search regardless of (non-zero) semantic ratio
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
2024-09-19 09:17:14 +00:00
ManyTheFish
bbaee3dbc6
Add Swedish pipeline in all-tokenization feature
2024-09-19 08:34:51 +02:00
meili-bors[bot]
ff523a2357
Merge #4939
...
4939: Introduce the `STARTS WITH` filter operator r=irevoire a=Kerollmops
This PR fixes #4872 by introducing the `STARTS WITH` filter operator and gating it under the _contains filter_ experimental feature along with the `CONTAINS` one. I also updated [the experimental feature discussion page](https://github.com/orgs/meilisearch/discussions/763 ).
Co-authored-by: Clément Renault <clement@meilisearch.com>
2024-09-18 10:19:48 +00:00
Clément Renault
9f1fb4b425
Introduce the STARTS WITH filter operator gated under an experimental feature
2024-09-17 16:44:11 +02:00
Louis Dureuil
3c5e363554
Remove default embedders
2024-09-17 16:30:43 +02:00
Clément Renault
f4ab1f168e
Prefer using Rc<str> than String when cloning a lot
2024-09-16 15:41:29 +02:00
ManyTheFish
1a0e962299
Replace hashmap by vectors in wpp
2024-09-16 15:01:20 +02:00
ManyTheFish
f13e076b8a
Use hashmap instead of Btree in wpp extractor
2024-09-16 14:40:40 +02:00
ManyTheFish
7ba49b849e
Extract and write facet databases
2024-09-16 09:35:16 +02:00
Clément Renault
f7652186e1
WIP geo fields
2024-09-12 18:01:02 +02:00
Louis Dureuil
23e14138bb
facet distribution: implement Display for OrderBy
2024-09-12 17:43:50 +02:00
Louis Dureuil
e44325683a
Facet distribution: fix issue where truncated facet distribution would have a wrong order
2024-09-12 17:43:49 +02:00
Clément Renault
b2f4e67c9a
Do not store useless updates
2024-09-12 15:38:31 +02:00
Clément Renault
ff5d3b59f5
Move the document id extraction to the primary key code
2024-09-12 12:01:42 +02:00
ManyTheFish
aa69308e45
Use a bufWriter to build word FSTs
2024-09-12 11:48:00 +02:00
ManyTheFish
eb9a20ff0b
Fix fid_word_docids extraction
2024-09-12 11:08:18 +02:00
Clément Renault
3e9198ebaa
Support guessing primary key again
2024-09-11 17:25:40 +02:00
Clément Renault
2a0ad0982f
Fix the document counter
2024-09-11 15:59:36 +02:00
ManyTheFish
2b317c681b
Build mergers in parallel
2024-09-11 11:49:26 +02:00
ManyTheFish
39b5990f64
Mutualize tokenization
2024-09-11 10:22:38 +02:00
Clément Renault
8287c2644f
Support CSV again
2024-09-10 21:10:28 +01:00
Clément Renault
c1c44a0b81
Impl serialize on TopLevelMap
2024-09-10 19:32:03 +01:00
Clément Renault
04596f3616
Move the TopLevelMap into a dedicated module
2024-09-10 18:01:17 +01:00
Clément Renault
24cb5839ad
Move the document changes sorting logic to a new trait
2024-09-10 17:37:52 +01:00
ManyTheFish
f69688e8f7
Fix several warnings in extractors and remove unreachable macros
2024-09-09 14:52:50 +02:00
Louis Dureuil
f18e9cb7b3
Change openai default model
2024-09-09 13:09:35 +02:00
Clément Renault
8fd0afaaaa
Make sure we iterate over the payload documents in order
2024-09-06 08:09:08 +02:00
Clément Renault
72c6a21a30
Use raw JSON to read the payloads
2024-09-05 20:08:23 +02:00
Clément Renault
8412be4a7d
Cleanup CowStr and TopLevelMap struct
2024-09-05 18:32:55 +02:00
Louis Dureuil
10f09c531f
add some commented code to read from json with raw values
2024-09-05 18:22:16 +02:00
ManyTheFish
8fd99b111b
Add tracing timers logs
2024-09-05 18:00:22 +02:00
Clément Renault
f6b3d1f9a5
Increase some channel sizes
2024-09-05 15:12:07 +02:00
Clément Renault
73ce67862d
Use the word pair proximity and fid word count docids extractors
...
Co-authored-by: ManyTheFish <many@meilisearch.com>
2024-09-05 10:56:22 +02:00
Clément Renault
0fc02f7351
Move the facet extraction to dedicated modules
2024-09-05 10:32:27 +02:00
ManyTheFish
34f11e3380
Implement word count and word pair proximity extractors
2024-09-05 10:30:39 +02:00
Clément Renault
27308eaab1
Import the facet extractors
2024-09-04 17:58:15 +02:00
Clément Renault
b33ec9ba3f
Introduce the FieldIdFacetIsNullDocidsExtractor
2024-09-04 17:50:08 +02:00
Clément Renault
9c0a1cd9fd
Introduce the FieldIdFacetExistsDocidsExtractor
2024-09-04 17:48:49 +02:00
Clément Renault
0b061f1e70
Introduce the FieldIdFacetIsEmptyDocidsExtractor
2024-09-04 17:40:24 +02:00
Clément Renault
19d937ab21
Introduce the facet extractors
2024-09-04 17:03:54 +02:00
Clément Renault
1d59c19cd2
Send the WordsFst by using an Mmap
2024-09-04 14:30:09 +02:00
Clément Renault
98e48371c3
Factorize some stuff
2024-09-04 12:17:13 +02:00
Clément Renault
6d74fb0229
Introduce the WordFidWordDocids database
2024-09-04 11:40:55 +02:00
ManyTheFish
1eb75a1040
remove milli/src/update/new/extract/tokenize_document.rs
2024-09-04 11:40:26 +02:00
Clément Renault
3b82d8b5b9
Fix the cache to serialize entries correctly
2024-09-04 10:55:36 +02:00
ManyTheFish
781a186f75
remove milli/src/update/new/extract/extract_word_docids.rs
2024-09-04 10:28:31 +02:00
ManyTheFish
6a399556b5
Implement more searchable extractor
2024-09-04 10:20:18 +02:00
Clément Renault
27b4cab857
Extract and write the documents and words fst in the database
2024-09-04 09:59:19 +02:00
Clément Renault
52d32b4ee9
Move the channel sender in the closure to stop the merger thread
2024-09-03 16:08:33 +02:00
ManyTheFish
da61408e52
Remove unimplemented from document changes
2024-09-03 15:14:16 +02:00
ManyTheFish
fe69385bd7
Fix tokenizer test
2024-09-03 14:24:37 +02:00
Louis Dureuil
ed19b7c3c3
Only reindex if the size increased
2024-09-03 12:07:59 +02:00
Louis Dureuil
1ac008926b
Add maxBytes parameter
2024-09-03 12:07:15 +02:00
Louis Dureuil
c49d892c82
Changes to prompt
2024-09-03 12:07:10 +02:00