Louis Dureuil
ab09dc0167
exact_attributes: Add TODOs and additional check after review
2023-04-06 17:50:39 +02:00
Louis Dureuil
618c54915d
exact_attribute: dedup nodes after sorting them
2023-04-06 17:50:39 +02:00
Loïc Lecrenier
130d2061bd
Fix indexing of word_position_docid and fid
2023-04-06 17:50:39 +02:00
Louis Dureuil
66ddee4390
Fix word_position_docids indexing
2023-04-06 17:50:39 +02:00
Louis Dureuil
90a6c01495
Use correct codec in proximity
2023-04-06 17:50:39 +02:00
Louis Dureuil
e58426109a
Fix panics and issues in exactness graph ranking rule
2023-04-06 17:50:39 +02:00
Louis Dureuil
f513cf930a
Exact attribute with state
2023-04-06 17:50:39 +02:00
Louis Dureuil
8a13ed7e3f
Add exactness ranking rules
2023-04-06 17:50:39 +02:00
Louis Dureuil
1b8e4d0301
Add ExactTerm and helper method
2023-04-06 17:50:39 +02:00
Louis Dureuil
996619b22a
Increase position by 8 on hard separator when building query terms
2023-04-06 17:50:39 +02:00
Louis Dureuil
2c9822a337
Rename is_multiple_words
to is_ngram
and zero_typo
to exact
2023-04-06 17:50:39 +02:00
Louis Dureuil
7276deee0a
Add new db caches
2023-04-06 17:50:39 +02:00
ManyTheFish
f7e7f438f8
Patch prefix match
2023-04-06 17:22:31 +02:00
ManyTheFish
ba8dcc2d78
Fix clippy
2023-04-06 15:50:47 +02:00
Loïc Lecrenier
7ca91ebb71
Merge branch 'search-refactor-exactness' into search-refactor-tests-doc
2023-04-06 15:16:35 +02:00
ManyTheFish
47f6a3ad3d
Take into account that a logger need the search context
2023-04-06 15:02:23 +02:00
bors[bot]
b4c01581cd
Merge #3641
...
3641: Bring back changes from `release v1.1.0` into `main` after v1.1.0 release r=curquiza a=curquiza
Replace https://github.com/meilisearch/meilisearch/pull/3637 since we don't want to pull commits from `main` into `release-v1.1.0` when fixing git conflicts
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: bors[bot] <26634292+bors[bot]@users.noreply.github.com>
Co-authored-by: Charlotte Vermandel <charlottevermandel@gmail.com>
Co-authored-by: Tamo <tamo@meilisearch.com>
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
Co-authored-by: curquiza <clementine@meilisearch.com>
Co-authored-by: Clément Renault <clement@meilisearch.com>
Co-authored-by: Many the fish <many@meilisearch.com>
2023-04-06 12:37:54 +00:00
ManyTheFish
ae17c62e24
Remove warnings
2023-04-06 14:07:18 +02:00
ManyTheFish
a1148c09c2
remove old matcher
2023-04-06 14:00:21 +02:00
ManyTheFish
9c5f64769a
Integrate the new Highlighter in the search
2023-04-06 13:58:56 +02:00
ManyTheFish
ebe23b04c9
Make the matcher consume the search context
2023-04-06 12:28:28 +02:00
ManyTheFish
13b7c826c1
add new highlighter
2023-04-06 12:15:37 +02:00
Loïc Lecrenier
5440f43fd3
Fix indexing of word_position_docid and fid
2023-04-05 18:14:00 +02:00
Louis Dureuil
d9460a76f4
Fix word_position_docids indexing
2023-04-05 18:14:00 +02:00
Louis Dureuil
d1ddaa223d
Use correct codec in proximity
2023-04-05 18:14:00 +02:00
Louis Dureuil
f7ecea142e
Fix panics and issues in exactness graph ranking rule
2023-04-05 18:13:46 +02:00
Louis Dureuil
337e75b0e4
Exact attribute with state
2023-04-05 18:12:46 +02:00
Loïc Lecrenier
b5691802a3
Add new tests and fix construction of query graph from paths
2023-04-05 16:31:10 +02:00
Loïc Lecrenier
6e50f23896
Add more search tests
2023-04-05 13:33:23 +02:00
Tamo
597d57bf1d
Merge branch 'main' into bring-back-changes-v1.1.0
2023-04-05 11:32:14 +02:00
Loïc Lecrenier
4c8a0179ba
Add more search tests
2023-04-05 11:30:49 +02:00
Loïc Lecrenier
c69cbec64a
Add more search tests
2023-04-05 11:20:04 +02:00
Loïc Lecrenier
ce328c329d
Move bucket sort function to its own module and fix a bug
2023-04-04 18:03:08 +02:00
Loïc Lecrenier
959e4607bb
Add more search tests
2023-04-04 18:02:46 +02:00
Louis Dureuil
4b4ffb8ec9
Add exactness ranking rules
2023-04-04 17:12:07 +02:00
Louis Dureuil
3951fe22ab
Add ExactTerm and helper method
2023-04-04 17:09:32 +02:00
Louis Dureuil
4d5bc9df4c
Increase position by 8 on hard separator when building query terms
2023-04-04 17:07:26 +02:00
Louis Dureuil
ec2f8e8040
Rename is_multiple_words
to is_ngram
and zero_typo
to exact
2023-04-04 17:06:07 +02:00
Louis Dureuil
406b8bd248
Add new db caches
2023-04-04 17:04:46 +02:00
Loïc Lecrenier
62b9c6fbee
Add search tests
2023-04-04 16:18:22 +02:00
Loïc Lecrenier
b439d36807
Split query_term module into multiple submodules
2023-04-04 15:38:30 +02:00
Loïc Lecrenier
faceb661e3
Add note that a part of the code needs fixing
2023-04-04 15:02:01 +02:00
Loïc Lecrenier
4129d657e2
Simplify query_term module a bit
2023-04-04 15:01:42 +02:00
Filip Bachul
1e6fe71a67
fix clippy warning
2023-04-03 20:18:26 +02:00
Filip Bachul
fddfb37f1f
remove unnecessary FilterError:ReservedGeo and FilterError:ReservedGeo
2023-04-03 20:18:26 +02:00
Loïc Lecrenier
3f13608002
Fix computation of ngram derivations
2023-04-03 15:27:49 +02:00
Loïc Lecrenier
4708d9b016
Fix compiler warnings/errors
2023-04-03 10:09:27 +02:00
Clément Renault
0d2e7bcc13
Implement the previous way for the exhaustive distinct candidates
2023-04-03 10:08:10 +02:00
Loïc Lecrenier
55fbfb6124
Merge branch 'search-refactor-located-query-terms' into search-refactor
2023-04-03 10:04:36 +02:00
Loïc Lecrenier
58fe260c72
Allow removing all the terms from a query if it contains a phrase
2023-04-03 09:18:02 +02:00
Loïc Lecrenier
24e5f6f7a9
Don't remove phrases with "last" term matching strategy
2023-04-03 09:17:33 +02:00
Louis Dureuil
9b87c36200
Limit the number of derivations for a single word.
2023-03-31 09:19:18 +02:00
Filip Bachul
1861c69964
fmt
2023-03-30 23:37:26 +02:00
Filip Bachul
cb2b5eb38e
handle _geoDistance(x,x) sort error
2023-03-30 23:21:23 +02:00
Filip Bachul
53aa0a1b54
handle _geo(x,x) sort error
2023-03-30 23:17:34 +02:00
Loïc Lecrenier
12b26cd54e
Don't remove phrases from the query with term matching strategy Last
2023-03-30 14:54:08 +02:00
Loïc Lecrenier
061b1e6d7c
Tiny refactor of query graph remove_nodes method
2023-03-30 14:49:25 +02:00
Loïc Lecrenier
0d6e8b5c31
Fix phrase search bug when the phrase has only one word
2023-03-30 14:48:12 +02:00
Loïc Lecrenier
d48cdc67a0
Fix term matching strategy bugs
2023-03-30 14:01:52 +02:00
Loïc Lecrenier
35c16ad047
Use new term matching strategy logic in words ranking rule
2023-03-30 13:15:43 +02:00
Loïc Lecrenier
2997d1f186
Use new term matching strategy logic in resolve_maximally_reduced_...
2023-03-30 13:12:51 +02:00
Loïc Lecrenier
2a5997fb20
Avoid expensive assert! in bucket sort function
2023-03-30 13:07:17 +02:00
Loïc Lecrenier
ee8a9e0bad
Remove outdated sentence in documentation
2023-03-30 12:22:24 +02:00
Loïc Lecrenier
3b0737a092
Fix detailed logger
2023-03-30 12:20:44 +02:00
Loïc Lecrenier
fdd02105ac
Graph-based ranking rule + term matching strategy support
2023-03-30 12:19:21 +02:00
Loïc Lecrenier
aa9592455c
Refactor the paths_of_cost algorithm
...
Support conditions that require certain nodes to be skipped
2023-03-30 12:11:11 +02:00
Loïc Lecrenier
01e24dd630
Rewrite proximity ranking rule
2023-03-30 11:59:06 +02:00
Loïc Lecrenier
ae6bb1ce17
Update the ConditionDocidsCache after change to RankingRuleGraphTrait
2023-03-30 11:41:20 +02:00
Loïc Lecrenier
5fd28620cd
Build ranking rule graph correctly after changes to trait definition
2023-03-30 11:32:55 +02:00
Loïc Lecrenier
728710d63a
Update typo ranking rule to use new query term structure
2023-03-30 11:32:19 +02:00
Loïc Lecrenier
fa81381865
Update the trait requirements of ranking-rule graphs
2023-03-30 11:19:45 +02:00
Loïc Lecrenier
b96a682f16
Update resolve_graph module to work with lazy query terms
2023-03-30 11:10:38 +02:00
Loïc Lecrenier
d0f048c068
Simplify the API of the DatabaseCache
2023-03-30 11:08:17 +02:00
Loïc Lecrenier
223e82a10d
Update QueryGraph to use new lazy query terms + build from paths
2023-03-30 11:06:02 +02:00
Loïc Lecrenier
9507ff5e31
Update query term structure to allow for laziness
2023-03-30 11:06:02 +02:00
Louis Dureuil
c2b025946a
located_query_terms_from_string
: use u16 for positions, hard limit number of iterated tokens.
...
- Refactor phrase logic to reduce number of possible states
2023-03-30 11:04:14 +02:00
Loïc Lecrenier
3a818c5e87
Add more functionality to interners
2023-03-30 09:56:23 +02:00
Louis Dureuil
d74134ce3a
Check sort criteria
2023-03-29 15:21:54 +02:00
Louis Dureuil
5ac129bfa1
Mark geosearch as currently unimplemented for sort rule
2023-03-29 15:20:42 +02:00
ManyTheFish
efea1e5837
Fix facet normalization
2023-03-29 12:02:24 +02:00
Louis Dureuil
abb4522f76
Small comment on ignored rules for placeholder search
2023-03-29 09:11:06 +02:00
Louis Dureuil
ef084ef042
SmallBitmap: Consistently panic on incoherent universe lengths
2023-03-29 08:45:38 +02:00
Louis Dureuil
3524bd1257
SmallBitmap: Add documentation
2023-03-29 08:44:11 +02:00
Tamo
a50b058557
update the geoBoundingBox feature
...
Now instead of using the (top_left, bottom_right) corners of the bounding box it s using the (top_right, bottom_left) corners.
2023-03-28 18:26:18 +02:00
Louis Dureuil
d4f6216966
Resolve rule time sort criteria
2023-03-28 16:42:02 +02:00
Louis Dureuil
77acafe534
Resolve search time sort criteria for placeholder search
2023-03-28 16:41:03 +02:00
Louis Dureuil
53afda3237
Update search usage in example
2023-03-28 16:35:46 +02:00
Louis Dureuil
abb19d368d
Initialize query time ranking rule for query search
2023-03-28 12:40:52 +02:00
Louis Dureuil
b4a52a622e
BoxRankingRule
2023-03-28 12:39:42 +02:00
Louis Dureuil
8d7d8cdc2f
Clean-up index example
2023-03-27 18:34:10 +02:00
Louis Dureuil
626a93b348
Search example: panic when missing the index path
2023-03-27 18:18:01 +02:00
Louis Dureuil
af65fe201a
Clean-up search example
2023-03-27 17:49:43 +02:00
Louis Dureuil
9b83b1deb0
Expose SearchLogger trait
2023-03-27 17:49:18 +02:00
Louis Dureuil
e9eb271499
Remove empty attribute_rule mod
2023-03-27 11:08:03 +02:00
Louis Dureuil
3281a88d08
SmallBitmap: don't expose internal items
2023-03-27 11:04:43 +02:00
Louis Dureuil
5a644054ab
Removed unused search impl
2023-03-27 11:04:27 +02:00
Louis Dureuil
16fefd364e
Add TODO notes
2023-03-27 11:04:04 +02:00
Gregory Conrad
e7994cdeb3
feat: check to see if the PK changed before erroring out
...
Previously, if the primary key was set and a Settings update contained
a primary key, an error would be returned.
However, this error is not needed if the new PK == the current PK.
This commit just checks to see if the PK actually changes
before raising an error.
2023-03-26 12:18:39 -04:00
Loïc Lecrenier
00bad8c716
Add comments suggesting performance improvements
2023-03-23 10:18:24 +01:00
Loïc Lecrenier
862714a18b
Remove criterion_implementation_strategy param of Search
2023-03-23 09:44:12 +01:00
Loïc Lecrenier
d18ebe4f3a
Remove more warnings
2023-03-23 09:41:18 +01:00
Loïc Lecrenier
7169d85115
Remove old query_tree code and make clippy happy
2023-03-23 09:39:16 +01:00
Loïc Lecrenier
f5f5f03ec0
Remove old criteria code
2023-03-23 09:35:53 +01:00
Loïc Lecrenier
9b2653427d
Split position DB into fid and relative position DB
2023-03-23 09:22:01 +01:00
Loïc Lecrenier
56b7209f26
Make clippy happy
2023-03-23 09:16:17 +01:00
Loïc Lecrenier
9b1f439a91
WIP
2023-03-23 09:12:35 +01:00
Loïc Lecrenier
01c7d2de8f
Add example targets to the milli crate
2023-03-22 14:50:41 +01:00
Loïc Lecrenier
a86aeba411
WIP
2023-03-22 14:43:08 +01:00
Loïc Lecrenier
384fdc2df4
Fix two bugs in proximity ranking rule
2023-03-21 11:43:25 +01:00
Loïc Lecrenier
83e5b4ed0d
Compute edges of proximity graph lazily
2023-03-21 10:44:40 +01:00
Loïc Lecrenier
272cd7ebbd
Small cleanup
2023-03-20 13:39:19 +01:00
Loïc Lecrenier
c63c7377e6
Switch order of MappedInterner generic params
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
5b50e49522
cargo fmt
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
65474c8de5
Update new sort ranking rule after rebasing
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
fbb1ba3de0
Cargo fmt
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
a59ca28e2c
Add forgotten file
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
825f742000
Simplify graph-based ranking rule impl
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
dd491320e5
Simplify graph-based ranking rule impl
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
c6ff97a220
Rewrite the dead-ends cache to detect more dead-ends
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
49240c367a
Fix bug in cost of typo conditions
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
1e6e624078
Fix bug in SmallBitmap
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
8b4e07e1a3
WIP
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
2853009987
Renaming Edge -> Condition
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
aa59c3bc2c
Replace EdgeCondition with an Option<..> + other code cleanup
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
7b1d8f4c6d
Make PathSet strongly typed
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
a49ddec9df
Prune the query graph after executing a ranking rule
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
05fe856e6e
Merge forward and backward proximity conditions in proximity graph
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
c0cdaf9f53
Fix bug in the proximity ranking rule for queries with ngrams
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
e9cf58d584
Refactor of the Interner
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
31628c5cd4
Merge Phrase and WordDerivations into one structure
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
3004e281d7
Support ngram typos + splitwords and splitwords+synonyms in proximity
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
14e8d0aaa2
Rename lifetime
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
1c58cf8426
Intern ranking rule graph edge conditions as well
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
5155fd2bf1
Reorganise initialisation of ranking rules + rename PathsMap -> PathSet
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
9ec9c204d3
Small code cleanup
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
78b9304d52
Implement distinct attribute
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
0465ba4a05
Intern more values
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
2099991dd1
Continue documenting and cleaning up the code
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
c232cdabf5
Add documentation
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
4e266211bf
Small code reorganisation
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
57fa689131
Cargo fmt
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
10626dddfc
Add a few more optimisations to new search algorithms
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
9051065c22
Apply a few optimisations for graph-based ranking rules
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
e8c76cf7bf
Intern all strings and phrases in the search logic
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
3f1729a17f
Update new search test
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
cab2b6bcda
Fix: computation of initial universe, code organisation
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
c4979a2fda
Fix code visibility issue + unimplemented detail in proximity rule
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
23931f8a4f
Fix small bug in visual logger of search algo
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
aa414565bb
Fix proximity graph edge builder to include all proximities
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
1db152046e
WIP on split words and synonyms support
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
c27ea2677f
Rewrite cheapest path algorithm and empty path cache
...
It is now much simpler and has much better performance.
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
caa1e1b923
Add typo ranking rule to new search impl
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
71f18e4379
Add sort ranking rule to new search impl
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
600e3dd1c5
Remove warnings
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
362eb0de86
Add support for filters
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
998d46ac10
Add support for search offset and limit
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
6c85c0d95e
Fix more bugs + visual empty path cache logging
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
0e1fbbf7c6
Fix bugs in query graph's "remove word" and "cheapest paths" algos
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
6806640ef0
Fix d2 description of paths map
2023-03-20 09:41:56 +01:00
Loïc Lecrenier
173e37584c
Improve the visual/detailed search logger
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
6ba4d5e987
Add a search logger
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
dd12d44134
Support swapped word pairs in new proximity ranking rule impl
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
c8e251bf24
Remove noise in codebase
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
a938fbde4a
Use a cache when resolving the query graph
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
dcf3f1d18a
Remove EdgeIndex and NodeIndex types, prefer u32 instead
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
66d0c63694
Add some documentation and use bitmaps instead of hashmaps when possible
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
132191360b
Introduce the sort ranking rule working with the new search structures
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
345c99d5bd
Introduce the words ranking rule working with the new search structures
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
89d696c1e3
Introduce the proximity ranking rule as a graph-based ranking rule
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
c645853529
Introduce a generic graph-based ranking rule
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
a70ab8b072
Introduce a function to find the K shortest paths in a graph
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
48aae76b15
Introduce a function to find the docids of a set of paths in a graph
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
23bf572dea
Introduce cache structures used with ranking rule graphs
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
864f6410ed
Introduce a structure to represent a set of graph paths efficiently
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
c9bf6bb2fa
Introduce a structure to implement ranking rules with graph algorithms
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
46249ea901
Implement a function to find a QueryGraph's docids
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
ce0d1e0e13
Introduce a common way to manage the coordination between ranking rules
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
5065d8b0c1
Introduce a DatabaseCache to memorize the addresses of LMDB values
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
a83007c013
Introduce structure to represent search queries as graphs
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
79e0a6dd4e
Introduce a new search module, eventually meant to replace the old one
...
The code here does not compile, because I am merely splitting one giant
commit into smaller ones where each commit explains a single file.
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
2d88089129
Remove unused term matching strategies
2023-03-20 09:41:55 +01:00
Loïc Lecrenier
6c659dc12f
Use MiMalloc in milli tests
2023-03-20 09:41:37 +01:00
Clément Renault
cf34d1c95f
Fix a test that forget to match a Null value
2023-03-15 17:17:19 +01:00
Clément Renault
1a9c58a7ab
Fix a bug with the new flattening rules
2023-03-15 16:56:44 +01:00
Clément Renault
64571c8288
Improve the testing of the filters
2023-03-15 14:57:17 +01:00
Clément Renault
ea016d97af
Implementing an IS EMPTY filter
2023-03-15 14:12:34 +01:00
Clément Renault
fa2ea4a379
Update the test to accept the new IS syntax
2023-03-14 10:31:27 +01:00
Tamo
0f33a65468
makes kero happy
2023-03-13 16:51:11 +01:00
bors[bot]
fb1260ee88
Merge #3568 #3569
...
3568: CI: Fix `publish-aarch64` job that still uses ubuntu-18.04 r=Kerollmops a=curquiza
Fixes #3563
Main change
- add the usage of the `ubuntu-18.04` container instead of the native `ubuntu-18.04` of GitHub actions: I had to install docker in the container.
Small additional changes
- remove useless `fail-fast` and unused/irrelevant matrix inputs (`build`, `linker`, `os`, `use-cross`...)
- Remove useless step in job
Proof of work with this CI triggered on this current branch: https://github.com/meilisearch/meilisearch/actions/runs/4366233882
3569: Enhance Japanese language detection r=dureuill a=ManyTheFish
# Pull Request
This PR is a prototype and can be tested by downloading [the dedicated docker image](https://hub.docker.com/layers/getmeili/meilisearch/prototype-better-language-detection-0/images/sha256-a12847de00e21a71ab797879fd09777dadcb0881f65b5f810e7d1ed434d116ef?context=explore ):
```bash
$ docker pull getmeili/meilisearch:prototype-better-language-detection-0
```
## Context
Some Languages are harder to detect than others, this miss-detection leads to bad tokenization making some words or even documents completely unsearchable. Japanese is the main Language affected and can be detected as Chinese which has a completely different way of tokenization.
A [first iteration has been implemented for v1.1.0](https://github.com/meilisearch/meilisearch/pull/3347 ) but is an insufficient enhancement to make Japanese work. This first implementation was detecting the Language during the indexing to avoid bad detections during the search.
Unfortunately, some documents (shorter ones) can be wrongly detected as Chinese running bad tokenization for these documents and making possible the detection of Chinese during the search because it has been detected during the indexing.
For instance, a Japanese document `{"id": 1, "name": "東京スカパラダイスオーケストラ"}` is detected as Japanese during indexing, during the search the query `東京` will be detected as Japanese because only Japanese documents have been detected during indexing despite the fact that v1.0.2 would detect it as Chinese.
However if in the dataset there is at least one document containing a field with only Kanjis like:
_A document with only 1 field containing only Kanjis:_
```json
{
"id":4,
"name": "東京特許許可局"
}
```
_A document with 1 field containing only Kanjis and 1 field containing several Japanese characters:_
```json
{
"id":105,
"name": "東京特許許可局",
"desc": "日経平均株価は26日 に約8カ月ぶりに2万4000円の心理的な節目を上回った。株高を支える材料のひとつは、自民党総裁選で3選を決めた安倍晋三首相の経済政策への期待だ。恩恵が見込まれるとされる人材サービスや建設株の一角が買われている。ただ思惑が先行して資金が集まっている面 は否めない。実際に政策効果を取り込む企業はどこか、なお未知数だ。"
}
```
Then, in both cases, the field `name` will be detected as Chinese during indexing allowing the search to detect Chinese in queries. Therefore, the query `東京` will be detected as Chinese and only the two last documents will be retrieved by Meilisearch.
## Technical Approach
The current PR partially fixes these issues by:
1) Adding a check over potential miss-detections and rerunning the extraction of the document forcing the tokenization over the main Languages detected in it.
> 1) run a first extraction allowing the tokenizer to detect any Language in any Script
> 2) generate a distribution of tokens by Script and Languages (`script_language`)
> 3) if for a Script we have a token distribution of one of the Language that is under the threshold, then we rerun the extraction forbidding the tokenizer to detect the marginal Languages
> 4) the tokenizer will fall back on the other available Languages to tokenize the text. For example, if the Chinese were marginally detected compared to the Japanese on the CJ script, then the second extraction will force Japanese tokenization for CJ text in the document. however, the text on another script like Latin will not be impacted by this restriction.
2) Adding a filtering threshold during the search over Languages that have been marginally detected in documents
## Limits
This PR introduces 2 arbitrary thresholds:
1) during the indexing, a Language is considered miss-detected if the number of detected tokens of this Language is under 10% of the tokens detected in the same Script (Japanese and Chinese are 2 different Languages sharing the "same" script "CJK").
2) during the search, a Language is considered marginal if less than 5% of documents are detected as this Language.
This PR only partially fixes these issues:
- ✅ the query `東京` now find Japanese documents if less than 5% of documents are detected as Chinese.
- ✅ the document with the id `105` containing the Japanese field `desc` but the miss-detected field `name` is now completely detected and tokenized as Japanese and is found with the query `東京`.
- ❌ the document with the id `4` no longer breaks the search Language detection but continues to be detected as a Chinese document and can't be found during the search.
## Related issue
Fixes #3565
## Possible future enhancements
- Change or contribute to the Library used to detect the Language
- the related issue on Whatlang: https://github.com/greyblake/whatlang-rs/issues/122
Co-authored-by: curquiza <clementine@meilisearch.com>
Co-authored-by: ManyTheFish <many@meilisearch.com>
Co-authored-by: Many the fish <many@meilisearch.com>
2023-03-09 15:34:35 +00:00
ManyTheFish
2f8eb4f54a
last PR fixes
2023-03-09 15:34:36 +01:00
Clément Renault
175e8a8495
Fix a diacritic issue
2023-03-09 14:57:47 +01:00
Clément Renault
df48ac8803
Add one more test for the NULL operator
2023-03-09 13:53:37 +01:00
Clément Renault
ff86073288
Add a snapshot for the NULL facet database
2023-03-09 13:32:27 +01:00
Clément Renault
0ad53784e7
Create a new struct to reduce the type complexity
2023-03-09 13:21:21 +01:00
Clément Renault
e064c52544
Rename an internal facet deletion method
2023-03-09 13:08:02 +01:00
Clément Renault
e106b16148
Fix a typo in a variable
...
Co-authored-by: Louis Dureuil <louis@meilisearch.com>
aaa
2023-03-09 13:08:02 +01:00
Tamo
eddefb0e0f
refactor the error type of the milli::document thing
...
silence a warning
2023-03-09 13:03:14 +01:00
ManyTheFish
5deea631ea
fix clippy too many arguments
2023-03-09 11:19:13 +01:00
Tamo
c5f22be6e1
add boolean support for csv documents
2023-03-09 11:12:49 +01:00
ManyTheFish
b4b859ec8c
Fix typos
2023-03-09 10:58:35 +01:00