387 Commits

Author SHA1 Message Date
Kerollmops
4eda149ffa
Rename the BoRoaringBitmap codec 2020-10-02 16:46:06 +02:00
Clément Renault
ac84db2506
Move the words pairs proximities average into the stats infos subcommand 2020-10-02 16:46:06 +02:00
Kerollmops
30755e31e7
Introduce the words pairs proximities stats info subcommand 2020-10-02 16:46:06 +02:00
Clément Renault
bc35c9a598
Introduce the size_of_database infos subcommand 2020-10-02 16:46:05 +02:00
Kerollmops
c6b883289c
Remove the unused fetch_keywords function 2020-09-30 15:41:23 +02:00
Kerollmops
58237bd67f
Introduce the average-number-of-document-by-word-pair-proximity infos subcommand 2020-09-29 18:32:48 +02:00
Kerollmops
991be8950e
Rename the subcommand into average-number-of-positions-by-word-by-doc 2020-09-29 18:15:44 +02:00
Kerollmops
54370e228a
Search for documents with longer proximities until we find enough 2020-09-29 17:37:14 +02:00
Kerollmops
f277ea134f
Simplify some search function by reducing the number of parameters 2020-09-29 16:08:58 +02:00
Kerollmops
68f4af7d2e
Improve the display of the number of processed documents 2020-09-29 16:08:58 +02:00
Kerollmops
59a127d022
Improve the indexing process
We now store the words pairs proximity in a cache and only compute the
shortest proximity between pairs of words in a document.
2020-09-29 15:09:18 +02:00
Kerollmops
6ddb3e722c
Depth-first search cache the docids unions 2020-09-28 16:55:21 +02:00
Kerollmops
a3821a0b33
Introduce the depth_first_search path resolution function 2020-09-28 16:34:12 +02:00
Clément Renault
d8354f6f02
Fix the word_docids capacity limit detection 2020-09-27 11:52:05 +02:00
Clément Renault
25b2853b70
Move the words pairs proximities compute into the write document function 2020-09-23 15:02:40 +02:00
Clément Renault
ed05999f63
Replace the arc cache by a simple linked hash map 2020-09-23 14:50:52 +02:00
Clément Renault
4d22d80281
Display only the key on heed error 2020-09-23 14:13:51 +02:00
Clément Renault
5178b3d59d
Make the search system be aware of query words typos 2020-09-23 12:01:39 +02:00
Clément Renault
b597a92487
Add a default max-memory value to the indexer 2020-09-23 12:00:36 +02:00
Clément Renault
1f6e00878d
Use the words pair proximities in the search algorithm 2020-09-22 18:47:55 +02:00
Clément Renault
31224a8425
Index the word pair proximities for both orders of the pair 2020-09-22 14:49:22 +02:00
Clément Renault
a58ae5eb2a
Introduce the word-pair-proximities-docids infos subcommand 2020-09-22 14:04:34 +02:00
Clément Renault
d6fa9c0414
Index the intra documents word pair proximities 2020-09-22 14:04:33 +02:00
Clément Renault
7b67ae6972
Introduce the StrStrU8 heed codec 2020-09-22 12:44:17 +02:00
Clément Renault
e34437b2d7
Move the proximity function to a module 2020-09-22 10:54:59 +02:00
Clément Renault
15208c7d3d
Simplify the indexer record loop 2020-09-22 10:33:30 +02:00
Clément Renault
e5adfaade0
Replace the token filter by a filter mapper 2020-09-22 10:24:31 +02:00
Clément Renault
d21c80b865
Apply the chunk compression parameters on all the MTBL writers 2020-09-21 18:30:54 +02:00
Clément Renault
944df52e2a
Simplify the indexer main loop 2020-09-21 14:59:48 +02:00
Kerollmops
3ded98e5fa
Bump the roaring version that fix a deserialization bug 2020-09-10 22:37:51 +02:00
Kerollmops
d5e5baa20f
Bump the oxidized-mtbl dependency 2020-09-10 13:29:12 +02:00
Kerollmops
aed0704404
Remove the temporary optimisation 2020-09-08 14:48:33 +02:00
Kerollmops
072382fa61
Sort the word docids to make intersections much faster 2020-09-07 22:38:49 +02:00
Kerollmops
ad11c5fb3f
Introduce the words-docids command for the infos binary 2020-09-07 22:36:35 +02:00
Kerollmops
5664c37539
Introduce an heed codec that reduce the size of small amount of serialized integers 2020-09-07 20:06:23 +02:00
Kerollmops
3e2250423c
Introduce the average-number-of-positions infos subcommand 2020-09-07 15:26:42 +02:00
Kerollmops
ea605b499c
Introduce two new infos subcommands 2020-09-07 14:56:48 +02:00
Clément Renault
bb1ab428db
Use another function to define the proximity 2020-09-06 17:55:07 +02:00
Clément Renault
dec460ce52
Fix the infos binary and add commands 2020-09-06 17:14:20 +02:00
Clément Renault
daa3673c1c
Invert the word docid positions key order 2020-09-06 10:30:53 +02:00
Clément Renault
c2405bcae2
Prefer using the word_docids db to create the words-fst 2020-09-06 10:23:56 +02:00
Kerollmops
4ca9472e02
Fix the minimum proximity len 2020-09-06 10:19:34 +02:00
Clément Renault
1c504471d3
Introduce the plane-sweep algorithm 2020-09-05 18:25:27 +02:00
Clément Renault
dc88a86259
Store the word positions under the documents 2020-09-05 18:03:06 +02:00
Kerollmops
580ed1119a
Make the engine to return csv string records as documents and headers 2020-08-31 19:02:00 +02:00
Clément Renault
bad0663138
Come back to the old tokenizer 2020-08-31 13:34:38 +02:00
Clément Renault
4afc4d0751
Use the groups of four positions to speed up disjunctions tests 2020-08-30 16:25:11 +02:00
Clément Renault
605f75b56f
Add the words grouped by four positions in the infos binary 2020-08-29 18:23:33 +02:00
Clément Renault
ad5cafbfed
Introduce a database to store docids in groups of four positions 2020-08-29 17:42:55 +02:00
Clément Renault
3db517548d
Move the documents back into the LMDB database 2020-08-29 15:14:04 +02:00
Clément Renault
816db7a0aa
Improve the RoaringBitmap codec to reserve enough vector space 2020-08-29 11:21:30 +02:00
Clément Renault
3fe497e129
Improve the Mtbl heed codec to only encode MTBL databases 2020-08-29 11:20:39 +02:00
Clément Renault
21aafd603c
Make sure the first document is associated to the document id 0 2020-08-29 10:56:40 +02:00
Clément Renault
0a44ff86ab
Put the documents MTBL back into LMDB
We makes sure to write the documents into a file before
memory mapping it and putting it into LMDB, this way we avoid
moving it to RAM
2020-08-28 15:43:24 +02:00
Clément Renault
d784d87880
Remove the prefix LMDB databases 2020-08-28 14:41:43 +02:00
Clément Renault
7cde312f14
Introduce the StrBEU32Codec heed codec 2020-08-28 14:16:37 +02:00
Clément Renault
34db376ae5
Rename the RoaringBitmapCodec module 2020-08-28 13:31:16 +02:00
Kerollmops
38ddc71b83
Simplify the search algorithm 2020-08-26 15:16:41 +02:00
Kerollmops
ba2eb0d7ad
Take the words-fst into account when retrieving the biggests values 2020-08-26 14:36:22 +02:00
Clément Renault
32da07ccee
Introduce the word-positions-doc-ids and words-positions infos commands 2020-08-23 10:52:47 +02:00
Clément Renault
d19f394630
Make the indexer support gzipped CSV as input 2020-08-21 18:10:24 +02:00
Clément Renault
ff479c865d
Replace pipe by ringtail to improve stdin read performances 2020-08-21 17:45:52 +02:00
Clément Renault
ada30c2789
Introducing more arguments to specify the different compression algorithms 2020-08-21 16:41:26 +02:00
Clément Renault
02335ee72d
Introduce the biggest-value-sizes command on the infos binary 2020-08-21 14:44:42 +02:00
Clément Renault
1e3e756c19
Introduce the words-frequencies command on the infos binary 2020-08-21 14:44:42 +02:00
Kerollmops
6a230fe803
Move the contains_documents logic to a function 2020-08-21 14:44:42 +02:00
Kerollmops
e55a569629
Compress much more the documents database 2020-08-21 14:44:42 +02:00
Kerollmops
962bad3cea
Introduce an infos binary to fetch stats 2020-08-17 19:41:49 +02:00
Clément Renault
8806fcd545
Introduce a better query and document lexer 2020-08-16 14:36:54 +02:00
Clément Renault
1e358e3ae8
Introduce the AstarBagIter that iterates through best paths 2020-08-15 16:24:06 +02:00
Clément Renault
7dc594ba4d
Introduce the Search builder struct 2020-08-13 14:27:51 +02:00
Clément Renault
bfb46cbfbe
Introduce the Crtierion enum 2020-08-12 10:43:02 +02:00
Clément Renault
6d04a285dc
Retrieve and display the distances of the words found 2020-08-11 15:18:02 +02:00
Clément Renault
1bd37d213a
Lowercase quoted words 2020-08-10 14:49:09 +02:00
Clément Renault
883a8109c8
Show both database and documents database sizes 2020-08-10 14:37:18 +02:00
Clément Renault
a4e0f3f724
Remove the useless TransitiveArc from the serve binary 2020-08-10 14:06:27 +02:00
Clément Renault
edc06a97d6
Remove the useless stats binary 2020-08-10 13:55:02 +02:00
Clément Renault
ae77fe5a69
Introduce an option to specify the maximum database size 2020-08-10 13:53:53 +02:00
Clément Renault
394844062f
Move the documents MTBL database inside the Index 2020-08-10 13:47:19 +02:00
Clément Renault
ecd2b2f217
Make the final merge done in parallel 2020-08-07 15:44:04 +02:00
Clément Renault
91282c8b6a
Move the documents into another file 2020-08-07 13:11:31 +02:00
Clément Renault
fae694a102
Put the documents into an MTBL database 2020-08-07 12:14:40 +02:00
Clément Renault
405a71d3a4
Accept csv from stdin 2020-08-06 13:38:21 +02:00
Clément Renault
d3b1096510
Compute the word attribute postings lists on each threads 2020-08-06 11:50:27 +02:00
Clément Renault
8d734941af
Clean up some lines 2020-08-06 10:20:26 +02:00
Clément Renault
6508d497ce
Replace the regex highlighting by a simple algorithm 2020-08-05 13:52:27 +02:00
Clément Renault
4873abe145
Introduce option flags to toggle the indexing engine 2020-08-05 12:10:41 +02:00
Clément Renault
bd4b18541c
Introduce a new indexer which uses an MTBL sorter 2020-08-04 15:44:37 +02:00
Kerollmops
ee305c9284
Replace the title by the milli logo 2020-07-15 23:55:28 +02:00
Kerollmops
9ade00e27b
Highlight all the matching words 2020-07-14 11:53:21 +02:00
Kerollmops
085c376655
Use the regex crate to highlight "hello" 2020-07-14 11:28:40 +02:00
Kerollmops
aa92311d4e
Add a dark theme to the dashboard 2020-07-13 23:51:41 +02:00
Kerollmops
3d144e62c4
Search for best proximities in multiple attributes 2020-07-13 19:06:56 +02:00
Kerollmops
576dd011a1
Compute the candidates but not by attribute 2020-07-13 18:16:05 +02:00
Kerollmops
6b14b20369
Introduce a method to retrieve the number of attributes of the documents 2020-07-13 17:50:16 +02:00
Kerollmops
92c2b1dd2d
Refine the help message of the binaries 2020-07-12 11:06:45 +02:00
Kerollmops
f757df5dfd
Introduce the stderr logger to the project 2020-07-12 11:04:35 +02:00
Kerollmops
12358476da
Use the log crate instead of stderr 2020-07-12 10:55:09 +02:00
Kerollmops
2c62eeea3c
Rename the project milli 2020-07-12 00:16:41 +02:00
Kerollmops
d31da26a51
Avoid cloning RoraringBitmaps when unecessary 2020-07-11 23:51:32 +02:00
Kerollmops
b8a1fc0126
Clean up the CSS style custom bulma rules 2020-07-11 14:51:59 +02:00
Kerollmops
f6eae91c7d
Pretty print the new dashboard numbers 2020-07-11 14:17:37 +02:00
Kerollmops
d44428fa90
Display more informations on the dashboard 2020-07-11 11:51:56 +02:00
Kerollmops
11c7fef80a
Implement a memory dumper
It moves the in memory HashMaps used when indexing to a disk based MTBL file
2020-07-07 16:48:49 +02:00
Kerollmops
b12bfcb03b
Reduce the deepness of the word position document ids
This helps reduce the number of allocations.
2020-07-07 12:30:05 +02:00
Kerollmops
7178b6c2c4
First basic version using MTBL again 2020-07-07 11:32:33 +02:00
Kerollmops
adb1038b26
Add a jobs parameter to set the number of threads the indexer uses 2020-07-06 12:17:17 +02:00
Kerollmops
ec1023e790
Intersect document ids by inverse popularity of the words
This reduces the worst request we had which took 56s to now took 3s ("the best of the do").
2020-07-05 19:33:51 +02:00
Kerollmops
cd7e64b2b3
Allow users to set the arc cache size when indexing 2020-07-04 18:12:41 +02:00
Kerollmops
ac8353a64f
Merge pre-computed word attribute documents ids 2020-07-04 17:02:27 +02:00
Kerollmops
fea7cac206
Display the time it took to compute the word attribute documents ids 2020-07-04 15:18:38 +02:00
Kerollmops
46ced5c828
Introduce the RwIter append heed API 2020-07-04 12:34:10 +02:00
Kerollmops
7e7440c431
Finalize the LMDB indexing design 2020-07-01 22:45:43 +02:00
Kerollmops
2ae3f40971
Make the indexer ignore certain words
This is a preparation for making the indexing fully parallel by making the
indexer only be aware of certain words for each threads to avoid postings lists
conflicts for each words
2020-07-01 17:49:46 +02:00
Kerollmops
a3ac2623d5
Introduce multiple functions to clean up the code 2020-07-01 17:24:55 +02:00
Kerollmops
ac5cc7ddad
Introduce an Iterator yielding owned entries for the LruCache 2020-07-01 17:21:52 +02:00
Kerollmops
014a25697d
Use only one ARC cache based on the words 2020-07-01 12:03:18 +02:00
Kerollmops
fc4013a43f
Fix the ARC cache 2020-07-01 10:35:07 +02:00
Kerollmops
2fcae719ad
Use another LRU impl which uses hashbrown 2020-06-29 22:26:06 +02:00
Kerollmops
f98b615bf3
Replace the LRU by an Arc cache 2020-06-29 20:48:57 +02:00
Kerollmops
07abebfc46
Introduce a (too big) LRU cache 2020-06-29 18:15:03 +02:00
Kerollmops
5f0088594b
Index by writing directly into LMDB 2020-06-29 13:54:47 +02:00
Kerollmops
63cbeca64e
Skip all derived words when too short 2020-06-28 12:13:12 +02:00
Kerollmops
736f0f7560
Use the proximity instead of the attributes when searching for <= 7 proximities 2020-06-28 12:13:12 +02:00
Kerollmops
fe3be8f18a
Replace the HashMap by a Vec for attributes documents ids 2020-06-28 12:13:12 +02:00
Kerollmops
6a2834f2b0
Add a jobs parameter to set the number of threads the indexer uses 2020-06-28 12:13:10 +02:00
Kerollmops
7e16afbdce
Ignore documents which are not part of the candidates when exploring with A* 2020-06-24 15:06:45 +02:00
Kerollmops
1c7a9a4132
Remove the found documents from the candidates list 2020-06-24 15:00:26 +02:00
Kerollmops
50169b9798
Compute the full list of ids we are willing to find by attribute 2020-06-24 14:48:04 +02:00
Kerollmops
374ec6773f
Introduce a database to store all docids for a word and attribute 2020-06-22 19:24:20 +02:00
Kerollmops
a044cb6cc8
Clean up the warnings for prefix postings 2020-06-22 18:10:31 +02:00
Kerollmops
ba3e805981
Document the Index types and the internal LMDB databases 2020-06-22 18:09:22 +02:00
Kerollmops
2f0e1afd16
Introduce the roaring bitmap heed codec 2020-06-22 17:56:07 +02:00
Kerollmops
8148210860
Use the cache when retrieving the documents at the end 2020-06-21 12:25:19 +02:00
Kerollmops
1628a31efa
Cache the unions of the derived words positions 2020-06-20 15:38:10 +02:00
Kerollmops
115e0142d9
Add a feature flags to enable the export of stats 2020-06-20 13:25:42 +02:00
Kerollmops
beb49b24f6
Skip looking at connections for proximity 0 2020-06-20 13:19:03 +02:00
Kerollmops
c84012d655
Accept queries from standard input when not given as argument 2020-06-20 12:01:15 +02:00
Kerollmops
55a8941922
Optimize things 2020-06-19 17:48:17 +02:00
Kerollmops
a3ca80d20d
Ignore every proximities bigger or equal to 8 2020-06-18 15:42:46 +02:00
Kerollmops
3577de04b8
Reduce the number of KV lookups to the sucessfulls only 2020-06-16 12:58:29 +02:00
Kerollmops
e974e6b3c9
Acquire search intersections metrics 2020-06-16 12:10:23 +02:00
Kerollmops
8db16ff306
Add a cache to the contains_documents success function 2020-06-14 13:39:39 +02:00
Kerollmops
a8cda248b4
Introduce a customized A* algorithm.
This custom algo lazily compute the intersections between words, to avoid too much set operations and database reads
2020-06-14 12:51:57 +02:00
Kerollmops
69285b22d3
Check that an edges combination contains results 2020-06-13 11:16:02 +02:00
Kerollmops
b9cc6c10af
Introduce a function to ignore useless paths 2020-06-13 00:17:43 +02:00
Kerollmops
d02c5cb023
Fix node skipping by computing the accumulated proximity 2020-06-12 14:08:46 +02:00
Kerollmops
37a48489da
Reworked the best proximity algo a little bit 2020-06-12 12:53:08 +02:00
Kerollmops
302866ad73
Make the algo don't work with an astar 2020-06-11 17:43:06 +02:00
Kerollmops
0a83a86e65
Fix multiple bugs 2020-06-11 11:55:03 +02:00
Kerollmops
4e86ecf807
Retrieve the words before the intersect loops 2020-06-10 22:05:01 +02:00
Kerollmops
6ca3579cc0
Add more time debug measurements 2020-06-10 21:35:01 +02:00
Kerollmops
66a4b26811
Introduce a proximity based documents retriever 2020-06-10 16:54:28 +02:00
Kerollmops
78f27c0465
squash-me: Remove debugs 2020-06-10 16:29:46 +02:00
Kerollmops
3ad883d7c7
squash-me: Make the dijkstra work even with different attributes 2020-06-10 16:27:02 +02:00
Kerollmops
fecd8ca54a
squash-me: It works! we must remove the debug after having added more tests 2020-06-10 14:20:35 +02:00
Kerollmops
13977d9338
squash-me 2020-06-09 23:06:59 +02:00
Kerollmops
5d5b827f1a
Squash-me 2020-06-09 17:32:25 +02:00
Kerollmops
2a6d6a7f69
Introduce a first draft of the best_proximity algorithm 2020-06-09 10:11:43 +02:00
Kerollmops
dfdaceb410
Introduce a first basic working positions-based engine 2020-06-05 20:13:19 +02:00
Kerollmops
f51a63e4ef
Store documents ids under attribute ids 2020-06-05 16:32:14 +02:00
Kerollmops
ce86a43779
Make the query tokenizer a real Iterator 2020-06-05 09:49:28 +02:00
Kerollmops
f55f4cb02a
Not fetch the cached prefix postings when prefix is disabled 2020-06-04 21:22:45 +02:00
Kerollmops
eefc6d7c44
Add support for quoted query phrases 2020-06-04 20:25:51 +02:00
Kerollmops
1f7035f18f
Just do a little clean-up 2020-06-04 19:13:28 +02:00
Kerollmops
71dc6a3828
Disable prefix search when query is ended by a whitespace 2020-06-04 18:37:20 +02:00
Kerollmops
5d1c625b74
Change the page index texts 2020-06-04 18:20:57 +02:00
Kerollmops
c42d3c19e2
Merge the whole list of generated MTBL in one go 2020-06-04 17:38:43 +02:00
Kerollmops
3a23dc242e
More efficiently merge MTBLs, more than two at a time 2020-06-04 16:17:24 +02:00
Kerollmops
1df1f88fe1
Directly write to LMDB without intermediate final MTBL 2020-06-01 21:30:39 +02:00
Kerollmops
2174042994
Merge only 3 MTBL at the same time 2020-06-01 19:49:58 +02:00
Kerollmops
5cc81a0179
Merge many MTBL into one a the same time 2020-06-01 18:39:58 +02:00
Kerollmops
6a047519f6
Do a merge two by two 2020-06-01 18:27:26 +02:00
Kerollmops
5404776f7a
Add a little bit more debug 2020-06-01 17:52:43 +02:00
Kerollmops
dff68a339a
Use OnceCell to cache levenshtein builders 2020-05-31 19:27:11 +02:00
Kerollmops
dde3e01a59
Introduce prefix postings ids for better perfs 2020-05-31 18:20:49 +02:00
Kerollmops
a26553c90a
Reintroduce a simple HTTP server 2020-05-31 17:48:13 +02:00
Kerollmops
2a10b2275e
Support prefix typo tolerant search 2020-05-31 17:18:13 +02:00
Kerollmops
ba9527abc0
Support typos with a levenshtein automata 2020-05-31 17:01:11 +02:00
Kerollmops
6c726df9b9
Support multiple space seperated words 2020-05-31 16:09:34 +02:00
Kerollmops
24587148fd
Introduce MTBL parallel merging before LMDB writing 2020-05-31 14:22:57 +02:00
Kerollmops
6762c2d08f
Clean up a little bit 2020-05-31 14:22:57 +02:00
Kerollmops
3a998cf39c
Far better usage of rayon to fold indexed data 2020-05-31 14:22:57 +02:00
Kerollmops
1237306ca8
Introduce a thread that write to heed 2020-05-31 14:22:57 +02:00
Kerollmops
3668627e03
Use zerocopy without bitpacking as a first step 2020-05-31 14:22:07 +02:00
Kerollmops
a81f201fad
Inroduce the use of RocksDB instead of sled (RAM) 2020-05-31 14:22:06 +02:00
Kerollmops
91ba938953
Initial commit 2020-05-31 14:22:06 +02:00