Clément Renault
1f6e00878d
Use the words pair proximities in the search algorithm
2020-09-22 18:47:55 +02:00
Clément Renault
31224a8425
Index the word pair proximities for both orders of the pair
2020-09-22 14:49:22 +02:00
Clément Renault
a58ae5eb2a
Introduce the word-pair-proximities-docids infos subcommand
2020-09-22 14:04:34 +02:00
Clément Renault
d6fa9c0414
Index the intra documents word pair proximities
2020-09-22 14:04:33 +02:00
Clément Renault
7b67ae6972
Introduce the StrStrU8 heed codec
2020-09-22 12:44:17 +02:00
Clément Renault
e34437b2d7
Move the proximity function to a module
2020-09-22 10:54:59 +02:00
Clément Renault
15208c7d3d
Simplify the indexer record loop
2020-09-22 10:33:30 +02:00
Clément Renault
e5adfaade0
Replace the token filter by a filter mapper
2020-09-22 10:24:31 +02:00
Clément Renault
d21c80b865
Apply the chunk compression parameters on all the MTBL writers
2020-09-21 18:30:54 +02:00
Clément Renault
944df52e2a
Simplify the indexer main loop
2020-09-21 14:59:48 +02:00
Kerollmops
3ded98e5fa
Bump the roaring version that fix a deserialization bug
2020-09-10 22:37:51 +02:00
Kerollmops
d5e5baa20f
Bump the oxidized-mtbl dependency
2020-09-10 13:29:12 +02:00
Kerollmops
aed0704404
Remove the temporary optimisation
2020-09-08 14:48:33 +02:00
Kerollmops
072382fa61
Sort the word docids to make intersections much faster
2020-09-07 22:38:49 +02:00
Kerollmops
ad11c5fb3f
Introduce the words-docids command for the infos binary
2020-09-07 22:36:35 +02:00
Kerollmops
5664c37539
Introduce an heed codec that reduce the size of small amount of serialized integers
2020-09-07 20:06:23 +02:00
Kerollmops
3e2250423c
Introduce the average-number-of-positions infos subcommand
2020-09-07 15:26:42 +02:00
Kerollmops
ea605b499c
Introduce two new infos subcommands
2020-09-07 14:56:48 +02:00
Clément Renault
bb1ab428db
Use another function to define the proximity
2020-09-06 17:55:07 +02:00
Clément Renault
dec460ce52
Fix the infos binary and add commands
2020-09-06 17:14:20 +02:00
Clément Renault
daa3673c1c
Invert the word docid positions key order
2020-09-06 10:30:53 +02:00
Clément Renault
c2405bcae2
Prefer using the word_docids db to create the words-fst
2020-09-06 10:23:56 +02:00
Kerollmops
4ca9472e02
Fix the minimum proximity len
2020-09-06 10:19:34 +02:00
Clément Renault
1c504471d3
Introduce the plane-sweep algorithm
2020-09-05 18:25:27 +02:00
Clément Renault
dc88a86259
Store the word positions under the documents
2020-09-05 18:03:06 +02:00
Kerollmops
580ed1119a
Make the engine to return csv string records as documents and headers
2020-08-31 19:02:00 +02:00
Clément Renault
bad0663138
Come back to the old tokenizer
2020-08-31 13:34:38 +02:00
Clément Renault
4afc4d0751
Use the groups of four positions to speed up disjunctions tests
2020-08-30 16:25:11 +02:00
Clément Renault
605f75b56f
Add the words grouped by four positions in the infos binary
2020-08-29 18:23:33 +02:00
Clément Renault
ad5cafbfed
Introduce a database to store docids in groups of four positions
2020-08-29 17:42:55 +02:00
Clément Renault
3db517548d
Move the documents back into the LMDB database
2020-08-29 15:14:04 +02:00
Clément Renault
816db7a0aa
Improve the RoaringBitmap codec to reserve enough vector space
2020-08-29 11:21:30 +02:00
Clément Renault
3fe497e129
Improve the Mtbl heed codec to only encode MTBL databases
2020-08-29 11:20:39 +02:00
Clément Renault
21aafd603c
Make sure the first document is associated to the document id 0
2020-08-29 10:56:40 +02:00
Clément Renault
0a44ff86ab
Put the documents MTBL back into LMDB
...
We makes sure to write the documents into a file before
memory mapping it and putting it into LMDB, this way we avoid
moving it to RAM
2020-08-28 15:43:24 +02:00
Clément Renault
d784d87880
Remove the prefix LMDB databases
2020-08-28 14:41:43 +02:00
Clément Renault
7cde312f14
Introduce the StrBEU32Codec heed codec
2020-08-28 14:16:37 +02:00
Clément Renault
34db376ae5
Rename the RoaringBitmapCodec module
2020-08-28 13:31:16 +02:00
Kerollmops
38ddc71b83
Simplify the search algorithm
2020-08-26 15:16:41 +02:00
Kerollmops
ba2eb0d7ad
Take the words-fst into account when retrieving the biggests values
2020-08-26 14:36:22 +02:00
Clément Renault
32da07ccee
Introduce the word-positions-doc-ids and words-positions infos commands
2020-08-23 10:52:47 +02:00
Clément Renault
d19f394630
Make the indexer support gzipped CSV as input
2020-08-21 18:10:24 +02:00
Clément Renault
ff479c865d
Replace pipe by ringtail to improve stdin read performances
2020-08-21 17:45:52 +02:00
Clément Renault
ada30c2789
Introducing more arguments to specify the different compression algorithms
2020-08-21 16:41:26 +02:00
Clément Renault
02335ee72d
Introduce the biggest-value-sizes command on the infos binary
2020-08-21 14:44:42 +02:00
Clément Renault
1e3e756c19
Introduce the words-frequencies command on the infos binary
2020-08-21 14:44:42 +02:00
Kerollmops
6a230fe803
Move the contains_documents logic to a function
2020-08-21 14:44:42 +02:00
Kerollmops
e55a569629
Compress much more the documents database
2020-08-21 14:44:42 +02:00
Kerollmops
962bad3cea
Introduce an infos binary to fetch stats
2020-08-17 19:41:49 +02:00
Clément Renault
8806fcd545
Introduce a better query and document lexer
2020-08-16 14:36:54 +02:00
Clément Renault
1e358e3ae8
Introduce the AstarBagIter that iterates through best paths
2020-08-15 16:24:06 +02:00
Clément Renault
7dc594ba4d
Introduce the Search builder struct
2020-08-13 14:27:51 +02:00
Clément Renault
bfb46cbfbe
Introduce the Crtierion enum
2020-08-12 10:43:02 +02:00
Clément Renault
6d04a285dc
Retrieve and display the distances of the words found
2020-08-11 15:18:02 +02:00
Clément Renault
1bd37d213a
Lowercase quoted words
2020-08-10 14:49:09 +02:00
Clément Renault
883a8109c8
Show both database and documents database sizes
2020-08-10 14:37:18 +02:00
Clément Renault
a4e0f3f724
Remove the useless TransitiveArc from the serve binary
2020-08-10 14:06:27 +02:00
Clément Renault
edc06a97d6
Remove the useless stats binary
2020-08-10 13:55:02 +02:00
Clément Renault
ae77fe5a69
Introduce an option to specify the maximum database size
2020-08-10 13:53:53 +02:00
Clément Renault
394844062f
Move the documents MTBL database inside the Index
2020-08-10 13:47:19 +02:00
Clément Renault
ecd2b2f217
Make the final merge done in parallel
2020-08-07 15:44:04 +02:00
Clément Renault
91282c8b6a
Move the documents into another file
2020-08-07 13:11:31 +02:00
Clément Renault
fae694a102
Put the documents into an MTBL database
2020-08-07 12:14:40 +02:00
Clément Renault
405a71d3a4
Accept csv from stdin
2020-08-06 13:38:21 +02:00
Clément Renault
d3b1096510
Compute the word attribute postings lists on each threads
2020-08-06 11:50:27 +02:00
Clément Renault
8d734941af
Clean up some lines
2020-08-06 10:20:26 +02:00
Clément Renault
6508d497ce
Replace the regex highlighting by a simple algorithm
2020-08-05 13:52:27 +02:00
Clément Renault
4873abe145
Introduce option flags to toggle the indexing engine
2020-08-05 12:10:41 +02:00
Clément Renault
bd4b18541c
Introduce a new indexer which uses an MTBL sorter
2020-08-04 15:44:37 +02:00
Kerollmops
ee305c9284
Replace the title by the milli logo
2020-07-15 23:55:28 +02:00
Kerollmops
9ade00e27b
Highlight all the matching words
2020-07-14 11:53:21 +02:00
Kerollmops
085c376655
Use the regex crate to highlight "hello"
2020-07-14 11:28:40 +02:00
Kerollmops
aa92311d4e
Add a dark theme to the dashboard
2020-07-13 23:51:41 +02:00
Kerollmops
3d144e62c4
Search for best proximities in multiple attributes
2020-07-13 19:06:56 +02:00
Kerollmops
576dd011a1
Compute the candidates but not by attribute
2020-07-13 18:16:05 +02:00
Kerollmops
6b14b20369
Introduce a method to retrieve the number of attributes of the documents
2020-07-13 17:50:16 +02:00
Kerollmops
92c2b1dd2d
Refine the help message of the binaries
2020-07-12 11:06:45 +02:00
Kerollmops
f757df5dfd
Introduce the stderr logger to the project
2020-07-12 11:04:35 +02:00
Kerollmops
12358476da
Use the log crate instead of stderr
2020-07-12 10:55:09 +02:00
Kerollmops
2c62eeea3c
Rename the project milli
2020-07-12 00:16:41 +02:00
Kerollmops
d31da26a51
Avoid cloning RoraringBitmaps when unecessary
2020-07-11 23:51:32 +02:00
Kerollmops
b8a1fc0126
Clean up the CSS style custom bulma rules
2020-07-11 14:51:59 +02:00
Kerollmops
f6eae91c7d
Pretty print the new dashboard numbers
2020-07-11 14:17:37 +02:00
Kerollmops
d44428fa90
Display more informations on the dashboard
2020-07-11 11:51:56 +02:00
Kerollmops
11c7fef80a
Implement a memory dumper
...
It moves the in memory HashMaps used when indexing to a disk based MTBL file
2020-07-07 16:48:49 +02:00
Kerollmops
b12bfcb03b
Reduce the deepness of the word position document ids
...
This helps reduce the number of allocations.
2020-07-07 12:30:05 +02:00
Kerollmops
7178b6c2c4
First basic version using MTBL again
2020-07-07 11:32:33 +02:00
Kerollmops
adb1038b26
Add a jobs
parameter to set the number of threads the indexer uses
2020-07-06 12:17:17 +02:00
Kerollmops
ec1023e790
Intersect document ids by inverse popularity of the words
...
This reduces the worst request we had which took 56s to now took 3s ("the best of the do").
2020-07-05 19:33:51 +02:00
Kerollmops
cd7e64b2b3
Allow users to set the arc cache size when indexing
2020-07-04 18:12:41 +02:00
Kerollmops
ac8353a64f
Merge pre-computed word attribute documents ids
2020-07-04 17:02:27 +02:00
Kerollmops
fea7cac206
Display the time it took to compute the word attribute documents ids
2020-07-04 15:18:38 +02:00
Kerollmops
46ced5c828
Introduce the RwIter append heed API
2020-07-04 12:34:10 +02:00
Kerollmops
7e7440c431
Finalize the LMDB indexing design
2020-07-01 22:45:43 +02:00
Kerollmops
2ae3f40971
Make the indexer ignore certain words
...
This is a preparation for making the indexing fully parallel by making the
indexer only be aware of certain words for each threads to avoid postings lists
conflicts for each words
2020-07-01 17:49:46 +02:00
Kerollmops
a3ac2623d5
Introduce multiple functions to clean up the code
2020-07-01 17:24:55 +02:00
Kerollmops
ac5cc7ddad
Introduce an Iterator yielding owned entries for the LruCache
2020-07-01 17:21:52 +02:00
Kerollmops
014a25697d
Use only one ARC cache based on the words
2020-07-01 12:03:18 +02:00
Kerollmops
fc4013a43f
Fix the ARC cache
2020-07-01 10:35:07 +02:00
Kerollmops
2fcae719ad
Use another LRU impl which uses hashbrown
2020-06-29 22:26:06 +02:00