Kerollmops
b12bfcb03b
Reduce the deepness of the word position document ids
...
This helps reduce the number of allocations.
2020-07-07 12:30:05 +02:00
Kerollmops
7178b6c2c4
First basic version using MTBL again
2020-07-07 11:32:33 +02:00
Kerollmops
adb1038b26
Add a jobs
parameter to set the number of threads the indexer uses
2020-07-06 12:17:17 +02:00
Kerollmops
ec1023e790
Intersect document ids by inverse popularity of the words
...
This reduces the worst request we had which took 56s to now took 3s ("the best of the do").
2020-07-05 19:33:51 +02:00
Kerollmops
cd7e64b2b3
Allow users to set the arc cache size when indexing
2020-07-04 18:12:41 +02:00
Kerollmops
ac8353a64f
Merge pre-computed word attribute documents ids
2020-07-04 17:02:27 +02:00
Kerollmops
fea7cac206
Display the time it took to compute the word attribute documents ids
2020-07-04 15:18:38 +02:00
Kerollmops
46ced5c828
Introduce the RwIter append heed API
2020-07-04 12:34:10 +02:00
Kerollmops
7e7440c431
Finalize the LMDB indexing design
2020-07-01 22:45:43 +02:00
Kerollmops
2ae3f40971
Make the indexer ignore certain words
...
This is a preparation for making the indexing fully parallel by making the
indexer only be aware of certain words for each threads to avoid postings lists
conflicts for each words
2020-07-01 17:49:46 +02:00
Kerollmops
a3ac2623d5
Introduce multiple functions to clean up the code
2020-07-01 17:24:55 +02:00
Kerollmops
ac5cc7ddad
Introduce an Iterator yielding owned entries for the LruCache
2020-07-01 17:21:52 +02:00
Kerollmops
014a25697d
Use only one ARC cache based on the words
2020-07-01 12:03:18 +02:00
Kerollmops
fc4013a43f
Fix the ARC cache
2020-07-01 10:35:07 +02:00
Kerollmops
2fcae719ad
Use another LRU impl which uses hashbrown
2020-06-29 22:26:06 +02:00
Kerollmops
f98b615bf3
Replace the LRU by an Arc cache
2020-06-29 20:48:57 +02:00
Kerollmops
07abebfc46
Introduce a (too big) LRU cache
2020-06-29 18:15:03 +02:00
Kerollmops
5f0088594b
Index by writing directly into LMDB
2020-06-29 13:54:47 +02:00
Kerollmops
63cbeca64e
Skip all derived words when too short
2020-06-28 12:13:12 +02:00
Kerollmops
736f0f7560
Use the proximity instead of the attributes when searching for <= 7 proximities
2020-06-28 12:13:12 +02:00
Kerollmops
fe3be8f18a
Replace the HashMap by a Vec for attributes documents ids
2020-06-28 12:13:12 +02:00
Kerollmops
6a2834f2b0
Add a jobs
parameter to set the number of threads the indexer uses
2020-06-28 12:13:10 +02:00
Kerollmops
7e16afbdce
Ignore documents which are not part of the candidates when exploring with A*
2020-06-24 15:06:45 +02:00
Kerollmops
1c7a9a4132
Remove the found documents from the candidates list
2020-06-24 15:00:26 +02:00
Kerollmops
50169b9798
Compute the full list of ids we are willing to find by attribute
2020-06-24 14:48:04 +02:00
Kerollmops
374ec6773f
Introduce a database to store all docids for a word and attribute
2020-06-22 19:24:20 +02:00
Kerollmops
a044cb6cc8
Clean up the warnings for prefix postings
2020-06-22 18:10:31 +02:00
Kerollmops
ba3e805981
Document the Index types and the internal LMDB databases
2020-06-22 18:09:22 +02:00
Kerollmops
2f0e1afd16
Introduce the roaring bitmap heed codec
2020-06-22 17:56:07 +02:00
Kerollmops
8148210860
Use the cache when retrieving the documents at the end
2020-06-21 12:25:19 +02:00
Kerollmops
1628a31efa
Cache the unions of the derived words positions
2020-06-20 15:38:10 +02:00
Kerollmops
115e0142d9
Add a feature flags to enable the export of stats
2020-06-20 13:25:42 +02:00
Kerollmops
beb49b24f6
Skip looking at connections for proximity 0
2020-06-20 13:19:03 +02:00
Kerollmops
c84012d655
Accept queries from standard input when not given as argument
2020-06-20 12:01:15 +02:00
Kerollmops
55a8941922
Optimize things
2020-06-19 17:48:17 +02:00
Kerollmops
a3ca80d20d
Ignore every proximities bigger or equal to 8
2020-06-18 15:42:46 +02:00
Kerollmops
3577de04b8
Reduce the number of KV lookups to the sucessfulls only
2020-06-16 12:58:29 +02:00
Kerollmops
e974e6b3c9
Acquire search intersections metrics
2020-06-16 12:10:23 +02:00
Kerollmops
8db16ff306
Add a cache to the contains_documents success function
2020-06-14 13:39:39 +02:00
Kerollmops
a8cda248b4
Introduce a customized A* algorithm.
...
This custom algo lazily compute the intersections between words, to avoid too much set operations and database reads
2020-06-14 12:51:57 +02:00
Kerollmops
69285b22d3
Check that an edges combination contains results
2020-06-13 11:16:02 +02:00
Kerollmops
b9cc6c10af
Introduce a function to ignore useless paths
2020-06-13 00:17:43 +02:00
Kerollmops
d02c5cb023
Fix node skipping by computing the accumulated proximity
2020-06-12 14:08:46 +02:00
Kerollmops
37a48489da
Reworked the best proximity algo a little bit
2020-06-12 12:53:08 +02:00
Kerollmops
302866ad73
Make the algo don't work with an astar
2020-06-11 17:43:06 +02:00
Kerollmops
0a83a86e65
Fix multiple bugs
2020-06-11 11:55:03 +02:00
Kerollmops
4e86ecf807
Retrieve the words before the intersect loops
2020-06-10 22:05:01 +02:00
Kerollmops
6ca3579cc0
Add more time debug measurements
2020-06-10 21:35:01 +02:00
Kerollmops
66a4b26811
Introduce a proximity based documents retriever
2020-06-10 16:54:28 +02:00
Kerollmops
78f27c0465
squash-me: Remove debugs
2020-06-10 16:29:46 +02:00
Kerollmops
3ad883d7c7
squash-me: Make the dijkstra work even with different attributes
2020-06-10 16:27:02 +02:00
Kerollmops
fecd8ca54a
squash-me: It works! we must remove the debug after having added more tests
2020-06-10 14:20:35 +02:00
Kerollmops
13977d9338
squash-me
2020-06-09 23:06:59 +02:00
Kerollmops
5d5b827f1a
Squash-me
2020-06-09 17:32:25 +02:00
Kerollmops
2a6d6a7f69
Introduce a first draft of the best_proximity algorithm
2020-06-09 10:11:43 +02:00
Kerollmops
dfdaceb410
Introduce a first basic working positions-based engine
2020-06-05 20:13:19 +02:00
Kerollmops
f51a63e4ef
Store documents ids under attribute ids
2020-06-05 16:32:14 +02:00
Kerollmops
ce86a43779
Make the query tokenizer a real Iterator
2020-06-05 09:49:28 +02:00
Kerollmops
f55f4cb02a
Not fetch the cached prefix postings when prefix is disabled
2020-06-04 21:22:45 +02:00
Kerollmops
eefc6d7c44
Add support for quoted query phrases
2020-06-04 20:25:51 +02:00
Kerollmops
1f7035f18f
Just do a little clean-up
2020-06-04 19:13:28 +02:00
Kerollmops
71dc6a3828
Disable prefix search when query is ended by a whitespace
2020-06-04 18:37:20 +02:00
Kerollmops
5d1c625b74
Change the page index texts
2020-06-04 18:20:57 +02:00
Kerollmops
c42d3c19e2
Merge the whole list of generated MTBL in one go
2020-06-04 17:38:43 +02:00
Kerollmops
3a23dc242e
More efficiently merge MTBLs, more than two at a time
2020-06-04 16:17:24 +02:00
Kerollmops
1df1f88fe1
Directly write to LMDB without intermediate final MTBL
2020-06-01 21:30:39 +02:00
Kerollmops
2174042994
Merge only 3 MTBL at the same time
2020-06-01 19:49:58 +02:00
Kerollmops
5cc81a0179
Merge many MTBL into one a the same time
2020-06-01 18:39:58 +02:00
Kerollmops
6a047519f6
Do a merge two by two
2020-06-01 18:27:26 +02:00
Kerollmops
5404776f7a
Add a little bit more debug
2020-06-01 17:52:43 +02:00
Kerollmops
dff68a339a
Use OnceCell to cache levenshtein builders
2020-05-31 19:27:11 +02:00
Kerollmops
dde3e01a59
Introduce prefix postings ids for better perfs
2020-05-31 18:20:49 +02:00
Kerollmops
a26553c90a
Reintroduce a simple HTTP server
2020-05-31 17:48:13 +02:00
Kerollmops
2a10b2275e
Support prefix typo tolerant search
2020-05-31 17:18:13 +02:00
Kerollmops
ba9527abc0
Support typos with a levenshtein automata
2020-05-31 17:01:11 +02:00
Kerollmops
6c726df9b9
Support multiple space seperated words
2020-05-31 16:09:34 +02:00
Kerollmops
24587148fd
Introduce MTBL parallel merging before LMDB writing
2020-05-31 14:22:57 +02:00
Kerollmops
6762c2d08f
Clean up a little bit
2020-05-31 14:22:57 +02:00
Kerollmops
3a998cf39c
Far better usage of rayon to fold indexed data
2020-05-31 14:22:57 +02:00
Kerollmops
1237306ca8
Introduce a thread that write to heed
2020-05-31 14:22:57 +02:00
Kerollmops
3668627e03
Use zerocopy without bitpacking as a first step
2020-05-31 14:22:07 +02:00
Kerollmops
a81f201fad
Inroduce the use of RocksDB instead of sled (RAM)
2020-05-31 14:22:06 +02:00
Kerollmops
91ba938953
Initial commit
2020-05-31 14:22:06 +02:00