Commit Graph

499 Commits

Author SHA1 Message Date
Kerollmops 68f4af7d2e
Improve the display of the number of processed documents 2020-09-29 16:08:58 +02:00
Clément Renault ed05999f63
Replace the arc cache by a simple linked hash map 2020-09-23 14:50:52 +02:00
Clément Renault d6fa9c0414
Index the intra documents word pair proximities 2020-09-22 14:04:33 +02:00
Kerollmops 3ded98e5fa
Bump the roaring version that fix a deserialization bug 2020-09-10 22:37:51 +02:00
Kerollmops d5e5baa20f
Bump the oxidized-mtbl dependency 2020-09-10 13:29:12 +02:00
Kerollmops 0fb086f241
Use the crates.io raoring library 2020-09-08 15:16:04 +02:00
Clément Renault bb1ab428db
Use another function to define the proximity 2020-09-06 17:55:07 +02:00
Clément Renault f928b91e9d
Specify the exact rev for the near-proximity dep 2020-09-06 17:21:38 +02:00
Clément Renault 1c504471d3
Introduce the plane-sweep algorithm 2020-09-05 18:25:27 +02:00
Clément Renault dc88a86259
Store the word positions under the documents 2020-09-05 18:03:06 +02:00
Kerollmops 580ed1119a
Make the engine to return csv string records as documents and headers 2020-08-31 19:02:00 +02:00
Clément Renault bad0663138
Come back to the old tokenizer 2020-08-31 13:34:38 +02:00
Clément Renault 3fe497e129
Improve the Mtbl heed codec to only encode MTBL databases 2020-08-29 11:20:39 +02:00
Clément Renault d19f394630
Make the indexer support gzipped CSV as input 2020-08-21 18:10:24 +02:00
Clément Renault ff479c865d
Replace pipe by ringtail to improve stdin read performances 2020-08-21 17:45:52 +02:00
Clément Renault 8806fcd545
Introduce a better query and document lexer 2020-08-16 14:36:54 +02:00
Clément Renault 1e358e3ae8
Introduce the AstarBagIter that iterates through best paths 2020-08-15 16:24:06 +02:00
Clément Renault fae694a102
Put the documents into an MTBL database 2020-08-07 12:14:40 +02:00
Clément Renault 405a71d3a4
Accept csv from stdin 2020-08-06 13:38:21 +02:00
Clément Renault 6508d497ce
Replace the regex highlighting by a simple algorithm 2020-08-05 13:52:27 +02:00
Clément Renault bd4b18541c
Introduce a new indexer which uses an MTBL sorter 2020-08-04 15:44:37 +02:00
Kerollmops 085c376655
Use the regex crate to highlight "hello" 2020-07-14 11:28:40 +02:00
Kerollmops 12358476da
Use the log crate instead of stderr 2020-07-12 10:55:09 +02:00
Kerollmops 2c62eeea3c
Rename the project milli 2020-07-12 00:16:41 +02:00
Kerollmops f6eae91c7d
Pretty print the new dashboard numbers 2020-07-11 14:17:37 +02:00
Kerollmops 11c7fef80a
Implement a memory dumper
It moves the in memory HashMaps used when indexing to a disk based MTBL file
2020-07-07 16:48:49 +02:00
Kerollmops 7178b6c2c4
First basic version using MTBL again 2020-07-07 11:32:33 +02:00
Kerollmops 2a3b03138b
Use heed 0.8.1 with the RwIter append method 2020-07-05 19:50:28 +02:00
Kerollmops 46ced5c828
Introduce the RwIter append heed API 2020-07-04 12:34:10 +02:00
Kerollmops 2ae3f40971
Make the indexer ignore certain words
This is a preparation for making the indexing fully parallel by making the
indexer only be aware of certain words for each threads to avoid postings lists
conflicts for each words
2020-07-01 17:49:46 +02:00
Kerollmops f98b615bf3
Replace the LRU by an Arc cache 2020-06-29 20:48:57 +02:00
Kerollmops 07abebfc46
Introduce a (too big) LRU cache 2020-06-29 18:15:03 +02:00
Kerollmops 5f0088594b
Index by writing directly into LMDB 2020-06-29 13:54:47 +02:00
Kerollmops d6705d5529
Introduce the criterion dependency to bench the engine 2020-06-19 18:32:25 +02:00
Kerollmops 55a8941922
Optimize things 2020-06-19 17:48:17 +02:00
Kerollmops a8cda248b4
Introduce a customized A* algorithm.
This custom algo lazily compute the intersections between words, to avoid too much set operations and database reads
2020-06-14 12:51:57 +02:00
Kerollmops 0a83a86e65
Fix multiple bugs 2020-06-11 11:55:03 +02:00
Kerollmops 13977d9338
squash-me 2020-06-09 23:06:59 +02:00
Kerollmops dfdaceb410
Introduce a first basic working positions-based engine 2020-06-05 20:13:19 +02:00
Kerollmops 3a23dc242e
More efficiently merge MTBLs, more than two at a time 2020-06-04 16:17:24 +02:00
Kerollmops dff68a339a
Use OnceCell to cache levenshtein builders 2020-05-31 19:27:11 +02:00
Kerollmops a26553c90a
Reintroduce a simple HTTP server 2020-05-31 17:48:13 +02:00
Kerollmops ba9527abc0
Support typos with a levenshtein automata 2020-05-31 17:01:11 +02:00
Kerollmops 6c726df9b9
Support multiple space seperated words 2020-05-31 16:09:34 +02:00
Kerollmops 24587148fd
Introduce MTBL parallel merging before LMDB writing 2020-05-31 14:22:57 +02:00
Kerollmops 3a998cf39c
Far better usage of rayon to fold indexed data 2020-05-31 14:22:57 +02:00
Kerollmops 1237306ca8
Introduce a thread that write to heed 2020-05-31 14:22:57 +02:00
Kerollmops a81f201fad
Inroduce the use of RocksDB instead of sled (RAM) 2020-05-31 14:22:06 +02:00
Kerollmops 91ba938953
Initial commit 2020-05-31 14:22:06 +02:00