Viktor Lofgren
0e4dd3d76d
(minor) Remove accidentally committed debug printf
2024-06-27 13:40:53 +02:00
Viktor Lofgren
10fe5a78cb
(log) Prevent tests from trying to log to file
...
They would never have succeeded, but it adds an annoying preamble of error spam in the console window.
2024-06-27 13:19:48 +02:00
Viktor Lofgren
975b8ae2e9
(minor) Tidy code
2024-06-27 13:15:31 +02:00
Viktor Lofgren
935234939c
(test) Add query parsing to IntegrationTest
2024-06-27 13:15:20 +02:00
Viktor Lofgren
87e38e6181
(search-query) refac: Move query factory
2024-06-27 13:14:47 +02:00
Viktor Lofgren
f73fc8dd57
(search-query) Fix end-inclusion bug in QWordGraphIterator
2024-06-27 13:13:42 +02:00
Viktor Lofgren
3faa5bf521
(search-query) Tidy up QueryGRPCService and IndexClient
2024-06-26 14:03:30 +02:00
Viktor Lofgren
6973712480
(query) Tidy up code
2024-06-26 13:40:06 +02:00
Viktor Lofgren
02df421c94
(*) Trim the stopwords list
...
Having an overlong stopwords list leads to quoted terms not performing well. For now we'll slash it to just "a" and "the".
2024-06-26 12:22:57 +02:00
Viktor Lofgren
95b9af92a0
(index) Implement working optional TermCoherences
2024-06-26 12:22:06 +02:00
Viktor Lofgren
8ee64c0771
(index) Correct TermCoherence requirements
2024-06-25 22:18:10 +02:00
Viktor Lofgren
b805f6daa8
(gamma) Fix readCount() behavior in EGC
2024-06-25 22:17:54 +02:00
Viktor Lofgren
dae22ccbe0
(test) Integration test from crawl->query
2024-06-25 22:17:26 +02:00
Viktor Lofgren
9d00243d7f
(index) Partial re-implementation of position constraints
2024-06-24 15:55:54 +02:00
Viktor Lofgren
5461634616
(doc) Add readme.md for coded-sequence library
...
This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.
2024-06-24 14:28:51 +02:00
Viktor Lofgren
40bca93884
(gamma) Minor clean-up
2024-06-24 13:56:43 +02:00
Viktor Lofgren
b798f28443
(journal) Fixing journal encoding
...
Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.
2024-06-24 13:56:27 +02:00
Viktor Lofgren
fff2ce5721
(gamma) Correctly decode zero-length sequences
2024-06-24 13:11:41 +02:00
Viktor Lofgren
23759a7243
(loader) Correctly clamp document size
2024-06-10 18:29:14 +02:00
Viktor Lofgren
55b2b7636b
(loader) Correctly load the positions column in the keyword projection
2024-06-10 18:27:15 +02:00
Viktor Lofgren
36160988e2
(index) Integrate positions data with indexes WIP
...
This change integrates the new positions data with the forward and reverse indexes.
The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
9f982a0c3d
(index) Integrate positions file properly
2024-06-06 16:45:42 +02:00
Viktor Lofgren
dcbec9414f
(index) Fix non-compiling tests
2024-06-06 16:35:09 +02:00
Viktor Lofgren
4a8afa6b9f
(index, WIP) Position data partially integrated with forward and reverse indexes.
...
There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.
2024-06-06 12:54:52 +02:00
Viktor Lofgren
9b922af075
(converter) Amend existing modifications to use gamma coded positions lists
...
... instead of serialized RoaringBitmaps as was the initial take on the problem.
2024-05-30 14:20:36 +02:00
Viktor Lofgren
0112ae725c
(gamma) Implement a small library for Elias gamma coding an integer sequence
2024-05-30 14:19:13 +02:00
Viktor Lofgren
619392edf9
(keywords) Add position information to keywords
2024-05-28 16:54:53 +02:00
Viktor Lofgren
0894822b68
(converter) Add position information to serialized document data
...
This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
2024-05-28 14:18:03 +02:00
Viktor Lofgren
206a7ce6c1
Merge remote-tracking branch 'origin/master'
2024-05-28 14:15:57 +02:00
Viktor Lofgren
a69ab311c7
(qword) Fix tests that broke due to stopword removal
2024-05-28 14:15:45 +02:00
Viktor
a61327fa0b
Update ROADMAP.md
2024-05-24 13:57:50 +02:00
Viktor Lofgren
6985ab762a
(query) Improve handling of stopwords in queries
2024-05-23 20:50:55 +02:00
Viktor Lofgren
0e8300979b
(search) Update the no result text to request bug reports.
2024-05-23 20:18:16 +02:00
Viktor Lofgren
0b60411e5f
(query) Bugfix stopword issue
...
Add a new rule that crates an alternative path that omits a word if it's a stopword.
In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
2024-05-23 20:15:14 +02:00
Viktor Lofgren
f83f777fff
(converter) Experimental support for searching by URL
...
Add up to synthetic 128 keywords per document, corresponding to links to other websites.
2024-05-23 17:10:57 +02:00
Viktor Lofgren
89aae93e60
(*) Lift jetty and guava-dependencies
2024-05-23 14:20:01 +02:00
Viktor Lofgren
65b74f9cab
(registry) Fix broken test
2024-05-23 14:15:01 +02:00
Viktor Lofgren
59ec70eb73
(*) Clean up code related to crawl parquet inspection
2024-05-22 12:55:08 +02:00
Viktor Lofgren
365229991b
(control) Improve pagination for crawl data inspector
2024-05-21 19:44:48 +02:00
Viktor Lofgren
959a8e29ee
(control) Improve pagination for crawl data inspector
2024-05-21 19:27:25 +02:00
Viktor Lofgren
197c82acd4
(control) Add filter functionality for crawl data inspector
2024-05-21 19:05:44 +02:00
Viktor Lofgren
9539fdb53c
(control) Clean up UX for crawl data inspector
2024-05-21 18:27:24 +02:00
Viktor Lofgren
24bf29d369
(*) Upgrade opennlp and deprecate the monkey patched version of the code as it's no longer needed
2024-05-20 18:03:21 +02:00
Viktor Lofgren
17dc00d05f
(control) Partial implementation of inspection utility for crawl data
...
Uses duckdb and range queries to read the parquet files directly from the index partitions.
UX is a bit rough but is in working order.
2024-05-20 18:02:46 +02:00
Viktor Lofgren
4fcd4a8197
(index) Refactor to reduce the level of indirection
2024-05-19 12:40:33 +02:00
Viktor Lofgren
daf2a8df54
(btree) Roll back optimization of queryDataWithIndex
...
It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect.
The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.
2024-05-19 11:29:28 +02:00
Viktor Lofgren
88997a1c4f
(btree) Clean up code
2024-05-18 18:38:46 +02:00
Viktor Lofgren
d12c77305c
(btree) Clean up code
2024-05-18 18:03:17 +02:00
Viktor Lofgren
ab4e2b222e
(array) Fix broken benchmarks
2024-05-18 13:41:24 +02:00
Viktor Lofgren
b867eadbef
(big-string) Remove the unused bigstring library
2024-05-18 13:40:03 +02:00