Viktor Lofgren
f090f0101b
(index-construction) Gather up preindex writes
...
Use fewer writes when finalizing the preindex documents.dat file, as this was getting too slow.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
9881cac2da
(index-reader) Correctly handle negative offset values
...
When wordOffset(...) returns a negative value, it means the word isn't present in the index, and we should abort.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
12590d3449
(index-reverse) Added compression to priority index
...
The priority index documents file can be trivially compressed to a large degree.
Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
2024-07-11 16:13:23 +02:00
Viktor Lofgren
abf7a8d78d
(coded-sequence) Correct implementation of Elias gamma
...
Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.
2024-07-10 14:28:28 +02:00
Viktor Lofgren
0d29e2a39d
(index-reverse) Entry Sources reset() their LongQueryBuffer
...
Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome
2024-07-09 01:39:40 +02:00
Viktor Lofgren
d90bd340bb
(index-reverse) Removing btree indexes from prio documents file
...
Btree index adds overhead and disk space and doesn't fill any function for the prio index.
* Update finalize logic with a new IO transformer that copies the data and prepends a size
* Update the reader to read the new format
* Added a test
2024-07-08 17:20:17 +02:00
Viktor Lofgren
21afe94096
(index-reverse) Don't use 128 bit merge function for prio index
2024-07-07 21:36:10 +02:00
Viktor Lofgren
fa36689597
(index-reverse) Simplify priority index
...
* Do not emit a documents file
* Do not interlace metadata or offsets with doc ids
2024-07-06 18:04:08 +02:00
Viktor Lofgren
85c99ae808
(index-reverse) Split index construction into separate packages for full and priority index
2024-07-06 15:44:47 +02:00
Viktor Lofgren
a4ecd5f4ce
(minor) Fix non-compiling test due to previous refactor
2024-07-06 15:11:43 +02:00
Viktor Lofgren
d023e399d2
(index) Remove unnecessary allocations in journal reader
...
The term data iterator is quite hot and was performing buffer slice operations that were not necessary.
Replacing with a fixed pointer alias that can be repositioned to the relevant data.
The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped.
Removed this unnecessary step and move to copying the buffer directly instead.
2024-07-04 15:38:22 +02:00
Viktor Lofgren
0e4dd3d76d
(minor) Remove accidentally committed debug printf
2024-06-27 13:40:53 +02:00
Viktor Lofgren
975b8ae2e9
(minor) Tidy code
2024-06-27 13:15:31 +02:00
Viktor Lofgren
3faa5bf521
(search-query) Tidy up QueryGRPCService and IndexClient
2024-06-26 14:03:30 +02:00
Viktor Lofgren
6973712480
(query) Tidy up code
2024-06-26 13:40:06 +02:00
Viktor Lofgren
95b9af92a0
(index) Implement working optional TermCoherences
2024-06-26 12:22:06 +02:00
Viktor Lofgren
8ee64c0771
(index) Correct TermCoherence requirements
2024-06-25 22:18:10 +02:00
Viktor Lofgren
dae22ccbe0
(test) Integration test from crawl->query
2024-06-25 22:17:26 +02:00
Viktor Lofgren
9d00243d7f
(index) Partial re-implementation of position constraints
2024-06-24 15:55:54 +02:00
Viktor Lofgren
b798f28443
(journal) Fixing journal encoding
...
Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.
2024-06-24 13:56:27 +02:00
Viktor Lofgren
23759a7243
(loader) Correctly clamp document size
2024-06-10 18:29:14 +02:00
Viktor Lofgren
36160988e2
(index) Integrate positions data with indexes WIP
...
This change integrates the new positions data with the forward and reverse indexes.
The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
9f982a0c3d
(index) Integrate positions file properly
2024-06-06 16:45:42 +02:00
Viktor Lofgren
dcbec9414f
(index) Fix non-compiling tests
2024-06-06 16:35:09 +02:00
Viktor Lofgren
4a8afa6b9f
(index, WIP) Position data partially integrated with forward and reverse indexes.
...
There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.
2024-06-06 12:54:52 +02:00
Viktor Lofgren
89aae93e60
(*) Lift jetty and guava-dependencies
2024-05-23 14:20:01 +02:00
Viktor Lofgren
4fcd4a8197
(index) Refactor to reduce the level of indirection
2024-05-19 12:40:33 +02:00
Viktor Lofgren
19163fa883
(array) Clean up the Array library
...
IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it)
Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs.
Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
2024-05-18 13:23:06 +02:00
Viktor
2d49071e96
Merge branch 'master' into run-outside-docker
2024-04-25 18:53:26 +02:00
Viktor Lofgren
e4b34b6ee6
(index) Correctly detect the presence of an all-virtual path through the query
2024-04-25 14:01:46 +02:00
Viktor Lofgren
32fe864a33
(build) Java 22 and its consequences has been a disaster for Marginalia Search
...
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f46733a47a
(ranking) TermCoherenceFactory should be run for size=2 queries
2024-04-24 14:44:39 +02:00
Viktor Lofgren
5165cf6d15
(ranking) Set regularMask correctly
2024-04-24 14:44:39 +02:00
Viktor Lofgren
4489b21528
(ranking) Cleanup
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f623b37577
(ranking) Suppress NaN:s in ranking output
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f4a2fea451
(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a748fc5448
(index, bugfix) Pass url quality to query service
2024-04-24 14:44:39 +02:00
Viktor Lofgren
0dcca0cb83
(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp
2024-04-24 14:44:39 +02:00
Viktor Lofgren
b80a83339b
(qs) Additional info in query debug UI
2024-04-24 14:44:39 +02:00
Viktor Lofgren
eb74d08f2a
(qs) Additional info in query debug UI
2024-04-24 14:44:39 +02:00
Viktor Lofgren
462aa9af26
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
...
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44b33798f3
(index) Clean up jaccard index term code and down-tune the parameter's importance a bit
2024-04-24 14:44:39 +02:00
Viktor Lofgren
2f0b648fad
(index) Add jaccard index term to boost results based on term overlap
2024-04-24 14:44:39 +02:00
Viktor Lofgren
de0e56f027
(index) Remove position overlap check, coherences will do the work instead
2024-04-24 14:44:39 +02:00
Viktor Lofgren
973ced7b13
(index) Omit absent terms from coherence checks
2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb4b824a85
(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e0224085b4
(index) Improve recall for small queries
...
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44c1e1d6d9
(index) Remove dead code
...
Since the performance fix in 3359f72239
had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c620e9c026
(index) Experimental performance regression fix
2024-04-24 14:44:39 +02:00
Viktor Lofgren
1bb88968c5
(test) Fix broken test
2024-04-24 14:44:39 +02:00