Commit Graph

48 Commits

Author SHA1 Message Date
Viktor Lofgren
b510b7feb8 Spike for storing crawl data in slop instead of parquet
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds.  On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
Viktor Lofgren
a5b4951f23 (chore) Remove use of deprecated STR.-style string templates 2024-11-11 18:02:28 +01:00
Viktor Lofgren
2b8ab97ec1 (bit-writer) Do not clear buffer when creating a bit writer 2024-09-29 17:52:43 +02:00
Viktor Lofgren
43ca9c8a12 (sequence) Return Integer.MAX_VALUE for empty position lists.
Updated the method to return Integer.MAX_VALUE if any of the position lists are empty, instead of returning 0. This ensures that empty lists are handled consistently and address edge cases where an empty list is encountered.
2024-09-29 17:21:17 +02:00
Viktor Lofgren
edb42836da (vcs) Fix shared state issues with VarintCodedSequence's iterators.
Also cleans up the code a bit.
2024-09-21 16:09:15 +02:00
Viktor Lofgren
1ff88ff0bc (vcs) Stopgap fix for quoted queries with the same term appearinc multiple times
There are reentrance issues with VarintCodedSequence, this hides the symptom but these need to be corrected properly.
2024-09-21 14:07:59 +02:00
Viktor Lofgren
35f49bbb60 (coded-sequence) Add equals and hashCode to VCS 2024-09-10 10:33:56 +02:00
Viktor Lofgren
abab5bdc8a (index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data 2024-08-26 14:20:39 +02:00
Viktor Lofgren
30bf845c81 (index) Speed up minDist calculations by excluding large lists 2024-08-26 13:04:15 +02:00
Viktor Lofgren
7d471ec30d (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:45:11 +02:00
Viktor Lofgren
f3182a9264 (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:02:37 +02:00
Viktor Lofgren
805cb5ad58 (coded-sequence) Correct behavior of findIntersections 2024-08-25 14:54:17 +02:00
Viktor Lofgren
efd56efc63 (index) Optimize SequenceOperations.minDistance 2024-08-25 13:28:06 +02:00
Viktor Lofgren
0d01a48260 (index) Optimize SequenceOperations 2024-08-25 13:19:37 +02:00
Viktor Lofgren
00ab2684fa (index) Optimize SequenceOperations 2024-08-25 13:17:38 +02:00
Viktor Lofgren
a5585110a6 (index) Optimize SequenceOperations 2024-08-25 13:16:31 +02:00
Viktor Lofgren
965c89798e (index) Optimize DocumentSpan 2024-08-25 12:44:33 +02:00
Viktor Lofgren
24b805472a (index) Evaluate performance implication of decoding gcs early 2024-08-25 12:23:09 +02:00
Viktor Lofgren
6dda2c2d83 (coded-sequence) Reduce allocations in GCS.values() 2024-08-25 12:06:31 +02:00
Viktor Lofgren
7f498e10b7 (index) Adjust proximity score 2024-08-25 11:01:35 +02:00
Viktor Lofgren
9eb1f120fc (index) Repair positions bitmask for search result presentation 2024-08-22 11:28:23 +02:00
Viktor Lofgren
bca40de107 (*) Upgrade slop library 2024-08-18 10:43:41 +02:00
Viktor Lofgren
b2a3cac351 (*) Remove broken imports 2024-08-15 11:01:34 +02:00
Viktor Lofgren
75b0888032 (slop) Migrate to latest Slop version 2024-08-14 11:44:35 +02:00
Viktor Lofgren
623ee5570f (slop) Break slop out into its own repository 2024-08-13 09:50:05 +02:00
Viktor Lofgren
8462e88b8f (index) Add min-dist factor and adjust rankings 2024-08-03 13:07:00 +02:00
Viktor Lofgren
eba2844361 (index) Experimental ranking signals 2024-08-03 10:32:46 +02:00
Viktor Lofgren
57929ff242 (coded-sequence) Varint sequence 2024-08-02 20:22:56 +02:00
Viktor Lofgren
b316b55be9 (index) Experimental initial integration of document spans into index 2024-07-30 12:01:53 +02:00
Viktor Lofgren
34703da144 (slop) Support for nested array types and array-of-object types
Also adding very basic support for filtered reads via SlopTable.  This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
f8684118f3 (slop) Add columnDesc information to the column readers and writers, and correct a few broken position() implementations
Added a test that should find any additional broken implementations, as it's very important that this function is correct.
2024-07-27 14:35:30 +02:00
Viktor Lofgren
6c3abff664 (slop) Move GCS Slop column to the coded-sequence package
This lets the slop library be stand-alone without dependence on coded-sequence.

The change also gets rid of the vestigial seek() method in ColumnReader.
2024-07-27 13:58:45 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
0b31c4cfbb (coded-sequence) Replace GCS usage with an interface 2024-07-16 14:37:50 +02:00
Viktor Lofgren
ae87e41cec (index) Fix rare BitReader.takeWhileZero bug
Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer.  This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty.

The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte.

Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.
2024-07-16 11:03:56 +02:00
Viktor Lofgren
179a6002c2 (coded-sequence) Add a callback for re-filling underlying buffer 2024-07-12 23:50:28 +02:00
Viktor Lofgren
31881874a9 (coded-sequence) Correct indicator of next-value
It was incorrectly assumed that a "next" value could not be zero or negative, as this is not representable via the Gamam code.  This is incorrect in this case, as we're able to provide a negative offset.  Changing to using Integer.MIN_VALUE as indicator that a value is absent instead, as this will never be used.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
12590d3449 (index-reverse) Added compression to priority index
The priority index documents file can be trivially compressed to a large degree.

Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
2024-07-11 16:13:23 +02:00
Viktor Lofgren
abf7a8d78d (coded-sequence) Correct implementation of Elias gamma
Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.
2024-07-10 14:28:28 +02:00
Viktor Lofgren
ecfe17521a (coded-sequence) Correct implementation of Elias gamma
The implementation was incorrectly using 1 bit more than it should.  The change also adds a put method for Elias delta; and cleans up the interface a bit.
2024-07-09 17:28:21 +02:00
Viktor Lofgren
b805f6daa8 (gamma) Fix readCount() behavior in EGC 2024-06-25 22:17:54 +02:00
Viktor Lofgren
9d00243d7f (index) Partial re-implementation of position constraints 2024-06-24 15:55:54 +02:00
Viktor Lofgren
5461634616 (doc) Add readme.md for coded-sequence library
This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.
2024-06-24 14:28:51 +02:00
Viktor Lofgren
40bca93884 (gamma) Minor clean-up 2024-06-24 13:56:43 +02:00
Viktor Lofgren
fff2ce5721 (gamma) Correctly decode zero-length sequences 2024-06-24 13:11:41 +02:00
Viktor Lofgren
36160988e2 (index) Integrate positions data with indexes WIP
This change integrates the new positions data with the forward and reverse indexes.

The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
4a8afa6b9f (index, WIP) Position data partially integrated with forward and reverse indexes.
There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.
2024-06-06 12:54:52 +02:00
Viktor Lofgren
0112ae725c (gamma) Implement a small library for Elias gamma coding an integer sequence 2024-05-30 14:19:13 +02:00