MarginaliaSearch/code/libraries/coded-sequence
Viktor Lofgren aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
..
java/nu/marginalia/sequence (wip) Extract and encode spans data 2024-07-27 11:44:13 +02:00
test/nu/marginalia/sequence (index) Fix rare BitReader.takeWhileZero bug 2024-07-16 11:03:56 +02:00
build.gradle (gamma) Implement a small library for Elias gamma coding an integer sequence 2024-05-30 14:19:13 +02:00
readme.md (doc) Add readme.md for coded-sequence library 2024-06-24 14:28:51 +02:00

The coded-sequence library offers tools for encoding sequences of integers with a variable-length encoding.

The Elias Gamma code is supported: https://en.wikipedia.org/wiki/Elias_gamma_coding

The GammaCodedSequence class stores a sequence of ascending non-negative integers in a byte buffer. The encoding also stores the length of the sequence (as a gamma-coded value), which is used in decoding.

Sequences are encoded with the GammaCodedSequence.of()-method, and require a temporary buffer to work in.

// allocate a temporary buffer to work in, this is reused
// for all operations and will not hold the final result
ByteBuffer workArea = ByteBuffer.allocate(1024);

// create a new GammaCodedSequence with the given values
var gcs = GammaCodedSequence.of(workArea, 1, 3, 4, 7, 10);

The GammaCodedSequence class provides methods to query the sequence, iterate over the values, and access the underlying binary representation.

// query the sequence 
int valueCount = gcs.valueCount();
int bufferSize = gcs.bufferSize();

// iterate over the values
IntIterator iter = gcs.iterator();
IntList values = gcs.values();

// access the underlying data (e.g. for writing)
byte[] bytes = gcs.bytes();
ByteBuffer buffer = gcs.buffer();

The GammaCodedSequence class also provides methods to decode a sequence from a byte buffer or byte array.

// decode the data
var decodedGcs1 = new GammaCodedSequence(buffer);
var decodedGcs2 = new GammaCodedSequence(buffer, start, end);
var decodedGcs3 = new GammaCodedSequence(bytes);