MarginaliaSearch/code/libraries/coded-sequence
Viktor Lofgren b510b7feb8 Spike for storing crawl data in slop instead of parquet
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds.  On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
..
java/nu/marginalia/sequence Spike for storing crawl data in slop instead of parquet 2024-12-15 15:49:47 +01:00
src/jmh/java/nu/marginalia/bench (index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data 2024-08-26 14:20:39 +02:00
test/nu/marginalia/sequence (chore) Remove use of deprecated STR.-style string templates 2024-11-11 18:02:28 +01:00
build.gradle (slop) Break slop out into its own repository 2024-08-13 09:50:05 +02:00
readme.md (doc) Add readme.md for coded-sequence library 2024-06-24 14:28:51 +02:00

The coded-sequence library offers tools for encoding sequences of integers with a variable-length encoding.

The Elias Gamma code is supported: https://en.wikipedia.org/wiki/Elias_gamma_coding

The GammaCodedSequence class stores a sequence of ascending non-negative integers in a byte buffer. The encoding also stores the length of the sequence (as a gamma-coded value), which is used in decoding.

Sequences are encoded with the GammaCodedSequence.of()-method, and require a temporary buffer to work in.

// allocate a temporary buffer to work in, this is reused
// for all operations and will not hold the final result
ByteBuffer workArea = ByteBuffer.allocate(1024);

// create a new GammaCodedSequence with the given values
var gcs = GammaCodedSequence.of(workArea, 1, 3, 4, 7, 10);

The GammaCodedSequence class provides methods to query the sequence, iterate over the values, and access the underlying binary representation.

// query the sequence 
int valueCount = gcs.valueCount();
int bufferSize = gcs.bufferSize();

// iterate over the values
IntIterator iter = gcs.iterator();
IntList values = gcs.values();

// access the underlying data (e.g. for writing)
byte[] bytes = gcs.bytes();
ByteBuffer buffer = gcs.buffer();

The GammaCodedSequence class also provides methods to decode a sequence from a byte buffer or byte array.

// decode the data
var decodedGcs1 = new GammaCodedSequence(buffer);
var decodedGcs2 = new GammaCodedSequence(buffer, start, end);
var decodedGcs3 = new GammaCodedSequence(bytes);