mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-24 13:19:02 +00:00
![]() The priority index documents file can be trivially compressed to a large degree. Compression schema: ``` 00b -> diff docord (E gamma) 01b -> diff domainid (E delta) + (1 + docord) (E delta) 10b -> rank (E gamma) + domainid,docord (raw) 11b -> 30 bit size header, followed by 1 raw doc id (61 bits) ``` |
||
---|---|---|
.. | ||
java/nu/marginalia/sequence | ||
test/nu/marginalia/sequence | ||
build.gradle | ||
readme.md |
The coded-sequence library offers tools for encoding sequences of integers with a variable-length encoding.
The Elias Gamma code is supported: https://en.wikipedia.org/wiki/Elias_gamma_coding
The GammaCodedSequence
class stores a sequence of ascending
non-negative integers in a byte buffer. The encoding also
stores the length of the sequence (as a gamma-coded value),
which is used in decoding.
Sequences are encoded with the GammaCodedSequence.of()
-method,
and require a temporary buffer to work in.
// allocate a temporary buffer to work in, this is reused
// for all operations and will not hold the final result
ByteBuffer workArea = ByteBuffer.allocate(1024);
// create a new GammaCodedSequence with the given values
var gcs = GammaCodedSequence.of(workArea, 1, 3, 4, 7, 10);
The GammaCodedSequence
class provides methods to query the
sequence, iterate over the values, and access the underlying
binary representation.
// query the sequence
int valueCount = gcs.valueCount();
int bufferSize = gcs.bufferSize();
// iterate over the values
IntIterator iter = gcs.iterator();
IntList values = gcs.values();
// access the underlying data (e.g. for writing)
byte[] bytes = gcs.bytes();
ByteBuffer buffer = gcs.buffer();
The GammaCodedSequence
class also provides methods to decode
a sequence from a byte buffer or byte array.
// decode the data
var decodedGcs1 = new GammaCodedSequence(buffer);
var decodedGcs2 = new GammaCodedSequence(buffer, start, end);
var decodedGcs3 = new GammaCodedSequence(bytes);