MarginaliaSearch/code/index/index-reverse
Viktor Lofgren ae87e41cec (index) Fix rare BitReader.takeWhileZero bug
Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer.  This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty.

The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte.

Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.
2024-07-16 11:03:56 +02:00
..
java/nu/marginalia/index (coded-sequence) Add a callback for re-filling underlying buffer 2024-07-12 23:50:28 +02:00
test/nu/marginalia (index) Fix rare BitReader.takeWhileZero bug 2024-07-16 11:03:56 +02:00
build.gradle (index, WIP) Position data partially integrated with forward and reverse indexes. 2024-06-06 12:54:52 +02:00
index.svg (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
merging.svg (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
preindex.svg (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
readme.md (docs) Begin un-fucking the docs after refactoring 2024-02-27 21:22:21 +01:00

Reverse Index

The reverse index contains a mapping from word to document id.

There are two tiers of this index.

  • A priority index which only indexes terms that are flagged with priority flags1.
  • A full index that indexes all terms.

The full index also provides access to term-level metadata, while the priority index is a binary index that only offers information about which documents has a specific word.

[1] See WordFlags in common/model and KeywordMetadata in features-convert/keyword-extraction.

Construction

The reverse index is constructed by first building a series of preindexes. Preindexes consist of a Segment and a Documents object. The segment contains information about which word identifiers are present and how many, and the documents contain information about in which documents the words can be found.

Memory layout illustrations

These would typically not fit in RAM, so the index journal is paged and the preindexes are constructed small enough to fit in memory, and then merged. Merging sorted arrays is a very fast operation that does not require additional RAM.

Illustration of successively merged preindex files

Once merged into one large preindex, indexes are added to the preindex data to form a finalized reverse index.

Illustration of the data layout of the finalized index

Central Classes

See Also