MarginaliaSearch/code/processes/index-constructor-process
Viktor Lofgren aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
..
java/nu/marginalia/index (wip) Extract and encode spans data 2024-07-27 11:44:13 +02:00
build.gradle (wip) Extract and encode spans data 2024-07-27 11:44:13 +02:00
readme.md Clean up documentation and rename domain-links to link-graph 2024-02-28 11:40:39 +01:00

The index construction process is responsible for creating the indexes used by the search engine.

There are three types of indexes:

  • The forward index, which maps documents to words.
  • The full reverse index, which maps words to documents; and includes all words.
  • The priority reverse index, which maps words to documents; but includes only the most "important" words (such as those appearing in the title, or with especially high TF-IDF scores).

This is a very light-weight module that delegates the actual work to the modules:

Their respective readme files contain more information about the indexes themselves and how they are constructed.

The process is glued together within IndexConstructorMain, which is the only class of interest in this module.