MarginaliaSearch/code/processes/converting-process/model/readme.md
Viktor Lofgren aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00

18 lines
842 B
Markdown

The processed-data package contains models and logic for
reading and writing parquet files with the output from the
[converting-process](../../processes/converting-process).
Main models:
* [DocumentRecord](java/nu/marginalia/model/processed/DocumentRecord.java)
* * [DocumentRecordKeywordsProjection](java/nu/marginalia/model/processed/DocumentRecordKeywordsProjection.java)
* * [DocumentRecordMetadataProjection](java/nu/marginalia/model/processed/DocumentRecordMetadataProjection.java)
* [DomainLinkRecord](java/nu/marginalia/model/processed/DomainLinkRecord.java)
* [DomainRecord](java/nu/marginalia/model/processed/DomainRecord.java)
Since parquet is a column based format, some of the readable models are projections
that only read parts of the input file.
## See Also
[third-party/parquet-floor](../../../third-party/parquet-floor)