mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-24 13:19:02 +00:00
data:image/s3,"s3://crabby-images/c765d/c765d5283f4176ac41b612e7ae83ed62e7ddf9a1" alt="Viktor Lofgren"
Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.
18 lines
842 B
Markdown
18 lines
842 B
Markdown
The processed-data package contains models and logic for
|
|
reading and writing parquet files with the output from the
|
|
[converting-process](../../processes/converting-process).
|
|
|
|
Main models:
|
|
|
|
* [DocumentRecord](java/nu/marginalia/model/processed/DocumentRecord.java)
|
|
* * [DocumentRecordKeywordsProjection](java/nu/marginalia/model/processed/DocumentRecordKeywordsProjection.java)
|
|
* * [DocumentRecordMetadataProjection](java/nu/marginalia/model/processed/DocumentRecordMetadataProjection.java)
|
|
* [DomainLinkRecord](java/nu/marginalia/model/processed/DomainLinkRecord.java)
|
|
* [DomainRecord](java/nu/marginalia/model/processed/DomainRecord.java)
|
|
|
|
Since parquet is a column based format, some of the readable models are projections
|
|
that only read parts of the input file.
|
|
|
|
## See Also
|
|
|
|
[third-party/parquet-floor](../../../third-party/parquet-floor) |