Commit Graph

5 Commits

Author SHA1 Message Date
Viktor Lofgren
3714104976 Add loader for slop data in converter.
Also alter CrawledDocument to not require String parsing of the underlying byte[] data.  This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
2024-12-17 15:40:24 +01:00
Viktor Lofgren
f6f036b9b1 Switch to new Slop format for crawl data storage and processing.
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
2024-12-15 19:34:03 +01:00
Viktor Lofgren
b510b7feb8 Spike for storing crawl data in slop instead of parquet
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds.  On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
Viktor Lofgren
162fc25ebc (minor) Fix accidental commit errors 2024-09-23 18:03:09 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00