
Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.
2.9 KiB
Crawling Models
Contains crawl data models shared by the crawling-process and converting-process.
To ensure backward compatibility with older versions of the data, the serialization is abstracted away from the model classes.
The new way of serializing the data is to use parquet files.
The old way was to use zstd-compressed JSON. The old way is still supported for now, but the new way is preferred as it's not only more succinct, but also significantly faster to read and much more portable. The JSON support will be removed in the future.
Central Classes
Serialization
These serialization classes automatically negotiate the serialization format based on the file extension.
Data is accessed through a SerializableCrawlDataStream, which is a somewhat enhanced Iterator that can be used to read data.
Parquet Serialization
The parquet serialization is done using the CrawledDocumentParquetRecordFileReader and CrawledDocumentParquetRecordFileWriter classes, which read and write parquet files respectively.
The model classes are serialized to parquet using the CrawledDocumentParquetRecord
The record has the following fields:
domain
- The domain of the documenturl
- The URL of the documentip
- The IP address of the documentcookies
- Whether the document has cookieshttpStatus
- The HTTP status code of the documenttimestamp
- The timestamp of the documentcontentType
- The content type of the documentbody
- The body of the documentetagHeader
- The ETag header of the documentlastModifiedHeader
- The Last-Modified header of the document
The easiest way to interact with parquet files is to use DuckDB, which lets you run SQL queries on parquet files (and almost anything else).
e.g.
$ select httpStatus, count(*) as cnt
from 'my-file.parquet'
group by httpStatus;
┌────────────┬───────┐
│ httpStatus │ cnt │
│ int32 │ int64 │
├────────────┼───────┤
│ 200 │ 43 │
│ 304 │ 4 │
│ 500 │ 1 │
└────────────┴───────┘