The parquet serialization is done using the [CrawledDocumentParquetRecordFileReader](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecordFileReader.java)
and [CrawledDocumentParquetRecordFileWriter](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecordFileWriter.java) classes,
which read and write parquet files respectively.
The model classes are serialized to parquet using the [CrawledDocumentParquetRecord](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecord.java)
The record has the following fields:
*`domain` - The domain of the document
*`url` - The URL of the document
*`ip` - The IP address of the document
*`cookies` - Whether the document has cookies
*`httpStatus` - The HTTP status code of the document
*`timestamp` - The timestamp of the document
*`contentType` - The content type of the document
*`body` - The body of the document
*`etagHeader` - The ETag header of the document
*`lastModifiedHeader` - The Last-Modified header of the document
The easiest way to interact with parquet files is to use [DuckDB](https://duckdb.org/),
which lets you run SQL queries on parquet files (and almost anything else).