MarginaliaSearch/code/processes/crawling-process/model/readme.md

# Crawling Models

Contains crawl data models shared by the [crawling-process](../../) and
[converting-process](../../../processes/converting-process/).

To ensure backward compatibility with older versions of the data, the serialization is
abstracted away from the model classes.  

The new way of serializing the data is to use parquet files.  

The old way was to use zstd-compressed JSON.  The old way is still supported 
*for now*, but the new way is preferred as it's not only more succinct, but also 
significantly faster to read and much more portable.  The JSON support will be
removed in the future.

## Central Classes

* [CrawledDocument](java/nu/marginalia/model/crawldata/CrawledDocument.java)
* [CrawledDomain](java/nu/marginalia/model/crawldata/CrawledDomain.java)

### Serialization

These serialization classes automatically negotiate the serialization format based on the 
file extension.

Data is accessed through a [SerializableCrawlDataStream](java/nu/marginalia/io/crawldata/SerializableCrawlDataStream.java),
which is a somewhat enhanced Iterator that can be used to read data. 

* [CrawledDomainReader](java/nu/marginalia/io/crawldata/CrawledDomainReader.java)

### Parquet Serialization

The parquet serialization is done using the [CrawledDocumentParquetRecordFileReader](java/nu/marginalia/parquet/crawldata/CrawledDocumentParquetRecordFileReader.java)
and [CrawledDocumentParquetRecordFileWriter](java/nu/marginalia/parquet/crawldata/CrawledDocumentParquetRecordFileWriter.java) classes,
which read and write parquet files respectively.

The model classes are serialized to parquet using the [CrawledDocumentParquetRecord](java/nu/marginalia/parquet/crawldata/CrawledDocumentParquetRecord.java)

The record has the following fields:

* `domain` - The domain of the document
* `url` - The URL of the document
* `ip` - The IP address of the document
* `cookies` - Whether the document has cookies
* `httpStatus` - The HTTP status code of the document
* `timestamp` - The timestamp of the document
* `contentType` - The content type of the document
* `body` - The body of the document
* `etagHeader` - The ETag header of the document
* `lastModifiedHeader` - The Last-Modified header of the document

The easiest way to interact with parquet files is to use [DuckDB](https://duckdb.org/),
which lets you run SQL queries on parquet files (and almost anything else).

e.g. 
```sql
$ select httpStatus, count(*) as cnt 
       from 'my-file.parquet' 
       group by httpStatus;
┌────────────┬───────┐
│ httpStatus │  cnt  │
│   int32    │ int64 │
├────────────┼───────┤
│        200 │    43 │
│        304 │     4 │
│        500 │     1 │
└────────────┴───────┘
```
A lot of readmes, some refactoring. 2023-03-06 17:32:13 +00:00			`# Crawling Models`

(doc) Correct dead links and stale information in the docs 2024-09-13 09:01:05 +00:00			`Contains crawl data models shared by the [crawling-process](../../) and`
			`[converting-process](../../../processes/converting-process/).`
More documentation... 2023-03-06 18:01:36 +00:00
(process-models) Improve documentation 2024-02-15 11:21:12 +00:00			`To ensure backward compatibility with older versions of the data, the serialization is`
			`abstracted away from the model classes.`

			`The new way of serializing the data is to use parquet files.`

			`The old way was to use zstd-compressed JSON. The old way is still supported`
			`for now, but the new way is preferred as it's not only more succinct, but also`
			`significantly faster to read and much more portable. The JSON support will be`
			`removed in the future.`

More documentation... 2023-03-06 18:01:36 +00:00			`## Central Classes`

(doc) Correct dead links and stale information in the docs 2024-09-13 09:01:05 +00:00			`* [CrawledDocument](java/nu/marginalia/model/crawldata/CrawledDocument.java)`
			`* [CrawledDomain](java/nu/marginalia/model/crawldata/CrawledDomain.java)`
More documentation... 2023-03-06 18:01:36 +00:00
readme.md 2023-03-22 14:10:30 +00:00			`### Serialization`
(process-models) Improve documentation 2024-02-15 11:21:12 +00:00
			`These serialization classes automatically negotiate the serialization format based on the`
			`file extension.`

(doc) Correct dead links and stale information in the docs 2024-09-13 09:01:05 +00:00			`Data is accessed through a [SerializableCrawlDataStream](java/nu/marginalia/io/crawldata/SerializableCrawlDataStream.java),`
(process-models) Improve documentation 2024-02-15 11:21:12 +00:00			`which is a somewhat enhanced Iterator that can be used to read data.`

(doc) Correct dead links and stale information in the docs 2024-09-13 09:01:05 +00:00			`* [CrawledDomainReader](java/nu/marginalia/io/crawldata/CrawledDomainReader.java)`
(process-models) Improve documentation 2024-02-15 11:21:12 +00:00
			`### Parquet Serialization`

(doc) Correct dead links and stale information in the docs 2024-09-13 09:01:05 +00:00			`The parquet serialization is done using the [CrawledDocumentParquetRecordFileReader](java/nu/marginalia/parquet/crawldata/CrawledDocumentParquetRecordFileReader.java)`
			`and [CrawledDocumentParquetRecordFileWriter](java/nu/marginalia/parquet/crawldata/CrawledDocumentParquetRecordFileWriter.java) classes,`
(process-models) Improve documentation 2024-02-15 11:21:12 +00:00			`which read and write parquet files respectively.`

(doc) Correct dead links and stale information in the docs 2024-09-13 09:01:05 +00:00			`The model classes are serialized to parquet using the [CrawledDocumentParquetRecord](java/nu/marginalia/parquet/crawldata/CrawledDocumentParquetRecord.java)`
(process-models) Improve documentation 2024-02-15 11:21:12 +00:00
			`The record has the following fields:`

			* `domain` - The domain of the document
			* `url` - The URL of the document
			* `ip` - The IP address of the document
			* `cookies` - Whether the document has cookies
			* `httpStatus` - The HTTP status code of the document
			* `timestamp` - The timestamp of the document
			* `contentType` - The content type of the document
			* `body` - The body of the document
			* `etagHeader` - The ETag header of the document
			* `lastModifiedHeader` - The Last-Modified header of the document

			`The easiest way to interact with parquet files is to use [DuckDB](https://duckdb.org/),`
			`which lets you run SQL queries on parquet files (and almost anything else).`

			`e.g.`
			```sql
			`$ select httpStatus, count(*) as cnt`
			`from 'my-file.parquet'`
			`group by httpStatus;`
			`┌────────────┬───────┐`
			`│ httpStatus │ cnt │`
			`│ int32 │ int64 │`
			`├────────────┼───────┤`
			`│ 200 │ 43 │`
			`│ 304 │ 4 │`
			`│ 500 │ 1 │`
			`└────────────┴───────┘`
			```