mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-24 05:18:58 +00:00
(process-models) Improve documentation
This commit is contained in:
parent
300b1a1b84
commit
652d151373
16
code/process-models/crawl-spec/readme.md
Normal file
16
code/process-models/crawl-spec/readme.md
Normal file
@ -0,0 +1,16 @@
|
|||||||
|
# Crawl Spec
|
||||||
|
|
||||||
|
A crawl spec is a list of domains to be crawled. It is a parquet file with the following columns:
|
||||||
|
|
||||||
|
- `domain`: The domain to be crawled
|
||||||
|
- `crawlDepth`: The depth to which the domain should be crawled
|
||||||
|
- `urls`: A list of known URLs to be crawled
|
||||||
|
|
||||||
|
Crawl specs are used to define the scope of a crawl in the absence of known domains.
|
||||||
|
|
||||||
|
The [CrawlSpecRecord](src/main/java/nu/marginalia/model/crawlspec/CrawlSpecRecord.java) class is
|
||||||
|
used to represent a record in the crawl spec.
|
||||||
|
|
||||||
|
The [CrawlSpecRecordParquetFileReader](src/main/java/nu/marginalia/io/crawlspec/CrawlSpecRecordParquetFileReader.java)
|
||||||
|
and [CrawlSpecRecordParquetFileWriter](src/main/java/nu/marginalia/io/crawlspec/CrawlSpecRecordParquetFileWriter.java)
|
||||||
|
classes are used to read and write the crawl spec parquet files.
|
@ -1,13 +1,69 @@
|
|||||||
# Crawling Models
|
# Crawling Models
|
||||||
|
|
||||||
Contains models shared by the [crawling-process](../../processes/crawling-process/) and
|
Contains crawl data models shared by the [crawling-process](../../processes/crawling-process/) and
|
||||||
[converting-process](../../processes/converting-process/).
|
[converting-process](../../processes/converting-process/).
|
||||||
|
|
||||||
|
To ensure backward compatibility with older versions of the data, the serialization is
|
||||||
|
abstracted away from the model classes.
|
||||||
|
|
||||||
|
The new way of serializing the data is to use parquet files.
|
||||||
|
|
||||||
|
The old way was to use zstd-compressed JSON. The old way is still supported
|
||||||
|
*for now*, but the new way is preferred as it's not only more succinct, but also
|
||||||
|
significantly faster to read and much more portable. The JSON support will be
|
||||||
|
removed in the future.
|
||||||
|
|
||||||
## Central Classes
|
## Central Classes
|
||||||
|
|
||||||
* [CrawledDocument](src/main/java/nu/marginalia/crawling/model/CrawledDocument.java)
|
* [CrawledDocument](src/main/java/nu/marginalia/crawling/model/CrawledDocument.java)
|
||||||
* [CrawledDomain](src/main/java/nu/marginalia/crawling/model/CrawledDomain.java)
|
* [CrawledDomain](src/main/java/nu/marginalia/crawling/model/CrawledDomain.java)
|
||||||
|
|
||||||
### Serialization
|
### Serialization
|
||||||
|
|
||||||
|
These serialization classes automatically negotiate the serialization format based on the
|
||||||
|
file extension.
|
||||||
|
|
||||||
|
Data is accessed through a [SerializableCrawlDataStream](src/main/java/nu/marginalia/crawling/io/SerializableCrawlDataStream.java),
|
||||||
|
which is a somewhat enhanced Iterator that can be used to read data.
|
||||||
|
|
||||||
* [CrawledDomainReader](src/main/java/nu/marginalia/crawling/io/CrawledDomainReader.java)
|
* [CrawledDomainReader](src/main/java/nu/marginalia/crawling/io/CrawledDomainReader.java)
|
||||||
* [CrawledDomainWriter](src/main/java/nu/marginalia/crawling/io/CrawledDomainWriter.java)
|
* [CrawledDomainWriter](src/main/java/nu/marginalia/crawling/io/CrawledDomainWriter.java)
|
||||||
|
|
||||||
|
### Parquet Serialization
|
||||||
|
|
||||||
|
The parquet serialization is done using the [CrawledDocumentParquetRecordFileReader](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecordFileReader.java)
|
||||||
|
and [CrawledDocumentParquetRecordFileWriter](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecordFileWriter.java) classes,
|
||||||
|
which read and write parquet files respectively.
|
||||||
|
|
||||||
|
The model classes are serialized to parquet using the [CrawledDocumentParquetRecord](src/main/java/nu/marginalia/crawling/parquet/CrawledDocumentParquetRecord.java)
|
||||||
|
|
||||||
|
The record has the following fields:
|
||||||
|
|
||||||
|
* `domain` - The domain of the document
|
||||||
|
* `url` - The URL of the document
|
||||||
|
* `ip` - The IP address of the document
|
||||||
|
* `cookies` - Whether the document has cookies
|
||||||
|
* `httpStatus` - The HTTP status code of the document
|
||||||
|
* `timestamp` - The timestamp of the document
|
||||||
|
* `contentType` - The content type of the document
|
||||||
|
* `body` - The body of the document
|
||||||
|
* `etagHeader` - The ETag header of the document
|
||||||
|
* `lastModifiedHeader` - The Last-Modified header of the document
|
||||||
|
|
||||||
|
The easiest way to interact with parquet files is to use [DuckDB](https://duckdb.org/),
|
||||||
|
which lets you run SQL queries on parquet files (and almost anything else).
|
||||||
|
|
||||||
|
e.g.
|
||||||
|
```sql
|
||||||
|
$ select httpStatus, count(*) as cnt
|
||||||
|
from 'my-file.parquet'
|
||||||
|
group by httpStatus;
|
||||||
|
┌────────────┬───────┐
|
||||||
|
│ httpStatus │ cnt │
|
||||||
|
│ int32 │ int64 │
|
||||||
|
├────────────┼───────┤
|
||||||
|
│ 200 │ 43 │
|
||||||
|
│ 304 │ 4 │
|
||||||
|
│ 500 │ 1 │
|
||||||
|
└────────────┴───────┘
|
||||||
|
```
|
Loading…
Reference in New Issue
Block a user