MarginaliaSearch/code/process-models/crawl-spec/readme.md

# Crawl Spec

A crawl spec is a list of domains to be crawled.  It is a parquet file with the following columns:

- `domain`: The domain to be crawled
- `crawlDepth`: The depth to which the domain should be crawled
- `urls`: A list of known URLs to be crawled

Crawl specs are used to define the scope of a crawl in the absence of known domains.

The [CrawlSpecRecord](java/nu/marginalia/model/crawlspec/CrawlSpecRecord.java) class is 
used to represent a record in the crawl spec.  

The [CrawlSpecRecordParquetFileReader](java/nu/marginalia/io/crawlspec/CrawlSpecRecordParquetFileReader.java)
and [CrawlSpecRecordParquetFileWriter](java/nu/marginalia/io/crawlspec/CrawlSpecRecordParquetFileWriter.java)
classes are used to read and write the crawl spec parquet files.
(process-models) Improve documentation 2024-02-15 11:21:12 +00:00			`# Crawl Spec`

			`A crawl spec is a list of domains to be crawled. It is a parquet file with the following columns:`

			- `domain`: The domain to be crawled
			- `crawlDepth`: The depth to which the domain should be crawled
			- `urls`: A list of known URLs to be crawled

			`Crawl specs are used to define the scope of a crawl in the absence of known domains.`

(docs) Begin un-fucking the docs after refactoring 2024-02-27 20:15:49 +00:00			`The [CrawlSpecRecord](java/nu/marginalia/model/crawlspec/CrawlSpecRecord.java) class is`
(process-models) Improve documentation 2024-02-15 11:21:12 +00:00			`used to represent a record in the crawl spec.`

(docs) Begin un-fucking the docs after refactoring 2024-02-27 20:15:49 +00:00			`The [CrawlSpecRecordParquetFileReader](java/nu/marginalia/io/crawlspec/CrawlSpecRecordParquetFileReader.java)`
			`and [CrawlSpecRecordParquetFileWriter](java/nu/marginalia/io/crawlspec/CrawlSpecRecordParquetFileWriter.java)`
(process-models) Improve documentation 2024-02-15 11:21:12 +00:00			`classes are used to read and write the crawl spec parquet files.`