mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

History

Viktor Lofgren 652d151373 (process-models) Improve documentation		2024-02-15 12:21:12 +01:00
..
src/main/java/nu/marginalia	(crawler) Extract additional configuration properties	2024-01-20 10:36:04 +01:00
build.gradle	(*) WIP Control GUI redesign, executor-service, multi-node mq	2023-10-14 12:08:43 +02:00
readme.md	(process-models) Improve documentation	2024-02-15 12:21:12 +01:00

Crawl Spec

A crawl spec is a list of domains to be crawled. It is a parquet file with the following columns:

Crawl specs are used to define the scope of a crawl in the absence of known domains.

The CrawlSpecRecord class is used to represent a record in the crawl spec.

The CrawlSpecRecordParquetFileReader and CrawlSpecRecordParquetFileWriter classes are used to read and write the crawl spec parquet files.