MarginaliaSearch/code/process-models/crawl-spec
Viktor Lofgren 32fe864a33 (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 14:44:39 +02:00
..
java/nu/marginalia (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
build.gradle (build) Java 22 and its consequences has been a disaster for Marginalia Search 2024-04-24 14:44:39 +02:00
readme.md (docs) Begin un-fucking the docs after refactoring 2024-02-27 21:22:21 +01:00

Crawl Spec

A crawl spec is a list of domains to be crawled. It is a parquet file with the following columns:

  • domain: The domain to be crawled
  • crawlDepth: The depth to which the domain should be crawled
  • urls: A list of known URLs to be crawled

Crawl specs are used to define the scope of a crawl in the absence of known domains.

The CrawlSpecRecord class is used to represent a record in the crawl spec.

The CrawlSpecRecordParquetFileReader and CrawlSpecRecordParquetFileWriter classes are used to read and write the crawl spec parquet files.