mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

History

Viktor Lofgren 32fe864a33 (build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.		2024-04-24 14:44:39 +02:00
..
java/nu/marginalia	(refac) Remove src/main from all source code paths.	2024-02-23 16:13:40 +01:00
build.gradle	(build) Java 22 and its consequences has been a disaster for Marginalia Search	2024-04-24 14:44:39 +02:00
readme.md	(docs) Begin un-fucking the docs after refactoring	2024-02-27 21:22:21 +01:00

Crawl Spec

A crawl spec is a list of domains to be crawled. It is a parquet file with the following columns:

Crawl specs are used to define the scope of a crawl in the absence of known domains.

The CrawlSpecRecord class is used to represent a record in the crawl spec.

The CrawlSpecRecordParquetFileReader and CrawlSpecRecordParquetFileWriter classes are used to read and write the crawl spec parquet files.