MarginaliaSearch/code/processes/crawling-process
Viktor Lofgren d84a2c183f (*) Remove the crawl spec abstraction
The crawl spec abstraction was used to upload lists of domains into the system for future crawling.  This was fairly clunky, and it was difficult to understand what was going to be crawled.

Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table.  This is much preferred and means the operator can directly manage domains without specs.

This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
..
ft-content-type (perf) Code was still spending a lot of time resolving charsets 2024-08-01 11:58:59 +02:00
ft-crawl-blocklist (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
ft-link-parser (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
java/nu/marginalia/crawl (*) Remove the crawl spec abstraction 2024-10-03 13:41:17 +02:00
model (*) Remove the crawl spec abstraction 2024-10-03 13:41:17 +02:00
resources (*) Add domain parking service to ip blocklist 2024-09-01 12:53:22 +02:00
test/nu/marginalia (*) Remove the crawl spec abstraction 2024-10-03 13:41:17 +02:00
build.gradle (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
readme.md (doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00

Crawling Process

The crawling process downloads HTML and saves them into per-domain snapshots. The crawler seeks out HTML documents, and ignores other types of documents, such as PDFs. Crawling is done on a domain-by-domain basis, and the crawler does not follow links to other domains within a single job.

The crawler stores data from crawls in-progress in a WARC file. Once the crawl is complete, the WARC file is converted to a parquet file, which is then used by the converting process. The intermediate WARC file is not used by any other process, but kept to be able to recover the state of a crawl in case of a crash or other failure.

If configured so, these crawls may be retained. This is not the default behavior, as the WARC format is not very dense, and the parquet files are much more efficient. However, the WARC files are useful for debugging and integration with other tools.

Robots Rules

A significant part of the crawler is dealing with robots.txt and similar, rate limiting headers; especially when these are not served in a standard way (which is very common). RFC9390 as well as Google's Robots.txt Specifications are good references.

Re-crawling

The crawler can use old crawl data to avoid re-downloading documents that have not changed. This is done by comparing the old and new documents using the HTTP If-Modified-Since and If-None-Match headers. If a large proportion of the documents have not changed, the crawler falls into a mode where it only randomly samples a few documents from each domain, to avoid wasting time and resources on domains that have not changed.

Sitemaps and rss-feeds

On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.

Central Classes