mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

History

Viktor Lofgren 3a56a06c4f (warc) Add a fields for etags and last-modified headers to the new crawl data formats Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats. This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely. The commit also adds a few tests to this logic.		2023-12-18 17:45:54 +01:00
..
src	(warc) Add a fields for etags and last-modified headers to the new crawl data formats	2023-12-18 17:45:54 +01:00
build.gradle	Implement Warc-recording wrapper for OkHttp3 client	2023-12-08 13:49:16 +01:00
readme.md	(docs) Improve architectural documentation for the crawler.	2023-11-30 21:30:57 +01:00

readme.md

Crawling Process

The crawling process downloads HTML and saves them into per-domain snapshots. The crawler seeks out HTML documents, and ignores other types of documents, such as PDFs. Crawling is done on a domain-by-domain basis, and the crawler does not follow links to other domains within a single job.

Robots Rules

A significant part of the crawler is dealing with robots.txt and similar, rate limiting headers; especially when these are not served in a standard way (which is very common). RFC9390 as well as Google's Robots.txt Specifications are good references.

Re-crawling

The crawler can use old crawl data to avoid re-downloading documents that have not changed. This is done by comparing the old and new documents using the HTTP If-Modified-Since and If-None-Match headers. If a large proportion of the documents have not changed, the crawler falls into a mode where it only randomly samples a few documents from each domain, to avoid wasting time and resources on domains that have not changed.

Sitemaps and rss-feeds

On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.

Central Classes

CrawlerMain orchestrates the crawling.
CrawlerRetreiver visits known addresses from a domain and downloads each document.
HttpFetcher fetches URLs.