(docs) Improve architectural documentation for the crawler.

2025-02-23 13:09:00 +00:00 · 2023-11-30 21:30:57 +01:00 · 2023-11-30 21:30:57 +01:00 · 166a391eae
commit 166a391eae
parent 5fb24bb27f
1 changed files with 23 additions and 4 deletions
--- a/code/processes/crawling-process/readme.md
+++ b/code/processes/crawling-process/readme.md
@ -1,7 +1,26 @@
 # Crawling Process

-The crawling process downloads HTML and saves them
-into per-domain snapshots.
+The crawling process downloads HTML and saves them into per-domain snapshots.  The crawler seeks out HTML documents,
+and ignores other types of documents, such as PDFs.  Crawling is done on a domain-by-domain basis, and the crawler
+does not follow links to other domains within a single job.
+
+## Robots Rules
+
+A significant part of the crawler is dealing with `robots.txt` and similar, rate limiting headers; especially when these
+are not served in a standard way (which is very common).  [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well 
+as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.
+
+## Re-crawling
+
+The crawler can use old crawl data to avoid re-downloading documents that have not changed.  This is done by
+comparing the old and new documents using the HTTP `If-Modified-Since` and `If-None-Match` headers.  If a large
+proportion of the documents have not changed, the crawler falls into a mode where it only randomly samples a few
+documents from each domain, to avoid wasting time and resources on domains that have not changed.
+
+## Sitemaps and rss-feeds
+
+On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.
+

 ## Central Classes

@ -9,8 +28,8 @@ into per-domain snapshots.
 * [CrawlerRetreiver](src/main/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java)
  visits known addresses from a domain and downloads each document.
 * [HttpFetcher](src/main/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java)
-  fetches a URL.
+  fetches URLs.

 ## See Also

-* [features-convert](../../features-convert/)
+* [features-crawl](../../features-crawl/)