(doc) Update the readme's the crawler, as they've grown stale.

2025-02-24 05:18:58 +00:00 · 2024-02-01 18:10:55 +01:00 · 2024-02-01 18:10:55 +01:00 · d60c6b18d4
commit d60c6b18d4
parent d1e02569f4
2 changed files with 14 additions and 15 deletions
--- a/code/processes/crawling-process/readme.md
+++ b/code/processes/crawling-process/readme.md
@ -4,11 +4,19 @@ The crawling process downloads HTML and saves them into per-domain snapshots.  T
 and ignores other types of documents, such as PDFs.  Crawling is done on a domain-by-domain basis, and the crawler
 does not follow links to other domains within a single job.
 The crawler stores data from crawls in-progress in a WARC file.  Once the crawl is complete, the WARC file is
 converted to a parquet file, which is then used by the [converting process](../converting-process/).  The intermediate
 WARC file is not used by any other process, but kept to be able to recover the state of a crawl in case of a crash or
 other failure.
 If configured so, these crawls may be retained.  This is not the default behavior, as the WARC format is not very dense,
 and the parquet files are much more efficient.  However, the WARC files are useful for debugging and integration with
 other tools.
 ## Robots Rules
 A significant part of the crawler is dealing with `robots.txt` and similar, rate limiting headers; especially when these
-are not served in a standard way (which is very common).  [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well 
+are not served in a standard way (which is very common).  [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.
 as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.
 ## Re-crawling
@ -21,7 +29,6 @@ documents from each domain, to avoid wasting time and resources on domains that
 On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.
 ## Central Classes
 * [CrawlerMain](src/main/java/nu/marginalia/crawl/CrawlerMain.java) orchestrates the crawling.
--- a/code/processes/readme.md
+++ b/code/processes/readme.md
@ -2,10 +2,10 @@
 ## 1. Crawl Process
-The [crawling-process](crawling-process/) fetches website contents and saves them
+The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
-as compressed JSON models described in [crawling-model](../process-models/crawling-model/).
+re-converts them into parquet models.  Both are described in [crawling-model](../process-models/crawling-model/).
-The operation is specified by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
+The operation is optionally defined by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
 ## 2. Converting Process
@ -32,21 +32,13 @@ the data generated by the loader.
 Schematically the crawling and loading process looks like this:
 ```
    //====================\\
    || Compressed JSON:   ||  Specifications
    || ID, Domain, Urls[] ||  File
    || ID, Domain, Urls[] ||
    || ID, Domain, Urls[] ||
    ||      ...           ||
    \\====================//
          |
    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
-    ||  Compressed JSON:      || Crawl
+    ||  Parquet:              || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||