(doc) Update the readme's the crawler, as they've grown stale.

This commit is contained in:
Viktor Lofgren 2024-02-01 18:10:55 +01:00
parent d1e02569f4
commit d60c6b18d4
2 changed files with 14 additions and 15 deletions

View File

@ -4,11 +4,19 @@ The crawling process downloads HTML and saves them into per-domain snapshots. T
and ignores other types of documents, such as PDFs. Crawling is done on a domain-by-domain basis, and the crawler and ignores other types of documents, such as PDFs. Crawling is done on a domain-by-domain basis, and the crawler
does not follow links to other domains within a single job. does not follow links to other domains within a single job.
The crawler stores data from crawls in-progress in a WARC file. Once the crawl is complete, the WARC file is
converted to a parquet file, which is then used by the [converting process](../converting-process/). The intermediate
WARC file is not used by any other process, but kept to be able to recover the state of a crawl in case of a crash or
other failure.
If configured so, these crawls may be retained. This is not the default behavior, as the WARC format is not very dense,
and the parquet files are much more efficient. However, the WARC files are useful for debugging and integration with
other tools.
## Robots Rules ## Robots Rules
A significant part of the crawler is dealing with `robots.txt` and similar, rate limiting headers; especially when these A significant part of the crawler is dealing with `robots.txt` and similar, rate limiting headers; especially when these
are not served in a standard way (which is very common). [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well are not served in a standard way (which is very common). [RFC9390](https://www.rfc-editor.org/rfc/rfc9309.html) as well as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.
as Google's [Robots.txt Specifications](https://developers.google.com/search/docs/advanced/robots/robots_txt) are good references.
## Re-crawling ## Re-crawling
@ -21,7 +29,6 @@ documents from each domain, to avoid wasting time and resources on domains that
On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents. On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.
## Central Classes ## Central Classes
* [CrawlerMain](src/main/java/nu/marginalia/crawl/CrawlerMain.java) orchestrates the crawling. * [CrawlerMain](src/main/java/nu/marginalia/crawl/CrawlerMain.java) orchestrates the crawling.

View File

@ -2,10 +2,10 @@
## 1. Crawl Process ## 1. Crawl Process
The [crawling-process](crawling-process/) fetches website contents and saves them The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
as compressed JSON models described in [crawling-model](../process-models/crawling-model/). re-converts them into parquet models. Both are described in [crawling-model](../process-models/crawling-model/).
The operation is specified by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI. The operation is optionally defined by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
## 2. Converting Process ## 2. Converting Process
@ -32,21 +32,13 @@ the data generated by the loader.
Schematically the crawling and loading process looks like this: Schematically the crawling and loading process looks like this:
``` ```
//====================\\
|| Compressed JSON: || Specifications
|| ID, Domain, Urls[] || File
|| ID, Domain, Urls[] ||
|| ID, Domain, Urls[] ||
|| ... ||
\\====================//
|
+-----------+ +-----------+
| CRAWLING | Fetch each URL and | CRAWLING | Fetch each URL and
| STEP | output to file | STEP | output to file
+-----------+ +-----------+
| |
//========================\\ //========================\\
|| Compressed JSON: || Crawl || Parquet: || Crawl
|| Status, HTML[], ... || Files || Status, HTML[], ... || Files
|| Status, HTML[], ... || || Status, HTML[], ... ||
|| Status, HTML[], ... || || Status, HTML[], ... ||