![]() Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest. The warc files are concatenated into larger archives, up to about 1 GB each. An index is also created containing filenames, domain names, offsets and sizes to help navigate these larger archives. The warc data is saved in a directory warc/ under the crawl data storage. |
||
---|---|---|
.. | ||
src | ||
build.gradle | ||
readme.md |
Crawling Process
The crawling process downloads HTML and saves them into per-domain snapshots. The crawler seeks out HTML documents, and ignores other types of documents, such as PDFs. Crawling is done on a domain-by-domain basis, and the crawler does not follow links to other domains within a single job.
Robots Rules
A significant part of the crawler is dealing with robots.txt
and similar, rate limiting headers; especially when these
are not served in a standard way (which is very common). RFC9390 as well
as Google's Robots.txt Specifications are good references.
Re-crawling
The crawler can use old crawl data to avoid re-downloading documents that have not changed. This is done by
comparing the old and new documents using the HTTP If-Modified-Since
and If-None-Match
headers. If a large
proportion of the documents have not changed, the crawler falls into a mode where it only randomly samples a few
documents from each domain, to avoid wasting time and resources on domains that have not changed.
Sitemaps and rss-feeds
On top of organic links, the crawler can use sitemaps and rss-feeds to discover new documents.
Central Classes
- CrawlerMain orchestrates the crawling.
- CrawlerRetreiver visits known addresses from a domain and downloads each document.
- HttpFetcher fetches URLs.