2023-03-04 15:42:31 +00:00
|
|
|
# Crawling Process
|
|
|
|
|
|
|
|
The crawling process downloads HTML and saves them
|
|
|
|
into per-domain snapshots.
|
|
|
|
|
|
|
|
## Central Classes
|
|
|
|
|
|
|
|
* [CrawlerMain](src/main/java/nu/marginalia/crawl/CrawlerMain.java) orchestrates the crawling.
|
|
|
|
* [CrawlerRetreiver](src/main/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java)
|
|
|
|
visits known addresses from a domain and downloads each document.
|
2023-10-09 13:12:30 +00:00
|
|
|
* [HttpFetcher](src/main/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java)
|
2023-03-04 15:42:31 +00:00
|
|
|
fetches a URL.
|
2023-03-13 16:39:53 +00:00
|
|
|
|
|
|
|
## See Also
|
|
|
|
|
|
|
|
* [features-convert](../../features-convert/)
|