mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

History

Viktor Lofgren 667b0ca0b0 (converter, WIP) Refactor CrawledDomainReader to not return iterators. Instead return a closable class SerializableCrawlDataStream.		2023-07-24 16:28:30 +02:00
..
src	(converter, WIP) Refactor CrawledDomainReader to not return iterators.	2023-07-24 16:28:30 +02:00
build.gradle	(crawler) WIP	2023-07-20 21:05:16 +02:00
readme.md	More restructuring, big bug fixes in keyword extraction.	2023-03-13 17:39:53 +01:00

Crawling Process

The crawling process downloads HTML and saves them into per-domain snapshots.

Central Classes

CrawlerMain orchestrates the crawling.
CrawlerRetreiver visits known addresses from a domain and downloads each document.
HttpFetcher fetches a URL.