mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

History

Viktor cbbf60a599 Better fingerprinting (#35 ) * Better fingerprinting for server tech * Many more features in FeatureExtractor * Blog specialization * SiteType table		2023-07-10 18:58:43 +02:00
..
src	Better fingerprinting (#35 )	2023-07-10 18:58:43 +02:00
build.gradle	Tests for crawler specialization + testdata	2023-06-27 10:57:54 +02:00
readme.md	More restructuring, big bug fixes in keyword extraction.	2023-03-13 17:39:53 +01:00

Crawling Process

The crawling process downloads HTML and saves them into per-domain snapshots.

Central Classes

CrawlerMain orchestrates the crawling.
CrawlerRetreiver visits known addresses from a domain and downloads each document.
HttpFetcher fetches a URL.