MarginaliaSearch/code/processes/crawling-process/java/nu/marginalia/crawl/retreival
Viktor Lofgren 02c4a2d4ba (crawler) Add a per-domain mutex for crawling
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
..
fetcher (crawler) Update WARC builder to use SHA-256 for digests 2024-06-12 09:14:12 +02:00
revisit (crawler) Adjust revisit logic 2024-07-16 15:12:38 +02:00
sitemap (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
Cookies.java (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
CrawlDataReference.java (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
CrawlDelayTimer.java (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
CrawledDocumentFactory.java (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
CrawlerRetreiver.java (crawler) Add crawl delays around probe call and deal with 429:s properly during this phase 2024-07-16 15:33:24 +02:00
CrawlerWarcResynchronizer.java (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
DomainCrawlFrontier.java (crawler) Introduce absolute upper limit to crawl depth growth 2024-07-16 14:40:45 +02:00
DomainLocks.java (crawler) Add a per-domain mutex for crawling 2024-07-16 16:44:59 +02:00
DomainProber.java (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
LinkFilterSelector.java (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
RateLimitException.java (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00