MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

History

Viktor Lofgren 89db69d360 (crawler) Correct feed URLs in domain state db Discovered feed URLs were given a double slash after their domain name in the DB. This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.		2024-12-26 15:18:31 +01:00
..
revisit	(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris	2024-10-15 17:27:59 +02:00
sitemap	(crawler) Improved feed discovery, new domain state db per crawlset	2024-12-26 15:05:52 +01:00
CrawlDataReference.java	(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets	2024-12-11 17:01:52 +01:00
CrawlDelayTimer.java	(live-crawler) Crude first-try process for live crawling #WIP	2024-11-19 19:35:01 +01:00
CrawlerRetreiver.java	(crawler) Correct feed URLs in domain state db	2024-12-26 15:18:31 +01:00
CrawlerWarcResynchronizer.java	(crawler) Refactor	2024-09-23 17:51:07 +02:00
DomainCrawlFrontier.java	(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris	2024-10-15 17:27:59 +02:00
DomainProber.java	(crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl	2024-09-24 15:08:22 +02:00