MarginaliaSearch/code/processes/crawling-process/java/nu/marginalia/crawl
Viktor Lofgren 89db69d360 (crawler) Correct feed URLs in domain state db
Discovered feed URLs were given a double slash after their domain name in the DB.  This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.
2024-12-26 15:18:31 +01:00
..
fetcher (crawler) Correct content type probing to only run on URLs that are suspected to be binary 2024-12-26 14:13:17 +01:00
logic (crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris 2024-10-15 17:27:59 +02:00
retreival (crawler) Correct feed URLs in domain state db 2024-12-26 15:18:31 +01:00
warc (refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents 2024-11-21 16:00:09 +01:00
CrawlerMain.java (crawler) Improved feed discovery, new domain state db per crawlset 2024-12-26 15:05:52 +01:00
CrawlerModule.java (chore) Remove lombok 2024-11-11 21:14:38 +01:00
DomainStateDb.java (crawler) Improved feed discovery, new domain state db per crawlset 2024-12-26 15:05:52 +01:00