MarginaliaSearch/code/processes/crawling-process/java/nu/marginalia/crawl
Viktor Lofgren 4342e42722 (crawler) Fast detection and bail-out for crawler traps
Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly.  Out of the box crawl limits will also deal with this type of attack, but this fix is faster.
2025-01-17 13:02:57 +01:00
..
fetcher (crawler) Fast detection and bail-out for crawler traps 2025-01-17 13:02:57 +01:00
logic (crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris 2024-10-15 17:27:59 +02:00
retreival (crawler) Correct feed URLs in domain state db 2024-12-26 15:18:31 +01:00
warc (refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents 2024-11-21 16:00:09 +01:00
CrawlerMain.java (crawler) Improved feed discovery, new domain state db per crawlset 2024-12-26 15:05:52 +01:00
CrawlerModule.java (chore) Remove lombok 2024-11-11 21:14:38 +01:00
DomainStateDb.java (crawler) Improved feed discovery, new domain state db per crawlset 2024-12-26 15:05:52 +01:00