Commit Graph

5 Commits

Author SHA1 Message Date
Viktor Lofgren
a557c7ae7f (live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler 2024-12-23 23:31:03 +01:00
Viktor Lofgren
88caca60f9 (live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list 2024-11-23 17:07:16 +01:00
Viktor Lofgren
552b246099 (live-crawl) Improve error handling for errors during robots.txt-retrieval
Reduce log-spam and don't treat errors other than 404 as "all is permitted".
2024-11-22 14:15:32 +01:00
Viktor Lofgren
52eb5bc84f (live-crawler) Keep track of bad URLs
To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.
2024-11-22 00:55:46 +01:00
Viktor Lofgren
89d8af640d (live-crawl) Rename the live crawler code module to be more consistent with the other processes 2024-11-20 15:55:15 +01:00