Commit Graph

4 Commits

Author SHA1 Message Date
Viktor Lofgren
0ca43f0c9c (live-crawler) Improve live crawler short-circuit logic
We should not wait until we've fetched robots.txt to decide whether we have any data to fetch!  This makes the live crawler very slow and leads to unnecessary requests.
2024-12-27 20:54:42 +01:00
Viktor Lofgren
927bc0b63c (live-crawler) Add Accept-Encoding: gzip to outbound requests
This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data.

The change addresses issue #136, save for making the fetcher's requests conditional.
2024-12-27 03:59:34 +01:00
Viktor Lofgren
52eb5bc84f (live-crawler) Keep track of bad URLs
To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.
2024-11-22 00:55:46 +01:00
Viktor Lofgren
89d8af640d (live-crawl) Rename the live crawler code module to be more consistent with the other processes 2024-11-20 15:55:15 +01:00