Commit Graph

7 Commits

Author SHA1 Message Date
Viktor Lofgren
88caca60f9 (live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list 2024-11-23 17:07:16 +01:00
Viktor Lofgren
552b246099 (live-crawl) Improve error handling for errors during robots.txt-retrieval
Reduce log-spam and don't treat errors other than 404 as "all is permitted".
2024-11-22 14:15:32 +01:00
Viktor Lofgren
b941604135 (live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with. 2024-11-22 13:58:57 +01:00
Viktor Lofgren
52eb5bc84f (live-crawler) Keep track of bad URLs
To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.
2024-11-22 00:55:46 +01:00
Viktor Lofgren
51e46ad2b0 (refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.

While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform.  It's still a bit heterogeneous between different processes, but a bit less so for now.
2024-11-21 16:00:09 +01:00
Viktor Lofgren
f94911541a (live-crawl) Reduce the risk of id collisions with the main indexes
This is done by applying a large constant offset to the ordinals for the live crawled documents.  The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.
2024-11-20 16:01:10 +01:00
Viktor Lofgren
89d8af640d (live-crawl) Rename the live crawler code module to be more consistent with the other processes 2024-11-20 15:55:15 +01:00