Commit Graph

63 Commits

Author SHA1 Message Date
Viktor Lofgren
70e2e41955 (crawler) Content type prober should not swallow exceptions 2024-04-27 18:27:23 +02:00
Viktor Lofgren
4d71c776fc (crawler) Modify crawl set growth to grow small domains faster than larger ones 2024-04-27 17:36:27 +02:00
Viktor Lofgren
7eb5e6aa66 (crawler) Abort recrawl if error count is too high 2024-04-24 21:46:40 +02:00
Viktor Lofgren
8b9629f2f6 (crawler) Remove unnecessary double-fetch of the root document 2024-04-24 14:38:59 +02:00
Viktor Lofgren
f6db16b313 (crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber 2024-04-24 14:10:03 +02:00
Viktor Lofgren
dcf9d9caad (crawler) Emulate if-modified-since for domains that don't support the header
This will help reduce the strain on some server software, in particular Discourse.
2024-04-22 17:26:31 +02:00
Viktor Lofgren
7a69b76001 (crawler) Remove accidental log spam 2024-04-22 15:51:37 +02:00
Viktor Lofgren
ac07ef822f (crawler) Code quality 2024-04-22 15:37:35 +02:00
Viktor Lofgren
e7d4bcd872 (crawler) Use the probe-result to reduce the likelihood of crawling both http and https
This should drastically reduce the number of fetched documents on many domains
2024-04-22 15:36:43 +02:00
Viktor Lofgren
a28c6d7cfe (crawler) Strip W/-prefix from the etag when supplied as If-None-Match 2024-04-22 14:31:05 +02:00
Viktor Lofgren
d816f048f5 (crawler) Ensure all appropriate headers are recorded on the request 2024-04-22 14:14:24 +02:00
Viktor Lofgren
b09ddd0036 (crawler/converter) Remove legacy junk from parquet migration 2024-04-22 12:34:28 +02:00
Viktor Lofgren
1d34224416 (refac) Remove src/main from all source code paths.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.

While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules.  Which you'll do a lot, because it's *modul*ar.  The src/main/java convention makes a lot of sense for a non-modular project though.  This ain't that.
2024-02-23 16:13:40 +01:00