Viktor Lofgren
4d29581ea4
(crawler) Introduce absolute upper limit to crawl depth growth
2024-07-16 14:40:45 +02:00
Viktor Lofgren
d86926be5f
(crawl) Add new functionality for re-crawling a single domain
2024-07-05 15:31:55 +02:00
Viktor Lofgren
0ffbbaf4b9
(crawler) Update WARC builder to use SHA-256 for digests
2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b
(crawler) Fetch TLS instead of SSL context
2024-06-12 09:07:54 +02:00
Viktor Lofgren
b4eac2516e
(crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results
2024-06-02 16:30:34 +02:00
Viktor Lofgren
70e2e41955
(crawler) Content type prober should not swallow exceptions
2024-04-27 18:27:23 +02:00
Viktor Lofgren
4d71c776fc
(crawler) Modify crawl set growth to grow small domains faster than larger ones
2024-04-27 17:36:27 +02:00
Viktor Lofgren
7eb5e6aa66
(crawler) Abort recrawl if error count is too high
2024-04-24 21:46:40 +02:00
Viktor Lofgren
8b9629f2f6
(crawler) Remove unnecessary double-fetch of the root document
2024-04-24 14:38:59 +02:00
Viktor Lofgren
f6db16b313
(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber
2024-04-24 14:10:03 +02:00
Viktor Lofgren
dcf9d9caad
(crawler) Emulate if-modified-since for domains that don't support the header
...
This will help reduce the strain on some server software, in particular Discourse.
2024-04-22 17:26:31 +02:00
Viktor Lofgren
7a69b76001
(crawler) Remove accidental log spam
2024-04-22 15:51:37 +02:00
Viktor Lofgren
ac07ef822f
(crawler) Code quality
2024-04-22 15:37:35 +02:00
Viktor Lofgren
e7d4bcd872
(crawler) Use the probe-result to reduce the likelihood of crawling both http and https
...
This should drastically reduce the number of fetched documents on many domains
2024-04-22 15:36:43 +02:00
Viktor Lofgren
a28c6d7cfe
(crawler) Strip W/-prefix from the etag when supplied as If-None-Match
2024-04-22 14:31:05 +02:00
Viktor Lofgren
d816f048f5
(crawler) Ensure all appropriate headers are recorded on the request
2024-04-22 14:14:24 +02:00
Viktor Lofgren
b09ddd0036
(crawler/converter) Remove legacy junk from parquet migration
2024-04-22 12:34:28 +02:00
Viktor Lofgren
1d34224416
(refac) Remove src/main from all source code paths.
...
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.
While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.
2024-02-23 16:13:40 +01:00