Commit Graph

214 Commits

Author SHA1 Message Date
Viktor Lofgren
8862100f7e (crawler) Improve logging and error handling 2025-01-21 21:44:21 +01:00
Viktor Lofgren
274941f6de (crawler) Smarter parquet->slop crawl data migration 2025-01-21 21:26:12 +01:00
Viktor Lofgren
abec83582d Fix refactoring gore 2025-01-21 15:08:04 +01:00
Viktor Lofgren
4c74e280d3 (crawler) Fix urlencoding in sitemap fetcher 2025-01-21 13:33:35 +01:00
Viktor Lofgren
5b347e17ac (crawler) Automatically migrate to slop from parquet when crawling 2025-01-21 13:33:14 +01:00
Viktor Lofgren
55d6ab933f Merge branch 'master' into slop-crawl-data-spike 2025-01-21 13:32:58 +01:00
Viktor Lofgren
43b74e9706 (crawler) Fix exception handler and resource leak in WarcRecorder 2025-01-20 23:45:28 +01:00
Viktor Lofgren
579a115243 (crawler) Reduce log spam from error handling in new sitemap fetcher 2025-01-20 23:17:13 +01:00
Viktor Lofgren
4e939389b2 (crawler) New Jsoup based sitemap parser 2025-01-20 14:37:44 +01:00
Viktor Lofgren
e67a9bdb91 (crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead. 2025-01-19 15:07:11 +01:00
Viktor Lofgren
567e4e1237 (crawler) Fast detection and bail-out for crawler traps
Improve logging and exclude robots.txt from this logic.
2025-01-18 15:28:54 +01:00
Viktor Lofgren
4342e42722 (crawler) Fast detection and bail-out for crawler traps
Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly.  Out of the box crawl limits will also deal with this type of attack, but this fix is faster.
2025-01-17 13:02:57 +01:00
Viktor Lofgren
bae44497fe (crawler) Add a new system property crawler.maxFetchSize
This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.
2024-12-30 15:10:11 +01:00
Viktor Lofgren
0d59202aca (crawler) Do not remove W/-prefix on weak e-tags
The server expects to get them back prefixed, as we received them.
2024-12-27 20:56:42 +01:00
Viktor Lofgren
0ca43f0c9c (live-crawler) Improve live crawler short-circuit logic
We should not wait until we've fetched robots.txt to decide whether we have any data to fetch!  This makes the live crawler very slow and leads to unnecessary requests.
2024-12-27 20:54:42 +01:00
Viktor Lofgren
89db69d360 (crawler) Correct feed URLs in domain state db
Discovered feed URLs were given a double slash after their domain name in the DB.  This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.
2024-12-26 15:18:31 +01:00
Viktor Lofgren
895cee7004 (crawler) Improved feed discovery, new domain state db per crawlset
Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided.  To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered.

Solves issue #135
2024-12-26 15:05:52 +01:00
Viktor Lofgren
4bb71b8439 (crawler) Correct content type probing to only run on URLs that are suspected to be binary 2024-12-26 14:26:23 +01:00
Viktor Lofgren
e4a41f7dd1 (crawler) Correct content type probing to only run on URLs that are suspected to be binary 2024-12-26 14:13:17 +01:00
Viktor Lofgren
47e58a21c6 Refactor documentBody method and ContentType charset handling
Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
2024-12-17 17:11:37 +01:00
Viktor Lofgren
3714104976 Add loader for slop data in converter.
Also alter CrawledDocument to not require String parsing of the underlying byte[] data.  This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
2024-12-17 15:40:24 +01:00
Viktor Lofgren
f6f036b9b1 Switch to new Slop format for crawl data storage and processing.
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
2024-12-15 19:34:03 +01:00
Viktor Lofgren
b510b7feb8 Spike for storing crawl data in slop instead of parquet
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds.  On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
Viktor Lofgren
e65d75a0f9 (crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets 2024-12-11 17:01:52 +01:00
Viktor Lofgren
3b99cffb3d (link-parser) Filter out URLs with binary file suffixes in LinkParser
Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.
2024-12-11 16:42:47 +01:00
Viktor Lofgren
14519294d2 Merge branch 'master' into live-search 2024-11-21 16:00:20 +01:00
Viktor Lofgren
51e46ad2b0 (refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.

While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform.  It's still a bit heterogeneous between different processes, but a bit less so for now.
2024-11-21 16:00:09 +01:00
Viktor Lofgren
665c8831a3 (model) Fix resource leak in partially read crawl data streams.
Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.
2024-11-20 19:29:13 +01:00
Viktor Lofgren
79ce4de2ab (model) Remove deprecated fields from CrawledDocument and CrawledDomain 2024-11-20 15:27:05 +01:00
Viktor Lofgren
a91ab4c203 (live-crawler) Crude first-try process for live crawling #WIP
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 19:35:01 +01:00
Viktor Lofgren
9f47ce8d15 (chore) Remove lombok
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
a5b4951f23 (chore) Remove use of deprecated STR.-style string templates 2024-11-11 18:02:28 +01:00
Viktor Lofgren
dbb8bcdd8e (crawler) Use a better hashInt implementation in CrawlDataReference
Guava's hash functions are slow as hell.
2024-10-15 18:25:55 +02:00
Viktor Lofgren
7305afa0f8 (crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris 2024-10-15 17:27:59 +02:00
Viktor Lofgren
481f999b70 (crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full.
Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.
2024-10-15 14:22:40 +02:00
Viktor Lofgren
4b16022556 (crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains 2024-10-15 14:21:59 +02:00
Viktor Lofgren
89dd201a7b (link-parser) Make mailing list blocking optional 2024-10-15 13:48:32 +02:00
Viktor Lofgren
fe800b3af7 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 19:04:49 +02:00
Viktor Lofgren
2a1077ff43 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:57:27 +02:00
Viktor Lofgren
01a16ff388 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:55:59 +02:00
Viktor Lofgren
eb60ddb729 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:49:39 +02:00
Viktor Lofgren
d84a2c183f (*) Remove the crawl spec abstraction
The crawl spec abstraction was used to upload lists of domains into the system for future crawling.  This was fairly clunky, and it was difficult to understand what was going to be crawled.

Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table.  This is much preferred and means the operator can directly manage domains without specs.

This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
ecb5eedeae (crawler, EXPERIMENT) Disable content type probing and use Accept header instead
There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.
2024-09-30 14:53:01 +02:00
Viktor Lofgren
4565bfe359 (crawler) Make the crawler report crawling progress correctly when stopped and resumed. 2024-09-26 18:30:29 +02:00
Viktor Lofgren
40512511af (crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl
This code is still a bit too complex, but it's slowly getting better.
2024-09-24 15:08:22 +02:00
Viktor Lofgren
162fc25ebc (minor) Fix accidental commit errors 2024-09-23 18:03:09 +02:00
Viktor Lofgren
e9854f194c (crawler) Refactor
* Restructure the code to make a bit more sense
* Store full headers in crawl data
* Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong
2024-09-23 17:51:07 +02:00
Viktor Lofgren
9c292a4f62 (doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00
Viktor Lofgren
8047e77757 (doc) Correct dead links and stale information in the docs 2024-09-13 11:01:05 +02:00
Viktor Lofgren
a3b0189934 Fix build errors after merge 2024-09-08 10:22:32 +02:00