MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	8862100f7e	(crawler) Improve logging and error handling	2025-01-21 21:44:21 +01:00
Viktor Lofgren	274941f6de	(crawler) Smarter parquet->slop crawl data migration	2025-01-21 21:26:12 +01:00
Viktor Lofgren	abec83582d	Fix refactoring gore	2025-01-21 15:08:04 +01:00
Viktor Lofgren	4c74e280d3	(crawler) Fix urlencoding in sitemap fetcher	2025-01-21 13:33:35 +01:00
Viktor Lofgren	5b347e17ac	(crawler) Automatically migrate to slop from parquet when crawling	2025-01-21 13:33:14 +01:00
Viktor Lofgren	55d6ab933f	Merge branch 'master' into slop-crawl-data-spike	2025-01-21 13:32:58 +01:00
Viktor Lofgren	43b74e9706	(crawler) Fix exception handler and resource leak in WarcRecorder	2025-01-20 23:45:28 +01:00
Viktor Lofgren	579a115243	(crawler) Reduce log spam from error handling in new sitemap fetcher	2025-01-20 23:17:13 +01:00
Viktor Lofgren	4e939389b2	(crawler) New Jsoup based sitemap parser	2025-01-20 14:37:44 +01:00
Viktor Lofgren	e67a9bdb91	(crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead.	2025-01-19 15:07:11 +01:00
Viktor Lofgren	567e4e1237	(crawler) Fast detection and bail-out for crawler traps Improve logging and exclude robots.txt from this logic.	2025-01-18 15:28:54 +01:00
Viktor Lofgren	4342e42722	(crawler) Fast detection and bail-out for crawler traps Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly. Out of the box crawl limits will also deal with this type of attack, but this fix is faster.	2025-01-17 13:02:57 +01:00
Viktor Lofgren	bae44497fe	(crawler) Add a new system property crawler.maxFetchSize This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.	2024-12-30 15:10:11 +01:00
Viktor Lofgren	0d59202aca	(crawler) Do not remove W/-prefix on weak e-tags The server expects to get them back prefixed, as we received them.	2024-12-27 20:56:42 +01:00
Viktor Lofgren	0ca43f0c9c	(live-crawler) Improve live crawler short-circuit logic We should not wait until we've fetched robots.txt to decide whether we have any data to fetch! This makes the live crawler very slow and leads to unnecessary requests.	2024-12-27 20:54:42 +01:00
Viktor Lofgren	89db69d360	(crawler) Correct feed URLs in domain state db Discovered feed URLs were given a double slash after their domain name in the DB. This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.	2024-12-26 15:18:31 +01:00
Viktor Lofgren	895cee7004	(crawler) Improved feed discovery, new domain state db per crawlset Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided. To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered. Solves issue #135	2024-12-26 15:05:52 +01:00
Viktor Lofgren	4bb71b8439	(crawler) Correct content type probing to only run on URLs that are suspected to be binary	2024-12-26 14:26:23 +01:00
Viktor Lofgren	e4a41f7dd1	(crawler) Correct content type probing to only run on URLs that are suspected to be binary	2024-12-26 14:13:17 +01:00
Viktor Lofgren	47e58a21c6	Refactor documentBody method and ContentType charset handling Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.	2024-12-17 17:11:37 +01:00
Viktor Lofgren	3714104976	Add loader for slop data in converter. Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.	2024-12-17 15:40:24 +01:00
Viktor Lofgren	f6f036b9b1	Switch to new Slop format for crawl data storage and processing. Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.	2024-12-15 19:34:03 +01:00
Viktor Lofgren	b510b7feb8	Spike for storing crawl data in slop instead of parquet This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds. On disk size is virtually identical.	2024-12-15 15:49:47 +01:00
Viktor Lofgren	e65d75a0f9	(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets	2024-12-11 17:01:52 +01:00
Viktor Lofgren	3b99cffb3d	(link-parser) Filter out URLs with binary file suffixes in LinkParser Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.	2024-12-11 16:42:47 +01:00
Viktor Lofgren	14519294d2	Merge branch 'master' into live-search	2024-11-21 16:00:20 +01:00
Viktor Lofgren	51e46ad2b0	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx. While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.	2024-11-21 16:00:09 +01:00
Viktor Lofgren	665c8831a3	(model) Fix resource leak in partially read crawl data streams. Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.	2024-11-20 19:29:13 +01:00
Viktor Lofgren	79ce4de2ab	(model) Remove deprecated fields from CrawledDocument and CrawledDomain	2024-11-20 15:27:05 +01:00
Viktor Lofgren	a91ab4c203	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 19:35:01 +01:00
Viktor Lofgren	9f47ce8d15	(chore) Remove lombok There are likely some instances of delombok gore with this commit.	2024-11-11 21:14:38 +01:00
Viktor Lofgren	a5b4951f23	(chore) Remove use of deprecated STR.-style string templates	2024-11-11 18:02:28 +01:00
Viktor Lofgren	dbb8bcdd8e	(crawler) Use a better hashInt implementation in CrawlDataReference Guava's hash functions are slow as hell.	2024-10-15 18:25:55 +02:00
Viktor Lofgren	7305afa0f8	(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris	2024-10-15 17:27:59 +02:00
Viktor Lofgren	481f999b70	(crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full. Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.	2024-10-15 14:22:40 +02:00
Viktor Lofgren	4b16022556	(crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains	2024-10-15 14:21:59 +02:00
Viktor Lofgren	89dd201a7b	(link-parser) Make mailing list blocking optional	2024-10-15 13:48:32 +02:00
Viktor Lofgren	fe800b3af7	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 19:04:49 +02:00
Viktor Lofgren	2a1077ff43	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 17:57:27 +02:00
Viktor Lofgren	01a16ff388	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 17:55:59 +02:00
Viktor Lofgren	eb60ddb729	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 17:49:39 +02:00
Viktor Lofgren	d84a2c183f	(*) Remove the crawl spec abstraction The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled. Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs. This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.	2024-10-03 13:41:17 +02:00
Viktor Lofgren	ecb5eedeae	(crawler, EXPERIMENT) Disable content type probing and use Accept header instead There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.	2024-09-30 14:53:01 +02:00
Viktor Lofgren	4565bfe359	(crawler) Make the crawler report crawling progress correctly when stopped and resumed.	2024-09-26 18:30:29 +02:00
Viktor Lofgren	40512511af	(crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl This code is still a bit too complex, but it's slowly getting better.	2024-09-24 15:08:22 +02:00
Viktor Lofgren	162fc25ebc	(minor) Fix accidental commit errors	2024-09-23 18:03:09 +02:00
Viktor Lofgren	e9854f194c	(crawler) Refactor * Restructure the code to make a bit more sense * Store full headers in crawl data * Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong	2024-09-23 17:51:07 +02:00
Viktor Lofgren	9c292a4f62	(doc) Fix outdated links in documentation	2024-09-22 13:56:17 +02:00
Viktor Lofgren	8047e77757	(doc) Correct dead links and stale information in the docs	2024-09-13 11:01:05 +02:00
Viktor Lofgren	a3b0189934	Fix build errors after merge	2024-09-08 10:22:32 +02:00

1 2 3 4 5

214 Commits