MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	55d6ab933f	Merge branch 'master' into slop-crawl-data-spike	2025-01-21 13:32:58 +01:00
Viktor Lofgren	e67a9bdb91	(crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead.	2025-01-19 15:07:11 +01:00
Viktor Lofgren	bae44497fe	(crawler) Add a new system property crawler.maxFetchSize This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.	2024-12-30 15:10:11 +01:00
Viktor Lofgren	0ca43f0c9c	(live-crawler) Improve live crawler short-circuit logic We should not wait until we've fetched robots.txt to decide whether we have any data to fetch! This makes the live crawler very slow and leads to unnecessary requests.	2024-12-27 20:54:42 +01:00
Viktor Lofgren	927bc0b63c	(live-crawler) Add Accept-Encoding: gzip to outbound requests This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data. The change addresses issue #136, save for making the fetcher's requests conditional.	2024-12-27 03:59:34 +01:00
Viktor Lofgren	a557c7ae7f	(live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler	2024-12-23 23:31:03 +01:00
Viktor Lofgren	3714104976	Add loader for slop data in converter. Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.	2024-12-17 15:40:24 +01:00
Viktor Lofgren	c5d657ef98	(live-crawler) Flag live crawled documents with a special keyword	2024-12-10 13:42:10 +01:00
Viktor Lofgren	88caca60f9	(live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list	2024-11-23 17:07:16 +01:00
Viktor Lofgren	552b246099	(live-crawl) Improve error handling for errors during robots.txt-retrieval Reduce log-spam and don't treat errors other than 404 as "all is permitted".	2024-11-22 14:15:32 +01:00
Viktor Lofgren	b941604135	(live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with.	2024-11-22 13:58:57 +01:00
Viktor Lofgren	52eb5bc84f	(live-crawler) Keep track of bad URLs To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.	2024-11-22 00:55:46 +01:00
Viktor Lofgren	51e46ad2b0	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx. While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.	2024-11-21 16:00:09 +01:00
Viktor Lofgren	f94911541a	(live-crawl) Reduce the risk of id collisions with the main indexes This is done by applying a large constant offset to the ordinals for the live crawled documents. The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.	2024-11-20 16:01:10 +01:00
Viktor Lofgren	89d8af640d	(live-crawl) Rename the live crawler code module to be more consistent with the other processes	2024-11-20 15:55:15 +01:00

15 Commits