MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	2ea34767d8	(crawler) Use the response URL when resolving relative links The crawler was incorrectly using the request URL as the base URL when resolving relative links. This caused problems when encountering redirects. For example if we fetch /log, redirecting to /log/ and find links to foo/, and bar/; these would resolve to /foo and /bar, and not /log/foo and /log/bar.	2025-01-31 12:40:13 +01:00
Viktor Lofgren	c8b0a32c0f	(crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams	2025-01-26 15:40:17 +01:00
Viktor Lofgren	98a340a0d1	(crawler) Add favicon data to domain state db in its own table	2025-01-22 11:41:20 +01:00
Viktor Lofgren	55d6ab933f	Merge branch 'master' into slop-crawl-data-spike	2025-01-21 13:32:58 +01:00
Viktor Lofgren	4e939389b2	(crawler) New Jsoup based sitemap parser	2025-01-20 14:37:44 +01:00
Viktor Lofgren	89db69d360	(crawler) Correct feed URLs in domain state db Discovered feed URLs were given a double slash after their domain name in the DB. This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.	2024-12-26 15:18:31 +01:00
Viktor Lofgren	895cee7004	(crawler) Improved feed discovery, new domain state db per crawlset Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided. To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered. Solves issue #135	2024-12-26 15:05:52 +01:00
Viktor Lofgren	3714104976	Add loader for slop data in converter. Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.	2024-12-17 15:40:24 +01:00
Viktor Lofgren	e65d75a0f9	(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets	2024-12-11 17:01:52 +01:00
Viktor Lofgren	a91ab4c203	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 19:35:01 +01:00
Viktor Lofgren	9f47ce8d15	(chore) Remove lombok There are likely some instances of delombok gore with this commit.	2024-11-11 21:14:38 +01:00
Viktor Lofgren	dbb8bcdd8e	(crawler) Use a better hashInt implementation in CrawlDataReference Guava's hash functions are slow as hell.	2024-10-15 18:25:55 +02:00
Viktor Lofgren	7305afa0f8	(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris	2024-10-15 17:27:59 +02:00
Viktor Lofgren	481f999b70	(crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full. Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.	2024-10-15 14:22:40 +02:00
Viktor Lofgren	eb60ddb729	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 17:49:39 +02:00
Viktor Lofgren	d84a2c183f	(*) Remove the crawl spec abstraction The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled. Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs. This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.	2024-10-03 13:41:17 +02:00
Viktor Lofgren	ecb5eedeae	(crawler, EXPERIMENT) Disable content type probing and use Accept header instead There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.	2024-09-30 14:53:01 +02:00
Viktor Lofgren	40512511af	(crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl This code is still a bit too complex, but it's slowly getting better.	2024-09-24 15:08:22 +02:00
Viktor Lofgren	e9854f194c	(crawler) Refactor * Restructure the code to make a bit more sense * Store full headers in crawl data * Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong	2024-09-23 17:51:07 +02:00
Viktor Lofgren	a3b0189934	Fix build errors after merge	2024-09-08 10:22:32 +02:00
Viktor Lofgren	8f367d96f8	Merge branch 'master' into term-positions # Conflicts: # code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java # code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java # code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java # code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java # code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java # code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java # code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java # code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java	2024-09-08 10:14:43 +02:00
Viktor Lofgren	8d0f9652c7	(crawler) Correct RSS-sitemap behavior	2024-08-31 11:38:34 +02:00
Viktor Lofgren	5353805cc6	(crawler) Correct RSS-sitemap behavior	2024-08-31 11:37:09 +02:00
Viktor Lofgren	5407da5650	(crawler) Grab favicons as part of root sniff	2024-08-31 11:32:56 +02:00
Viktor Lofgren	285e657f68	Merge branch 'master' into term-positions # Conflicts: # code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java # code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java	2024-07-31 10:44:01 +02:00
Viktor Lofgren	ec600b967d	(crawler) Adjust domain locking Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress. Giving these a slightly larger allowance.	2024-07-27 11:54:46 +02:00
Viktor Lofgren	aebb2652e8	(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.	2024-07-27 11:44:13 +02:00
Viktor Lofgren	accc598967	(crawler) Add 1 second pause after probing domain to reduce request pressure	2024-07-16 16:55:07 +02:00
Viktor Lofgren	02c4a2d4ba	(crawler) Add a per-domain mutex for crawling To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.	2024-07-16 16:44:59 +02:00
Viktor Lofgren	6665e447aa	(crawler) Add crawl delays around probe call and deal with 429:s properly during this phase	2024-07-16 15:33:24 +02:00
Viktor Lofgren	f4d79c203d	(crawler) Adjust revisit logic The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed. Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.	2024-07-16 15:12:38 +02:00
Viktor Lofgren	4d29581ea4	(crawler) Introduce absolute upper limit to crawl depth growth	2024-07-16 14:40:45 +02:00
Viktor Lofgren	0ffbbaf4b9	(crawler) Update WARC builder to use SHA-256 for digests	2024-06-12 09:14:12 +02:00
Viktor Lofgren	6839415a0b	(crawler) Fetch TLS instead of SSL context	2024-06-12 09:07:54 +02:00
Viktor Lofgren	b4eac2516e	(crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results	2024-06-02 16:30:34 +02:00
Viktor Lofgren	70e2e41955	(crawler) Content type prober should not swallow exceptions	2024-04-27 18:27:23 +02:00
Viktor Lofgren	7eb5e6aa66	(crawler) Abort recrawl if error count is too high	2024-04-24 21:46:40 +02:00
Viktor Lofgren	8b9629f2f6	(crawler) Remove unnecessary double-fetch of the root document	2024-04-24 14:38:59 +02:00
Viktor Lofgren	f6db16b313	(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber	2024-04-24 14:10:03 +02:00
Viktor Lofgren	dcf9d9caad	(crawler) Emulate if-modified-since for domains that don't support the header This will help reduce the strain on some server software, in particular Discourse.	2024-04-22 17:26:31 +02:00
Viktor Lofgren	7a69b76001	(crawler) Remove accidental log spam	2024-04-22 15:51:37 +02:00
Viktor Lofgren	ac07ef822f	(crawler) Code quality	2024-04-22 15:37:35 +02:00
Viktor Lofgren	e7d4bcd872	(crawler) Use the probe-result to reduce the likelihood of crawling both http and https This should drastically reduce the number of fetched documents on many domains	2024-04-22 15:36:43 +02:00
Viktor Lofgren	a28c6d7cfe	(crawler) Strip W/-prefix from the etag when supplied as If-None-Match	2024-04-22 14:31:05 +02:00
Viktor Lofgren	d816f048f5	(crawler) Ensure all appropriate headers are recorded on the request	2024-04-22 14:14:24 +02:00
Viktor Lofgren	b09ddd0036	(crawler/converter) Remove legacy junk from parquet migration	2024-04-22 12:34:28 +02:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00

47 Commits