MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	c8b0a32c0f	(crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams	2025-01-26 15:40:17 +01:00
Viktor Lofgren	74a1f100f4	(converter) Refactor to remove CrawledDomainReader and move its functionality into SerializableCrawlDataStream	2025-01-26 14:46:50 +01:00
Viktor Lofgren	98a340a0d1	(crawler) Add favicon data to domain state db in its own table	2025-01-22 11:41:20 +01:00
Viktor Lofgren	55d6ab933f	Merge branch 'master' into slop-crawl-data-spike	2025-01-21 13:32:58 +01:00
Viktor Lofgren	4e939389b2	(crawler) New Jsoup based sitemap parser	2025-01-20 14:37:44 +01:00
Viktor Lofgren	e67a9bdb91	(crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead.	2025-01-19 15:07:11 +01:00
Viktor Lofgren	895cee7004	(crawler) Improved feed discovery, new domain state db per crawlset Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided. To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered. Solves issue #135	2024-12-26 15:05:52 +01:00
Viktor Lofgren	4bb71b8439	(crawler) Correct content type probing to only run on URLs that are suspected to be binary	2024-12-26 14:26:23 +01:00
Viktor Lofgren	3714104976	Add loader for slop data in converter. Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.	2024-12-17 15:40:24 +01:00
Viktor Lofgren	e65d75a0f9	(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets	2024-12-11 17:01:52 +01:00
Viktor Lofgren	9f47ce8d15	(chore) Remove lombok There are likely some instances of delombok gore with this commit.	2024-11-11 21:14:38 +01:00
Viktor Lofgren	7305afa0f8	(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris	2024-10-15 17:27:59 +02:00
Viktor Lofgren	fe800b3af7	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 19:04:49 +02:00
Viktor Lofgren	2a1077ff43	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 17:57:27 +02:00
Viktor Lofgren	eb60ddb729	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 17:49:39 +02:00
Viktor Lofgren	d84a2c183f	(*) Remove the crawl spec abstraction The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled. Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs. This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.	2024-10-03 13:41:17 +02:00
Viktor Lofgren	40512511af	(crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl This code is still a bit too complex, but it's slowly getting better.	2024-09-24 15:08:22 +02:00
Viktor Lofgren	e9854f194c	(crawler) Refactor * Restructure the code to make a bit more sense * Store full headers in crawl data * Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong	2024-09-23 17:51:07 +02:00
Viktor Lofgren	a3b0189934	Fix build errors after merge	2024-09-08 10:22:32 +02:00
Viktor Lofgren	8f367d96f8	Merge branch 'master' into term-positions # Conflicts: # code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java # code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java # code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java # code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java # code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java # code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java # code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java # code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java	2024-09-08 10:14:43 +02:00
Viktor Lofgren	5407da5650	(crawler) Grab favicons as part of root sniff	2024-08-31 11:32:56 +02:00
Viktor Lofgren	aebb2652e8	(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.	2024-07-27 11:44:13 +02:00
Viktor Lofgren	b09ddd0036	(crawler/converter) Remove legacy junk from parquet migration	2024-04-22 12:34:28 +02:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00

24 Commits