MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	88caca60f9	(live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list	2024-11-23 17:07:16 +01:00
Viktor Lofgren	923ebbac81	(feeds) Add logic to handle URI fragments in feed items Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.	2024-11-23 16:38:56 +01:00
Viktor	df298df852	Merge pull request #125 from MarginaliaSearch/live-search Add near real-time crawling from RSS feeds to supplement the slower batch based crawls	2024-11-22 16:38:37 +00:00
Viktor Lofgren	552b246099	(live-crawl) Improve error handling for errors during robots.txt-retrieval Reduce log-spam and don't treat errors other than 404 as "all is permitted".	2024-11-22 14:15:32 +01:00
Viktor Lofgren	80e6d0069c	(live-crawl-actor) Clear index journal before starting live crawl This is to prevent data corruption. This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.	2024-11-22 14:04:57 +01:00
Viktor Lofgren	b941604135	(live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with.	2024-11-22 13:58:57 +01:00
Viktor Lofgren	52eb5bc84f	(live-crawler) Keep track of bad URLs To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.	2024-11-22 00:55:46 +01:00
Viktor Lofgren	4d23fe6261	(feeds) Simplify RSS User-Agent header Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService. This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.	2024-11-21 16:43:56 +01:00
Viktor Lofgren	14519294d2	Merge branch 'master' into live-search	2024-11-21 16:00:20 +01:00
Viktor Lofgren	51e46ad2b0	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx. While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.	2024-11-21 16:00:09 +01:00
Viktor Lofgren	665c8831a3	(model) Fix resource leak in partially read crawl data streams. Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.	2024-11-20 19:29:13 +01:00
Viktor Lofgren	47dfbacb00	(conf) Introduce a new concept of node profiles Node profiles decide which actors are started, and which views are available in the control GUI. This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.	2024-11-20 18:15:22 +01:00
Viktor Lofgren	f94911541a	(live-crawl) Reduce the risk of id collisions with the main indexes This is done by applying a large constant offset to the ordinals for the live crawled documents. The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.	2024-11-20 16:01:10 +01:00
Viktor Lofgren	89d8af640d	(live-crawl) Rename the live crawler code module to be more consistent with the other processes	2024-11-20 15:55:15 +01:00
Viktor Lofgren	6e4252cf4c	(live-crawl) Make the actor poll for feeds changes instead of being a one-shot thing. Also changes the live crawl process to store the live crawl data in a fixed directory in the storage base rather than versioned directories.	2024-11-20 15:36:25 +01:00
Viktor Lofgren	79ce4de2ab	(model) Remove deprecated fields from CrawledDocument and CrawledDomain	2024-11-20 15:27:05 +01:00
Viktor Lofgren	d6575dfee4	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 21:00:18 +01:00
Viktor Lofgren	a91ab4c203	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 19:35:01 +01:00
Viktor Lofgren	6a3079a167	(search) Fix missing getter for proto	2024-11-18 21:05:22 +01:00
Viktor Lofgren	c728a1e2f2	(rss) Add endpoint for extracting URLs changed withing a timespan.	2024-11-18 14:59:32 +01:00
Viktor Lofgren	d874d76a09	(rss) Add an endpoint that can be used for identifying when RSS data has changed	2024-11-18 14:22:17 +01:00
Viktor Lofgren	70bc8831f5	(test) Fix excludeTags	2024-11-17 20:07:49 +01:00
Viktor Lofgren	41c11be075	(status) Clean up the status page a bit	2024-11-17 20:00:44 +01:00
Viktor Lofgren	163ce19846	(test) Tag status service endpoint tests as flaky These tests have outside dependencies that inherently makes them unreliable and unsuitable for CI.	2024-11-17 19:48:01 +01:00
Viktor Lofgren	9eb16cb667	(test) Remove tests from fast suite Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI. Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.	2024-11-17 19:45:59 +01:00
Viktor Lofgren	af40fa327b	(status-service) Correct measurement pruning to use correct sqlite datetimes, as to not delete the database	2024-11-17 18:35:34 +01:00
Viktor Lofgren	cf6d28e71e	(status-service) Enable auto-commit	2024-11-17 18:25:15 +01:00
Viktor Lofgren	3791ea1e18	(service) Add a new application service for external liveness monitoring The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.	2024-11-17 18:01:08 +01:00
Viktor	34258b92d1	Merge pull request #124 from MarginaliaSearch/jdk-23+delombok Friendship with lombok over, now JDK 23 is my best friend	2024-11-16 14:00:49 +00:00
Viktor Lofgren	e5db3f11e1	(chore) Clean up some of the uglier delomboking artifacts	2024-11-15 13:57:20 +01:00
Viktor Lofgren	9f47ce8d15	(chore) Remove lombok There are likely some instances of delombok gore with this commit.	2024-11-11 21:14:38 +01:00
Viktor Lofgren	a5b4951f23	(chore) Remove use of deprecated STR.-style string templates	2024-11-11 18:02:28 +01:00
Viktor Lofgren	8b8bf0748f	(feature-extraction) Add new DocumentHeaders class encapsulating Html headers. Also adds a few new html features for CDNs and S3 hosting for use in ranking and query refinement.	2024-11-11 13:26:15 +01:00
Viktor	5cc71ae586	Merge pull request #123 from MarginaliaSearch/vlofgren-patch-1 Update ROADMAP.md	2024-11-10 18:57:49 +01:00
Viktor	33fcfe4b63	Update ROADMAP.md	2024-11-10 18:57:15 +01:00
Viktor	a31a3b53c4	Merge pull request #122 from MarginaliaSearch/fetch-rss-feeds Automatic RSS feed polling	2024-11-10 18:35:28 +01:00
Viktor Lofgren	a456ec9599	(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished	2024-11-10 18:30:28 +01:00
Viktor Lofgren	a2bc9a98c0	(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished	2024-11-10 17:45:20 +01:00
Viktor Lofgren	e24a98390c	(feed) Update API to allow specifying clean vs refresh update Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.	2024-11-09 18:43:47 +01:00
Viktor Lofgren	6f858cd627	(feed) Decrease update interval to 24 hours	2024-11-09 18:17:51 +01:00
Viktor Lofgren	a293266ccd	(feed) Wipe the feeds db and start over from system URLs periodically.	2024-11-09 18:17:16 +01:00
Viktor Lofgren	b8e0dc93d7	(search) Correctly show the feeds view when items are present ... otherwise show samples. This commit also removes the (Experimental) bit, as this is getting fairly mature.	2024-11-09 17:56:43 +01:00
Viktor Lofgren	d774c39031	(feeds) Reduce log spam	2024-11-09 17:56:43 +01:00
Viktor Lofgren	ab17af99da	(feeds) Refresh the feed db using the previous db, when it is available.	2024-11-09 17:56:43 +01:00
Viktor Lofgren	b0ac3c586f	(feeds) Correct parallelism using SimpleBlockingThreadPool	2024-11-09 17:56:43 +01:00
Viktor Lofgren	139fa85b18	(feeds) Add working heartbeat tracking progress	2024-11-09 17:56:43 +01:00
Viktor Lofgren	bfeb9a4538	(feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service	2024-11-09 17:56:43 +01:00
Viktor	3d6c79ae5f	Merge pull request #121 from MarginaliaSearch/headless-setup Headless deterministic setup	2024-11-08 13:50:54 +01:00
Viktor Lofgren	c9e9f73ea9	(setup) Break out installation action into non-interactive script	2024-11-08 13:38:40 +01:00
Viktor Lofgren	80e482b155	(setup) Add progress bar to downloads for better feedback	2024-11-08 13:38:40 +01:00

1 2 3 4 5 ...

2538 Commits