MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 13:09:00 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	55aeb03c4a	(feeds) Replace rssreader based parsing with a custom jsoup based rss parser This solves some issues with the rssreader based parser, which was very picky about the XML being valid. Jsoup is much more lenient when parsing malformed XML.	2025-01-09 18:29:55 +01:00
Viktor Lofgren	faa589962f	(live-capture) Browserless now requires a token	2025-01-09 14:51:11 +01:00
Viktor Lofgren	3da8337ba6	(feeds) Add system property for exporting fetched feeds to a slop table for debugging	2025-01-08 20:49:16 +01:00
Viktor Lofgren	3772bfd387	(query) Fix handling of optional ranking parameters	2025-01-08 17:11:22 +01:00
Viktor Lofgren	a1fb92468f	(refac) Remove ResultRankingParameters, QueryLimits class and use protobuf classes directly instead This is primarily to make the code a bit easier to reason about, and will reduce the level of indirection and data copying in the search-servi->query-service->index-service communication chain.	2025-01-08 16:15:57 +01:00
Viktor Lofgren	983d6d067c	(search-service) Add indexing indicator to sibling domains listing	2025-01-08 12:58:34 +01:00
Viktor Lofgren	a84a06975c	(ranking-params) Add disable penalties flag to ranking params This will help debugging ranking issues. Later it may be added to some filters.	2025-01-08 00:16:49 +01:00
Viktor Lofgren	7c90b6b414	(query) Don't blindly make tokens containing a colon into a non-ranking advice term	2025-01-07 15:18:05 +01:00
Viktor Lofgren	2315fdc731	(search) Vendor rssreader and modify it to be able to consume the nlnet atom feed Also dial down the logging a bit for the rssreader package.	2025-01-06 17:58:50 +01:00
Viktor Lofgren	8dde502cc9	Merge branch 'master' into serp-redesign	2025-01-05 23:33:35 +01:00
Viktor Lofgren	3e66767af3	(search) Adjust query parsing to trim tokens in quoted search terms Quoted search queries that contained keywords with possessive 's endings were not returning any results, as the index does not retain that suffix, and the query parser was not stripping it away in this code path. This solves issue #143.	2025-01-05 23:33:09 +01:00
Viktor Lofgren	8c69dc31b8	Merge branch 'master' into serp-redesign	2025-01-05 18:52:51 +01:00
Viktor Lofgren	4da3563d8a	(service) Clean up exceptions when requestScreengrab is not available	2025-01-04 14:45:51 +01:00
Viktor Lofgren	594df64b20	(domain-info) Use appropriate sqlite database when fetching feed status	2025-01-02 20:20:36 +01:00
Viktor Lofgren	06efb5abfc	Merge branch 'master' into serp-redesign	2025-01-02 18:42:12 +01:00
Viktor Lofgren	78eb1417a7	(service) Only block on SingleNodeChannelPool creation in QueryClient The code was always blocking for up to 5s while waiting for the remote end to become available, meaning some services would stall for several seconds on start-up for no sensible reason. This should make most services start faster as a result.	2025-01-02 18:42:01 +01:00
Viktor Lofgren	8c8f2ad5ee	(search) Add an indicator when a link has a feed in the similar/linked domains views	2025-01-02 18:11:57 +01:00
Viktor Lofgren	67edc8f90d	(domain-info) Only flag domains with rss feed items as having a feed	2025-01-02 17:41:52 +01:00
Viktor Lofgren	5f576b7d0c	(query-parser) Strip leading underlines This addresses issue #140, where __builtin_ffs gives no results.	2025-01-02 14:39:03 +01:00
Viktor Lofgren	0376f2e6e3	Merge branch 'master' into serp-redesign # Conflicts: # code/services-application/search-service/resources/templates/search/index/index.hdb	2025-01-01 18:15:09 +01:00
Viktor Lofgren	0b65164f60	(chore) Fix broken test	2025-01-01 18:06:29 +01:00
Viktor Lofgren	9be477de33	(domain-info) Add a feed flag to domain info This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.	2025-01-01 18:02:33 +01:00
Viktor Lofgren	710af4999a	(feed-fetcher) Add " entity mapping in feed fetcher	2025-01-01 15:45:17 +01:00
Viktor Lofgren	baeb4a46cd	(search) Reintroduce query rewriting for recipes, add rules for wikis and forums	2024-12-31 16:05:00 +01:00
Viktor Lofgren	3bc99639a0	(feed-fetcher) Make feed fetcher requests conditional Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests. On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary. A new table was added to the FeedDb to hold one etag per domain. If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received. This completes the changes for Issue #136.	2024-12-27 15:10:15 +01:00
Viktor Lofgren	927bc0b63c	(live-crawler) Add Accept-Encoding: gzip to outbound requests This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data. The change addresses issue #136, save for making the fetcher's requests conditional.	2024-12-27 03:59:34 +01:00
Viktor Lofgren	41a59dcf45	(feed) Sanitize illegal HTML entities out of the feed XML before parsing	2024-12-25 14:53:28 +01:00
Viktor Lofgren	2b222efa75	Merge branch 'master' into serp-redesign	2024-12-25 14:22:42 +01:00
Viktor Lofgren	94d4d2edb7	(live-crawler) Add refresh date to feeds API For now this is just the ctime for the feeds db. We may want to store this per-record in the future.	2024-12-25 14:20:48 +01:00
Viktor Lofgren	b66879ccb1	(feed) Add support for date discovery through atom:issued and atom:created This is specifically to help parse monadnock.net's Atom feed.	2024-12-23 20:05:58 +01:00
Viktor Lofgren	0da2047eae	(live-capture) Correctly update processed count, disable poll rate adjustment based on freshness.	2024-12-23 15:56:27 +01:00
Viktor Lofgren	5ca8523220	(math) Reduce log error spam from null unit conversions	2024-12-21 18:51:45 +01:00
Viktor Lofgren	8c963bd4ba	(feeds) Remove Content-Encoding: gzip from feed fetcher We don't support decompressing gzip, so this just gives us errors at this point should the server support it.	2024-12-18 22:23:44 +01:00
Viktor Lofgren	6a079c1c75	(feeds) Add per-domain throttling for feed fetcher.	2024-12-18 22:06:46 +01:00
Viktor Lofgren	2dc9f2e639	(feeds) Make feed XML parsing more lenient ... by consuming BOM markers and leading whitespace.	2024-12-18 17:18:41 +01:00
Viktor Lofgren	b66fb9caf6	(feeds) Improve error handling in the feed fetcher.	2024-12-18 17:02:13 +01:00
Viktor Lofgren	eab61cd48a	Merge branch 'master' into serp-redesign	2024-12-11 17:09:27 +01:00
Viktor Lofgren	cf7f84f033	(rank) Reduce the impact of domain rank bonus, and only apply it to cancel out negative penalties, never to increase the ranking	2024-12-10 22:04:12 +01:00
Viktor Lofgren	f3382b5bd8	(search) Completely remove all old hdb templates Create new views for conversion results, dictionary results, and site crosstalk.	2024-12-10 15:04:49 +01:00
Viktor Lofgren	f050bf5c4c	(WIP) Initial semi-working transformation to new tailwind UI Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod. There's also a lot of polish remaining everywhere, dead links, etc.	2024-12-05 14:00:17 +01:00
Viktor Lofgren	c97c66a41c	(ranking) Reduce the verbatim score multiplier	2024-11-28 13:37:11 +01:00
Viktor Lofgren	923ebbac81	(feeds) Add logic to handle URI fragments in feed items Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.	2024-11-23 16:38:56 +01:00
Viktor Lofgren	4d23fe6261	(feeds) Simplify RSS User-Agent header Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService. This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.	2024-11-21 16:43:56 +01:00
Viktor Lofgren	a91ab4c203	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 19:35:01 +01:00
Viktor Lofgren	c728a1e2f2	(rss) Add endpoint for extracting URLs changed withing a timespan.	2024-11-18 14:59:32 +01:00
Viktor Lofgren	d874d76a09	(rss) Add an endpoint that can be used for identifying when RSS data has changed	2024-11-18 14:22:17 +01:00
Viktor Lofgren	9eb16cb667	(test) Remove tests from fast suite Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI. Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.	2024-11-17 19:45:59 +01:00
Viktor Lofgren	e5db3f11e1	(chore) Clean up some of the uglier delomboking artifacts	2024-11-15 13:57:20 +01:00
Viktor Lofgren	9f47ce8d15	(chore) Remove lombok There are likely some instances of delombok gore with this commit.	2024-11-11 21:14:38 +01:00
Viktor Lofgren	a5b4951f23	(chore) Remove use of deprecated STR.-style string templates	2024-11-11 18:02:28 +01:00

1 2 3 4

173 Commits