MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	6a6318d04c	(search) Add separate websiteUrl property to legacy service	2025-01-06 16:26:08 +01:00
Viktor Lofgren	55933f8d40	(search) Ensure we respect old URL contracts /explore/random should be equivalent to /explore	2025-01-06 16:20:53 +01:00
Viktor Lofgren	45e771f96b	(api) Update the / API redirect to the new documentation stub.	2025-01-06 16:07:32 +01:00
Viktor Lofgren	8dde502cc9	Merge branch 'master' into serp-redesign	2025-01-05 23:33:35 +01:00
Viktor Lofgren	3e66767af3	(search) Adjust query parsing to trim tokens in quoted search terms Quoted search queries that contained keywords with possessive 's endings were not returning any results, as the index does not retain that suffix, and the query parser was not stripping it away in this code path. This solves issue #143.	2025-01-05 23:33:09 +01:00
Viktor Lofgren	9ec9d1b338	Merge branch 'master' into serp-redesign	2025-01-05 21:10:20 +01:00
Viktor Lofgren	dcad0d7863	(search) Tweak token formation.	2025-01-05 21:01:09 +01:00
Viktor Lofgren	94e1aa0baf	(search) Tweak token formation to still break apart emails in brackets.	2025-01-05 20:55:44 +01:00
Viktor Lofgren	b62f043910	(search) Adjust token formation rules to be more lenient to C++ and PHP code. This addresses Issue #142	2025-01-05 20:50:27 +01:00
Viktor Lofgren	6ea22d0d21	(search) Update front page with work-in-progress note	2025-01-05 19:08:02 +01:00
Viktor Lofgren	8c69dc31b8	Merge branch 'master' into serp-redesign	2025-01-05 18:52:51 +01:00
Viktor Lofgren	00734ea87f	(search) Add hover text for matchogram	2025-01-05 18:50:44 +01:00
Viktor Lofgren	3009713db4	(search) Fix broken tests	2025-01-05 18:50:27 +01:00
Viktor Lofgren	a9e312b8b1	(service) Add links to marginalia-search.com where appropriate	2025-01-05 16:56:38 +01:00
Viktor Lofgren	4da3563d8a	(service) Clean up exceptions when requestScreengrab is not available	2025-01-04 14:45:51 +01:00
Viktor Lofgren	48d0a3089a	(service) Improve logging around grpc This change adds a marker for the gRPC-specific logging, as well as improves the clarity and meaningfulness of the log messages.	2025-01-02 20:40:53 +01:00
Viktor Lofgren	594df64b20	(domain-info) Use appropriate sqlite database when fetching feed status	2025-01-02 20:20:36 +01:00
Viktor Lofgren	06efb5abfc	Merge branch 'master' into serp-redesign	2025-01-02 18:42:12 +01:00
Viktor Lofgren	78eb1417a7	(service) Only block on SingleNodeChannelPool creation in QueryClient The code was always blocking for up to 5s while waiting for the remote end to become available, meaning some services would stall for several seconds on start-up for no sensible reason. This should make most services start faster as a result.	2025-01-02 18:42:01 +01:00
Viktor Lofgren	8c8f2ad5ee	(search) Add an indicator when a link has a feed in the similar/linked domains views	2025-01-02 18:11:57 +01:00
Viktor Lofgren	f71e79d10f	(search) Add a copy of the old UI as a separate service, `search-service-legacy`	2025-01-02 18:03:42 +01:00
Viktor Lofgren	1b27c5cf06	(search) Add a copy of the old UI as a separate service, `search-service-legacy`	2025-01-02 18:02:17 +01:00
Viktor Lofgren	67edc8f90d	(domain-info) Only flag domains with rss feed items as having a feed	2025-01-02 17:41:52 +01:00
Viktor Lofgren	5f576b7d0c	(query-parser) Strip leading underlines This addresses issue #140, where __builtin_ffs gives no results.	2025-01-02 14:39:03 +01:00
Viktor Lofgren	8b05c788fd	(Search) Enable gzip compression of responses	2025-01-01 18:34:42 +01:00
Viktor Lofgren	236f033bc9	(Search) Reduce whitespace in explore view on all resolutions	2025-01-01 18:23:35 +01:00
Viktor Lofgren	510fc75121	(Search) Reduce whitespace in explorer view on mobile	2025-01-01 18:18:09 +01:00
Viktor Lofgren	0376f2e6e3	Merge branch 'master' into serp-redesign # Conflicts: # code/services-application/search-service/resources/templates/search/index/index.hdb	2025-01-01 18:15:09 +01:00
Viktor Lofgren	0b65164f60	(chore) Fix broken test	2025-01-01 18:06:29 +01:00
Viktor Lofgren	9be477de33	(domain-info) Add a feed flag to domain info This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.	2025-01-01 18:02:33 +01:00
Viktor Lofgren	84f55b84ff	(search) Add experimental OPML-export function for feed subscriptions	2025-01-01 17:17:54 +01:00
Viktor Lofgren	ab5c30ad51	(search) Fix site info view for completely unknown domains Also correct the DbDomainQueries.getDomainId so that it throws NoSuchElementException when domain id is missing, and not UncheckedExecutionException via Cache.	2025-01-01 16:29:01 +01:00
Viktor Lofgren	0c839453c5	(search) Fix crosstalk link	2025-01-01 16:09:19 +01:00
Viktor Lofgren	5e4c5d03ae	(search) Clean up breakpoints in site overview	2025-01-01 16:06:08 +01:00
Viktor Lofgren	710af4999a	(feed-fetcher) Add " entity mapping in feed fetcher	2025-01-01 15:45:17 +01:00
Viktor Lofgren	a5b0a1ae62	(search) Move linked/similar domains to a popover style menu on mobile Fix scroll	2025-01-01 15:37:35 +01:00
Viktor Lofgren	e9f71ee39b	(search) Move linked/similar domains to a popover style menu on mobile	2025-01-01 15:23:25 +01:00
Viktor Lofgren	baeb4a46cd	(search) Reintroduce query rewriting for recipes, add rules for wikis and forums	2024-12-31 16:05:00 +01:00
Viktor Lofgren	0ea8092350	(search) Add link promoting the redesign beta	2024-12-30 15:47:13 +01:00
Viktor Lofgren	bae44497fe	(crawler) Add a new system property crawler.maxFetchSize This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.	2024-12-30 15:10:11 +01:00
Viktor Lofgren	0d59202aca	(crawler) Do not remove W/-prefix on weak e-tags The server expects to get them back prefixed, as we received them.	2024-12-27 20:56:42 +01:00
Viktor Lofgren	0ca43f0c9c	(live-crawler) Improve live crawler short-circuit logic We should not wait until we've fetched robots.txt to decide whether we have any data to fetch! This makes the live crawler very slow and leads to unnecessary requests.	2024-12-27 20:54:42 +01:00
Viktor Lofgren	3bc99639a0	(feed-fetcher) Make feed fetcher requests conditional Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests. On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary. A new table was added to the FeedDb to hold one etag per domain. If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received. This completes the changes for Issue #136.	2024-12-27 15:10:15 +01:00
Viktor Lofgren	927bc0b63c	(live-crawler) Add Accept-Encoding: gzip to outbound requests This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data. The change addresses issue #136, save for making the fetcher's requests conditional.	2024-12-27 03:59:34 +01:00
Viktor Lofgren	d968801dc1	(converter) Drop feed data from SlopDomainRecord Also remove feed extraction from converter. This is the crawler's responsibility now.	2024-12-26 17:57:08 +01:00
Viktor Lofgren	89db69d360	(crawler) Correct feed URLs in domain state db Discovered feed URLs were given a double slash after their domain name in the DB. This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.	2024-12-26 15:18:31 +01:00
Viktor Lofgren	895cee7004	(crawler) Improved feed discovery, new domain state db per crawlset Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided. To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered. Solves issue #135	2024-12-26 15:05:52 +01:00
Viktor Lofgren	4bb71b8439	(crawler) Correct content type probing to only run on URLs that are suspected to be binary	2024-12-26 14:26:23 +01:00
Viktor Lofgren	e4a41f7dd1	(crawler) Correct content type probing to only run on URLs that are suspected to be binary	2024-12-26 14:13:17 +01:00
Viktor Lofgren	81cdd6385d	Add rendering tests for most major views This will prevent accidentally deploying a broken search service	2024-12-25 15:22:26 +01:00
Viktor Lofgren	e76c42329f	Correct dark mode for infobox in site focused search	2024-12-25 15:06:05 +01:00
Viktor Lofgren	e6ef4734ea	Fix tests	2024-12-25 15:05:41 +01:00
Viktor Lofgren	41a59dcf45	(feed) Sanitize illegal HTML entities out of the feed XML before parsing	2024-12-25 14:53:28 +01:00
Viktor Lofgren	df4bc1d7e9	Add update time to front page subscriptions	2024-12-25 14:42:00 +01:00
Viktor Lofgren	2b222efa75	Merge branch 'master' into serp-redesign	2024-12-25 14:22:42 +01:00
Viktor Lofgren	94d4d2edb7	(live-crawler) Add refresh date to feeds API For now this is just the ctime for the feeds db. We may want to store this per-record in the future.	2024-12-25 14:20:48 +01:00
Viktor Lofgren	56d14e56d7	(live-crawler) Improve LiveCrawlActor resilience to FeedService outages	2024-12-23 23:33:54 +01:00
Viktor Lofgren	a557c7ae7f	(live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler	2024-12-23 23:31:03 +01:00
Viktor Lofgren	b66879ccb1	(feed) Add support for date discovery through atom:issued and atom:created This is specifically to help parse monadnock.net's Atom feed.	2024-12-23 20:05:58 +01:00
Viktor Lofgren	0da2047eae	(live-capture) Correctly update processed count, disable poll rate adjustment based on freshness.	2024-12-23 15:56:27 +01:00
Viktor Lofgren	5ca8523220	(math) Reduce log error spam from null unit conversions	2024-12-21 18:51:45 +01:00
Viktor Lofgren	1118657ffd	(system) Supply local IP to service discovery if multiFace is enabled	2024-12-19 22:20:19 +01:00
Viktor Lofgren	b1f970152d	(system) To support configurations with multiple docker networks, bind to the "most local" interface. Make the behavior optional.	2024-12-19 20:26:31 +01:00
Viktor Lofgren	e1783891ab	(system) To support configurations with multiple docker networks, bind to the "most local" interface.	2024-12-19 20:18:57 +01:00
Viktor Lofgren	8c963bd4ba	(feeds) Remove Content-Encoding: gzip from feed fetcher We don't support decompressing gzip, so this just gives us errors at this point should the server support it.	2024-12-18 22:23:44 +01:00
Viktor Lofgren	6a079c1c75	(feeds) Add per-domain throttling for feed fetcher.	2024-12-18 22:06:46 +01:00
Viktor Lofgren	2dc9f2e639	(feeds) Make feed XML parsing more lenient ... by consuming BOM markers and leading whitespace.	2024-12-18 17:18:41 +01:00
Viktor Lofgren	b66fb9caf6	(feeds) Improve error handling in the feed fetcher.	2024-12-18 17:02:13 +01:00
Viktor Lofgren	6d18e6d840	(search) Add clustering to subscriptions view	2024-12-18 15:36:05 +01:00
Viktor Lofgren	2a3c63f209	(search) Exclude generated style.css from git	2024-12-18 15:24:31 +01:00
Viktor Lofgren	9f70cecaef	(search) Add site subscription feature that puts RSS updates on the front page	2024-12-18 15:24:31 +01:00
Viktor Lofgren	47e58a21c6	Refactor documentBody method and ContentType charset handling Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.	2024-12-17 17:11:37 +01:00
Viktor Lofgren	3714104976	Add loader for slop data in converter. Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.	2024-12-17 15:40:24 +01:00
Viktor Lofgren	f6f036b9b1	Switch to new Slop format for crawl data storage and processing. Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.	2024-12-15 19:34:03 +01:00
Viktor Lofgren	b510b7feb8	Spike for storing crawl data in slop instead of parquet This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds. On disk size is virtually identical.	2024-12-15 15:49:47 +01:00
Viktor Lofgren	c08203e2ed	(search) Prevent paperdoll from being run as a test by CI	2024-12-14 20:35:57 +01:00
Viktor Lofgren	86497fd32f	(site-info) Mobile layout fix	2024-12-14 16:19:56 +01:00
Viktor Lofgren	3b998573fd	Adjust colors on dark mode for site overview	2024-12-13 21:51:25 +01:00
Viktor Lofgren	e161882ec7	(search) Fix layout for light mode	2024-12-13 21:47:29 +01:00
Viktor Lofgren	357f349e30	(search) Table layout fixes for dictionary lookup	2024-12-13 21:47:08 +01:00
Viktor Lofgren	e4769f541d	(search) Sort and deduplicate search results for better relevance. Added a custom sorting mechanism to prioritize HTTPS over HTTP and domain-based URLs over raw IPs during deduplication. Ensures "bad duplicates" are discarded while maintaining the original presentation order for user-facing results.	2024-12-13 21:47:08 +01:00
Viktor Lofgren	2a173e2861	(search) Dark Mode	2024-12-13 21:47:07 +01:00
Viktor Lofgren	a6a900266c	(search) Fix redirects	2024-12-13 02:40:51 +01:00
Viktor Lofgren	bdba53f055	(site) Update domain parameter type from PathParam to QueryParam	2024-12-13 02:15:35 +01:00
Viktor Lofgren	eb2fe18867	(sideload) Add LSH generation for sideloaded StackExchange data Previously, the sideloader did not generate a locality-sensitive hashCode for document details. This caused all documents from the same domain to be considered duplicates by the deduplication logic.	2024-12-13 02:10:52 +01:00
Viktor Lofgren	a7468c8d23	(converter) Ensure paths are created for converter batch writer	2024-12-13 01:35:07 +01:00
Viktor Lofgren	fb2beb1eac	(converter) Fix data-loss bug where the converter writer would remove all but the last batch of processed data	2024-12-13 01:19:30 +01:00
Viktor Lofgren	0fb03e3d62	(export) Add logging to AtagExporter for error handling	2024-12-12 22:54:32 +01:00
Viktor Lofgren	67db3f295e	(index) Revert some optimization changes	2024-12-12 22:14:24 +01:00
Viktor Lofgren	dafaab3ef7	(index) Additional optimization pass	2024-12-12 18:57:33 +01:00
Viktor Lofgren	3f11ca409f	(index) Increase thread limit and optimize search result handling Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.	2024-12-12 17:07:06 +01:00
Viktor Lofgren	694eed79ef	(index) Increase thread limit and optimize search result handling Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.	2024-12-12 15:32:31 +01:00
Viktor Lofgren	4220169119	(index) Increase thread limit and optimize search result handling Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.	2024-12-12 15:31:11 +01:00
Viktor Lofgren	bbdde789e7	Merge branch 'master' into serp-redesign	2024-12-11 19:45:17 +01:00
Viktor Lofgren	0a53ac68a0	Add specialization for steam store and GOG	2024-12-11 18:32:45 +01:00
Viktor Lofgren	eab61cd48a	Merge branch 'master' into serp-redesign	2024-12-11 17:09:27 +01:00
Viktor Lofgren	e65d75a0f9	(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets	2024-12-11 17:01:52 +01:00
Viktor Lofgren	3b99cffb3d	(link-parser) Filter out URLs with binary file suffixes in LinkParser Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.	2024-12-11 16:42:47 +01:00
Viktor Lofgren	a97c05107e	Add synthetic meta flag for root path documents If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D	2024-12-11 16:10:44 +01:00
Viktor Lofgren	5002870d1f	(converter) Refactor sideloaders to improve feature handling and keyword logic Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.	2024-12-11 16:01:38 +01:00

1 2 3 4 5 ...

1931 Commits