MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 13:09:00 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	0ca43f0c9c	(live-crawler) Improve live crawler short-circuit logic We should not wait until we've fetched robots.txt to decide whether we have any data to fetch! This makes the live crawler very slow and leads to unnecessary requests.	2024-12-27 20:54:42 +01:00
Viktor Lofgren	3bc99639a0	(feed-fetcher) Make feed fetcher requests conditional Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests. On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary. A new table was added to the FeedDb to hold one etag per domain. If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received. This completes the changes for Issue #136.	2024-12-27 15:10:15 +01:00
Viktor Lofgren	927bc0b63c	(live-crawler) Add Accept-Encoding: gzip to outbound requests This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data. The change addresses issue #136, save for making the fetcher's requests conditional.	2024-12-27 03:59:34 +01:00
Viktor Lofgren	d968801dc1	(converter) Drop feed data from SlopDomainRecord Also remove feed extraction from converter. This is the crawler's responsibility now.	2024-12-26 17:57:08 +01:00
Viktor Lofgren	89db69d360	(crawler) Correct feed URLs in domain state db Discovered feed URLs were given a double slash after their domain name in the DB. This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.	2024-12-26 15:18:31 +01:00
Viktor Lofgren	895cee7004	(crawler) Improved feed discovery, new domain state db per crawlset Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided. To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered. Solves issue #135	2024-12-26 15:05:52 +01:00
Viktor Lofgren	4bb71b8439	(crawler) Correct content type probing to only run on URLs that are suspected to be binary	2024-12-26 14:26:23 +01:00
Viktor Lofgren	e4a41f7dd1	(crawler) Correct content type probing to only run on URLs that are suspected to be binary	2024-12-26 14:13:17 +01:00
Viktor	69ad6287b1	Update ROADMAP.md	2024-12-25 21:16:38 +00:00
Viktor Lofgren	41a59dcf45	(feed) Sanitize illegal HTML entities out of the feed XML before parsing	2024-12-25 14:53:28 +01:00
Viktor Lofgren	94d4d2edb7	(live-crawler) Add refresh date to feeds API For now this is just the ctime for the feeds db. We may want to store this per-record in the future.	2024-12-25 14:20:48 +01:00
Viktor Lofgren	7ae19a92ba	(deploy) Improve deployment script to allow specification of partitions	2024-12-24 11:16:15 +01:00
Viktor Lofgren	56d14e56d7	(live-crawler) Improve LiveCrawlActor resilience to FeedService outages	2024-12-23 23:33:54 +01:00
Viktor Lofgren	a557c7ae7f	(live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler	2024-12-23 23:31:03 +01:00
Viktor Lofgren	b66879ccb1	(feed) Add support for date discovery through atom:issued and atom:created This is specifically to help parse monadnock.net's Atom feed.	2024-12-23 20:05:58 +01:00
Viktor Lofgren	f1b7157ca2	(deploy) Add basic linting ability to deployment script.	2024-12-23 16:21:29 +01:00
Viktor Lofgren	7622335e84	(deploy) Correct deploy script, set correct name for assistant	2024-12-23 15:59:02 +01:00
Viktor Lofgren	0da2047eae	(live-capture) Correctly update processed count, disable poll rate adjustment based on freshness.	2024-12-23 15:56:27 +01:00
Viktor Lofgren	5ee4321110	(ci) Correct deploy script	2024-12-22 20:08:37 +01:00
Viktor Lofgren	9459b9933b	(ci) Correct deploy script	2024-12-22 19:40:32 +01:00
Viktor Lofgren	87fb564f89	(ci) Add script for automatic deployment based on git tags	2024-12-22 19:24:54 +01:00
Viktor Lofgren	5ca8523220	(math) Reduce log error spam from null unit conversions	2024-12-21 18:51:45 +01:00
Viktor Lofgren	1118657ffd	(system) Supply local IP to service discovery if multiFace is enabled	2024-12-19 22:20:19 +01:00
Viktor Lofgren	b1f970152d	(system) To support configurations with multiple docker networks, bind to the "most local" interface. Make the behavior optional.	2024-12-19 20:26:31 +01:00
Viktor Lofgren	e1783891ab	(system) To support configurations with multiple docker networks, bind to the "most local" interface.	2024-12-19 20:18:57 +01:00
Viktor Lofgren	64d32471dd	(deploy) Deploy executor test	2024-12-19 17:45:47 +01:00
Viktor Lofgren	232cc465d9	(deploy) Deploy executor test	2024-12-19 17:35:38 +01:00
Viktor Lofgren	8c963bd4ba	(feeds) Remove Content-Encoding: gzip from feed fetcher We don't support decompressing gzip, so this just gives us errors at this point should the server support it.	2024-12-18 22:23:44 +01:00
Viktor Lofgren	6a079c1c75	(feeds) Add per-domain throttling for feed fetcher.	2024-12-18 22:06:46 +01:00
Viktor Lofgren	2dc9f2e639	(feeds) Make feed XML parsing more lenient ... by consuming BOM markers and leading whitespace.	2024-12-18 17:18:41 +01:00
Viktor Lofgren	b66fb9caf6	(feeds) Improve error handling in the feed fetcher.	2024-12-18 17:02:13 +01:00
Viktor Lofgren	eb2fe18867	(sideload) Add LSH generation for sideloaded StackExchange data Previously, the sideloader did not generate a locality-sensitive hashCode for document details. This caused all documents from the same domain to be considered duplicates by the deduplication logic.	2024-12-13 02:10:52 +01:00
Viktor Lofgren	a7468c8d23	(converter) Ensure paths are created for converter batch writer	2024-12-13 01:35:07 +01:00
Viktor Lofgren	fb2beb1eac	(converter) Fix data-loss bug where the converter writer would remove all but the last batch of processed data	2024-12-13 01:19:30 +01:00
Viktor Lofgren	0fb03e3d62	(export) Add logging to AtagExporter for error handling	2024-12-12 22:54:32 +01:00
Viktor Lofgren	67db3f295e	(index) Revert some optimization changes	2024-12-12 22:14:24 +01:00
Viktor Lofgren	dafaab3ef7	(index) Additional optimization pass	2024-12-12 18:57:33 +01:00
Viktor Lofgren	3f11ca409f	(index) Increase thread limit and optimize search result handling Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.	2024-12-12 17:07:06 +01:00
Viktor Lofgren	694eed79ef	(index) Increase thread limit and optimize search result handling Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.	2024-12-12 15:32:31 +01:00
Viktor Lofgren	4220169119	(index) Increase thread limit and optimize search result handling Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.	2024-12-12 15:31:11 +01:00
Viktor Lofgren	0a53ac68a0	Add specialization for steam store and GOG	2024-12-11 18:32:45 +01:00
Viktor Lofgren	e65d75a0f9	(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets	2024-12-11 17:01:52 +01:00
Viktor Lofgren	3b99cffb3d	(link-parser) Filter out URLs with binary file suffixes in LinkParser Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.	2024-12-11 16:42:47 +01:00
Viktor Lofgren	a97c05107e	Add synthetic meta flag for root path documents If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D	2024-12-11 16:10:44 +01:00
Viktor Lofgren	5002870d1f	(converter) Refactor sideloaders to improve feature handling and keyword logic Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.	2024-12-11 16:01:38 +01:00
Viktor Lofgren	73861e613f	(ranking) Downtune score boost for unordered heading matces	2024-12-11 15:44:29 +01:00
Viktor Lofgren	461bc3eb1a	(generator) Add special workaround to flag fextralife as a wiki	2024-12-10 22:22:52 +01:00
Viktor Lofgren	cf7f84f033	(rank) Reduce the impact of domain rank bonus, and only apply it to cancel out negative penalties, never to increase the ranking	2024-12-10 22:04:12 +01:00
Viktor Lofgren	9fc82574f0	(fingerprint) Add FluxGarden as a wiki generator #130	2024-12-10 13:51:42 +01:00
Viktor	589f4dafb9	Merge pull request #129 from MarginaliaSearch/atags-counts (WIP) Improve atag sentence matching	2024-12-10 12:42:34 +00:00

1 2 3 4 5 ...

2568 Commits