MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	abec83582d	Fix refactoring gore	2025-01-21 15:08:04 +01:00
Viktor Lofgren	088310e998	(converter) Improve simple processing performance There was a regression introduced in the recent slop migration changes in the performance of the simple conversion track. This reverts the issue.	2025-01-21 14:13:33 +01:00
Viktor Lofgren	4c74e280d3	(crawler) Fix urlencoding in sitemap fetcher	2025-01-21 13:33:35 +01:00
Viktor Lofgren	5b347e17ac	(crawler) Automatically migrate to slop from parquet when crawling	2025-01-21 13:33:14 +01:00
Viktor Lofgren	55d6ab933f	Merge branch 'master' into slop-crawl-data-spike	2025-01-21 13:32:58 +01:00
Viktor Lofgren	43b74e9706	(crawler) Fix exception handler and resource leak in WarcRecorder	2025-01-20 23:45:28 +01:00
Viktor Lofgren	579a115243	(crawler) Reduce log spam from error handling in new sitemap fetcher	2025-01-20 23:17:13 +01:00
Viktor Lofgren	78a958e2b0	(crawler) Fix broken test that started failing after the search engine moved to a new domain	2025-01-20 18:52:14 +01:00
Viktor Lofgren	4e939389b2	(crawler) New Jsoup based sitemap parser	2025-01-20 14:37:44 +01:00
Viktor Lofgren	e67a9bdb91	(crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead.	2025-01-19 15:07:11 +01:00
Viktor Lofgren	567e4e1237	(crawler) Fast detection and bail-out for crawler traps Improve logging and exclude robots.txt from this logic.	2025-01-18 15:28:54 +01:00
Viktor Lofgren	4342e42722	(crawler) Fast detection and bail-out for crawler traps Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly. Out of the box crawl limits will also deal with this type of attack, but this fix is faster.	2025-01-17 13:02:57 +01:00
Viktor Lofgren	59e2dd4c26	(specialization) Soften length requirements for wiki-specialized documents (incl. cppreference)	2025-01-07 15:41:30 +01:00
Viktor Lofgren	ca1807caae	(specialization) Add new specialization for cppreference.com Give this reference website some synthetically generated tokens to improve the likelihood of a good match.	2025-01-07 15:41:05 +01:00
Viktor Lofgren	26c20e18ac	(keyword-extraction) Soften constraints on keyword patterns, allowing for longer segmented words	2025-01-07 15:20:50 +01:00
Viktor Lofgren	bae44497fe	(crawler) Add a new system property crawler.maxFetchSize This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.	2024-12-30 15:10:11 +01:00
Viktor Lofgren	0d59202aca	(crawler) Do not remove W/-prefix on weak e-tags The server expects to get them back prefixed, as we received them.	2024-12-27 20:56:42 +01:00
Viktor Lofgren	0ca43f0c9c	(live-crawler) Improve live crawler short-circuit logic We should not wait until we've fetched robots.txt to decide whether we have any data to fetch! This makes the live crawler very slow and leads to unnecessary requests.	2024-12-27 20:54:42 +01:00
Viktor Lofgren	927bc0b63c	(live-crawler) Add Accept-Encoding: gzip to outbound requests This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data. The change addresses issue #136, save for making the fetcher's requests conditional.	2024-12-27 03:59:34 +01:00
Viktor Lofgren	d968801dc1	(converter) Drop feed data from SlopDomainRecord Also remove feed extraction from converter. This is the crawler's responsibility now.	2024-12-26 17:57:08 +01:00
Viktor Lofgren	89db69d360	(crawler) Correct feed URLs in domain state db Discovered feed URLs were given a double slash after their domain name in the DB. This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.	2024-12-26 15:18:31 +01:00
Viktor Lofgren	895cee7004	(crawler) Improved feed discovery, new domain state db per crawlset Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided. To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered. Solves issue #135	2024-12-26 15:05:52 +01:00
Viktor Lofgren	4bb71b8439	(crawler) Correct content type probing to only run on URLs that are suspected to be binary	2024-12-26 14:26:23 +01:00
Viktor Lofgren	e4a41f7dd1	(crawler) Correct content type probing to only run on URLs that are suspected to be binary	2024-12-26 14:13:17 +01:00
Viktor Lofgren	a557c7ae7f	(live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler	2024-12-23 23:31:03 +01:00
Viktor Lofgren	47e58a21c6	Refactor documentBody method and ContentType charset handling Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.	2024-12-17 17:11:37 +01:00
Viktor Lofgren	3714104976	Add loader for slop data in converter. Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.	2024-12-17 15:40:24 +01:00
Viktor Lofgren	f6f036b9b1	Switch to new Slop format for crawl data storage and processing. Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.	2024-12-15 19:34:03 +01:00
Viktor Lofgren	b510b7feb8	Spike for storing crawl data in slop instead of parquet This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds. On disk size is virtually identical.	2024-12-15 15:49:47 +01:00
Viktor Lofgren	eb2fe18867	(sideload) Add LSH generation for sideloaded StackExchange data Previously, the sideloader did not generate a locality-sensitive hashCode for document details. This caused all documents from the same domain to be considered duplicates by the deduplication logic.	2024-12-13 02:10:52 +01:00
Viktor Lofgren	a7468c8d23	(converter) Ensure paths are created for converter batch writer	2024-12-13 01:35:07 +01:00
Viktor Lofgren	fb2beb1eac	(converter) Fix data-loss bug where the converter writer would remove all but the last batch of processed data	2024-12-13 01:19:30 +01:00
Viktor Lofgren	0fb03e3d62	(export) Add logging to AtagExporter for error handling	2024-12-12 22:54:32 +01:00
Viktor Lofgren	0a53ac68a0	Add specialization for steam store and GOG	2024-12-11 18:32:45 +01:00
Viktor Lofgren	e65d75a0f9	(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets	2024-12-11 17:01:52 +01:00
Viktor Lofgren	3b99cffb3d	(link-parser) Filter out URLs with binary file suffixes in LinkParser Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.	2024-12-11 16:42:47 +01:00
Viktor Lofgren	a97c05107e	Add synthetic meta flag for root path documents If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D	2024-12-11 16:10:44 +01:00
Viktor Lofgren	5002870d1f	(converter) Refactor sideloaders to improve feature handling and keyword logic Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.	2024-12-11 16:01:38 +01:00
Viktor Lofgren	461bc3eb1a	(generator) Add special workaround to flag fextralife as a wiki	2024-12-10 22:22:52 +01:00
Viktor Lofgren	9fc82574f0	(fingerprint) Add FluxGarden as a wiki generator #130	2024-12-10 13:51:42 +01:00
Viktor	589f4dafb9	Merge pull request #129 from MarginaliaSearch/atags-counts (WIP) Improve atag sentence matching	2024-12-10 12:42:34 +00:00
Viktor Lofgren	c5d657ef98	(live-crawler) Flag live crawled documents with a special keyword	2024-12-10 13:42:10 +01:00
Viktor Lofgren	3c2bb566da	(converter) Wipe the converter output path on initialization to avoid lingering stale data.	2024-12-10 13:41:05 +01:00
Viktor Lofgren	e0c0ed27bc	(keyword-extraction) Clean up code and add tests for position and spans calculation This code has been a bit of a mess and historically significantly flaky, so some test coverage is more than overdue.	2024-12-08 14:14:52 +01:00
Viktor Lofgren	20abb91657	(loader) Correct DocumentLoaderService to properly do bulk inserts Fixes issue #128	2024-12-08 13:12:52 +01:00
Viktor Lofgren	291ca8daf1	(converter/index) Improve atag sentence matching by taking into consideration how many times a sentence appears in the links This change breaks the format of the atags.parquet file.	2024-12-08 00:27:11 +01:00
Viktor Lofgren	ee2d5496d0	Revert "(experiment) Modify atags exporter to permit duplicates from different source domains" This reverts commit `5c858a2b94`.	2024-12-07 14:01:50 +01:00
Viktor Lofgren	5c858a2b94	(experiment) Modify atags exporter to permit duplicates from different source domains This is an attempt to provide higher resolution term frequency data that will need evaluation when the data is processed.	2024-12-06 14:10:15 +01:00
Viktor Lofgren	fdc3efa250	(setup) Remove OpenNLP tokenization model This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.	2024-11-28 16:03:05 +01:00
Viktor Lofgren	52bc0272f8	(atag) Add alias domain support and improve domain handling Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.	2024-11-27 14:26:44 +01:00

1 2 3 4 5 ...

516 Commits