MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	eb2fe18867	(sideload) Add LSH generation for sideloaded StackExchange data Previously, the sideloader did not generate a locality-sensitive hashCode for document details. This caused all documents from the same domain to be considered duplicates by the deduplication logic.	2024-12-13 02:10:52 +01:00
Viktor Lofgren	a7468c8d23	(converter) Ensure paths are created for converter batch writer	2024-12-13 01:35:07 +01:00
Viktor Lofgren	fb2beb1eac	(converter) Fix data-loss bug where the converter writer would remove all but the last batch of processed data	2024-12-13 01:19:30 +01:00
Viktor Lofgren	0fb03e3d62	(export) Add logging to AtagExporter for error handling	2024-12-12 22:54:32 +01:00
Viktor Lofgren	0a53ac68a0	Add specialization for steam store and GOG	2024-12-11 18:32:45 +01:00
Viktor Lofgren	e65d75a0f9	(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets	2024-12-11 17:01:52 +01:00
Viktor Lofgren	3b99cffb3d	(link-parser) Filter out URLs with binary file suffixes in LinkParser Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.	2024-12-11 16:42:47 +01:00
Viktor Lofgren	a97c05107e	Add synthetic meta flag for root path documents If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D	2024-12-11 16:10:44 +01:00
Viktor Lofgren	5002870d1f	(converter) Refactor sideloaders to improve feature handling and keyword logic Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.	2024-12-11 16:01:38 +01:00
Viktor Lofgren	461bc3eb1a	(generator) Add special workaround to flag fextralife as a wiki	2024-12-10 22:22:52 +01:00
Viktor Lofgren	9fc82574f0	(fingerprint) Add FluxGarden as a wiki generator #130	2024-12-10 13:51:42 +01:00
Viktor	589f4dafb9	Merge pull request #129 from MarginaliaSearch/atags-counts (WIP) Improve atag sentence matching	2024-12-10 12:42:34 +00:00
Viktor Lofgren	c5d657ef98	(live-crawler) Flag live crawled documents with a special keyword	2024-12-10 13:42:10 +01:00
Viktor Lofgren	3c2bb566da	(converter) Wipe the converter output path on initialization to avoid lingering stale data.	2024-12-10 13:41:05 +01:00
Viktor Lofgren	e0c0ed27bc	(keyword-extraction) Clean up code and add tests for position and spans calculation This code has been a bit of a mess and historically significantly flaky, so some test coverage is more than overdue.	2024-12-08 14:14:52 +01:00
Viktor Lofgren	20abb91657	(loader) Correct DocumentLoaderService to properly do bulk inserts Fixes issue #128	2024-12-08 13:12:52 +01:00
Viktor Lofgren	291ca8daf1	(converter/index) Improve atag sentence matching by taking into consideration how many times a sentence appears in the links This change breaks the format of the atags.parquet file.	2024-12-08 00:27:11 +01:00
Viktor Lofgren	ee2d5496d0	Revert "(experiment) Modify atags exporter to permit duplicates from different source domains" This reverts commit `5c858a2b94`.	2024-12-07 14:01:50 +01:00
Viktor Lofgren	5c858a2b94	(experiment) Modify atags exporter to permit duplicates from different source domains This is an attempt to provide higher resolution term frequency data that will need evaluation when the data is processed.	2024-12-06 14:10:15 +01:00
Viktor Lofgren	fdc3efa250	(setup) Remove OpenNLP tokenization model This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.	2024-11-28 16:03:05 +01:00
Viktor Lofgren	52bc0272f8	(atag) Add alias domain support and improve domain handling Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.	2024-11-27 14:26:44 +01:00
Viktor Lofgren	d4bce13a03	(export) Add export actors to precession Adding a tracking message to the export actor means it's possible to run them in a precession. Adding a new precession actor, and some GUI components for triggering exports. The change also adds a heartbeat to the export process.	2024-11-26 15:07:03 +01:00
Viktor Lofgren	b9842b57e0	(encyclopedia-sideloader) Add test suite and clean up urlencoding logic	2024-11-26 13:34:15 +01:00
Viktor Lofgren	95776e9bee	(encyclopedia) Fix commit gore resulting in bad SQL query	2024-11-26 12:44:49 +01:00
Viktor Lofgren	9ec41e27c6	(keyword-extractor) Fix bug where external link keywords weren't generating document spans as intended	2024-11-25 18:30:22 +01:00
Viktor Lofgren	200743c84f	(minor) Remove delomobok debris	2024-11-25 18:29:21 +01:00
Viktor Lofgren	3ec9c4c5fa	(export) Filter non-HTML documents in exporters Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.	2024-11-25 15:06:42 +01:00
Viktor Lofgren	ff17473105	Fix UTF-8 URL normalization issue in sideloader. Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue. Fixes issue #109.	2024-11-25 14:25:47 +01:00
Viktor Lofgren	88caca60f9	(live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list	2024-11-23 17:07:16 +01:00
Viktor Lofgren	552b246099	(live-crawl) Improve error handling for errors during robots.txt-retrieval Reduce log-spam and don't treat errors other than 404 as "all is permitted".	2024-11-22 14:15:32 +01:00
Viktor Lofgren	b941604135	(live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with.	2024-11-22 13:58:57 +01:00
Viktor Lofgren	52eb5bc84f	(live-crawler) Keep track of bad URLs To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.	2024-11-22 00:55:46 +01:00
Viktor Lofgren	14519294d2	Merge branch 'master' into live-search	2024-11-21 16:00:20 +01:00
Viktor Lofgren	51e46ad2b0	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx. While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.	2024-11-21 16:00:09 +01:00
Viktor Lofgren	665c8831a3	(model) Fix resource leak in partially read crawl data streams. Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.	2024-11-20 19:29:13 +01:00
Viktor Lofgren	f94911541a	(live-crawl) Reduce the risk of id collisions with the main indexes This is done by applying a large constant offset to the ordinals for the live crawled documents. The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.	2024-11-20 16:01:10 +01:00
Viktor Lofgren	89d8af640d	(live-crawl) Rename the live crawler code module to be more consistent with the other processes	2024-11-20 15:55:15 +01:00
Viktor Lofgren	6e4252cf4c	(live-crawl) Make the actor poll for feeds changes instead of being a one-shot thing. Also changes the live crawl process to store the live crawl data in a fixed directory in the storage base rather than versioned directories.	2024-11-20 15:36:25 +01:00
Viktor Lofgren	79ce4de2ab	(model) Remove deprecated fields from CrawledDocument and CrawledDomain	2024-11-20 15:27:05 +01:00
Viktor Lofgren	d6575dfee4	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 21:00:18 +01:00
Viktor Lofgren	a91ab4c203	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 19:35:01 +01:00
Viktor Lofgren	9eb16cb667	(test) Remove tests from fast suite Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI. Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.	2024-11-17 19:45:59 +01:00
Viktor Lofgren	9f47ce8d15	(chore) Remove lombok There are likely some instances of delombok gore with this commit.	2024-11-11 21:14:38 +01:00
Viktor Lofgren	a5b4951f23	(chore) Remove use of deprecated STR.-style string templates	2024-11-11 18:02:28 +01:00
Viktor Lofgren	8b8bf0748f	(feature-extraction) Add new DocumentHeaders class encapsulating Html headers. Also adds a few new html features for CDNs and S3 hosting for use in ranking and query refinement.	2024-11-11 13:26:15 +01:00
Viktor Lofgren	dbb8bcdd8e	(crawler) Use a better hashInt implementation in CrawlDataReference Guava's hash functions are slow as hell.	2024-10-15 18:25:55 +02:00
Viktor Lofgren	7305afa0f8	(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris	2024-10-15 17:27:59 +02:00
Viktor Lofgren	481f999b70	(crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full. Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.	2024-10-15 14:22:40 +02:00
Viktor Lofgren	4b16022556	(crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains	2024-10-15 14:21:59 +02:00
Viktor Lofgren	89dd201a7b	(link-parser) Make mailing list blocking optional	2024-10-15 13:48:32 +02:00

1 2 3 4 5 ...

487 Commits