MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	3c2bb566da	(converter) Wipe the converter output path on initialization to avoid lingering stale data.	2024-12-10 13:41:05 +01:00
Viktor Lofgren	fdc3efa250	(setup) Remove OpenNLP tokenization model This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.	2024-11-28 16:03:05 +01:00
Viktor Lofgren	52bc0272f8	(atag) Add alias domain support and improve domain handling Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.	2024-11-27 14:26:44 +01:00
Viktor Lofgren	b9842b57e0	(encyclopedia-sideloader) Add test suite and clean up urlencoding logic	2024-11-26 13:34:15 +01:00
Viktor Lofgren	95776e9bee	(encyclopedia) Fix commit gore resulting in bad SQL query	2024-11-26 12:44:49 +01:00
Viktor Lofgren	9ec41e27c6	(keyword-extractor) Fix bug where external link keywords weren't generating document spans as intended	2024-11-25 18:30:22 +01:00
Viktor Lofgren	200743c84f	(minor) Remove delomobok debris	2024-11-25 18:29:21 +01:00
Viktor Lofgren	ff17473105	Fix UTF-8 URL normalization issue in sideloader. Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue. Fixes issue #109.	2024-11-25 14:25:47 +01:00
Viktor Lofgren	51e46ad2b0	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx. While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.	2024-11-21 16:00:09 +01:00
Viktor Lofgren	f94911541a	(live-crawl) Reduce the risk of id collisions with the main indexes This is done by applying a large constant offset to the ordinals for the live crawled documents. The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.	2024-11-20 16:01:10 +01:00
Viktor Lofgren	79ce4de2ab	(model) Remove deprecated fields from CrawledDocument and CrawledDomain	2024-11-20 15:27:05 +01:00
Viktor Lofgren	a91ab4c203	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 19:35:01 +01:00
Viktor Lofgren	9eb16cb667	(test) Remove tests from fast suite Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI. Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.	2024-11-17 19:45:59 +01:00
Viktor Lofgren	9f47ce8d15	(chore) Remove lombok There are likely some instances of delombok gore with this commit.	2024-11-11 21:14:38 +01:00
Viktor Lofgren	a5b4951f23	(chore) Remove use of deprecated STR.-style string templates	2024-11-11 18:02:28 +01:00
Viktor Lofgren	8b8bf0748f	(feature-extraction) Add new DocumentHeaders class encapsulating Html headers. Also adds a few new html features for CDNs and S3 hosting for use in ranking and query refinement.	2024-11-11 13:26:15 +01:00
Viktor Lofgren	7305afa0f8	(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris	2024-10-15 17:27:59 +02:00
Viktor Lofgren	ab486323f2	(converter) Increase the number of links the converter will pick up per document	2024-10-15 13:46:19 +02:00
Viktor Lofgren	d84a2c183f	(*) Remove the crawl spec abstraction The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled. Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs. This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.	2024-10-03 13:41:17 +02:00
Viktor Lofgren	e9e8580913	(converter) Fix NPE bugs in converter due to the reintroduction of CrawledDocument.headers	2024-09-25 12:18:56 +02:00
Viktor Lofgren	162fc25ebc	(minor) Fix accidental commit errors	2024-09-23 18:03:09 +02:00
Viktor Lofgren	e9854f194c	(crawler) Refactor * Restructure the code to make a bit more sense * Store full headers in crawl data * Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong	2024-09-23 17:51:07 +02:00
Viktor Lofgren	9c292a4f62	(doc) Fix outdated links in documentation	2024-09-22 13:56:17 +02:00
Viktor Lofgren	8047e77757	(doc) Correct dead links and stale information in the docs	2024-09-13 11:01:05 +02:00
Viktor Lofgren	8f367d96f8	Merge branch 'master' into term-positions # Conflicts: # code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java # code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java # code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java # code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java # code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java # code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java # code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java # code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java	2024-09-08 10:14:43 +02:00
Viktor Lofgren	f78ef36cd4	(slop) Upgrade to 0.0.8, add encodings to string columns.	2024-09-04 15:19:00 +02:00
Viktor Lofgren	dc67c81f99	(summary) Fix a few cases where noscript tags would sometimes be used for document summary	2024-09-04 15:00:40 +02:00
Viktor Lofgren	185b79f2a5	(converter) Fix bug where sideloaded reddit content was errouneously categoriszed as wiki-generated.	2024-09-01 11:30:25 +02:00
Viktor Lofgren	abab5bdc8a	(index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data	2024-08-26 14:20:39 +02:00
Viktor Lofgren	266d6e4bea	(slop) Replace SlopPageRef<T> with SlopTable.Ref<T>	2024-08-21 10:13:49 +02:00
Viktor Lofgren	b0a874a842	(*) Upgrade slop library -> 0.0.5	2024-08-18 11:05:27 +02:00
Viktor Lofgren	0a383a712d	(qdebug) Accurately display positions when intersecting with spans	2024-08-15 11:44:17 +02:00
Viktor Lofgren	75b0888032	(slop) Migrate to latest Slop version	2024-08-14 11:44:35 +02:00
Viktor Lofgren	623ee5570f	(slop) Break slop out into its own repository	2024-08-13 09:50:05 +02:00
Viktor Lofgren	fd2bad39f3	(keyword-extraction) Add body field for terms that are not otherwise part of a field	2024-08-13 09:49:26 +02:00
Viktor Lofgren	680ad19c7d	(keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors	2024-08-06 11:16:56 +02:00
Viktor Lofgren	2080e31616	(converter) Store link text positions To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends. Integrating this information with the ranking is not performed here.	2024-08-04 12:00:29 +02:00
Viktor Lofgren	e48f52faba	(experiment) Add add-hoc filter runner	2024-08-03 13:24:03 +02:00
Viktor Lofgren	4430a39120	(loader) Clean up	2024-08-02 12:32:47 +02:00
Viktor Lofgren	ac67b6b5da	(converter) Fix exception handling while reading crawl data	2024-08-02 10:39:49 +02:00
Viktor Lofgren	1a268c24c8	(perf) Reduce DomPruningFilter hash table recalculation	2024-08-01 12:04:55 +02:00
Viktor Lofgren	b316b55be9	(index) Experimental initial integration of document spans into index	2024-07-30 12:01:53 +02:00
Viktor Lofgren	80900107f7	(restructure) Clean up repo by moving stray features into converter-process and crawler-process	2024-07-30 10:14:00 +02:00
Viktor Lofgren	7e4efa45b8	(converter/loader) Simplify document record writing to not require predicated reads	2024-07-29 14:21:21 +02:00
Viktor Lofgren	86ea28d6bc	(converter/loader) Simplify document record writing to not require predicated reads	2024-07-29 14:18:52 +02:00
Viktor Lofgren	34703da144	(slop) Support for nested array types and array-of-object types Also adding very basic support for filtered reads via SlopTable. This is probably not a final design.	2024-07-29 14:00:43 +02:00
Viktor Lofgren	1282f78bc5	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 11:01:18 +02:00
Viktor Lofgren	2d5d965f7f	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 10:34:33 +02:00
Viktor Lofgren	7d51cf882f	(loader) Move rssFeeds to a different column group to avoid errors	2024-07-28 21:30:10 +02:00
Viktor Lofgren	9685993adb	(loader) Add spans to a different column group from spanCodes, as they are not in sync	2024-07-28 21:20:09 +02:00

1 2 3 4 5 ...

270 Commits