MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	abf7a8d78d	(coded-sequence) Correct implementation of Elias gamma Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.	2024-07-10 14:28:28 +02:00
Viktor Lofgren	e8ab1e14e0	(keyword-extraction) Update upper limit to number of positions per word After real-world testing, it was determined that 256 was still a bit too low, but 512 seems like it will only truncate outlier cases like assembly code and certain tabulations.	2024-07-02 20:52:32 +02:00
Viktor Lofgren	a6e15cb338	(keyword-extraction) Update upper limit to number of positions per word 100 was a bit too low, let's try 256.	2024-06-30 22:46:56 +02:00
Viktor Lofgren	4fbb863a10	(keyword-extraction) Add upper limit to number of positions per word Also adding some logging for this event to get a feel for how big these lists get with realistic data. To be cleaned up later.	2024-06-30 22:41:38 +02:00
Viktor Lofgren	6ee4d1eb90	(keyword) Increase the work area for position encoding The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.	2024-06-28 16:42:39 +02:00
Viktor Lofgren	dae22ccbe0	(test) Integration test from crawl->query	2024-06-25 22:17:26 +02:00
Viktor Lofgren	4a8afa6b9f	(index, WIP) Position data partially integrated with forward and reverse indexes. There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.	2024-06-06 12:54:52 +02:00
Viktor Lofgren	9b922af075	(converter) Amend existing modifications to use gamma coded positions lists ... instead of serialized RoaringBitmaps as was the initial take on the problem.	2024-05-30 14:20:36 +02:00
Viktor Lofgren	619392edf9	(keywords) Add position information to keywords	2024-05-28 16:54:53 +02:00
Viktor Lofgren	0894822b68	(converter) Add position information to serialized document data This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.	2024-05-28 14:18:03 +02:00
Viktor Lofgren	89aae93e60	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
Viktor Lofgren	4668b1ddcb	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 13:54:04 +02:00
Viktor Lofgren	b09ddd0036	(crawler/converter) Remove legacy junk from parquet migration	2024-04-22 12:34:28 +02:00
Viktor Lofgren	c538c25008	(term-freq-exporter) Reduce thread count and memory usage	2024-04-10 17:11:23 +02:00
Viktor Lofgren	4b47fadbab	(term-freq-exporter) Extract ngrams in term-frequency-exporter	2024-04-10 16:58:05 +02:00
Viktor Lofgren	bd0704d5a4	(*) Fix JDK22 migration issues A few bizarre build errors cropped up when migrating to JDK22. Not at all sure what caused them, but they were easy to mitigate.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	002afca1c5	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	0bd3365c24	(convert) Initial integration of segmentation data into the converter's keyword extraction logic	2024-03-19 14:28:42 +01:00
Viktor Lofgren	d8f4e7d72b	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-03-19 10:42:09 +01:00
Viktor Lofgren	9f1649636e	Clean up documentation and rename `domain-links` to `link-graph`	2024-02-28 11:40:39 +01:00
Viktor Lofgren	e696fd9e92	(docs) Begin un-fucking the docs after refactoring	2024-02-27 21:22:21 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00
Viktor Lofgren	0307c55f9f	(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.	2024-02-20 11:41:14 +01:00
Viktor Lofgren	fab36d6e63	(converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy.	2024-02-14 17:35:44 +01:00
Viktor Lofgren	1b8b97b8ec	(sample-exporter) Add some limits on sizes and lengths Tar files will reject entries with filenames over 100b, so we need a limit there. Also added a maximum size limit to keep the file sizes reasonable.	2024-01-25 11:51:53 +01:00
Viktor Lofgren	805afad4fe	(control) New GUI for exporting crawl data samples Not going to win any beauty pageants, but this is pretty peripheral functionality.	2024-01-23 17:08:21 +01:00
Viktor Lofgren	0081328aca	(converter) Adjust which flags are set by anchor text keywords It's a mistake to let it bleed into Title, as this is a high quality signal. We'll co-opt Site and SiteAdjacent instead to reinforce the ExternalLink when count is high.	2024-01-23 11:54:00 +01:00
Viktor Lofgren	40c9d2050f	(control) Fully automatic conversion Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine. Removed the tool itself. This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency. This has been fixed, and :third-party:xz was removed.	2024-01-22 13:03:24 +01:00
Viktor Lofgren	c41e68aaab	(control) New export actions for RSS/Atom feeds and term frequency data This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.	2024-01-15 14:54:26 +01:00
Viktor Lofgren	e46e174b59	(keyword-extractor) Add another test for Name-extractor	2024-01-01 15:21:51 +01:00
Viktor Lofgren	1cbf23e7e7	(test) Don't fail test if atags.parquet is not in ~vlofgren	2023-11-15 09:11:38 +01:00
Viktor Lofgren	e0c769fd19	(converter) Integrate atags.parquet with the encyclopedia sideloader Also clean up stackexchange and dirtree a bit.	2023-11-06 18:03:01 +01:00
Viktor Lofgren	ebd10a5f28	(crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized	2023-11-06 16:14:58 +01:00
Viktor Lofgren	2b77184281	(converter) Integrate atags with the topology field	2023-11-06 13:46:44 +01:00
Viktor Lofgren	72afa0341f	duckdb connection may need to be synchronized?	2023-11-04 14:30:25 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	4415f52e18	(keyword-extraction) Fix broken test	2023-10-27 12:19:33 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	93dc80000c	(bugfix) Fix NPE in KeywordExtractor due to bad SoftReference handling	2023-09-26 17:16:41 +02:00
Viktor Lofgren	9b781f8404	(keyoword-extractor) Address very rare race condition in memoization logic	2023-09-25 18:28:04 +02:00
Viktor Lofgren	8ca20f184d	(keyword-extraction) Chasing my tail looking for a bug	2023-09-24 19:39:48 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	70aa04c047	(converter, stackexchange-xml) Add the ability to sideload stackexchange data	2023-09-21 12:48:33 +02:00
Viktor Lofgren	5b0a6d7ec1	(stackexchange-converter) Create tool for converting stackexchange 7z-files to digestible sqlite db:s	2023-09-20 15:15:13 +02:00
Viktor Lofgren	3b4d08f52b	(stackexchange-integration) Add better comments	2023-09-20 14:43:06 +02:00
Viktor Lofgren	6bbf40d7d2	(stackexchange-integration) Tools for reading stackexchange xml files	2023-09-20 14:17:33 +02:00
Viktor Lofgren	eaeb23d41e	(refactor) Remove converting-model package completely	2023-09-14 11:21:44 +02:00
Viktor Lofgren	c68d17d482	(keyword-extraction) Fix bug leading to position data missing on some keywords. This was due to a discrepancy between the KeywordPositionBitmask and WordsTfIdfCounts' concept of a keyword.	2023-09-02 14:48:55 +02:00
Viktor Lofgren	676e7c7947	(keywords) Add Serializable properties that went missing as the record became a class	2023-09-02 09:52:01 +02:00
Viktor Lofgren	5f427d2b4c	(keywords) Clean up leaky abstractions, clean up tests	2023-09-01 13:52:00 +02:00

1 2

89 Commits