MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 13:09:00 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	9f47ce8d15	(chore) Remove lombok There are likely some instances of delombok gore with this commit.	2024-11-11 21:14:38 +01:00
Viktor	8ed5b51a32	Merge branch 'master' into term-positions	2024-07-15 07:05:31 +02:00
Viktor Lofgren	e0459d0c0d	(build) Upgrade parquet dependencies to 1.14.0 This gets rid of a vulnerable transitive dependency.	2024-06-12 08:57:22 +02:00
Viktor Lofgren	9b922af075	(converter) Amend existing modifications to use gamma coded positions lists ... instead of serialized RoaringBitmaps as was the initial take on the problem.	2024-05-30 14:20:36 +02:00
Viktor Lofgren	0112ae725c	(gamma) Implement a small library for Elias gamma coding an integer sequence	2024-05-30 14:19:13 +02:00
Viktor Lofgren	24bf29d369	(*) Upgrade opennlp and deprecate the monkey patched version of the code as it's no longer needed	2024-05-20 18:03:21 +02:00
Viktor Lofgren	4668b1ddcb	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 13:54:04 +02:00
Viktor	cfd9a7187f	(query-segmentation) Merge pull request #89 from MarginaliaSearch/query-segmentation The changeset cleans up the query parsing logic in the query service. It gets rid of a lot of old and largely unmaintainable query-rewriting logic that was based on POS-tagging rules, and adds a new cleaner approach. Query parsing is also refactored, and the internal APIs are updated to remove unnecessary duplication of document-level data across each search term. A new query segmentation model is introduced based on a dictionary of known n-grams, with tools for extracting this dictionary from Wikipedia data. The changeset introduces a new segmentation model file, which is downloaded with the usual run/setup.sh, as well as an updated term frequency model. A new intermediate representation of the query is introduced, based on a DAG with predefined vertices initiating and terminating the graph. This is for the benefit of easily writing rules for generating alternative queries, e.g. using the new segmentation data. The graph is converted to a basic LL(1) syntax loosely reminiscent of a regular expression, where e.g. "( wiby \| marginalia \| kagi ) ( search engine \| searchengine )" expands to "wiby search engine", "wiby searchengine", "marginalia search engine", "marginalia searchengine", "kagi search engine" and "kagi searchengine". This compiled query is passed to the index, which parses the expression, where it is used for execution of the search and ranking of the results.	2024-04-16 15:31:05 +02:00
Viktor Lofgren	be55f3f937	(zim) Fix title extractor	2024-04-13 19:33:47 +02:00
Viktor Lofgren	448a941de2	(encyclopedia) Fix memory issue in preconversion step Use SimpleBlockingThreadPool pool instead of Java's Workstealing Pool as the latter causes runaway memory consumption in some circumstances, while SimpleBlockingThreadPool uses a bounded queue and always pushes back against the supplier if it can't hold any more tasks.	2024-04-05 16:57:53 +02:00
Viktor Lofgren	002afca1c5	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	fe8d583fdd	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:27:13 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00
Viktor Lofgren	0307c55f9f	(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.	2024-02-20 11:41:14 +01:00
Viktor Lofgren	ba26f6ce84	(doc) Documentation corrections	2024-02-10 14:16:01 +01:00
Viktor Lofgren	7286596fb4	(deps) Remove monkey patched GSON The codebase used to have a monkey patched version of gson that made special optimizations for the unusually large JSON files that used to store e.g. crawl data. Since JSON is no longer used in this fashion, the GSON fork is not needed anymore.	2024-02-06 12:11:39 +01:00
Viktor Lofgren	52a0255814	() Add flag for disabling ASCII flattening The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this. This assumption holds poorly in the wild. Adding an experimental* system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior. IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.	2024-01-31 11:50:59 +01:00
Viktor Lofgren	40c9d2050f	(control) Fully automatic conversion Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine. Removed the tool itself. This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency. This has been fixed, and :third-party:xz was removed.	2024-01-22 13:03:24 +01:00
Viktor Lofgren	3a325845c7	(mq) Add better error handling in fsm and mq java.lang.Error:s were not handled properly, leading to mismatch in the bookkeeping of the FSMs. These are now caught, acted on, and re-thrown. MqSynchronousInbox also no longer assumes all exceptions are InterruptedException.	2024-01-22 13:03:24 +01:00
Viktor Lofgren	6a1bfd6270	(array) Remove unused 'madvise' code and 3rd party dependency on 'uppend' This wasn't actually hooked in anywhere. Removing the dependency and code. If it turns out we need madvise in the future, we'll re-introducde it.	2024-01-22 13:01:57 +01:00
Viktor Lofgren	27ffb8fa8a	(converter) Integrate zim->db conversion into automatic encyclopedia processing workflow Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file. This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically. The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.	2024-01-19 13:59:03 +01:00
Viktor Lofgren	787a20cbaa	(crawling-model) Implement a parquet format for crawl data This is not hooked into anything yet. The change also makes modifications to the parquet-floor library to support reading and writing of byte[] arrays. This is desirable since we may in the future want to support inputs that are not text-based, and codifying the assumption that each document is a string will definitely cause us grief down the line.	2023-12-13 16:22:19 +01:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	35996d0adb	(docs) Update the documentation up-to-date information	2023-09-14 11:33:36 +02:00
Viktor Lofgren	9f672a0cf4	(parquet-floor) Modify the parquet library to permit list-fields.	2023-09-13 15:56:35 +02:00
Viktor Lofgren	a00cabe223	(parquet-floor) Patch in support for writing and reading repeated values	2023-09-11 14:06:43 +02:00
Viktor Lofgren	dbe974f510	(parquet) Use ZSTD compression by default.	2023-09-11 09:02:58 +02:00
Viktor Lofgren	a284682deb	(parquet) Add parquet library This small library, while great, will require some modifications to fit the project's needs, so it goes into third-party directly.	2023-09-05 10:38:51 +02:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	86a5cc5c5f	(hash) Modified version of common codec's Murmur3 hash	2023-08-01 14:57:40 +02:00
Viktor Lofgren	186a02acfd	Optimize RDRPosTagger to use integer comparisons instead of string comparisons. Also reduce the cache-thrashing by deconstructing the tree's nodes into arrays.	2023-06-19 17:58:19 +02:00
Viktor Lofgren	6f2a7977c1	(Minor) Remove character debris in build.gradle	2023-06-19 17:58:19 +02:00
Viktor Lofgren	266ad2e4de	Re-introduce monkey patched GSON to make converter run better. fixup! Re-introduce monkey patched GSON to make converter run better. fixup! Re-introduce monkey patched GSON to make converter run better.	2023-06-19 17:58:19 +02:00
Viktor Lofgren	449471a076	Yet more restructuring. Improved search result ranking.	2023-03-16 21:35:54 +01:00
Viktor Lofgren	616effdb3c	The refactoring will continue until morale improves.	2023-03-12 10:04:48 +01:00
Viktor Lofgren	b945fd7f39	A lot of readmes, some refactoring.	2023-03-06 18:32:13 +01:00
Viktor Lofgren	4fdaaa16ba	Restructuring the git repo	2023-03-04 13:19:01 +01:00

37 Commits