MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	12590d3449	(index-reverse) Added compression to priority index The priority index documents file can be trivially compressed to a large degree. Compression schema: ``` 00b -> diff docord (E gamma) 01b -> diff domainid (E delta) + (1 + docord) (E delta) 10b -> rank (E gamma) + domainid,docord (raw) 11b -> 30 bit size header, followed by 1 raw doc id (61 bits) ```	2024-07-11 16:13:23 +02:00
Viktor Lofgren	abf7a8d78d	(coded-sequence) Correct implementation of Elias gamma Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.	2024-07-10 14:28:28 +02:00
Viktor Lofgren	ecfe17521a	(coded-sequence) Correct implementation of Elias gamma The implementation was incorrectly using 1 bit more than it should. The change also adds a put method for Elias delta; and cleans up the interface a bit.	2024-07-09 17:28:21 +02:00
Viktor Lofgren	02df421c94	(*) Trim the stopwords list Having an overlong stopwords list leads to quoted terms not performing well. For now we'll slash it to just "a" and "the".	2024-06-26 12:22:57 +02:00
Viktor Lofgren	b805f6daa8	(gamma) Fix readCount() behavior in EGC	2024-06-25 22:17:54 +02:00
Viktor Lofgren	9d00243d7f	(index) Partial re-implementation of position constraints	2024-06-24 15:55:54 +02:00
Viktor Lofgren	5461634616	(doc) Add readme.md for coded-sequence library This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.	2024-06-24 14:28:51 +02:00
Viktor Lofgren	40bca93884	(gamma) Minor clean-up	2024-06-24 13:56:43 +02:00
Viktor Lofgren	fff2ce5721	(gamma) Correctly decode zero-length sequences	2024-06-24 13:11:41 +02:00
Viktor Lofgren	36160988e2	(index) Integrate positions data with indexes WIP This change integrates the new positions data with the forward and reverse indexes. The ranking code is still only partially re-written.	2024-06-10 15:09:06 +02:00
Viktor Lofgren	4a8afa6b9f	(index, WIP) Position data partially integrated with forward and reverse indexes. There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.	2024-06-06 12:54:52 +02:00
Viktor Lofgren	0112ae725c	(gamma) Implement a small library for Elias gamma coding an integer sequence	2024-05-30 14:19:13 +02:00
Viktor Lofgren	89aae93e60	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
Viktor Lofgren	24bf29d369	(*) Upgrade opennlp and deprecate the monkey patched version of the code as it's no longer needed	2024-05-20 18:03:21 +02:00
Viktor Lofgren	4fcd4a8197	(index) Refactor to reduce the level of indirection	2024-05-19 12:40:33 +02:00
Viktor Lofgren	daf2a8df54	(btree) Roll back optimization of queryDataWithIndex It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect. The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.	2024-05-19 11:29:28 +02:00
Viktor Lofgren	88997a1c4f	(btree) Clean up code	2024-05-18 18:38:46 +02:00
Viktor Lofgren	d12c77305c	(btree) Clean up code	2024-05-18 18:03:17 +02:00
Viktor Lofgren	ab4e2b222e	(array) Fix broken benchmarks	2024-05-18 13:41:24 +02:00
Viktor Lofgren	b867eadbef	(big-string) Remove the unused bigstring library	2024-05-18 13:40:03 +02:00
Viktor Lofgren	19163fa883	(array) Clean up the Array library IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it) Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs. Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.	2024-05-18 13:23:06 +02:00
Viktor Lofgren	650f3843bb	(array) Clean up search function jungle Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values. Replaced binary search function with a branchless version that is much faster. Cleaned up benchmark code.	2024-05-17 14:31:02 +02:00
Viktor Lofgren	9e766bc056	(array) Clean up search function jungle Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values. Replaced binary search function with a branchless version that is much faster. Cleaned up benchmark code.	2024-05-17 14:30:06 +02:00
Viktor Lofgren	48aff52e00	(array) Increase LongArray on-heap alignment to 16 bytes This primarily affects benchmarks, making performance more consistent for the 128 bit operations, as the system mostly works with memory mapped data.	2024-05-16 19:12:36 +02:00
Viktor Lofgren	9d7616317e	(array) Clean up native code a bit	2024-05-16 14:47:10 +02:00
Viktor Lofgren	f48cf77c4d	(array, experimental) Add benchmark results for quicksort	2024-05-14 18:15:30 +02:00
Viktor Lofgren	3549be216f	(array, experimental) Documentation for native algos	2024-05-14 17:43:05 +02:00
Viktor Lofgren	55a7c1db00	(array, experimental) Call C++ helper methods to do some low level stuff a bit faster than is possible with Java	2024-05-14 12:54:14 +02:00
Viktor Lofgren	4668b1ddcb	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 13:54:04 +02:00
Viktor Lofgren	deaba0152d	(index) Explicitly free LongQueryBuffers	2024-04-16 19:23:00 +02:00
Viktor Lofgren	b6d365bacd	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-15 16:04:07 +02:00
Viktor Lofgren	52f0c0d336	(ngram) Grab titles separately when extracting ngrams from wiki data	2024-04-13 19:34:16 +02:00
Viktor Lofgren	1329d4abd8	(ngram) Correct size value in ngram lexicon generation, trim the terms better	2024-04-13 17:51:02 +02:00
Viktor Lofgren	f064992137	(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.	2024-04-13 17:07:23 +02:00
Viktor Lofgren	8a81a480a1	(ngram) Only extract frequencies of title words, but use the body to increment the counters... The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.	2024-04-12 18:08:31 +02:00
Viktor Lofgren	6a67043537	(ngram) Clean up ngram lexicon code This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.	2024-04-12 17:45:06 +02:00
Viktor Lofgren	bb6b51ad91	(ngram) Fix index range in NgramLexicon to an avoid exception	2024-04-12 10:13:25 +02:00
Viktor Lofgren	b7d9a7ae89	(ngrams) Remove the vestigial logic for capturing permutations of n-grams The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.	2024-04-11 18:12:01 +02:00
Viktor Lofgren	7f7021ce64	(sentence-extractor) Fix resource leak in sentence extractor The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation. The modified behavior checks for nullity before creating a new instance.	2024-04-05 18:52:58 +02:00
Viktor Lofgren	ae7c760772	(index) Clean up new index query code	2024-04-05 13:30:49 +02:00
Viktor Lofgren	81815f3e0a	(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.	2024-04-04 20:17:58 +02:00
Viktor Lofgren	002afca1c5	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	0bd3365c24	(convert) Initial integration of segmentation data into the converter's keyword extraction logic	2024-03-19 14:28:42 +01:00
Viktor Lofgren	d8f4e7d72b	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-03-19 10:42:09 +01:00
Viktor Lofgren	46423612e3	(refac) Merge service-discovery and service modules Also adds a few tests to the server/client code.	2024-03-03 10:49:23 +01:00
Viktor Lofgren	9f1649636e	Clean up documentation and rename `domain-links` to `link-graph`	2024-02-28 11:40:39 +01:00
Viktor Lofgren	e696fd9e92	(docs) Begin un-fucking the docs after refactoring	2024-02-27 21:22:21 +01:00
Viktor Lofgren	67aa20ea2c	(array) Attempting to debug strange errors	2024-02-27 21:22:18 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00
Viktor Lofgren	66c1281301	(zk-registry) epic jak shaving WIP Cleaning out a lot of old junk from the code, and one thing lead to another... * Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds. * The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs. * Project is migrated to GraalVM * gRPC clients are re-written with a neat fluent/functional style. e.g. ```channelPool.call(grpcStub::method) .async(executor) // <-- optional .run(argument); ``` This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall. * For now the project is all in on zookeeper * Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh. * To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP! Missing is documentation and testing, and some more breaking apart of code.	2024-02-22 14:01:23 +01:00

1 2 3 4

157 Commits