MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	0d227f3543	(cleanup) Remove next-prime library only used in tests	2024-07-17 13:48:03 +02:00
Viktor Lofgren	0b31c4cfbb	(coded-sequence) Replace GCS usage with an interface	2024-07-16 14:37:50 +02:00
Viktor Lofgren	ae87e41cec	(index) Fix rare BitReader.takeWhileZero bug Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer. This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty. The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte. Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.	2024-07-16 11:03:56 +02:00
Viktor	8ed5b51a32	Merge branch 'master' into term-positions	2024-07-15 07:05:31 +02:00
Viktor Lofgren	9d0e5dee02	Fix gitignore issue .so files not to be ignored correctly.	2024-07-15 05:18:10 +02:00
Viktor Lofgren	ffd970036d	(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter How'd This Ever Work? (tm) TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.	2024-07-15 05:16:17 +02:00
Viktor Lofgren	fa162698c2	(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter How'd This Ever Work? (tm) TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.	2024-07-15 05:15:30 +02:00
Viktor Lofgren	179a6002c2	(coded-sequence) Add a callback for re-filling underlying buffer	2024-07-12 23:50:28 +02:00
Viktor Lofgren	31881874a9	(coded-sequence) Correct indicator of next-value It was incorrectly assumed that a "next" value could not be zero or negative, as this is not representable via the Gamam code. This is incorrect in this case, as we're able to provide a negative offset. Changing to using Integer.MIN_VALUE as indicator that a value is absent instead, as this will never be used.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	12590d3449	(index-reverse) Added compression to priority index The priority index documents file can be trivially compressed to a large degree. Compression schema: ``` 00b -> diff docord (E gamma) 01b -> diff domainid (E delta) + (1 + docord) (E delta) 10b -> rank (E gamma) + domainid,docord (raw) 11b -> 30 bit size header, followed by 1 raw doc id (61 bits) ```	2024-07-11 16:13:23 +02:00
Viktor Lofgren	abf7a8d78d	(coded-sequence) Correct implementation of Elias gamma Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.	2024-07-10 14:28:28 +02:00
Viktor Lofgren	ecfe17521a	(coded-sequence) Correct implementation of Elias gamma The implementation was incorrectly using 1 bit more than it should. The change also adds a put method for Elias delta; and cleans up the interface a bit.	2024-07-09 17:28:21 +02:00
Viktor Lofgren	02df421c94	(*) Trim the stopwords list Having an overlong stopwords list leads to quoted terms not performing well. For now we'll slash it to just "a" and "the".	2024-06-26 12:22:57 +02:00
Viktor Lofgren	b805f6daa8	(gamma) Fix readCount() behavior in EGC	2024-06-25 22:17:54 +02:00
Viktor Lofgren	9d00243d7f	(index) Partial re-implementation of position constraints	2024-06-24 15:55:54 +02:00
Viktor Lofgren	5461634616	(doc) Add readme.md for coded-sequence library This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.	2024-06-24 14:28:51 +02:00
Viktor Lofgren	40bca93884	(gamma) Minor clean-up	2024-06-24 13:56:43 +02:00
Viktor Lofgren	fff2ce5721	(gamma) Correctly decode zero-length sequences	2024-06-24 13:11:41 +02:00
Jaseem Abid	0dd14a4bd0	Specify C++ standard in build command The default C++ language standard on macOS is gnu++98, which won't build this module. Full error: ``` > Task :code:libraries:array:cpp:compileCpp FAILED src/main/cpp/cpphelpers.cpp:28:5: error: expected expression [](const p64x2& fst, const p64x2& snd) { ^ ```	2024-06-12 12:47:10 +01:00
Jaseem Abid	9974b31a09	Don't track build files(libcpp.so) with git	2024-06-12 12:45:49 +01:00
Viktor Lofgren	36160988e2	(index) Integrate positions data with indexes WIP This change integrates the new positions data with the forward and reverse indexes. The ranking code is still only partially re-written.	2024-06-10 15:09:06 +02:00
Viktor Lofgren	a07cf1ba93	(array/cpp) Update gitignore to properly exclude libcpp.so	2024-06-06 13:06:08 +02:00
Viktor Lofgren	4a8afa6b9f	(index, WIP) Position data partially integrated with forward and reverse indexes. There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.	2024-06-06 12:54:52 +02:00
Sam Storment	e2f68d9ccf	Add a theme select to the header that lets users toggle their theme independent of their OS theme	2024-06-02 21:02:52 -05:00
Viktor Lofgren	0112ae725c	(gamma) Implement a small library for Elias gamma coding an integer sequence	2024-05-30 14:19:13 +02:00
Viktor Lofgren	89aae93e60	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
Viktor Lofgren	24bf29d369	(*) Upgrade opennlp and deprecate the monkey patched version of the code as it's no longer needed	2024-05-20 18:03:21 +02:00
Viktor Lofgren	4fcd4a8197	(index) Refactor to reduce the level of indirection	2024-05-19 12:40:33 +02:00
Viktor Lofgren	daf2a8df54	(btree) Roll back optimization of queryDataWithIndex It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect. The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.	2024-05-19 11:29:28 +02:00
Viktor Lofgren	88997a1c4f	(btree) Clean up code	2024-05-18 18:38:46 +02:00
Viktor Lofgren	d12c77305c	(btree) Clean up code	2024-05-18 18:03:17 +02:00
Viktor Lofgren	ab4e2b222e	(array) Fix broken benchmarks	2024-05-18 13:41:24 +02:00
Viktor Lofgren	b867eadbef	(big-string) Remove the unused bigstring library	2024-05-18 13:40:03 +02:00
Viktor Lofgren	19163fa883	(array) Clean up the Array library IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it) Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs. Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.	2024-05-18 13:23:06 +02:00
Viktor Lofgren	650f3843bb	(array) Clean up search function jungle Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values. Replaced binary search function with a branchless version that is much faster. Cleaned up benchmark code.	2024-05-17 14:31:02 +02:00
Viktor Lofgren	9e766bc056	(array) Clean up search function jungle Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values. Replaced binary search function with a branchless version that is much faster. Cleaned up benchmark code.	2024-05-17 14:30:06 +02:00
Viktor Lofgren	48aff52e00	(array) Increase LongArray on-heap alignment to 16 bytes This primarily affects benchmarks, making performance more consistent for the 128 bit operations, as the system mostly works with memory mapped data.	2024-05-16 19:12:36 +02:00
Viktor Lofgren	9d7616317e	(array) Clean up native code a bit	2024-05-16 14:47:10 +02:00
Viktor Lofgren	f48cf77c4d	(array, experimental) Add benchmark results for quicksort	2024-05-14 18:15:30 +02:00
Viktor Lofgren	3549be216f	(array, experimental) Documentation for native algos	2024-05-14 17:43:05 +02:00
Viktor Lofgren	55a7c1db00	(array, experimental) Call C++ helper methods to do some low level stuff a bit faster than is possible with Java	2024-05-14 12:54:14 +02:00
Viktor Lofgren	4668b1ddcb	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 13:54:04 +02:00
Viktor Lofgren	deaba0152d	(index) Explicitly free LongQueryBuffers	2024-04-16 19:23:00 +02:00
Viktor Lofgren	b6d365bacd	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-15 16:04:07 +02:00
Viktor Lofgren	52f0c0d336	(ngram) Grab titles separately when extracting ngrams from wiki data	2024-04-13 19:34:16 +02:00
Viktor Lofgren	1329d4abd8	(ngram) Correct size value in ngram lexicon generation, trim the terms better	2024-04-13 17:51:02 +02:00
Viktor Lofgren	f064992137	(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.	2024-04-13 17:07:23 +02:00
Viktor Lofgren	8a81a480a1	(ngram) Only extract frequencies of title words, but use the body to increment the counters... The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.	2024-04-12 18:08:31 +02:00
Viktor Lofgren	6a67043537	(ngram) Clean up ngram lexicon code This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.	2024-04-12 17:45:06 +02:00
Viktor Lofgren	bb6b51ad91	(ngram) Fix index range in NgramLexicon to an avoid exception	2024-04-12 10:13:25 +02:00

1 2 3 4

170 Commits