MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 13:09:00 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	f090f0101b	(index-construction) Gather up preindex writes Use fewer writes when finalizing the preindex documents.dat file, as this was getting too slow.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	9881cac2da	(index-reader) Correctly handle negative offset values When wordOffset(...) returns a negative value, it means the word isn't present in the index, and we should abort.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	12590d3449	(index-reverse) Added compression to priority index The priority index documents file can be trivially compressed to a large degree. Compression schema: ``` 00b -> diff docord (E gamma) 01b -> diff domainid (E delta) + (1 + docord) (E delta) 10b -> rank (E gamma) + domainid,docord (raw) 11b -> 30 bit size header, followed by 1 raw doc id (61 bits) ```	2024-07-11 16:13:23 +02:00
Viktor Lofgren	abf7a8d78d	(coded-sequence) Correct implementation of Elias gamma Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.	2024-07-10 14:28:28 +02:00
Viktor Lofgren	ecfe17521a	(coded-sequence) Correct implementation of Elias gamma The implementation was incorrectly using 1 bit more than it should. The change also adds a put method for Elias delta; and cleans up the interface a bit.	2024-07-09 17:28:21 +02:00
Viktor Lofgren	0d29e2a39d	(index-reverse) Entry Sources reset() their LongQueryBuffer Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome	2024-07-09 01:39:40 +02:00
Viktor Lofgren	12a2ab93db	(actor) Improve error messages for convert-and-load Some copy-and-paste errors had snuck in and every index construction error was reported as "repartitioned failed"; updated with more useful messages.	2024-07-08 19:19:30 +02:00
Viktor Lofgren	d90bd340bb	(index-reverse) Removing btree indexes from prio documents file Btree index adds overhead and disk space and doesn't fill any function for the prio index. * Update finalize logic with a new IO transformer that copies the data and prepends a size * Update the reader to read the new format * Added a test	2024-07-08 17:20:17 +02:00
Viktor Lofgren	21afe94096	(index-reverse) Don't use 128 bit merge function for prio index	2024-07-07 21:36:10 +02:00
Viktor Lofgren	fa36689597	(index-reverse) Simplify priority index * Do not emit a documents file * Do not interlace metadata or offsets with doc ids	2024-07-06 18:04:08 +02:00
Viktor Lofgren	85c99ae808	(index-reverse) Split index construction into separate packages for full and priority index	2024-07-06 15:44:47 +02:00
Viktor Lofgren	a4ecd5f4ce	(minor) Fix non-compiling test due to previous refactor	2024-07-06 15:11:43 +02:00
Viktor Lofgren	a6b03a66dc	(crawl) Reduce Charset.forName() object churn Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets	2024-07-04 20:49:07 +02:00
Viktor Lofgren	d023e399d2	(index) Remove unnecessary allocations in journal reader The term data iterator is quite hot and was performing buffer slice operations that were not necessary. Replacing with a fixed pointer alias that can be repositioned to the relevant data. The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped. Removed this unnecessary step and move to copying the buffer directly instead.	2024-07-04 15:38:22 +02:00
Viktor Lofgren	e8ab1e14e0	(keyword-extraction) Update upper limit to number of positions per word After real-world testing, it was determined that 256 was still a bit too low, but 512 seems like it will only truncate outlier cases like assembly code and certain tabulations.	2024-07-02 20:52:32 +02:00
Viktor Lofgren	a6e15cb338	(keyword-extraction) Update upper limit to number of positions per word 100 was a bit too low, let's try 256.	2024-06-30 22:46:56 +02:00
Viktor Lofgren	4fbb863a10	(keyword-extraction) Add upper limit to number of positions per word Also adding some logging for this event to get a feel for how big these lists get with realistic data. To be cleaned up later.	2024-06-30 22:41:38 +02:00
Viktor Lofgren	6ee4d1eb90	(keyword) Increase the work area for position encoding The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.	2024-06-28 16:42:39 +02:00
Viktor Lofgren	738e0e5fed	(process) Add option for automatic profiling The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns. By default, these are put in the log directory. The change also adds a JVM parameter that makes it shut up about native access.	2024-06-27 13:58:36 +02:00
Viktor Lofgren	0e4dd3d76d	(minor) Remove accidentally committed debug printf	2024-06-27 13:40:53 +02:00
Viktor Lofgren	10fe5a78cb	(log) Prevent tests from trying to log to file They would never have succeeded, but it adds an annoying preamble of error spam in the console window.	2024-06-27 13:19:48 +02:00
Viktor Lofgren	975b8ae2e9	(minor) Tidy code	2024-06-27 13:15:31 +02:00
Viktor Lofgren	935234939c	(test) Add query parsing to IntegrationTest	2024-06-27 13:15:20 +02:00
Viktor Lofgren	87e38e6181	(search-query) refac: Move query factory	2024-06-27 13:14:47 +02:00
Viktor Lofgren	f73fc8dd57	(search-query) Fix end-inclusion bug in QWordGraphIterator	2024-06-27 13:13:42 +02:00
Viktor Lofgren	3faa5bf521	(search-query) Tidy up QueryGRPCService and IndexClient	2024-06-26 14:03:30 +02:00
Viktor Lofgren	6973712480	(query) Tidy up code	2024-06-26 13:40:06 +02:00
Viktor Lofgren	02df421c94	(*) Trim the stopwords list Having an overlong stopwords list leads to quoted terms not performing well. For now we'll slash it to just "a" and "the".	2024-06-26 12:22:57 +02:00
Viktor Lofgren	95b9af92a0	(index) Implement working optional TermCoherences	2024-06-26 12:22:06 +02:00
Viktor Lofgren	8ee64c0771	(index) Correct TermCoherence requirements	2024-06-25 22:18:10 +02:00
Viktor Lofgren	b805f6daa8	(gamma) Fix readCount() behavior in EGC	2024-06-25 22:17:54 +02:00
Viktor Lofgren	dae22ccbe0	(test) Integration test from crawl->query	2024-06-25 22:17:26 +02:00
Viktor Lofgren	9d00243d7f	(index) Partial re-implementation of position constraints	2024-06-24 15:55:54 +02:00
Viktor Lofgren	5461634616	(doc) Add readme.md for coded-sequence library This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.	2024-06-24 14:28:51 +02:00
Viktor Lofgren	40bca93884	(gamma) Minor clean-up	2024-06-24 13:56:43 +02:00
Viktor Lofgren	b798f28443	(journal) Fixing journal encoding Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.	2024-06-24 13:56:27 +02:00
Viktor Lofgren	fff2ce5721	(gamma) Correctly decode zero-length sequences	2024-06-24 13:11:41 +02:00
Viktor Lofgren	23759a7243	(loader) Correctly clamp document size	2024-06-10 18:29:14 +02:00
Viktor Lofgren	55b2b7636b	(loader) Correctly load the positions column in the keyword projection	2024-06-10 18:27:15 +02:00
Viktor Lofgren	36160988e2	(index) Integrate positions data with indexes WIP This change integrates the new positions data with the forward and reverse indexes. The ranking code is still only partially re-written.	2024-06-10 15:09:06 +02:00
Viktor Lofgren	9f982a0c3d	(index) Integrate positions file properly	2024-06-06 16:45:42 +02:00
Viktor Lofgren	dcbec9414f	(index) Fix non-compiling tests	2024-06-06 16:35:09 +02:00
Viktor Lofgren	4a8afa6b9f	(index, WIP) Position data partially integrated with forward and reverse indexes. There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.	2024-06-06 12:54:52 +02:00
Viktor Lofgren	9b922af075	(converter) Amend existing modifications to use gamma coded positions lists ... instead of serialized RoaringBitmaps as was the initial take on the problem.	2024-05-30 14:20:36 +02:00
Viktor Lofgren	0112ae725c	(gamma) Implement a small library for Elias gamma coding an integer sequence	2024-05-30 14:19:13 +02:00
Viktor Lofgren	619392edf9	(keywords) Add position information to keywords	2024-05-28 16:54:53 +02:00
Viktor Lofgren	0894822b68	(converter) Add position information to serialized document data This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.	2024-05-28 14:18:03 +02:00
Viktor Lofgren	206a7ce6c1	Merge remote-tracking branch 'origin/master'	2024-05-28 14:15:57 +02:00
Viktor Lofgren	a69ab311c7	(qword) Fix tests that broke due to stopword removal	2024-05-28 14:15:45 +02:00
Viktor	a61327fa0b	Update ROADMAP.md	2024-05-24 13:57:50 +02:00

1 2 3 4 5 ...

2142 Commits