MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 21:29:00 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	b95646625f	(index) Correct prio index construction with mmap Accidentally snuck in behavior from full index	2024-09-17 13:39:08 +02:00
Viktor Lofgren	6e47eae903	(index) Correct strange close handling of PositionsFileConstructor	2024-09-13 16:34:14 +02:00
Viktor Lofgren	934af0dd4b	(index) Correct units in log message when shrinking the documents file	2024-09-13 16:33:19 +02:00
Viktor Lofgren	a8bec13ed9	(index) Evaluate using mmap reads during index construction in favor of filechannel reads It's likely that this will be faster, as the reads are on average small and sequential, and can't be buffered easily.	2024-09-13 16:14:56 +02:00
Viktor Lofgren	abab5bdc8a	(index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data	2024-08-26 14:20:39 +02:00
Viktor Lofgren	b0a874a842	(*) Upgrade slop library -> 0.0.5	2024-08-18 11:05:27 +02:00
Viktor Lofgren	75b0888032	(slop) Migrate to latest Slop version	2024-08-14 11:44:35 +02:00
Viktor Lofgren	623ee5570f	(slop) Break slop out into its own repository	2024-08-13 09:50:05 +02:00
Viktor Lofgren	dc5c668940	(index) Re-enable parallelization of index construction, disable parallel sorting during construction The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance. It got worse, so the change is reverted. Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.	2024-07-31 10:06:53 +02:00
Viktor Lofgren	34703da144	(slop) Support for nested array types and array-of-object types Also adding very basic support for filtered reads via SlopTable. This is probably not a final design.	2024-07-29 14:00:43 +02:00
Viktor Lofgren	dcb43a3308	(slop) Introduce table concept to keep track of positions and simplify closing The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip. The second most common error is forgetting to close one of the columns in a reader or writer. To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.	2024-07-27 13:47:47 +02:00
Viktor Lofgren	aebb2652e8	(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.	2024-07-27 11:44:13 +02:00
Viktor Lofgren	0b31c4cfbb	(coded-sequence) Replace GCS usage with an interface	2024-07-16 14:37:50 +02:00
Viktor Lofgren	179a6002c2	(coded-sequence) Add a callback for re-filling underlying buffer	2024-07-12 23:50:28 +02:00
Viktor Lofgren	6303977e9c	(index-prio) Fail louder when size is 0 in PrioDocIdsTransformer We can't deal with this scenario and should complain very loudly	2024-07-11 19:22:05 +02:00
Viktor Lofgren	97695693f2	(index-prio) Don't increment readItems counter when the output buffer is full This behavior was causing the reader to sometimes discard trailing entries in the list.	2024-07-11 19:21:36 +02:00
Viktor Lofgren	1ab875a75d	(test) Correcting flaky tests Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	f090f0101b	(index-construction) Gather up preindex writes Use fewer writes when finalizing the preindex documents.dat file, as this was getting too slow.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	9881cac2da	(index-reader) Correctly handle negative offset values When wordOffset(...) returns a negative value, it means the word isn't present in the index, and we should abort.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	12590d3449	(index-reverse) Added compression to priority index The priority index documents file can be trivially compressed to a large degree. Compression schema: ``` 00b -> diff docord (E gamma) 01b -> diff domainid (E delta) + (1 + docord) (E delta) 10b -> rank (E gamma) + domainid,docord (raw) 11b -> 30 bit size header, followed by 1 raw doc id (61 bits) ```	2024-07-11 16:13:23 +02:00
Viktor Lofgren	0d29e2a39d	(index-reverse) Entry Sources reset() their LongQueryBuffer Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome	2024-07-09 01:39:40 +02:00
Viktor Lofgren	d90bd340bb	(index-reverse) Removing btree indexes from prio documents file Btree index adds overhead and disk space and doesn't fill any function for the prio index. * Update finalize logic with a new IO transformer that copies the data and prepends a size * Update the reader to read the new format * Added a test	2024-07-08 17:20:17 +02:00
Viktor Lofgren	21afe94096	(index-reverse) Don't use 128 bit merge function for prio index	2024-07-07 21:36:10 +02:00
Viktor Lofgren	fa36689597	(index-reverse) Simplify priority index * Do not emit a documents file * Do not interlace metadata or offsets with doc ids	2024-07-06 18:04:08 +02:00
Viktor Lofgren	85c99ae808	(index-reverse) Split index construction into separate packages for full and priority index	2024-07-06 15:44:47 +02:00
Viktor Lofgren	d023e399d2	(index) Remove unnecessary allocations in journal reader The term data iterator is quite hot and was performing buffer slice operations that were not necessary. Replacing with a fixed pointer alias that can be repositioned to the relevant data. The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped. Removed this unnecessary step and move to copying the buffer directly instead.	2024-07-04 15:38:22 +02:00
Viktor Lofgren	9d00243d7f	(index) Partial re-implementation of position constraints	2024-06-24 15:55:54 +02:00
Viktor Lofgren	36160988e2	(index) Integrate positions data with indexes WIP This change integrates the new positions data with the forward and reverse indexes. The ranking code is still only partially re-written.	2024-06-10 15:09:06 +02:00
Viktor Lofgren	9f982a0c3d	(index) Integrate positions file properly	2024-06-06 16:45:42 +02:00
Viktor Lofgren	dcbec9414f	(index) Fix non-compiling tests	2024-06-06 16:35:09 +02:00
Viktor Lofgren	4a8afa6b9f	(index, WIP) Position data partially integrated with forward and reverse indexes. There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.	2024-06-06 12:54:52 +02:00
Viktor Lofgren	19163fa883	(array) Clean up the Array library IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it) Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs. Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.	2024-05-18 13:23:06 +02:00
Viktor Lofgren	65e3caf402	(index) Clean up the code	2024-04-11 18:50:21 +02:00
Viktor Lofgren	ef25d60666	(index) Add origin trace information for index readers This used to be supported by the system but got lost in refactoring at some point.	2024-04-06 13:28:14 +02:00
Viktor Lofgren	81815f3e0a	(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.	2024-04-04 20:17:58 +02:00
Viktor Lofgren	67aa20ea2c	(array) Attempting to debug strange errors	2024-02-27 21:22:18 +01:00
Viktor Lofgren	fc00701a1e	(index) Experimental refactoring of the indexing functionality	2024-02-25 11:05:10 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00

38 Commits