MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	2ef66ce0ca	(actor) Reset NEW flag earlier when auto-deletion is disabled Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.	2024-07-31 10:31:03 +02:00
Viktor Lofgren	dc5c668940	(index) Re-enable parallelization of index construction, disable parallel sorting during construction The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance. It got worse, so the change is reverted. Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.	2024-07-31 10:06:53 +02:00
Viktor Lofgren	6d7b886aaa	(converter) Correct sort order of files in control storage GUI Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.	2024-07-30 19:43:27 +02:00
Viktor Lofgren	b316b55be9	(index) Experimental initial integration of document spans into index	2024-07-30 12:01:53 +02:00
Viktor Lofgren	80900107f7	(restructure) Clean up repo by moving stray features into converter-process and crawler-process	2024-07-30 10:14:00 +02:00
Viktor Lofgren	7e4efa45b8	(converter/loader) Simplify document record writing to not require predicated reads	2024-07-29 14:21:21 +02:00
Viktor Lofgren	86ea28d6bc	(converter/loader) Simplify document record writing to not require predicated reads	2024-07-29 14:18:52 +02:00
Viktor Lofgren	34703da144	(slop) Support for nested array types and array-of-object types Also adding very basic support for filtered reads via SlopTable. This is probably not a final design.	2024-07-29 14:00:43 +02:00
Viktor Lofgren	1282f78bc5	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 11:01:18 +02:00
Viktor Lofgren	2d5d965f7f	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 10:34:33 +02:00
Viktor Lofgren	afe56c7cf1	(loader) Tidy up code	2024-07-28 21:36:42 +02:00
Viktor Lofgren	7d51cf882f	(loader) Move rssFeeds to a different column group to avoid errors	2024-07-28 21:30:10 +02:00
Viktor Lofgren	499deac2ef	(slop) Fix test that broke when we split get into int get() and long getLong()	2024-07-28 21:20:37 +02:00
Viktor Lofgren	9685993adb	(loader) Add spans to a different column group from spanCodes, as they are not in sync	2024-07-28 21:20:09 +02:00
Viktor Lofgren	261dcdadc8	(loader) Additional tracking for the control GUI	2024-07-28 21:19:45 +02:00
Viktor Lofgren	314a901bf0	(slop) Clean up build.gradle from unnecessary copy-paste garbage	2024-07-28 13:22:20 +02:00
Viktor Lofgren	1caad7e19e	(slop) Update existing code to use the altered Slop interfaces	2024-07-28 13:21:08 +02:00
Viktor Lofgren	e585116dab	(slop) Add 32 bit read method for Varint along with the old 64 bit version	2024-07-28 13:20:18 +02:00
Viktor Lofgren	40f42bf654	(slop) Add signed 16 bit column type "short"	2024-07-28 13:19:44 +02:00
Viktor Lofgren	eaf7fbb9e9	(slop) Improve Conveniences for Enum * New fixed width 8 bit version of Enum * Access to the enum's dictionary, and a method for reading the ordinal directly to reduce GC churn	2024-07-28 13:19:15 +02:00
Viktor Lofgren	d05a2e57e9	(index-forward) Spans Writer should not be in the index page loop context	2024-07-27 15:17:04 +02:00
Viktor Lofgren	f8684118f3	(slop) Add columnDesc information to the column readers and writers, and correct a few broken position() implementations Added a test that should find any additional broken implementations, as it's very important that this function is correct.	2024-07-27 14:35:30 +02:00
Viktor Lofgren	2e1f669aea	(slop) Remove additional vestigial seek() implementations	2024-07-27 14:35:30 +02:00
Viktor Lofgren	6c3abff664	(slop) Move GCS Slop column to the coded-sequence package This lets the slop library be stand-alone without dependence on coded-sequence. The change also gets rid of the vestigial seek() method in ColumnReader.	2024-07-27 13:58:45 +02:00
Viktor Lofgren	dcb43a3308	(slop) Introduce table concept to keep track of positions and simplify closing The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip. The second most common error is forgetting to close one of the columns in a reader or writer. To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.	2024-07-27 13:47:47 +02:00
Viktor Lofgren	aebb2652e8	(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.	2024-07-27 11:44:13 +02:00
Viktor Lofgren	52a9a0d410	(slop) Translate nulls to empty strings when passed to the StringColumnWriters.	2024-07-25 18:26:41 +02:00
Viktor Lofgren	4123e99469	(slop) Handle empty compressed files correctly The CompressingStorageReader would incorrectly report having data when a file was empty. Preemptively attempting to fill the backing buffer fixes the behavior.	2024-07-25 18:26:13 +02:00
Viktor Lofgren	51a8a242ac	(slop) First commit of slop library Slop is a low-abstraction data storage convention for column based storage of complex data.	2024-07-25 15:08:41 +02:00
Viktor Lofgren	60ef826e07	(loader) Add heartbeat to update domain-ids step	2024-07-25 15:08:41 +02:00
Viktor Lofgren	2ad564404e	(loader) Add heartbeat to update domain-ids step	2024-07-23 15:28:52 +02:00
Viktor Lofgren	2bb9f18411	(dld) Refactor DocumentLanguageData Reduce the usage of raw arrays	2024-07-19 12:24:55 +02:00
Viktor Lofgren	7a1edc0880	(term-freq) Reduce the number of low-relevance words in the dictionary Using a statistical trick to reduce the number of low-frequency words in the dictionary, as they are numerous and not very informative.	2024-07-19 12:23:28 +02:00
Viktor Lofgren	b812e96c6d	(language-processing) Select the appropriate language filter The incorrect filter was selected based on the provided parameter, this has been corrected.	2024-07-19 12:22:32 +02:00
Viktor Lofgren	22b35d5d91	(sentence-extractor) Add tag information to document language data Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object. Separator information is encoded as a bit set instead of an array of integers. The change also cleans up the SentenceExtractor class a fair bit. It no longer extracts ngrams, and a significant amount of redundant operations were removed as well. This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.	2024-07-18 15:57:48 +02:00
Viktor Lofgren	d36055a2d0	(keyword-extractor) Retire TfIdfHigh WordFlag This will bring the word flags count down to 8, and let us pack every value in a byte.	2024-07-17 13:54:39 +02:00
Viktor Lofgren	0d227f3543	(cleanup) Remove next-prime library only used in tests	2024-07-17 13:48:03 +02:00
Viktor Lofgren	0b31c4cfbb	(coded-sequence) Replace GCS usage with an interface	2024-07-16 14:37:50 +02:00
Viktor Lofgren	5c098005cc	(index) Fix broken test Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.	2024-07-16 12:37:59 +02:00
Viktor Lofgren	ae87e41cec	(index) Fix rare BitReader.takeWhileZero bug Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer. This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty. The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte. Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.	2024-07-16 11:03:56 +02:00
Viktor Lofgren	dfd19b5eb9	(index) Reduce the number of abstractions around result ranking The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.	2024-07-16 08:18:54 +02:00
Viktor	8ed5b51a32	Merge branch 'master' into term-positions	2024-07-15 07:05:31 +02:00
Viktor Lofgren	9d0e5dee02	Fix gitignore issue .so files not to be ignored correctly.	2024-07-15 05:18:10 +02:00
Viktor Lofgren	ffd970036d	(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter How'd This Ever Work? (tm) TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.	2024-07-15 05:16:17 +02:00
Viktor Lofgren	fa162698c2	(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter How'd This Ever Work? (tm) TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.	2024-07-15 05:15:30 +02:00
Viktor Lofgren	ad3857938d	(search-api, ranking) Update with new ranking parameters Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm. The change also cleans out several parameters that no longer filled any function.	2024-07-15 04:49:40 +02:00
Viktor Lofgren	179a6002c2	(coded-sequence) Add a callback for re-filling underlying buffer	2024-07-12 23:50:28 +02:00
Viktor Lofgren	d28fc86956	(index-prio) Add fuzz test for prio index	2024-07-11 19:22:36 +02:00
Viktor Lofgren	6303977e9c	(index-prio) Fail louder when size is 0 in PrioDocIdsTransformer We can't deal with this scenario and should complain very loudly	2024-07-11 19:22:05 +02:00
Viktor Lofgren	97695693f2	(index-prio) Don't increment readItems counter when the output buffer is full This behavior was causing the reader to sometimes discard trailing entries in the list.	2024-07-11 19:21:36 +02:00

1 2 3 4 5 ...

2226 Commits