MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 21:29:00 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	aebb2652e8	(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.	2024-07-27 11:44:13 +02:00
Viktor Lofgren	2bb9f18411	(dld) Refactor DocumentLanguageData Reduce the usage of raw arrays	2024-07-19 12:24:55 +02:00
Viktor Lofgren	b812e96c6d	(language-processing) Select the appropriate language filter The incorrect filter was selected based on the provided parameter, this has been corrected.	2024-07-19 12:22:32 +02:00
Viktor Lofgren	22b35d5d91	(sentence-extractor) Add tag information to document language data Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object. Separator information is encoded as a bit set instead of an array of integers. The change also cleans up the SentenceExtractor class a fair bit. It no longer extracts ngrams, and a significant amount of redundant operations were removed as well. This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.	2024-07-18 15:57:48 +02:00
Viktor Lofgren	fa162698c2	(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter How'd This Ever Work? (tm) TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.	2024-07-15 05:15:30 +02:00
Viktor Lofgren	7f7021ce64	(sentence-extractor) Fix resource leak in sentence extractor The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation. The modified behavior checks for nullity before creating a new instance.	2024-04-05 18:52:58 +02:00
Viktor Lofgren	0bd3365c24	(convert) Initial integration of segmentation data into the converter's keyword extraction logic	2024-03-19 14:28:42 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00

8 Commits