MarginaliaSearch/code/processes/converting-process/java/nu/marginalia/converting
Viktor Lofgren 22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
..
model (btree) Clean up code 2024-05-18 18:03:17 +02:00
processor (sentence-extractor) Add tag information to document language data 2024-07-18 15:57:48 +02:00
sideload (keywords) Add position information to keywords 2024-05-28 16:54:53 +02:00
util (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
writer (coded-sequence) Replace GCS usage with an interface 2024-07-16 14:37:50 +02:00
ConverterMain.java (converter) Do not suppress exceptions in the converter 2024-04-30 18:24:35 +02:00
ConverterModule.java (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00