MarginaliaSearch/code/features-convert/keyword-extraction/test/nu/marginalia
Viktor Lofgren 22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
..
keyword (sentence-extractor) Add tag information to document language data 2024-07-18 15:57:48 +02:00
test/util (qs) Retire NGramBloomFilter, integrate new segmentation model instead 2024-03-19 10:42:09 +01:00