MarginaliaSearch/code/features-convert/keyword-extraction
Viktor Lofgren 22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
..
java/nu/marginalia/keyword (sentence-extractor) Add tag information to document language data 2024-07-18 15:57:48 +02:00
test/nu/marginalia (sentence-extractor) Add tag information to document language data 2024-07-18 15:57:48 +02:00
test-resources/test-data (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
build.gradle (converter) Amend existing modifications to use gamma coded positions lists 2024-05-30 14:20:36 +02:00
readme.md (docs) Begin un-fucking the docs after refactoring 2024-02-27 21:22:21 +01:00

Keyword Extraction

This code deals with identifying keywords in a document, their positions in the document, their important based on TF-IDF and their grammatical functions based on POS tags.

Central Classes

See Also