Mirror/MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

History

Viktor Lofgren aebb2652e8 (wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.		2024-07-27 11:44:13 +02:00
..
adblock	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
anchor-keywords	(sentence-extractor) Add tag information to document language data	2024-07-18 15:57:48 +02:00
data-extractors	(wip) Extract and encode spans data	2024-07-27 11:44:13 +02:00
keyword-extraction	(wip) Extract and encode spans data	2024-07-27 11:44:13 +02:00
pubdate	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
reddit-json	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
stackexchange-xml	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
summary-extraction	(keywords) Add position information to keywords	2024-05-28 16:54:53 +02:00
topic-detection	(dld) Refactor DocumentLanguageData	2024-07-19 12:24:55 +02:00
readme.md	Update features-convert/readme.md	2023-03-25 12:43:58 +01:00

readme.md

Converter Features

Major features

keyword-extraction - Identifies keywords to index in a document
summary-extraction - Generate an excerpt/quote from a website to display on the search results page.

Smaller features:

adblock - Simulates Adblock
pubdate - Determines when a document was published
topic-detection - Tries to identify the topic of a website