mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-23 21:18:58 +00:00
Clean up summary extractor module.
This commit is contained in:
parent
8def95e849
commit
950c49d80f
@ -3,6 +3,14 @@
|
|||||||
This feature attempts to find a descriptive passage of text that summarizes
|
This feature attempts to find a descriptive passage of text that summarizes
|
||||||
what a search result "is about". It's the text you see below a search result.
|
what a search result "is about". It's the text you see below a search result.
|
||||||
|
|
||||||
|
It must solve two problems:
|
||||||
|
|
||||||
|
1. Identify which part of the document that contains "the text".
|
||||||
|
The crux is that the document may be anywhere from 1993 to the present, with era-appropriate
|
||||||
|
formatting. Headings may be <center>ed <font>-tags, or semantic HTML5.
|
||||||
|
|
||||||
|
2. Identify which part of "the text" best describes the document.
|
||||||
|
|
||||||
It uses several naive heuristics to try to find something that makes sense,
|
It uses several naive heuristics to try to find something that makes sense,
|
||||||
and there is probably room for improvement.
|
and there is probably room for improvement.
|
||||||
|
|
||||||
@ -10,6 +18,7 @@ There are many good techniques for doing this, but they've sadly not proved
|
|||||||
particularly fast. Whatever solution is used needs to be able to summarize of
|
particularly fast. Whatever solution is used needs to be able to summarize of
|
||||||
order of a 100,000,000 documents with a time budget of a couple of hours.
|
order of a 100,000,000 documents with a time budget of a couple of hours.
|
||||||
|
|
||||||
|
|
||||||
## Central Classes
|
## Central Classes
|
||||||
|
|
||||||
* [SummaryExtractor](src/main/java/nu/marginalia/summary/SummaryExtractor.java)
|
* [SummaryExtractor](src/main/java/nu/marginalia/summary/SummaryExtractor.java)
|
||||||
|
Loading…
Reference in New Issue
Block a user