Clean up summary extractor module.

2025-02-23 21:18:58 +00:00 · 2023-03-18 10:28:48 +01:00 · 2023-03-18 10:28:48 +01:00 · 950c49d80f
commit 950c49d80f
parent 8def95e849
1 changed files with 9 additions and 0 deletions
--- a/code/features-convert/summary-extraction/readme.md
+++ b/code/features-convert/summary-extraction/readme.md
@ -3,6 +3,14 @@
 This feature attempts to find a descriptive passage of text that summarizes
 what a search result "is about". It's the text you see below a search result.
 It must solve two problems:
 1.  Identify which part of the document that contains "the text".
 The crux is that the document may be anywhere from 1993 to the present, with era-appropriate 
 formatting. Headings may be &lt;center&gt;ed  &lt;font&gt;-tags, or semantic HTML5. 
 2. Identify which part of "the text" best describes the document. 
 It uses several naive heuristics to try to find something that makes sense,
 and there is probably room for improvement. 
@ -10,6 +18,7 @@ There are many good techniques for doing this, but they've sadly not proved
 particularly fast. Whatever solution is used needs to be able to summarize of
 order of a 100,000,000 documents with a time budget of a couple of hours.
 ## Central Classes
 * [SummaryExtractor](src/main/java/nu/marginalia/summary/SummaryExtractor.java)