diff --git a/code/features-convert/summary-extraction/readme.md b/code/features-convert/summary-extraction/readme.md index dc14b366..225ceb3c 100644 --- a/code/features-convert/summary-extraction/readme.md +++ b/code/features-convert/summary-extraction/readme.md @@ -3,6 +3,14 @@ This feature attempts to find a descriptive passage of text that summarizes what a search result "is about". It's the text you see below a search result. +It must solve two problems: + +1. Identify which part of the document that contains "the text". +The crux is that the document may be anywhere from 1993 to the present, with era-appropriate +formatting. Headings may be <center>ed <font>-tags, or semantic HTML5. + +2. Identify which part of "the text" best describes the document. + It uses several naive heuristics to try to find something that makes sense, and there is probably room for improvement. @@ -10,6 +18,7 @@ There are many good techniques for doing this, but they've sadly not proved particularly fast. Whatever solution is used needs to be able to summarize of order of a 100,000,000 documents with a time budget of a couple of hours. + ## Central Classes * [SummaryExtractor](src/main/java/nu/marginalia/summary/SummaryExtractor.java)