Commit Graph

2137 Commits

Author SHA1 Message Date
Viktor Lofgren
2f0b648fad (index) Add jaccard index term to boost results based on term overlap 2024-04-24 14:44:39 +02:00
Viktor Lofgren
de0e56f027 (index) Remove position overlap check, coherences will do the work instead 2024-04-24 14:44:39 +02:00
Viktor Lofgren
973ced7b13 (index) Omit absent terms from coherence checks 2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb4b824a85 (index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus 2024-04-24 14:44:39 +02:00
Viktor Lofgren
c583a538b1 (search) Add implicit coherence constraints based on segmentation 2024-04-24 14:44:39 +02:00
Viktor Lofgren
e0224085b4 (index) Improve recall for small queries
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44c1e1d6d9 (index) Remove dead code
Since the performance fix in 3359f72239 had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c620e9c026 (index) Experimental performance regression fix 2024-04-24 14:44:39 +02:00
Viktor Lofgren
1bb88968c5 (test) Fix broken test 2024-04-24 14:44:39 +02:00
Viktor Lofgren
df75e8f4aa (index) Explicitly free LongQueryBuffers 2024-04-24 14:44:39 +02:00
Viktor Lofgren
adf846bfd2 (index) Fix term coherence evaluation
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
1748fcc5ac (valuation) Impose stronger constraints on locality of terms
Clean up logic a bit
2024-04-24 14:44:39 +02:00
Viktor Lofgren
08416393e0 (valuation) Impose stronger constraints on locality of terms 2024-04-24 14:44:39 +02:00
Viktor Lofgren
fce26015c9 (encyclopedia) Index the full articles
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits.  This was not a good idea, so the change is reverted.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
155be1078d (index) Fix priority search terms
This functionality fell into disrepair some while ago.  It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6efc0f21fe (index) Clean up data model
The change set cleans up the data model for the term-level data.  This used to contain a bunch of fields with document-level metadata.  This data-duplication means a larger memory footprint and worse memory locality.

The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking.  This is again an effort to improve memory locality.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f3255e080d (ngram) Grab titles separately when extracting ngrams from wiki data 2024-04-24 14:44:39 +02:00
Viktor Lofgren
0da03d4cfc (zim) Fix title extractor 2024-04-24 14:44:39 +02:00
Viktor Lofgren
5f6a3ef9d0 (ngram) Correct |s|^|s|-normalization to use length and not count 2024-04-24 14:44:39 +02:00
Viktor Lofgren
afc4fed591 (ngram) Correct size value in ngram lexicon generation, trim the terms better 2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb505f98ef (ngram) Use simple blocking pool instead of FJP; split on underscores in article names. 2024-04-24 14:44:39 +02:00
Viktor Lofgren
a0b3634cb6 (ngram) Only extract frequencies of title words, but use the body to increment the counters...
The sign of the counter is used to indicate whether a term has appeared as title.  Until it's seen in the title, it's provisionally saved as a negative count.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e23359bae9 (query, minor) Remove debug statement 2024-04-24 14:44:39 +02:00
Viktor Lofgren
5531ed632a (query, minor) Remove debug statement 2024-04-24 14:44:39 +02:00
Viktor Lofgren
150ee21f3c (ngram) Clean up ngram lexicon code
This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
c96da0ce1e (segmentation) Pick best segmentation using |s|^|s|-style normalization
This is better than doing all segmentations possible at the same time.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
a0d9e66ff7 (ngram) Fix index range in NgramLexicon to an avoid exception 2024-04-24 14:44:38 +02:00
Viktor Lofgren
55f627ed4c (index) Clean up the code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
7dd8c78c6b (ngrams) Remove the vestigial logic for capturing permutations of n-grams
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8bf7d090fd (qs) Clean up parsing code using new record matching 2024-04-24 14:44:38 +02:00
Viktor Lofgren
6bfe04b609 (term-freq-exporter) Reduce thread count and memory usage 2024-04-24 14:44:38 +02:00
Viktor Lofgren
491d6bec46 (term-freq-exporter) Extract ngrams in term-frequency-exporter 2024-04-24 14:44:38 +02:00
Viktor Lofgren
4fb86ac692 (search) Fix outdated assumptions about the results
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.

For the API service, we'll simulate the old behavior to keep the API stable.

For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
6cba6aef3b (minor) Remove dead code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
7e216db463 (index) Add origin trace information for index readers
This used to be supported by the system but got lost in refactoring at some point.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
adc90c8f1e (sentence-extractor) Fix resource leak in sentence extractor
The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation.

The modified behavior checks for nullity before creating a new instance.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
e3316a3672 (index) Clean up new index query code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
a3a6d6292b (qs, index) New query model integrated with index service.
Seems to work, tests are green and initial testing finds no errors.  Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8cb9455c32 (qs, WIP) Fix edge cases in query compilation
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w).  The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
dc65b2ee01 (qs, WIP) Clean up dead code 2024-04-24 14:44:38 +02:00
Viktor Lofgren
98a1adbf81 (qs, WIP) Tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
0bd1e15cce (qs, WIP) Tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
eda926767e (qs, WIP) Tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
cd1a18c045 (qs, WIP) Break up code and tidy it up a bit 2024-04-24 14:44:38 +02:00
Viktor Lofgren
6f567fbea8 (qs, WIP) Fix output determinism, fix tests 2024-04-24 14:44:38 +02:00
Viktor Lofgren
0ebadd03a5 (WIP) Query rendering finally beginning to look like it works 2024-04-24 14:44:38 +02:00
Viktor Lofgren
2253b556b2 WIP 2024-04-24 14:44:17 +02:00
Viktor Lofgren
6a7a7009c7 (convert) Initial integration of segmentation data into the converter's keyword extraction logic 2024-04-24 14:44:17 +02:00
Viktor Lofgren
3c75057dcd (qs) Retire NGramBloomFilter, integrate new segmentation model instead 2024-04-24 14:44:17 +02:00
Viktor Lofgren
212d101727 (control) GUI for exporting segmentation data from a wikipedia zim 2024-04-24 14:44:17 +02:00