Commit Graph

1180 Commits

Author SHA1 Message Date
Viktor Lofgren
0a73b02a00 (query) Mark flaky test, correct assert on test 2024-04-21 12:30:14 +02:00
Viktor Lofgren
8769704462 (ranking) TermCoherenceFactory should be run for size=2 queries 2024-04-21 12:29:25 +02:00
Viktor Lofgren
214551f1df (converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation. 2024-04-19 20:36:01 +02:00
Viktor Lofgren
2cc74c005a (query) Always generate an ngram alternative, suppresses generation of multiple identical query branches 2024-04-19 19:42:30 +02:00
Viktor Lofgren
ed250f57f2 (ranking) Set regularMask correctly 2024-04-19 14:31:57 +02:00
Viktor Lofgren
e92c25f7e0 (ranking) Cleanup 2024-04-19 14:13:12 +02:00
Viktor Lofgren
3ab563f314 (ranking) Suppress NaN:s in ranking output 2024-04-19 13:58:28 +02:00
Viktor Lofgren
426338cb45 (ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N 2024-04-19 12:41:48 +02:00
Viktor Lofgren
5fa2375898 (index, bugfix) Pass url quality to query service 2024-04-19 12:41:26 +02:00
Viktor Lofgren
41782a0ab5 (index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp 2024-04-19 12:19:26 +02:00
Viktor Lofgren
9b06433b82 (qs) Additional info in query debug UI 2024-04-19 12:18:53 +02:00
Viktor Lofgren
def607d840 (qs) Additional info in query debug UI 2024-04-19 11:46:27 +02:00
Viktor Lofgren
2b811fb422 (qs) Basic query debug feature 2024-04-19 11:00:56 +02:00
Viktor Lofgren
36cc62c10c (proto) Improve handling of omitted parameters 2024-04-18 10:47:12 +02:00
Viktor Lofgren
975d92912c (qs) Improve logging 2024-04-18 10:44:08 +02:00
Viktor Lofgren
8bbaf457de (query) Minor code cleanup 2024-04-18 10:37:51 +02:00
Viktor Lofgren
7641a02f31 (query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-18 10:36:15 +02:00
Viktor Lofgren
ce16239e34 (query) Modify tokenizer to match the behavior of the sentence extractor
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-17 17:54:32 +02:00
Viktor Lofgren
d64bd227cf (index) Clean up jaccard index term code and down-tune the parameter's importance a bit 2024-04-17 17:40:16 +02:00
Viktor Lofgren
c5ab0a9054 (index) Add jaccard index term to boost results based on term overlap 2024-04-17 16:50:26 +02:00
Viktor Lofgren
dac948973d (index) Remove position overlap check, coherences will do the work instead 2024-04-17 14:20:01 +02:00
Viktor Lofgren
9d008d1d6f (index) Omit absent terms from coherence checks 2024-04-17 14:12:16 +02:00
Viktor Lofgren
f52457213e (index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus 2024-04-17 14:05:02 +02:00
Viktor Lofgren
579295a673 (search) Add implicit coherence constraints based on segmentation 2024-04-17 14:03:35 +02:00
Viktor Lofgren
af8ff8ce99 (index) Improve recall for small queries
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-16 22:51:03 +02:00
Viktor Lofgren
7fa3e86e64 (index) Remove dead code
Since the performance fix in 3359f72239 had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-16 19:59:27 +02:00
Viktor Lofgren
3359f72239 (index) Experimental performance regression fix 2024-04-16 19:48:14 +02:00
Viktor Lofgren
41fa154aa6 (test) Fix broken test 2024-04-16 19:48:14 +02:00
Viktor Lofgren
deaba0152d (index) Explicitly free LongQueryBuffers 2024-04-16 19:23:00 +02:00
Viktor Lofgren
feaef6093e (index) Fix term coherence evaluation
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-16 18:07:43 +02:00
Viktor Lofgren
078fa4fdd0 (valuation) Impose stronger constraints on locality of terms
Clean up logic a bit
2024-04-16 17:22:58 +02:00
Viktor Lofgren
2dc77a0638 (valuation) Impose stronger constraints on locality of terms 2024-04-16 17:15:21 +02:00
Viktor Lofgren
2353c73c57 (encyclopedia) Index the full articles
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits.  This was not a good idea, so the change is reverted.
2024-04-16 12:10:13 +02:00
Viktor Lofgren
599e719ad4 (index) Fix priority search terms
This functionality fell into disrepair some while ago.  It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-15 16:44:08 +02:00
Viktor Lofgren
b6d365bacd (index) Clean up data model
The change set cleans up the data model for the term-level data.  This used to contain a bunch of fields with document-level metadata.  This data-duplication means a larger memory footprint and worse memory locality.

The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking.  This is again an effort to improve memory locality.
2024-04-15 16:04:07 +02:00
Viktor Lofgren
52f0c0d336 (ngram) Grab titles separately when extracting ngrams from wiki data 2024-04-13 19:34:16 +02:00
Viktor Lofgren
fda1c05164 (ngram) Correct |s|^|s|-normalization to use length and not count 2024-04-13 18:05:30 +02:00
Viktor Lofgren
1329d4abd8 (ngram) Correct size value in ngram lexicon generation, trim the terms better 2024-04-13 17:51:02 +02:00
Viktor Lofgren
f064992137 (ngram) Use simple blocking pool instead of FJP; split on underscores in article names. 2024-04-13 17:07:23 +02:00
Viktor Lofgren
8a81a480a1 (ngram) Only extract frequencies of title words, but use the body to increment the counters...
The sign of the counter is used to indicate whether a term has appeared as title.  Until it's seen in the title, it's provisionally saved as a negative count.
2024-04-12 18:08:31 +02:00
Viktor Lofgren
d729c400e5 (query, minor) Remove debug statement 2024-04-12 17:52:55 +02:00
Viktor Lofgren
ad4810d991 (query, minor) Remove debug statement 2024-04-12 17:45:26 +02:00
Viktor Lofgren
6a67043537 (ngram) Clean up ngram lexicon code
This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.
2024-04-12 17:45:06 +02:00
Viktor Lofgren
864d6c28e7 (segmentation) Pick best segmentation using |s|^|s|-style normalization
This is better than doing all segmentations possible at the same time.
2024-04-12 17:44:14 +02:00
Viktor Lofgren
bb6b51ad91 (ngram) Fix index range in NgramLexicon to an avoid exception 2024-04-12 10:13:25 +02:00
Viktor Lofgren
65e3caf402 (index) Clean up the code 2024-04-11 18:50:21 +02:00
Viktor Lofgren
b7d9a7ae89 (ngrams) Remove the vestigial logic for capturing permutations of n-grams
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-11 18:12:01 +02:00
Viktor Lofgren
ed73d79ec1 (qs) Clean up parsing code using new record matching 2024-04-11 17:36:08 +02:00
Viktor Lofgren
c538c25008 (term-freq-exporter) Reduce thread count and memory usage 2024-04-10 17:11:23 +02:00
Viktor Lofgren
4b47fadbab (term-freq-exporter) Extract ngrams in term-frequency-exporter 2024-04-10 16:58:05 +02:00