Viktor Lofgren
95b9af92a0
(index) Implement working optional TermCoherences
2024-06-26 12:22:06 +02:00
Viktor Lofgren
8ee64c0771
(index) Correct TermCoherence requirements
2024-06-25 22:18:10 +02:00
Viktor Lofgren
dae22ccbe0
(test) Integration test from crawl->query
2024-06-25 22:17:26 +02:00
Viktor Lofgren
9d00243d7f
(index) Partial re-implementation of position constraints
2024-06-24 15:55:54 +02:00
Viktor Lofgren
36160988e2
(index) Integrate positions data with indexes WIP
...
This change integrates the new positions data with the forward and reverse indexes.
The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
4fcd4a8197
(index) Refactor to reduce the level of indirection
2024-05-19 12:40:33 +02:00
Viktor Lofgren
19163fa883
(array) Clean up the Array library
...
IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it)
Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs.
Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
2024-05-18 13:23:06 +02:00
Viktor
2d49071e96
Merge branch 'master' into run-outside-docker
2024-04-25 18:53:26 +02:00
Viktor Lofgren
e4b34b6ee6
(index) Correctly detect the presence of an all-virtual path through the query
2024-04-25 14:01:46 +02:00
Viktor Lofgren
f46733a47a
(ranking) TermCoherenceFactory should be run for size=2 queries
2024-04-24 14:44:39 +02:00
Viktor Lofgren
5165cf6d15
(ranking) Set regularMask correctly
2024-04-24 14:44:39 +02:00
Viktor Lofgren
4489b21528
(ranking) Cleanup
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f623b37577
(ranking) Suppress NaN:s in ranking output
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f4a2fea451
(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a748fc5448
(index, bugfix) Pass url quality to query service
2024-04-24 14:44:39 +02:00
Viktor Lofgren
0dcca0cb83
(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp
2024-04-24 14:44:39 +02:00
Viktor Lofgren
b80a83339b
(qs) Additional info in query debug UI
2024-04-24 14:44:39 +02:00
Viktor Lofgren
eb74d08f2a
(qs) Additional info in query debug UI
2024-04-24 14:44:39 +02:00
Viktor Lofgren
462aa9af26
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
...
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44b33798f3
(index) Clean up jaccard index term code and down-tune the parameter's importance a bit
2024-04-24 14:44:39 +02:00
Viktor Lofgren
2f0b648fad
(index) Add jaccard index term to boost results based on term overlap
2024-04-24 14:44:39 +02:00
Viktor Lofgren
de0e56f027
(index) Remove position overlap check, coherences will do the work instead
2024-04-24 14:44:39 +02:00
Viktor Lofgren
973ced7b13
(index) Omit absent terms from coherence checks
2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb4b824a85
(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e0224085b4
(index) Improve recall for small queries
...
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44c1e1d6d9
(index) Remove dead code
...
Since the performance fix in 3359f72239
had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c620e9c026
(index) Experimental performance regression fix
2024-04-24 14:44:39 +02:00
Viktor Lofgren
df75e8f4aa
(index) Explicitly free LongQueryBuffers
2024-04-24 14:44:39 +02:00
Viktor Lofgren
adf846bfd2
(index) Fix term coherence evaluation
...
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
1748fcc5ac
(valuation) Impose stronger constraints on locality of terms
...
Clean up logic a bit
2024-04-24 14:44:39 +02:00
Viktor Lofgren
08416393e0
(valuation) Impose stronger constraints on locality of terms
2024-04-24 14:44:39 +02:00
Viktor Lofgren
155be1078d
(index) Fix priority search terms
...
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6efc0f21fe
(index) Clean up data model
...
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
55f627ed4c
(index) Clean up the code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
4fb86ac692
(search) Fix outdated assumptions about the results
...
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
6cba6aef3b
(minor) Remove dead code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
7e216db463
(index) Add origin trace information for index readers
...
This used to be supported by the system but got lost in refactoring at some point.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
e3316a3672
(index) Clean up new index query code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
a3a6d6292b
(qs, index) New query model integrated with index service.
...
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8769704462
(ranking) TermCoherenceFactory should be run for size=2 queries
2024-04-21 12:29:25 +02:00
Viktor Lofgren
ed250f57f2
(ranking) Set regularMask correctly
2024-04-19 14:31:57 +02:00
Viktor Lofgren
e92c25f7e0
(ranking) Cleanup
2024-04-19 14:13:12 +02:00
Viktor Lofgren
3ab563f314
(ranking) Suppress NaN:s in ranking output
2024-04-19 13:58:28 +02:00
Viktor Lofgren
426338cb45
(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N
2024-04-19 12:41:48 +02:00
Viktor Lofgren
5fa2375898
(index, bugfix) Pass url quality to query service
2024-04-19 12:41:26 +02:00
Viktor Lofgren
41782a0ab5
(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp
2024-04-19 12:19:26 +02:00
Viktor Lofgren
9b06433b82
(qs) Additional info in query debug UI
2024-04-19 12:18:53 +02:00
Viktor Lofgren
def607d840
(qs) Additional info in query debug UI
2024-04-19 11:46:27 +02:00
Viktor Lofgren
7641a02f31
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
...
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-18 10:36:15 +02:00
Viktor Lofgren
d64bd227cf
(index) Clean up jaccard index term code and down-tune the parameter's importance a bit
2024-04-17 17:40:16 +02:00