Viktor Lofgren
2cc74c005a
(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches
2024-04-19 19:42:30 +02:00
Viktor Lofgren
ed250f57f2
(ranking) Set regularMask correctly
2024-04-19 14:31:57 +02:00
Viktor Lofgren
e92c25f7e0
(ranking) Cleanup
2024-04-19 14:13:12 +02:00
Viktor Lofgren
41782a0ab5
(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp
2024-04-19 12:19:26 +02:00
Viktor Lofgren
9b06433b82
(qs) Additional info in query debug UI
2024-04-19 12:18:53 +02:00
Viktor Lofgren
def607d840
(qs) Additional info in query debug UI
2024-04-19 11:46:27 +02:00
Viktor Lofgren
2b811fb422
(qs) Basic query debug feature
2024-04-19 11:00:56 +02:00
Viktor Lofgren
36cc62c10c
(proto) Improve handling of omitted parameters
2024-04-18 10:47:12 +02:00
Viktor Lofgren
8bbaf457de
(query) Minor code cleanup
2024-04-18 10:37:51 +02:00
Viktor Lofgren
7641a02f31
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
...
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-18 10:36:15 +02:00
Viktor Lofgren
ce16239e34
(query) Modify tokenizer to match the behavior of the sentence extractor
...
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-17 17:54:32 +02:00
Viktor Lofgren
f52457213e
(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus
2024-04-17 14:05:02 +02:00
Viktor Lofgren
579295a673
(search) Add implicit coherence constraints based on segmentation
2024-04-17 14:03:35 +02:00
Viktor Lofgren
599e719ad4
(index) Fix priority search terms
...
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-15 16:44:08 +02:00
Viktor Lofgren
b6d365bacd
(index) Clean up data model
...
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
2024-04-15 16:04:07 +02:00
Viktor Lofgren
fda1c05164
(ngram) Correct |s|^|s|-normalization to use length and not count
2024-04-13 18:05:30 +02:00
Viktor Lofgren
d729c400e5
(query, minor) Remove debug statement
2024-04-12 17:52:55 +02:00
Viktor Lofgren
ad4810d991
(query, minor) Remove debug statement
2024-04-12 17:45:26 +02:00
Viktor Lofgren
864d6c28e7
(segmentation) Pick best segmentation using |s|^|s|-style normalization
...
This is better than doing all segmentations possible at the same time.
2024-04-12 17:44:14 +02:00
Viktor Lofgren
b7d9a7ae89
(ngrams) Remove the vestigial logic for capturing permutations of n-grams
...
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-11 18:12:01 +02:00
Viktor Lofgren
ed73d79ec1
(qs) Clean up parsing code using new record matching
2024-04-11 17:36:08 +02:00
Viktor Lofgren
fcdc843c15
(search) Fix outdated assumptions about the results
...
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-07 12:09:44 +02:00
Viktor Lofgren
ae7c760772
(index) Clean up new index query code
2024-04-05 13:30:49 +02:00
Viktor Lofgren
81815f3e0a
(qs, index) New query model integrated with index service.
...
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-04 20:17:58 +02:00
Viktor Lofgren
87bb93e1d4
(qs, WIP) Fix edge cases in query compilation
...
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
2024-03-29 12:40:27 +01:00
Viktor Lofgren
e596c929ac
(qs, WIP) Clean up dead code
2024-03-28 16:37:23 +01:00
Viktor Lofgren
9852b0e609
(qs, WIP) Tidy it up a bit
2024-03-28 14:18:26 +01:00
Viktor Lofgren
51b0d6c0d3
(qs, WIP) Tidy it up a bit
2024-03-28 14:09:17 +01:00
Viktor Lofgren
15391c7a88
(qs, WIP) Tidy it up a bit
2024-03-28 13:54:30 +01:00
Viktor Lofgren
fe62593286
(qs, WIP) Break up code and tidy it up a bit
2024-03-28 13:26:54 +01:00
Viktor Lofgren
4cc11e183c
(qs, WIP) Fix output determinism, fix tests
2024-03-28 13:11:26 +01:00
Viktor Lofgren
f82ebd7716
(WIP) Query rendering finally beginning to look like it works
2024-03-28 13:01:21 +01:00
Viktor Lofgren
002afca1c5
(sys) Upgrade to JDK22
...
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
a4b810f511
WIP
2024-03-21 14:33:26 +01:00
Viktor Lofgren
0bd3365c24
(convert) Initial integration of segmentation data into the converter's keyword extraction logic
2024-03-19 14:28:42 +01:00
Viktor Lofgren
d8f4e7d72b
(qs) Retire NGramBloomFilter, integrate new segmentation model instead
2024-03-19 10:42:09 +01:00
Viktor Lofgren
00ef4f9803
(WIP) Partial integration of new query expansion code into the query-serivice
2024-03-18 13:16:49 +01:00
Viktor Lofgren
07e4d7ec6d
(WIP) Improve data extraction from wikipedia data
2024-03-18 13:16:00 +01:00
Viktor Lofgren
8ae1f08095
(WIP) Implement first take of new query segmentation algorithm
2024-03-12 13:12:50 +01:00
Viktor Lofgren
46423612e3
(refac) Merge service-discovery and service modules
...
Also adds a few tests to the server/client code.
2024-03-03 10:49:23 +01:00
Viktor Lofgren
9f1649636e
Clean up documentation and rename domain-links
to link-graph
2024-02-28 11:40:39 +01:00
Viktor Lofgren
5604e9f531
(query) Bump query length, see what happens :P
2024-02-27 21:22:17 +01:00
Viktor Lofgren
427f3e922f
(index) Retire count operation, clean up index code.
2024-02-27 21:22:17 +01:00
Viktor Lofgren
9429bf5c45
(index) Clean up
2024-02-27 21:22:17 +01:00
Viktor Lofgren
fc00701a1e
(index) Experimental refactoring of the indexing functionality
2024-02-25 11:05:10 +01:00
Viktor Lofgren
1d34224416
(refac) Remove src/main from all source code paths.
...
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.
While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.
2024-02-23 16:13:40 +01:00
Viktor Lofgren
f8e7f75831
Move index to top level of code
2024-02-22 18:01:35 +01:00
Viktor Lofgren
085137ca63
* Extract the index functionality
2024-02-22 17:31:25 +01:00
Viktor Lofgren
3fd2a83184
* Extract the search-query function
2024-02-22 15:27:39 +01:00