Viktor Lofgren
32fe864a33
(build) Java 22 and its consequences has been a disaster for Marginalia Search
...
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
ad2ac8eee3
(query) Mark flaky test, correct assert on test
2024-04-24 14:44:39 +02:00
Viktor Lofgren
64baa41e64
(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches
2024-04-24 14:44:39 +02:00
Viktor Lofgren
5165cf6d15
(ranking) Set regularMask correctly
2024-04-24 14:44:39 +02:00
Viktor Lofgren
4489b21528
(ranking) Cleanup
2024-04-24 14:44:39 +02:00
Viktor Lofgren
0dcca0cb83
(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp
2024-04-24 14:44:39 +02:00
Viktor Lofgren
b80a83339b
(qs) Additional info in query debug UI
2024-04-24 14:44:39 +02:00
Viktor Lofgren
eb74d08f2a
(qs) Additional info in query debug UI
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e79ab0c70e
(qs) Basic query debug feature
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e419e26f3a
(proto) Improve handling of omitted parameters
2024-04-24 14:44:39 +02:00
Viktor Lofgren
def36719d3
(query) Minor code cleanup
2024-04-24 14:44:39 +02:00
Viktor Lofgren
462aa9af26
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
...
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a09c84e1b8
(query) Modify tokenizer to match the behavior of the sentence extractor
...
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb4b824a85
(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c583a538b1
(search) Add implicit coherence constraints based on segmentation
2024-04-24 14:44:39 +02:00
Viktor Lofgren
155be1078d
(index) Fix priority search terms
...
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6efc0f21fe
(index) Clean up data model
...
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
5f6a3ef9d0
(ngram) Correct |s|^|s|-normalization to use length and not count
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e23359bae9
(query, minor) Remove debug statement
2024-04-24 14:44:39 +02:00
Viktor Lofgren
5531ed632a
(query, minor) Remove debug statement
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c96da0ce1e
(segmentation) Pick best segmentation using |s|^|s|-style normalization
...
This is better than doing all segmentations possible at the same time.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
7dd8c78c6b
(ngrams) Remove the vestigial logic for capturing permutations of n-grams
...
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8bf7d090fd
(qs) Clean up parsing code using new record matching
2024-04-24 14:44:38 +02:00
Viktor Lofgren
4fb86ac692
(search) Fix outdated assumptions about the results
...
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
e3316a3672
(index) Clean up new index query code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
a3a6d6292b
(qs, index) New query model integrated with index service.
...
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8cb9455c32
(qs, WIP) Fix edge cases in query compilation
...
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
dc65b2ee01
(qs, WIP) Clean up dead code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
98a1adbf81
(qs, WIP) Tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
0bd1e15cce
(qs, WIP) Tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
eda926767e
(qs, WIP) Tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
cd1a18c045
(qs, WIP) Break up code and tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
6f567fbea8
(qs, WIP) Fix output determinism, fix tests
2024-04-24 14:44:38 +02:00
Viktor Lofgren
0ebadd03a5
(WIP) Query rendering finally beginning to look like it works
2024-04-24 14:44:38 +02:00
Viktor Lofgren
2253b556b2
WIP
2024-04-24 14:44:17 +02:00
Viktor Lofgren
6a7a7009c7
(convert) Initial integration of segmentation data into the converter's keyword extraction logic
2024-04-24 14:44:17 +02:00
Viktor Lofgren
3c75057dcd
(qs) Retire NGramBloomFilter, integrate new segmentation model instead
2024-04-24 14:44:17 +02:00
Viktor Lofgren
760b80659d
(WIP) Partial integration of new query expansion code into the query-serivice
2024-04-24 14:44:17 +02:00
Viktor Lofgren
04879c005d
(WIP) Improve data extraction from wikipedia data
2024-04-24 14:44:17 +02:00
Viktor Lofgren
cb82927756
(WIP) Implement first take of new query segmentation algorithm
2024-04-24 14:44:17 +02:00
Viktor Lofgren
fe8d583fdd
(sys) Upgrade to JDK22
...
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
46423612e3
(refac) Merge service-discovery and service modules
...
Also adds a few tests to the server/client code.
2024-03-03 10:49:23 +01:00
Viktor Lofgren
9689f3faee
(domain-info) Fix incorrect array indexing
2024-02-29 18:56:09 +01:00
Viktor Lofgren
93fa58c93d
(domain-info) Fix incorrect array indexing
...
Using the id instead of idx when addressing the ranksArray caused exceptions.
2024-02-29 17:54:23 +01:00
Viktor Lofgren
41abd8982f
(math) Clean up error handling
2024-02-28 14:19:50 +01:00
Viktor Lofgren
9415539b38
(docs) Update docs
2024-02-28 12:25:19 +01:00
Viktor Lofgren
84bab2783d
(docs) Fix fake news in docs
2024-02-28 12:16:45 +01:00
Viktor Lofgren
9f1649636e
Clean up documentation and rename domain-links
to link-graph
2024-02-28 11:40:39 +01:00
Viktor Lofgren
c943954bb4
(domain-info) Reduce memory usage
2024-02-27 21:22:21 +01:00
Viktor Lofgren
eaf836dc66
(service/grpc) Reduce thread count
...
Netty and GRPC by default spawns an incredible number of threads on high-core CPUs, which amount to a fair bit of RAM usage.
Add custom executors that throttle this behavior.
2024-02-27 21:22:21 +01:00