Viktor Lofgren
760b80659d
(WIP) Partial integration of new query expansion code into the query-serivice
2024-04-24 14:44:17 +02:00
Viktor Lofgren
04879c005d
(WIP) Improve data extraction from wikipedia data
2024-04-24 14:44:17 +02:00
Viktor Lofgren
cb82927756
(WIP) Implement first take of new query segmentation algorithm
2024-04-24 14:44:17 +02:00
Viktor Lofgren
8b9629f2f6
(crawler) Remove unnecessary double-fetch of the root document
2024-04-24 14:38:59 +02:00
Viktor Lofgren
f6db16b313
(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber
2024-04-24 14:10:03 +02:00
Viktor Lofgren
4668b1ddcb
(build) Java 22 and its consequences has been a disaster for Marginalia Search
...
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
Viktor Lofgren
dcf9d9caad
(crawler) Emulate if-modified-since for domains that don't support the header
...
This will help reduce the strain on some server software, in particular Discourse.
2024-04-22 17:26:31 +02:00
Viktor Lofgren
7a69b76001
(crawler) Remove accidental log spam
2024-04-22 15:51:37 +02:00
Viktor Lofgren
ac07ef822f
(crawler) Code quality
2024-04-22 15:37:35 +02:00
Viktor Lofgren
e7d4bcd872
(crawler) Use the probe-result to reduce the likelihood of crawling both http and https
...
This should drastically reduce the number of fetched documents on many domains
2024-04-22 15:36:43 +02:00
Viktor Lofgren
a28c6d7cfe
(crawler) Strip W/-prefix from the etag when supplied as If-None-Match
2024-04-22 14:31:05 +02:00
Viktor Lofgren
d816f048f5
(crawler) Ensure all appropriate headers are recorded on the request
2024-04-22 14:14:24 +02:00
Viktor Lofgren
b09ddd0036
(crawler/converter) Remove legacy junk from parquet migration
2024-04-22 12:34:28 +02:00
Viktor Lofgren
0a73b02a00
(query) Mark flaky test, correct assert on test
2024-04-21 12:30:14 +02:00
Viktor Lofgren
8769704462
(ranking) TermCoherenceFactory should be run for size=2 queries
2024-04-21 12:29:25 +02:00
Viktor Lofgren
214551f1df
(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.
2024-04-19 20:36:01 +02:00
Viktor Lofgren
2cc74c005a
(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches
2024-04-19 19:42:30 +02:00
Viktor Lofgren
ed250f57f2
(ranking) Set regularMask correctly
2024-04-19 14:31:57 +02:00
Viktor Lofgren
e92c25f7e0
(ranking) Cleanup
2024-04-19 14:13:12 +02:00
Viktor Lofgren
3ab563f314
(ranking) Suppress NaN:s in ranking output
2024-04-19 13:58:28 +02:00
Viktor Lofgren
426338cb45
(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N
2024-04-19 12:41:48 +02:00
Viktor Lofgren
5fa2375898
(index, bugfix) Pass url quality to query service
2024-04-19 12:41:26 +02:00
Viktor Lofgren
41782a0ab5
(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp
2024-04-19 12:19:26 +02:00
Viktor Lofgren
9b06433b82
(qs) Additional info in query debug UI
2024-04-19 12:18:53 +02:00
Viktor Lofgren
def607d840
(qs) Additional info in query debug UI
2024-04-19 11:46:27 +02:00
Viktor Lofgren
2b811fb422
(qs) Basic query debug feature
2024-04-19 11:00:56 +02:00
Viktor Lofgren
36cc62c10c
(proto) Improve handling of omitted parameters
2024-04-18 10:47:12 +02:00
Viktor Lofgren
975d92912c
(qs) Improve logging
2024-04-18 10:44:08 +02:00
Viktor Lofgren
8bbaf457de
(query) Minor code cleanup
2024-04-18 10:37:51 +02:00
Viktor Lofgren
7641a02f31
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
...
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-18 10:36:15 +02:00
Viktor Lofgren
ce16239e34
(query) Modify tokenizer to match the behavior of the sentence extractor
...
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-17 17:54:32 +02:00
Viktor Lofgren
d64bd227cf
(index) Clean up jaccard index term code and down-tune the parameter's importance a bit
2024-04-17 17:40:16 +02:00
Viktor Lofgren
c5ab0a9054
(index) Add jaccard index term to boost results based on term overlap
2024-04-17 16:50:26 +02:00
Viktor Lofgren
dac948973d
(index) Remove position overlap check, coherences will do the work instead
2024-04-17 14:20:01 +02:00
Viktor Lofgren
9d008d1d6f
(index) Omit absent terms from coherence checks
2024-04-17 14:12:16 +02:00
Viktor Lofgren
f52457213e
(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus
2024-04-17 14:05:02 +02:00
Viktor Lofgren
579295a673
(search) Add implicit coherence constraints based on segmentation
2024-04-17 14:03:35 +02:00
Viktor Lofgren
af8ff8ce99
(index) Improve recall for small queries
...
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-16 22:51:03 +02:00
Viktor Lofgren
7fa3e86e64
(index) Remove dead code
...
Since the performance fix in 3359f72239
had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-16 19:59:27 +02:00
Viktor Lofgren
3359f72239
(index) Experimental performance regression fix
2024-04-16 19:48:14 +02:00
Viktor Lofgren
41fa154aa6
(test) Fix broken test
2024-04-16 19:48:14 +02:00
Viktor Lofgren
deaba0152d
(index) Explicitly free LongQueryBuffers
2024-04-16 19:23:00 +02:00
Viktor Lofgren
feaef6093e
(index) Fix term coherence evaluation
...
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-16 18:07:43 +02:00
Viktor Lofgren
078fa4fdd0
(valuation) Impose stronger constraints on locality of terms
...
Clean up logic a bit
2024-04-16 17:22:58 +02:00
Viktor Lofgren
2dc77a0638
(valuation) Impose stronger constraints on locality of terms
2024-04-16 17:15:21 +02:00
Viktor
cfd9a7187f
(query-segmentation) Merge pull request #89 from MarginaliaSearch/query-segmentation
...
The changeset cleans up the query parsing logic in the query service. It gets rid of a lot of old and largely unmaintainable query-rewriting logic that was based on POS-tagging rules, and adds a new cleaner approach. Query parsing is also refactored, and the internal APIs are updated to remove unnecessary duplication of document-level data across each search term.
A new query segmentation model is introduced based on a dictionary of known n-grams, with tools for extracting this dictionary from Wikipedia data. The changeset introduces a new segmentation model file, which is downloaded with the usual run/setup.sh, as well as an updated term frequency model.
A new intermediate representation of the query is introduced, based on a DAG with predefined vertices initiating and terminating the graph. This is for the benefit of easily writing rules for generating alternative queries, e.g. using the new segmentation data.
The graph is converted to a basic LL(1) syntax loosely reminiscent of a regular expression, where e.g. "( wiby | marginalia | kagi ) ( search engine | searchengine )" expands to "wiby search engine", "wiby searchengine", "marginalia search engine", "marginalia searchengine", "kagi search engine" and "kagi searchengine".
This compiled query is passed to the index, which parses the expression, where it is used for execution of the search and ranking of the results.
2024-04-16 15:31:05 +02:00
Viktor Lofgren
f434a8b492
(build) Upgrade jib plugin version
2024-04-16 15:25:23 +02:00
Viktor Lofgren
d2658d6f84
(sys) Add springboard service that can spawn multiple different marginalia services to make distribution easier.
2024-04-16 13:25:15 +02:00
Viktor Lofgren
8c559c8121
(conf) Add additional logic for discovering system root
2024-04-16 12:37:18 +02:00
Viktor Lofgren
2353c73c57
(encyclopedia) Index the full articles
...
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.
2024-04-16 12:10:13 +02:00