Commit Graph

1613 Commits

Author SHA1 Message Date
Viktor Lofgren
a5585110a6 (index) Optimize SequenceOperations 2024-08-25 13:16:31 +02:00
Viktor Lofgren
965c89798e (index) Optimize DocumentSpan 2024-08-25 12:44:33 +02:00
Viktor Lofgren
982b03382b (index) Optimize DocumentSpan 2024-08-25 12:31:15 +02:00
Viktor Lofgren
24b805472a (index) Evaluate performance implication of decoding gcs early 2024-08-25 12:23:09 +02:00
Viktor Lofgren
6ce029b317 (index) Remove vestigial parameter 2024-08-25 12:14:12 +02:00
Viktor Lofgren
63e5b0ab18 (index) Correct weightedCounts calculations 2024-08-25 12:06:56 +02:00
Viktor Lofgren
6dda2c2d83 (coded-sequence) Reduce allocations in GCS.values() 2024-08-25 12:06:31 +02:00
Viktor Lofgren
3fb3c0b92e (index) Optimize ranking calculations 2024-08-25 11:56:11 +02:00
Viktor Lofgren
aa2c960b74 (index) Optimize ranking calculations 2024-08-25 11:53:44 +02:00
Viktor Lofgren
4fbcc02f96 (index) Adjust sensible defaults for ranking parameters 2024-08-25 11:24:16 +02:00
Viktor Lofgren
9aa8f13731 (index) Remove tcfAvgDist ranking parameter
This is captured by tcfProximity already
2024-08-25 11:20:19 +02:00
Viktor Lofgren
65bee366dc (index) Try harmonic mean for avgMinDist 2024-08-25 11:11:52 +02:00
Viktor Lofgren
53700e6667 (index) Try harmonic mean for avgMinDist 2024-08-25 11:08:41 +02:00
Viktor Lofgren
7f498e10b7 (index) Adjust proximity score 2024-08-25 11:01:35 +02:00
Viktor Lofgren
6eb0f13411 (index) Adjust handling of full phrase matches to prioritize full query matches over large partial matches 2024-08-25 10:54:04 +02:00
Viktor Lofgren
773377fe84 (index) Correct handling of full phrase match group 2024-08-25 10:48:34 +02:00
Viktor Lofgren
4372c8c835 (index) Give ranking components more consistent names 2024-08-25 10:44:27 +02:00
Viktor Lofgren
099133bdbc (index) Fix verbatim match score after moving full phrase group to a separate entity 2024-08-25 10:43:35 +02:00
Viktor Lofgren
b09e2dbeb7 (build) Fix dependency churn from testcontainers
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
96bcf03ad5 (index) Address broken tests
They are still broken, but less so.
2024-08-25 10:34:36 +02:00
Viktor Lofgren
0999f07320 (search-query) Add new ranking parameters for proximity and verbatim matches 2024-08-25 10:34:12 +02:00
Viktor Lofgren
5d2b455572 (search) Clean up inconsistent usage of MathClient in SearchOperator
Also clean up SearchOperator and adjacent code
2024-08-24 10:39:31 +02:00
Viktor Lofgren
ea75ddc0e0 (search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator 2024-08-22 11:50:52 +02:00
Viktor Lofgren
2db0e446cb (search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator 2024-08-22 11:49:29 +02:00
Viktor Lofgren
557bdaa694 (search) Clean up SearchQueryIndexService and surrounding code 2024-08-22 11:45:28 +02:00
Viktor Lofgren
9eb1f120fc (index) Repair positions bitmask for search result presentation 2024-08-22 11:28:23 +02:00
Viktor Lofgren
266d6e4bea (slop) Replace SlopPageRef<T> with SlopTable.Ref<T> 2024-08-21 10:13:49 +02:00
Viktor Lofgren
e4c97a91d8 (*) Comment clarity 2024-08-21 10:12:00 +02:00
Viktor Lofgren
b0a874a842 (*) Upgrade slop library -> 0.0.5 2024-08-18 11:05:27 +02:00
Viktor Lofgren
bca40de107 (*) Upgrade slop library 2024-08-18 10:43:41 +02:00
Viktor Lofgren
93652e0937 (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:55:48 +02:00
Viktor Lofgren
0a383a712d (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:44:17 +02:00
Viktor Lofgren
03d5dec24c (*) Refactor termCoherences and rename them to phrase constraints. 2024-08-15 11:02:19 +02:00
Viktor Lofgren
b2a3cac351 (*) Remove broken imports 2024-08-15 11:01:34 +02:00
Viktor Lofgren
a18edad04c (index) Remove stopword list from converter
We want to index all words in the document, stopword handling is moved to the index where we change the semantics to elide inclusion checks in query construction for a very short list of words tentatively hard-coded in SearchTerms.
2024-08-15 09:36:50 +02:00
Viktor Lofgren
92522e8d97 (index) Attenuate bm25 score based on query length 2024-08-15 08:41:38 +02:00
Viktor Lofgren
049d94ce31 (index) Add body position match to qdebug fields 2024-08-15 08:39:37 +02:00
Viktor Lofgren
dbc6a95276 (index) Consume the new 'body' span in index to make it used in ranking 2024-08-15 08:33:43 +02:00
Viktor Lofgren
75b0888032 (slop) Migrate to latest Slop version 2024-08-14 11:44:35 +02:00
Viktor Lofgren
2ad93ad41a (*) Clean up 2024-08-14 11:43:45 +02:00
Viktor Lofgren
623ee5570f (slop) Break slop out into its own repository 2024-08-13 09:50:05 +02:00
Viktor Lofgren
fd2bad39f3 (keyword-extraction) Add body field for terms that are not otherwise part of a field 2024-08-13 09:49:26 +02:00
Viktor Lofgren
e6c8a6febe (index) Add index-side deduplication in selectBestResults 2024-08-10 10:51:59 +02:00
Viktor Lofgren
4ece5f847b (index) Add more qdebug factors 2024-08-10 10:45:30 +02:00
Viktor Lofgren
e4f04af044 (index) Give BODY matches a verbatim match value 2024-08-10 10:22:19 +02:00
Viktor Lofgren
b730b17f52 (index) Correct handling of firstPosition to avoid d/z 2024-08-10 10:21:59 +02:00
Viktor Lofgren
98c40958ab (index) Simplify verbatim match calculation 2024-08-10 09:54:56 +02:00
Viktor Lofgren
41b52f5bcd (index) Simplify verbatim match calculation 2024-08-10 09:51:03 +02:00
Viktor Lofgren
4264fb9f49 (query-service) Clean up qdebug UI a bit 2024-08-10 09:51:03 +02:00
Viktor Lofgren
016a4c62e1 (index) Bugs and error fixes, chasing and fixing mystery results that did not contain all relevant keywords 2024-08-10 09:51:03 +02:00
Viktor Lofgren
2f38c95886 (index) Backport bugfix from term-positions branch
The ordering of TermIdsList is assumed to be unchanged by the surrounding code, but the constructor sorts the dang list to be able to do contains() by binary search.  This is no bueno.

This is gonna be a merge conflict in the future, but it's too big of a bug to leave for another month.
2024-08-09 21:17:02 +02:00
Viktor Lofgren
df89661ed2 (index) In SearchResultItem, populate combinedId with combinedId and not its ranking-removed documentId cousin 2024-08-09 16:32:32 +02:00
Viktor Lofgren
41da4f422d (search-query) Always generate the "all"-segmentation 2024-08-09 13:20:00 +02:00
Viktor Lofgren
2e89b55593 (wip) Repair qdebug utility and show new ranking details 2024-08-09 12:57:25 +02:00
Viktor Lofgren
7babdb87d5 (index) Remove intermediate models 2024-08-07 10:10:44 +02:00
Viktor Lofgren
680ad19c7d (keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors 2024-08-06 11:16:56 +02:00
Viktor Lofgren
f01267bc6b (index) Don't load fwd index offsets into a hash table at start.
This makes the service take forever to start up.  Memory map the data instead and binary search.  This is a bit slower, but not by much.
2024-08-06 11:16:28 +02:00
Viktor Lofgren
df6a05b9a7 (index) Avoid hypothetical divide-by-zero in tcfAvgDist 2024-08-06 10:55:57 +02:00
Viktor Lofgren
8569bb8e11 (index) Avoid divide-by-zero when minDist returns 0 2024-08-06 10:34:05 +02:00
Viktor Lofgren
ca6e2db2b9 (index) Include external link texts in verbatim score 2024-08-06 10:23:23 +02:00
Viktor Lofgren
2080e31616 (converter) Store link text positions
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends.  Integrating this information with the ranking is not performed here.
2024-08-04 12:00:29 +02:00
Viktor Lofgren
c379be846c (slop) Update readme 2024-08-04 10:58:23 +02:00
Viktor Lofgren
9bc665628b (slop) VarintLE implementation, correct enum8 column 2024-08-04 10:57:52 +02:00
Viktor Lofgren
ee49c01d86 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:47:23 +02:00
Viktor Lofgren
b21f8538a8 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:41:38 +02:00
Viktor Lofgren
dd15676d33 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:18:04 +02:00
Viktor Lofgren
ec5a17ad13 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:07:02 +02:00
Viktor Lofgren
e48f52faba (experiment) Add add-hoc filter runner 2024-08-03 13:24:03 +02:00
Viktor Lofgren
8462e88b8f (index) Add min-dist factor and adjust rankings 2024-08-03 13:07:00 +02:00
Viktor Lofgren
bf26ead010 (index) Remove hasPrioTerm check as we should sort this out in ranking 2024-08-03 13:06:50 +02:00
Viktor Lofgren
c2cedfa83c (index) Experimental ranking signals 2024-08-03 10:33:41 +02:00
Viktor Lofgren
eba2844361 (index) Experimental ranking signals 2024-08-03 10:32:46 +02:00
Viktor Lofgren
c6c8b059bf (index) Return some variant of the previously removed 'Bm25PrioGraphVisitor' 2024-08-03 10:10:12 +02:00
Viktor Lofgren
d8a99784e5 (index) Adding a few experimental relevance signals 2024-08-02 20:26:07 +02:00
Viktor Lofgren
57929ff242 (coded-sequence) Varint sequence 2024-08-02 20:22:56 +02:00
Viktor Lofgren
4430a39120 (loader) Clean up 2024-08-02 12:32:47 +02:00
Viktor Lofgren
6228f46af1 (loader) Reduce log spam 2024-08-02 12:21:03 +02:00
Viktor Lofgren
ac67b6b5da (converter) Fix exception handling while reading crawl data 2024-08-02 10:39:49 +02:00
Viktor Lofgren
1a268c24c8 (perf) Reduce DomPruningFilter hash table recalculation 2024-08-01 12:04:55 +02:00
Viktor Lofgren
38e2089c3f (perf) Code was still spending a lot of time resolving charsets
... in the failure case which wasn't captured by memoization.
2024-08-01 11:58:59 +02:00
Viktor Lofgren
e2107901ec (index) Add span information for anchor tags, tweak ranking params 2024-08-01 11:46:30 +02:00
Viktor Lofgren
15745b692e (index) Coherences need to be able to deal with null values among positions 2024-07-31 22:00:14 +02:00
Viktor Lofgren
696fd8909d (screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones 2024-07-31 21:44:10 +02:00
Viktor Lofgren
02b1c4b172 (screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones 2024-07-31 20:21:23 +02:00
Viktor Lofgren
285e657f68 Merge branch 'master' into term-positions
# Conflicts:
#	code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
2024-07-31 10:44:01 +02:00
Viktor Lofgren
046ffc7752 (build) Upgrade jib to 3.4.3 2024-07-31 10:39:50 +02:00
Viktor Lofgren
2ef66ce0ca (actor) Reset NEW flag earlier when auto-deletion is disabled
Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.
2024-07-31 10:31:03 +02:00
Viktor Lofgren
dc5c668940 (index) Re-enable parallelization of index construction, disable parallel sorting during construction
The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance.  It got worse, so the change is reverted.

Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.
2024-07-31 10:06:53 +02:00
Viktor Lofgren
f19148132a (search) Restrict site-search by passing domain id along with the site:-term
This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.
2024-07-30 21:41:07 +02:00
Viktor Lofgren
6d7b886aaa (converter) Correct sort order of files in control storage GUI
Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.
2024-07-30 19:43:27 +02:00
Viktor Lofgren
b316b55be9 (index) Experimental initial integration of document spans into index 2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7 (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
Viktor Lofgren
7e4efa45b8 (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:21:21 +02:00
Viktor Lofgren
86ea28d6bc (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:18:52 +02:00
Viktor Lofgren
34703da144 (slop) Support for nested array types and array-of-object types
Also adding very basic support for filtered reads via SlopTable.  This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
1282f78bc5 (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 11:01:18 +02:00
Viktor Lofgren
2d5d965f7f (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 10:34:33 +02:00
Viktor Lofgren
afe56c7cf1 (loader) Tidy up code 2024-07-28 21:36:42 +02:00
Viktor Lofgren
7d51cf882f (loader) Move rssFeeds to a different column group to avoid errors 2024-07-28 21:30:10 +02:00
Viktor Lofgren
499deac2ef (slop) Fix test that broke when we split get into int get() and long getLong() 2024-07-28 21:20:37 +02:00