Commit Graph

1654 Commits

Author SHA1 Message Date
Viktor Lofgren
99523ca079 (query-parser) Remove test that is no longer relevant 2024-09-10 10:35:56 +02:00
Viktor Lofgren
35f49bbb60 (coded-sequence) Add equals and hashCode to VCS 2024-09-10 10:33:56 +02:00
Viktor Lofgren
50ec922c2b (index) Fix broken index tests
Also cleaned up the tests to be less fragile to ranking algorithm changes.
2024-09-10 10:23:46 +02:00
Viktor Lofgren
cfbbeaa26e (ranking) Clean up ranking test code 2024-09-08 15:46:51 +02:00
Viktor Lofgren
a3b0189934 Fix build errors after merge 2024-09-08 10:22:32 +02:00
Viktor Lofgren
8f367d96f8 Merge branch 'master' into term-positions
# Conflicts:
#	code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java
#	code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java
#	code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java
#	code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java
2024-09-08 10:14:43 +02:00
Viktor Lofgren
f78ef36cd4 (slop) Upgrade to 0.0.8, add encodings to string columns. 2024-09-04 15:19:00 +02:00
Viktor Lofgren
dc67c81f99 (summary) Fix a few cases where noscript tags would sometimes be used for document summary 2024-09-04 15:00:40 +02:00
Viktor Lofgren
50ba8fd099 (query-parsing) Correct handling of trailing parentheses 2024-09-03 11:45:14 +02:00
Viktor Lofgren
99b3b00b68 (query-parsing) Merge QueryTokenizer into QueryParser and add escaping of query grammar 2024-09-03 11:35:32 +02:00
Viktor Lofgren
f6d981761d (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:24:05 +02:00
Viktor Lofgren
8290c19e24 (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:21:01 +02:00
Viktor Lofgren
7a69dff6cf (search) Correct handling of languages on fandom 2024-09-01 13:46:01 +02:00
Viktor Lofgren
bfb7ed2c99 (search) Translate cursed medium URLs to scribe.rip links via the search application 2024-09-01 13:32:14 +02:00
Viktor Lofgren
e19dc9b13e (search) Translate cursed fandom URLs to breezewiki links via the search application 2024-09-01 13:23:35 +02:00
Viktor Lofgren
74148c790e (crawler) Pull additional new domains from node-affinity 0
Previously a bit ambiguously defined, node affinity 0 is now indicative that a domain is up for grabs for the next crawler
2024-09-01 13:00:36 +02:00
Viktor Lofgren
3d77456110 (*) Add domain parking service to ip blocklist 2024-09-01 12:53:22 +02:00
Viktor Lofgren
ab6a4b1749 (control) Correct id value for domain addition tool 2024-09-01 12:25:15 +02:00
Viktor Lofgren
aeeb1d0cb7 (control) Add utility for adding domains from an external URL 2024-09-01 12:14:21 +02:00
Viktor Lofgren
185b79f2a5 (converter) Fix bug where sideloaded reddit content was errouneously categoriszed as wiki-generated. 2024-09-01 11:30:25 +02:00
Viktor Lofgren
8d0f9652c7 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:38:34 +02:00
Viktor Lofgren
5353805cc6 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:37:09 +02:00
Viktor Lofgren
5407da5650 (crawler) Grab favicons as part of root sniff 2024-08-31 11:32:56 +02:00
Viktor Lofgren
b1bfe6f76e (control) New view for domains
Add capability to assign domains, and bulk-add new domains.
2024-08-30 17:06:48 +02:00
Viktor Lofgren
74e25370ca (control) New view for domains
Still a work in progress, but at this point it's possible to use for viewing domains
2024-08-29 15:40:40 +02:00
Viktor Lofgren
bb5d946c26 (index, EXPERIMENTAL) Clean up ranking code 2024-08-29 11:34:23 +02:00
Viktor Lofgren
abab5bdc8a (index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data 2024-08-26 14:20:39 +02:00
Viktor Lofgren
30bf845c81 (index) Speed up minDist calculations by excluding large lists 2024-08-26 13:04:15 +02:00
Viktor Lofgren
77efce0673 (paper-doll) Fix compilation 2024-08-26 12:51:29 +02:00
Viktor Lofgren
67a98fb0b0 (coded-sequence) Handle weird legacy HTML that puts everything in a heading 2024-08-26 12:49:15 +02:00
Viktor Lofgren
7d471ec30d (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:45:11 +02:00
Viktor Lofgren
f3182a9264 (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:02:37 +02:00
Viktor Lofgren
805cb5ad58 (coded-sequence) Correct behavior of findIntersections 2024-08-25 14:54:17 +02:00
Viktor Lofgren
fdf05cedae (index) Optimize DocumentSpan.countIntersections 2024-08-25 14:12:30 +02:00
Viktor Lofgren
9c5f463775 (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:59:11 +02:00
Viktor Lofgren
893fae6d59 (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:51:43 +02:00
Viktor Lofgren
5660f291af (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:43:29 +02:00
Viktor Lofgren
efd56efc63 (index) Optimize SequenceOperations.minDistance 2024-08-25 13:28:06 +02:00
Viktor Lofgren
d94373f4b1 (index) Optimize calculatePositionsMask 2024-08-25 13:24:37 +02:00
Viktor Lofgren
0d01a48260 (index) Optimize SequenceOperations 2024-08-25 13:19:37 +02:00
Viktor Lofgren
00ab2684fa (index) Optimize SequenceOperations 2024-08-25 13:17:38 +02:00
Viktor Lofgren
a5585110a6 (index) Optimize SequenceOperations 2024-08-25 13:16:31 +02:00
Viktor Lofgren
965c89798e (index) Optimize DocumentSpan 2024-08-25 12:44:33 +02:00
Viktor Lofgren
982b03382b (index) Optimize DocumentSpan 2024-08-25 12:31:15 +02:00
Viktor Lofgren
24b805472a (index) Evaluate performance implication of decoding gcs early 2024-08-25 12:23:09 +02:00
Viktor Lofgren
6ce029b317 (index) Remove vestigial parameter 2024-08-25 12:14:12 +02:00
Viktor Lofgren
63e5b0ab18 (index) Correct weightedCounts calculations 2024-08-25 12:06:56 +02:00
Viktor Lofgren
6dda2c2d83 (coded-sequence) Reduce allocations in GCS.values() 2024-08-25 12:06:31 +02:00
Viktor Lofgren
3fb3c0b92e (index) Optimize ranking calculations 2024-08-25 11:56:11 +02:00
Viktor Lofgren
aa2c960b74 (index) Optimize ranking calculations 2024-08-25 11:53:44 +02:00
Viktor Lofgren
4fbcc02f96 (index) Adjust sensible defaults for ranking parameters 2024-08-25 11:24:16 +02:00
Viktor Lofgren
9aa8f13731 (index) Remove tcfAvgDist ranking parameter
This is captured by tcfProximity already
2024-08-25 11:20:19 +02:00
Viktor Lofgren
65bee366dc (index) Try harmonic mean for avgMinDist 2024-08-25 11:11:52 +02:00
Viktor Lofgren
53700e6667 (index) Try harmonic mean for avgMinDist 2024-08-25 11:08:41 +02:00
Viktor Lofgren
7f498e10b7 (index) Adjust proximity score 2024-08-25 11:01:35 +02:00
Viktor Lofgren
6eb0f13411 (index) Adjust handling of full phrase matches to prioritize full query matches over large partial matches 2024-08-25 10:54:04 +02:00
Viktor Lofgren
773377fe84 (index) Correct handling of full phrase match group 2024-08-25 10:48:34 +02:00
Viktor Lofgren
4372c8c835 (index) Give ranking components more consistent names 2024-08-25 10:44:27 +02:00
Viktor Lofgren
099133bdbc (index) Fix verbatim match score after moving full phrase group to a separate entity 2024-08-25 10:43:35 +02:00
Viktor Lofgren
b09e2dbeb7 (build) Fix dependency churn from testcontainers
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
96bcf03ad5 (index) Address broken tests
They are still broken, but less so.
2024-08-25 10:34:36 +02:00
Viktor Lofgren
0999f07320 (search-query) Add new ranking parameters for proximity and verbatim matches 2024-08-25 10:34:12 +02:00
Viktor Lofgren
5d2b455572 (search) Clean up inconsistent usage of MathClient in SearchOperator
Also clean up SearchOperator and adjacent code
2024-08-24 10:39:31 +02:00
Viktor Lofgren
ea75ddc0e0 (search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator 2024-08-22 11:50:52 +02:00
Viktor Lofgren
2db0e446cb (search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator 2024-08-22 11:49:29 +02:00
Viktor Lofgren
557bdaa694 (search) Clean up SearchQueryIndexService and surrounding code 2024-08-22 11:45:28 +02:00
Viktor Lofgren
9eb1f120fc (index) Repair positions bitmask for search result presentation 2024-08-22 11:28:23 +02:00
Viktor Lofgren
266d6e4bea (slop) Replace SlopPageRef<T> with SlopTable.Ref<T> 2024-08-21 10:13:49 +02:00
Viktor Lofgren
e4c97a91d8 (*) Comment clarity 2024-08-21 10:12:00 +02:00
Viktor Lofgren
b0a874a842 (*) Upgrade slop library -> 0.0.5 2024-08-18 11:05:27 +02:00
Viktor Lofgren
bca40de107 (*) Upgrade slop library 2024-08-18 10:43:41 +02:00
Viktor Lofgren
93652e0937 (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:55:48 +02:00
Viktor Lofgren
0a383a712d (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:44:17 +02:00
Viktor Lofgren
03d5dec24c (*) Refactor termCoherences and rename them to phrase constraints. 2024-08-15 11:02:19 +02:00
Viktor Lofgren
b2a3cac351 (*) Remove broken imports 2024-08-15 11:01:34 +02:00
Viktor Lofgren
a18edad04c (index) Remove stopword list from converter
We want to index all words in the document, stopword handling is moved to the index where we change the semantics to elide inclusion checks in query construction for a very short list of words tentatively hard-coded in SearchTerms.
2024-08-15 09:36:50 +02:00
Viktor Lofgren
92522e8d97 (index) Attenuate bm25 score based on query length 2024-08-15 08:41:38 +02:00
Viktor Lofgren
049d94ce31 (index) Add body position match to qdebug fields 2024-08-15 08:39:37 +02:00
Viktor Lofgren
dbc6a95276 (index) Consume the new 'body' span in index to make it used in ranking 2024-08-15 08:33:43 +02:00
Viktor Lofgren
75b0888032 (slop) Migrate to latest Slop version 2024-08-14 11:44:35 +02:00
Viktor Lofgren
2ad93ad41a (*) Clean up 2024-08-14 11:43:45 +02:00
Viktor Lofgren
623ee5570f (slop) Break slop out into its own repository 2024-08-13 09:50:05 +02:00
Viktor Lofgren
fd2bad39f3 (keyword-extraction) Add body field for terms that are not otherwise part of a field 2024-08-13 09:49:26 +02:00
Viktor Lofgren
e6c8a6febe (index) Add index-side deduplication in selectBestResults 2024-08-10 10:51:59 +02:00
Viktor Lofgren
4ece5f847b (index) Add more qdebug factors 2024-08-10 10:45:30 +02:00
Viktor Lofgren
e4f04af044 (index) Give BODY matches a verbatim match value 2024-08-10 10:22:19 +02:00
Viktor Lofgren
b730b17f52 (index) Correct handling of firstPosition to avoid d/z 2024-08-10 10:21:59 +02:00
Viktor Lofgren
98c40958ab (index) Simplify verbatim match calculation 2024-08-10 09:54:56 +02:00
Viktor Lofgren
41b52f5bcd (index) Simplify verbatim match calculation 2024-08-10 09:51:03 +02:00
Viktor Lofgren
4264fb9f49 (query-service) Clean up qdebug UI a bit 2024-08-10 09:51:03 +02:00
Viktor Lofgren
016a4c62e1 (index) Bugs and error fixes, chasing and fixing mystery results that did not contain all relevant keywords 2024-08-10 09:51:03 +02:00
Viktor Lofgren
2f38c95886 (index) Backport bugfix from term-positions branch
The ordering of TermIdsList is assumed to be unchanged by the surrounding code, but the constructor sorts the dang list to be able to do contains() by binary search.  This is no bueno.

This is gonna be a merge conflict in the future, but it's too big of a bug to leave for another month.
2024-08-09 21:17:02 +02:00
Viktor Lofgren
df89661ed2 (index) In SearchResultItem, populate combinedId with combinedId and not its ranking-removed documentId cousin 2024-08-09 16:32:32 +02:00
Viktor Lofgren
41da4f422d (search-query) Always generate the "all"-segmentation 2024-08-09 13:20:00 +02:00
Viktor Lofgren
2e89b55593 (wip) Repair qdebug utility and show new ranking details 2024-08-09 12:57:25 +02:00
Viktor Lofgren
7babdb87d5 (index) Remove intermediate models 2024-08-07 10:10:44 +02:00
Viktor Lofgren
680ad19c7d (keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors 2024-08-06 11:16:56 +02:00
Viktor Lofgren
f01267bc6b (index) Don't load fwd index offsets into a hash table at start.
This makes the service take forever to start up.  Memory map the data instead and binary search.  This is a bit slower, but not by much.
2024-08-06 11:16:28 +02:00
Viktor Lofgren
df6a05b9a7 (index) Avoid hypothetical divide-by-zero in tcfAvgDist 2024-08-06 10:55:57 +02:00
Viktor Lofgren
8569bb8e11 (index) Avoid divide-by-zero when minDist returns 0 2024-08-06 10:34:05 +02:00