Viktor Lofgren
8569bb8e11
(index) Avoid divide-by-zero when minDist returns 0
2024-08-06 10:34:05 +02:00
Viktor Lofgren
ca6e2db2b9
(index) Include external link texts in verbatim score
2024-08-06 10:23:23 +02:00
Viktor Lofgren
2080e31616
(converter) Store link text positions
...
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends. Integrating this information with the ranking is not performed here.
2024-08-04 12:00:29 +02:00
Viktor Lofgren
c379be846c
(slop) Update readme
2024-08-04 10:58:23 +02:00
Viktor Lofgren
9bc665628b
(slop) VarintLE implementation, correct enum8 column
2024-08-04 10:57:52 +02:00
Viktor Lofgren
ee49c01d86
(index) Tune ranking for verbatim matches in the title, rewarding shorter titles
2024-08-03 14:47:23 +02:00
Viktor Lofgren
b21f8538a8
(index) Tune ranking for verbatim matches in the title, rewarding shorter titles
2024-08-03 14:41:38 +02:00
Viktor Lofgren
dd15676d33
(index) Tune ranking for verbatim matches in the title, rewarding shorter titles
2024-08-03 14:18:04 +02:00
Viktor Lofgren
ec5a17ad13
(index) Tune ranking for verbatim matches in the title, rewarding shorter titles
2024-08-03 14:07:02 +02:00
Viktor Lofgren
e48f52faba
(experiment) Add add-hoc filter runner
2024-08-03 13:24:03 +02:00
Viktor Lofgren
8462e88b8f
(index) Add min-dist factor and adjust rankings
2024-08-03 13:07:00 +02:00
Viktor Lofgren
bf26ead010
(index) Remove hasPrioTerm check as we should sort this out in ranking
2024-08-03 13:06:50 +02:00
Viktor Lofgren
c2cedfa83c
(index) Experimental ranking signals
2024-08-03 10:33:41 +02:00
Viktor Lofgren
eba2844361
(index) Experimental ranking signals
2024-08-03 10:32:46 +02:00
Viktor Lofgren
c6c8b059bf
(index) Return some variant of the previously removed 'Bm25PrioGraphVisitor'
2024-08-03 10:10:12 +02:00
Viktor Lofgren
d8a99784e5
(index) Adding a few experimental relevance signals
2024-08-02 20:26:07 +02:00
Viktor Lofgren
57929ff242
(coded-sequence) Varint sequence
2024-08-02 20:22:56 +02:00
Viktor Lofgren
4430a39120
(loader) Clean up
2024-08-02 12:32:47 +02:00
Viktor Lofgren
6228f46af1
(loader) Reduce log spam
2024-08-02 12:21:03 +02:00
Viktor Lofgren
ac67b6b5da
(converter) Fix exception handling while reading crawl data
2024-08-02 10:39:49 +02:00
Viktor Lofgren
1a268c24c8
(perf) Reduce DomPruningFilter hash table recalculation
2024-08-01 12:04:55 +02:00
Viktor Lofgren
38e2089c3f
(perf) Code was still spending a lot of time resolving charsets
...
... in the failure case which wasn't captured by memoization.
2024-08-01 11:58:59 +02:00
Viktor Lofgren
e2107901ec
(index) Add span information for anchor tags, tweak ranking params
2024-08-01 11:46:30 +02:00
Viktor Lofgren
15745b692e
(index) Coherences need to be able to deal with null values among positions
2024-07-31 22:00:14 +02:00
Viktor Lofgren
696fd8909d
(screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones
2024-07-31 21:44:10 +02:00
Viktor Lofgren
02b1c4b172
(screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones
2024-07-31 20:21:23 +02:00
Viktor Lofgren
285e657f68
Merge branch 'master' into term-positions
...
# Conflicts:
# code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
# code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
2024-07-31 10:44:01 +02:00
Viktor Lofgren
046ffc7752
(build) Upgrade jib to 3.4.3
2024-07-31 10:39:50 +02:00
Viktor Lofgren
2ef66ce0ca
(actor) Reset NEW flag earlier when auto-deletion is disabled
...
Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.
2024-07-31 10:31:03 +02:00
Viktor Lofgren
dc5c668940
(index) Re-enable parallelization of index construction, disable parallel sorting during construction
...
The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance. It got worse, so the change is reverted.
Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.
2024-07-31 10:06:53 +02:00
Viktor Lofgren
f19148132a
(search) Restrict site-search by passing domain id along with the site:-term
...
This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.
2024-07-30 21:41:07 +02:00
Viktor Lofgren
6d7b886aaa
(converter) Correct sort order of files in control storage GUI
...
Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.
2024-07-30 19:43:27 +02:00
Viktor Lofgren
b316b55be9
(index) Experimental initial integration of document spans into index
2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7
(restructure) Clean up repo by moving stray features into converter-process and crawler-process
2024-07-30 10:14:00 +02:00
Viktor Lofgren
7e4efa45b8
(converter/loader) Simplify document record writing to not require predicated reads
2024-07-29 14:21:21 +02:00
Viktor Lofgren
86ea28d6bc
(converter/loader) Simplify document record writing to not require predicated reads
2024-07-29 14:18:52 +02:00
Viktor Lofgren
34703da144
(slop) Support for nested array types and array-of-object types
...
Also adding very basic support for filtered reads via SlopTable. This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
1282f78bc5
(slop-models) Fix incorrect column grouping leading to errors in converter
2024-07-29 11:01:18 +02:00
Viktor Lofgren
2d5d965f7f
(slop-models) Fix incorrect column grouping leading to errors in converter
2024-07-29 10:34:33 +02:00
Viktor Lofgren
afe56c7cf1
(loader) Tidy up code
2024-07-28 21:36:42 +02:00
Viktor Lofgren
7d51cf882f
(loader) Move rssFeeds to a different column group to avoid errors
2024-07-28 21:30:10 +02:00
Viktor Lofgren
499deac2ef
(slop) Fix test that broke when we split get into int get() and long getLong()
2024-07-28 21:20:37 +02:00
Viktor Lofgren
9685993adb
(loader) Add spans to a different column group from spanCodes, as they are not in sync
2024-07-28 21:20:09 +02:00
Viktor Lofgren
261dcdadc8
(loader) Additional tracking for the control GUI
2024-07-28 21:19:45 +02:00
Viktor Lofgren
314a901bf0
(slop) Clean up build.gradle from unnecessary copy-paste garbage
2024-07-28 13:22:20 +02:00
Viktor Lofgren
1caad7e19e
(slop) Update existing code to use the altered Slop interfaces
2024-07-28 13:21:08 +02:00
Viktor Lofgren
e585116dab
(slop) Add 32 bit read method for Varint along with the old 64 bit version
2024-07-28 13:20:18 +02:00
Viktor Lofgren
40f42bf654
(slop) Add signed 16 bit column type "short"
2024-07-28 13:19:44 +02:00
Viktor Lofgren
eaf7fbb9e9
(slop) Improve Conveniences for Enum
...
* New fixed width 8 bit version of Enum
* Access to the enum's dictionary, and a method for reading the ordinal directly to reduce GC churn
2024-07-28 13:19:15 +02:00
Viktor Lofgren
d05a2e57e9
(index-forward) Spans Writer should not be in the index page loop context
2024-07-27 15:17:04 +02:00