Commit Graph

2247 Commits

Author SHA1 Message Date
Viktor Lofgren
c2cedfa83c (index) Experimental ranking signals 2024-08-03 10:33:41 +02:00
Viktor Lofgren
eba2844361 (index) Experimental ranking signals 2024-08-03 10:32:46 +02:00
Viktor Lofgren
c6c8b059bf (index) Return some variant of the previously removed 'Bm25PrioGraphVisitor' 2024-08-03 10:10:12 +02:00
Viktor Lofgren
d8a99784e5 (index) Adding a few experimental relevance signals 2024-08-02 20:26:07 +02:00
Viktor Lofgren
57929ff242 (coded-sequence) Varint sequence 2024-08-02 20:22:56 +02:00
Viktor Lofgren
4430a39120 (loader) Clean up 2024-08-02 12:32:47 +02:00
Viktor Lofgren
6228f46af1 (loader) Reduce log spam 2024-08-02 12:21:03 +02:00
Viktor Lofgren
1a268c24c8 (perf) Reduce DomPruningFilter hash table recalculation 2024-08-01 12:04:55 +02:00
Viktor Lofgren
38e2089c3f (perf) Code was still spending a lot of time resolving charsets
... in the failure case which wasn't captured by memoization.
2024-08-01 11:58:59 +02:00
Viktor Lofgren
e2107901ec (index) Add span information for anchor tags, tweak ranking params 2024-08-01 11:46:30 +02:00
Viktor Lofgren
15745b692e (index) Coherences need to be able to deal with null values among positions 2024-07-31 22:00:14 +02:00
Viktor Lofgren
285e657f68 Merge branch 'master' into term-positions
# Conflicts:
#	code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
2024-07-31 10:44:01 +02:00
Viktor Lofgren
046ffc7752 (build) Upgrade jib to 3.4.3 2024-07-31 10:39:50 +02:00
Viktor Lofgren
2ef66ce0ca (actor) Reset NEW flag earlier when auto-deletion is disabled
Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.
2024-07-31 10:31:03 +02:00
Viktor Lofgren
dc5c668940 (index) Re-enable parallelization of index construction, disable parallel sorting during construction
The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance.  It got worse, so the change is reverted.

Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.
2024-07-31 10:06:53 +02:00
Viktor Lofgren
f19148132a (search) Restrict site-search by passing domain id along with the site:-term
This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.
2024-07-30 21:41:07 +02:00
Viktor Lofgren
6d7b886aaa (converter) Correct sort order of files in control storage GUI
Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.
2024-07-30 19:43:27 +02:00
Viktor Lofgren
b316b55be9 (index) Experimental initial integration of document spans into index 2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7 (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
Viktor Lofgren
7e4efa45b8 (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:21:21 +02:00
Viktor Lofgren
86ea28d6bc (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:18:52 +02:00
Viktor Lofgren
34703da144 (slop) Support for nested array types and array-of-object types
Also adding very basic support for filtered reads via SlopTable.  This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
1282f78bc5 (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 11:01:18 +02:00
Viktor Lofgren
2d5d965f7f (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 10:34:33 +02:00
Viktor Lofgren
afe56c7cf1 (loader) Tidy up code 2024-07-28 21:36:42 +02:00
Viktor Lofgren
7d51cf882f (loader) Move rssFeeds to a different column group to avoid errors 2024-07-28 21:30:10 +02:00
Viktor Lofgren
499deac2ef (slop) Fix test that broke when we split get into int get() and long getLong() 2024-07-28 21:20:37 +02:00
Viktor Lofgren
9685993adb (loader) Add spans to a different column group from spanCodes, as they are not in sync 2024-07-28 21:20:09 +02:00
Viktor Lofgren
261dcdadc8 (loader) Additional tracking for the control GUI 2024-07-28 21:19:45 +02:00
Viktor Lofgren
314a901bf0 (slop) Clean up build.gradle from unnecessary copy-paste garbage 2024-07-28 13:22:20 +02:00
Viktor Lofgren
1caad7e19e (slop) Update existing code to use the altered Slop interfaces 2024-07-28 13:21:08 +02:00
Viktor Lofgren
e585116dab (slop) Add 32 bit read method for Varint along with the old 64 bit version 2024-07-28 13:20:18 +02:00
Viktor Lofgren
40f42bf654 (slop) Add signed 16 bit column type "short" 2024-07-28 13:19:44 +02:00
Viktor Lofgren
eaf7fbb9e9 (slop) Improve Conveniences for Enum
* New fixed width 8 bit version of Enum
* Access to the enum's dictionary, and a method for reading the ordinal directly to reduce GC churn
2024-07-28 13:19:15 +02:00
Viktor Lofgren
d05a2e57e9 (index-forward) Spans Writer should not be in the index page loop context 2024-07-27 15:17:04 +02:00
Viktor Lofgren
f8684118f3 (slop) Add columnDesc information to the column readers and writers, and correct a few broken position() implementations
Added a test that should find any additional broken implementations, as it's very important that this function is correct.
2024-07-27 14:35:30 +02:00
Viktor Lofgren
2e1f669aea (slop) Remove additional vestigial seek() implementations 2024-07-27 14:35:30 +02:00
Viktor Lofgren
6c3abff664 (slop) Move GCS Slop column to the coded-sequence package
This lets the slop library be stand-alone without dependence on coded-sequence.

The change also gets rid of the vestigial seek() method in ColumnReader.
2024-07-27 13:58:45 +02:00
Viktor Lofgren
dcb43a3308 (slop) Introduce table concept to keep track of positions and simplify closing
The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip.

The second most common error is forgetting to close one of the columns in a reader or writer.

To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.
2024-07-27 13:47:47 +02:00
Viktor Lofgren
ec600b967d (crawler) Adjust domain locking
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress.  Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
52a9a0d410 (slop) Translate nulls to empty strings when passed to the StringColumnWriters. 2024-07-25 18:26:41 +02:00
Viktor Lofgren
4123e99469 (slop) Handle empty compressed files correctly
The CompressingStorageReader would incorrectly report having data when a file was empty.  Preemptively attempting to fill the backing buffer fixes the behavior.
2024-07-25 18:26:13 +02:00
Viktor Lofgren
51a8a242ac (slop) First commit of slop library
Slop is a low-abstraction data storage convention for column based storage of complex data.
2024-07-25 15:08:41 +02:00
Viktor Lofgren
60ef826e07 (loader) Add heartbeat to update domain-ids step 2024-07-25 15:08:41 +02:00
Viktor Lofgren
2ad564404e (loader) Add heartbeat to update domain-ids step 2024-07-23 15:28:52 +02:00
Viktor Lofgren
2bb9f18411 (dld) Refactor DocumentLanguageData
Reduce the usage of raw arrays
2024-07-19 12:24:55 +02:00
Viktor Lofgren
7a1edc0880 (term-freq) Reduce the number of low-relevance words in the dictionary
Using a statistical trick to reduce the number of low-frequency words in the dictionary, as they are numerous and not very informative.
2024-07-19 12:23:28 +02:00
Viktor Lofgren
b812e96c6d (language-processing) Select the appropriate language filter
The incorrect filter was selected based on the provided parameter, this has been corrected.
2024-07-19 12:22:32 +02:00
Viktor Lofgren
22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00