Commit Graph

236 Commits

Author SHA1 Message Date
Viktor Lofgren
623ee5570f (slop) Break slop out into its own repository 2024-08-13 09:50:05 +02:00
Viktor Lofgren
fd2bad39f3 (keyword-extraction) Add body field for terms that are not otherwise part of a field 2024-08-13 09:49:26 +02:00
Viktor Lofgren
680ad19c7d (keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors 2024-08-06 11:16:56 +02:00
Viktor Lofgren
2080e31616 (converter) Store link text positions
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends.  Integrating this information with the ranking is not performed here.
2024-08-04 12:00:29 +02:00
Viktor Lofgren
e48f52faba (experiment) Add add-hoc filter runner 2024-08-03 13:24:03 +02:00
Viktor Lofgren
4430a39120 (loader) Clean up 2024-08-02 12:32:47 +02:00
Viktor Lofgren
1a268c24c8 (perf) Reduce DomPruningFilter hash table recalculation 2024-08-01 12:04:55 +02:00
Viktor Lofgren
b316b55be9 (index) Experimental initial integration of document spans into index 2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7 (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
Viktor Lofgren
7e4efa45b8 (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:21:21 +02:00
Viktor Lofgren
86ea28d6bc (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:18:52 +02:00
Viktor Lofgren
34703da144 (slop) Support for nested array types and array-of-object types
Also adding very basic support for filtered reads via SlopTable.  This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
1282f78bc5 (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 11:01:18 +02:00
Viktor Lofgren
2d5d965f7f (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 10:34:33 +02:00
Viktor Lofgren
7d51cf882f (loader) Move rssFeeds to a different column group to avoid errors 2024-07-28 21:30:10 +02:00
Viktor Lofgren
9685993adb (loader) Add spans to a different column group from spanCodes, as they are not in sync 2024-07-28 21:20:09 +02:00
Viktor Lofgren
1caad7e19e (slop) Update existing code to use the altered Slop interfaces 2024-07-28 13:21:08 +02:00
Viktor Lofgren
6c3abff664 (slop) Move GCS Slop column to the coded-sequence package
This lets the slop library be stand-alone without dependence on coded-sequence.

The change also gets rid of the vestigial seek() method in ColumnReader.
2024-07-27 13:58:45 +02:00
Viktor Lofgren
dcb43a3308 (slop) Introduce table concept to keep track of positions and simplify closing
The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip.

The second most common error is forgetting to close one of the columns in a reader or writer.

To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.
2024-07-27 13:47:47 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
2bb9f18411 (dld) Refactor DocumentLanguageData
Reduce the usage of raw arrays
2024-07-19 12:24:55 +02:00
Viktor Lofgren
22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
Viktor Lofgren
d36055a2d0 (keyword-extractor) Retire TfIdfHigh WordFlag
This will bring the word flags count down to 8, and let us pack every value in a byte.
2024-07-17 13:54:39 +02:00
Viktor Lofgren
0b31c4cfbb (coded-sequence) Replace GCS usage with an interface 2024-07-16 14:37:50 +02:00
Viktor Lofgren
6ee4d1eb90 (keyword) Increase the work area for position encoding
The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.
2024-06-28 16:42:39 +02:00
Viktor Lofgren
dae22ccbe0 (test) Integration test from crawl->query 2024-06-25 22:17:26 +02:00
Viktor Lofgren
9b922af075 (converter) Amend existing modifications to use gamma coded positions lists
... instead of serialized RoaringBitmaps as was the initial take on the problem.
2024-05-30 14:20:36 +02:00
Viktor Lofgren
619392edf9 (keywords) Add position information to keywords 2024-05-28 16:54:53 +02:00
Viktor Lofgren
0894822b68 (converter) Add position information to serialized document data
This is not hooked in yet, and the term metadata is still left intact.  It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
2024-05-28 14:18:03 +02:00
Viktor Lofgren
f83f777fff (converter) Experimental support for searching by URL
Add up to synthetic 128 keywords per document, corresponding to links to other websites.
2024-05-23 17:10:57 +02:00
Viktor Lofgren
89aae93e60 (*) Lift jetty and guava-dependencies 2024-05-23 14:20:01 +02:00
Viktor Lofgren
d12c77305c (btree) Clean up code 2024-05-18 18:03:17 +02:00
Viktor Lofgren
b867eadbef (big-string) Remove the unused bigstring library 2024-05-18 13:40:03 +02:00
Viktor Lofgren
38aedb50ac (converter) Do not suppress exceptions in the converter 2024-04-30 18:24:35 +02:00
Viktor Lofgren
4668b1ddcb (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
Viktor Lofgren
b09ddd0036 (crawler/converter) Remove legacy junk from parquet migration 2024-04-22 12:34:28 +02:00
Viktor Lofgren
214551f1df (converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation. 2024-04-19 20:36:01 +02:00
Viktor Lofgren
2353c73c57 (encyclopedia) Index the full articles
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits.  This was not a good idea, so the change is reverted.
2024-04-16 12:10:13 +02:00
Viktor Lofgren
bd0704d5a4 (*) Fix JDK22 migration issues
A few bizarre build errors cropped up when migrating to JDK22.  Not at all sure what caused them, but they were easy to mitigate.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
002afca1c5 (sys) Upgrade to JDK22
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
d8f4e7d72b (qs) Retire NGramBloomFilter, integrate new segmentation model instead 2024-03-19 10:42:09 +01:00
Viktor Lofgren
46423612e3 (refac) Merge service-discovery and service modules
Also adds a few tests to the server/client code.
2024-03-03 10:49:23 +01:00
Viktor Lofgren
29bf473d74 (encyclopedia) Add URLencoding to path element
This prevents corruption of the links to the sideloaded encyclopedia data when the article path contains characters that are not valid in a URL.
2024-03-01 17:28:09 +01:00
Viktor Lofgren
e696fd9e92 (docs) Begin un-fucking the docs after refactoring 2024-02-27 21:22:21 +01:00
Viktor Lofgren
1d34224416 (refac) Remove src/main from all source code paths.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.

While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules.  Which you'll do a lot, because it's *modul*ar.  The src/main/java convention makes a lot of sense for a non-modular project though.  This ain't that.
2024-02-23 16:13:40 +01:00
Viktor Lofgren
f4ff7185f0 (refac) Move process-mqapi out of api directory 2024-02-23 11:18:29 +01:00
Viktor Lofgren
f8e7f75831 Move index to top level of code 2024-02-22 18:01:35 +01:00
Viktor Lofgren
085137ca63 * Extract the index functionality 2024-02-22 17:31:25 +01:00
Viktor Lofgren
0307c55f9f (refac) Zookeeper for service-discovery, kill service-client lib (WIP)
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.

A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.

The last remaining REST service, the assistant-service, has been migrated to gRPC.

This also proved a good time to clear out primordial technical debt from the root of the codebase.  The 'service-client' library has been taken behind the barn and given a last farewell.  It's replaced by a small library for managing gRPC channels.

Since it's no longer used by anything, RxJava has been removed as a dependency from the project.

Although the current state seems reasonably stable, this is a work-in-progress commit.
2024-02-20 11:41:14 +01:00
Viktor Lofgren
37a7296759 (sideload) Clean up the sideloading code
Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach.

The reddit sideloader now uses the SideloaderProcessing class.  It also properly sets js-attributes for the sideloaded documents.

The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.
2024-02-17 14:32:36 +01:00