Commit Graph

67 Commits

Author SHA1 Message Date
Viktor Lofgren
89aae93e60 (*) Lift jetty and guava-dependencies 2024-05-23 14:20:01 +02:00
Viktor Lofgren
b867eadbef (big-string) Remove the unused bigstring library 2024-05-18 13:40:03 +02:00
Viktor Lofgren
4668b1ddcb (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
Viktor Lofgren
b6d365bacd (index) Clean up data model
The change set cleans up the data model for the term-level data.  This used to contain a bunch of fields with document-level metadata.  This data-duplication means a larger memory footprint and worse memory locality.

The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking.  This is again an effort to improve memory locality.
2024-04-15 16:04:07 +02:00
Viktor Lofgren
002afca1c5 (sys) Upgrade to JDK22
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
46423612e3 (refac) Merge service-discovery and service modules
Also adds a few tests to the server/client code.
2024-03-03 10:49:23 +01:00
Viktor Lofgren
e696fd9e92 (docs) Begin un-fucking the docs after refactoring 2024-02-27 21:22:21 +01:00
Viktor Lofgren
1d34224416 (refac) Remove src/main from all source code paths.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.

While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules.  Which you'll do a lot, because it's *modul*ar.  The src/main/java convention makes a lot of sense for a non-modular project though.  This ain't that.
2024-02-23 16:13:40 +01:00
Viktor Lofgren
2201b1a506 (refac) Clean up code issues 2024-02-23 11:39:19 +01:00
Viktor Lofgren
0307c55f9f (refac) Zookeeper for service-discovery, kill service-client lib (WIP)
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.

A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.

The last remaining REST service, the assistant-service, has been migrated to gRPC.

This also proved a good time to clear out primordial technical debt from the root of the codebase.  The 'service-client' library has been taken behind the barn and given a last farewell.  It's replaced by a small library for managing gRPC channels.

Since it's no longer used by anything, RxJava has been removed as a dependency from the project.

Although the current state seems reasonably stable, this is a work-in-progress commit.
2024-02-20 11:41:14 +01:00
Viktor Lofgren
6aee27a3f1 (*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style. 2023-12-29 16:36:01 +01:00
Viktor Lofgren
bf44805e69 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 14:00:07 +01:00
Viktor Lofgren
bcad6492d6 (sideloader) Fix integration problems with sideloaders
In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment.

Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters.

Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.
2023-12-17 13:28:17 +01:00
Viktor Lofgren
b74a3ebd85 (crawler) WIP integration of WARC files into the crawler process.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.

This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.

The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
2023-12-11 19:32:58 +01:00
Viktor Lofgren
072b5fcd12 Implement Warc-recording wrapper for OkHttp3 client
This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted.  This component is currently not hooked into anything.

The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'.

The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.
2023-12-08 13:49:16 +01:00
Viktor Lofgren
7aa2f80117 (domain) id.au should be treated as a TLD 2023-11-06 19:07:47 +01:00
Viktor Lofgren
2b77184281 (converter) Integrate atags with the topology field 2023-11-06 13:46:44 +01:00
Viktor Lofgren
0152004c42 Initial Commit Anchor Tags
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
2023-11-04 14:24:17 +01:00
Viktor Lofgren
3889c4bdd9 (refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00
Viktor Lofgren
c51159672e (build) Move unit test configuration to root build.gradle 2023-10-04 12:46:22 +02:00
Viktor Lofgren
ec6c9bca62 (common) Fix factual error in comments 2023-09-24 19:40:19 +02:00
Viktor Lofgren
dbe9235f3a (*) Upgrade to JDK21 with preview enabled.
... also move some common configuration into the root build.gradle-file.

Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work.  This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
eaeb23d41e (refactor) Remove converting-model package completely 2023-09-14 11:21:44 +02:00
Viktor Lofgren
1e6800565a (system) Remove EdgeId<T> and similar objects
They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.
2023-08-24 17:46:02 +02:00
Viktor Lofgren
c909120ae1 (search) Basic working integration of linkdb in search service 2023-08-24 17:24:56 +02:00
Viktor Lofgren
9894f37412 (index) Implement new URL ID coding scheme.
Also refactor along the way.  Really needs an additional pass, these tests are very hairy.
2023-08-24 16:44:27 +02:00
Viktor Lofgren
c70670bacb (common) New UrlIdCodec class
Have a single class responsible for encoding and decoding URL ids, as it's a bit finicky and used all over.
2023-08-24 11:41:07 +02:00
Viktor Lofgren
7bb3e44a76 (common) Deprecate EdgeId and similar 2023-08-24 11:16:28 +02:00
Viktor Lofgren
ebc84c22fb Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
15912f31d0 (control-service) Basic GUI for deleting bad links from exploration mode 2023-08-21 18:35:26 +02:00
Viktor Lofgren
6cb784df75 (minor) Improve comment 2023-08-18 11:25:36 +02:00
Viktor Lofgren
c019a029ec (flags) Documentation and preventative bugfix 2023-08-17 17:42:31 +02:00
Viktor Lofgren
4598c7f40f (valuation) Penalize wordpress style kebab case urls 2023-08-16 13:11:24 +02:00
Viktor Lofgren
ce293029c7 (converter) Treat adtech tracking as advertisement. 2023-08-09 14:23:53 +02:00
Viktor Lofgren
5411950b87 (minor) Tidy up EdgeDomain class a bit, no functional difference 2023-07-31 10:31:29 +02:00
Viktor Lofgren
77d5e39fe0 Make processed data Serializable 2023-07-28 18:11:19 +02:00
Viktor Lofgren
7470c170b1 (minor) EdgeUrl.parse() should deal with null 2023-07-24 15:06:57 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
f92d8a0975 EdgeUrl conversion to/from java.net.URL 2023-06-27 10:57:54 +02:00
Viktor Lofgren
5abaf13192 Fix serialization bug with CompressedBigString 2023-06-27 10:57:54 +02:00
Viktor Lofgren
bd2c3855ed Add bits and keywords for generator classes (docs, forum, wiki). 2023-06-23 21:35:28 +02:00
Viktor Lofgren
54c2be893b TRIVIAL: Remove unused import. 2023-06-22 17:21:47 +02:00
Viktor Lofgren
b5ef67ed28 Categorize generators by type
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00
Viktor Lofgren
d1a004bea6 (minor) Clean up StringPool 2023-06-19 17:58:19 +02:00
Viktor Lofgren
2cda57355a More word metadata tests 2023-05-28 11:57:06 +02:00
Viktor Lofgren
2ab26f37b8 Bug fix for document metadata encoding that breaks year based queries. 2023-04-14 16:56:49 +02:00
Viktor
a278fc6296
Increase search result relevance (#8)
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
2023-04-07 20:18:08 +02:00
Viktor Lofgren
716ab35b4e Search ranking debuggability improvements. 2023-04-02 13:43:24 +02:00
Viktor Lofgren
cc4e089a5d Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc. 2023-03-30 15:46:15 +02:00