This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.
A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database. Move the first boot check into the MainClass instead of the Service constructor.
The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations.
Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths.
The change removes the path prefix, and updates the docker templates to reflect the change. This will require a migration for existing systems.
This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
Add the ability to indicate to the search service that a request is malicious, and to poison the results by providing randomly reorered old results instead.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.
While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.
Cleaning out a lot of old junk from the code, and one thing lead to another...
* Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds.
* The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs.
* Project is migrated to GraalVM
* gRPC clients are re-written with a neat fluent/functional style. e.g.
```channelPool.call(grpcStub::method)
.async(executor) // <-- optional
.run(argument);
```
This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall.
* For now the project is all in on zookeeper
* Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh.
* To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP!
Missing is documentation and testing, and some more breaking apart of code.
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.
A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.
The last remaining REST service, the assistant-service, has been migrated to gRPC.
This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels.
Since it's no longer used by anything, RxJava has been removed as a dependency from the project.
Although the current state seems reasonably stable, this is a work-in-progress commit.
This filter currently does not distinguish itself very much from the unfiltered results, and lends the impression that the filters don't "do anything".
It may come back in some shape or form in the future, with some additional tweaking of the rankings...
This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias.
The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period.
These options are added to the search interface. The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well.
The vintage filter is modified to add a temporal bias for the past.
Adds experimental support for clustering search results by e.g. domain. At a first stage, this is only enabled for the wiki and forum filters.
The commit also cleans up the UrlDetails class, which contained a number of vestigial entries.
Recent changes to the result ranking mean the no filter mode returns sufficiently good results for most queries that filtering by default just makes the search results more restricted.
!bang query handling seems to have fallen victim to an overzealous refactoring effort, and broken.
It's now repaired, and a test is in place to ensure we know if it breaks again.
Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.
Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.
A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.
The changeset also makes the control service responsible for flyway migrations. This helps reduce the number of places the database configuration needs to be spread out. These automatic migrations can be disabled with -DdisableFlyway=true.
The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.
The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.
This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service.
A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.
The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
Modified site info feed template to secure the description field against injected code. Also adjusted search service by extracting samples within the correct scope and including them in the returned site info. This improves the quality and security of the displayed information.
This change integrates the Feedlot RSS Bot with Marginalia's site info view to offer a preview of the latest updates.
The change introduces a new tiny feature that is a feedlot-client based on Java's HttpClient.
This is for filtering results on how many times the term appears on the domain. The intent is to be beneficial in creating e.g. a domain search feature. It's also very helpful when tracking down spammy domains.
We do both ip2location and ASN data.
The change also adds some keywords based on autonomous system information, on a somewhat experimental basis. It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.
The code now intercepts and deals with potential exceptions during the parsing of search parameters. This is in response to constant bad requests from bots which were cluttering the logs. A catch clause is added that suppresses these errors and redirects to the base URL.
The site info view can't blindly assume that every website supports https. To figure out which schema to use when linking to a site, execute a single-result search for site:domain.name and then grab the schema off the result.
To allow this, a count parameter is introduced to doSiteSearch() in SearchOperator.
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.
The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.
The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.
The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.
The previous version used a personalized pagerank centering on a few academic domains, but this didn't work very well and most results were not very academia-centric.
This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available. It also permits exposing this information via API in the future if there is interest in this.
The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time.
Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.
In the future this logic probably needs to move into a separate
service, as it's still quite slow to load. But this fixes response
times and DOS potential of previous version.
The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.