This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.
A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database. Move the first boot check into the MainClass instead of the Service constructor.
The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.
While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.
Cleaning out a lot of old junk from the code, and one thing lead to another...
* Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds.
* The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs.
* Project is migrated to GraalVM
* gRPC clients are re-written with a neat fluent/functional style. e.g.
```channelPool.call(grpcStub::method)
.async(executor) // <-- optional
.run(argument);
```
This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall.
* For now the project is all in on zookeeper
* Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh.
* To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP!
Missing is documentation and testing, and some more breaking apart of code.
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.
A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.
The last remaining REST service, the assistant-service, has been migrated to gRPC.
This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels.
Since it's no longer used by anything, RxJava has been removed as a dependency from the project.
Although the current state seems reasonably stable, this is a work-in-progress commit.
In the scenario where an operator
* Performs a new crawl from spec
* Doesn't load the data into the index
* Recrawls the data
The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file,
irrecoverably losing the crawl log making it impossible to load!
To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening.
More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl. This should help the DbCrawlSpecProvider to find them regardless of loaded state.
This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.
Fix a bug where sideloading stackexchange files by explicitly selecting the 7z file would fail, since the 7z file would be passed along to the converter rather than the path to the pre-converted .db file.
Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename.
The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.
Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult.
Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more.
Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run.
The change also refactors the sideloading a bit since it was a bit messy.
* (executor-api) Make executor API talk GRPC
The executor's REST API was very fragile and annoying to work with, lacking even basic type safety. Migrate to use GRPC instead. GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil. This is a fairly straightforward change, but it's also large so a solid round of testing is needed...
The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients.
ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name().
The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.
This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets. Since only the first is necessary before the index construction, the rest can be delayed until after...
To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one.
The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader.
Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data.
Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead.
To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.
This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu.
It also refactors out some leaky abstractions out of FileStorageService. allocateTemporaryStorage has been renamed allocateStorage. The storage was never temporary in any scenario...
It also doesn't take a storage base, as there was always only one valid option for this input. The allocateStorage method finds the appropriate base itself.
Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine.
Removed the tool itself.
This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency. This has been fixed, and :third-party:xz was removed.
This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties.
The documentation is updated to reflect the change.
Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.
Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file. This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically.
The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.
The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID. This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers.
The existing RELATED_ID field has too many semantics associated with them,
among other things the FSM code uses them this field in tracking state changes.
The change set also improves the consistency of inbox names. The IndexClient was buggy and populated its outbox with a UUID. This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.
The EC_DOMAIN_LINK table is deprecated and slated for removal, use QueryClient.getAllDomainLinks() instead.
The ExportDataActor now uses the QueryClient appropriately. The CSV format was also changed to quote the values, to prevent e.g. Excel from interpreting the comma as a decimal separator when previewing the file.
Finally the form for triggering an export was overhauled.
Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.
A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.
This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views. It has many small tweaks to make the work flow better.
It also adds a new /uploads directory in each index node, from which sideloaded data can be selected. This is a bit of a breaking change, as this directory needs to exist in each index node.
The changeset also makes the control service responsible for flyway migrations. This helps reduce the number of places the database configuration needs to be spread out. These automatic migrations can be disabled with -DdisableFlyway=true.
The commit also adds curl to the docker container, to enable docker health checks and interdependencies.