This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.
A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database. Move the first boot check into the MainClass instead of the Service constructor.
The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations.
Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths.
The change removes the path prefix, and updates the docker templates to reflect the change. This will require a migration for existing systems.
This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.
While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.
Cleaning out a lot of old junk from the code, and one thing lead to another...
* Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds.
* The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs.
* Project is migrated to GraalVM
* gRPC clients are re-written with a neat fluent/functional style. e.g.
```channelPool.call(grpcStub::method)
.async(executor) // <-- optional
.run(argument);
```
This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall.
* For now the project is all in on zookeeper
* Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh.
* To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP!
Missing is documentation and testing, and some more breaking apart of code.
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.
A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.
The last remaining REST service, the assistant-service, has been migrated to gRPC.
This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels.
Since it's no longer used by anything, RxJava has been removed as a dependency from the project.
Although the current state seems reasonably stable, this is a work-in-progress commit.
In the scenario where an operator
* Performs a new crawl from spec
* Doesn't load the data into the index
* Recrawls the data
The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file,
irrecoverably losing the crawl log making it impossible to load!
To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening.
More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl. This should help the DbCrawlSpecProvider to find them regardless of loaded state.
This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.
Refactored the GRPC Stub Pool for better handling of channel SHUTDOWN state. Any disconnected channels are now re-created before returning the stub.
The class was also renamed to GrpcChannelPool, as we no longer pool the stubs.
Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach.
The reddit sideloader now uses the SideloaderProcessing class. It also properly sets js-attributes for the sideloaded documents.
The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.
Fix a bug where sideloading stackexchange files by explicitly selecting the 7z file would fail, since the 7z file would be passed along to the converter rather than the path to the pre-converted .db file.