Commit Graph

218 Commits

Author SHA1 Message Date
Viktor Lofgren
47dfbacb00 (conf) Introduce a new concept of node profiles
Node profiles decide which actors are started, and which views are available in the control GUI.  This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.
2024-11-20 18:15:22 +01:00
Viktor Lofgren
9f47ce8d15 (chore) Remove lombok
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
a5b4951f23 (chore) Remove use of deprecated STR.-style string templates 2024-11-11 18:02:28 +01:00
Viktor Lofgren
76e9053dd0 (setup) Move some file-downloads from setup script to the first boot of the control node of the system
We can only do this for files that are not required for unit tests.

As it is illegal to run more than one instance of the control service, this should be fine with regard to race conditions.  The boot orchestration will also ensure that no other services will boot up before the downloading is complete.
2024-11-06 15:28:20 +01:00
Viktor Lofgren
d84a2c183f (*) Remove the crawl spec abstraction
The crawl spec abstraction was used to upload lists of domains into the system for future crawling.  This was fairly clunky, and it was difficult to understand what was going to be crawled.

Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table.  This is much preferred and means the operator can directly manage domains without specs.

This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
8f367d96f8 Merge branch 'master' into term-positions
# Conflicts:
#	code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java
#	code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java
#	code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java
#	code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java
2024-09-08 10:14:43 +02:00
Viktor Lofgren
ab6a4b1749 (control) Correct id value for domain addition tool 2024-09-01 12:25:15 +02:00
Viktor Lofgren
aeeb1d0cb7 (control) Add utility for adding domains from an external URL 2024-09-01 12:14:21 +02:00
Viktor Lofgren
b1bfe6f76e (control) New view for domains
Add capability to assign domains, and bulk-add new domains.
2024-08-30 17:06:48 +02:00
Viktor Lofgren
74e25370ca (control) New view for domains
Still a work in progress, but at this point it's possible to use for viewing domains
2024-08-29 15:40:40 +02:00
Viktor Lofgren
b09e2dbeb7 (build) Fix dependency churn from testcontainers
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
046ffc7752 (build) Upgrade jib to 3.4.3 2024-07-31 10:39:50 +02:00
Viktor Lofgren
6d7b886aaa (converter) Correct sort order of files in control storage GUI
Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.
2024-07-30 19:43:27 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
6401a513d7 (crawl) Fix onsubmit confirm dialog for single-site recrawl 2024-07-05 17:21:03 +02:00
Viktor Lofgren
d86926be5f (crawl) Add new functionality for re-crawling a single domain 2024-07-05 15:31:55 +02:00
Viktor Lofgren
d0d6bb173c (control) Fix warc data http status filter default value 2024-06-17 12:40:25 +02:00
Viktor Lofgren
89aae93e60 (*) Lift jetty and guava-dependencies 2024-05-23 14:20:01 +02:00
Viktor Lofgren
59ec70eb73 (*) Clean up code related to crawl parquet inspection 2024-05-22 12:55:08 +02:00
Viktor Lofgren
365229991b (control) Improve pagination for crawl data inspector 2024-05-21 19:44:48 +02:00
Viktor Lofgren
959a8e29ee (control) Improve pagination for crawl data inspector 2024-05-21 19:27:25 +02:00
Viktor Lofgren
197c82acd4 (control) Add filter functionality for crawl data inspector 2024-05-21 19:05:44 +02:00
Viktor Lofgren
9539fdb53c (control) Clean up UX for crawl data inspector 2024-05-21 18:27:24 +02:00
Viktor Lofgren
17dc00d05f (control) Partial implementation of inspection utility for crawl data
Uses duckdb and range queries to read the parquet files directly from the index partitions.

UX is a bit rough but is in working order.
2024-05-20 18:02:46 +02:00
Viktor Lofgren
7d1cafc070 (control) Add skip link for navigation in control GUI 2024-05-04 12:36:44 +02:00
Viktor Lofgren
4021a0ae98 (search) Add en-US language tags to all templates 2024-05-04 11:40:59 +02:00
Viktor Lofgren
2ad0bfda1e (*) Fix boot orchestration for the services
This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.

A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database.  Move the first boot check into the MainClass instead of the Service constructor.

The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
2024-05-01 12:39:48 +02:00
Viktor Lofgren
4772e0b59d (service) Deprecate /public prefix on HTTP
Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations.

Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths.

The change removes the path prefix, and updates the docker templates to reflect the change.  This will require a migration for existing systems.
2024-04-30 14:46:18 +02:00
Viktor Lofgren
6690e9bde8 (service) Ensure the service discovery starts early
This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.
2024-04-25 15:08:33 +02:00
Viktor Lofgren
32fe864a33 (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
212d101727 (control) GUI for exporting segmentation data from a wikipedia zim 2024-04-24 14:44:17 +02:00
Viktor Lofgren
f434a8b492 (build) Upgrade jib plugin version 2024-04-16 15:25:23 +02:00
Viktor Lofgren
fe8d583fdd (sys) Upgrade to JDK22
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
46423612e3 (refac) Merge service-discovery and service modules
Also adds a few tests to the server/client code.
2024-03-03 10:49:23 +01:00
Viktor Lofgren
20fc0ef13c (gradle) Add task alias 'docker' for 'jibDockerBuild'
The change also moves the jib boilerplate to an include.
2024-02-28 11:59:15 +01:00
Viktor Lofgren
9f1649636e Clean up documentation and rename domain-links to link-graph 2024-02-28 11:40:39 +01:00
Viktor Lofgren
e696fd9e92 (docs) Begin un-fucking the docs after refactoring 2024-02-27 21:22:21 +01:00
Viktor Lofgren
f7f0100174 (build) Make docker image registry and tag configurable in root build.gradle 2024-02-25 11:08:49 +01:00
Viktor Lofgren
1d34224416 (refac) Remove src/main from all source code paths.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.

While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules.  Which you'll do a lot, because it's *modul*ar.  The src/main/java convention makes a lot of sense for a non-modular project though.  This ain't that.
2024-02-23 16:13:40 +01:00
Viktor Lofgren
56d35aa596 (refac) Move execution API out of executor service 2024-02-23 13:26:11 +01:00
Viktor Lofgren
2201b1a506 (refac) Clean up code issues 2024-02-23 11:39:19 +01:00
Viktor Lofgren
5cdb07023b (refac) Clean up unused imports 2024-02-23 11:27:20 +01:00
Viktor Lofgren
f4ff7185f0 (refac) Move process-mqapi out of api directory 2024-02-23 11:18:29 +01:00
Viktor Lofgren
f8e7f75831 Move index to top level of code 2024-02-22 18:01:35 +01:00
Viktor Lofgren
085137ca63 * Extract the index functionality 2024-02-22 17:31:25 +01:00
Viktor Lofgren
3fd2a83184 * Extract the search-query function 2024-02-22 15:27:39 +01:00
Viktor Lofgren
66c1281301 (zk-registry) epic jak shaving WIP
Cleaning out a lot of old junk from the code, and one thing lead to another...

* Build is improved, now constructing docker images with 'jib'.  Clean build went from 3 minutes to 50 seconds.
* The ProcessService's spawning is smarter.  Will now just spawn a java process instead of relying on the application plugin's generated outputs.
* Project is migrated to GraalVM
* gRPC clients are re-written with a neat fluent/functional style. e.g.
```channelPool.call(grpcStub::method)
              .async(executor) // <-- optional
              .run(argument);
```
This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall.
* For now the project is all in on zookeeper
* Service discovery is now based on APIs and not services.  Theoretically means we could ship the same code either a monolith or a service mesh.
* To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service.  WIP!

Missing is documentation and testing, and some more breaking apart of code.
2024-02-22 14:01:23 +01:00
Viktor Lofgren
ee8e0497ae (refac) Move service discovery injection to a separate guice module 2024-02-20 15:41:04 +01:00
Viktor
f85ec28a16
Merge branch 'master' into service-discovery 2024-02-20 11:44:12 +01:00
Viktor Lofgren
0307c55f9f (refac) Zookeeper for service-discovery, kill service-client lib (WIP)
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.

A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.

The last remaining REST service, the assistant-service, has been migrated to gRPC.

This also proved a good time to clear out primordial technical debt from the root of the codebase.  The 'service-client' library has been taken behind the barn and given a last farewell.  It's replaced by a small library for managing gRPC channels.

Since it's no longer used by anything, RxJava has been removed as a dependency from the project.

Although the current state seems reasonably stable, this is a work-in-progress commit.
2024-02-20 11:41:14 +01:00