Commit Graph

320 Commits

Author SHA1 Message Date
Viktor Lofgren
3791ea1e18 (service) Add a new application service for external liveness monitoring
The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.
2024-11-17 18:01:08 +01:00
Viktor Lofgren
e5db3f11e1 (chore) Clean up some of the uglier delomboking artifacts 2024-11-15 13:57:20 +01:00
Viktor Lofgren
9f47ce8d15 (chore) Remove lombok
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
a5b4951f23 (chore) Remove use of deprecated STR.-style string templates 2024-11-11 18:02:28 +01:00
Viktor Lofgren
8b8bf0748f (feature-extraction) Add new DocumentHeaders class encapsulating Html headers.
Also adds a few new html features for CDNs and  S3 hosting for use in ranking and query refinement.
2024-11-11 13:26:15 +01:00
Viktor Lofgren
bfeb9a4538 (feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service 2024-11-09 17:56:43 +01:00
Viktor Lofgren
76e9053dd0 (setup) Move some file-downloads from setup script to the first boot of the control node of the system
We can only do this for files that are not required for unit tests.

As it is illegal to run more than one instance of the control service, this should be fine with regard to race conditions.  The boot orchestration will also ensure that no other services will boot up before the downloading is complete.
2024-11-06 15:28:20 +01:00
Viktor Lofgren
45d3e6aa71 (download-sample) Break apart actor for better error recovery
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:19:09 +02:00
Viktor Lofgren
d84a2c183f (*) Remove the crawl spec abstraction
The crawl spec abstraction was used to upload lists of domains into the system for future crawling.  This was fairly clunky, and it was difficult to understand what was going to be crawled.

Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table.  This is much preferred and means the operator can directly manage domains without specs.

This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
23cce0c78a Add a new function 'Live Capture' for on-demand screenshot capture
The screenshots are requested by the site-service, and triggered via the site-info view.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
1bd29a586c (service-discovery) Add common base interface to all Grpc services
To be able to tell service discovery whether to enable a service on a particular runtime, a common base interface DiscoverableService extends BindableService was added.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
8047e77757 (doc) Correct dead links and stale information in the docs 2024-09-13 11:01:05 +02:00
Viktor Lofgren
abab5bdc8a (index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data 2024-08-26 14:20:39 +02:00
Viktor Lofgren
b09e2dbeb7 (build) Fix dependency churn from testcontainers
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
0999f07320 (search-query) Add new ranking parameters for proximity and verbatim matches 2024-08-25 10:34:12 +02:00
Viktor Lofgren
0a383a712d (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:44:17 +02:00
Viktor Lofgren
2ad93ad41a (*) Clean up 2024-08-14 11:43:45 +02:00
Viktor Lofgren
34703da144 (slop) Support for nested array types and array-of-object types
Also adding very basic support for filtered reads via SlopTable.  This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
d36055a2d0 (keyword-extractor) Retire TfIdfHigh WordFlag
This will bring the word flags count down to 8, and let us pack every value in a byte.
2024-07-17 13:54:39 +02:00
Viktor Lofgren
ae87e41cec (index) Fix rare BitReader.takeWhileZero bug
Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer.  This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty.

The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte.

Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.
2024-07-16 11:03:56 +02:00
Viktor Lofgren
dfd19b5eb9 (index) Reduce the number of abstractions around result ranking
The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.
2024-07-16 08:18:54 +02:00
Viktor Lofgren
12590d3449 (index-reverse) Added compression to priority index
The priority index documents file can be trivially compressed to a large degree.

Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
2024-07-11 16:13:23 +02:00
Viktor Lofgren
10fe5a78cb (log) Prevent tests from trying to log to file
They would never have succeeded, but it adds an annoying preamble of error spam in the console window.
2024-06-27 13:19:48 +02:00
Viktor Lofgren
3faa5bf521 (search-query) Tidy up QueryGRPCService and IndexClient 2024-06-26 14:03:30 +02:00
Viktor Lofgren
dae22ccbe0 (test) Integration test from crawl->query 2024-06-25 22:17:26 +02:00
Viktor Lofgren
89aae93e60 (*) Lift jetty and guava-dependencies 2024-05-23 14:20:01 +02:00
Viktor Lofgren
65b74f9cab (registry) Fix broken test 2024-05-23 14:15:01 +02:00
Viktor Lofgren
59ec70eb73 (*) Clean up code related to crawl parquet inspection 2024-05-22 12:55:08 +02:00
Viktor Lofgren
17dc00d05f (control) Partial implementation of inspection utility for crawl data
Uses duckdb and range queries to read the parquet files directly from the index partitions.

UX is a bit rough but is in working order.
2024-05-20 18:02:46 +02:00
Viktor Lofgren
b867eadbef (big-string) Remove the unused bigstring library 2024-05-18 13:40:03 +02:00
Viktor Lofgren
b7a95be731 (search) Create a small mocking framework for running the search service in isolation. 2024-05-04 11:40:59 +02:00
Viktor Lofgren
616649f040 (logs) Fix logdir location 2024-05-04 11:40:59 +02:00
Viktor Lofgren
2ad0bfda1e (*) Fix boot orchestration for the services
This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.

A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database.  Move the first boot check into the MainClass instead of the Service constructor.

The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
2024-05-01 12:39:48 +02:00
Viktor Lofgren
08f8b6e022 (system) Log loaded properties to the console 2024-04-30 18:29:11 +02:00
Viktor Lofgren
800ed6b1e9 (zk) Terminately immediately if zookeeper isn't found
This makes debugging easier
2024-04-30 18:28:49 +02:00
Viktor Lofgren
908535a3a0 (single-service) Ensure single-service spawner can specify the node 2024-04-30 18:27:46 +02:00
Viktor Lofgren
7fe2ab6f39 (file-storage) Ensure file storage root location can be overridden when running outside of docker 2024-04-30 18:26:15 +02:00
Viktor Lofgren
4772e0b59d (service) Deprecate /public prefix on HTTP
Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations.

Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths.

The change removes the path prefix, and updates the docker templates to reflect the change.  This will require a migration for existing systems.
2024-04-30 14:46:18 +02:00
Viktor Lofgren
89889ecbbd (single-service) Skip starting Prometheus if it's not explicitly enabled 2024-04-25 17:54:07 +02:00
Viktor Lofgren
c8ee354d0b (log) Make log dir configurable via environment variable 2024-04-25 15:09:18 +02:00
Viktor Lofgren
3952ef6ca5 (service) Let singleservice configure ports and bind addresses 2024-04-25 13:49:57 +02:00
Viktor Lofgren
32fe864a33 (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6efc0f21fe (index) Clean up data model
The change set cleans up the data model for the term-level data.  This used to contain a bunch of fields with document-level metadata.  This data-duplication means a larger memory footprint and worse memory locality.

The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking.  This is again an effort to improve memory locality.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6a7a7009c7 (convert) Initial integration of segmentation data into the converter's keyword extraction logic 2024-04-24 14:44:17 +02:00
Viktor Lofgren
3c75057dcd (qs) Retire NGramBloomFilter, integrate new segmentation model instead 2024-04-24 14:44:17 +02:00
Viktor Lofgren
8c559c8121 (conf) Add additional logic for discovering system root 2024-04-16 12:37:18 +02:00
Viktor Lofgren
fe8d583fdd (sys) Upgrade to JDK22
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
57e6a12d08 (registry) Correct registerMonitor() behavior
The previous behavior would listen to too many changes, and based on zookeeper and not curator assumptions about behavior, add an additional monitor on each invocation of each monitor, (which always trigger on service state changes), leading to each monitor re-registering and effectively doubling monitors in numbers whenever a service stopped or started, which in turn meant a lot of bizarre thrashing behavior even on changes in services that don't explicitly talk to each other.

This re-registering behavior is no longer done.
2024-03-06 12:22:15 +01:00
Viktor Lofgren
46423612e3 (refac) Merge service-discovery and service modules
Also adds a few tests to the server/client code.
2024-03-03 10:49:23 +01:00