MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	6efc0f21fe	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6a7a7009c7	(convert) Initial integration of segmentation data into the converter's keyword extraction logic	2024-04-24 14:44:17 +02:00
Viktor Lofgren	3c75057dcd	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-04-24 14:44:17 +02:00
Viktor Lofgren	8c559c8121	(conf) Add additional logic for discovering system root	2024-04-16 12:37:18 +02:00
Viktor Lofgren	fe8d583fdd	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:27:13 +01:00
Viktor Lofgren	57e6a12d08	(registry) Correct registerMonitor() behavior The previous behavior would listen to too many changes, and based on zookeeper and not curator assumptions about behavior, add an additional monitor on each invocation of each monitor, (which always trigger on service state changes), leading to each monitor re-registering and effectively doubling monitors in numbers whenever a service stopped or started, which in turn meant a lot of bizarre thrashing behavior even on changes in services that don't explicitly talk to each other. This re-registering behavior is no longer done.	2024-03-06 12:22:15 +01:00
Viktor Lofgren	46423612e3	(refac) Merge service-discovery and service modules Also adds a few tests to the server/client code.	2024-03-03 10:49:23 +01:00
Viktor Lofgren	144f967dbf	(misc) Tweak pool sizes	2024-02-28 16:23:02 +01:00
Viktor Lofgren	b31c9bb726	(docs) Update process docs	2024-02-28 15:21:33 +01:00
Viktor Lofgren	c0820b5e5c	(docs) Update service docs	2024-02-28 15:19:31 +01:00
Viktor Lofgren	65b8a1d5d9	(grpc) Reduce error spam	2024-02-28 14:44:48 +01:00
Viktor Lofgren	a0648844fb	(grpc) Reduce error spam	2024-02-28 14:35:29 +01:00
Viktor Lofgren	c4a27003c6	(docs) Fix formatting	2024-02-28 14:22:57 +01:00
Viktor Lofgren	86bbc1043e	(service) Clean up thread pool creation	2024-02-28 14:06:32 +01:00
Viktor Lofgren	a8ec59eb75	(conf) Add migration warning when ZOOKEEPER_HOSTS is not set.	2024-02-28 12:09:38 +01:00
Viktor Lofgren	9f1649636e	Clean up documentation and rename `domain-links` to `link-graph`	2024-02-28 11:40:39 +01:00
Viktor Lofgren	3a65fe8917	Add offload executor to GrpcChannelPoolFactory	2024-02-27 22:08:39 +01:00
Viktor Lofgren	e696fd9e92	(docs) Begin un-fucking the docs after refactoring	2024-02-27 21:22:21 +01:00
Viktor Lofgren	eaf836dc66	(service/grpc) Reduce thread count Netty and GRPC by default spawns an incredible number of threads on high-core CPUs, which amount to a fair bit of RAM usage. Add custom executors that throttle this behavior.	2024-02-27 21:22:21 +01:00
Viktor Lofgren	dbf64b0987	(logs) Add the option for json logging	2024-02-27 21:22:20 +01:00
Viktor Lofgren	ff0ef1eebc	(cleanup) Minor cleanups	2024-02-24 15:33:56 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00
Viktor Lofgren	2201b1a506	(refac) Clean up code issues	2024-02-23 11:39:19 +01:00
Viktor Lofgren	5cdb07023b	(refac) Clean up unused imports	2024-02-23 11:27:20 +01:00
Viktor Lofgren	6357d30ea0	Clean up docs	2024-02-22 19:53:20 +01:00
Viktor Lofgren	8d4ef982d0	Clean up docs	2024-02-22 19:37:59 +01:00
Viktor Lofgren	4740156cfa	Clean up docs	2024-02-22 18:18:58 +01:00
Viktor Lofgren	085137ca63	* Extract the index functionality	2024-02-22 17:31:25 +01:00
Viktor Lofgren	66c1281301	(zk-registry) epic jak shaving WIP Cleaning out a lot of old junk from the code, and one thing lead to another... * Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds. * The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs. * Project is migrated to GraalVM * gRPC clients are re-written with a neat fluent/functional style. e.g. ```channelPool.call(grpcStub::method) .async(executor) // <-- optional .run(argument); ``` This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall. * For now the project is all in on zookeeper * Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh. * To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP! Missing is documentation and testing, and some more breaking apart of code.	2024-02-22 14:01:23 +01:00
Viktor Lofgren	73947d9eca	(zk-registry) Filter out phantom addresses in the registry The change adds a hostname validation step to remove endpoints from the ZkServiceRegistry when they do not resolve. This is a scenario that primarily happens when running in docker, and the entire system is started and stopped.	2024-02-20 18:09:11 +01:00
Viktor Lofgren	a69c0b2718	(grpc-client) Fix warmup crash The warmup would sometimes crash during a cold start-up, because it could not get an API. Changed the warmup to just create a GrpcSingleNodeChannelPool for the node.	2024-02-20 18:03:57 +01:00
Viktor Lofgren	6c764bceeb	(doc) Update documentation for `service-discovery`	2024-02-20 16:09:49 +01:00
Viktor Lofgren	273aeb7bae	(doc) Update documentation with new gRPC service setup	2024-02-20 16:06:05 +01:00
Viktor Lofgren	d185858266	(minor) Add missing query parameter to ServiceEndpoint.toURL	2024-02-20 15:49:43 +01:00
Viktor Lofgren	453bd6064b	(minor) Add warm-up to GrpcMultiNodeChannelPool to speed up the initial messages Without doing this, connections would be created lazily, which is probably never desirable.	2024-02-20 15:45:16 +01:00
Viktor Lofgren	ee8e0497ae	(refac) Move service discovery injection to a separate guice module	2024-02-20 15:41:04 +01:00
Viktor Lofgren	30bdb4b4e9	(config) Clean up service configuration for IP addresses Adds new ways to configure the bind and external IP addresses for a service. Notably, if the environment variable WMSA_IN_DOCKER is present, the system will grab the HOSTNAME variable and announce that as the external address in the service registry. The default bind address is also changed to be 0.0.0.0 only if WMSA_IN_DOCKER is present, otherwise 127.0.0.1; as this is a more secure default.	2024-02-20 14:22:48 +01:00
Viktor Lofgren	2ee492fb74	(gRPC) Bind gRPC services to an interface By default gRPC it magically decides on an interface. The change will explicitly tell it what to use.	2024-02-20 14:22:47 +01:00
Viktor Lofgren	36a5c8b44c	(cleanup) Clean up code	2024-02-20 14:22:47 +01:00
Viktor Lofgren	07b625c58d	(query-client) Add support for fault-tolerant requests to single node services Adding a method importantCall that will retry a failing request on each route until it succeeds or the routes run out.	2024-02-20 14:16:05 +01:00
Viktor Lofgren	746a865106	(client) Fix handling of channel refreshes The previous code made an incorrect assumption that all routes refer to the same node, and would overwrite the route list on each update. This lead to storms of closing and opening channels whenever an update was received. The new code is correctly aware that we may talk to multiple nodes.	2024-02-20 14:14:09 +01:00
Viktor	f85ec28a16	Merge branch 'master' into service-discovery	2024-02-20 11:44:12 +01:00
Viktor Lofgren	0307c55f9f	(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.	2024-02-20 11:41:14 +01:00
Viktor	d05c916491	Merge pull request #80 from MarginaliaSearch/ranking-algorithms Clean up domain ranking code	2024-02-18 09:52:34 +01:00
Viktor Lofgren	e61e7f44b9	(blacklist) Delay startup of blacklist To help services start faster, the blacklist will no longer block until it's loaded. If such a behavior is desirable, a method was added to explicitly wait for the data.	2024-02-18 09:23:20 +01:00
Viktor Lofgren	f9b6ac03c6	(api) Clean up incorrect error handling in GrpcChannelPool	2024-02-18 08:45:35 +01:00
Viktor Lofgren	296ccc5f8e	(blacklist) Clean up blacklist impl The domain blacklist blocked the start-up of each process that injected it, adding like 30 seconds to the start-up time in prod. This change moves the loading to a separate thread entirely. For threads or processes that require the blacklist to be definitely loaded, a helper method was added that blocks until that time.	2024-02-18 08:16:48 +01:00
Viktor Lofgren	92717a4832	(client) Refactor GrpcStubPool to handle error states Refactored the GRPC Stub Pool for better handling of channel SHUTDOWN state. Any disconnected channels are now re-created before returning the stub. The class was also renamed to GrpcChannelPool, as we no longer pool the stubs.	2024-02-17 14:42:26 +01:00
Viktor Lofgren	9ec262ae00	(domain-ranking) Integrate new ranking logic The change deprecates the 'algorithm' field from the domain ranking set configuration. Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.	2024-02-16 20:22:01 +01:00
Viktor Lofgren	b15f47d80e	(db) Retire the EC_DOMAIN_LINK table Retire the EC_DOMAIN_LINK table as the data has been migrated off into a file instead.	2024-02-08 15:52:30 +01:00

1 2 3 4 5 ...

277 Commits