MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	a91ab4c203	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 19:35:01 +01:00
Viktor Lofgren	9f47ce8d15	(chore) Remove lombok There are likely some instances of delombok gore with this commit.	2024-11-11 21:14:38 +01:00
Viktor Lofgren	a5b4951f23	(chore) Remove use of deprecated STR.-style string templates	2024-11-11 18:02:28 +01:00
Viktor Lofgren	a456ec9599	(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished	2024-11-10 18:30:28 +01:00
Viktor Lofgren	a2bc9a98c0	(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished	2024-11-10 17:45:20 +01:00
Viktor Lofgren	e24a98390c	(feed) Update API to allow specifying clean vs refresh update Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.	2024-11-09 18:43:47 +01:00
Viktor Lofgren	6f858cd627	(feed) Decrease update interval to 24 hours	2024-11-09 18:17:51 +01:00
Viktor Lofgren	bfeb9a4538	(feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service	2024-11-09 17:56:43 +01:00
Viktor Lofgren	db5faeceee	(download-sample) Break apart actor for better error recovery Change also adds logged events to give more feedback that something is happening.	2024-10-04 13:39:43 +02:00
Viktor Lofgren	45d3e6aa71	(download-sample) Break apart actor for better error recovery Change also adds logged events to give more feedback that something is happening.	2024-10-04 13:19:09 +02:00
Viktor Lofgren	d84a2c183f	(*) Remove the crawl spec abstraction The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled. Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs. This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.	2024-10-03 13:41:17 +02:00
Viktor Lofgren	938431e514	(scrape-feeds-actor) Add deduplication of insertion data To avoid unnecessary db churn, the domains to be added are put in a set instead of a list, ensuring that they are unique.	2024-09-28 14:41:14 +02:00
Viktor Lofgren	b2de3c70fa	(scrape-feeds-actor) Add explicit commit in case it's disabled	2024-09-28 14:36:57 +02:00
Viktor Lofgren	596a7fb4ea	(actor) Disable the feed scraper on all nodes but the first	2024-09-28 12:36:16 +02:00
Viktor Lofgren	c3f726a01f	(actor) Add a feed scraping actor Add a new actor that polls an URL every 6 hours and amends the domain database with any unseen domains, flagging them to be crawled by the next crawl job. The URLs are specified in data/scrape-urls.txt. If this file is absent, the actor shuts down.	2024-09-28 12:33:29 +02:00
Viktor Lofgren	1bd29a586c	(service-discovery) Add common base interface to all Grpc services To be able to tell service discovery whether to enable a service on a particular runtime, a common base interface DiscoverableService extends BindableService was added.	2024-09-27 13:46:34 +02:00
Viktor Lofgren	b09e2dbeb7	(build) Fix dependency churn from testcontainers Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.	2024-08-25 10:35:48 +02:00
Viktor Lofgren	2ef66ce0ca	(actor) Reset NEW flag earlier when auto-deletion is disabled Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.	2024-07-31 10:31:03 +02:00
Viktor Lofgren	80900107f7	(restructure) Clean up repo by moving stray features into converter-process and crawler-process	2024-07-30 10:14:00 +02:00
Viktor Lofgren	aebb2652e8	(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.	2024-07-27 11:44:13 +02:00
Viktor	8ed5b51a32	Merge branch 'master' into term-positions	2024-07-15 07:05:31 +02:00
Viktor Lofgren	12a2ab93db	(actor) Improve error messages for convert-and-load Some copy-and-paste errors had snuck in and every index construction error was reported as "repartitioned failed"; updated with more useful messages.	2024-07-08 19:19:30 +02:00
Viktor Lofgren	d86926be5f	(crawl) Add new functionality for re-crawling a single domain	2024-07-05 15:31:55 +02:00
Viktor Lofgren	738e0e5fed	(process) Add option for automatic profiling The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns. By default, these are put in the log directory. The change also adds a JVM parameter that makes it shut up about native access.	2024-06-27 13:58:36 +02:00
Viktor Lofgren	89aae93e60	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
Viktor Lofgren	59ec70eb73	(*) Clean up code related to crawl parquet inspection	2024-05-22 12:55:08 +02:00
Viktor Lofgren	17dc00d05f	(control) Partial implementation of inspection utility for crawl data Uses duckdb and range queries to read the parquet files directly from the index partitions. UX is a bit rough but is in working order.	2024-05-20 18:02:46 +02:00
Viktor Lofgren	908535a3a0	(single-service) Ensure single-service spawner can specify the node	2024-04-30 18:27:46 +02:00
Viktor Lofgren	c9ee0c909e	(download-sample) Set +x permissions on directories created during this job	2024-04-30 18:25:07 +02:00
Viktor Lofgren	4668b1ddcb	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 13:54:04 +02:00
Viktor Lofgren	b7d9a7ae89	(ngrams) Remove the vestigial logic for capturing permutations of n-grams The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.	2024-04-11 18:12:01 +02:00
Viktor Lofgren	002afca1c5	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	0bd3365c24	(convert) Initial integration of segmentation data into the converter's keyword extraction logic	2024-03-19 14:28:42 +01:00
Viktor Lofgren	d8f4e7d72b	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-03-19 10:42:09 +01:00
Viktor Lofgren	afc047cd27	(control) GUI for exporting segmentation data from a wikipedia zim	2024-03-18 13:45:23 +01:00
Viktor Lofgren	46423612e3	(refac) Merge service-discovery and service modules Also adds a few tests to the server/client code.	2024-03-03 10:49:23 +01:00
Viktor Lofgren	9f1649636e	Clean up documentation and rename `domain-links` to `link-graph`	2024-02-28 11:40:39 +01:00
Viktor Lofgren	dbf64b0987	(logs) Add the option for json logging	2024-02-27 21:22:20 +01:00
Viktor Lofgren	09447f2ad2	(process service) Inherit parent's assertion status	2024-02-24 18:32:37 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00
Viktor Lofgren	56d35aa596	(refac) Move execution API out of executor service	2024-02-23 13:26:11 +01:00

41 Commits