Viktor Lofgren
89d8af640d
(live-crawl) Rename the live crawler code module to be more consistent with the other processes
2024-11-20 15:55:15 +01:00
Viktor Lofgren
6e4252cf4c
(live-crawl) Make the actor poll for feeds changes instead of being a one-shot thing.
...
Also changes the live crawl process to store the live crawl data in a fixed directory in the storage base rather than versioned directories.
2024-11-20 15:36:25 +01:00
Viktor Lofgren
79ce4de2ab
(model) Remove deprecated fields from CrawledDocument and CrawledDomain
2024-11-20 15:27:05 +01:00
Viktor Lofgren
d6575dfee4
(live-crawler) Crude first-try process for live crawling #WIP
...
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 21:00:18 +01:00
Viktor Lofgren
a91ab4c203
(live-crawler) Crude first-try process for live crawling #WIP
...
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 19:35:01 +01:00
Viktor Lofgren
6a3079a167
(search) Fix missing getter for proto
2024-11-18 21:05:22 +01:00
Viktor Lofgren
c728a1e2f2
(rss) Add endpoint for extracting URLs changed withing a timespan.
2024-11-18 14:59:32 +01:00
Viktor Lofgren
d874d76a09
(rss) Add an endpoint that can be used for identifying when RSS data has changed
2024-11-18 14:22:17 +01:00
Viktor Lofgren
70bc8831f5
(test) Fix excludeTags
2024-11-17 20:07:49 +01:00
Viktor Lofgren
41c11be075
(status) Clean up the status page a bit
2024-11-17 20:00:44 +01:00
Viktor Lofgren
163ce19846
(test) Tag status service endpoint tests as flaky
...
These tests have outside dependencies that inherently makes them unreliable and unsuitable for CI.
2024-11-17 19:48:01 +01:00
Viktor Lofgren
9eb16cb667
(test) Remove tests from fast suite
...
Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI.
Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.
2024-11-17 19:45:59 +01:00
Viktor Lofgren
af40fa327b
(status-service) Correct measurement pruning to use correct sqlite datetimes, as to not delete the database
2024-11-17 18:35:34 +01:00
Viktor Lofgren
cf6d28e71e
(status-service) Enable auto-commit
2024-11-17 18:25:15 +01:00
Viktor Lofgren
3791ea1e18
(service) Add a new application service for external liveness monitoring
...
The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.
2024-11-17 18:01:08 +01:00
Viktor
34258b92d1
Merge pull request #124 from MarginaliaSearch/jdk-23+delombok
...
Friendship with lombok over, now JDK 23 is my best friend
2024-11-16 14:00:49 +00:00
Viktor Lofgren
e5db3f11e1
(chore) Clean up some of the uglier delomboking artifacts
2024-11-15 13:57:20 +01:00
Viktor Lofgren
9f47ce8d15
(chore) Remove lombok
...
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
a5b4951f23
(chore) Remove use of deprecated STR.-style string templates
2024-11-11 18:02:28 +01:00
Viktor Lofgren
8b8bf0748f
(feature-extraction) Add new DocumentHeaders class encapsulating Html headers.
...
Also adds a few new html features for CDNs and S3 hosting for use in ranking and query refinement.
2024-11-11 13:26:15 +01:00
Viktor
5cc71ae586
Merge pull request #123 from MarginaliaSearch/vlofgren-patch-1
...
Update ROADMAP.md
2024-11-10 18:57:49 +01:00
Viktor
33fcfe4b63
Update ROADMAP.md
2024-11-10 18:57:15 +01:00
Viktor
a31a3b53c4
Merge pull request #122 from MarginaliaSearch/fetch-rss-feeds
...
Automatic RSS feed polling
2024-11-10 18:35:28 +01:00
Viktor Lofgren
a456ec9599
(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished
2024-11-10 18:30:28 +01:00
Viktor Lofgren
a2bc9a98c0
(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished
2024-11-10 17:45:20 +01:00
Viktor Lofgren
e24a98390c
(feed) Update API to allow specifying clean vs refresh update
...
Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.
2024-11-09 18:43:47 +01:00
Viktor Lofgren
6f858cd627
(feed) Decrease update interval to 24 hours
2024-11-09 18:17:51 +01:00
Viktor Lofgren
a293266ccd
(feed) Wipe the feeds db and start over from system URLs periodically.
2024-11-09 18:17:16 +01:00
Viktor Lofgren
b8e0dc93d7
(search) Correctly show the feeds view when items are present
...
... otherwise show samples. This commit also removes the (Experimental) bit, as this is getting fairly mature.
2024-11-09 17:56:43 +01:00
Viktor Lofgren
d774c39031
(feeds) Reduce log spam
2024-11-09 17:56:43 +01:00
Viktor Lofgren
ab17af99da
(feeds) Refresh the feed db using the previous db, when it is available.
2024-11-09 17:56:43 +01:00
Viktor Lofgren
b0ac3c586f
(feeds) Correct parallelism using SimpleBlockingThreadPool
2024-11-09 17:56:43 +01:00
Viktor Lofgren
139fa85b18
(feeds) Add working heartbeat tracking progress
2024-11-09 17:56:43 +01:00
Viktor Lofgren
bfeb9a4538
(feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service
2024-11-09 17:56:43 +01:00
Viktor
3d6c79ae5f
Merge pull request #121 from MarginaliaSearch/headless-setup
...
Headless deterministic setup
2024-11-08 13:50:54 +01:00
Viktor Lofgren
c9e9f73ea9
(setup) Break out installation action into non-interactive script
2024-11-08 13:38:40 +01:00
Viktor Lofgren
80e482b155
(setup) Add progress bar to downloads for better feedback
2024-11-08 13:38:40 +01:00
Viktor Lofgren
9351593495
(setup) Use huggingface for versioned hosting of language models
2024-11-08 13:38:40 +01:00
Viktor Lofgren
d74436f546
(setup) Use checksums for rdrpostagger and opennlp files
...
Also use versioned URLs for rdrpostagger
2024-11-08 13:38:40 +01:00
Viktor Lofgren
76e9053dd0
(setup) Move some file-downloads from setup script to the first boot of the control node of the system
...
We can only do this for files that are not required for unit tests.
As it is illegal to run more than one instance of the control service, this should be fine with regard to race conditions. The boot orchestration will also ensure that no other services will boot up before the downloading is complete.
2024-11-06 15:28:20 +01:00
Viktor Lofgren
dbb8bcdd8e
(crawler) Use a better hashInt implementation in CrawlDataReference
...
Guava's hash functions are slow as hell.
2024-10-15 18:25:55 +02:00
Viktor Lofgren
7305afa0f8
(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris
2024-10-15 17:27:59 +02:00
Viktor Lofgren
481f999b70
(crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full.
...
Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.
2024-10-15 14:22:40 +02:00
Viktor Lofgren
4b16022556
(crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains
2024-10-15 14:21:59 +02:00
Viktor Lofgren
89dd201a7b
(link-parser) Make mailing list blocking optional
2024-10-15 13:48:32 +02:00
Viktor Lofgren
ab486323f2
(converter) Increase the number of links the converter will pick up per document
2024-10-15 13:46:19 +02:00
Viktor Lofgren
6460c11107
(index) Short-circuit rankResults when there are no results
2024-10-14 13:47:35 +02:00
Viktor Lofgren
89f7f3c17c
(query-parser) Fix regression where advice terms weren't parsed properly
2024-10-14 13:46:37 +02:00
Viktor Lofgren
fe800b3af7
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 19:04:49 +02:00
Viktor Lofgren
2a1077ff43
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:57:27 +02:00