Commit Graph

2784 Commits

Author SHA1 Message Date
Viktor Lofgren
e9f71ee39b (search) Move linked/similar domains to a popover style menu on mobile 2025-01-01 15:23:25 +01:00
Viktor Lofgren
baeb4a46cd (search) Reintroduce query rewriting for recipes, add rules for wikis and forums 2024-12-31 16:05:00 +01:00
Viktor Lofgren
5e2a8e9f27 (deploy) Add capability of adding tags to deploy script 2024-12-31 16:04:13 +01:00
Viktor
cc1a5bdf90
Merge pull request #138 from MarginaliaSearch/vlofgren-patch-1
Update ROADMAP.md
2024-12-31 14:41:02 +01:00
Viktor
7f7b1ffaba
Update ROADMAP.md 2024-12-31 14:40:34 +01:00
Viktor Lofgren
0ea8092350 (search) Add link promoting the redesign beta 2024-12-30 15:47:13 +01:00
Viktor Lofgren
483d29497e (deploy) Add hashbang to deploy script 2024-12-30 15:47:13 +01:00
Viktor Lofgren
bae44497fe (crawler) Add a new system property crawler.maxFetchSize
This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.
2024-12-30 15:10:11 +01:00
Viktor Lofgren
0d59202aca (crawler) Do not remove W/-prefix on weak e-tags
The server expects to get them back prefixed, as we received them.
2024-12-27 20:56:42 +01:00
Viktor Lofgren
0ca43f0c9c (live-crawler) Improve live crawler short-circuit logic
We should not wait until we've fetched robots.txt to decide whether we have any data to fetch!  This makes the live crawler very slow and leads to unnecessary requests.
2024-12-27 20:54:42 +01:00
Viktor Lofgren
3bc99639a0 (feed-fetcher) Make feed fetcher requests conditional
Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests.  On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary.

A new table was added to the FeedDb to hold one etag per domain.

If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received.

This completes the changes for Issue #136.
2024-12-27 15:10:15 +01:00
Viktor Lofgren
927bc0b63c (live-crawler) Add Accept-Encoding: gzip to outbound requests
This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data.

The change addresses issue #136, save for making the fetcher's requests conditional.
2024-12-27 03:59:34 +01:00
Viktor Lofgren
d968801dc1 (converter) Drop feed data from SlopDomainRecord
Also remove feed extraction from converter.  This is the crawler's responsibility now.
2024-12-26 17:57:08 +01:00
Viktor Lofgren
89db69d360 (crawler) Correct feed URLs in domain state db
Discovered feed URLs were given a double slash after their domain name in the DB.  This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.
2024-12-26 15:18:31 +01:00
Viktor Lofgren
895cee7004 (crawler) Improved feed discovery, new domain state db per crawlset
Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided.  To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered.

Solves issue #135
2024-12-26 15:05:52 +01:00
Viktor Lofgren
4bb71b8439 (crawler) Correct content type probing to only run on URLs that are suspected to be binary 2024-12-26 14:26:23 +01:00
Viktor Lofgren
e4a41f7dd1 (crawler) Correct content type probing to only run on URLs that are suspected to be binary 2024-12-26 14:13:17 +01:00
Viktor
69ad6287b1
Update ROADMAP.md 2024-12-25 21:16:38 +00:00
Viktor Lofgren
81cdd6385d Add rendering tests for most major views
This will prevent accidentally deploying a broken search service
2024-12-25 15:22:26 +01:00
Viktor Lofgren
e76c42329f Correct dark mode for infobox in site focused search 2024-12-25 15:06:05 +01:00
Viktor Lofgren
e6ef4734ea Fix tests 2024-12-25 15:05:41 +01:00
Viktor Lofgren
41a59dcf45 (feed) Sanitize illegal HTML entities out of the feed XML before parsing 2024-12-25 14:53:28 +01:00
Viktor Lofgren
df4bc1d7e9 Add update time to front page subscriptions 2024-12-25 14:42:00 +01:00
Viktor Lofgren
2b222efa75 Merge branch 'master' into serp-redesign 2024-12-25 14:22:42 +01:00
Viktor Lofgren
94d4d2edb7 (live-crawler) Add refresh date to feeds API
For now this is just the ctime for the feeds db.  We may want to store this per-record in the future.
2024-12-25 14:20:48 +01:00
Viktor Lofgren
7ae19a92ba (deploy) Improve deployment script to allow specification of partitions 2024-12-24 11:16:15 +01:00
Viktor Lofgren
56d14e56d7 (live-crawler) Improve LiveCrawlActor resilience to FeedService outages 2024-12-23 23:33:54 +01:00
Viktor Lofgren
a557c7ae7f (live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler 2024-12-23 23:31:03 +01:00
Viktor Lofgren
b66879ccb1 (feed) Add support for date discovery through atom:issued and atom:created
This is specifically to help parse monadnock.net's Atom feed.
2024-12-23 20:05:58 +01:00
Viktor Lofgren
f1b7157ca2 (deploy) Add basic linting ability to deployment script. 2024-12-23 16:21:29 +01:00
Viktor Lofgren
7622335e84 (deploy) Correct deploy script, set correct name for assistant 2024-12-23 15:59:02 +01:00
Viktor Lofgren
0da2047eae (live-capture) Correctly update processed count, disable poll rate adjustment based on freshness. 2024-12-23 15:56:27 +01:00
Viktor Lofgren
5ee4321110 (ci) Correct deploy script 2024-12-22 20:08:37 +01:00
Viktor Lofgren
9459b9933b (ci) Correct deploy script 2024-12-22 19:40:32 +01:00
Viktor Lofgren
87fb564f89 (ci) Add script for automatic deployment based on git tags 2024-12-22 19:24:54 +01:00
Viktor Lofgren
5ca8523220 (math) Reduce log error spam from null unit conversions 2024-12-21 18:51:45 +01:00
Viktor Lofgren
1118657ffd (system) Supply local IP to service discovery if multiFace is enabled 2024-12-19 22:20:19 +01:00
Viktor Lofgren
b1f970152d (system) To support configurations with multiple docker networks, bind to the "most local" interface.
Make the behavior optional.
2024-12-19 20:26:31 +01:00
Viktor Lofgren
e1783891ab (system) To support configurations with multiple docker networks, bind to the "most local" interface. 2024-12-19 20:18:57 +01:00
Viktor Lofgren
64d32471dd (deploy) Deploy executor test 2024-12-19 17:45:47 +01:00
Viktor Lofgren
232cc465d9 (deploy) Deploy executor test 2024-12-19 17:35:38 +01:00
Viktor Lofgren
8c963bd4ba (feeds) Remove Content-Encoding: gzip from feed fetcher
We don't support decompressing gzip, so this just gives us errors at this point should the server support it.
2024-12-18 22:23:44 +01:00
Viktor Lofgren
6a079c1c75 (feeds) Add per-domain throttling for feed fetcher. 2024-12-18 22:06:46 +01:00
Viktor Lofgren
2dc9f2e639 (feeds) Make feed XML parsing more lenient
... by consuming BOM markers and leading whitespace.
2024-12-18 17:18:41 +01:00
Viktor Lofgren
b66fb9caf6 (feeds) Improve error handling in the feed fetcher. 2024-12-18 17:02:13 +01:00
Viktor Lofgren
6d18e6d840 (search) Add clustering to subscriptions view 2024-12-18 15:36:05 +01:00
Viktor Lofgren
2a3c63f209 (search) Exclude generated style.css from git 2024-12-18 15:24:31 +01:00
Viktor Lofgren
9f70cecaef (search) Add site subscription feature that puts RSS updates on the front page 2024-12-18 15:24:31 +01:00
Viktor Lofgren
47e58a21c6 Refactor documentBody method and ContentType charset handling
Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
2024-12-17 17:11:37 +01:00
Viktor Lofgren
3714104976 Add loader for slop data in converter.
Also alter CrawledDocument to not require String parsing of the underlying byte[] data.  This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
2024-12-17 15:40:24 +01:00