Viktor Lofgren
8862100f7e
(crawler) Improve logging and error handling
2025-01-21 21:44:21 +01:00
Viktor Lofgren
274941f6de
(crawler) Smarter parquet->slop crawl data migration
2025-01-21 21:26:12 +01:00
Viktor Lofgren
abec83582d
Fix refactoring gore
2025-01-21 15:08:04 +01:00
Viktor Lofgren
569520c9b6
(index) Add manual adjustments for rankings based on domain
2025-01-21 15:07:43 +01:00
Viktor Lofgren
088310e998
(converter) Improve simple processing performance
...
There was a regression introduced in the recent slop migration changes in the performance of the simple conversion track. This reverts the issue.
2025-01-21 14:13:33 +01:00
Viktor
270cab874b
Merge pull request #134 from MarginaliaSearch/slop-crawl-data-spike
...
Store crawl data in slop instead of parquet
2025-01-21 13:34:22 +01:00
Viktor Lofgren
4c74e280d3
(crawler) Fix urlencoding in sitemap fetcher
2025-01-21 13:33:35 +01:00
Viktor Lofgren
5b347e17ac
(crawler) Automatically migrate to slop from parquet when crawling
2025-01-21 13:33:14 +01:00
Viktor Lofgren
55d6ab933f
Merge branch 'master' into slop-crawl-data-spike
2025-01-21 13:32:58 +01:00
Viktor Lofgren
43b74e9706
(crawler) Fix exception handler and resource leak in WarcRecorder
2025-01-20 23:45:28 +01:00
Viktor Lofgren
579a115243
(crawler) Reduce log spam from error handling in new sitemap fetcher
2025-01-20 23:17:13 +01:00
Viktor
2c67f50a43
Merge pull request #150 from MarginaliaSearch/httpclient-in-crawler
...
Reduce the use of 3rd party code in the crawler
2025-01-20 19:35:30 +01:00
Viktor Lofgren
78a958e2b0
(crawler) Fix broken test that started failing after the search engine moved to a new domain
2025-01-20 18:52:14 +01:00
Viktor Lofgren
4e939389b2
(crawler) New Jsoup based sitemap parser
2025-01-20 14:37:44 +01:00
Viktor Lofgren
e67a9bdb91
(crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead.
2025-01-19 15:07:11 +01:00
Viktor Lofgren
567e4e1237
(crawler) Fast detection and bail-out for crawler traps
...
Improve logging and exclude robots.txt from this logic.
2025-01-18 15:28:54 +01:00
Viktor Lofgren
4342e42722
(crawler) Fast detection and bail-out for crawler traps
...
Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly. Out of the box crawl limits will also deal with this type of attack, but this fix is faster.
2025-01-17 13:02:57 +01:00
Viktor Lofgren
bc818056e6
(run) Fix templates for mariadb
...
Apparently the docker image contract changed at some point, and now we should spawn mariadbd and not mysqld; mariadb-admin and not mysqladmin.
2025-01-16 15:27:02 +01:00
Viktor Lofgren
de2feac238
(chore) Upgrade jib from 3.4.3 to 3.4.4
2025-01-16 15:10:45 +01:00
Viktor Lofgren
1e770205a5
(search) Dyslexia fix
2025-01-12 20:40:14 +01:00
Viktor
e44ecd6d69
Merge pull request #149 from MarginaliaSearch/vlofgren-patch-1
...
Update ROADMAP.md
2025-01-12 20:38:36 +01:00
Viktor
5b93a0e633
Update ROADMAP.md
2025-01-12 20:38:11 +01:00
Viktor
08fb0e5efe
Update ROADMAP.md
2025-01-12 20:37:43 +01:00
Viktor
bcf67782ea
Update ROADMAP.md
2025-01-12 20:37:09 +01:00
Viktor Lofgren
ef3f175ede
(search) Don't clobber the search query URL with default values
2025-01-10 15:57:30 +01:00
Viktor Lofgren
bbe4b5d9fd
Revert experimental changes
2025-01-10 15:52:02 +01:00
Viktor Lofgren
c67a635103
(search, experimental) Add a few debugging tracks to the search UI
2025-01-10 15:44:44 +01:00
Viktor Lofgren
20b24133fb
(search, experimental) Add a few debugging tracks to the search UI
2025-01-10 15:34:48 +01:00
Viktor Lofgren
f2567677e8
(index-client) Clean up index client code
...
Improve error handling. This should be a relatively rare case, but we don't want one bad index partition to blow up the entire query.
2025-01-10 15:17:07 +01:00
Viktor Lofgren
bc2c2061f2
(index-client) Clean up index client code
...
This should have the rpc stream reception be performed in parallel in separate threads, rather blocking sequentially in the main thread, hopefully giving a slight performance boost.
2025-01-10 15:14:42 +01:00
Viktor Lofgren
1c7f5a31a5
(search) Further reduce the number of db queries by adding more caching to DbDomainQueries.
2025-01-10 14:17:29 +01:00
Viktor Lofgren
59a8ea60f7
(search) Further reduce the number of db queries by adding more caching to DbDomainQueries.
2025-01-10 14:15:22 +01:00
Viktor Lofgren
aa9b1244ea
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:56:04 +01:00
Viktor Lofgren
2d17233366
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:53:56 +01:00
Viktor Lofgren
b245cc9f38
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:46:19 +01:00
Viktor Lofgren
6614d05bdf
(db) Make db pool size configurable
2025-01-09 20:20:51 +01:00
Viktor Lofgren
55aeb03c4a
(feeds) Replace rssreader based parsing with a custom jsoup based rss parser
...
This solves some issues with the rssreader based parser, which was very picky about the XML being valid. Jsoup is much more lenient when parsing malformed XML.
2025-01-09 18:29:55 +01:00
Viktor Lofgren
faa589962f
(live-capture) Browserless now requires a token
2025-01-09 14:51:11 +01:00
Viktor Lofgren
c7edd6b39f
(live-capture) Browserless now requires a token
2025-01-09 14:46:05 +01:00
Viktor Lofgren
79da622e3b
(search) Update front page with new banner about move
2025-01-08 21:38:19 +01:00
Viktor Lofgren
3da8337ba6
(feeds) Add system property for exporting fetched feeds to a slop table for debugging
2025-01-08 20:49:16 +01:00
Viktor Lofgren
a32d230f0a
(special) Trigger deployment
2025-01-08 20:07:54 +01:00
Viktor Lofgren
3772bfd387
(query) Fix handling of optional ranking parameters
2025-01-08 17:11:22 +01:00
Viktor Lofgren
02a7900d1a
(search) Correct search-in-title toggle in search UI
2025-01-08 16:51:10 +01:00
Viktor Lofgren
a1fb92468f
(refac) Remove ResultRankingParameters, QueryLimits class and use protobuf classes directly instead
...
This is primarily to make the code a bit easier to reason about, and will reduce the level of indirection and data copying in the search-servi->query-service->index-service communication chain.
2025-01-08 16:15:57 +01:00
Viktor Lofgren
b7f0a2a98e
(search-service) Fix metrics for errors and request times
...
This was previously in place, but broke during the jooby migration.
2025-01-08 14:10:43 +01:00
Viktor Lofgren
5fb76b2e79
(search-service) Fix metrics for errors and request times
...
This was previously in place, but broke during the jooby migration.
2025-01-08 14:06:03 +01:00
Viktor Lofgren
ad8c97f342
(search-service) Begin replacement of the crawl queue mechanism with node_affinity flagging
...
Previously a special db table was used to hold domains slated for crawling, but this is deprecated, and instead now each domain has a node_affinity flag that decides its indexing state, where a value of -1 indicates it shouldn't be crawled, a value of 0 means it's slated for crawling by the next index partition to be crawled, and a positive value means it's assigned to an index partition.
The change set also adds a test case validating the modified behavior.
2025-01-08 13:25:56 +01:00
Viktor Lofgren
dc1b6373eb
(search-service) Clean up readme
2025-01-08 13:04:39 +01:00
Viktor Lofgren
983d6d067c
(search-service) Add indexing indicator to sibling domains listing
2025-01-08 12:58:34 +01:00