Viktor Lofgren
44d6bc71b7
(assistant) Migrate to Jooby framework
2025-02-15 13:28:12 +01:00
Viktor Lofgren
9d302e2973
(assistant) Migrate to Jooby framework
2025-02-15 13:26:04 +01:00
Viktor Lofgren
f553701224
(assistant) Migrate to Jooby framework
2025-02-15 13:21:48 +01:00
Viktor Lofgren
f076d05595
(deps) Upgrade slf4j to latest
2025-02-15 12:50:16 +01:00
Viktor Lofgren
b513809710
(*) Stopgap fix for metrics server initialization errors bringing down services
2025-02-14 17:09:48 +01:00
Viktor Lofgren
7519b28e21
(search) Correct exception from misbehaving bots feeding invalid urls
2025-02-14 17:05:24 +01:00
Viktor Lofgren
3eac4dd57f
(search) Correct exception in error handler when page is missing
2025-02-14 17:00:21 +01:00
Viktor Lofgren
4c2810720a
(search) Add redirect handler for full URLs in the /site endpoint
2025-02-14 16:31:11 +01:00
Viktor Lofgren
8480ba8daa
(live-capture) Code cleanup
2025-02-04 14:05:36 +01:00
Viktor Lofgren
fbba392491
(live-capture) Send a UA-string from the browserless fetcher as well
...
The change also introduces a somewhat convoluted wiremock test to intercept and verify that these headers are in fact sent
2025-02-04 13:36:49 +01:00
Viktor Lofgren
530eb35949
(update-rss) Do not fail the feed fetcher control actor if it takes a long time to complete.
2025-02-03 11:35:32 +01:00
Viktor Lofgren
c2dd2175a2
(search) Add new query expansion rule contracting WORD NUM pairs into WORD-NUM and WORDNUM
2025-02-01 13:13:30 +01:00
Viktor Lofgren
b8581b0f56
(crawler) Safe sanitization of headers during warc->slop conversion
...
The warc->slop converter was rejecting some items because they had headers that were representable in the Warc code's MessageHeader map implementation, but illegal in the HttpHeaders' implementation.
Fixing this by manually filtering these out. Ostensibly the constructor has a filtering predicate, but this annoyingly runs too late and fails to prevent the problem.
2025-01-31 12:47:42 +01:00
Viktor Lofgren
2ea34767d8
(crawler) Use the response URL when resolving relative links
...
The crawler was incorrectly using the request URL as the base URL when resolving relative links. This caused problems when encountering redirects.
For example if we fetch /log, redirecting to /log/ and find links to foo/, and bar/; these would resolve to /foo and /bar, and not /log/foo and /log/bar.
2025-01-31 12:40:13 +01:00
Viktor Lofgren
e9af838231
(actor) Fix migration actor final steps
2025-01-30 11:48:21 +01:00
Viktor Lofgren
ae0cad47c4
(actor) Utility method for getting a json prototype for actor states
...
If we can hook this into the control gui somehow, it'll make for a nice QOL upgrade when manually interacting with the actors.
2025-01-29 15:20:25 +01:00
Viktor Lofgren
5fbc8ef998
(misc) Tidying
2025-01-29 15:17:04 +01:00
Viktor Lofgren
32c6dd9e6a
(actor) Delete old data in the migration actor
2025-01-29 14:51:46 +01:00
Viktor Lofgren
6ece6a6cfb
(actor) Improve resilience for the migration actor
2025-01-29 14:43:09 +01:00
Viktor Lofgren
39cd1c18f8
Automatically run npm install tailwindcss@3 via setup.sh, as the new default version of the package is incompatible with the project
2025-01-29 12:21:08 +01:00
Viktor
eb65daaa88
Merge pull request #151 from Lionstiger/master
...
fix small grammar error in footerLegal.jte
2025-01-28 21:49:50 +01:00
Viktor
0bebdb6e33
Merge branch 'master' into master
2025-01-28 21:49:36 +01:00
Viktor Lofgren
1e50e392c6
(actor) Improve logging and error handling for data migration actor
2025-01-28 15:34:36 +01:00
Viktor Lofgren
fb673de370
(crawler) Change the header 'User-agent' to 'User-Agent'
2025-01-28 15:34:16 +01:00
Viktor Lofgren
eee73ab16c
(crawler) Be more lenient when performing a domain probe
2025-01-28 15:24:30 +01:00
Viktor Lofgren
5354e034bf
(search) Minor grammar fix
2025-01-27 18:36:31 +01:00
Magnus Wulf
72384ad6ca
fix small grammar error
2025-01-27 15:04:57 +01:00
Viktor Lofgren
a2b076f9be
(converter) Add progress tracking for big domains in converter
2025-01-26 18:03:59 +01:00
Viktor Lofgren
c8b0a32c0f
(crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams
2025-01-26 15:40:17 +01:00
Viktor Lofgren
f0d74aa3bb
(converter) Fix close() ordering to prevent converter crash
2025-01-26 14:47:36 +01:00
Viktor Lofgren
74a1f100f4
(converter) Refactor to remove CrawledDomainReader and move its functionality into SerializableCrawlDataStream
2025-01-26 14:46:50 +01:00
Viktor Lofgren
eb049658e4
(converter) Add truncation att the parser step to prevent the converter from spending too much time on excessively large documents
...
Refactor to do this without introducing additional copies
2025-01-26 14:28:53 +01:00
Viktor Lofgren
db138b2a6f
(converter) Add truncation att the parser step to prevent the converter from spending too much time on exessively large documents
2025-01-26 14:25:57 +01:00
Viktor Lofgren
1673fc284c
(converter) Reduce lock contention in converter by separating the processing of full and simple-track domains
2025-01-26 13:21:46 +01:00
Viktor Lofgren
503ea57d5b
(converter) Reduce lock contention in converter by separating the processing of full and simple-track domains
2025-01-26 13:18:14 +01:00
Viktor Lofgren
18ca926c7f
(converter) Truncate excessively long strings in SentenceExtractor, malformed data was effectively DOS:ing the converter
2025-01-26 12:52:54 +01:00
Viktor Lofgren
db99242db2
(converter) Adding some logging around the simple processing track to investigate an issue with the converter stalling
2025-01-26 12:02:00 +01:00
Viktor Lofgren
2b9d2985ba
(doc) Update readme with up-to-date install instructions.
2025-01-24 18:51:41 +01:00
Viktor Lofgren
eeb6ecd711
(search) Make it clearer that the affiliate marker applies to the result, and not the search engine's relation to the result.
2025-01-24 18:50:00 +01:00
Viktor Lofgren
1f58aeadbf
(build) Upgrade JIB
2025-01-24 18:49:28 +01:00
Viktor Lofgren
3d68be64da
(crawler) Add default CT when it's missing for icons
2025-01-22 13:55:47 +01:00
Viktor Lofgren
668f3b16ef
(search) Redirect ^/site/$ to /site
2025-01-22 13:35:18 +01:00
Viktor Lofgren
98a340a0d1
(crawler) Add favicon data to domain state db in its own table
2025-01-22 11:41:20 +01:00
Viktor Lofgren
8862100f7e
(crawler) Improve logging and error handling
2025-01-21 21:44:21 +01:00
Viktor Lofgren
274941f6de
(crawler) Smarter parquet->slop crawl data migration
2025-01-21 21:26:12 +01:00
Viktor Lofgren
abec83582d
Fix refactoring gore
2025-01-21 15:08:04 +01:00
Viktor Lofgren
569520c9b6
(index) Add manual adjustments for rankings based on domain
2025-01-21 15:07:43 +01:00
Viktor Lofgren
088310e998
(converter) Improve simple processing performance
...
There was a regression introduced in the recent slop migration changes in the performance of the simple conversion track. This reverts the issue.
2025-01-21 14:13:33 +01:00
Viktor
270cab874b
Merge pull request #134 from MarginaliaSearch/slop-crawl-data-spike
...
Store crawl data in slop instead of parquet
2025-01-21 13:34:22 +01:00
Viktor Lofgren
4c74e280d3
(crawler) Fix urlencoding in sitemap fetcher
2025-01-21 13:33:35 +01:00