Commit Graph

147 Commits

Author SHA1 Message Date
Viktor Lofgren
f18f82e229 (crawler) Write etags and last-modified on reference copy
This commit also fixes a test that broke with a previous change.
2023-12-25 01:40:13 +01:00
Viktor Lofgren
67ef2b45fa (crawler) Reduce logging 2023-12-25 01:10:03 +01:00
Viktor Lofgren
d72e871265 (warc) Fix resync 2023-12-25 01:03:03 +01:00
Viktor Lofgren
4c9bc13309 (warc) Reduce log spam 2023-12-25 00:58:31 +01:00
Viktor Lofgren
84563b0d46 (crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK 2023-12-25 00:55:05 +01:00
Viktor Lofgren
c5aab7e8db (warc) Fix NPE in WarcRecorder 2023-12-25 00:54:38 +01:00
Viktor Lofgren
1755b646b8 (warc) Fix NPE in WarcRecorder 2023-12-25 00:48:42 +01:00
Viktor Lofgren
e1a155a9c8 (crawler) Increase growth of crawl jobs
A number of crawl jobs get stuck at about 300 documents, or just under.  This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl.  GOOD_URLS is based on how many documents successfully process, which is typically fairly small.  Switching to KNOWN_URLS should let this grow faster.

The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table.

The floor is also increased to 250 from 200.
2023-12-23 13:22:10 +01:00
Viktor Lofgren
a5bc29245b (cleanup) Remove vestigial support for WARC crawl data streams 2023-12-20 15:46:21 +01:00
Viktor Lofgren
bfae478251 Refactor CrawlerRevisitor for better consistency 2023-12-20 15:21:49 +01:00
Viktor Lofgren
3a56a06c4f (warc) Add a fields for etags and last-modified headers to the new crawl data formats
Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats.  This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely.

The commit also adds a few tests to this logic.
2023-12-18 17:45:54 +01:00
Viktor Lofgren
bf44805e69 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 14:00:07 +01:00
Viktor Lofgren
3113b5a551 (warc) Filter WarcResponses based on X-Robots-Tags
There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.
2023-12-16 15:58:27 +01:00
Viktor Lofgren
54ed3b86ba (minor) Remove dead code. 2023-12-15 21:49:35 +01:00
Viktor Lofgren
fa81e5b8ee (warc) Use a non-standard WARC header to convey information about whether a website uses cookies
This information is then propagated to the parquet file as a boolean.

For documents that are copied from the reference, use whatever value we last saw.  This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.
2023-12-15 16:37:53 +01:00
Viktor Lofgren
9fea22b90d (warc) Further tidying
This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled.

A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics.

Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.
2023-12-15 15:38:23 +01:00
Viktor Lofgren
0889b6d247 (warc) Clean up parquet conversion
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.

It also refactors the fetch result, body extraction and content type abstractions.
2023-12-14 20:39:40 +01:00
Viktor Lofgren
1328bc4938 (warc) Clean up parquet conversion
This commit cleans up the warc->parquet conversion.  Records with a http status other than 200 are now included.

The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body.

The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.
2023-12-14 16:05:48 +01:00
Viktor Lofgren
440e097d78 (crawler) WIP integration of WARC files into the crawler and converter process.
This commit is in a pretty rough state.  It refactors the crawler fairly significantly to offer better separation of concerns.  It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data.  This works, -ish.

There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.

A problem is that the WARC files are a bit too large.  It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
2023-12-13 15:33:42 +01:00
Viktor Lofgren
b74a3ebd85 (crawler) WIP integration of WARC files into the crawler process.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.

This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.

The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
2023-12-11 19:32:58 +01:00
Viktor Lofgren
45987a1d98 Merge branch 'master' into warc 2023-12-11 14:32:35 +01:00
Viktor Lofgren
f655ec5a5c (*) Refactor GeoIP-related code
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.

The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.

The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.

The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
2023-12-10 17:30:43 +01:00
Viktor Lofgren
e6a1052ba7 Simplify CrawlerMain, removing the CrawlerLimiter and using a global HttpFetcher with a virtual thread pool dispatcher instead of the default. 2023-12-08 20:24:01 +01:00
Viktor Lofgren
968dce50fc (crawler) Refactored IpInterceptingNetworkInterceptor for clarity. 2023-12-08 17:45:46 +01:00
Viktor Lofgren
3bbffd3c22 (crawler) Refactor HttpFetcher to integrate WarcRecorder
Partially hook in the WarcRecorder into the crawler process.  So far it's not read, but should record the crawled documents.

The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.
2023-12-08 17:12:51 +01:00
Viktor Lofgren
072b5fcd12 Implement Warc-recording wrapper for OkHttp3 client
This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted.  This component is currently not hooked into anything.

The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'.

The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.
2023-12-08 13:49:16 +01:00
Viktor Lofgren
064265b0b9 (crawler) Move content type/charset sniffing to a separate microlibrary
This functionality needs to be accessed by the WarcSideloader, which is in the converter.  The resultant microlibrary is tiny, but I think in this case it's justifiable.
2023-12-07 15:16:37 +01:00
Viktor Lofgren
166a391eae (docs) Improve architectural documentation for the crawler. 2023-11-30 21:30:57 +01:00
Viktor Lofgren
09917837d0 (process) Ensure construction exceptions are logged
Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs.

The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.
2023-11-22 18:32:06 +01:00
Viktor Lofgren
7617b4cbc2 (crawler) Fix NPE in crawler caused by not having fetched the domains list yet 2023-11-06 18:16:38 +01:00
Viktor Lofgren
ebd10a5f28 (crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized 2023-11-06 16:14:58 +01:00
Viktor Lofgren
8f74dbdbb4 (crawler) Set more lenient parameters for recrawl 2023-10-30 11:35:30 +01:00
Viktor Lofgren
fd5a7eac87 (crawler) Exit crawler retriever on thread interrupted 2023-10-30 11:34:16 +01:00
Viktor Lofgren
a497e4c920 (crawler) Terminate crawler after a few hours of no progress 2023-10-26 12:49:28 +02:00
Viktor Lofgren
81dd3809e9 (*) WIP Add node affinity to EC_DOMAIN
Very messy commit due to fractalline yak shaving
2023-10-19 17:48:34 +02:00
Viktor Lofgren
4baf9527d7 (*) WIP Control GUI redesign, executor-service, multi-node mq
This turned out to be very difficult to do in small isolated steps.

* Design overhaul of the control gui using bootstrap
* Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes
* Add node-affinity to message queue
2023-10-14 12:08:43 +02:00
Viktor Lofgren
199c459697 (*) Add node-affinity to services, processes and file storage. 2023-10-10 12:32:22 +02:00
Viktor Lofgren
3889c4bdd9 (refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00
Viktor Lofgren
c51159672e (build) Move unit test configuration to root build.gradle 2023-10-04 12:46:22 +02:00
Viktor Lofgren
dbe9235f3a (*) Upgrade to JDK21 with preview enabled.
... also move some common configuration into the root build.gradle-file.

Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work.  This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
d895f83520 (blocking-thread-pool) Move DumbThreadPool to its own micro-library
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
5c040f7a46 (crawl-spec) Parquetify crawl spec
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
eaeb23d41e (refactor) Remove converting-model package completely 2023-09-14 11:21:44 +02:00
Viktor Lofgren
39c1857c61 (heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction. 2023-08-29 13:07:55 +02:00
Viktor Lofgren
ebc84c22fb Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
1d486bddee (crawler) Reduce log spam 2023-08-16 11:12:09 +02:00
Viktor Lofgren
e7192a9cad (mq) Refactor mq and actor library and move it to libraries out of common 2023-08-15 10:53:23 +02:00
Viktor Lofgren
251fc63b42 (*) Fix merge gore 2023-08-09 13:33:28 +02:00
Viktor
52e2ab45bf
Merge branch 'master' into master-control-program 2023-08-07 12:53:43 +02:00
Viktor Lofgren
c22feaf42e (crawl) Make crawler limiter request a GC when throttling 2023-08-03 17:58:18 +02:00
Viktor Lofgren
e5c9791b14 (crawler) Fix rare ConcurrentModificationError due to HashSet 2023-08-01 17:28:29 +02:00
Viktor Lofgren
37c4cc68ed TODO 2023-07-31 10:34:42 +02:00
Viktor Lofgren
5c071ce4d3 (crawler) Clean up the code and remove unnecessary logging 2023-07-30 16:53:39 +02:00
Viktor Lofgren
caf3d231a8 (crawler) Fix rare issue with NPEs if the crawl queue is empty 2023-07-30 16:53:13 +02:00
Viktor Lofgren
730e8f74e4 (crawler) Even more memory optimizations.
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
aba134284f (crawler) Reduce log spam 2023-07-29 19:22:58 +02:00
Viktor Lofgren
2a6183f9e0 (crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size. 2023-07-29 19:20:09 +02:00
Viktor Lofgren
ee143bbc48 (crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended. 2023-07-29 19:19:09 +02:00
Viktor Lofgren
05ba3bab96 (crawler) Make SitemapRetriever abort on too large sitemaps. 2023-07-29 19:18:12 +02:00
Viktor Lofgren
d2b6b2044c (crawler) Reduce log spam in HttpFetcherImpl 2023-07-29 19:18:12 +02:00
Viktor Lofgren
7611b7900d (crawler) Reduce long term memory allocation in DomainCrawlFrontier
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
2023-07-29 19:18:12 +02:00
Viktor Lofgren
e237df4a10 (converter) Use a dumb thread pool instead of Java's executor service. 2023-07-28 18:15:16 +02:00
Viktor Lofgren
667b0ca0b0 (converter, WIP) Refactor CrawledDomainReader to not return iterators.
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798 (converter, WIP) Refactor converter to not have to load everything into RAM. 2023-07-24 15:25:09 +02:00
Viktor Lofgren
35b29e4f9e (crawler) Clean up and refactor the code a bit 2023-07-23 19:06:37 +02:00
Viktor Lofgren
69f333c0bf (crawler) Clean up and refactor the code a bit 2023-07-23 18:59:14 +02:00
Viktor Lofgren
c069c8c182 (crawler) Clean up crawl data reference and recrawl logic 2023-07-22 18:42:21 +02:00
Viktor Lofgren
9e4aa7da7c (crawler) Support for X-Robots-Tag 2023-07-22 18:42:21 +02:00
Viktor Lofgren
58f2f86ea8 (crawler) Don't read all the data into RAM when doing a refresh-crawl 2023-07-21 19:47:52 +02:00
Viktor Lofgren
f91d92cccb (crawler) WIP 2023-07-20 21:05:16 +02:00
Viktor Lofgren
5deec63667 (work-log) Better tests 2023-07-12 18:04:06 +02:00
Viktor Lofgren
74caf9e38a (processes) Remove forEach-constructs in favor of iterators. 2023-07-12 17:47:36 +02:00
Viktor Lofgren
4c016b0318 Process monitoring
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
f03146de4b (crawler) Fix bug poor handling of duplicate ids
* Also clean up the code a bit
2023-07-10 18:58:43 +02:00
Viktor
0f9b90eb1c
Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00
Viktor Lofgren
2619d196bb (crawler) Fix bug poor handling of duplicate ids
* Also clean up the code a bit
2023-07-07 19:56:14 +02:00
Viktor Lofgren
647bbfa617 Fix so that crawler tests don't sometimes fetch real sitemaps when they're run. 2023-07-06 18:05:23 +02:00
Viktor Lofgren
b73fcc19fe Fix so that crawler tests don't sometimes fetch real sitemaps when they're run. 2023-07-06 18:05:03 +02:00
Viktor Lofgren
24dce8c03b Remove link filtering for mediawiki, it's too strict and not every site uses the /wiki/-pattern. 2023-07-01 19:32:25 +02:00
Viktor Lofgren
7d86586594 Remove annoying log spam in sitemap retriever 2023-06-30 17:08:35 +02:00
Viktor Lofgren
11c26e700e Remove annoying log spam in crawler retriever 2023-06-30 17:08:24 +02:00
Viktor Lofgren
d71124961e Better tests for crawling and processing. 2023-06-27 16:11:27 +02:00
Viktor Lofgren
fbdedf53de Fix bug in CrawlerRetreiver
... where the root URL wasn't always added properly to the front of the crawl queue.
2023-06-27 15:50:38 +02:00
Viktor Lofgren
d167ad2017 Remove sitemap related log spam 2023-06-27 13:59:47 +02:00
Viktor Lofgren
f8f9f04158 Specialized logic for processing Lemmy-based websites. 2023-06-27 10:57:54 +02:00
Viktor Lofgren
b0c7480d06 Set default timeouts for java.net.URL-connections 2023-06-27 10:57:54 +02:00
Viktor Lofgren
e7af77e151 Tests for crawler specialization + testdata 2023-06-27 10:57:54 +02:00
Viktor Lofgren
ec940e36d0 Sitemap support, refined crawler specialization 2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61 Refactor crawler and add special logic for some platforms
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
e4372289a5 Use fixed buffers for BigString compression and decompression to reduce GC churn.
fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
eb2ca942d5 Up the default crawl delay to 1 second. 2023-06-07 22:02:17 +02:00
Viktor Lofgren
e332faa07e Fix test that broke when memex.marginalia.nu started redirecting to www.marginalia.nu. 2023-05-28 13:46:24 +02:00
Viktor Lofgren
2eb972dea1 Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00
Viktor Lofgren
449471a076 Yet more restructuring. Improved search result ranking. 2023-03-16 21:35:54 +01:00
Viktor Lofgren
d82532b7f1 More restructuring, big bug fixes in keyword extraction. 2023-03-13 17:39:53 +01:00