Viktor Lofgren
|
d895f83520
|
(blocking-thread-pool) Move DumbThreadPool to its own micro-library
Also rename it to SimpleBlockingThreadPool.
|
2023-09-20 10:11:49 +02:00 |
|
Viktor Lofgren
|
5c040f7a46
|
(crawl-spec) Parquetify crawl spec
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
|
2023-09-17 09:41:34 +02:00 |
|
Viktor Lofgren
|
eaeb23d41e
|
(refactor) Remove converting-model package completely
|
2023-09-14 11:21:44 +02:00 |
|
Viktor Lofgren
|
39c1857c61
|
(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.
|
2023-08-29 13:07:55 +02:00 |
|
Viktor Lofgren
|
ebc84c22fb
|
Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
|
2023-08-23 14:34:32 +00:00 |
|
Viktor Lofgren
|
aa0d256d6a
|
Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
|
2023-08-23 13:37:49 +00:00 |
|
Viktor Lofgren
|
1d486bddee
|
(crawler) Reduce log spam
|
2023-08-16 11:12:09 +02:00 |
|
Viktor Lofgren
|
e7192a9cad
|
(mq) Refactor mq and actor library and move it to libraries out of common
|
2023-08-15 10:53:23 +02:00 |
|
Viktor Lofgren
|
251fc63b42
|
(*) Fix merge gore
|
2023-08-09 13:33:28 +02:00 |
|
Viktor
|
52e2ab45bf
|
Merge branch 'master' into master-control-program
|
2023-08-07 12:53:43 +02:00 |
|
Viktor Lofgren
|
c22feaf42e
|
(crawl) Make crawler limiter request a GC when throttling
|
2023-08-03 17:58:18 +02:00 |
|
Viktor Lofgren
|
e5c9791b14
|
(crawler) Fix rare ConcurrentModificationError due to HashSet
|
2023-08-01 17:28:29 +02:00 |
|
Viktor Lofgren
|
37c4cc68ed
|
TODO
|
2023-07-31 10:34:42 +02:00 |
|
Viktor Lofgren
|
5c071ce4d3
|
(crawler) Clean up the code and remove unnecessary logging
|
2023-07-30 16:53:39 +02:00 |
|
Viktor Lofgren
|
caf3d231a8
|
(crawler) Fix rare issue with NPEs if the crawl queue is empty
|
2023-07-30 16:53:13 +02:00 |
|
Viktor Lofgren
|
730e8f74e4
|
(crawler) Even more memory optimizations.
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
|
2023-07-30 14:19:55 +02:00 |
|
Viktor Lofgren
|
aba134284f
|
(crawler) Reduce log spam
|
2023-07-29 19:22:58 +02:00 |
|
Viktor Lofgren
|
2a6183f9e0
|
(crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size.
|
2023-07-29 19:20:09 +02:00 |
|
Viktor Lofgren
|
ee143bbc48
|
(crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended.
|
2023-07-29 19:19:09 +02:00 |
|
Viktor Lofgren
|
05ba3bab96
|
(crawler) Make SitemapRetriever abort on too large sitemaps.
|
2023-07-29 19:18:12 +02:00 |
|
Viktor Lofgren
|
d2b6b2044c
|
(crawler) Reduce log spam in HttpFetcherImpl
|
2023-07-29 19:18:12 +02:00 |
|
Viktor Lofgren
|
7611b7900d
|
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
|
2023-07-29 19:18:12 +02:00 |
|
Viktor Lofgren
|
e237df4a10
|
(converter) Use a dumb thread pool instead of Java's executor service.
|
2023-07-28 18:15:16 +02:00 |
|
Viktor Lofgren
|
667b0ca0b0
|
(converter, WIP) Refactor CrawledDomainReader to not return iterators.
Instead return a closable class SerializableCrawlDataStream.
|
2023-07-24 16:28:30 +02:00 |
|
Viktor Lofgren
|
a56953c798
|
(converter, WIP) Refactor converter to not have to load everything into RAM.
|
2023-07-24 15:25:09 +02:00 |
|
Viktor Lofgren
|
35b29e4f9e
|
(crawler) Clean up and refactor the code a bit
|
2023-07-23 19:06:37 +02:00 |
|
Viktor Lofgren
|
69f333c0bf
|
(crawler) Clean up and refactor the code a bit
|
2023-07-23 18:59:14 +02:00 |
|
Viktor Lofgren
|
c069c8c182
|
(crawler) Clean up crawl data reference and recrawl logic
|
2023-07-22 18:42:21 +02:00 |
|
Viktor Lofgren
|
9e4aa7da7c
|
(crawler) Support for X-Robots-Tag
|
2023-07-22 18:42:21 +02:00 |
|
Viktor Lofgren
|
58f2f86ea8
|
(crawler) Don't read all the data into RAM when doing a refresh-crawl
|
2023-07-21 19:47:52 +02:00 |
|
Viktor Lofgren
|
f91d92cccb
|
(crawler) WIP
|
2023-07-20 21:05:16 +02:00 |
|
Viktor Lofgren
|
5deec63667
|
(work-log) Better tests
|
2023-07-12 18:04:06 +02:00 |
|
Viktor Lofgren
|
74caf9e38a
|
(processes) Remove forEach-constructs in favor of iterators.
|
2023-07-12 17:47:36 +02:00 |
|
Viktor Lofgren
|
4c016b0318
|
Process monitoring
* Also refactored the SQL tables a bit
|
2023-07-11 14:46:21 +02:00 |
|
Viktor
|
cbbf60a599
|
Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
|
2023-07-10 18:58:43 +02:00 |
|
Viktor Lofgren
|
f03146de4b
|
(crawler) Fix bug poor handling of duplicate ids
* Also clean up the code a bit
|
2023-07-10 18:58:43 +02:00 |
|
Viktor
|
0f9b90eb1c
|
Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
|
2023-07-10 17:36:12 +02:00 |
|
Viktor Lofgren
|
2619d196bb
|
(crawler) Fix bug poor handling of duplicate ids
* Also clean up the code a bit
|
2023-07-07 19:56:14 +02:00 |
|
Viktor Lofgren
|
647bbfa617
|
Fix so that crawler tests don't sometimes fetch real sitemaps when they're run.
|
2023-07-06 18:05:23 +02:00 |
|
Viktor Lofgren
|
b73fcc19fe
|
Fix so that crawler tests don't sometimes fetch real sitemaps when they're run.
|
2023-07-06 18:05:03 +02:00 |
|
Viktor Lofgren
|
24dce8c03b
|
Remove link filtering for mediawiki, it's too strict and not every site uses the /wiki/-pattern.
|
2023-07-01 19:32:25 +02:00 |
|
Viktor Lofgren
|
7d86586594
|
Remove annoying log spam in sitemap retriever
|
2023-06-30 17:08:35 +02:00 |
|
Viktor Lofgren
|
11c26e700e
|
Remove annoying log spam in crawler retriever
|
2023-06-30 17:08:24 +02:00 |
|
Viktor Lofgren
|
d71124961e
|
Better tests for crawling and processing.
|
2023-06-27 16:11:27 +02:00 |
|
Viktor Lofgren
|
fbdedf53de
|
Fix bug in CrawlerRetreiver
... where the root URL wasn't always added properly to the front of the crawl queue.
|
2023-06-27 15:50:38 +02:00 |
|
Viktor Lofgren
|
d167ad2017
|
Remove sitemap related log spam
|
2023-06-27 13:59:47 +02:00 |
|
Viktor Lofgren
|
f8f9f04158
|
Specialized logic for processing Lemmy-based websites.
|
2023-06-27 10:57:54 +02:00 |
|
Viktor Lofgren
|
b0c7480d06
|
Set default timeouts for java.net.URL-connections
|
2023-06-27 10:57:54 +02:00 |
|
Viktor Lofgren
|
e7af77e151
|
Tests for crawler specialization + testdata
|
2023-06-27 10:57:54 +02:00 |
|
Viktor Lofgren
|
ec940e36d0
|
Sitemap support, refined crawler specialization
|
2023-06-27 10:57:54 +02:00 |
|