Viktor Lofgren
166a391eae
(docs) Improve architectural documentation for the crawler.
2023-11-30 21:30:57 +01:00
Viktor Lofgren
09917837d0
(process) Ensure construction exceptions are logged
...
Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs.
The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.
2023-11-22 18:32:06 +01:00
Viktor Lofgren
7617b4cbc2
(crawler) Fix NPE in crawler caused by not having fetched the domains list yet
2023-11-06 18:16:38 +01:00
Viktor Lofgren
ebd10a5f28
(crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized
2023-11-06 16:14:58 +01:00
Viktor Lofgren
8f74dbdbb4
(crawler) Set more lenient parameters for recrawl
2023-10-30 11:35:30 +01:00
Viktor Lofgren
fd5a7eac87
(crawler) Exit crawler retriever on thread interrupted
2023-10-30 11:34:16 +01:00
Viktor Lofgren
a497e4c920
(crawler) Terminate crawler after a few hours of no progress
2023-10-26 12:49:28 +02:00
Viktor Lofgren
81dd3809e9
(*) WIP Add node affinity to EC_DOMAIN
...
Very messy commit due to fractalline yak shaving
2023-10-19 17:48:34 +02:00
Viktor Lofgren
4baf9527d7
(*) WIP Control GUI redesign, executor-service, multi-node mq
...
This turned out to be very difficult to do in small isolated steps.
* Design overhaul of the control gui using bootstrap
* Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes
* Add node-affinity to message queue
2023-10-14 12:08:43 +02:00
Viktor Lofgren
199c459697
(*) Add node-affinity to services, processes and file storage.
2023-10-10 12:32:22 +02:00
Viktor Lofgren
3889c4bdd9
(refactor) Remove features-search and update documentation
2023-10-09 15:12:30 +02:00
Viktor Lofgren
c51159672e
(build) Move unit test configuration to root build.gradle
2023-10-04 12:46:22 +02:00
Viktor Lofgren
dbe9235f3a
(*) Upgrade to JDK21 with preview enabled.
...
... also move some common configuration into the root build.gradle-file.
Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
d895f83520
(blocking-thread-pool) Move DumbThreadPool to its own micro-library
...
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
5c040f7a46
(crawl-spec) Parquetify crawl spec
...
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
eaeb23d41e
(refactor) Remove converting-model package completely
2023-09-14 11:21:44 +02:00
Viktor Lofgren
39c1857c61
(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.
2023-08-29 13:07:55 +02:00
Viktor Lofgren
ebc84c22fb
Upgrade antique lombok plugin
...
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a
Upgrade code to Java 20.
...
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
1d486bddee
(crawler) Reduce log spam
2023-08-16 11:12:09 +02:00
Viktor Lofgren
e7192a9cad
(mq) Refactor mq and actor library and move it to libraries out of common
2023-08-15 10:53:23 +02:00
Viktor Lofgren
251fc63b42
(*) Fix merge gore
2023-08-09 13:33:28 +02:00
Viktor
52e2ab45bf
Merge branch 'master' into master-control-program
2023-08-07 12:53:43 +02:00
Viktor Lofgren
c22feaf42e
(crawl) Make crawler limiter request a GC when throttling
2023-08-03 17:58:18 +02:00
Viktor Lofgren
e5c9791b14
(crawler) Fix rare ConcurrentModificationError due to HashSet
2023-08-01 17:28:29 +02:00
Viktor Lofgren
37c4cc68ed
TODO
2023-07-31 10:34:42 +02:00
Viktor Lofgren
5c071ce4d3
(crawler) Clean up the code and remove unnecessary logging
2023-07-30 16:53:39 +02:00
Viktor Lofgren
caf3d231a8
(crawler) Fix rare issue with NPEs if the crawl queue is empty
2023-07-30 16:53:13 +02:00
Viktor Lofgren
730e8f74e4
(crawler) Even more memory optimizations.
...
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
aba134284f
(crawler) Reduce log spam
2023-07-29 19:22:58 +02:00
Viktor Lofgren
2a6183f9e0
(crawler) Dynamic throttling of the number of active crawl jobs permitted to spawn; reduce queue size.
2023-07-29 19:20:09 +02:00
Viktor Lofgren
ee143bbc48
(crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended.
2023-07-29 19:19:09 +02:00
Viktor Lofgren
05ba3bab96
(crawler) Make SitemapRetriever abort on too large sitemaps.
2023-07-29 19:18:12 +02:00
Viktor Lofgren
d2b6b2044c
(crawler) Reduce log spam in HttpFetcherImpl
2023-07-29 19:18:12 +02:00
Viktor Lofgren
7611b7900d
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
...
(crawler) Reduce long term memory allocation in DomainCrawlFrontier
2023-07-29 19:18:12 +02:00
Viktor Lofgren
e237df4a10
(converter) Use a dumb thread pool instead of Java's executor service.
2023-07-28 18:15:16 +02:00
Viktor Lofgren
667b0ca0b0
(converter, WIP) Refactor CrawledDomainReader to not return iterators.
...
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798
(converter, WIP) Refactor converter to not have to load everything into RAM.
2023-07-24 15:25:09 +02:00
Viktor Lofgren
35b29e4f9e
(crawler) Clean up and refactor the code a bit
2023-07-23 19:06:37 +02:00
Viktor Lofgren
69f333c0bf
(crawler) Clean up and refactor the code a bit
2023-07-23 18:59:14 +02:00
Viktor Lofgren
c069c8c182
(crawler) Clean up crawl data reference and recrawl logic
2023-07-22 18:42:21 +02:00
Viktor Lofgren
9e4aa7da7c
(crawler) Support for X-Robots-Tag
2023-07-22 18:42:21 +02:00
Viktor Lofgren
58f2f86ea8
(crawler) Don't read all the data into RAM when doing a refresh-crawl
2023-07-21 19:47:52 +02:00
Viktor Lofgren
f91d92cccb
(crawler) WIP
2023-07-20 21:05:16 +02:00
Viktor Lofgren
5deec63667
(work-log) Better tests
2023-07-12 18:04:06 +02:00
Viktor Lofgren
74caf9e38a
(processes) Remove forEach-constructs in favor of iterators.
2023-07-12 17:47:36 +02:00
Viktor Lofgren
4c016b0318
Process monitoring
...
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor
cbbf60a599
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
f03146de4b
(crawler) Fix bug poor handling of duplicate ids
...
* Also clean up the code a bit
2023-07-10 18:58:43 +02:00
Viktor
0f9b90eb1c
Better fingerprinting ( #35 )
...
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00