MarginaliaSearch/code/processes/crawling-process/java/nu/marginalia/crawl
Viktor Lofgren 481f999b70 (crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full.
Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.
2024-10-15 14:22:40 +02:00
..
fetcher (crawler, EXPERIMENT) Disable content type probing and use Accept header instead 2024-09-30 14:53:01 +02:00
logic (crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl 2024-09-24 15:08:22 +02:00
retreival (crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full. 2024-10-15 14:22:40 +02:00
spec (crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains 2024-10-15 14:21:59 +02:00
warc (crawler) Code quality 2024-04-22 15:37:35 +02:00
AbortMonitor.java (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
CrawlerMain.java (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:55:59 +02:00
CrawlerModule.java (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00