MarginaliaSearch/code/processes/crawling-process/java/nu/marginalia/crawl/retreival
Viktor Lofgren 2ea34767d8 (crawler) Use the response URL when resolving relative links
The crawler was incorrectly using the request URL as the base URL when resolving relative links.  This caused problems when encountering redirects.

 For example if we fetch /log, redirecting to  /log/ and find links to foo/, and bar/; these would resolve to /foo and /bar, and not /log/foo and /log/bar.
2025-01-31 12:40:13 +01:00
..
revisit (crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams 2025-01-26 15:40:17 +01:00
CrawlDataReference.java (crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams 2025-01-26 15:40:17 +01:00
CrawlDelayTimer.java (live-crawler) Crude first-try process for live crawling #WIP 2024-11-19 19:35:01 +01:00
CrawlerRetreiver.java (crawler) Use the response URL when resolving relative links 2025-01-31 12:40:13 +01:00
CrawlerWarcResynchronizer.java (crawler) Refactor 2024-09-23 17:51:07 +02:00
DomainCrawlFrontier.java (crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris 2024-10-15 17:27:59 +02:00
DomainProber.java (crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl 2024-09-24 15:08:22 +02:00