(crawler) Use the response URL when resolving relative links

The crawler was incorrectly using the request URL as the base URL when resolving relative links. This caused problems when encountering redirects. For example if we fetch /log, redirecting to /log/ and find links to foo/, and bar/; these would resolve to /foo and /bar, and not /log/foo and /log/bar.
2025-02-22 20:48:59 +00:00 · 2025-01-31 12:40:13 +01:00 · 2025-01-31 12:40:13 +01:00 · 2ea34767d8
commit 2ea34767d8
parent e9af838231
1 changed files with 4 additions and 2 deletions
--- a/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
+++ b/code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
@ -381,8 +381,10 @@ public class CrawlerRetreiver implements AutoCloseable {
                if (docOpt.isPresent()) {
                    var doc = docOpt.get();

-                    crawlFrontier.enqueueLinksFromDocument(top, doc);
-                    crawlFrontier.addVisited(new EdgeUrl(ok.uri()));
+                    var responseUrl = new EdgeUrl(ok.uri());
+
+                    crawlFrontier.enqueueLinksFromDocument(responseUrl, doc);
+                    crawlFrontier.addVisited(responseUrl);
                }
            }
            else if (fetchedDoc instanceof HttpFetchResult.Result304Raw && reference.doc() != null) {