(crawler) Correct feed URLs in domain state db

Discovered feed URLs were given a double slash after their domain name in the DB.  This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.
This commit is contained in:
Viktor Lofgren 2024-12-26 15:18:31 +01:00
parent 895cee7004
commit 89db69d360

View File

@ -297,16 +297,16 @@ public class CrawlerRetreiver implements AutoCloseable {
}
private final List<String> likelyFeedEndpoints = List.of(
"/rss.xml",
"/atom.xml",
"/feed.xml",
"/index.xml",
"/feed",
"/rss",
"/atom",
"/feeds",
"/blog/feed",
"/blog/rss"
"rss.xml",
"atom.xml",
"feed.xml",
"index.xml",
"feed",
"rss",
"atom",
"feeds",
"blog/feed",
"blog/rss"
);
private Optional<String> guessFeedUrl(CrawlDelayTimer timer) throws InterruptedException {