MarginaliaSearch/code/features-crawl
Viktor Lofgren 785d8deadd (crawler) Improve meta-tag redirect handling, add tests for redirects.
Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file.  This works as intended.

Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier.  Added logic to handle this case, amended the test case to verify the new behavior.  Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.
2024-02-01 20:30:43 +01:00
..
content-type (crawler) Fix rare exception in content type handling due to improper length checking of a split() array 2024-01-18 11:05:45 +01:00
crawl-blocklist (*) Overhaul settings and properties 2024-01-13 17:12:18 +01:00
link-parser (crawler) Improve meta-tag redirect handling, add tests for redirects. 2024-02-01 20:30:43 +01:00
readme.md Yet more restructuring. Improved search result ranking. 2023-03-16 21:35:54 +01:00

Crawl Features

These are bits of search-engine related code that are relatively isolated pieces of business logic, that benefit from the clarity of being kept separate from the rest of the crawling code.