MarginaliaSearch/code/process-models/crawling-model
Viktor Lofgren 9fea22b90d (warc) Further tidying
This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled.

A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics.

Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.
2023-12-15 15:38:23 +01:00
..
src (warc) Further tidying 2023-12-15 15:38:23 +01:00
build.gradle (crawling-model) Implement a parquet format for crawl data 2023-12-13 16:22:19 +01:00
readme.md (refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00

Crawling Models

Contains models shared by the crawling-process and converting-process.

Central Classes

Serialization