MarginaliaSearch/code/processes/crawling-process/model/java/nu/marginalia
Viktor Lofgren b8581b0f56 (crawler) Safe sanitization of headers during warc->slop conversion
The warc->slop converter was rejecting some items because they had headers that were representable in the Warc code's MessageHeader map implementation, but illegal in the HttpHeaders' implementation.

Fixing this by manually filtering these out.  Ostensibly the constructor has a filtering predicate, but this annoyingly runs too late and fails to prevent the problem.
2025-01-31 12:47:42 +01:00
..
io (converter) Refactor to remove CrawledDomainReader and move its functionality into SerializableCrawlDataStream 2025-01-26 14:46:50 +01:00
model (crawler) Safe sanitization of headers during warc->slop conversion 2025-01-31 12:47:42 +01:00
parquet/crawldata (crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead. 2025-01-19 15:07:11 +01:00
slop (crawler) Reduce long retention of CrawlDataReference objects and their associated SerializableCrawlDataStreams 2025-01-26 15:40:17 +01:00
ContentTypes.java (crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets 2024-12-11 17:01:52 +01:00