MarginaliaSearch/code/processes/crawling-process/model/java/nu/marginalia
2025-01-26 14:46:50 +01:00
..
io (converter) Refactor to remove CrawledDomainReader and move its functionality into SerializableCrawlDataStream 2025-01-26 14:46:50 +01:00
model (converter) Add truncation att the parser step to prevent the converter from spending too much time on excessively large documents 2025-01-26 14:28:53 +01:00
parquet/crawldata (crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead. 2025-01-19 15:07:11 +01:00
slop Merge branch 'master' into slop-crawl-data-spike 2025-01-21 13:32:58 +01:00
ContentTypes.java (crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets 2024-12-11 17:01:52 +01:00