MarginaliaSearch/code/process-models/crawling-model/src
Viktor Lofgren dec3b1092d (converter) Fix bugs in conversion
This commit adds a safety check that the URL of the document is from the correct domain.

It also adds a sizeHint() method to SerializableCrawlDataStream which *may* provide an indication if the stream is very large and benefits from sideload-style processing (which is slow).

It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...
2023-12-29 13:58:08 +01:00
..
main/java (converter) Fix bugs in conversion 2023-12-29 13:58:08 +01:00
test/java/nu/marginalia/crawling (warc) Add a fields for etags and last-modified headers to the new crawl data formats 2023-12-18 17:45:54 +01:00