mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-24 05:18:58 +00:00
![]() This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream. The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type. This is to avoid having to do format conversions when writing and reading the data. This parquet field populates the timestamp field in CrawledDocument. |
||
---|---|---|
.. | ||
src | ||
build.gradle | ||
readme.md |
Crawling Models
Contains models shared by the crawling-process and converting-process.