mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

History

Viktor Lofgren d84a2c183f (*) Remove the crawl spec abstraction The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled. Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs. This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.		2024-10-03 13:41:17 +02:00
..
java	(*) Remove the crawl spec abstraction	2024-10-03 13:41:17 +02:00
test/nu/marginalia/crawling	(minor) Fix accidental commit errors	2024-09-23 18:03:09 +02:00
build.gradle	(restructure) Clean up repo by moving stray features into converter-process and crawler-process	2024-07-30 10:14:00 +02:00
readme.md	(doc) Correct dead links and stale information in the docs	2024-09-13 11:01:05 +02:00

readme.md

Crawling Models

Contains crawl data models shared by the crawling-process and converting-process.

To ensure backward compatibility with older versions of the data, the serialization is abstracted away from the model classes.

The new way of serializing the data is to use parquet files.

The old way was to use zstd-compressed JSON. The old way is still supported for now, but the new way is preferred as it's not only more succinct, but also significantly faster to read and much more portable. The JSON support will be removed in the future.

Central Classes

Serialization

These serialization classes automatically negotiate the serialization format based on the file extension.

Data is accessed through a SerializableCrawlDataStream, which is a somewhat enhanced Iterator that can be used to read data.

CrawledDomainReader

Parquet Serialization

The parquet serialization is done using the CrawledDocumentParquetRecordFileReader and CrawledDocumentParquetRecordFileWriter classes, which read and write parquet files respectively.

The model classes are serialized to parquet using the CrawledDocumentParquetRecord

The record has the following fields:

domain - The domain of the document
url - The URL of the document
ip - The IP address of the document
cookies - Whether the document has cookies
httpStatus - The HTTP status code of the document
timestamp - The timestamp of the document
contentType - The content type of the document
body - The body of the document
etagHeader - The ETag header of the document
lastModifiedHeader - The Last-Modified header of the document

The easiest way to interact with parquet files is to use DuckDB, which lets you run SQL queries on parquet files (and almost anything else).

e.g.

$ select httpStatus, count(*) as cnt 
       from 'my-file.parquet' 
       group by httpStatus;
┌────────────┬───────┐
│ httpStatus │  cnt  │
│   int32    │ int64 │
├────────────┼───────┤
│        200 │    43 │
│        304 │     4 │
│        500 │     1 │
└────────────┴───────┘