MarginaliaSearch/code/process-models/crawling-model
Viktor Lofgren 0307c55f9f (refac) Zookeeper for service-discovery, kill service-client lib (WIP)
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.

A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.

The last remaining REST service, the assistant-service, has been migrated to gRPC.

This also proved a good time to clear out primordial technical debt from the root of the codebase.  The 'service-client' library has been taken behind the barn and given a last farewell.  It's replaced by a small library for managing gRPC channels.

Since it's no longer used by anything, RxJava has been removed as a dependency from the project.

Although the current state seems reasonably stable, this is a work-in-progress commit.
2024-02-20 11:41:14 +01:00
..
src (refac) Zookeeper for service-discovery, kill service-client lib (WIP) 2024-02-20 11:41:14 +01:00
build.gradle (refac) Zookeeper for service-discovery, kill service-client lib (WIP) 2024-02-20 11:41:14 +01:00
readme.md (process-models) Improve documentation 2024-02-15 12:21:12 +01:00

Crawling Models

Contains crawl data models shared by the crawling-process and converting-process.

To ensure backward compatibility with older versions of the data, the serialization is abstracted away from the model classes.

The new way of serializing the data is to use parquet files.

The old way was to use zstd-compressed JSON. The old way is still supported for now, but the new way is preferred as it's not only more succinct, but also significantly faster to read and much more portable. The JSON support will be removed in the future.

Central Classes

Serialization

These serialization classes automatically negotiate the serialization format based on the file extension.

Data is accessed through a SerializableCrawlDataStream, which is a somewhat enhanced Iterator that can be used to read data.

Parquet Serialization

The parquet serialization is done using the CrawledDocumentParquetRecordFileReader and CrawledDocumentParquetRecordFileWriter classes, which read and write parquet files respectively.

The model classes are serialized to parquet using the CrawledDocumentParquetRecord

The record has the following fields:

  • domain - The domain of the document
  • url - The URL of the document
  • ip - The IP address of the document
  • cookies - Whether the document has cookies
  • httpStatus - The HTTP status code of the document
  • timestamp - The timestamp of the document
  • contentType - The content type of the document
  • body - The body of the document
  • etagHeader - The ETag header of the document
  • lastModifiedHeader - The Last-Modified header of the document

The easiest way to interact with parquet files is to use DuckDB, which lets you run SQL queries on parquet files (and almost anything else).

e.g.

$ select httpStatus, count(*) as cnt 
       from 'my-file.parquet' 
       group by httpStatus;
┌────────────┬───────┐
 httpStatus   cnt  
   int32     int64 
├────────────┼───────┤
        200     43 
        304      4 
        500      1 
└────────────┴───────┘