MarginaliaSearch/code/processes
Viktor Lofgren b8581b0f56 (crawler) Safe sanitization of headers during warc->slop conversion
The warc->slop converter was rejecting some items because they had headers that were representable in the Warc code's MessageHeader map implementation, but illegal in the HttpHeaders' implementation.

Fixing this by manually filtering these out.  Ostensibly the constructor has a filtering predicate, but this annoyingly runs too late and fails to prevent the problem.
2025-01-31 12:47:42 +01:00
..
converting-process (misc) Tidying 2025-01-29 15:17:04 +01:00
crawling-process (crawler) Safe sanitization of headers during warc->slop conversion 2025-01-31 12:47:42 +01:00
export-task-process (converter) Refactor to remove CrawledDomainReader and move its functionality into SerializableCrawlDataStream 2025-01-26 14:46:50 +01:00
index-constructor-process (refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents 2024-11-21 16:00:09 +01:00
live-crawling-process (converter) Reduce lock contention in converter by separating the processing of full and simple-track domains 2025-01-26 13:21:46 +01:00
loading-process (loader) Correct DocumentLoaderService to properly do bulk inserts 2024-12-08 13:12:52 +01:00
process-mq-api (refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents 2024-11-21 16:00:09 +01:00
test-data (build) Java 22 and its consequences has been a disaster for Marginalia Search 2024-04-24 13:54:04 +02:00
readme.md (doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00

Processes

1. Crawl Process

The crawling-process fetches website contents, temporarily saving them as WARC files, and then re-converts them into parquet models. Both are described in crawling-process/model.

2. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in converting-process/model.

3. Loading Process

The loading-process reads the processed data.

It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.

4. Index Construction Process

The index-construction-process constructs indices from the data generated by the loader.

Overview

Schematically the crawling and loading process looks like this:

    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Parquet:              || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Slop   :         ||  Processed
    ||  Documents[]     ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    \\==================//
          |
    +------------+ Insert domains into mariadb
    |  LOADING   | Insert URLs, titles in link DB
    |    STEP    | Insert keywords in Index
    +------------+    
          |
    +------------+
    | CONSTRUCT  | Make the data searchable
    |   INDEX    | 
    +------------+