MarginaliaSearch/code/processes
Viktor Lofgren d84a2c183f (*) Remove the crawl spec abstraction
The crawl spec abstraction was used to upload lists of domains into the system for future crawling.  This was fairly clunky, and it was difficult to understand what was going to be crawled.

Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table.  This is much preferred and means the operator can directly manage domains without specs.

This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
..
converting-process (*) Remove the crawl spec abstraction 2024-10-03 13:41:17 +02:00
crawling-process (*) Remove the crawl spec abstraction 2024-10-03 13:41:17 +02:00
index-constructor-process (index) Experimental initial integration of document spans into index 2024-07-30 12:01:53 +02:00
loading-process (loader) Fix it so that the loader doesn't explode if it sees an invalid URL 2024-09-12 11:36:00 +02:00
process-mq-api (*) Remove the crawl spec abstraction 2024-10-03 13:41:17 +02:00
test-data (build) Java 22 and its consequences has been a disaster for Marginalia Search 2024-04-24 13:54:04 +02:00
website-adjacencies-calculator (build) Fix dependency churn from testcontainers 2024-08-25 10:35:48 +02:00
readme.md (doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00

Processes

1. Crawl Process

The crawling-process fetches website contents, temporarily saving them as WARC files, and then re-converts them into parquet models. Both are described in crawling-process/model.

2. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in converting-process/model.

3. Loading Process

The loading-process reads the processed data.

It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.

4. Index Construction Process

The index-construction-process constructs indices from the data generated by the loader.

Overview

Schematically the crawling and loading process looks like this:

    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Parquet:              || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Slop   :         ||  Processed
    ||  Documents[]     ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    \\==================//
          |
    +------------+ Insert domains into mariadb
    |  LOADING   | Insert URLs, titles in link DB
    |    STEP    | Insert keywords in Index
    +------------+    
          |
    +------------+
    | CONSTRUCT  | Make the data searchable
    |   INDEX    | 
    +------------+