mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 13:09:00 +00:00

History

Viktor Lofgren fb673de370 (crawler) Change the header 'User-agent' to 'User-Agent'		2025-01-28 15:34:16 +01:00
..
converting-process	(converter) Add progress tracking for big domains in converter	2025-01-26 18:03:59 +01:00
crawling-process	(crawler) Change the header 'User-agent' to 'User-Agent'	2025-01-28 15:34:16 +01:00
export-task-process	(converter) Refactor to remove CrawledDomainReader and move its functionality into SerializableCrawlDataStream	2025-01-26 14:46:50 +01:00
index-constructor-process	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents	2024-11-21 16:00:09 +01:00
live-crawling-process	(converter) Reduce lock contention in converter by separating the processing of full and simple-track domains	2025-01-26 13:21:46 +01:00
loading-process	(loader) Correct DocumentLoaderService to properly do bulk inserts	2024-12-08 13:12:52 +01:00
process-mq-api	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents	2024-11-21 16:00:09 +01:00
test-data	(build) Java 22 and its consequences has been a disaster for Marginalia Search	2024-04-24 13:54:04 +02:00
readme.md	(doc) Fix outdated links in documentation	2024-09-22 13:56:17 +02:00

readme.md

Processes

1. Crawl Process

The crawling-process fetches website contents, temporarily saving them as WARC files, and then re-converts them into parquet models. Both are described in crawling-process/model.

2. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in converting-process/model.

3. Loading Process

The loading-process reads the processed data.

It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.

4. Index Construction Process

The index-construction-process constructs indices from the data generated by the loader.

Overview

Schematically the crawling and loading process looks like this:

    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Parquet:              || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Slop   :         ||  Processed
    ||  Documents[]     ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    \\==================//
          |
    +------------+ Insert domains into mariadb
    |  LOADING   | Insert URLs, titles in link DB
    |    STEP    | Insert keywords in Index
    +------------+    
          |
    +------------+
    | CONSTRUCT  | Make the data searchable
    |   INDEX    | 
    +------------+