mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

History

Viktor Lofgren dcb43a3308 (slop) Introduce table concept to keep track of positions and simplify closing The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip. The second most common error is forgetting to close one of the columns in a reader or writer. To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.		2024-07-27 13:47:47 +02:00
..
converting-process	(slop) Introduce table concept to keep track of positions and simplify closing	2024-07-27 13:47:47 +02:00
crawling-process	(wip) Extract and encode spans data	2024-07-27 11:44:13 +02:00
index-constructor-process	(wip) Extract and encode spans data	2024-07-27 11:44:13 +02:00
loading-process	(slop) Introduce table concept to keep track of positions and simplify closing	2024-07-27 13:47:47 +02:00
process-mq-api	(wip) Extract and encode spans data	2024-07-27 11:44:13 +02:00
test-data	(build) Java 22 and its consequences has been a disaster for Marginalia Search	2024-04-24 13:54:04 +02:00
website-adjacencies-calculator	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
readme.md	Clean up documentation and rename `domain-links` to `link-graph`	2024-02-28 11:40:39 +01:00

readme.md

Processes

1. Crawl Process

The crawling-process fetches website contents, temporarily saving them as WARC files, and then re-converts them into parquet models. Both are described in crawling-model.

The operation is optionally defined by a crawl specification, which can be created in the control GUI.

2. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in processed-data.

3. Loading Process

The loading-process reads the processed data.

It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.

4. Index Construction Process

The index-construction-process constructs indices from the data generated by the loader.

Overview

Schematically the crawling and loading process looks like this:

    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Parquet:              || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Parquet:         ||  Processed
    ||  Documents[]     ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    \\==================//
          |
    +------------+ Insert domains into mariadb
    |  LOADING   | Insert URLs, titles in link DB
    |    STEP    | Insert keywords in Index
    +------------+    
          |
    +------------+
    | CONSTRUCT  | Make the data searchable
    |   INDEX    | 
    +------------+