mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

History

Viktor Lofgren fdc3efa250 (setup) Remove OpenNLP tokenization model This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.		2024-11-28 16:03:05 +01:00
..
converting-process	(setup) Remove OpenNLP tokenization model	2024-11-28 16:03:05 +01:00
crawling-process	Merge branch 'master' into live-search	2024-11-21 16:00:20 +01:00
export-task-process	(export) Add export actors to precession	2024-11-26 15:07:03 +01:00
index-constructor-process	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents	2024-11-21 16:00:09 +01:00
live-crawling-process	(live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list	2024-11-23 17:07:16 +01:00
loading-process	(live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with.	2024-11-22 13:58:57 +01:00
process-mq-api	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents	2024-11-21 16:00:09 +01:00
test-data	(build) Java 22 and its consequences has been a disaster for Marginalia Search	2024-04-24 13:54:04 +02:00
readme.md	(doc) Fix outdated links in documentation	2024-09-22 13:56:17 +02:00

readme.md

Processes

1. Crawl Process

The crawling-process fetches website contents, temporarily saving them as WARC files, and then re-converts them into parquet models. Both are described in crawling-process/model.

2. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in converting-process/model.

3. Loading Process

The loading-process reads the processed data.

It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.

4. Index Construction Process

The index-construction-process constructs indices from the data generated by the loader.

Overview

Schematically the crawling and loading process looks like this:

    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Parquet:              || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Slop   :         ||  Processed
    ||  Documents[]     ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    \\==================//
          |
    +------------+ Insert domains into mariadb
    |  LOADING   | Insert URLs, titles in link DB
    |    STEP    | Insert keywords in Index
    +------------+    
          |
    +------------+
    | CONSTRUCT  | Make the data searchable
    |   INDEX    | 
    +------------+