mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

History

Viktor Lofgren 507f26ad47 (converter) Refactor converter to not keep instructions list in RAM. (converter) Refactor converter to not keep instructions list in RAM. (converter) Refactor converter to not keep instructions list in RAM.		2023-07-25 22:06:46 +02:00
..
converting-process	(converter) Refactor converter to not keep instructions list in RAM.	2023-07-25 22:06:46 +02:00
crawling-process	(converter, WIP) Refactor CrawledDomainReader to not return iterators.	2023-07-24 16:28:30 +02:00
loading-process	(loader) Don't delete the entire link database when the loader runs	2023-07-24 18:37:35 +02:00
test-data	Specialization for javadocs	2023-07-01 20:16:56 +02:00
readme.md	Remove unrelated code, break tools into their own directory.	2023-03-17 16:03:11 +01:00

readme.md

Processes

1. Crawl Process

The crawling-process fetches website contents and saves them as compressed JSON models described in crawling-model.

The operation is specified by a crawl job specification. This is generated by tools/crawl-job-extractor based on the content in the database.

2. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as compressed JSON models described in converting-model.

3. Loading Process

The loading-process reads the processed data and creates an index journal and lexicon, and loads domains and addresses into the MariaDB-database.

Overview

Schematically the crawling and loading process looks like this:

    //====================\\
    || Compressed JSON:   ||  Specifications
    || ID, Domain, Urls[] ||  File
    || ID, Domain, Urls[] ||
    || ID, Domain, Urls[] ||
    ||      ...           ||
    \\====================//
          |
    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Compressed JSON:      || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Compressed JSON: ||  Processed
    ||  URLs[]          ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    ||  Keywords[]      ||
    ||    ...           ||
    ||  URLs[]          ||
    ||  Domains[]       ||
    ||  Links[]         ||    
    ||  Keywords[]      ||
    ||    ...           ||
    \\==================//
          |
    +------------+
    |  LOADING   | Insert URLs in DB
    |    STEP    | Insert keywords in Index
    +------------+