mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

History

Viktor Lofgren f655ec5a5c (*) Refactor GeoIP-related code In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services. The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions. The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server. The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.		2023-12-10 17:30:43 +01:00
..
converting-process	(*) Refactor GeoIP-related code	2023-12-10 17:30:43 +01:00
crawling-process	(*) Refactor GeoIP-related code	2023-12-10 17:30:43 +01:00
index-constructor-process	(process) Ensure construction exceptions are logged	2023-11-22 18:32:06 +01:00
loading-process	(minor) Fix broken test	2023-12-10 14:39:20 +01:00
test-data	(convert) Wiki specialization that should do a better job at removing junk keywords and providing a useful summary.	2023-11-30 20:04:46 +01:00
website-adjacencies-calculator	(executor-service) Embed dist/ in executor-service's docker image	2023-10-19 17:48:34 +02:00
readme.md	(refactor) Remove features-search and update documentation	2023-10-09 15:12:30 +02:00

readme.md

Processes

1. Crawl Process

The crawling-process fetches website contents and saves them as compressed JSON models described in crawling-model.

The operation is specified by a crawl specification, which can be created in the control GUI.

2. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in processed-data.

3. Loading Process

The loading-process reads the processed data.

It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.

4. Index Construction Process

The index-construction-process constructs indices from the data generated by the loader.

Overview

Schematically the crawling and loading process looks like this:

    //====================\\
    || Compressed JSON:   ||  Specifications
    || ID, Domain, Urls[] ||  File
    || ID, Domain, Urls[] ||
    || ID, Domain, Urls[] ||
    ||      ...           ||
    \\====================//
          |
    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Compressed JSON:      || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Parquet:         ||  Processed
    ||  Documents[]     ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    \\==================//
          |
    +------------+ Insert domains into mariadb
    |  LOADING   | Insert URLs, titles in link DB
    |    STEP    | Insert keywords in Index
    +------------+    
          |
    +------------+
    | CONSTRUCT  | Make the data searchable
    |   INDEX    | 
    +------------+