MarginaliaSearch/code/processes
Viktor Lofgren 22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
..
converting-process (sentence-extractor) Add tag information to document language data 2024-07-18 15:57:48 +02:00
crawling-process (crawl) Add new functionality for re-crawling a single domain 2024-07-05 15:31:55 +02:00
index-constructor-process (keyword-extractor) Retire TfIdfHigh WordFlag 2024-07-17 13:54:39 +02:00
loading-process (coded-sequence) Replace GCS usage with an interface 2024-07-16 14:37:50 +02:00
test-data (build) Java 22 and its consequences has been a disaster for Marginalia Search 2024-04-24 13:54:04 +02:00
website-adjacencies-calculator (*) Lift jetty and guava-dependencies 2024-05-23 14:20:01 +02:00
readme.md Clean up documentation and rename domain-links to link-graph 2024-02-28 11:40:39 +01:00

Processes

1. Crawl Process

The crawling-process fetches website contents, temporarily saving them as WARC files, and then re-converts them into parquet models. Both are described in crawling-model.

The operation is optionally defined by a crawl specification, which can be created in the control GUI.

2. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in processed-data.

3. Loading Process

The loading-process reads the processed data.

It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.

4. Index Construction Process

The index-construction-process constructs indices from the data generated by the loader.

Overview

Schematically the crawling and loading process looks like this:

    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Parquet:              || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Parquet:         ||  Processed
    ||  Documents[]     ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    \\==================//
          |
    +------------+ Insert domains into mariadb
    |  LOADING   | Insert URLs, titles in link DB
    |    STEP    | Insert keywords in Index
    +------------+    
          |
    +------------+
    | CONSTRUCT  | Make the data searchable
    |   INDEX    | 
    +------------+