MarginaliaSearch/code/processes
Viktor Lofgren 66c1281301 (zk-registry) epic jak shaving WIP
Cleaning out a lot of old junk from the code, and one thing lead to another...

* Build is improved, now constructing docker images with 'jib'.  Clean build went from 3 minutes to 50 seconds.
* The ProcessService's spawning is smarter.  Will now just spawn a java process instead of relying on the application plugin's generated outputs.
* Project is migrated to GraalVM
* gRPC clients are re-written with a neat fluent/functional style. e.g.
```channelPool.call(grpcStub::method)
              .async(executor) // <-- optional
              .run(argument);
```
This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall.
* For now the project is all in on zookeeper
* Service discovery is now based on APIs and not services.  Theoretically means we could ship the same code either a monolith or a service mesh.
* To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service.  WIP!

Missing is documentation and testing, and some more breaking apart of code.
2024-02-22 14:01:23 +01:00
..
converting-process (refac) Zookeeper for service-discovery, kill service-client lib (WIP) 2024-02-20 11:41:14 +01:00
crawling-process (refac) Zookeeper for service-discovery, kill service-client lib (WIP) 2024-02-20 11:41:14 +01:00
index-constructor-process (zk-registry) epic jak shaving WIP 2024-02-22 14:01:23 +01:00
loading-process (refac) Zookeeper for service-discovery, kill service-client lib (WIP) 2024-02-20 11:41:14 +01:00
test-data (convert) Wiki specialization that should do a better job at removing junk keywords and providing a useful summary. 2023-11-30 20:04:46 +01:00
website-adjacencies-calculator (zk-registry) epic jak shaving WIP 2024-02-22 14:01:23 +01:00
readme.md (doc) Update the readme's the crawler, as they've grown stale. 2024-02-01 18:10:55 +01:00

Processes

1. Crawl Process

The crawling-process fetches website contents, temporarily saving them as WARC files, and then re-converts them into parquet models. Both are described in crawling-model.

The operation is optionally defined by a crawl specification, which can be created in the control GUI.

2. Converting Process

The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in processed-data.

3. Loading Process

The loading-process reads the processed data.

It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.

4. Index Construction Process

The index-construction-process constructs indices from the data generated by the loader.

Overview

Schematically the crawling and loading process looks like this:

    +-----------+  
    |  CRAWLING |  Fetch each URL and 
    |    STEP   |  output to file
    +-----------+
          |
    //========================\\
    ||  Parquet:              || Crawl
    ||  Status, HTML[], ...   || Files
    ||  Status, HTML[], ...   ||
    ||  Status, HTML[], ...   ||
    ||     ...                ||
    \\========================//
          |
    +------------+
    | CONVERTING |  Analyze HTML and 
    |    STEP    |  extract keywords 
    +------------+  features, links, URLs
          |
    //==================\\
    || Parquet:         ||  Processed
    ||  Documents[]     ||  Files
    ||  Domains[]       ||
    ||  Links[]         ||  
    \\==================//
          |
    +------------+ Insert domains into mariadb
    |  LOADING   | Insert URLs, titles in link DB
    |    STEP    | Insert keywords in Index
    +------------+    
          |
    +------------+
    | CONSTRUCT  | Make the data searchable
    |   INDEX    | 
    +------------+