mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-23 21:18:58 +00:00
![]() Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx. While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now. |
||
---|---|---|
.. | ||
converting-process | ||
crawling-process | ||
export-task-process | ||
index-constructor-process | ||
live-crawling-process | ||
loading-process | ||
process-mq-api | ||
test-data | ||
readme.md |
Processes
1. Crawl Process
The crawling-process fetches website contents, temporarily saving them as WARC files, and then re-converts them into parquet models. Both are described in crawling-process/model.
2. Converting Process
The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in converting-process/model.
3. Loading Process
The loading-process reads the processed data.
It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.
4. Index Construction Process
The index-construction-process constructs indices from the data generated by the loader.
Overview
Schematically the crawling and loading process looks like this:
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
|| Parquet: || Crawl
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||
|| ... ||
\\========================//
|
+------------+
| CONVERTING | Analyze HTML and
| STEP | extract keywords
+------------+ features, links, URLs
|
//==================\\
|| Slop : || Processed
|| Documents[] || Files
|| Domains[] ||
|| Links[] ||
\\==================//
|
+------------+ Insert domains into mariadb
| LOADING | Insert URLs, titles in link DB
| STEP | Insert keywords in Index
+------------+
|
+------------+
| CONSTRUCT | Make the data searchable
| INDEX |
+------------+