mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-23 13:09:00 +00:00
![]() Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI. Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time. |
||
---|---|---|
.. | ||
converting-process | ||
crawling-process | ||
index-constructor-process | ||
loading-process | ||
process-mq-api | ||
test-data | ||
website-adjacencies-calculator | ||
readme.md |
Processes
1. Crawl Process
The crawling-process fetches website contents, temporarily saving them as WARC files, and then re-converts them into parquet models. Both are described in crawling-process/model.
2. Converting Process
The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in converting-process/model.
3. Loading Process
The loading-process reads the processed data.
It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.
4. Index Construction Process
The index-construction-process constructs indices from the data generated by the loader.
Overview
Schematically the crawling and loading process looks like this:
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
|| Parquet: || Crawl
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||
|| ... ||
\\========================//
|
+------------+
| CONVERTING | Analyze HTML and
| STEP | extract keywords
+------------+ features, links, URLs
|
//==================\\
|| Slop : || Processed
|| Documents[] || Files
|| Domains[] ||
|| Links[] ||
\\==================//
|
+------------+ Insert domains into mariadb
| LOADING | Insert URLs, titles in link DB
| STEP | Insert keywords in Index
+------------+
|
+------------+
| CONSTRUCT | Make the data searchable
| INDEX |
+------------+