mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-23 21:18:58 +00:00
![]() The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records. This is true for the new parquet format, but not for the old zstd/gson format. To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order. This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be. Down the line when all the old data is purged, this should be removed, as it amounts to technical debt. |
||
---|---|---|
.. | ||
converting-process | ||
crawling-process | ||
index-constructor-process | ||
loading-process | ||
test-data | ||
website-adjacencies-calculator | ||
readme.md |
Processes
1. Crawl Process
The crawling-process fetches website contents and saves them as compressed JSON models described in crawling-model.
The operation is specified by a crawl specification, which can be created in the control GUI.
2. Converting Process
The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in processed-data.
3. Loading Process
The loading-process reads the processed data.
It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.
4. Index Construction Process
The index-construction-process constructs indices from the data generated by the loader.
Overview
Schematically the crawling and loading process looks like this:
//====================\\
|| Compressed JSON: || Specifications
|| ID, Domain, Urls[] || File
|| ID, Domain, Urls[] ||
|| ID, Domain, Urls[] ||
|| ... ||
\\====================//
|
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
|| Compressed JSON: || Crawl
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||
|| ... ||
\\========================//
|
+------------+
| CONVERTING | Analyze HTML and
| STEP | extract keywords
+------------+ features, links, URLs
|
//==================\\
|| Parquet: || Processed
|| Documents[] || Files
|| Domains[] ||
|| Links[] ||
\\==================//
|
+------------+ Insert domains into mariadb
| LOADING | Insert URLs, titles in link DB
| STEP | Insert keywords in Index
+------------+
|
+------------+
| CONSTRUCT | Make the data searchable
| INDEX |
+------------+