mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-24 05:18:58 +00:00
![]() A number of crawl jobs get stuck at about 300 documents, or just under. This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl. GOOD_URLS is based on how many documents successfully process, which is typically fairly small. Switching to KNOWN_URLS should let this grow faster. The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table. The floor is also increased to 250 from 200. |
||
---|---|---|
.. | ||
converting-process | ||
crawling-process | ||
index-constructor-process | ||
loading-process | ||
test-data | ||
website-adjacencies-calculator | ||
readme.md |
Processes
1. Crawl Process
The crawling-process fetches website contents and saves them as compressed JSON models described in crawling-model.
The operation is specified by a crawl specification, which can be created in the control GUI.
2. Converting Process
The converting-process reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files described in processed-data.
3. Loading Process
The loading-process reads the processed data.
It has creates an index journal, a link database, and loads domains and domain-links into the MariaDB database.
4. Index Construction Process
The index-construction-process constructs indices from the data generated by the loader.
Overview
Schematically the crawling and loading process looks like this:
//====================\\
|| Compressed JSON: || Specifications
|| ID, Domain, Urls[] || File
|| ID, Domain, Urls[] ||
|| ID, Domain, Urls[] ||
|| ... ||
\\====================//
|
+-----------+
| CRAWLING | Fetch each URL and
| STEP | output to file
+-----------+
|
//========================\\
|| Compressed JSON: || Crawl
|| Status, HTML[], ... || Files
|| Status, HTML[], ... ||
|| Status, HTML[], ... ||
|| ... ||
\\========================//
|
+------------+
| CONVERTING | Analyze HTML and
| STEP | extract keywords
+------------+ features, links, URLs
|
//==================\\
|| Parquet: || Processed
|| Documents[] || Files
|| Domains[] ||
|| Links[] ||
\\==================//
|
+------------+ Insert domains into mariadb
| LOADING | Insert URLs, titles in link DB
| STEP | Insert keywords in Index
+------------+
|
+------------+
| CONSTRUCT | Make the data searchable
| INDEX |
+------------+