(doc) Fix outdated links in documentation

This commit is contained in:
Viktor Lofgren 2024-09-22 13:56:17 +02:00
parent edb42836da
commit 9c292a4f62
5 changed files with 6 additions and 16 deletions

View File

@ -13,7 +13,7 @@ a binary index that only offers information about which documents has a specific
The priority index is also compressed, while the full index at this point is not. The priority index is also compressed, while the full index at this point is not.
[1] See WordFlags in [common/model](../../common/model/) and [1] See WordFlags in [common/model](../../common/model/) and
KeywordMetadata in [features-convert/keyword-extraction](../../features-convert/keyword-extraction). KeywordMetadata in [converting-process/ft-keyword-extraction](../../processes/converting-process/ft-keyword-extraction).
## Construction ## Construction

View File

@ -10,5 +10,5 @@ its words, how they stem, POS tags, and so on.
## See Also ## See Also
[features-convert/keyword-extraction](../../features-convert/keyword-extraction) uses this code to identify which keywords [converting-process/ft-keyword-extraction](../../processes/converting-process/ft-keyword-extraction) uses this code to identify which keywords
are important. are important.

View File

@ -49,7 +49,3 @@ has HTML-specific logic related to a document, keywords and identifies features
* [DomainProcessor](java/nu/marginalia/converting/processor/DomainProcessor.java) converts each document and * [DomainProcessor](java/nu/marginalia/converting/processor/DomainProcessor.java) converts each document and
generates domain-wide metadata such as link graphs. generates domain-wide metadata such as link graphs.
## See Also
* [features-convert](../../features-convert/)

View File

@ -36,7 +36,3 @@ On top of organic links, the crawler can use sitemaps and rss-feeds to discover
visits known addresses from a domain and downloads each document. visits known addresses from a domain and downloads each document.
* [HttpFetcher](java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java) * [HttpFetcher](java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java)
fetches URLs. fetches URLs.
## See Also
* [features-crawl](../../features-crawl/)

View File

@ -3,15 +3,13 @@
## 1. Crawl Process ## 1. Crawl Process
The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
re-converts them into parquet models. Both are described in [crawling-model](../process-models/crawling-model/). re-converts them into parquet models. Both are described in [crawling-process/model](crawling-process/model/).
The operation is optionally defined by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
## 2. Converting Process ## 2. Converting Process
The [converting-process](converting-process/) reads crawl data from the crawling step and The [converting-process](converting-process/) reads crawl data from the crawling step and
processes them, extracting keywords and metadata and saves them as parquet files processes them, extracting keywords and metadata and saves them as parquet files
described in [processed-data](../process-models/processed-data/). described in [converting-process/model](converting-process/model/).
## 3. Loading Process ## 3. Loading Process
@ -51,7 +49,7 @@ Schematically the crawling and loading process looks like this:
+------------+ features, links, URLs +------------+ features, links, URLs
| |
//==================\\ //==================\\
|| Parquet: || Processed || Slop : || Processed
|| Documents[] || Files || Documents[] || Files
|| Domains[] || || Domains[] ||
|| Links[] || || Links[] ||