(doc) Fix outdated links in documentation

This commit is contained in:
Viktor Lofgren 2024-09-22 13:56:17 +02:00
parent edb42836da
commit 9c292a4f62
5 changed files with 6 additions and 16 deletions

View File

@ -13,7 +13,7 @@ a binary index that only offers information about which documents has a specific
The priority index is also compressed, while the full index at this point is not.
[1] See WordFlags in [common/model](../../common/model/) and
KeywordMetadata in [features-convert/keyword-extraction](../../features-convert/keyword-extraction).
KeywordMetadata in [converting-process/ft-keyword-extraction](../../processes/converting-process/ft-keyword-extraction).
## Construction

View File

@ -10,5 +10,5 @@ its words, how they stem, POS tags, and so on.
## See Also
[features-convert/keyword-extraction](../../features-convert/keyword-extraction) uses this code to identify which keywords
[converting-process/ft-keyword-extraction](../../processes/converting-process/ft-keyword-extraction) uses this code to identify which keywords
are important.

View File

@ -49,7 +49,3 @@ has HTML-specific logic related to a document, keywords and identifies features
* [DomainProcessor](java/nu/marginalia/converting/processor/DomainProcessor.java) converts each document and
generates domain-wide metadata such as link graphs.
## See Also
* [features-convert](../../features-convert/)

View File

@ -35,8 +35,4 @@ On top of organic links, the crawler can use sitemaps and rss-feeds to discover
* [CrawlerRetreiver](java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java)
visits known addresses from a domain and downloads each document.
* [HttpFetcher](java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java)
fetches URLs.
## See Also
* [features-crawl](../../features-crawl/)
fetches URLs.

View File

@ -3,15 +3,13 @@
## 1. Crawl Process
The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
re-converts them into parquet models. Both are described in [crawling-model](../process-models/crawling-model/).
The operation is optionally defined by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
re-converts them into parquet models. Both are described in [crawling-process/model](crawling-process/model/).
## 2. Converting Process
The [converting-process](converting-process/) reads crawl data from the crawling step and
processes them, extracting keywords and metadata and saves them as parquet files
described in [processed-data](../process-models/processed-data/).
described in [converting-process/model](converting-process/model/).
## 3. Loading Process
@ -51,7 +49,7 @@ Schematically the crawling and loading process looks like this:
+------------+ features, links, URLs
|
//==================\\
|| Parquet: || Processed
|| Slop : || Processed
|| Documents[] || Files
|| Domains[] ||
|| Links[] ||