mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-23 21:18:58 +00:00
(doc) Fix outdated links in documentation
This commit is contained in:
parent
edb42836da
commit
9c292a4f62
@ -13,7 +13,7 @@ a binary index that only offers information about which documents has a specific
|
||||
The priority index is also compressed, while the full index at this point is not.
|
||||
|
||||
[1] See WordFlags in [common/model](../../common/model/) and
|
||||
KeywordMetadata in [features-convert/keyword-extraction](../../features-convert/keyword-extraction).
|
||||
KeywordMetadata in [converting-process/ft-keyword-extraction](../../processes/converting-process/ft-keyword-extraction).
|
||||
|
||||
## Construction
|
||||
|
||||
|
@ -10,5 +10,5 @@ its words, how they stem, POS tags, and so on.
|
||||
|
||||
## See Also
|
||||
|
||||
[features-convert/keyword-extraction](../../features-convert/keyword-extraction) uses this code to identify which keywords
|
||||
[converting-process/ft-keyword-extraction](../../processes/converting-process/ft-keyword-extraction) uses this code to identify which keywords
|
||||
are important.
|
@ -49,7 +49,3 @@ has HTML-specific logic related to a document, keywords and identifies features
|
||||
|
||||
* [DomainProcessor](java/nu/marginalia/converting/processor/DomainProcessor.java) converts each document and
|
||||
generates domain-wide metadata such as link graphs.
|
||||
|
||||
## See Also
|
||||
|
||||
* [features-convert](../../features-convert/)
|
@ -35,8 +35,4 @@ On top of organic links, the crawler can use sitemaps and rss-feeds to discover
|
||||
* [CrawlerRetreiver](java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java)
|
||||
visits known addresses from a domain and downloads each document.
|
||||
* [HttpFetcher](java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java)
|
||||
fetches URLs.
|
||||
|
||||
## See Also
|
||||
|
||||
* [features-crawl](../../features-crawl/)
|
||||
fetches URLs.
|
@ -3,15 +3,13 @@
|
||||
## 1. Crawl Process
|
||||
|
||||
The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
|
||||
re-converts them into parquet models. Both are described in [crawling-model](../process-models/crawling-model/).
|
||||
|
||||
The operation is optionally defined by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
|
||||
re-converts them into parquet models. Both are described in [crawling-process/model](crawling-process/model/).
|
||||
|
||||
## 2. Converting Process
|
||||
|
||||
The [converting-process](converting-process/) reads crawl data from the crawling step and
|
||||
processes them, extracting keywords and metadata and saves them as parquet files
|
||||
described in [processed-data](../process-models/processed-data/).
|
||||
described in [converting-process/model](converting-process/model/).
|
||||
|
||||
## 3. Loading Process
|
||||
|
||||
@ -51,7 +49,7 @@ Schematically the crawling and loading process looks like this:
|
||||
+------------+ features, links, URLs
|
||||
|
|
||||
//==================\\
|
||||
|| Parquet: || Processed
|
||||
|| Slop : || Processed
|
||||
|| Documents[] || Files
|
||||
|| Domains[] ||
|
||||
|| Links[] ||
|
||||
|
Loading…
Reference in New Issue
Block a user