mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-24 05:18:58 +00:00
(doc) Fix outdated links in documentation
This commit is contained in:
parent
edb42836da
commit
9c292a4f62
@ -13,7 +13,7 @@ a binary index that only offers information about which documents has a specific
|
|||||||
The priority index is also compressed, while the full index at this point is not.
|
The priority index is also compressed, while the full index at this point is not.
|
||||||
|
|
||||||
[1] See WordFlags in [common/model](../../common/model/) and
|
[1] See WordFlags in [common/model](../../common/model/) and
|
||||||
KeywordMetadata in [features-convert/keyword-extraction](../../features-convert/keyword-extraction).
|
KeywordMetadata in [converting-process/ft-keyword-extraction](../../processes/converting-process/ft-keyword-extraction).
|
||||||
|
|
||||||
## Construction
|
## Construction
|
||||||
|
|
||||||
|
@ -10,5 +10,5 @@ its words, how they stem, POS tags, and so on.
|
|||||||
|
|
||||||
## See Also
|
## See Also
|
||||||
|
|
||||||
[features-convert/keyword-extraction](../../features-convert/keyword-extraction) uses this code to identify which keywords
|
[converting-process/ft-keyword-extraction](../../processes/converting-process/ft-keyword-extraction) uses this code to identify which keywords
|
||||||
are important.
|
are important.
|
@ -49,7 +49,3 @@ has HTML-specific logic related to a document, keywords and identifies features
|
|||||||
|
|
||||||
* [DomainProcessor](java/nu/marginalia/converting/processor/DomainProcessor.java) converts each document and
|
* [DomainProcessor](java/nu/marginalia/converting/processor/DomainProcessor.java) converts each document and
|
||||||
generates domain-wide metadata such as link graphs.
|
generates domain-wide metadata such as link graphs.
|
||||||
|
|
||||||
## See Also
|
|
||||||
|
|
||||||
* [features-convert](../../features-convert/)
|
|
@ -36,7 +36,3 @@ On top of organic links, the crawler can use sitemaps and rss-feeds to discover
|
|||||||
visits known addresses from a domain and downloads each document.
|
visits known addresses from a domain and downloads each document.
|
||||||
* [HttpFetcher](java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java)
|
* [HttpFetcher](java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java)
|
||||||
fetches URLs.
|
fetches URLs.
|
||||||
|
|
||||||
## See Also
|
|
||||||
|
|
||||||
* [features-crawl](../../features-crawl/)
|
|
@ -3,15 +3,13 @@
|
|||||||
## 1. Crawl Process
|
## 1. Crawl Process
|
||||||
|
|
||||||
The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
|
The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then
|
||||||
re-converts them into parquet models. Both are described in [crawling-model](../process-models/crawling-model/).
|
re-converts them into parquet models. Both are described in [crawling-process/model](crawling-process/model/).
|
||||||
|
|
||||||
The operation is optionally defined by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI.
|
|
||||||
|
|
||||||
## 2. Converting Process
|
## 2. Converting Process
|
||||||
|
|
||||||
The [converting-process](converting-process/) reads crawl data from the crawling step and
|
The [converting-process](converting-process/) reads crawl data from the crawling step and
|
||||||
processes them, extracting keywords and metadata and saves them as parquet files
|
processes them, extracting keywords and metadata and saves them as parquet files
|
||||||
described in [processed-data](../process-models/processed-data/).
|
described in [converting-process/model](converting-process/model/).
|
||||||
|
|
||||||
## 3. Loading Process
|
## 3. Loading Process
|
||||||
|
|
||||||
@ -51,7 +49,7 @@ Schematically the crawling and loading process looks like this:
|
|||||||
+------------+ features, links, URLs
|
+------------+ features, links, URLs
|
||||||
|
|
|
|
||||||
//==================\\
|
//==================\\
|
||||||
|| Parquet: || Processed
|
|| Slop : || Processed
|
||||||
|| Documents[] || Files
|
|| Documents[] || Files
|
||||||
|| Domains[] ||
|
|| Domains[] ||
|
||||||
|| Links[] ||
|
|| Links[] ||
|
||||||
|
Loading…
Reference in New Issue
Block a user