diff --git a/code/index/index-reverse/readme.md b/code/index/index-reverse/readme.md index 0874bf8d..4b53c1fb 100644 --- a/code/index/index-reverse/readme.md +++ b/code/index/index-reverse/readme.md @@ -13,7 +13,7 @@ a binary index that only offers information about which documents has a specific The priority index is also compressed, while the full index at this point is not. [1] See WordFlags in [common/model](../../common/model/) and -KeywordMetadata in [features-convert/keyword-extraction](../../features-convert/keyword-extraction). +KeywordMetadata in [converting-process/ft-keyword-extraction](../../processes/converting-process/ft-keyword-extraction). ## Construction diff --git a/code/libraries/language-processing/readme.md b/code/libraries/language-processing/readme.md index 7b8ee049..d996de05 100644 --- a/code/libraries/language-processing/readme.md +++ b/code/libraries/language-processing/readme.md @@ -10,5 +10,5 @@ its words, how they stem, POS tags, and so on. ## See Also -[features-convert/keyword-extraction](../../features-convert/keyword-extraction) uses this code to identify which keywords +[converting-process/ft-keyword-extraction](../../processes/converting-process/ft-keyword-extraction) uses this code to identify which keywords are important. \ No newline at end of file diff --git a/code/processes/converting-process/readme.md b/code/processes/converting-process/readme.md index 936ca7fe..8dab7911 100644 --- a/code/processes/converting-process/readme.md +++ b/code/processes/converting-process/readme.md @@ -49,7 +49,3 @@ has HTML-specific logic related to a document, keywords and identifies features * [DomainProcessor](java/nu/marginalia/converting/processor/DomainProcessor.java) converts each document and generates domain-wide metadata such as link graphs. - -## See Also - -* [features-convert](../../features-convert/) \ No newline at end of file diff --git a/code/processes/crawling-process/readme.md b/code/processes/crawling-process/readme.md index 0f72cb87..d40ed8bd 100644 --- a/code/processes/crawling-process/readme.md +++ b/code/processes/crawling-process/readme.md @@ -35,8 +35,4 @@ On top of organic links, the crawler can use sitemaps and rss-feeds to discover * [CrawlerRetreiver](java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java) visits known addresses from a domain and downloads each document. * [HttpFetcher](java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java) - fetches URLs. - -## See Also - -* [features-crawl](../../features-crawl/) \ No newline at end of file + fetches URLs. \ No newline at end of file diff --git a/code/processes/readme.md b/code/processes/readme.md index 3bdc0970..27142bfc 100644 --- a/code/processes/readme.md +++ b/code/processes/readme.md @@ -3,15 +3,13 @@ ## 1. Crawl Process The [crawling-process](crawling-process/) fetches website contents, temporarily saving them as WARC files, and then -re-converts them into parquet models. Both are described in [crawling-model](../process-models/crawling-model/). - -The operation is optionally defined by a [crawl specification](../process-models/crawl-spec), which can be created in the control GUI. +re-converts them into parquet models. Both are described in [crawling-process/model](crawling-process/model/). ## 2. Converting Process The [converting-process](converting-process/) reads crawl data from the crawling step and processes them, extracting keywords and metadata and saves them as parquet files -described in [processed-data](../process-models/processed-data/). +described in [converting-process/model](converting-process/model/). ## 3. Loading Process @@ -51,7 +49,7 @@ Schematically the crawling and loading process looks like this: +------------+ features, links, URLs | //==================\\ - || Parquet: || Processed + || Slop : || Processed || Documents[] || Files || Domains[] || || Links[] ||