MarginaliaSearch/code/processes/converting-process
Viktor Lofgren 440e097d78 (crawler) WIP integration of WARC files into the crawler and converter process.
This commit is in a pretty rough state.  It refactors the crawler fairly significantly to offer better separation of concerns.  It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data.  This works, -ish.

There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.

A problem is that the WARC files are a bit too large.  It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
2023-12-13 15:33:42 +01:00
..
src (crawler) WIP integration of WARC files into the crawler and converter process. 2023-12-13 15:33:42 +01:00
build.gradle Merge branch 'master' into warc 2023-12-11 14:32:35 +01:00
readme.md (docs) Improve architectural documentation for the converter. 2023-11-30 20:43:22 +01:00

Converting Process

The converting process reads crawl data and extracts information to be fed into the index, such as keywords, metadata, urls, descriptions...

Structure

Most information is extracted from the document itself within DocumentProcessor, but some information is extracted from the context of the document, such as other documents on the same domain. This is done in DomainProcessor.

To support multiple document formats, the converting process is pluggable. Each plugin is responsible for converting a single document format, such as HTML or plain text.

Further, the HTML plugin supports specializations, which refine the conversion process for specific server software, such as Javadoc, MediaWiki, PhpBB, etc. This helps to improve the processing for common types of websites, and makes up for the fact that it's hard to build a one-size-fits-all heuristic for deciding which parts of a document are important that does justice to every website.

Anchor Text

The converting process also supports supplementing the data with external information, such as anchor texts. This is done automatically if atags.parquet is available in the data/-directory. atags.parquet can be downloaded from here.

The rationale for doing this as well as the details of how the file is generated is described in this blog post: https://www.marginalia.nu/log/93_atags/

Central Classes

See Also