MarginaliaSearch/code/processes/converting-process
Viktor Lofgren 088310e998 (converter) Improve simple processing performance
There was a regression introduced in the recent slop migration changes in  the performance of the simple conversion track.  This reverts the issue.
2025-01-21 14:13:33 +01:00
..
ft-anchor-keywords (converter) Refactor sideloaders to improve feature handling and keyword logic 2024-12-11 16:01:38 +01:00
ft-keyword-extraction (keyword-extraction) Soften constraints on keyword patterns, allowing for longer segmented words 2025-01-07 15:20:50 +01:00
java/nu/marginalia (converter) Improve simple processing performance 2025-01-21 14:13:33 +01:00
model (converter) Drop feed data from SlopDomainRecord 2024-12-26 17:57:08 +01:00
resources/db (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
test/nu/marginalia Merge branch 'master' into slop-crawl-data-spike 2025-01-21 13:32:58 +01:00
test-resources Add specialization for steam store and GOG 2024-12-11 18:32:45 +01:00
build.gradle (encyclopedia-sideloader) Add test suite and clean up urlencoding logic 2024-11-26 13:34:15 +01:00
readme.md (doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00

Converting Process

The converting process reads crawl data and extracts information to be fed into the index, such as keywords, metadata, urls, descriptions...

The converter reads crawl data in the form of parquet files, and writes the extracted data to parquet files on a different format. These files are then passed to the loader process, which does additional processing needed to feed the data into the index.

The reason for splitting the process into two parts is that the heavier converting process can be terminated and restarted without losing progress, while the lighter loader process needs to be run in a single go (or restarted if it crashes/terminates).

The converter output is also in general more portable and can be used for different tasks, meanwhile the loader's output is heavily tailored to the index and not much use for anything else.

Structure

Most information is extracted from the document itself within DocumentProcessor, but some information is extracted from the context of the document, such as other documents on the same domain. This is done in DomainProcessor.

To support multiple document formats, the converting process is pluggable. Each plugin is responsible for converting a single document format, such as HTML or plain text.

Further, the HTML plugin supports specializations, which refine the conversion process for specific server software, such as Javadoc, MediaWiki, PhpBB, etc. This helps to improve the processing for common types of websites, and makes up for the fact that it's hard to build a one-size-fits-all heuristic for deciding which parts of a document are important that does justice to every website.

Anchor Text

The converting process also supports supplementing the data with external information, such as anchor texts. This is done automatically if atags.parquet is available in the data/-directory. atags.parquet can be downloaded from here.

The rationale for doing this as well as the details of how the file is generated is described in this blog post: https://www.marginalia.nu/log/93_atags/

Central Classes