Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.
The EC_DOMAIN_LINK table is deprecated and slated for removal, use QueryClient.getAllDomainLinks() instead.
The ExportDataActor now uses the QueryClient appropriately. The CSV format was also changed to quote the values, to prevent e.g. Excel from interpreting the comma as a decimal separator when previewing the file.
Finally the form for triggering an export was overhauled.
Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.
A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.
Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest.
The warc files are concatenated into larger archives, up to about 1 GB each.
An index is also created containing filenames, domain names, offsets and sizes
to help navigate these larger archives.
The warc data is saved in a directory warc/ under the crawl data storage.
This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views. It has many small tweaks to make the work flow better.
It also adds a new /uploads directory in each index node, from which sideloaded data can be selected. This is a bit of a breaking change, as this directory needs to exist in each index node.
The changeset also makes the control service responsible for flyway migrations. This helps reduce the number of places the database configuration needs to be spread out. These automatic migrations can be disabled with -DdisableFlyway=true.
The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
This adds a docker-compose file 'docker-compose-barebones.yml' which will only start the minimal number of services needed to run a whitelabel Marginalia Search-style search engine, with none of the surrounding frills.
The change also adds a minimal search GUI to the query service, which is also available with JSON results if the appropriate Accept header is provided.
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.
The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records. This is true for the new parquet format, but not for the old zstd/gson format.
To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order.
This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be.
Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.
The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.
This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service.
A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.
The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.
Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!
This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.
Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!
This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service.
This reduces thread churn in the converter sideloader style processing of regular crawl data.
Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number.
This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.
The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.
Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.
Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.
This commit adds a safety check that the URL of the document is from the correct domain.
It also adds a sizeHint() method to SerializableCrawlDataStream which *may* provide an indication if the stream is very large and benefits from sideload-style processing (which is slow).
It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...
The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis. This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process.
These websites now receive a simplified treatment. This is executed in the converter batch writer thread. This is slower, but the documents will not be persisted in memory.
With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter. This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while.
The first step is to move stuff out of the domain processor into the document processor.
Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler. This switches to the modified Murmur hash function used throughout Marginalia.