Commit Graph

287 Commits

Author SHA1 Message Date
Viktor Lofgren
fbad625126 (linkdb) Add delegating implementation of DomainLinkDb
This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.
2024-01-08 19:56:33 +01:00
Viktor Lofgren
e49ba887e9 (crawl data) Add compatibility layer for old crawl data format
The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records.  This is true for the new parquet format, but not for the old zstd/gson format.

To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order.

This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be.

Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.
2024-01-08 19:16:49 +01:00
Viktor Lofgren
edc1acbb7e (*) Replace EC_DOMAIN_LINK table with files and in-memory caching
The EC_DOMAIN_LINK MariaDB table stores links between domains.  This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB).  This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.

This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains.  This file is loaded in memory in each node, and can be queried via the Query Service.

A migration step is needed before this file is created in each node.   Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.

The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
2024-01-08 15:53:13 +01:00
Viktor Lofgren
6d2e14a656 (build) Remove false depdencency between icp and index-service
This dependency causes the executor service docker image to change when the index service docker image changes.
2024-01-05 13:17:29 +01:00
Viktor Lofgren
60361f88ed (converter) Add upper 128KB limit to how much HTML we'll parse 2024-01-03 23:14:03 +01:00
Viktor Lofgren
f7560cb1d8 (feature) More trackers 2024-01-03 17:31:02 +01:00
Viktor Lofgren
1f66568d59 (feature) More trackers 2024-01-03 17:27:25 +01:00
Viktor Lofgren
7af07cef95 (feature) Add another doubleclick variant to the adtech trackers 2024-01-03 17:21:12 +01:00
Viktor Lofgren
41a540a629 (converter) Penalize chatgpt content farm spam 2024-01-03 17:04:38 +01:00
Viktor Lofgren
f599944942 (converter) Penalize chatgpt content farm spam 2024-01-03 16:51:26 +01:00
Viktor Lofgren
4ce692ccaf (converter) Use SimpleBlockingThreadPool in ProcessingIterator 2024-01-03 14:27:47 +01:00
Viktor Lofgren
f0d9618dfc (sideload) Reduce quality assessment.
This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.
2024-01-02 12:34:58 +01:00
Viktor Lofgren
dc90c9ac65 (sideload) Just index based on first paragraph
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!

This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
2024-01-01 16:19:38 +01:00
Viktor Lofgren
75d87c73d1 (crawler) Disable Java's infinite DNS caching 2023-12-31 16:59:08 +01:00
Viktor Lofgren
0fe44c9bf2 (crawler) Fix broken test
A necessary step was accidentally deleted when cleaning up these tests previously.
2023-12-30 13:56:44 +01:00
Viktor Lofgren
7a1d20ed0a (converter) Better use of ProcessingIterator
Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service.

This reduces thread churn in the converter sideloader style processing of regular crawl data.
2023-12-30 13:53:55 +01:00
Viktor Lofgren
70c83b60a1 (converter) Clean up fullProcessing()
This function made some very flimsy-looking assumptions about the order of an iterable.  These are still made, but more explicitly so.
2023-12-30 13:36:18 +01:00
Viktor Lofgren
7ba296ccdf (converter) Route sizeHint to SideloadProcessing
Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number.

This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.
2023-12-30 13:05:10 +01:00
Viktor Lofgren
0b112cb4d4 (warc) Update URL encoding in WarcProtocolReconstructor
The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.
2023-12-29 19:41:37 +01:00
Viktor Lofgren
e7dd28b926 (converter) Optimize sideload-loading
Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.
2023-12-29 14:25:48 +01:00
Viktor Lofgren
dec3b1092d (converter) Fix bugs in conversion
This commit adds a safety check that the URL of the document is from the correct domain.

It also adds a sizeHint() method to SerializableCrawlDataStream which *may* provide an indication if the stream is very large and benefits from sideload-style processing (which is slow).

It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...
2023-12-29 13:58:08 +01:00
Viktor Lofgren
c488599879 (converter) Fix NPE in converter 2023-12-28 19:52:26 +01:00
Viktor Lofgren
c847d83011 (converter) Add size hint to converter sideload processing 2023-12-28 19:14:16 +01:00
Viktor Lofgren
5ce46a61d4 Merge branch 'master' into converter-optimizations 2023-12-28 13:26:19 +01:00
Viktor Lofgren
00a974a721 (crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions
This commit also improves the test coverage for this part of the code.
2023-12-27 20:02:17 +01:00
Viktor Lofgren
7428ba2dd7 (converter) Basic test coverage for sideloading-style processing 2023-12-27 19:29:26 +01:00
Viktor Lofgren
b37223c053 (converter) Basic test coverage for sideloading-style processing 2023-12-27 18:33:16 +01:00
Viktor Lofgren
24051fec03 (converter) WIP Run sideload-style processing for large domains
The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis.   This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process.

These websites now receive a simplified treatment.  This is executed in the converter batch writer thread.  This is slower, but the documents will not be persisted in memory.
2023-12-27 18:20:03 +01:00
Viktor Lofgren
f811a29f87 (crawler) Fix resource leak in crawler
A 10 MB thread local buffer wasn't static.  Oops.
2023-12-27 16:32:17 +01:00
Viktor Lofgren
acf7bcc7a6 (converter) Refactor the DomainProcessor for new format of crawl data
With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter.  This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while.

The first step is to move stuff out of the domain processor into the document processor.
2023-12-27 13:57:59 +01:00
Viktor Lofgren
9e5fe71f5b (crawler) Switch hash function in crawler
Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler.   This switches to the modified Murmur hash function used throughout Marginalia.
2023-12-27 13:29:00 +01:00
Viktor Lofgren
3ea1ddae22 (crawler) Roll back switch to virtual thread pool in crawler
This seems to cause a resource leak, it seems the http library uses thread locals?
2023-12-26 19:37:34 +01:00
Viktor Lofgren
25d086c4e1 (crawler) Clean up stale warc files
We should probably have an option to keep them, but not by default!
2023-12-25 15:07:36 +01:00
Viktor Lofgren
f779f760c4 (crawler) Even more lenient resyncing 2023-12-25 01:44:18 +01:00
Viktor Lofgren
f18f82e229 (crawler) Write etags and last-modified on reference copy
This commit also fixes a test that broke with a previous change.
2023-12-25 01:40:13 +01:00
Viktor Lofgren
67ef2b45fa (crawler) Reduce logging 2023-12-25 01:10:03 +01:00
Viktor Lofgren
d72e871265 (warc) Fix resync 2023-12-25 01:03:03 +01:00
Viktor Lofgren
4c9bc13309 (warc) Reduce log spam 2023-12-25 00:58:31 +01:00
Viktor Lofgren
84563b0d46 (crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK 2023-12-25 00:55:05 +01:00
Viktor Lofgren
c5aab7e8db (warc) Fix NPE in WarcRecorder 2023-12-25 00:54:38 +01:00
Viktor Lofgren
1755b646b8 (warc) Fix NPE in WarcRecorder 2023-12-25 00:48:42 +01:00
Viktor Lofgren
e1a155a9c8 (crawler) Increase growth of crawl jobs
A number of crawl jobs get stuck at about 300 documents, or just under.  This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl.  GOOD_URLS is based on how many documents successfully process, which is typically fairly small.  Switching to KNOWN_URLS should let this grow faster.

The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table.

The floor is also increased to 250 from 200.
2023-12-23 13:22:10 +01:00
Viktor Lofgren
dc773c5c20 (adjacencies) Clean up AdjacenciesLoader
Make JDBC batching more consistent, also adds a test case for the loader.
2023-12-21 14:14:22 +01:00
Viktor Lofgren
b6253b03c2 (adjacencies) Fix bug in AdjacenciesLoader
This fixes a bug where a prepared statement was created before the table it was supposed to insert into was created.  This fails and does nothing.

Furthermore, added the logging that would have warned about this failure, had it been in place.
2023-12-21 13:12:31 +01:00
Viktor Lofgren
a5bc29245b (cleanup) Remove vestigial support for WARC crawl data streams 2023-12-20 15:46:21 +01:00
Viktor Lofgren
bfae478251 Refactor CrawlerRevisitor for better consistency 2023-12-20 15:21:49 +01:00
Viktor Lofgren
a7cd490593 (minor) Remove dead code. 2023-12-19 18:58:33 +01:00
Viktor Lofgren
dd8fb04886 (converter) Add sizeloadSizeAdvice field to several ProcessedDomain
Since the sideloaders don't populate the documents list in ProcessedDomain to keep the memory footprint manageable, the code that estimates knownUrls etc. will set them to zero, which has negative effects on their ranking.  This change will populate them with a bullshit value within a sane ballpark, ensuring that these domains show up in the rankings.
2023-12-19 18:37:51 +01:00
Viktor Lofgren
3a56a06c4f (warc) Add a fields for etags and last-modified headers to the new crawl data formats
Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats.  This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely.

The commit also adds a few tests to this logic.
2023-12-18 17:45:54 +01:00
Viktor Lofgren
126ac3816f (converter) Reduce queue size in ConverterWriter
The size of the ArrayBlockingQueue in ConverterWriter.java has been reduced from 4 to 1. This change aims to reduce the memory utilization by not having fully processed domains piling up in RAM.  This may cause the writer to go idle in waiting for new data, but that may be preferable to an OOM.
2023-12-18 13:42:40 +01:00