Commit Graph

280 Commits

Author SHA1 Message Date
Viktor Lofgren
785d8deadd (crawler) Improve meta-tag redirect handling, add tests for redirects.
Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file.  This works as intended.

Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier.  Added logic to handle this case, amended the test case to verify the new behavior.  Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.
2024-02-01 20:30:43 +01:00
Viktor Lofgren
3fff7f6878 (converter) Fix issue where quality limits were no longer enforced 2024-01-23 11:42:17 +01:00
Viktor Lofgren
41d896ba3e (converter) Refactor content type check in PlainTextDocumentProcessorPlugin
The method `isApplicable` in the `PlainTextDocumentProcessorPlugin` was refactored to handle a wider range of content types beyond merely "text/plain". It now also handles any content type that starts with "text/plain;", to accomodate contentTypes that append a charset as well.
2024-01-22 17:52:14 +01:00
Viktor Lofgren
40c9d2050f (control) Fully automatic conversion
Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine.

Removed the tool itself.

This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency.  This has been fixed, and :third-party:xz was removed.
2024-01-22 13:03:24 +01:00
Viktor Lofgren
fd1eec99b5 (cleanup) Fix broken tests 2024-01-15 15:44:33 +01:00
Viktor Lofgren
c41e68aaab (control) New export actions for RSS/Atom feeds and term frequency data
This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.
2024-01-15 14:54:26 +01:00
Viktor Lofgren
7c6e18f7a7 (*) Overhaul settings and properties
Use a system.properties file to configure the system.  This is loaded statically by MainClass or ProcessMainClass.  Update the property names to be more consistent, and update the documentations to reflect the changes.
2024-01-13 17:12:18 +01:00
Viktor Lofgren
734996002c (*) install script for deploying Marginalia outside the codebase
The changeset also makes the control service responsible for flyway migrations.  This helps reduce the number of places the database configuration needs to be spread out.  These automatic migrations can be disabled with -DdisableFlyway=true.

The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
2024-01-11 12:40:03 +01:00
Viktor Lofgren
d56b394bcc (control) GUI for loading external WARC files 2024-01-10 12:13:30 +01:00
Viktor Lofgren
fbad625126 (linkdb) Add delegating implementation of DomainLinkDb
This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.
2024-01-08 19:56:33 +01:00
Viktor Lofgren
60361f88ed (converter) Add upper 128KB limit to how much HTML we'll parse 2024-01-03 23:14:03 +01:00
Viktor Lofgren
f7560cb1d8 (feature) More trackers 2024-01-03 17:31:02 +01:00
Viktor Lofgren
1f66568d59 (feature) More trackers 2024-01-03 17:27:25 +01:00
Viktor Lofgren
7af07cef95 (feature) Add another doubleclick variant to the adtech trackers 2024-01-03 17:21:12 +01:00
Viktor Lofgren
41a540a629 (converter) Penalize chatgpt content farm spam 2024-01-03 17:04:38 +01:00
Viktor Lofgren
f599944942 (converter) Penalize chatgpt content farm spam 2024-01-03 16:51:26 +01:00
Viktor Lofgren
4ce692ccaf (converter) Use SimpleBlockingThreadPool in ProcessingIterator 2024-01-03 14:27:47 +01:00
Viktor Lofgren
f0d9618dfc (sideload) Reduce quality assessment.
This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.
2024-01-02 12:34:58 +01:00
Viktor Lofgren
dc90c9ac65 (sideload) Just index based on first paragraph
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!

This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
2024-01-01 16:19:38 +01:00
Viktor Lofgren
7a1d20ed0a (converter) Better use of ProcessingIterator
Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service.

This reduces thread churn in the converter sideloader style processing of regular crawl data.
2023-12-30 13:53:55 +01:00
Viktor Lofgren
70c83b60a1 (converter) Clean up fullProcessing()
This function made some very flimsy-looking assumptions about the order of an iterable.  These are still made, but more explicitly so.
2023-12-30 13:36:18 +01:00
Viktor Lofgren
7ba296ccdf (converter) Route sizeHint to SideloadProcessing
Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number.

This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.
2023-12-30 13:05:10 +01:00
Viktor Lofgren
e7dd28b926 (converter) Optimize sideload-loading
Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.
2023-12-29 14:25:48 +01:00
Viktor Lofgren
dec3b1092d (converter) Fix bugs in conversion
This commit adds a safety check that the URL of the document is from the correct domain.

It also adds a sizeHint() method to SerializableCrawlDataStream which *may* provide an indication if the stream is very large and benefits from sideload-style processing (which is slow).

It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...
2023-12-29 13:58:08 +01:00
Viktor Lofgren
c488599879 (converter) Fix NPE in converter 2023-12-28 19:52:26 +01:00
Viktor Lofgren
c847d83011 (converter) Add size hint to converter sideload processing 2023-12-28 19:14:16 +01:00
Viktor Lofgren
7428ba2dd7 (converter) Basic test coverage for sideloading-style processing 2023-12-27 19:29:26 +01:00
Viktor Lofgren
b37223c053 (converter) Basic test coverage for sideloading-style processing 2023-12-27 18:33:16 +01:00
Viktor Lofgren
24051fec03 (converter) WIP Run sideload-style processing for large domains
The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis.   This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process.

These websites now receive a simplified treatment.  This is executed in the converter batch writer thread.  This is slower, but the documents will not be persisted in memory.
2023-12-27 18:20:03 +01:00
Viktor Lofgren
acf7bcc7a6 (converter) Refactor the DomainProcessor for new format of crawl data
With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter.  This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while.

The first step is to move stuff out of the domain processor into the document processor.
2023-12-27 13:57:59 +01:00
Viktor Lofgren
dd8fb04886 (converter) Add sizeloadSizeAdvice field to several ProcessedDomain
Since the sideloaders don't populate the documents list in ProcessedDomain to keep the memory footprint manageable, the code that estimates knownUrls etc. will set them to zero, which has negative effects on their ranking.  This change will populate them with a bullshit value within a sane ballpark, ensuring that these domains show up in the rankings.
2023-12-19 18:37:51 +01:00
Viktor Lofgren
3a56a06c4f (warc) Add a fields for etags and last-modified headers to the new crawl data formats
Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats.  This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely.

The commit also adds a few tests to this logic.
2023-12-18 17:45:54 +01:00
Viktor Lofgren
126ac3816f (converter) Reduce queue size in ConverterWriter
The size of the ArrayBlockingQueue in ConverterWriter.java has been reduced from 4 to 1. This change aims to reduce the memory utilization by not having fully processed domains piling up in RAM.  This may cause the writer to go idle in waiting for new data, but that may be preferable to an OOM.
2023-12-18 13:42:40 +01:00
Viktor Lofgren
c422f0b9fb (geo-ip) Tidy up error handling 2023-12-17 16:06:51 +01:00
Viktor Lofgren
c92f1b8df8 (geo-ip) Revert removal of ip2location logic
We do both ip2location and ASN data.

The change also adds some keywords based on autonomous system information, on a somewhat experimental basis.  It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.
2023-12-17 15:03:00 +01:00
Viktor Lofgren
bde68ba48b Merge branch 'master' into asn-info 2023-12-17 14:00:23 +01:00
Viktor Lofgren
bf44805e69 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 14:00:07 +01:00
Viktor Lofgren
edf9aa2c23 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 13:59:54 +01:00
Viktor Lofgren
bcad6492d6 (sideloader) Fix integration problems with sideloaders
In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment.

Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters.

Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.
2023-12-17 13:28:17 +01:00
Viktor Lofgren
d7bd540683 (*) Replace the ip2location IP geolocation data with ASN information from apnic.net.
Doesn't really make sense to use ip2location as a middle man for information that is already freely available...
2023-12-16 21:55:04 +01:00
Viktor Lofgren
3113b5a551 (warc) Filter WarcResponses based on X-Robots-Tags
There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.
2023-12-16 15:58:27 +01:00
Viktor Lofgren
2001d0f707 (converter) Add @Deprecated annotation to a few fields that should no longer be used. 2023-12-15 21:42:00 +01:00
Viktor Lofgren
0f9cd9c87d (warc) More accurate filering of advisory records
Further create records for resources that were blocked due to robots.txt; as well as tests to verify this happens.
2023-12-15 21:37:02 +01:00
Viktor Lofgren
2e7db61808 (warc) More accurate filering of advisory records
We want to mute some of these records so that they don't produce documents, but in some cases we want a document to be produced for accounting purposes.

Added improved tests that reach for known resources on www.marginalia.nu to test the behavior when encountering bad content type and 404s.

The commit also adds some safety try-catch:es around the charset handling, as it may sometimes explode when fed incorrect data, and we do be guessing...
2023-12-15 21:31:16 +01:00
Viktor Lofgren
5329968155 (crawler) Update CrawlingThenConvertingIntegrationTest
This commit updates CrawlingThenConvertingIntegrationTest with additional tests for invalid, redirecting, and blocked domains. Improvements have also been made to filter out irrelevant entries in ParquetSerializableCrawlDataStream.
2023-12-15 21:04:06 +01:00
Viktor Lofgren
2e536e3141 (crawler) Add timestamp to CrawledDocument records
This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream.

The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type.  This is to avoid having to do format conversions when writing and reading the data.

This parquet field populates the timestamp field in CrawledDocument.
2023-12-15 20:23:27 +01:00
Viktor Lofgren
cf935a5331 (converter) Read cookie information
Add an optional new field to CrawledDocument containing information about whether the domain has cookies.  This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object.

Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.
2023-12-15 18:09:53 +01:00
Viktor Lofgren
0889b6d247 (warc) Clean up parquet conversion
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.

It also refactors the fetch result, body extraction and content type abstractions.
2023-12-14 20:39:40 +01:00
Viktor Lofgren
440e097d78 (crawler) WIP integration of WARC files into the crawler and converter process.
This commit is in a pretty rough state.  It refactors the crawler fairly significantly to offer better separation of concerns.  It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data.  This works, -ish.

There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.

A problem is that the WARC files are a bit too large.  It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
2023-12-13 15:33:42 +01:00
Viktor Lofgren
b74a3ebd85 (crawler) WIP integration of WARC files into the crawler process.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.

This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.

The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
2023-12-11 19:32:58 +01:00
Viktor Lofgren
45987a1d98 Merge branch 'master' into warc 2023-12-11 14:32:35 +01:00
Viktor Lofgren
30bc3f9281 (converter) Use the prefix ip: instead of geopip: for country codes
This is the same as the prefix for the IP address, but I don't think that substantially matters, the as two have such different namespaces there can be no confusion.
2023-12-11 13:59:23 +01:00
Viktor Lofgren
f655ec5a5c (*) Refactor GeoIP-related code
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.

The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.

The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.

The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
2023-12-10 17:30:43 +01:00
Viktor Lofgren
5c46af0edb (converter) Refactor EncyclopediaMarginaliaNuSideloader to use ProcessingIterator
Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator.

The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().
2023-12-09 15:20:53 +01:00
Viktor Lofgren
b6511fbfe2 (converter) Add AnchorTextKeywords to EncyclopediaMarginaliaNuSideloader processing
The commit updates EncyclopediaMarginaliaNuSideloader to include the AnchorTextKeywords in processing documents, aiding search result relevance.

It also removes old test-related functionality and a large but fairly useless test previously used to debug a specific problem, to the detriment of the overall code quality.
2023-12-09 15:20:52 +01:00
Viktor Lofgren
d0982e7ba5 (converter) Add error handling and lazy load external domain links
The converter was not properly initiating the external links for each domain, causing an NPE in conversion.  This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data.

Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.
2023-12-09 12:33:39 +01:00
Viktor Lofgren
fc30da0d48 (converter) Add academia recognition to DomainProcessor
The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like *.ac.ccTld or *.edu.ccTld.

 If these conditions are met, the search term "special:academia" is added to the domain.

 The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well.  The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.
2023-12-08 20:31:34 +01:00
Viktor Lofgren
3bbffd3c22 (crawler) Refactor HttpFetcher to integrate WarcRecorder
Partially hook in the WarcRecorder into the crawler process.  So far it's not read, but should record the crawled documents.

The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.
2023-12-08 17:12:51 +01:00
Viktor Lofgren
fabffa80f0 (warc) Integrate the crawler's content type parsing and charset logic into the WarcSideloader 2023-12-07 15:26:01 +01:00
Viktor Lofgren
2d5d11645d (warc) Refactor WarcSideloaderTest to not rely on specific test files on the computer 2023-12-06 19:00:29 +01:00
Viktor Lofgren
cc813a5624 (convert) Add basic support for Warc file sideloading
This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.
2023-12-06 18:43:55 +01:00
Viktor Lofgren
f615cf2391 (convert) Loosen up the rules enforcement for documents that have external links. 2023-12-01 17:44:29 +01:00
Viktor Lofgren
5fb24bb27f (docs) Improve architectural documentation for the converter. 2023-11-30 20:43:22 +01:00
Viktor Lofgren
5a5430b383 (convert) Wiki specialization that should do a better job at removing junk keywords and providing a useful summary. 2023-11-30 20:04:46 +01:00
Viktor Lofgren
09917837d0 (process) Ensure construction exceptions are logged
Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs.

The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.
2023-11-22 18:32:06 +01:00
Viktor Lofgren
5de37cb820 (converter) Set feature flags appropriately on stackexchange posts 2023-11-12 15:48:08 +01:00
Viktor Lofgren
e5cee1f46d (sideload) Fix sideloading so that it doesn't get disproportionately good rankings
Also add type flags so that e.g. wikipedia shows up in the wikis filter.
2023-11-12 14:57:57 +01:00
Viktor Lofgren
e0c769fd19 (converter) Integrate atags.parquet with the encyclopedia sideloader
Also clean up stackexchange and dirtree a bit.
2023-11-06 18:03:01 +01:00
Viktor Lofgren
ebd10a5f28 (crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized 2023-11-06 16:14:58 +01:00
Viktor Lofgren
2b77184281 (converter) Integrate atags with the topology field 2023-11-06 13:46:44 +01:00
Viktor Lofgren
0152004c42 Initial Commit Anchor Tags
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
2023-11-04 14:24:17 +01:00
Viktor Lofgren
f613f4f2df (array) Fix spurious search results
This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss.

It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.
2023-10-26 15:27:02 +02:00
Viktor Lofgren
d7686b665e Refactoring
* Encyclopedia sideloader; permit providing base URL.
* Storage base shows node id in GUI
* ProcessLivenessMonitorActor restarts automatically
* Clean-up of outbox code
2023-10-25 18:51:02 +02:00
Viktor Lofgren
e06a8c1de2 (converter) Put upper limit on number of worker threads. 2023-10-22 14:03:09 +02:00
Viktor Lofgren
81dd3809e9 (*) WIP Add node affinity to EC_DOMAIN
Very messy commit due to fractalline yak shaving
2023-10-19 17:48:34 +02:00
Viktor Lofgren
4baf9527d7 (*) WIP Control GUI redesign, executor-service, multi-node mq
This turned out to be very difficult to do in small isolated steps.

* Design overhaul of the control gui using bootstrap
* Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes
* Add node-affinity to message queue
2023-10-14 12:08:43 +02:00
Viktor Lofgren
199c459697 (*) Add node-affinity to services, processes and file storage. 2023-10-10 12:32:22 +02:00
Viktor Lofgren
8375237de5 (converter) Add special keyword for websites with a tilde url. 2023-10-09 17:02:32 +02:00
Viktor Lofgren
c51159672e (build) Move unit test configuration to root build.gradle 2023-10-04 12:46:22 +02:00
Viktor Lofgren
e0cd3cd991 (converter) Alter StackexchangeSideloader's summary length to align with the rest of the system. 2023-09-26 12:19:43 +02:00
Viktor Lofgren
81ae501e73 (converter) Use ThreadLocalSentenceExtractorProvider for PlainText plugin as well 2023-09-25 18:28:34 +02:00
Viktor Lofgren
f797a92f87 (converter, minor) Use domain name in task heartbeat progress 2023-09-25 18:27:04 +02:00
Viktor Lofgren
a433bbbe45 (converter) Fix rare sentence extractor bug
It was caused by non-thread safe concurrent memory access in SentenceExtractor.
2023-09-24 19:39:48 +02:00
Viktor Lofgren
8ca20f184d (keyword-extraction) Chasing my tail looking for a bug 2023-09-24 19:39:48 +02:00
Viktor Lofgren
dbe9235f3a (*) Upgrade to JDK21 with preview enabled.
... also move some common configuration into the root build.gradle-file.

Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work.  This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
ad660cf420 (converter) Bugfix: Don't try to Path.of() on optional field 2023-09-21 13:27:09 +02:00
Viktor Lofgren
70aa04c047 (converter, stackexchange-xml) Add the ability to sideload stackexchange data 2023-09-21 12:48:33 +02:00
Viktor Lofgren
d895f83520 (blocking-thread-pool) Move DumbThreadPool to its own micro-library
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
f6b9e8c5eb (converter) JavadocSpecialization should truncate its summary if it gets too long 2023-09-17 16:25:33 +02:00
Viktor Lofgren
98bcdf6028 (converter) DirtreeSideloader now trims /index.html from the URL if present
This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.
2023-09-17 16:08:16 +02:00
Viktor Lofgren
9b385ec7cc (converter) Make it possible to sideload documents from a directory tree 2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46 (crawl-spec) Parquetify crawl spec
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
c67d95c00f (converter) Write dummy processor log when sideloading 2023-09-14 14:13:03 +02:00
Viktor Lofgren
5e5aaf9a7e (converter, control) Re-enable sideloading encyclopedia data 2023-09-14 12:12:07 +02:00
Viktor Lofgren
eaeb23d41e (refactor) Remove converting-model package completely 2023-09-14 11:21:44 +02:00
Viktor Lofgren
c71f6ad417 (converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup 2023-09-14 10:11:57 +02:00
Viktor Lofgren
4799dd769e (converting) WIP begin to remove converting-model and the old InstructionsCompiler 2023-09-13 19:18:58 +02:00
Viktor Lofgren
24b4606f96 (converter,loader) Converter outputs parquet files instead of compressed json. 2023-09-13 16:13:41 +02:00
Viktor Lofgren
39c1857c61 (heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction. 2023-08-29 13:07:55 +02:00
Viktor Lofgren
3101b74580 (index) Move to a lexicon-free index design
This is a system-wide change.  The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table.   This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.

The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.

It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
194a6057dd (index,control) Recoverable index backups 2023-08-25 14:57:43 +02:00
Viktor Lofgren
e710e057e2 (db) Remove EC_URL and EC_PAGE_DATA from mariadb database 2023-08-25 13:45:03 +02:00
Viktor Lofgren
6a04cdfddf (loader) Implement new linkdb in loader
Deprecate the LoadUrl instruction entirely. We no longer need to be told upfront about which URLs to expect, as IDs are generated from the domain id and document ordinal.

For now, we no longer store new URLs in different domains.  We need to re-implement this somehow, probably in a different job or a as a different output.
2023-08-24 13:07:54 +02:00
Viktor Lofgren
ebc84c22fb Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
bf92c270dc (language) Rollback language filter change a bit.
It appears to lead to too much junk in the lexicon.
2023-08-23 10:16:57 +02:00
Viktor Lofgren
e507844616 (language) Rollback language filter change a bit.
It appears to lead to too much junk in the lexicon.
2023-08-23 10:03:25 +02:00
Viktor Lofgren
704de50a9b (forward-index, valuator) HTML features in valuator
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
fcfe07fb7d (valuator) Clean up code 2023-08-18 11:26:56 +02:00
Viktor Lofgren
ccf4990add (minor) Clean up code 2023-08-18 11:26:39 +02:00
Viktor Lofgren
f2638dd845 (feature-extractor) More adtech nonsense 2023-08-18 11:26:19 +02:00
Viktor Lofgren
239980ecae (minor) Improve comment 2023-08-18 11:26:05 +02:00
Viktor Lofgren
bee815b1c4 (converter) Add monsterinsights as an adtech tracker 2023-08-17 17:44:11 +02:00
Viktor Lofgren
e296b02649 (converter) Optimize LSH based within-domain deduplication 2023-08-17 17:43:46 +02:00
Viktor Lofgren
46d761f34f (language) fasttext based language filter 2023-08-16 15:48:12 +02:00
Viktor Lofgren
4598c7f40f (valuation) Penalize wordpress style kebab case urls 2023-08-16 13:11:24 +02:00
Viktor Lofgren
d8073f0dde (feature-extractor) Add mail.ru counter to non-adtech trackers 2023-08-15 19:10:43 +02:00
Viktor Lofgren
e7192a9cad (mq) Refactor mq and actor library and move it to libraries out of common 2023-08-15 10:53:23 +02:00
Viktor Lofgren
ce293029c7 (converter) Treat adtech tracking as advertisement. 2023-08-09 14:23:53 +02:00
Viktor Lofgren
4ab1cd9502 (*) last touches 2023-08-07 12:57:44 +02:00
Viktor
52e2ab45bf
Merge branch 'master' into master-control-program 2023-08-07 12:53:43 +02:00
Viktor Lofgren
2f8488610a (loader) Fix bug where trailing deferred domain meta inserts weren't executed 2023-07-31 14:23:23 +02:00
Viktor Lofgren
37c4cc68ed TODO 2023-07-31 10:34:42 +02:00
Viktor Lofgren
1c948eb3d8 (minor) Alter DumbThreadPool in Converter to not claim the threads are crawlers. 2023-07-31 10:33:15 +02:00
Viktor Lofgren
cd90ca820f YAGNI filter over ConverterDomainTypes 2023-07-31 10:32:47 +02:00
Viktor Lofgren
6f4e767a04 (minor) Re-enable monkey-patch-json for converter 2023-07-31 10:31:46 +02:00
Viktor Lofgren
730e8f74e4 (crawler) Even more memory optimizations.
* Fix minor resource leak in zstd streams
* Use pools for zstd streams
* Reduce the SSL session cache size
2023-07-30 14:19:55 +02:00
Viktor Lofgren
ee143bbc48 (crawler, converter) Fix so that DumbThreadPool actually waits for termination as intended. 2023-07-29 19:19:09 +02:00
Viktor Lofgren
d3f01bd171 (crawler, converter) Remove monkey patched gson from dependencies 2023-07-29 19:18:12 +02:00
Viktor Lofgren
f11103d31d (WIP) Make it possible to sideload encyclopedia data.
This is mostly a pilot track for sideloading other large websites.

Also change coverter to produce a more compact output (java serialization instead of json).
2023-07-28 18:14:43 +02:00
Viktor Lofgren
507f26ad47 (converter) Refactor converter to not keep instructions list in RAM.
(converter) Refactor converter to not keep instructions list in RAM.

(converter) Refactor converter to not keep instructions list in RAM.
2023-07-25 22:06:46 +02:00
Viktor Lofgren
fd44e09ebd (loader) Don't delete the entire link database when the loader runs 2023-07-24 18:37:35 +02:00
Viktor Lofgren
667b0ca0b0 (converter, WIP) Refactor CrawledDomainReader to not return iterators.
Instead return a closable class SerializableCrawlDataStream.
2023-07-24 16:28:30 +02:00
Viktor Lofgren
a56953c798 (converter, WIP) Refactor converter to not have to load everything into RAM. 2023-07-24 15:25:09 +02:00
Viktor Lofgren
f91d92cccb (crawler) WIP 2023-07-20 21:05:16 +02:00
Viktor Lofgren
d7ab21fe34 (*) Refactor Control Service and processes 2023-07-17 21:20:31 +02:00
Viktor Lofgren
bca4bbb6c8 (*) Refactor MQ and MQSM 2023-07-17 13:57:32 +02:00
Viktor Lofgren
e618aa34e9 (control) Name change process->fsm, new fsm:s
* FSM for spawning processes when messages appear for them
* FSM for removing data flagged for purging
2023-07-17 12:27:27 +02:00
Viktor Lofgren
8b74e3aa0d (*) File Storage WIP 2023-07-14 17:08:10 +02:00
Viktor Lofgren
74caf9e38a (processes) Remove forEach-constructs in favor of iterators. 2023-07-12 17:47:36 +02:00
Viktor Lofgren
ac2d7034db (minor) Bugfix in Path handling 2023-07-11 21:24:29 +02:00
Viktor Lofgren
3c7c77fe21 (minor) Bugfix in Path handling 2023-07-11 17:06:52 +02:00
Viktor Lofgren
4c016b0318 Process monitoring
* Also refactored the SQL tables a bit
2023-07-11 14:46:21 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor
0f9b90eb1c
Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 17:36:12 +02:00
Viktor Lofgren
98d1898610 Bugfix: Don't run the xenforo specialization on phpBB. 2023-07-06 18:12:26 +02:00
Viktor Lofgren
1400fb4a9b Bugfix: Don't run the xenforo specialization on phpBB. 2023-07-06 18:11:19 +02:00
Adrthegamedev
78f21dd19a (an attempt to) Add wikidot to wiki generators list 2023-07-05 18:03:36 +02:00
Viktor Lofgren
2cb209ae9c Better wordpress fingerprinting 2023-07-05 18:03:36 +02:00
Viktor Lofgren
979a620ead Bugfix where DocumentGeneratorExtractor out of bounded for generators starting with 'microsoft' or 'adobe' but having no followup string. 2023-07-05 18:03:36 +02:00