Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
Added a custom sorting mechanism to prioritize HTTPS over HTTP and domain-based URLs over raw IPs during deduplication. Ensures "bad duplicates" are discarded while maintaining the original presentation order for user-facing results.
Previously, the sideloader did not generate a locality-sensitive hashCode for document details. This caused all documents from the same domain to be considered duplicates by the deduplication logic.
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.
If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D
Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.
There's also a lot of polish remaining everywhere, dead links, etc.
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.
Adding a tracking message to the export actor means it's possible to run them in a precession.
Adding a new precession actor, and some GUI components for triggering exports.
The change also adds a heartbeat to the export process.
Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.
Enhanced scoring logic to add bonuses when the query matches single-word anchor (atag) spans exactly. Implemented this by adding conditions in `IndexResultScoreCalculator.java` and creating a new method `containsRangeExact` in `DocumentSpan.java` to check for exact span matches.
Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue.
Fixes issue #109.
Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.
This is to prevent data corruption. This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.
To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.
Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService. This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.
While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.
Node profiles decide which actors are started, and which views are available in the control GUI. This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.
This is done by applying a large constant offset to the ordinals for the live crawled documents. The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI.
Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.
The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.
Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.