Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.
If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D
Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.
Adding a tracking message to the export actor means it's possible to run them in a precession.
Adding a new precession actor, and some GUI components for triggering exports.
The change also adds a heartbeat to the export process.
Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.
Enhanced scoring logic to add bonuses when the query matches single-word anchor (atag) spans exactly. Implemented this by adding conditions in `IndexResultScoreCalculator.java` and creating a new method `containsRangeExact` in `DocumentSpan.java` to check for exact span matches.
Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue.
Fixes issue #109.
Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.
This is to prevent data corruption. This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.
To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.
Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService. This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.