This solves some issues with the rssreader based parser, which was very picky about the XML being valid. Jsoup is much more lenient when parsing malformed XML.
This is primarily to make the code a bit easier to reason about, and will reduce the level of indirection and data copying in the search-servi->query-service->index-service communication chain.
Quoted search queries that contained keywords with possessive 's endings were not returning any results, as the index does not retain that suffix, and the query parser was not stripping it away in this code path.
This solves issue #143.
The code was always blocking for up to 5s while waiting for the remote end to become available, meaning some services would stall for several seconds on start-up for no sensible reason.
This should make most services start faster as a result.
Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests. On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary.
A new table was added to the FeedDb to hold one etag per domain.
If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received.
This completes the changes for Issue #136.
This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data.
The change addresses issue #136, save for making the fetcher's requests conditional.
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.
There's also a lot of polish remaining everywhere, dead links, etc.
Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.
Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService. This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI.
Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.