Viktor Lofgren
e67a9bdb91
(crawler) Migrate away from using OkHttp in the crawler, use Java's HttpClient instead.
2025-01-19 15:07:11 +01:00
Viktor Lofgren
567e4e1237
(crawler) Fast detection and bail-out for crawler traps
...
Improve logging and exclude robots.txt from this logic.
2025-01-18 15:28:54 +01:00
Viktor Lofgren
4342e42722
(crawler) Fast detection and bail-out for crawler traps
...
Nephentes has been doing the rounds in social media, adding an easy detection and mitigation mechanism for this type of trap, as sadly not all webmasters set up their robots.txt correctly. Out of the box crawl limits will also deal with this type of attack, but this fix is faster.
2025-01-17 13:02:57 +01:00
Viktor Lofgren
1e770205a5
(search) Dyslexia fix
2025-01-12 20:40:14 +01:00
Viktor Lofgren
ef3f175ede
(search) Don't clobber the search query URL with default values
2025-01-10 15:57:30 +01:00
Viktor Lofgren
bbe4b5d9fd
Revert experimental changes
2025-01-10 15:52:02 +01:00
Viktor Lofgren
c67a635103
(search, experimental) Add a few debugging tracks to the search UI
2025-01-10 15:44:44 +01:00
Viktor Lofgren
20b24133fb
(search, experimental) Add a few debugging tracks to the search UI
2025-01-10 15:34:48 +01:00
Viktor Lofgren
f2567677e8
(index-client) Clean up index client code
...
Improve error handling. This should be a relatively rare case, but we don't want one bad index partition to blow up the entire query.
2025-01-10 15:17:07 +01:00
Viktor Lofgren
bc2c2061f2
(index-client) Clean up index client code
...
This should have the rpc stream reception be performed in parallel in separate threads, rather blocking sequentially in the main thread, hopefully giving a slight performance boost.
2025-01-10 15:14:42 +01:00
Viktor Lofgren
1c7f5a31a5
(search) Further reduce the number of db queries by adding more caching to DbDomainQueries.
2025-01-10 14:17:29 +01:00
Viktor Lofgren
59a8ea60f7
(search) Further reduce the number of db queries by adding more caching to DbDomainQueries.
2025-01-10 14:15:22 +01:00
Viktor Lofgren
aa9b1244ea
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:56:04 +01:00
Viktor Lofgren
2d17233366
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:53:56 +01:00
Viktor Lofgren
b245cc9f38
(search) Reduce the number of db queries a bit by caching data that doesn't change too often
2025-01-10 13:46:19 +01:00
Viktor Lofgren
6614d05bdf
(db) Make db pool size configurable
2025-01-09 20:20:51 +01:00
Viktor Lofgren
55aeb03c4a
(feeds) Replace rssreader based parsing with a custom jsoup based rss parser
...
This solves some issues with the rssreader based parser, which was very picky about the XML being valid. Jsoup is much more lenient when parsing malformed XML.
2025-01-09 18:29:55 +01:00
Viktor Lofgren
faa589962f
(live-capture) Browserless now requires a token
2025-01-09 14:51:11 +01:00
Viktor Lofgren
c7edd6b39f
(live-capture) Browserless now requires a token
2025-01-09 14:46:05 +01:00
Viktor Lofgren
79da622e3b
(search) Update front page with new banner about move
2025-01-08 21:38:19 +01:00
Viktor Lofgren
3da8337ba6
(feeds) Add system property for exporting fetched feeds to a slop table for debugging
2025-01-08 20:49:16 +01:00
Viktor Lofgren
3772bfd387
(query) Fix handling of optional ranking parameters
2025-01-08 17:11:22 +01:00
Viktor Lofgren
02a7900d1a
(search) Correct search-in-title toggle in search UI
2025-01-08 16:51:10 +01:00
Viktor Lofgren
a1fb92468f
(refac) Remove ResultRankingParameters, QueryLimits class and use protobuf classes directly instead
...
This is primarily to make the code a bit easier to reason about, and will reduce the level of indirection and data copying in the search-servi->query-service->index-service communication chain.
2025-01-08 16:15:57 +01:00
Viktor Lofgren
b7f0a2a98e
(search-service) Fix metrics for errors and request times
...
This was previously in place, but broke during the jooby migration.
2025-01-08 14:10:43 +01:00
Viktor Lofgren
5fb76b2e79
(search-service) Fix metrics for errors and request times
...
This was previously in place, but broke during the jooby migration.
2025-01-08 14:06:03 +01:00
Viktor Lofgren
ad8c97f342
(search-service) Begin replacement of the crawl queue mechanism with node_affinity flagging
...
Previously a special db table was used to hold domains slated for crawling, but this is deprecated, and instead now each domain has a node_affinity flag that decides its indexing state, where a value of -1 indicates it shouldn't be crawled, a value of 0 means it's slated for crawling by the next index partition to be crawled, and a positive value means it's assigned to an index partition.
The change set also adds a test case validating the modified behavior.
2025-01-08 13:25:56 +01:00
Viktor Lofgren
dc1b6373eb
(search-service) Clean up readme
2025-01-08 13:04:39 +01:00
Viktor Lofgren
983d6d067c
(search-service) Add indexing indicator to sibling domains listing
2025-01-08 12:58:34 +01:00
Viktor Lofgren
a84a06975c
(ranking-params) Add disable penalties flag to ranking params
...
This will help debugging ranking issues. Later it may be added to some filters.
2025-01-08 00:16:49 +01:00
Viktor Lofgren
d2864c13ec
(query-params) Add additional permitted query params
2025-01-07 20:21:44 +01:00
Viktor Lofgren
03ba53ce51
(legacy-search) Update nav bar with correct links
2025-01-07 17:44:52 +01:00
Viktor Lofgren
59e2dd4c26
(specialization) Soften length requirements for wiki-specialized documents (incl. cppreference)
2025-01-07 15:41:30 +01:00
Viktor Lofgren
ca1807caae
(specialization) Add new specialization for cppreference.com
...
Give this reference website some synthetically generated tokens to improve the likelihood of a good match.
2025-01-07 15:41:05 +01:00
Viktor Lofgren
26c20e18ac
(keyword-extraction) Soften constraints on keyword patterns, allowing for longer segmented words
2025-01-07 15:20:50 +01:00
Viktor Lofgren
7c90b6b414
(query) Don't blindly make tokens containing a colon into a non-ranking advice term
2025-01-07 15:18:05 +01:00
Viktor Lofgren
b63c54c4ce
(search) Update opensearch.xml to point to non-redirecting domains.
2025-01-07 00:23:09 +01:00
Viktor Lofgren
39e420de88
(search) Add wayback machine link to siteinfo
2025-01-06 20:33:10 +01:00
Viktor Lofgren
87d1c89701
(search) Add listing of sibling subdomains to site overview
2025-01-06 20:17:36 +01:00
Viktor Lofgren
a42a7769e2
(leagacy-search) Remove legacy paperdoll class
2025-01-06 20:17:36 +01:00
Viktor Lofgren
2315fdc731
(search) Vendor rssreader and modify it to be able to consume the nlnet atom feed
...
Also dial down the logging a bit for the rssreader package.
2025-01-06 17:58:50 +01:00
Viktor Lofgren
b5469bd8a1
(search) Turn relative feed URLs absolute when dealing with RSS/Atom item URLs
2025-01-06 16:56:24 +01:00
Viktor Lofgren
6a6318d04c
(search) Add separate websiteUrl property to legacy service
2025-01-06 16:26:08 +01:00
Viktor Lofgren
55933f8d40
(search) Ensure we respect old URL contracts
...
/explore/random should be equivalent to /explore
2025-01-06 16:20:53 +01:00
Viktor Lofgren
45e771f96b
(api) Update the / API redirect to the new documentation stub.
2025-01-06 16:07:32 +01:00
Viktor Lofgren
8dde502cc9
Merge branch 'master' into serp-redesign
2025-01-05 23:33:35 +01:00
Viktor Lofgren
3e66767af3
(search) Adjust query parsing to trim tokens in quoted search terms
...
Quoted search queries that contained keywords with possessive 's endings were not returning any results, as the index does not retain that suffix, and the query parser was not stripping it away in this code path.
This solves issue #143 .
2025-01-05 23:33:09 +01:00
Viktor Lofgren
9ec9d1b338
Merge branch 'master' into serp-redesign
2025-01-05 21:10:20 +01:00
Viktor Lofgren
dcad0d7863
(search) Tweak token formation.
2025-01-05 21:01:09 +01:00
Viktor Lofgren
94e1aa0baf
(search) Tweak token formation to still break apart emails in brackets.
2025-01-05 20:55:44 +01:00
Viktor Lofgren
b62f043910
(search) Adjust token formation rules to be more lenient to C++ and PHP code.
...
This addresses Issue #142
2025-01-05 20:50:27 +01:00
Viktor Lofgren
6ea22d0d21
(search) Update front page with work-in-progress note
2025-01-05 19:08:02 +01:00
Viktor Lofgren
8c69dc31b8
Merge branch 'master' into serp-redesign
2025-01-05 18:52:51 +01:00
Viktor Lofgren
00734ea87f
(search) Add hover text for matchogram
2025-01-05 18:50:44 +01:00
Viktor Lofgren
3009713db4
(search) Fix broken tests
2025-01-05 18:50:27 +01:00
Viktor Lofgren
a9e312b8b1
(service) Add links to marginalia-search.com where appropriate
2025-01-05 16:56:38 +01:00
Viktor Lofgren
4da3563d8a
(service) Clean up exceptions when requestScreengrab is not available
2025-01-04 14:45:51 +01:00
Viktor Lofgren
48d0a3089a
(service) Improve logging around grpc
...
This change adds a marker for the gRPC-specific logging, as well as improves the clarity and meaningfulness of the log messages.
2025-01-02 20:40:53 +01:00
Viktor Lofgren
594df64b20
(domain-info) Use appropriate sqlite database when fetching feed status
2025-01-02 20:20:36 +01:00
Viktor Lofgren
06efb5abfc
Merge branch 'master' into serp-redesign
2025-01-02 18:42:12 +01:00
Viktor Lofgren
78eb1417a7
(service) Only block on SingleNodeChannelPool creation in QueryClient
...
The code was always blocking for up to 5s while waiting for the remote end to become available, meaning some services would stall for several seconds on start-up for no sensible reason.
This should make most services start faster as a result.
2025-01-02 18:42:01 +01:00
Viktor Lofgren
8c8f2ad5ee
(search) Add an indicator when a link has a feed in the similar/linked domains views
2025-01-02 18:11:57 +01:00
Viktor Lofgren
f71e79d10f
(search) Add a copy of the old UI as a separate service, search-service-legacy
2025-01-02 18:03:42 +01:00
Viktor Lofgren
1b27c5cf06
(search) Add a copy of the old UI as a separate service, search-service-legacy
2025-01-02 18:02:17 +01:00
Viktor Lofgren
67edc8f90d
(domain-info) Only flag domains with rss feed items as having a feed
2025-01-02 17:41:52 +01:00
Viktor Lofgren
5f576b7d0c
(query-parser) Strip leading underlines
...
This addresses issue #140 , where __builtin_ffs gives no results.
2025-01-02 14:39:03 +01:00
Viktor Lofgren
8b05c788fd
(Search) Enable gzip compression of responses
2025-01-01 18:34:42 +01:00
Viktor Lofgren
236f033bc9
(Search) Reduce whitespace in explore view on all resolutions
2025-01-01 18:23:35 +01:00
Viktor Lofgren
510fc75121
(Search) Reduce whitespace in explorer view on mobile
2025-01-01 18:18:09 +01:00
Viktor Lofgren
0376f2e6e3
Merge branch 'master' into serp-redesign
...
# Conflicts:
# code/services-application/search-service/resources/templates/search/index/index.hdb
2025-01-01 18:15:09 +01:00
Viktor Lofgren
0b65164f60
(chore) Fix broken test
2025-01-01 18:06:29 +01:00
Viktor Lofgren
9be477de33
(domain-info) Add a feed flag to domain info
...
This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.
2025-01-01 18:02:33 +01:00
Viktor Lofgren
84f55b84ff
(search) Add experimental OPML-export function for feed subscriptions
2025-01-01 17:17:54 +01:00
Viktor Lofgren
ab5c30ad51
(search) Fix site info view for completely unknown domains
...
Also correct the DbDomainQueries.getDomainId so that it throws NoSuchElementException when domain id is missing, and not UncheckedExecutionException via Cache.
2025-01-01 16:29:01 +01:00
Viktor Lofgren
0c839453c5
(search) Fix crosstalk link
2025-01-01 16:09:19 +01:00
Viktor Lofgren
5e4c5d03ae
(search) Clean up breakpoints in site overview
2025-01-01 16:06:08 +01:00
Viktor Lofgren
710af4999a
(feed-fetcher) Add " entity mapping in feed fetcher
2025-01-01 15:45:17 +01:00
Viktor Lofgren
a5b0a1ae62
(search) Move linked/similar domains to a popover style menu on mobile
...
Fix scroll
2025-01-01 15:37:35 +01:00
Viktor Lofgren
e9f71ee39b
(search) Move linked/similar domains to a popover style menu on mobile
2025-01-01 15:23:25 +01:00
Viktor Lofgren
baeb4a46cd
(search) Reintroduce query rewriting for recipes, add rules for wikis and forums
2024-12-31 16:05:00 +01:00
Viktor Lofgren
0ea8092350
(search) Add link promoting the redesign beta
2024-12-30 15:47:13 +01:00
Viktor Lofgren
bae44497fe
(crawler) Add a new system property crawler.maxFetchSize
...
This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.
2024-12-30 15:10:11 +01:00
Viktor Lofgren
0d59202aca
(crawler) Do not remove W/-prefix on weak e-tags
...
The server expects to get them back prefixed, as we received them.
2024-12-27 20:56:42 +01:00
Viktor Lofgren
0ca43f0c9c
(live-crawler) Improve live crawler short-circuit logic
...
We should not wait until we've fetched robots.txt to decide whether we have any data to fetch! This makes the live crawler very slow and leads to unnecessary requests.
2024-12-27 20:54:42 +01:00
Viktor Lofgren
3bc99639a0
(feed-fetcher) Make feed fetcher requests conditional
...
Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests. On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary.
A new table was added to the FeedDb to hold one etag per domain.
If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received.
This completes the changes for Issue #136 .
2024-12-27 15:10:15 +01:00
Viktor Lofgren
927bc0b63c
(live-crawler) Add Accept-Encoding: gzip to outbound requests
...
This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data.
The change addresses issue #136 , save for making the fetcher's requests conditional.
2024-12-27 03:59:34 +01:00
Viktor Lofgren
d968801dc1
(converter) Drop feed data from SlopDomainRecord
...
Also remove feed extraction from converter. This is the crawler's responsibility now.
2024-12-26 17:57:08 +01:00
Viktor Lofgren
89db69d360
(crawler) Correct feed URLs in domain state db
...
Discovered feed URLs were given a double slash after their domain name in the DB. This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.
2024-12-26 15:18:31 +01:00
Viktor Lofgren
895cee7004
(crawler) Improved feed discovery, new domain state db per crawlset
...
Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided. To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered.
Solves issue #135
2024-12-26 15:05:52 +01:00
Viktor Lofgren
4bb71b8439
(crawler) Correct content type probing to only run on URLs that are suspected to be binary
2024-12-26 14:26:23 +01:00
Viktor Lofgren
e4a41f7dd1
(crawler) Correct content type probing to only run on URLs that are suspected to be binary
2024-12-26 14:13:17 +01:00
Viktor Lofgren
81cdd6385d
Add rendering tests for most major views
...
This will prevent accidentally deploying a broken search service
2024-12-25 15:22:26 +01:00
Viktor Lofgren
e76c42329f
Correct dark mode for infobox in site focused search
2024-12-25 15:06:05 +01:00
Viktor Lofgren
e6ef4734ea
Fix tests
2024-12-25 15:05:41 +01:00
Viktor Lofgren
41a59dcf45
(feed) Sanitize illegal HTML entities out of the feed XML before parsing
2024-12-25 14:53:28 +01:00
Viktor Lofgren
df4bc1d7e9
Add update time to front page subscriptions
2024-12-25 14:42:00 +01:00
Viktor Lofgren
2b222efa75
Merge branch 'master' into serp-redesign
2024-12-25 14:22:42 +01:00
Viktor Lofgren
94d4d2edb7
(live-crawler) Add refresh date to feeds API
...
For now this is just the ctime for the feeds db. We may want to store this per-record in the future.
2024-12-25 14:20:48 +01:00
Viktor Lofgren
56d14e56d7
(live-crawler) Improve LiveCrawlActor resilience to FeedService outages
2024-12-23 23:33:54 +01:00
Viktor Lofgren
a557c7ae7f
(live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler
2024-12-23 23:31:03 +01:00