Viktor Lofgren
67edc8f90d
(domain-info) Only flag domains with rss feed items as having a feed
2025-01-02 17:41:52 +01:00
Viktor Lofgren
5f576b7d0c
(query-parser) Strip leading underlines
...
This addresses issue #140 , where __builtin_ffs gives no results.
2025-01-02 14:39:03 +01:00
Viktor Lofgren
8b05c788fd
(Search) Enable gzip compression of responses
2025-01-01 18:34:42 +01:00
Viktor Lofgren
236f033bc9
(Search) Reduce whitespace in explore view on all resolutions
2025-01-01 18:23:35 +01:00
Viktor Lofgren
510fc75121
(Search) Reduce whitespace in explorer view on mobile
2025-01-01 18:18:09 +01:00
Viktor Lofgren
0376f2e6e3
Merge branch 'master' into serp-redesign
...
# Conflicts:
# code/services-application/search-service/resources/templates/search/index/index.hdb
2025-01-01 18:15:09 +01:00
Viktor Lofgren
0b65164f60
(chore) Fix broken test
2025-01-01 18:06:29 +01:00
Viktor Lofgren
9be477de33
(domain-info) Add a feed flag to domain info
...
This is a bit of a sketchy solution that requires both assistant services to run on the same physical machine.
2025-01-01 18:02:33 +01:00
Viktor Lofgren
84f55b84ff
(search) Add experimental OPML-export function for feed subscriptions
2025-01-01 17:17:54 +01:00
Viktor Lofgren
ab5c30ad51
(search) Fix site info view for completely unknown domains
...
Also correct the DbDomainQueries.getDomainId so that it throws NoSuchElementException when domain id is missing, and not UncheckedExecutionException via Cache.
2025-01-01 16:29:01 +01:00
Viktor Lofgren
0c839453c5
(search) Fix crosstalk link
2025-01-01 16:09:19 +01:00
Viktor Lofgren
5e4c5d03ae
(search) Clean up breakpoints in site overview
2025-01-01 16:06:08 +01:00
Viktor Lofgren
710af4999a
(feed-fetcher) Add " entity mapping in feed fetcher
2025-01-01 15:45:17 +01:00
Viktor Lofgren
a5b0a1ae62
(search) Move linked/similar domains to a popover style menu on mobile
...
Fix scroll
2025-01-01 15:37:35 +01:00
Viktor Lofgren
e9f71ee39b
(search) Move linked/similar domains to a popover style menu on mobile
2025-01-01 15:23:25 +01:00
Viktor Lofgren
baeb4a46cd
(search) Reintroduce query rewriting for recipes, add rules for wikis and forums
2024-12-31 16:05:00 +01:00
Viktor Lofgren
0ea8092350
(search) Add link promoting the redesign beta
2024-12-30 15:47:13 +01:00
Viktor Lofgren
bae44497fe
(crawler) Add a new system property crawler.maxFetchSize
...
This gives the same upper limit to the live crawler and the big boy crawler, though the live crawler will reject items too large, and the big crawler will truncate at that point.
2024-12-30 15:10:11 +01:00
Viktor Lofgren
0d59202aca
(crawler) Do not remove W/-prefix on weak e-tags
...
The server expects to get them back prefixed, as we received them.
2024-12-27 20:56:42 +01:00
Viktor Lofgren
0ca43f0c9c
(live-crawler) Improve live crawler short-circuit logic
...
We should not wait until we've fetched robots.txt to decide whether we have any data to fetch! This makes the live crawler very slow and leads to unnecessary requests.
2024-12-27 20:54:42 +01:00
Viktor Lofgren
3bc99639a0
(feed-fetcher) Make feed fetcher requests conditional
...
Add `If-None-Match` and `If-Modified-Since` headers as appropriate to the feed fetcher's requests. On well-configured web servers, this should short-circuit the request and reduce the amount of bandwidth and processing that is necessary.
A new table was added to the FeedDb to hold one etag per domain.
If-Modified-Since semantics are based on the creation date for the feed database, which should serve as a cutoff date for the earliest update we can have received.
This completes the changes for Issue #136 .
2024-12-27 15:10:15 +01:00
Viktor Lofgren
927bc0b63c
(live-crawler) Add Accept-Encoding: gzip to outbound requests
...
This change adds `Accept-Encoding: gzip` to all outbound requests from the live crawler and feed fetcher, and the corresponding decoding logic for the compressed response data.
The change addresses issue #136 , save for making the fetcher's requests conditional.
2024-12-27 03:59:34 +01:00
Viktor Lofgren
d968801dc1
(converter) Drop feed data from SlopDomainRecord
...
Also remove feed extraction from converter. This is the crawler's responsibility now.
2024-12-26 17:57:08 +01:00
Viktor Lofgren
89db69d360
(crawler) Correct feed URLs in domain state db
...
Discovered feed URLs were given a double slash after their domain name in the DB. This will go away in the URL normalizer, so the URLs are still viable, but the commit fixes the issue regardless.
2024-12-26 15:18:31 +01:00
Viktor Lofgren
895cee7004
(crawler) Improved feed discovery, new domain state db per crawlset
...
Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided. To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered.
Solves issue #135
2024-12-26 15:05:52 +01:00
Viktor Lofgren
4bb71b8439
(crawler) Correct content type probing to only run on URLs that are suspected to be binary
2024-12-26 14:26:23 +01:00
Viktor Lofgren
e4a41f7dd1
(crawler) Correct content type probing to only run on URLs that are suspected to be binary
2024-12-26 14:13:17 +01:00
Viktor Lofgren
81cdd6385d
Add rendering tests for most major views
...
This will prevent accidentally deploying a broken search service
2024-12-25 15:22:26 +01:00
Viktor Lofgren
e76c42329f
Correct dark mode for infobox in site focused search
2024-12-25 15:06:05 +01:00
Viktor Lofgren
e6ef4734ea
Fix tests
2024-12-25 15:05:41 +01:00
Viktor Lofgren
41a59dcf45
(feed) Sanitize illegal HTML entities out of the feed XML before parsing
2024-12-25 14:53:28 +01:00
Viktor Lofgren
df4bc1d7e9
Add update time to front page subscriptions
2024-12-25 14:42:00 +01:00
Viktor Lofgren
2b222efa75
Merge branch 'master' into serp-redesign
2024-12-25 14:22:42 +01:00
Viktor Lofgren
94d4d2edb7
(live-crawler) Add refresh date to feeds API
...
For now this is just the ctime for the feeds db. We may want to store this per-record in the future.
2024-12-25 14:20:48 +01:00
Viktor Lofgren
56d14e56d7
(live-crawler) Improve LiveCrawlActor resilience to FeedService outages
2024-12-23 23:33:54 +01:00
Viktor Lofgren
a557c7ae7f
(live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler
2024-12-23 23:31:03 +01:00
Viktor Lofgren
b66879ccb1
(feed) Add support for date discovery through atom:issued and atom:created
...
This is specifically to help parse monadnock.net's Atom feed.
2024-12-23 20:05:58 +01:00
Viktor Lofgren
0da2047eae
(live-capture) Correctly update processed count, disable poll rate adjustment based on freshness.
2024-12-23 15:56:27 +01:00
Viktor Lofgren
5ca8523220
(math) Reduce log error spam from null unit conversions
2024-12-21 18:51:45 +01:00
Viktor Lofgren
1118657ffd
(system) Supply local IP to service discovery if multiFace is enabled
2024-12-19 22:20:19 +01:00
Viktor Lofgren
b1f970152d
(system) To support configurations with multiple docker networks, bind to the "most local" interface.
...
Make the behavior optional.
2024-12-19 20:26:31 +01:00
Viktor Lofgren
e1783891ab
(system) To support configurations with multiple docker networks, bind to the "most local" interface.
2024-12-19 20:18:57 +01:00
Viktor Lofgren
8c963bd4ba
(feeds) Remove Content-Encoding: gzip from feed fetcher
...
We don't support decompressing gzip, so this just gives us errors at this point should the server support it.
2024-12-18 22:23:44 +01:00
Viktor Lofgren
6a079c1c75
(feeds) Add per-domain throttling for feed fetcher.
2024-12-18 22:06:46 +01:00
Viktor Lofgren
2dc9f2e639
(feeds) Make feed XML parsing more lenient
...
... by consuming BOM markers and leading whitespace.
2024-12-18 17:18:41 +01:00
Viktor Lofgren
b66fb9caf6
(feeds) Improve error handling in the feed fetcher.
2024-12-18 17:02:13 +01:00
Viktor Lofgren
6d18e6d840
(search) Add clustering to subscriptions view
2024-12-18 15:36:05 +01:00
Viktor Lofgren
2a3c63f209
(search) Exclude generated style.css from git
2024-12-18 15:24:31 +01:00
Viktor Lofgren
9f70cecaef
(search) Add site subscription feature that puts RSS updates on the front page
2024-12-18 15:24:31 +01:00
Viktor Lofgren
47e58a21c6
Refactor documentBody method and ContentType charset handling
...
Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
2024-12-17 17:11:37 +01:00
Viktor Lofgren
3714104976
Add loader for slop data in converter.
...
Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
2024-12-17 15:40:24 +01:00
Viktor Lofgren
f6f036b9b1
Switch to new Slop format for crawl data storage and processing.
...
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
2024-12-15 19:34:03 +01:00
Viktor Lofgren
b510b7feb8
Spike for storing crawl data in slop instead of parquet
...
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds. On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
Viktor Lofgren
c08203e2ed
(search) Prevent paperdoll from being run as a test by CI
2024-12-14 20:35:57 +01:00
Viktor Lofgren
86497fd32f
(site-info) Mobile layout fix
2024-12-14 16:19:56 +01:00
Viktor Lofgren
3b998573fd
Adjust colors on dark mode for site overview
2024-12-13 21:51:25 +01:00
Viktor Lofgren
e161882ec7
(search) Fix layout for light mode
2024-12-13 21:47:29 +01:00
Viktor Lofgren
357f349e30
(search) Table layout fixes for dictionary lookup
2024-12-13 21:47:08 +01:00
Viktor Lofgren
e4769f541d
(search) Sort and deduplicate search results for better relevance.
...
Added a custom sorting mechanism to prioritize HTTPS over HTTP and domain-based URLs over raw IPs during deduplication. Ensures "bad duplicates" are discarded while maintaining the original presentation order for user-facing results.
2024-12-13 21:47:08 +01:00
Viktor Lofgren
2a173e2861
(search) Dark Mode
2024-12-13 21:47:07 +01:00
Viktor Lofgren
a6a900266c
(search) Fix redirects
2024-12-13 02:40:51 +01:00
Viktor Lofgren
bdba53f055
(site) Update domain parameter type from PathParam to QueryParam
2024-12-13 02:15:35 +01:00
Viktor Lofgren
eb2fe18867
(sideload) Add LSH generation for sideloaded StackExchange data
...
Previously, the sideloader did not generate a locality-sensitive hashCode for document details. This caused all documents from the same domain to be considered duplicates by the deduplication logic.
2024-12-13 02:10:52 +01:00
Viktor Lofgren
a7468c8d23
(converter) Ensure paths are created for converter batch writer
2024-12-13 01:35:07 +01:00
Viktor Lofgren
fb2beb1eac
(converter) Fix data-loss bug where the converter writer would remove all but the last batch of processed data
2024-12-13 01:19:30 +01:00
Viktor Lofgren
0fb03e3d62
(export) Add logging to AtagExporter for error handling
2024-12-12 22:54:32 +01:00
Viktor Lofgren
67db3f295e
(index) Revert some optimization changes
2024-12-12 22:14:24 +01:00
Viktor Lofgren
dafaab3ef7
(index) Additional optimization pass
2024-12-12 18:57:33 +01:00
Viktor Lofgren
3f11ca409f
(index) Increase thread limit and optimize search result handling
...
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 17:07:06 +01:00
Viktor Lofgren
694eed79ef
(index) Increase thread limit and optimize search result handling
...
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:32:31 +01:00
Viktor Lofgren
4220169119
(index) Increase thread limit and optimize search result handling
...
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:31:11 +01:00
Viktor Lofgren
bbdde789e7
Merge branch 'master' into serp-redesign
2024-12-11 19:45:17 +01:00
Viktor Lofgren
0a53ac68a0
Add specialization for steam store and GOG
2024-12-11 18:32:45 +01:00
Viktor Lofgren
eab61cd48a
Merge branch 'master' into serp-redesign
2024-12-11 17:09:27 +01:00
Viktor Lofgren
e65d75a0f9
(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets
2024-12-11 17:01:52 +01:00
Viktor Lofgren
3b99cffb3d
(link-parser) Filter out URLs with binary file suffixes in LinkParser
...
Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.
2024-12-11 16:42:47 +01:00
Viktor Lofgren
a97c05107e
Add synthetic meta flag for root path documents
...
If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D
2024-12-11 16:10:44 +01:00
Viktor Lofgren
5002870d1f
(converter) Refactor sideloaders to improve feature handling and keyword logic
...
Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.
2024-12-11 16:01:38 +01:00
Viktor Lofgren
73861e613f
(ranking) Downtune score boost for unordered heading matces
2024-12-11 15:44:29 +01:00
Viktor Lofgren
0ce2ba9ad9
(jooby) Fix asset handler
2024-12-11 14:38:04 +01:00
Viktor Lofgren
3ddcebaa36
(search) Give serp/start a more consistent name to siteinfo/start
...
The change also cleans up the layout a bit.
2024-12-11 14:33:57 +01:00
Viktor Lofgren
b91463383e
(jooby) Clean up initialization process
2024-12-11 14:33:18 +01:00
Viktor Lofgren
7444a2f36c
(site-info) Add placeholder when a feed item lacks a title.
2024-12-10 22:46:12 +01:00
Viktor Lofgren
461bc3eb1a
(generator) Add special workaround to flag fextralife as a wiki
2024-12-10 22:22:52 +01:00
Viktor Lofgren
cf7f84f033
(rank) Reduce the impact of domain rank bonus, and only apply it to cancel out negative penalties, never to increase the ranking
2024-12-10 22:04:12 +01:00
Viktor Lofgren
fdee07048d
(search) Remove Spark and migrate to Jooby for the search service
2024-12-10 19:13:13 +01:00
Viktor Lofgren
2fbf201761
(search) Adjust crosstalk flex-basis
2024-12-10 15:12:51 +01:00
Viktor Lofgren
4018e4c434
(search) Add crosstalk to paperdoll
2024-12-10 15:12:39 +01:00
Viktor Lofgren
f3382b5bd8
(search) Completely remove all old hdb templates
...
Create new views for conversion results, dictionary results, and site crosstalk.
2024-12-10 15:04:49 +01:00
Viktor Lofgren
9fc82574f0
(fingerprint) Add FluxGarden as a wiki generator
...
#130
2024-12-10 13:51:42 +01:00
Viktor
589f4dafb9
Merge pull request #129 from MarginaliaSearch/atags-counts
...
(WIP) Improve atag sentence matching
2024-12-10 12:42:34 +00:00
Viktor Lofgren
c5d657ef98
(live-crawler) Flag live crawled documents with a special keyword
2024-12-10 13:42:10 +01:00
Viktor Lofgren
3c2bb566da
(converter) Wipe the converter output path on initialization to avoid lingering stale data.
2024-12-10 13:41:05 +01:00
Viktor Lofgren
9287ee0141
(search) Improve hyphenation logic for titles
2024-12-09 15:29:10 +01:00
Viktor Lofgren
2769c8f869
(search) Remove sticky search bar to aid with performance on firefox (and iOS?)
2024-12-09 15:20:33 +01:00
Viktor Lofgren
ddb66f33ba
(search) Add more feedback when pressing some buttons
2024-12-09 15:07:23 +01:00
Viktor Lofgren
79500b8fbc
(search) Move search bar back up top on mobile, put filter buttom at the bottom instead.
2024-12-09 14:55:37 +01:00
Viktor Lofgren
187eea43a4
(search) Remove redundant @if
2024-12-09 14:46:02 +01:00
Viktor Lofgren
a89ed6fa9f
(search) Fix rendering on site overview, more dense serp layout on mobile
2024-12-09 14:45:45 +01:00
Viktor Lofgren
e0c0ed27bc
(keyword-extraction) Clean up code and add tests for position and spans calculation
...
This code has been a bit of a mess and historically significantly flaky, so some test coverage is more than overdue.
2024-12-08 14:14:52 +01:00
Viktor Lofgren
20abb91657
(loader) Correct DocumentLoaderService to properly do bulk inserts
...
Fixes issue #128
2024-12-08 13:12:52 +01:00
Viktor Lofgren
291ca8daf1
(converter/index) Improve atag sentence matching by taking into consideration how many times a sentence appears in the links
...
This change breaks the format of the atags.parquet file.
2024-12-08 00:27:11 +01:00
Viktor Lofgren
8d168be138
(search) Typeahead search, etc.
2024-12-07 15:47:01 +01:00
Viktor Lofgren
6e1aa7b391
(search) Make style.css depend on jte file changes
...
Also add a hack to ensure classes generated from java code get included in the stylesheet as intended.
2024-12-07 14:11:22 +01:00
Viktor Lofgren
deab9b9516
(search) Clean up start views for search and site-info
2024-12-07 14:11:22 +01:00
Viktor Lofgren
39d99a906a
(search) Add proper tailwind build and host fontawesome locally
2024-12-07 14:11:22 +01:00
Viktor Lofgren
6f72e6e0d3
(explore) Add lazy loading and alt attributes to images
2024-12-07 14:11:22 +01:00
Viktor Lofgren
d786d79483
(site-info) Add whitespace-nowrap to pubDay span in overview.jte
2024-12-07 14:11:22 +01:00
Viktor Lofgren
01510f6c2e
(serp) Add wayback link to search results
2024-12-07 14:11:22 +01:00
Viktor Lofgren
7ba43e9e3f
(site) Adjust sizing of navbars
2024-12-07 14:11:16 +01:00
Viktor Lofgren
97bfcd1353
(site) Layout changes site-info
2024-12-07 14:11:16 +01:00
Viktor Lofgren
aa3c85c196
(site) Mobile layout fixes
2024-12-07 14:11:16 +01:00
Viktor Lofgren
ee2d5496d0
Revert "(experiment) Modify atags exporter to permit duplicates from different source domains"
...
This reverts commit 5c858a2b94
.
2024-12-07 14:01:50 +01:00
Viktor Lofgren
5c858a2b94
(experiment) Modify atags exporter to permit duplicates from different source domains
...
This is an attempt to provide higher resolution term frequency data that will need evaluation when the data is processed.
2024-12-06 14:10:15 +01:00
Viktor Lofgren
fb75a3827d
(site) Adjust coloration of search results
2024-12-05 16:58:00 +01:00
Viktor Lofgren
7d546d0e2a
(site) Make SearchParameters generate relative URLs instead of absolute
2024-12-05 16:47:22 +01:00
Viktor Lofgren
8fcb6ffd7a
(site-info) Increase contrast in search results for forums, wikis
2024-12-05 16:42:16 +01:00
Viktor Lofgren
f97de0c15a
(site-info) Fix layout
2024-12-05 16:33:46 +01:00
Viktor Lofgren
be9e192b78
(site-info) Fix pagination in backlinks and documents views
2024-12-05 16:26:11 +01:00
Viktor Lofgren
75ae1c9526
(site-info) Do not show 'suggest for crawling' when the ndoe affinity is already set to 0
...
This indicates the domain is already slated for crawling.
2024-12-05 16:18:46 +01:00
Viktor Lofgren
33761a0236
(site-info) Make the search box in the site viewer functional
2024-12-05 16:16:29 +01:00
Viktor Lofgren
19b69b1764
(site-info) Only show samples if feed is absent, never both.
2024-12-05 16:05:03 +01:00
Viktor Lofgren
8b804359a9
(serp) Layout fixes for mobile
2024-12-05 15:59:33 +01:00
Viktor Lofgren
f050bf5c4c
(WIP) Initial semi-working transformation to new tailwind UI
...
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.
There's also a lot of polish remaining everywhere, dead links, etc.
2024-12-05 14:00:17 +01:00
Viktor Lofgren
fdc3efa250
(setup) Remove OpenNLP tokenization model
...
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
2024-11-28 16:03:05 +01:00
Viktor Lofgren
c97c66a41c
(ranking) Reduce the verbatim score multiplier
2024-11-28 13:37:11 +01:00
Viktor Lofgren
7b64377fd6
(ranking) Promote documents with multiple phrase matches with a log-scale bonus
2024-11-28 13:36:56 +01:00
Viktor Lofgren
e11ebf18e5
(span) Correct intersection counting logic, add comprehensive tests
2024-11-28 13:36:25 +01:00
Viktor Lofgren
ba47d72bf4
(ranking) Adjust scores for external link matches
2024-11-27 14:27:23 +01:00
Viktor Lofgren
52bc0272f8
(atag) Add alias domain support and improve domain handling
...
Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.
2024-11-27 14:26:44 +01:00
Viktor Lofgren
d4bce13a03
(export) Add export actors to precession
...
Adding a tracking message to the export actor means it's possible to run them in a precession.
Adding a new precession actor, and some GUI components for triggering exports.
The change also adds a heartbeat to the export process.
2024-11-26 15:07:03 +01:00
Viktor Lofgren
b9842b57e0
(encyclopedia-sideloader) Add test suite and clean up urlencoding logic
2024-11-26 13:34:15 +01:00
Viktor Lofgren
95776e9bee
(encyclopedia) Fix commit gore resulting in bad SQL query
2024-11-26 12:44:49 +01:00
Viktor Lofgren
077d8dcd11
(result-score) Adjust ranking parameters a tiny bit
2024-11-25 18:30:59 +01:00
Viktor Lofgren
9ec41e27c6
(keyword-extractor) Fix bug where external link keywords weren't generating document spans as intended
2024-11-25 18:30:22 +01:00
Viktor Lofgren
200743c84f
(minor) Remove delomobok debris
2024-11-25 18:29:21 +01:00
Viktor Lofgren
6d7998e349
(index) Correct behavior of debug function positionValues(), which was misleadingly incorrect
2024-11-25 18:28:53 +01:00
Viktor Lofgren
7d1ef08a0f
(index) Correct ranking bonus for external linktext appearnces
2024-11-25 17:40:15 +01:00
Viktor Lofgren
3ec9c4c5fa
(export) Filter non-HTML documents in exporters
...
Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.
2024-11-25 15:06:42 +01:00
Viktor Lofgren
0b6b5dab07
(index) Add score bonuses for single-word anchor tag spans
...
Enhanced scoring logic to add bonuses when the query matches single-word anchor (atag) spans exactly. Implemented this by adding conditions in `IndexResultScoreCalculator.java` and creating a new method `containsRangeExact` in `DocumentSpan.java` to check for exact span matches.
2024-11-25 14:44:41 +01:00
Viktor Lofgren
ff17473105
Fix UTF-8 URL normalization issue in sideloader.
...
Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue.
Fixes issue #109 .
2024-11-25 14:25:47 +01:00
Viktor Lofgren
dc5f97e737
(index) Add bonus for single-word title matches when the title is also a single word
2024-11-25 13:24:12 +01:00
Viktor Lofgren
d919179ba3
(index) Correct off-by-1 error in DocumentSpan.containsRange
2024-11-25 13:24:03 +01:00
Viktor Lofgren
f09669a5b0
(index) Correct usage of DocumentSpan.length() instead of DocumentSpan.size()
...
The latter counts the number of spans, and is not what you want here.
2024-11-25 13:11:55 +01:00
Viktor Lofgren
b3b0f6fed3
(actor) Add side-load profile to PROC_CONVERTER_SPAWNER.
...
This fell off during the profile split, but is necessary for sideloading.
2024-11-25 12:40:14 +01:00
Viktor Lofgren
88caca60f9
(live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list
2024-11-23 17:07:16 +01:00
Viktor Lofgren
923ebbac81
(feeds) Add logic to handle URI fragments in feed items
...
Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.
2024-11-23 16:38:56 +01:00
Viktor Lofgren
552b246099
(live-crawl) Improve error handling for errors during robots.txt-retrieval
...
Reduce log-spam and don't treat errors other than 404 as "all is permitted".
2024-11-22 14:15:32 +01:00
Viktor Lofgren
80e6d0069c
(live-crawl-actor) Clear index journal before starting live crawl
...
This is to prevent data corruption. This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.
2024-11-22 14:04:57 +01:00
Viktor Lofgren
b941604135
(live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with.
2024-11-22 13:58:57 +01:00