Viktor Lofgren
461bc3eb1a
(generator) Add special workaround to flag fextralife as a wiki
2024-12-10 22:22:52 +01:00
Viktor Lofgren
cf7f84f033
(rank) Reduce the impact of domain rank bonus, and only apply it to cancel out negative penalties, never to increase the ranking
2024-12-10 22:04:12 +01:00
Viktor Lofgren
fdee07048d
(search) Remove Spark and migrate to Jooby for the search service
2024-12-10 19:13:13 +01:00
Viktor Lofgren
2fbf201761
(search) Adjust crosstalk flex-basis
2024-12-10 15:12:51 +01:00
Viktor Lofgren
4018e4c434
(search) Add crosstalk to paperdoll
2024-12-10 15:12:39 +01:00
Viktor Lofgren
f3382b5bd8
(search) Completely remove all old hdb templates
...
Create new views for conversion results, dictionary results, and site crosstalk.
2024-12-10 15:04:49 +01:00
Viktor Lofgren
9fc82574f0
(fingerprint) Add FluxGarden as a wiki generator
...
#130
2024-12-10 13:51:42 +01:00
Viktor
589f4dafb9
Merge pull request #129 from MarginaliaSearch/atags-counts
...
(WIP) Improve atag sentence matching
2024-12-10 12:42:34 +00:00
Viktor Lofgren
c5d657ef98
(live-crawler) Flag live crawled documents with a special keyword
2024-12-10 13:42:10 +01:00
Viktor Lofgren
3c2bb566da
(converter) Wipe the converter output path on initialization to avoid lingering stale data.
2024-12-10 13:41:05 +01:00
Viktor Lofgren
9287ee0141
(search) Improve hyphenation logic for titles
2024-12-09 15:29:10 +01:00
Viktor Lofgren
2769c8f869
(search) Remove sticky search bar to aid with performance on firefox (and iOS?)
2024-12-09 15:20:33 +01:00
Viktor Lofgren
ddb66f33ba
(search) Add more feedback when pressing some buttons
2024-12-09 15:07:23 +01:00
Viktor Lofgren
79500b8fbc
(search) Move search bar back up top on mobile, put filter buttom at the bottom instead.
2024-12-09 14:55:37 +01:00
Viktor Lofgren
187eea43a4
(search) Remove redundant @if
2024-12-09 14:46:02 +01:00
Viktor Lofgren
a89ed6fa9f
(search) Fix rendering on site overview, more dense serp layout on mobile
2024-12-09 14:45:45 +01:00
Viktor Lofgren
e0c0ed27bc
(keyword-extraction) Clean up code and add tests for position and spans calculation
...
This code has been a bit of a mess and historically significantly flaky, so some test coverage is more than overdue.
2024-12-08 14:14:52 +01:00
Viktor Lofgren
20abb91657
(loader) Correct DocumentLoaderService to properly do bulk inserts
...
Fixes issue #128
2024-12-08 13:12:52 +01:00
Viktor Lofgren
291ca8daf1
(converter/index) Improve atag sentence matching by taking into consideration how many times a sentence appears in the links
...
This change breaks the format of the atags.parquet file.
2024-12-08 00:27:11 +01:00
Viktor Lofgren
8d168be138
(search) Typeahead search, etc.
2024-12-07 15:47:01 +01:00
Viktor Lofgren
6e1aa7b391
(search) Make style.css depend on jte file changes
...
Also add a hack to ensure classes generated from java code get included in the stylesheet as intended.
2024-12-07 14:11:22 +01:00
Viktor Lofgren
deab9b9516
(search) Clean up start views for search and site-info
2024-12-07 14:11:22 +01:00
Viktor Lofgren
39d99a906a
(search) Add proper tailwind build and host fontawesome locally
2024-12-07 14:11:22 +01:00
Viktor Lofgren
6f72e6e0d3
(explore) Add lazy loading and alt attributes to images
2024-12-07 14:11:22 +01:00
Viktor Lofgren
d786d79483
(site-info) Add whitespace-nowrap to pubDay span in overview.jte
2024-12-07 14:11:22 +01:00
Viktor Lofgren
01510f6c2e
(serp) Add wayback link to search results
2024-12-07 14:11:22 +01:00
Viktor Lofgren
7ba43e9e3f
(site) Adjust sizing of navbars
2024-12-07 14:11:16 +01:00
Viktor Lofgren
97bfcd1353
(site) Layout changes site-info
2024-12-07 14:11:16 +01:00
Viktor Lofgren
aa3c85c196
(site) Mobile layout fixes
2024-12-07 14:11:16 +01:00
Viktor Lofgren
ee2d5496d0
Revert "(experiment) Modify atags exporter to permit duplicates from different source domains"
...
This reverts commit 5c858a2b94
.
2024-12-07 14:01:50 +01:00
Viktor Lofgren
5c858a2b94
(experiment) Modify atags exporter to permit duplicates from different source domains
...
This is an attempt to provide higher resolution term frequency data that will need evaluation when the data is processed.
2024-12-06 14:10:15 +01:00
Viktor Lofgren
fb75a3827d
(site) Adjust coloration of search results
2024-12-05 16:58:00 +01:00
Viktor Lofgren
7d546d0e2a
(site) Make SearchParameters generate relative URLs instead of absolute
2024-12-05 16:47:22 +01:00
Viktor Lofgren
8fcb6ffd7a
(site-info) Increase contrast in search results for forums, wikis
2024-12-05 16:42:16 +01:00
Viktor Lofgren
f97de0c15a
(site-info) Fix layout
2024-12-05 16:33:46 +01:00
Viktor Lofgren
be9e192b78
(site-info) Fix pagination in backlinks and documents views
2024-12-05 16:26:11 +01:00
Viktor Lofgren
75ae1c9526
(site-info) Do not show 'suggest for crawling' when the ndoe affinity is already set to 0
...
This indicates the domain is already slated for crawling.
2024-12-05 16:18:46 +01:00
Viktor Lofgren
33761a0236
(site-info) Make the search box in the site viewer functional
2024-12-05 16:16:29 +01:00
Viktor Lofgren
19b69b1764
(site-info) Only show samples if feed is absent, never both.
2024-12-05 16:05:03 +01:00
Viktor Lofgren
8b804359a9
(serp) Layout fixes for mobile
2024-12-05 15:59:33 +01:00
Viktor Lofgren
f050bf5c4c
(WIP) Initial semi-working transformation to new tailwind UI
...
Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod.
There's also a lot of polish remaining everywhere, dead links, etc.
2024-12-05 14:00:17 +01:00
Viktor Lofgren
fdc3efa250
(setup) Remove OpenNLP tokenization model
...
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
2024-11-28 16:03:05 +01:00
Viktor Lofgren
c97c66a41c
(ranking) Reduce the verbatim score multiplier
2024-11-28 13:37:11 +01:00
Viktor Lofgren
7b64377fd6
(ranking) Promote documents with multiple phrase matches with a log-scale bonus
2024-11-28 13:36:56 +01:00
Viktor Lofgren
e11ebf18e5
(span) Correct intersection counting logic, add comprehensive tests
2024-11-28 13:36:25 +01:00
Viktor Lofgren
ba47d72bf4
(ranking) Adjust scores for external link matches
2024-11-27 14:27:23 +01:00
Viktor Lofgren
52bc0272f8
(atag) Add alias domain support and improve domain handling
...
Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.
2024-11-27 14:26:44 +01:00
Viktor Lofgren
d4bce13a03
(export) Add export actors to precession
...
Adding a tracking message to the export actor means it's possible to run them in a precession.
Adding a new precession actor, and some GUI components for triggering exports.
The change also adds a heartbeat to the export process.
2024-11-26 15:07:03 +01:00
Viktor Lofgren
b9842b57e0
(encyclopedia-sideloader) Add test suite and clean up urlencoding logic
2024-11-26 13:34:15 +01:00
Viktor Lofgren
95776e9bee
(encyclopedia) Fix commit gore resulting in bad SQL query
2024-11-26 12:44:49 +01:00
Viktor Lofgren
077d8dcd11
(result-score) Adjust ranking parameters a tiny bit
2024-11-25 18:30:59 +01:00
Viktor Lofgren
9ec41e27c6
(keyword-extractor) Fix bug where external link keywords weren't generating document spans as intended
2024-11-25 18:30:22 +01:00
Viktor Lofgren
200743c84f
(minor) Remove delomobok debris
2024-11-25 18:29:21 +01:00
Viktor Lofgren
6d7998e349
(index) Correct behavior of debug function positionValues(), which was misleadingly incorrect
2024-11-25 18:28:53 +01:00
Viktor Lofgren
7d1ef08a0f
(index) Correct ranking bonus for external linktext appearnces
2024-11-25 17:40:15 +01:00
Viktor Lofgren
3ec9c4c5fa
(export) Filter non-HTML documents in exporters
...
Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.
2024-11-25 15:06:42 +01:00
Viktor Lofgren
0b6b5dab07
(index) Add score bonuses for single-word anchor tag spans
...
Enhanced scoring logic to add bonuses when the query matches single-word anchor (atag) spans exactly. Implemented this by adding conditions in `IndexResultScoreCalculator.java` and creating a new method `containsRangeExact` in `DocumentSpan.java` to check for exact span matches.
2024-11-25 14:44:41 +01:00
Viktor Lofgren
ff17473105
Fix UTF-8 URL normalization issue in sideloader.
...
Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue.
Fixes issue #109 .
2024-11-25 14:25:47 +01:00
Viktor Lofgren
dc5f97e737
(index) Add bonus for single-word title matches when the title is also a single word
2024-11-25 13:24:12 +01:00
Viktor Lofgren
d919179ba3
(index) Correct off-by-1 error in DocumentSpan.containsRange
2024-11-25 13:24:03 +01:00
Viktor Lofgren
f09669a5b0
(index) Correct usage of DocumentSpan.length() instead of DocumentSpan.size()
...
The latter counts the number of spans, and is not what you want here.
2024-11-25 13:11:55 +01:00
Viktor Lofgren
b3b0f6fed3
(actor) Add side-load profile to PROC_CONVERTER_SPAWNER.
...
This fell off during the profile split, but is necessary for sideloading.
2024-11-25 12:40:14 +01:00
Viktor Lofgren
88caca60f9
(live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list
2024-11-23 17:07:16 +01:00
Viktor Lofgren
923ebbac81
(feeds) Add logic to handle URI fragments in feed items
...
Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.
2024-11-23 16:38:56 +01:00
Viktor Lofgren
552b246099
(live-crawl) Improve error handling for errors during robots.txt-retrieval
...
Reduce log-spam and don't treat errors other than 404 as "all is permitted".
2024-11-22 14:15:32 +01:00
Viktor Lofgren
80e6d0069c
(live-crawl-actor) Clear index journal before starting live crawl
...
This is to prevent data corruption. This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.
2024-11-22 14:04:57 +01:00
Viktor Lofgren
b941604135
(live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with.
2024-11-22 13:58:57 +01:00
Viktor Lofgren
52eb5bc84f
(live-crawler) Keep track of bad URLs
...
To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.
2024-11-22 00:55:46 +01:00
Viktor Lofgren
4d23fe6261
(feeds) Simplify RSS User-Agent header
...
Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService. This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.
2024-11-21 16:43:56 +01:00
Viktor Lofgren
14519294d2
Merge branch 'master' into live-search
2024-11-21 16:00:20 +01:00
Viktor Lofgren
51e46ad2b0
(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents
...
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.
While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.
2024-11-21 16:00:09 +01:00
Viktor Lofgren
665c8831a3
(model) Fix resource leak in partially read crawl data streams.
...
Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.
2024-11-20 19:29:13 +01:00
Viktor Lofgren
47dfbacb00
(conf) Introduce a new concept of node profiles
...
Node profiles decide which actors are started, and which views are available in the control GUI. This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.
2024-11-20 18:15:22 +01:00
Viktor Lofgren
f94911541a
(live-crawl) Reduce the risk of id collisions with the main indexes
...
This is done by applying a large constant offset to the ordinals for the live crawled documents. The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.
2024-11-20 16:01:10 +01:00
Viktor Lofgren
89d8af640d
(live-crawl) Rename the live crawler code module to be more consistent with the other processes
2024-11-20 15:55:15 +01:00
Viktor Lofgren
6e4252cf4c
(live-crawl) Make the actor poll for feeds changes instead of being a one-shot thing.
...
Also changes the live crawl process to store the live crawl data in a fixed directory in the storage base rather than versioned directories.
2024-11-20 15:36:25 +01:00
Viktor Lofgren
79ce4de2ab
(model) Remove deprecated fields from CrawledDocument and CrawledDomain
2024-11-20 15:27:05 +01:00
Viktor Lofgren
d6575dfee4
(live-crawler) Crude first-try process for live crawling #WIP
...
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 21:00:18 +01:00
Viktor Lofgren
a91ab4c203
(live-crawler) Crude first-try process for live crawling #WIP
...
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 19:35:01 +01:00
Viktor Lofgren
6a3079a167
(search) Fix missing getter for proto
2024-11-18 21:05:22 +01:00
Viktor Lofgren
c728a1e2f2
(rss) Add endpoint for extracting URLs changed withing a timespan.
2024-11-18 14:59:32 +01:00
Viktor Lofgren
d874d76a09
(rss) Add an endpoint that can be used for identifying when RSS data has changed
2024-11-18 14:22:17 +01:00
Viktor Lofgren
41c11be075
(status) Clean up the status page a bit
2024-11-17 20:00:44 +01:00
Viktor Lofgren
163ce19846
(test) Tag status service endpoint tests as flaky
...
These tests have outside dependencies that inherently makes them unreliable and unsuitable for CI.
2024-11-17 19:48:01 +01:00
Viktor Lofgren
9eb16cb667
(test) Remove tests from fast suite
...
Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI.
Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.
2024-11-17 19:45:59 +01:00
Viktor Lofgren
af40fa327b
(status-service) Correct measurement pruning to use correct sqlite datetimes, as to not delete the database
2024-11-17 18:35:34 +01:00
Viktor Lofgren
cf6d28e71e
(status-service) Enable auto-commit
2024-11-17 18:25:15 +01:00
Viktor Lofgren
3791ea1e18
(service) Add a new application service for external liveness monitoring
...
The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.
2024-11-17 18:01:08 +01:00
Viktor Lofgren
e5db3f11e1
(chore) Clean up some of the uglier delomboking artifacts
2024-11-15 13:57:20 +01:00
Viktor Lofgren
9f47ce8d15
(chore) Remove lombok
...
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
a5b4951f23
(chore) Remove use of deprecated STR.-style string templates
2024-11-11 18:02:28 +01:00
Viktor Lofgren
8b8bf0748f
(feature-extraction) Add new DocumentHeaders class encapsulating Html headers.
...
Also adds a few new html features for CDNs and S3 hosting for use in ranking and query refinement.
2024-11-11 13:26:15 +01:00
Viktor Lofgren
a456ec9599
(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished
2024-11-10 18:30:28 +01:00
Viktor Lofgren
a2bc9a98c0
(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished
2024-11-10 17:45:20 +01:00
Viktor Lofgren
e24a98390c
(feed) Update API to allow specifying clean vs refresh update
...
Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.
2024-11-09 18:43:47 +01:00
Viktor Lofgren
6f858cd627
(feed) Decrease update interval to 24 hours
2024-11-09 18:17:51 +01:00
Viktor Lofgren
a293266ccd
(feed) Wipe the feeds db and start over from system URLs periodically.
2024-11-09 18:17:16 +01:00
Viktor Lofgren
b8e0dc93d7
(search) Correctly show the feeds view when items are present
...
... otherwise show samples. This commit also removes the (Experimental) bit, as this is getting fairly mature.
2024-11-09 17:56:43 +01:00
Viktor Lofgren
d774c39031
(feeds) Reduce log spam
2024-11-09 17:56:43 +01:00
Viktor Lofgren
ab17af99da
(feeds) Refresh the feed db using the previous db, when it is available.
2024-11-09 17:56:43 +01:00