Commit Graph

2743 Commits

Author SHA1 Message Date
Viktor Lofgren
8c963bd4ba (feeds) Remove Content-Encoding: gzip from feed fetcher
We don't support decompressing gzip, so this just gives us errors at this point should the server support it.
2024-12-18 22:23:44 +01:00
Viktor Lofgren
6a079c1c75 (feeds) Add per-domain throttling for feed fetcher. 2024-12-18 22:06:46 +01:00
Viktor Lofgren
2dc9f2e639 (feeds) Make feed XML parsing more lenient
... by consuming BOM markers and leading whitespace.
2024-12-18 17:18:41 +01:00
Viktor Lofgren
b66fb9caf6 (feeds) Improve error handling in the feed fetcher. 2024-12-18 17:02:13 +01:00
Viktor Lofgren
6d18e6d840 (search) Add clustering to subscriptions view 2024-12-18 15:36:05 +01:00
Viktor Lofgren
2a3c63f209 (search) Exclude generated style.css from git 2024-12-18 15:24:31 +01:00
Viktor Lofgren
9f70cecaef (search) Add site subscription feature that puts RSS updates on the front page 2024-12-18 15:24:31 +01:00
Viktor Lofgren
47e58a21c6 Refactor documentBody method and ContentType charset handling
Updated the `documentBody` method to improve parsing retries and error handling. Refactored `ContentType` charset processing with cleaner logic, removing redundant handling for unsupported charsets. Also, updated the version of the `slop` library in dependency settings.
2024-12-17 17:11:37 +01:00
Viktor Lofgren
3714104976 Add loader for slop data in converter.
Also alter CrawledDocument to not require String parsing of the underlying byte[] data.  This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.
2024-12-17 15:40:24 +01:00
Viktor Lofgren
f6f036b9b1 Switch to new Slop format for crawl data storage and processing.
Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.
2024-12-15 19:34:03 +01:00
Viktor Lofgren
b510b7feb8 Spike for storing crawl data in slop instead of parquet
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds.  On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
Viktor Lofgren
c08203e2ed (search) Prevent paperdoll from being run as a test by CI 2024-12-14 20:35:57 +01:00
Viktor Lofgren
86497fd32f (site-info) Mobile layout fix 2024-12-14 16:19:56 +01:00
Viktor Lofgren
3b998573fd Adjust colors on dark mode for site overview 2024-12-13 21:51:25 +01:00
Viktor Lofgren
e161882ec7 (search) Fix layout for light mode 2024-12-13 21:47:29 +01:00
Viktor Lofgren
357f349e30 (search) Table layout fixes for dictionary lookup 2024-12-13 21:47:08 +01:00
Viktor Lofgren
e4769f541d (search) Sort and deduplicate search results for better relevance.
Added a custom sorting mechanism to prioritize HTTPS over HTTP and domain-based URLs over raw IPs during deduplication. Ensures "bad duplicates" are discarded while maintaining the original presentation order for user-facing results.
2024-12-13 21:47:08 +01:00
Viktor Lofgren
2a173e2861 (search) Dark Mode 2024-12-13 21:47:07 +01:00
Viktor Lofgren
a6a900266c (search) Fix redirects 2024-12-13 02:40:51 +01:00
Viktor Lofgren
bdba53f055 (site) Update domain parameter type from PathParam to QueryParam 2024-12-13 02:15:35 +01:00
Viktor Lofgren
eb2fe18867 (sideload) Add LSH generation for sideloaded StackExchange data
Previously, the sideloader did not generate a locality-sensitive hashCode for document details.  This caused all documents from the same domain to be considered duplicates by the deduplication logic.
2024-12-13 02:10:52 +01:00
Viktor Lofgren
a7468c8d23 (converter) Ensure paths are created for converter batch writer 2024-12-13 01:35:07 +01:00
Viktor Lofgren
fb2beb1eac (converter) Fix data-loss bug where the converter writer would remove all but the last batch of processed data 2024-12-13 01:19:30 +01:00
Viktor Lofgren
0fb03e3d62 (export) Add logging to AtagExporter for error handling 2024-12-12 22:54:32 +01:00
Viktor Lofgren
67db3f295e (index) Revert some optimization changes 2024-12-12 22:14:24 +01:00
Viktor Lofgren
dafaab3ef7 (index) Additional optimization pass 2024-12-12 18:57:33 +01:00
Viktor Lofgren
3f11ca409f (index) Increase thread limit and optimize search result handling
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 17:07:06 +01:00
Viktor Lofgren
694eed79ef (index) Increase thread limit and optimize search result handling
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:32:31 +01:00
Viktor Lofgren
4220169119 (index) Increase thread limit and optimize search result handling
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:31:11 +01:00
Viktor Lofgren
bbdde789e7 Merge branch 'master' into serp-redesign 2024-12-11 19:45:17 +01:00
Viktor Lofgren
0a53ac68a0 Add specialization for steam store and GOG 2024-12-11 18:32:45 +01:00
Viktor Lofgren
eab61cd48a Merge branch 'master' into serp-redesign 2024-12-11 17:09:27 +01:00
Viktor Lofgren
e65d75a0f9 (crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets 2024-12-11 17:01:52 +01:00
Viktor Lofgren
3b99cffb3d (link-parser) Filter out URLs with binary file suffixes in LinkParser
Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.
2024-12-11 16:42:47 +01:00
Viktor Lofgren
a97c05107e Add synthetic meta flag for root path documents
If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D
2024-12-11 16:10:44 +01:00
Viktor Lofgren
5002870d1f (converter) Refactor sideloaders to improve feature handling and keyword logic
Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.
2024-12-11 16:01:38 +01:00
Viktor Lofgren
73861e613f (ranking) Downtune score boost for unordered heading matces 2024-12-11 15:44:29 +01:00
Viktor Lofgren
0ce2ba9ad9 (jooby) Fix asset handler 2024-12-11 14:38:04 +01:00
Viktor Lofgren
3ddcebaa36 (search) Give serp/start a more consistent name to siteinfo/start
The change also cleans up the layout a bit.
2024-12-11 14:33:57 +01:00
Viktor Lofgren
b91463383e (jooby) Clean up initialization process 2024-12-11 14:33:18 +01:00
Viktor Lofgren
7444a2f36c (site-info) Add placeholder when a feed item lacks a title. 2024-12-10 22:46:12 +01:00
Viktor Lofgren
461bc3eb1a (generator) Add special workaround to flag fextralife as a wiki 2024-12-10 22:22:52 +01:00
Viktor Lofgren
cf7f84f033 (rank) Reduce the impact of domain rank bonus, and only apply it to cancel out negative penalties, never to increase the ranking 2024-12-10 22:04:12 +01:00
Viktor Lofgren
fdee07048d (search) Remove Spark and migrate to Jooby for the search service 2024-12-10 19:13:13 +01:00
Viktor Lofgren
2fbf201761 (search) Adjust crosstalk flex-basis 2024-12-10 15:12:51 +01:00
Viktor Lofgren
4018e4c434 (search) Add crosstalk to paperdoll 2024-12-10 15:12:39 +01:00
Viktor Lofgren
f3382b5bd8 (search) Completely remove all old hdb templates
Create new views for conversion results, dictionary results, and site crosstalk.
2024-12-10 15:04:49 +01:00
Viktor Lofgren
9fc82574f0 (fingerprint) Add FluxGarden as a wiki generator
#130
2024-12-10 13:51:42 +01:00
Viktor
589f4dafb9
Merge pull request #129 from MarginaliaSearch/atags-counts
(WIP) Improve atag sentence matching
2024-12-10 12:42:34 +00:00
Viktor Lofgren
c5d657ef98 (live-crawler) Flag live crawled documents with a special keyword 2024-12-10 13:42:10 +01:00