Viktor Lofgren
41a59dcf45
(feed) Sanitize illegal HTML entities out of the feed XML before parsing
2024-12-25 14:53:28 +01:00
Viktor Lofgren
94d4d2edb7
(live-crawler) Add refresh date to feeds API
...
For now this is just the ctime for the feeds db. We may want to store this per-record in the future.
2024-12-25 14:20:48 +01:00
Viktor Lofgren
7ae19a92ba
(deploy) Improve deployment script to allow specification of partitions
2024-12-24 11:16:15 +01:00
Viktor Lofgren
56d14e56d7
(live-crawler) Improve LiveCrawlActor resilience to FeedService outages
2024-12-23 23:33:54 +01:00
Viktor Lofgren
a557c7ae7f
(live-crawler) Limit concurrent accesses per domain using DomainLocks from main crawler
2024-12-23 23:31:03 +01:00
Viktor Lofgren
b66879ccb1
(feed) Add support for date discovery through atom:issued and atom:created
...
This is specifically to help parse monadnock.net's Atom feed.
2024-12-23 20:05:58 +01:00
Viktor Lofgren
f1b7157ca2
(deploy) Add basic linting ability to deployment script.
2024-12-23 16:21:29 +01:00
Viktor Lofgren
7622335e84
(deploy) Correct deploy script, set correct name for assistant
2024-12-23 15:59:02 +01:00
Viktor Lofgren
0da2047eae
(live-capture) Correctly update processed count, disable poll rate adjustment based on freshness.
2024-12-23 15:56:27 +01:00
Viktor Lofgren
5ee4321110
(ci) Correct deploy script
2024-12-22 20:08:37 +01:00
Viktor Lofgren
9459b9933b
(ci) Correct deploy script
2024-12-22 19:40:32 +01:00
Viktor Lofgren
87fb564f89
(ci) Add script for automatic deployment based on git tags
2024-12-22 19:24:54 +01:00
Viktor Lofgren
5ca8523220
(math) Reduce log error spam from null unit conversions
2024-12-21 18:51:45 +01:00
Viktor Lofgren
1118657ffd
(system) Supply local IP to service discovery if multiFace is enabled
2024-12-19 22:20:19 +01:00
Viktor Lofgren
b1f970152d
(system) To support configurations with multiple docker networks, bind to the "most local" interface.
...
Make the behavior optional.
2024-12-19 20:26:31 +01:00
Viktor Lofgren
e1783891ab
(system) To support configurations with multiple docker networks, bind to the "most local" interface.
2024-12-19 20:18:57 +01:00
Viktor Lofgren
64d32471dd
(deploy) Deploy executor test
2024-12-19 17:45:47 +01:00
Viktor Lofgren
232cc465d9
(deploy) Deploy executor test
2024-12-19 17:35:38 +01:00
Viktor Lofgren
8c963bd4ba
(feeds) Remove Content-Encoding: gzip from feed fetcher
...
We don't support decompressing gzip, so this just gives us errors at this point should the server support it.
2024-12-18 22:23:44 +01:00
Viktor Lofgren
6a079c1c75
(feeds) Add per-domain throttling for feed fetcher.
2024-12-18 22:06:46 +01:00
Viktor Lofgren
2dc9f2e639
(feeds) Make feed XML parsing more lenient
...
... by consuming BOM markers and leading whitespace.
2024-12-18 17:18:41 +01:00
Viktor Lofgren
b66fb9caf6
(feeds) Improve error handling in the feed fetcher.
2024-12-18 17:02:13 +01:00
Viktor Lofgren
eb2fe18867
(sideload) Add LSH generation for sideloaded StackExchange data
...
Previously, the sideloader did not generate a locality-sensitive hashCode for document details. This caused all documents from the same domain to be considered duplicates by the deduplication logic.
2024-12-13 02:10:52 +01:00
Viktor Lofgren
a7468c8d23
(converter) Ensure paths are created for converter batch writer
2024-12-13 01:35:07 +01:00
Viktor Lofgren
fb2beb1eac
(converter) Fix data-loss bug where the converter writer would remove all but the last batch of processed data
2024-12-13 01:19:30 +01:00
Viktor Lofgren
0fb03e3d62
(export) Add logging to AtagExporter for error handling
2024-12-12 22:54:32 +01:00
Viktor Lofgren
67db3f295e
(index) Revert some optimization changes
2024-12-12 22:14:24 +01:00
Viktor Lofgren
dafaab3ef7
(index) Additional optimization pass
2024-12-12 18:57:33 +01:00
Viktor Lofgren
3f11ca409f
(index) Increase thread limit and optimize search result handling
...
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 17:07:06 +01:00
Viktor Lofgren
694eed79ef
(index) Increase thread limit and optimize search result handling
...
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:32:31 +01:00
Viktor Lofgren
4220169119
(index) Increase thread limit and optimize search result handling
...
Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.
2024-12-12 15:31:11 +01:00
Viktor Lofgren
0a53ac68a0
Add specialization for steam store and GOG
2024-12-11 18:32:45 +01:00
Viktor Lofgren
e65d75a0f9
(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets
2024-12-11 17:01:52 +01:00
Viktor Lofgren
3b99cffb3d
(link-parser) Filter out URLs with binary file suffixes in LinkParser
...
Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.
2024-12-11 16:42:47 +01:00
Viktor Lofgren
a97c05107e
Add synthetic meta flag for root path documents
...
If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D
2024-12-11 16:10:44 +01:00
Viktor Lofgren
5002870d1f
(converter) Refactor sideloaders to improve feature handling and keyword logic
...
Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.
2024-12-11 16:01:38 +01:00
Viktor Lofgren
73861e613f
(ranking) Downtune score boost for unordered heading matces
2024-12-11 15:44:29 +01:00
Viktor Lofgren
461bc3eb1a
(generator) Add special workaround to flag fextralife as a wiki
2024-12-10 22:22:52 +01:00
Viktor Lofgren
cf7f84f033
(rank) Reduce the impact of domain rank bonus, and only apply it to cancel out negative penalties, never to increase the ranking
2024-12-10 22:04:12 +01:00
Viktor Lofgren
9fc82574f0
(fingerprint) Add FluxGarden as a wiki generator
...
#130
2024-12-10 13:51:42 +01:00
Viktor
589f4dafb9
Merge pull request #129 from MarginaliaSearch/atags-counts
...
(WIP) Improve atag sentence matching
2024-12-10 12:42:34 +00:00
Viktor Lofgren
c5d657ef98
(live-crawler) Flag live crawled documents with a special keyword
2024-12-10 13:42:10 +01:00
Viktor Lofgren
3c2bb566da
(converter) Wipe the converter output path on initialization to avoid lingering stale data.
2024-12-10 13:41:05 +01:00
Viktor Lofgren
e0c0ed27bc
(keyword-extraction) Clean up code and add tests for position and spans calculation
...
This code has been a bit of a mess and historically significantly flaky, so some test coverage is more than overdue.
2024-12-08 14:14:52 +01:00
Viktor Lofgren
20abb91657
(loader) Correct DocumentLoaderService to properly do bulk inserts
...
Fixes issue #128
2024-12-08 13:12:52 +01:00
Viktor Lofgren
291ca8daf1
(converter/index) Improve atag sentence matching by taking into consideration how many times a sentence appears in the links
...
This change breaks the format of the atags.parquet file.
2024-12-08 00:27:11 +01:00
Viktor Lofgren
ee2d5496d0
Revert "(experiment) Modify atags exporter to permit duplicates from different source domains"
...
This reverts commit 5c858a2b94
.
2024-12-07 14:01:50 +01:00
Viktor Lofgren
5c858a2b94
(experiment) Modify atags exporter to permit duplicates from different source domains
...
This is an attempt to provide higher resolution term frequency data that will need evaluation when the data is processed.
2024-12-06 14:10:15 +01:00
Viktor Lofgren
fdc3efa250
(setup) Remove OpenNLP tokenization model
...
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
2024-11-28 16:03:05 +01:00
Viktor Lofgren
5fdd2c71f8
(setup) Update OpenNLP model URLs to archive.apache.org
...
Changed the URLs for downloading OpenNLP sentence and tokens models from downloads.apache.org to archive.apache.org; as the previous link has died.
2024-11-28 15:58:25 +01:00