MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	3714104976	Add loader for slop data in converter. Also alter CrawledDocument to not require String parsing of the underlying byte[] data. This should reduce the number of large memory allocations quite significantly, hopefully reducing the GC churn a bit.	2024-12-17 15:40:24 +01:00
Viktor Lofgren	f6f036b9b1	Switch to new Slop format for crawl data storage and processing. Replaces Parquet output and processing with the new Slop-based format. Includes data migration functionality, updates to handling and writing of crawl data, and introduces support for SLOP in domain readers and converters.	2024-12-15 19:34:03 +01:00
Viktor Lofgren	b510b7feb8	Spike for storing crawl data in slop instead of parquet This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds. On disk size is virtually identical.	2024-12-15 15:49:47 +01:00
Viktor Lofgren	c08203e2ed	(search) Prevent paperdoll from being run as a test by CI	2024-12-14 20:35:57 +01:00
Viktor Lofgren	86497fd32f	(site-info) Mobile layout fix	2024-12-14 16:19:56 +01:00
Viktor Lofgren	3b998573fd	Adjust colors on dark mode for site overview	2024-12-13 21:51:25 +01:00
Viktor Lofgren	e161882ec7	(search) Fix layout for light mode	2024-12-13 21:47:29 +01:00
Viktor Lofgren	357f349e30	(search) Table layout fixes for dictionary lookup	2024-12-13 21:47:08 +01:00
Viktor Lofgren	e4769f541d	(search) Sort and deduplicate search results for better relevance. Added a custom sorting mechanism to prioritize HTTPS over HTTP and domain-based URLs over raw IPs during deduplication. Ensures "bad duplicates" are discarded while maintaining the original presentation order for user-facing results.	2024-12-13 21:47:08 +01:00
Viktor Lofgren	2a173e2861	(search) Dark Mode	2024-12-13 21:47:07 +01:00
Viktor Lofgren	a6a900266c	(search) Fix redirects	2024-12-13 02:40:51 +01:00
Viktor Lofgren	bdba53f055	(site) Update domain parameter type from PathParam to QueryParam	2024-12-13 02:15:35 +01:00
Viktor Lofgren	eb2fe18867	(sideload) Add LSH generation for sideloaded StackExchange data Previously, the sideloader did not generate a locality-sensitive hashCode for document details. This caused all documents from the same domain to be considered duplicates by the deduplication logic.	2024-12-13 02:10:52 +01:00
Viktor Lofgren	a7468c8d23	(converter) Ensure paths are created for converter batch writer	2024-12-13 01:35:07 +01:00
Viktor Lofgren	fb2beb1eac	(converter) Fix data-loss bug where the converter writer would remove all but the last batch of processed data	2024-12-13 01:19:30 +01:00
Viktor Lofgren	0fb03e3d62	(export) Add logging to AtagExporter for error handling	2024-12-12 22:54:32 +01:00
Viktor Lofgren	67db3f295e	(index) Revert some optimization changes	2024-12-12 22:14:24 +01:00
Viktor Lofgren	dafaab3ef7	(index) Additional optimization pass	2024-12-12 18:57:33 +01:00
Viktor Lofgren	3f11ca409f	(index) Increase thread limit and optimize search result handling Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.	2024-12-12 17:07:06 +01:00
Viktor Lofgren	694eed79ef	(index) Increase thread limit and optimize search result handling Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.	2024-12-12 15:32:31 +01:00
Viktor Lofgren	4220169119	(index) Increase thread limit and optimize search result handling Updated the default "index.valuationThreads" to 16 for improved concurrency. Expanded buffer sizes and restructured result handling logic for better memory management and performance.	2024-12-12 15:31:11 +01:00
Viktor Lofgren	bbdde789e7	Merge branch 'master' into serp-redesign	2024-12-11 19:45:17 +01:00
Viktor Lofgren	0a53ac68a0	Add specialization for steam store and GOG	2024-12-11 18:32:45 +01:00
Viktor Lofgren	eab61cd48a	Merge branch 'master' into serp-redesign	2024-12-11 17:09:27 +01:00
Viktor Lofgren	e65d75a0f9	(crawler) Reintroduce content type probing and clean out bad content type data from the existing crawl sets	2024-12-11 17:01:52 +01:00
Viktor Lofgren	3b99cffb3d	(link-parser) Filter out URLs with binary file suffixes in LinkParser Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.	2024-12-11 16:42:47 +01:00
Viktor Lofgren	a97c05107e	Add synthetic meta flag for root path documents If the document's URL path is "/", a "special:root" meta flag is now added with the "Synthetic" bit set. This will help searching only for the root document of a website, neat stuff ahead :D	2024-12-11 16:10:44 +01:00
Viktor Lofgren	5002870d1f	(converter) Refactor sideloaders to improve feature handling and keyword logic Centralized HTML feature handling with `applyFeatures` in StackexchangeSideloader and added dynamic synthetic term generation. Improved HTML structure in RedditSideloader and enhanced metadata processing with feature-based keywords. Updated DomainLinks to correctly compute link counts using individual link occurrences.	2024-12-11 16:01:38 +01:00
Viktor Lofgren	73861e613f	(ranking) Downtune score boost for unordered heading matces	2024-12-11 15:44:29 +01:00
Viktor Lofgren	0ce2ba9ad9	(jooby) Fix asset handler	2024-12-11 14:38:04 +01:00
Viktor Lofgren	3ddcebaa36	(search) Give serp/start a more consistent name to siteinfo/start The change also cleans up the layout a bit.	2024-12-11 14:33:57 +01:00
Viktor Lofgren	b91463383e	(jooby) Clean up initialization process	2024-12-11 14:33:18 +01:00
Viktor Lofgren	7444a2f36c	(site-info) Add placeholder when a feed item lacks a title.	2024-12-10 22:46:12 +01:00
Viktor Lofgren	461bc3eb1a	(generator) Add special workaround to flag fextralife as a wiki	2024-12-10 22:22:52 +01:00
Viktor Lofgren	cf7f84f033	(rank) Reduce the impact of domain rank bonus, and only apply it to cancel out negative penalties, never to increase the ranking	2024-12-10 22:04:12 +01:00
Viktor Lofgren	fdee07048d	(search) Remove Spark and migrate to Jooby for the search service	2024-12-10 19:13:13 +01:00
Viktor Lofgren	2fbf201761	(search) Adjust crosstalk flex-basis	2024-12-10 15:12:51 +01:00
Viktor Lofgren	4018e4c434	(search) Add crosstalk to paperdoll	2024-12-10 15:12:39 +01:00
Viktor Lofgren	f3382b5bd8	(search) Completely remove all old hdb templates Create new views for conversion results, dictionary results, and site crosstalk.	2024-12-10 15:04:49 +01:00
Viktor Lofgren	9fc82574f0	(fingerprint) Add FluxGarden as a wiki generator #130	2024-12-10 13:51:42 +01:00
Viktor	589f4dafb9	Merge pull request #129 from MarginaliaSearch/atags-counts (WIP) Improve atag sentence matching	2024-12-10 12:42:34 +00:00
Viktor Lofgren	c5d657ef98	(live-crawler) Flag live crawled documents with a special keyword	2024-12-10 13:42:10 +01:00
Viktor Lofgren	3c2bb566da	(converter) Wipe the converter output path on initialization to avoid lingering stale data.	2024-12-10 13:41:05 +01:00
Viktor Lofgren	9287ee0141	(search) Improve hyphenation logic for titles	2024-12-09 15:29:10 +01:00
Viktor Lofgren	2769c8f869	(search) Remove sticky search bar to aid with performance on firefox (and iOS?)	2024-12-09 15:20:33 +01:00
Viktor Lofgren	ddb66f33ba	(search) Add more feedback when pressing some buttons	2024-12-09 15:07:23 +01:00
Viktor Lofgren	79500b8fbc	(search) Move search bar back up top on mobile, put filter buttom at the bottom instead.	2024-12-09 14:55:37 +01:00
Viktor Lofgren	187eea43a4	(search) Remove redundant @if	2024-12-09 14:46:02 +01:00
Viktor Lofgren	a89ed6fa9f	(search) Fix rendering on site overview, more dense serp layout on mobile	2024-12-09 14:45:45 +01:00
Viktor Lofgren	e0c0ed27bc	(keyword-extraction) Clean up code and add tests for position and spans calculation This code has been a bit of a mess and historically significantly flaky, so some test coverage is more than overdue.	2024-12-08 14:14:52 +01:00
Viktor Lofgren	20abb91657	(loader) Correct DocumentLoaderService to properly do bulk inserts Fixes issue #128	2024-12-08 13:12:52 +01:00
Viktor Lofgren	291ca8daf1	(converter/index) Improve atag sentence matching by taking into consideration how many times a sentence appears in the links This change breaks the format of the atags.parquet file.	2024-12-08 00:27:11 +01:00
Viktor Lofgren	8d168be138	(search) Typeahead search, etc.	2024-12-07 15:47:01 +01:00
Viktor Lofgren	6e1aa7b391	(search) Make style.css depend on jte file changes Also add a hack to ensure classes generated from java code get included in the stylesheet as intended.	2024-12-07 14:11:22 +01:00
Viktor Lofgren	deab9b9516	(search) Clean up start views for search and site-info	2024-12-07 14:11:22 +01:00
Viktor Lofgren	39d99a906a	(search) Add proper tailwind build and host fontawesome locally	2024-12-07 14:11:22 +01:00
Viktor Lofgren	6f72e6e0d3	(explore) Add lazy loading and alt attributes to images	2024-12-07 14:11:22 +01:00
Viktor Lofgren	d786d79483	(site-info) Add whitespace-nowrap to pubDay span in overview.jte	2024-12-07 14:11:22 +01:00
Viktor Lofgren	01510f6c2e	(serp) Add wayback link to search results	2024-12-07 14:11:22 +01:00
Viktor Lofgren	7ba43e9e3f	(site) Adjust sizing of navbars	2024-12-07 14:11:16 +01:00
Viktor Lofgren	97bfcd1353	(site) Layout changes site-info	2024-12-07 14:11:16 +01:00
Viktor Lofgren	aa3c85c196	(site) Mobile layout fixes	2024-12-07 14:11:16 +01:00
Viktor Lofgren	ee2d5496d0	Revert "(experiment) Modify atags exporter to permit duplicates from different source domains" This reverts commit `5c858a2b94`.	2024-12-07 14:01:50 +01:00
Viktor Lofgren	5c858a2b94	(experiment) Modify atags exporter to permit duplicates from different source domains This is an attempt to provide higher resolution term frequency data that will need evaluation when the data is processed.	2024-12-06 14:10:15 +01:00
Viktor Lofgren	fb75a3827d	(site) Adjust coloration of search results	2024-12-05 16:58:00 +01:00
Viktor Lofgren	7d546d0e2a	(site) Make SearchParameters generate relative URLs instead of absolute	2024-12-05 16:47:22 +01:00
Viktor Lofgren	8fcb6ffd7a	(site-info) Increase contrast in search results for forums, wikis	2024-12-05 16:42:16 +01:00
Viktor Lofgren	f97de0c15a	(site-info) Fix layout	2024-12-05 16:33:46 +01:00
Viktor Lofgren	be9e192b78	(site-info) Fix pagination in backlinks and documents views	2024-12-05 16:26:11 +01:00
Viktor Lofgren	75ae1c9526	(site-info) Do not show 'suggest for crawling' when the ndoe affinity is already set to 0 This indicates the domain is already slated for crawling.	2024-12-05 16:18:46 +01:00
Viktor Lofgren	33761a0236	(site-info) Make the search box in the site viewer functional	2024-12-05 16:16:29 +01:00
Viktor Lofgren	19b69b1764	(site-info) Only show samples if feed is absent, never both.	2024-12-05 16:05:03 +01:00
Viktor Lofgren	8b804359a9	(serp) Layout fixes for mobile	2024-12-05 15:59:33 +01:00
Viktor Lofgren	f050bf5c4c	(WIP) Initial semi-working transformation to new tailwind UI Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod. There's also a lot of polish remaining everywhere, dead links, etc.	2024-12-05 14:00:17 +01:00
Viktor Lofgren	fdc3efa250	(setup) Remove OpenNLP tokenization model This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.	2024-11-28 16:03:05 +01:00
Viktor Lofgren	c97c66a41c	(ranking) Reduce the verbatim score multiplier	2024-11-28 13:37:11 +01:00
Viktor Lofgren	7b64377fd6	(ranking) Promote documents with multiple phrase matches with a log-scale bonus	2024-11-28 13:36:56 +01:00
Viktor Lofgren	e11ebf18e5	(span) Correct intersection counting logic, add comprehensive tests	2024-11-28 13:36:25 +01:00
Viktor Lofgren	ba47d72bf4	(ranking) Adjust scores for external link matches	2024-11-27 14:27:23 +01:00
Viktor Lofgren	52bc0272f8	(atag) Add alias domain support and improve domain handling Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.	2024-11-27 14:26:44 +01:00
Viktor Lofgren	d4bce13a03	(export) Add export actors to precession Adding a tracking message to the export actor means it's possible to run them in a precession. Adding a new precession actor, and some GUI components for triggering exports. The change also adds a heartbeat to the export process.	2024-11-26 15:07:03 +01:00
Viktor Lofgren	b9842b57e0	(encyclopedia-sideloader) Add test suite and clean up urlencoding logic	2024-11-26 13:34:15 +01:00
Viktor Lofgren	95776e9bee	(encyclopedia) Fix commit gore resulting in bad SQL query	2024-11-26 12:44:49 +01:00
Viktor Lofgren	077d8dcd11	(result-score) Adjust ranking parameters a tiny bit	2024-11-25 18:30:59 +01:00
Viktor Lofgren	9ec41e27c6	(keyword-extractor) Fix bug where external link keywords weren't generating document spans as intended	2024-11-25 18:30:22 +01:00
Viktor Lofgren	200743c84f	(minor) Remove delomobok debris	2024-11-25 18:29:21 +01:00
Viktor Lofgren	6d7998e349	(index) Correct behavior of debug function positionValues(), which was misleadingly incorrect	2024-11-25 18:28:53 +01:00
Viktor Lofgren	7d1ef08a0f	(index) Correct ranking bonus for external linktext appearnces	2024-11-25 17:40:15 +01:00
Viktor Lofgren	3ec9c4c5fa	(export) Filter non-HTML documents in exporters Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.	2024-11-25 15:06:42 +01:00
Viktor Lofgren	0b6b5dab07	(index) Add score bonuses for single-word anchor tag spans Enhanced scoring logic to add bonuses when the query matches single-word anchor (atag) spans exactly. Implemented this by adding conditions in `IndexResultScoreCalculator.java` and creating a new method `containsRangeExact` in `DocumentSpan.java` to check for exact span matches.	2024-11-25 14:44:41 +01:00
Viktor Lofgren	ff17473105	Fix UTF-8 URL normalization issue in sideloader. Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue. Fixes issue #109.	2024-11-25 14:25:47 +01:00
Viktor Lofgren	dc5f97e737	(index) Add bonus for single-word title matches when the title is also a single word	2024-11-25 13:24:12 +01:00
Viktor Lofgren	d919179ba3	(index) Correct off-by-1 error in DocumentSpan.containsRange	2024-11-25 13:24:03 +01:00
Viktor Lofgren	f09669a5b0	(index) Correct usage of DocumentSpan.length() instead of DocumentSpan.size() The latter counts the number of spans, and is not what you want here.	2024-11-25 13:11:55 +01:00
Viktor Lofgren	b3b0f6fed3	(actor) Add side-load profile to PROC_CONVERTER_SPAWNER. This fell off during the profile split, but is necessary for sideloading.	2024-11-25 12:40:14 +01:00
Viktor Lofgren	88caca60f9	(live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list	2024-11-23 17:07:16 +01:00
Viktor Lofgren	923ebbac81	(feeds) Add logic to handle URI fragments in feed items Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.	2024-11-23 16:38:56 +01:00
Viktor Lofgren	552b246099	(live-crawl) Improve error handling for errors during robots.txt-retrieval Reduce log-spam and don't treat errors other than 404 as "all is permitted".	2024-11-22 14:15:32 +01:00
Viktor Lofgren	80e6d0069c	(live-crawl-actor) Clear index journal before starting live crawl This is to prevent data corruption. This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.	2024-11-22 14:04:57 +01:00
Viktor Lofgren	b941604135	(live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with.	2024-11-22 13:58:57 +01:00
Viktor Lofgren	52eb5bc84f	(live-crawler) Keep track of bad URLs To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.	2024-11-22 00:55:46 +01:00
Viktor Lofgren	4d23fe6261	(feeds) Simplify RSS User-Agent header Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService. This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.	2024-11-21 16:43:56 +01:00
Viktor Lofgren	14519294d2	Merge branch 'master' into live-search	2024-11-21 16:00:20 +01:00
Viktor Lofgren	51e46ad2b0	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx. While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.	2024-11-21 16:00:09 +01:00
Viktor Lofgren	665c8831a3	(model) Fix resource leak in partially read crawl data streams. Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.	2024-11-20 19:29:13 +01:00
Viktor Lofgren	47dfbacb00	(conf) Introduce a new concept of node profiles Node profiles decide which actors are started, and which views are available in the control GUI. This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.	2024-11-20 18:15:22 +01:00
Viktor Lofgren	f94911541a	(live-crawl) Reduce the risk of id collisions with the main indexes This is done by applying a large constant offset to the ordinals for the live crawled documents. The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.	2024-11-20 16:01:10 +01:00
Viktor Lofgren	89d8af640d	(live-crawl) Rename the live crawler code module to be more consistent with the other processes	2024-11-20 15:55:15 +01:00
Viktor Lofgren	6e4252cf4c	(live-crawl) Make the actor poll for feeds changes instead of being a one-shot thing. Also changes the live crawl process to store the live crawl data in a fixed directory in the storage base rather than versioned directories.	2024-11-20 15:36:25 +01:00
Viktor Lofgren	79ce4de2ab	(model) Remove deprecated fields from CrawledDocument and CrawledDomain	2024-11-20 15:27:05 +01:00
Viktor Lofgren	d6575dfee4	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 21:00:18 +01:00
Viktor Lofgren	a91ab4c203	(live-crawler) Crude first-try process for live crawling #WIP Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.	2024-11-19 19:35:01 +01:00
Viktor Lofgren	6a3079a167	(search) Fix missing getter for proto	2024-11-18 21:05:22 +01:00
Viktor Lofgren	c728a1e2f2	(rss) Add endpoint for extracting URLs changed withing a timespan.	2024-11-18 14:59:32 +01:00
Viktor Lofgren	d874d76a09	(rss) Add an endpoint that can be used for identifying when RSS data has changed	2024-11-18 14:22:17 +01:00
Viktor Lofgren	41c11be075	(status) Clean up the status page a bit	2024-11-17 20:00:44 +01:00
Viktor Lofgren	163ce19846	(test) Tag status service endpoint tests as flaky These tests have outside dependencies that inherently makes them unreliable and unsuitable for CI.	2024-11-17 19:48:01 +01:00
Viktor Lofgren	9eb16cb667	(test) Remove tests from fast suite Adding a new @Tag("flaky") for tests that do not reliably return successes. These may still be valuable during development, but should not run in CI. Also tagging a few of the slower tests with the old @Tag("slow"), to speed up the run-time.	2024-11-17 19:45:59 +01:00
Viktor Lofgren	af40fa327b	(status-service) Correct measurement pruning to use correct sqlite datetimes, as to not delete the database	2024-11-17 18:35:34 +01:00
Viktor Lofgren	cf6d28e71e	(status-service) Enable auto-commit	2024-11-17 18:25:15 +01:00
Viktor Lofgren	3791ea1e18	(service) Add a new application service for external liveness monitoring The new service 'status-service' will poll public endpoints periodically, and publish a basic read-only UI with the results, as well as publish the results to prometheus.	2024-11-17 18:01:08 +01:00
Viktor Lofgren	e5db3f11e1	(chore) Clean up some of the uglier delomboking artifacts	2024-11-15 13:57:20 +01:00
Viktor Lofgren	9f47ce8d15	(chore) Remove lombok There are likely some instances of delombok gore with this commit.	2024-11-11 21:14:38 +01:00
Viktor Lofgren	a5b4951f23	(chore) Remove use of deprecated STR.-style string templates	2024-11-11 18:02:28 +01:00
Viktor Lofgren	8b8bf0748f	(feature-extraction) Add new DocumentHeaders class encapsulating Html headers. Also adds a few new html features for CDNs and S3 hosting for use in ranking and query refinement.	2024-11-11 13:26:15 +01:00
Viktor Lofgren	a456ec9599	(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished	2024-11-10 18:30:28 +01:00
Viktor Lofgren	a2bc9a98c0	(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished	2024-11-10 17:45:20 +01:00
Viktor Lofgren	e24a98390c	(feed) Update API to allow specifying clean vs refresh update Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.	2024-11-09 18:43:47 +01:00
Viktor Lofgren	6f858cd627	(feed) Decrease update interval to 24 hours	2024-11-09 18:17:51 +01:00
Viktor Lofgren	a293266ccd	(feed) Wipe the feeds db and start over from system URLs periodically.	2024-11-09 18:17:16 +01:00
Viktor Lofgren	b8e0dc93d7	(search) Correctly show the feeds view when items are present ... otherwise show samples. This commit also removes the (Experimental) bit, as this is getting fairly mature.	2024-11-09 17:56:43 +01:00
Viktor Lofgren	d774c39031	(feeds) Reduce log spam	2024-11-09 17:56:43 +01:00
Viktor Lofgren	ab17af99da	(feeds) Refresh the feed db using the previous db, when it is available.	2024-11-09 17:56:43 +01:00
Viktor Lofgren	b0ac3c586f	(feeds) Correct parallelism using SimpleBlockingThreadPool	2024-11-09 17:56:43 +01:00
Viktor Lofgren	139fa85b18	(feeds) Add working heartbeat tracking progress	2024-11-09 17:56:43 +01:00
Viktor Lofgren	bfeb9a4538	(feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service	2024-11-09 17:56:43 +01:00
Viktor Lofgren	76e9053dd0	(setup) Move some file-downloads from setup script to the first boot of the control node of the system We can only do this for files that are not required for unit tests. As it is illegal to run more than one instance of the control service, this should be fine with regard to race conditions. The boot orchestration will also ensure that no other services will boot up before the downloading is complete.	2024-11-06 15:28:20 +01:00
Viktor Lofgren	dbb8bcdd8e	(crawler) Use a better hashInt implementation in CrawlDataReference Guava's hash functions are slow as hell.	2024-10-15 18:25:55 +02:00
Viktor Lofgren	7305afa0f8	(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris	2024-10-15 17:27:59 +02:00
Viktor Lofgren	481f999b70	(crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full. Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.	2024-10-15 14:22:40 +02:00
Viktor Lofgren	4b16022556	(crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains	2024-10-15 14:21:59 +02:00
Viktor Lofgren	89dd201a7b	(link-parser) Make mailing list blocking optional	2024-10-15 13:48:32 +02:00
Viktor Lofgren	ab486323f2	(converter) Increase the number of links the converter will pick up per document	2024-10-15 13:46:19 +02:00
Viktor Lofgren	6460c11107	(index) Short-circuit rankResults when there are no results	2024-10-14 13:47:35 +02:00
Viktor Lofgren	89f7f3c17c	(query-parser) Fix regression where advice terms weren't parsed properly	2024-10-14 13:46:37 +02:00
Viktor Lofgren	fe800b3af7	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 19:04:49 +02:00
Viktor Lofgren	2a1077ff43	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 17:57:27 +02:00
Viktor Lofgren	01a16ff388	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 17:55:59 +02:00
Viktor Lofgren	eb60ddb729	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 17:49:39 +02:00
Viktor Lofgren	db5faeceee	(download-sample) Break apart actor for better error recovery Change also adds logged events to give more feedback that something is happening.	2024-10-04 13:39:43 +02:00
Viktor Lofgren	45d3e6aa71	(download-sample) Break apart actor for better error recovery Change also adds logged events to give more feedback that something is happening.	2024-10-04 13:19:09 +02:00
Viktor Lofgren	d84a2c183f	(*) Remove the crawl spec abstraction The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled. Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs. This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.	2024-10-03 13:41:17 +02:00
Viktor Lofgren	ecb5eedeae	(crawler, EXPERIMENT) Disable content type probing and use Accept header instead There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.	2024-09-30 14:53:01 +02:00
Viktor Lofgren	90a2d4ae38	(index) Fix partial buffer writing in PrioDocIdsTransformer Ensure all data is written to writeChannel by looping until the buffer is fully drained. This prevents potential data loss during the close operation and maintains data integrity.	2024-09-29 17:53:40 +02:00
Viktor Lofgren	2b8ab97ec1	(bit-writer) Do not clear buffer when creating a bit writer	2024-09-29 17:52:43 +02:00
Viktor Lofgren	43ca9c8a12	(sequence) Return Integer.MAX_VALUE for empty position lists. Updated the method to return Integer.MAX_VALUE if any of the position lists are empty, instead of returning 0. This ensures that empty lists are handled consistently and address edge cases where an empty list is encountered.	2024-09-29 17:21:17 +02:00
Viktor Lofgren	69d99c91dd	(index) Optimize buffer handling in PrioDocIdsTransformer	2024-09-29 17:20:49 +02:00
Viktor Lofgren	a8cc98a0f6	(index) Fix write offset calculation in PrioDocIdsTransformer Adjust the write offset calculation by adding the position of the write buffer. Updated the test to validate the transformation process and ensure correctness of output file positions.	2024-09-29 17:20:29 +02:00
Viktor Lofgren	2ee58f4bc9	(index) Adjust ranking parameters to dial down the importance of tcfProximity and firstPosition	2024-09-29 15:33:12 +02:00
Viktor Lofgren	938431e514	(scrape-feeds-actor) Add deduplication of insertion data To avoid unnecessary db churn, the domains to be added are put in a set instead of a list, ensuring that they are unique.	2024-09-28 14:41:14 +02:00
Viktor Lofgren	b2de3c70fa	(scrape-feeds-actor) Add explicit commit in case it's disabled	2024-09-28 14:36:57 +02:00
Viktor Lofgren	542690d9f6	(search-service) Hide pagination when there is only 1 page of results	2024-09-28 13:48:09 +02:00
Viktor Lofgren	596a7fb4ea	(actor) Disable the feed scraper on all nodes but the first	2024-09-28 12:36:16 +02:00
Viktor Lofgren	c3f726a01f	(actor) Add a feed scraping actor Add a new actor that polls an URL every 6 hours and amends the domain database with any unseen domains, flagging them to be crawled by the next crawl job. The URLs are specified in data/scrape-urls.txt. If this file is absent, the actor shuts down.	2024-09-28 12:33:29 +02:00
Viktor Lofgren	4538ade156	(live-capture) Add readme to live-capture function	2024-09-28 11:35:46 +02:00
Viktor Lofgren	f4709d8f32	(live-capture) Handle case when screenshot bytes are empty. Add logic to flag the domain as fetched when the pngBytes array is empty. This ensures we won't try to re-fetch this domain again for a while.	2024-09-27 15:53:17 +02:00
Viktor Lofgren	3dda8c228c	(live-capture) Handle failed screenshot fetch in BrowserlessClient Return an empty byte array when screenshot fetch fails, ensuring downstream processes are not impacted by null responses. Additionally, only attempt to upload the screenshot if the byte array is non-empty, preventing invalid data from being stored.	2024-09-27 14:52:05 +02:00
Viktor Lofgren	ccf6b7caf3	(assistant) Refactor scheduling of tasks within SimilarDomainsService Changed the scheduling function to use a single schedule call instead of a fixed delay for the init task. The updateScreenshotInfo method was also moved and slightly refactored for clearer readability and consistency.	2024-09-27 14:43:19 +02:00
Viktor Lofgren	fed33ed64a	(search-service) Update screenshot request handling Always request the main site screenshot to ensure staleness checks and necessary updates. Limit additional screenshot requests for similar and linking domains to avoid overloading with a maximum of 5 requests per view.	2024-09-27 14:27:25 +02:00
Viktor Lofgren	ca27d95ce1	(assistant) Add bounds checks for domain idx	2024-09-27 14:24:04 +02:00
Viktor Lofgren	3566fe296a	(assistant) Add scheduled update job for screenshot information	2024-09-27 14:16:28 +02:00
Viktor Lofgren	c91435e314	(assistant) Don't attempt to respond to similarity and linkedness queries before the data is ready This will reduce the number of exceptions in the assistant logs quite significantly.	2024-09-27 14:08:08 +02:00
Viktor Lofgren	31f30069a4	(live-capture) Dial down logging a bit	2024-09-27 14:00:55 +02:00
Viktor Lofgren	23cce0c78a	Add a new function 'Live Capture' for on-demand screenshot capture The screenshots are requested by the site-service, and triggered via the site-info view.	2024-09-27 13:46:34 +02:00
Viktor Lofgren	1bd29a586c	(service-discovery) Add common base interface to all Grpc services To be able to tell service discovery whether to enable a service on a particular runtime, a common base interface DiscoverableService extends BindableService was added.	2024-09-27 13:46:34 +02:00
Viktor Lofgren	c757d116bf	(misc) Fix Broken Tests	2024-09-27 13:46:34 +02:00
Viktor Lofgren	4565bfe359	(crawler) Make the crawler report crawling progress correctly when stopped and resumed.	2024-09-26 18:30:29 +02:00
Viktor Lofgren	336d6fdd14	(index-client) Fix error when zero results are found	2024-09-25 20:23:13 +02:00
Viktor Lofgren	95cde242ca	(assistant) Fix NPE when IP information is absent	2024-09-25 20:19:17 +02:00
Viktor Lofgren	0d2390fd13	(search-service) Only autofocus on the query when the query is empty	2024-09-25 14:27:03 +02:00
Viktor Lofgren	4a0356e26f	(search-service) Add pagination support to the search GUI	2024-09-25 14:26:49 +02:00
Viktor Lofgren	73f973cc06	(search-query) Add pagination to search query API and the direct query-service interface	2024-09-25 14:20:59 +02:00
Viktor Lofgren	e9e8580913	(converter) Fix NPE bugs in converter due to the reintroduction of CrawledDocument.headers	2024-09-25 12:18:56 +02:00
Viktor Lofgren	8b85a58fea	(search UX) Autofocus on the search form	2024-09-24 15:56:03 +02:00
Viktor Lofgren	40512511af	(crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl This code is still a bit too complex, but it's slowly getting better.	2024-09-24 15:08:22 +02:00
Viktor Lofgren	3dec4b6b34	(index) Fix bug where tcfFirstPosition lit up because one term was in the title and the other was missing from the document This was because firstPosition calculation was not invalidated when positions were missing.	2024-09-24 13:33:37 +02:00
Viktor Lofgren	162fc25ebc	(minor) Fix accidental commit errors	2024-09-23 18:03:09 +02:00
Viktor Lofgren	e9854f194c	(crawler) Refactor * Restructure the code to make a bit more sense * Store full headers in crawl data * Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong	2024-09-23 17:51:07 +02:00
Viktor Lofgren	9c292a4f62	(doc) Fix outdated links in documentation	2024-09-22 13:56:17 +02:00
Viktor Lofgren	edb42836da	(vcs) Fix shared state issues with VarintCodedSequence's iterators. Also cleans up the code a bit.	2024-09-21 16:09:15 +02:00
Viktor Lofgren	1ff88ff0bc	(vcs) Stopgap fix for quoted queries with the same term appearinc multiple times There are reentrance issues with VarintCodedSequence, this hides the symptom but these need to be corrected properly.	2024-09-21 14:07:59 +02:00
Viktor Lofgren	28e7c8e5e0	Increase temporal bias weight to give the recent results filter a bit more recency	2024-09-17 18:11:40 +02:00
Viktor Lofgren	8e78286068	Merge branch 'master' into term-positions	2024-09-17 15:20:46 +02:00
Viktor Lofgren	f4eeef145e	(index) Reduce fetch size to improve timeout characteristics	2024-09-17 15:20:41 +02:00
Viktor Lofgren	87aa869338	(index) Correct positions mask to take into account offsets when overlapping	2024-09-17 14:40:37 +02:00
Viktor Lofgren	60ad4786bc	(index) Use MemorySegment.copy for LongArray->LongArray transfers	2024-09-17 13:56:31 +02:00
Viktor Lofgren	a74df7f905	(index) Increase buffer size for PrioDocIdsTransformer	2024-09-17 13:52:52 +02:00
Viktor Lofgren	9f9c6736ab	(index) Use MemorySegment.copy for LongArray->LongArray transfers	2024-09-17 13:49:02 +02:00
Viktor Lofgren	b95646625f	(index) Correct prio index construction with mmap Accidentally snuck in behavior from full index	2024-09-17 13:39:08 +02:00
Viktor Lofgren	6e47eae903	(index) Correct strange close handling of PositionsFileConstructor	2024-09-13 16:34:14 +02:00

... 2 3 4 5 6 ...

1959 Commits