MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 13:09:00 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	5c858a2b94	(experiment) Modify atags exporter to permit duplicates from different source domains This is an attempt to provide higher resolution term frequency data that will need evaluation when the data is processed.	2024-12-06 14:10:15 +01:00
Viktor Lofgren	fb75a3827d	(site) Adjust coloration of search results	2024-12-05 16:58:00 +01:00
Viktor Lofgren	7d546d0e2a	(site) Make SearchParameters generate relative URLs instead of absolute	2024-12-05 16:47:22 +01:00
Viktor Lofgren	8fcb6ffd7a	(site-info) Increase contrast in search results for forums, wikis	2024-12-05 16:42:16 +01:00
Viktor Lofgren	f97de0c15a	(site-info) Fix layout	2024-12-05 16:33:46 +01:00
Viktor Lofgren	be9e192b78	(site-info) Fix pagination in backlinks and documents views	2024-12-05 16:26:11 +01:00
Viktor Lofgren	75ae1c9526	(site-info) Do not show 'suggest for crawling' when the ndoe affinity is already set to 0 This indicates the domain is already slated for crawling.	2024-12-05 16:18:46 +01:00
Viktor Lofgren	33761a0236	(site-info) Make the search box in the site viewer functional	2024-12-05 16:16:29 +01:00
Viktor Lofgren	19b69b1764	(site-info) Only show samples if feed is absent, never both.	2024-12-05 16:05:03 +01:00
Viktor Lofgren	8b804359a9	(serp) Layout fixes for mobile	2024-12-05 15:59:33 +01:00
Viktor Lofgren	f050bf5c4c	(WIP) Initial semi-working transformation to new tailwind UI Still missing is a proper build, we're currently pulling in tailwind from a CDN, which is no bueno in prod. There's also a lot of polish remaining everywhere, dead links, etc.	2024-12-05 14:00:17 +01:00
Viktor Lofgren	fdc3efa250	(setup) Remove OpenNLP tokenization model This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.	2024-11-28 16:03:05 +01:00
Viktor Lofgren	5fdd2c71f8	(setup) Update OpenNLP model URLs to `archive.apache.org` Changed the URLs for downloading OpenNLP sentence and tokens models from downloads.apache.org to archive.apache.org; as the previous link has died.	2024-11-28 15:58:25 +01:00
Viktor Lofgren	c97c66a41c	(ranking) Reduce the verbatim score multiplier	2024-11-28 13:37:11 +01:00
Viktor Lofgren	7b64377fd6	(ranking) Promote documents with multiple phrase matches with a log-scale bonus	2024-11-28 13:36:56 +01:00
Viktor Lofgren	e11ebf18e5	(span) Correct intersection counting logic, add comprehensive tests	2024-11-28 13:36:25 +01:00
Viktor Lofgren	ba47d72bf4	(ranking) Adjust scores for external link matches	2024-11-27 14:27:23 +01:00
Viktor Lofgren	52bc0272f8	(atag) Add alias domain support and improve domain handling Introduced optional alias domain functionality in EdgeDomain class to handle domain variations such as "www" in the anchor tags code, as there are commonly a number of relevant but glancing misses in the atags data.	2024-11-27 14:26:44 +01:00
Viktor Lofgren	d4bce13a03	(export) Add export actors to precession Adding a tracking message to the export actor means it's possible to run them in a precession. Adding a new precession actor, and some GUI components for triggering exports. The change also adds a heartbeat to the export process.	2024-11-26 15:07:03 +01:00
Viktor Lofgren	b9842b57e0	(encyclopedia-sideloader) Add test suite and clean up urlencoding logic	2024-11-26 13:34:15 +01:00
Viktor Lofgren	95776e9bee	(encyclopedia) Fix commit gore resulting in bad SQL query	2024-11-26 12:44:49 +01:00
Viktor Lofgren	077d8dcd11	(result-score) Adjust ranking parameters a tiny bit	2024-11-25 18:30:59 +01:00
Viktor Lofgren	9ec41e27c6	(keyword-extractor) Fix bug where external link keywords weren't generating document spans as intended	2024-11-25 18:30:22 +01:00
Viktor Lofgren	200743c84f	(minor) Remove delomobok debris	2024-11-25 18:29:21 +01:00
Viktor Lofgren	6d7998e349	(index) Correct behavior of debug function positionValues(), which was misleadingly incorrect	2024-11-25 18:28:53 +01:00
Viktor Lofgren	7d1ef08a0f	(index) Correct ranking bonus for external linktext appearnces	2024-11-25 17:40:15 +01:00
Viktor Lofgren	ea6b148df2	(docker) Add restart: always to executor nodes The system will perform a janitor reset on these nodes when the node profile is switched, so it's important they restart automatically.	2024-11-25 15:31:45 +01:00
Viktor Lofgren	3ec9c4c5fa	(export) Filter non-HTML documents in exporters Add a check to ensure only documents with "text/html" content type are processed in FeedExporter, AtagExporter, and TermFrequencyExporter. This prevents non-HTML documents from being parsed and helps maintain data consistency and keep the memory usage down.	2024-11-25 15:06:42 +01:00
Viktor Lofgren	0b6b5dab07	(index) Add score bonuses for single-word anchor tag spans Enhanced scoring logic to add bonuses when the query matches single-word anchor (atag) spans exactly. Implemented this by adding conditions in `IndexResultScoreCalculator.java` and creating a new method `containsRangeExact` in `DocumentSpan.java` to check for exact span matches.	2024-11-25 14:44:41 +01:00
Viktor Lofgren	ff17473105	Fix UTF-8 URL normalization issue in sideloader. Normalize URLs by replacing en-dash with hyphen to prevent encoding errors. This ensures correct handling of a small subset of articles with improperly normalized UTF-8 paths. Added `normalizeUtf8` method to address this issue. Fixes issue #109.	2024-11-25 14:25:47 +01:00
Viktor Lofgren	dc5f97e737	(index) Add bonus for single-word title matches when the title is also a single word	2024-11-25 13:24:12 +01:00
Viktor Lofgren	d919179ba3	(index) Correct off-by-1 error in DocumentSpan.containsRange	2024-11-25 13:24:03 +01:00
Viktor Lofgren	f09669a5b0	(index) Correct usage of DocumentSpan.length() instead of DocumentSpan.size() The latter counts the number of spans, and is not what you want here.	2024-11-25 13:11:55 +01:00
Viktor Lofgren	b3b0f6fed3	(actor) Add side-load profile to PROC_CONVERTER_SPAWNER. This fell off during the profile split, but is necessary for sideloading.	2024-11-25 12:40:14 +01:00
Viktor Lofgren	88caca60f9	(live-crawl) Flag URLs that don't pass robots.txt as bad so we don't keep fetching robots.txt every day for an empty link list	2024-11-23 17:07:16 +01:00
Viktor Lofgren	923ebbac81	(feeds) Add logic to handle URI fragments in feed items Introduced a method to decide whether to retain URI fragments in feed items based on their uniqueness. Enhanced FeedItem processing to conditionally strip fragments to maintain clean URLs where applicable.	2024-11-23 16:38:56 +01:00
Viktor	df298df852	Merge pull request #125 from MarginaliaSearch/live-search Add near real-time crawling from RSS feeds to supplement the slower batch based crawls	2024-11-22 16:38:37 +00:00
Viktor Lofgren	552b246099	(live-crawl) Improve error handling for errors during robots.txt-retrieval Reduce log-spam and don't treat errors other than 404 as "all is permitted".	2024-11-22 14:15:32 +01:00
Viktor Lofgren	80e6d0069c	(live-crawl-actor) Clear index journal before starting live crawl This is to prevent data corruption. This shouldn't be necessary for the regular loader path, but the live crawler is a bit different and needs some paving of the road ahead of it.	2024-11-22 14:04:57 +01:00
Viktor Lofgren	b941604135	(live-crawler) Alter DbDomainIdRegistry to make inserts if an id is missing, as this is apparently a rare scenario we need to deal with.	2024-11-22 13:58:57 +01:00
Viktor Lofgren	52eb5bc84f	(live-crawler) Keep track of bad URLs To avoid hammering the same invalid URLs for up to two months, URLs that fail to fetch correctly are on a dice roll added to a bad URLs table, that prevents further attempts at fetching them.	2024-11-22 00:55:46 +01:00
Viktor Lofgren	4d23fe6261	(feeds) Simplify RSS User-Agent header Removed the redundant "RSS Feed Fetcher" suffix from the User-Agent header in the FeedFetcherService. This will help avoid making the feed fetcher trigger bot mitigation that accepts the regular UA-string.	2024-11-21 16:43:56 +01:00
Viktor Lofgren	14519294d2	Merge branch 'master' into live-search	2024-11-21 16:00:20 +01:00
Viktor Lofgren	51e46ad2b0	(refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx. While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform. It's still a bit heterogeneous between different processes, but a bit less so for now.	2024-11-21 16:00:09 +01:00
Viktor Lofgren	665c8831a3	(model) Fix resource leak in partially read crawl data streams. Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.	2024-11-20 19:29:13 +01:00
Viktor Lofgren	47dfbacb00	(conf) Introduce a new concept of node profiles Node profiles decide which actors are started, and which views are available in the control GUI. This helps keep the system organized, and hides real-time clutter from the batch-oriented nodes.	2024-11-20 18:15:22 +01:00
Viktor Lofgren	f94911541a	(live-crawl) Reduce the risk of id collisions with the main indexes This is done by applying a large constant offset to the ordinals for the live crawled documents. The chosen value still permits upto 100k documents to be fetched for a single domain with the live crawler, which is ridiculously large.	2024-11-20 16:01:10 +01:00
Viktor Lofgren	89d8af640d	(live-crawl) Rename the live crawler code module to be more consistent with the other processes	2024-11-20 15:55:15 +01:00
Viktor Lofgren	6e4252cf4c	(live-crawl) Make the actor poll for feeds changes instead of being a one-shot thing. Also changes the live crawl process to store the live crawl data in a fixed directory in the storage base rather than versioned directories.	2024-11-20 15:36:25 +01:00
Viktor Lofgren	79ce4de2ab	(model) Remove deprecated fields from CrawledDocument and CrawledDomain	2024-11-20 15:27:05 +01:00

... 3 4 5 6 7 ...

2722 Commits