MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 04:58:59 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	01a16ff388	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 17:55:59 +02:00
Viktor Lofgren	eb60ddb729	(crawler) Properly enqueue links from the root document in the crawler	2024-10-05 17:49:39 +02:00
Viktor Lofgren	db5faeceee	(download-sample) Break apart actor for better error recovery Change also adds logged events to give more feedback that something is happening.	2024-10-04 13:39:43 +02:00
Viktor Lofgren	45d3e6aa71	(download-sample) Break apart actor for better error recovery Change also adds logged events to give more feedback that something is happening.	2024-10-04 13:19:09 +02:00
Viktor Lofgren	d84a2c183f	(*) Remove the crawl spec abstraction The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled. Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs. This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.	2024-10-03 13:41:17 +02:00
Viktor Lofgren	ecb5eedeae	(crawler, EXPERIMENT) Disable content type probing and use Accept header instead There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.	2024-09-30 14:53:01 +02:00
Viktor Lofgren	90a2d4ae38	(index) Fix partial buffer writing in PrioDocIdsTransformer Ensure all data is written to writeChannel by looping until the buffer is fully drained. This prevents potential data loss during the close operation and maintains data integrity.	2024-09-29 17:53:40 +02:00
Viktor Lofgren	2b8ab97ec1	(bit-writer) Do not clear buffer when creating a bit writer	2024-09-29 17:52:43 +02:00
Viktor Lofgren	43ca9c8a12	(sequence) Return Integer.MAX_VALUE for empty position lists. Updated the method to return Integer.MAX_VALUE if any of the position lists are empty, instead of returning 0. This ensures that empty lists are handled consistently and address edge cases where an empty list is encountered.	2024-09-29 17:21:17 +02:00
Viktor Lofgren	69d99c91dd	(index) Optimize buffer handling in PrioDocIdsTransformer	2024-09-29 17:20:49 +02:00
Viktor Lofgren	a8cc98a0f6	(index) Fix write offset calculation in PrioDocIdsTransformer Adjust the write offset calculation by adding the position of the write buffer. Updated the test to validate the transformation process and ensure correctness of output file positions.	2024-09-29 17:20:29 +02:00
Viktor Lofgren	2ee58f4bc9	(index) Adjust ranking parameters to dial down the importance of tcfProximity and firstPosition	2024-09-29 15:33:12 +02:00
Viktor Lofgren	938431e514	(scrape-feeds-actor) Add deduplication of insertion data To avoid unnecessary db churn, the domains to be added are put in a set instead of a list, ensuring that they are unique.	2024-09-28 14:41:14 +02:00
Viktor Lofgren	b2de3c70fa	(scrape-feeds-actor) Add explicit commit in case it's disabled	2024-09-28 14:36:57 +02:00
Viktor Lofgren	542690d9f6	(search-service) Hide pagination when there is only 1 page of results	2024-09-28 13:48:09 +02:00
Viktor Lofgren	596a7fb4ea	(actor) Disable the feed scraper on all nodes but the first	2024-09-28 12:36:16 +02:00
Viktor Lofgren	c3f726a01f	(actor) Add a feed scraping actor Add a new actor that polls an URL every 6 hours and amends the domain database with any unseen domains, flagging them to be crawled by the next crawl job. The URLs are specified in data/scrape-urls.txt. If this file is absent, the actor shuts down.	2024-09-28 12:33:29 +02:00
Viktor Lofgren	4538ade156	(live-capture) Add readme to live-capture function	2024-09-28 11:35:46 +02:00
Viktor Lofgren	f4709d8f32	(live-capture) Handle case when screenshot bytes are empty. Add logic to flag the domain as fetched when the pngBytes array is empty. This ensures we won't try to re-fetch this domain again for a while.	2024-09-27 15:53:17 +02:00
Viktor Lofgren	3dda8c228c	(live-capture) Handle failed screenshot fetch in BrowserlessClient Return an empty byte array when screenshot fetch fails, ensuring downstream processes are not impacted by null responses. Additionally, only attempt to upload the screenshot if the byte array is non-empty, preventing invalid data from being stored.	2024-09-27 14:52:05 +02:00
Viktor Lofgren	ccf6b7caf3	(assistant) Refactor scheduling of tasks within SimilarDomainsService Changed the scheduling function to use a single schedule call instead of a fixed delay for the init task. The updateScreenshotInfo method was also moved and slightly refactored for clearer readability and consistency.	2024-09-27 14:43:19 +02:00
Viktor Lofgren	fed33ed64a	(search-service) Update screenshot request handling Always request the main site screenshot to ensure staleness checks and necessary updates. Limit additional screenshot requests for similar and linking domains to avoid overloading with a maximum of 5 requests per view.	2024-09-27 14:27:25 +02:00
Viktor Lofgren	ca27d95ce1	(assistant) Add bounds checks for domain idx	2024-09-27 14:24:04 +02:00
Viktor Lofgren	3566fe296a	(assistant) Add scheduled update job for screenshot information	2024-09-27 14:16:28 +02:00
Viktor Lofgren	c91435e314	(assistant) Don't attempt to respond to similarity and linkedness queries before the data is ready This will reduce the number of exceptions in the assistant logs quite significantly.	2024-09-27 14:08:08 +02:00
Viktor Lofgren	31f30069a4	(live-capture) Dial down logging a bit	2024-09-27 14:00:55 +02:00
Viktor	e5726a75d2	Merge pull request #120 from MarginaliaSearch/live-capture-function Add a new function 'Live Capture' for on-demand screenshot capture	2024-09-27 13:48:53 +02:00
Viktor Lofgren	c757d116bf	(misc) Fix Broken Tests	2024-09-27 13:46:34 +02:00
Viktor Lofgren	23cce0c78a	Add a new function 'Live Capture' for on-demand screenshot capture The screenshots are requested by the site-service, and triggered via the site-info view.	2024-09-27 13:46:34 +02:00
Viktor Lofgren	1bd29a586c	(service-discovery) Add common base interface to all Grpc services To be able to tell service discovery whether to enable a service on a particular runtime, a common base interface DiscoverableService extends BindableService was added.	2024-09-27 13:46:34 +02:00
Viktor Lofgren	4565bfe359	(crawler) Make the crawler report crawling progress correctly when stopped and resumed.	2024-09-26 18:30:29 +02:00
Viktor Lofgren	336d6fdd14	(index-client) Fix error when zero results are found	2024-09-25 20:23:13 +02:00
Viktor Lofgren	95cde242ca	(assistant) Fix NPE when IP information is absent	2024-09-25 20:19:17 +02:00
Viktor	9224176202	Merge pull request #119 from MarginaliaSearch/result-pagination Add pagination support for the search results	2024-09-25 14:29:24 +02:00
Viktor Lofgren	0d2390fd13	(search-service) Only autofocus on the query when the query is empty	2024-09-25 14:27:03 +02:00
Viktor Lofgren	4a0356e26f	(search-service) Add pagination support to the search GUI	2024-09-25 14:26:49 +02:00
Viktor Lofgren	73f973cc06	(search-query) Add pagination to search query API and the direct query-service interface	2024-09-25 14:20:59 +02:00
Viktor Lofgren	e9e8580913	(converter) Fix NPE bugs in converter due to the reintroduction of CrawledDocument.headers	2024-09-25 12:18:56 +02:00
Viktor Lofgren	8b85a58fea	(search UX) Autofocus on the search form	2024-09-24 15:56:03 +02:00
Viktor Lofgren	40512511af	(crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl This code is still a bit too complex, but it's slowly getting better.	2024-09-24 15:08:22 +02:00
Viktor	10d8fc4fe7	Update ROADMAP.md	2024-09-24 14:57:30 +02:00
Viktor	9899d45ea8	Merge pull request #118 from MarginaliaSearch/vlofgren-patch-1 Update ROADMAP.md	2024-09-24 14:13:47 +02:00
Viktor	3eea471ca6	Update ROADMAP.md	2024-09-24 14:13:32 +02:00
Viktor Lofgren	3dec4b6b34	(index) Fix bug where tcfFirstPosition lit up because one term was in the title and the other was missing from the document This was because firstPosition calculation was not invalidated when positions were missing.	2024-09-24 13:33:37 +02:00
Viktor Lofgren	162fc25ebc	(minor) Fix accidental commit errors	2024-09-23 18:03:09 +02:00
Viktor Lofgren	e9854f194c	(crawler) Refactor * Restructure the code to make a bit more sense * Store full headers in crawl data * Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong	2024-09-23 17:51:07 +02:00
Viktor Lofgren	9c292a4f62	(doc) Fix outdated links in documentation	2024-09-22 13:56:17 +02:00
Viktor Lofgren	edb42836da	(vcs) Fix shared state issues with VarintCodedSequence's iterators. Also cleans up the code a bit.	2024-09-21 16:09:15 +02:00
Viktor Lofgren	1ff88ff0bc	(vcs) Stopgap fix for quoted queries with the same term appearinc multiple times There are reentrance issues with VarintCodedSequence, this hides the symptom but these need to be corrected properly.	2024-09-21 14:07:59 +02:00
Viktor Lofgren	28e7c8e5e0	Increase temporal bias weight to give the recent results filter a bit more recency	2024-09-17 18:11:40 +02:00

... 4 5 6 7 8 ...

2675 Commits