Commit Graph

2436 Commits

Author SHA1 Message Date
Viktor Lofgren
76e9053dd0 (setup) Move some file-downloads from setup script to the first boot of the control node of the system
We can only do this for files that are not required for unit tests.

As it is illegal to run more than one instance of the control service, this should be fine with regard to race conditions.  The boot orchestration will also ensure that no other services will boot up before the downloading is complete.
2024-11-06 15:28:20 +01:00
Viktor Lofgren
dbb8bcdd8e (crawler) Use a better hashInt implementation in CrawlDataReference
Guava's hash functions are slow as hell.
2024-10-15 18:25:55 +02:00
Viktor Lofgren
7305afa0f8 (crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris 2024-10-15 17:27:59 +02:00
Viktor Lofgren
481f999b70 (crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full.
Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.
2024-10-15 14:22:40 +02:00
Viktor Lofgren
4b16022556 (crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains 2024-10-15 14:21:59 +02:00
Viktor Lofgren
89dd201a7b (link-parser) Make mailing list blocking optional 2024-10-15 13:48:32 +02:00
Viktor Lofgren
ab486323f2 (converter) Increase the number of links the converter will pick up per document 2024-10-15 13:46:19 +02:00
Viktor Lofgren
6460c11107 (index) Short-circuit rankResults when there are no results 2024-10-14 13:47:35 +02:00
Viktor Lofgren
89f7f3c17c (query-parser) Fix regression where advice terms weren't parsed properly 2024-10-14 13:46:37 +02:00
Viktor Lofgren
fe800b3af7 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 19:04:49 +02:00
Viktor Lofgren
2a1077ff43 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:57:27 +02:00
Viktor Lofgren
01a16ff388 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:55:59 +02:00
Viktor Lofgren
eb60ddb729 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:49:39 +02:00
Viktor Lofgren
db5faeceee (download-sample) Break apart actor for better error recovery
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:39:43 +02:00
Viktor Lofgren
45d3e6aa71 (download-sample) Break apart actor for better error recovery
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:19:09 +02:00
Viktor Lofgren
d84a2c183f (*) Remove the crawl spec abstraction
The crawl spec abstraction was used to upload lists of domains into the system for future crawling.  This was fairly clunky, and it was difficult to understand what was going to be crawled.

Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table.  This is much preferred and means the operator can directly manage domains without specs.

This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
ecb5eedeae (crawler, EXPERIMENT) Disable content type probing and use Accept header instead
There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.
2024-09-30 14:53:01 +02:00
Viktor Lofgren
90a2d4ae38 (index) Fix partial buffer writing in PrioDocIdsTransformer
Ensure all data is written to writeChannel by looping until the buffer is fully drained. This prevents potential data loss during the close operation and maintains data integrity.
2024-09-29 17:53:40 +02:00
Viktor Lofgren
2b8ab97ec1 (bit-writer) Do not clear buffer when creating a bit writer 2024-09-29 17:52:43 +02:00
Viktor Lofgren
43ca9c8a12 (sequence) Return Integer.MAX_VALUE for empty position lists.
Updated the method to return Integer.MAX_VALUE if any of the position lists are empty, instead of returning 0. This ensures that empty lists are handled consistently and address edge cases where an empty list is encountered.
2024-09-29 17:21:17 +02:00
Viktor Lofgren
69d99c91dd (index) Optimize buffer handling in PrioDocIdsTransformer 2024-09-29 17:20:49 +02:00
Viktor Lofgren
a8cc98a0f6 (index) Fix write offset calculation in PrioDocIdsTransformer
Adjust the write offset calculation by adding the position of the write buffer. Updated the test to validate the transformation process and ensure correctness of output file positions.
2024-09-29 17:20:29 +02:00
Viktor Lofgren
2ee58f4bc9 (index) Adjust ranking parameters to dial down the importance of tcfProximity and firstPosition 2024-09-29 15:33:12 +02:00
Viktor Lofgren
938431e514 (scrape-feeds-actor) Add deduplication of insertion data
To avoid unnecessary db churn, the domains to be added are put in a set instead of a list, ensuring that they are unique.
2024-09-28 14:41:14 +02:00
Viktor Lofgren
b2de3c70fa (scrape-feeds-actor) Add explicit commit in case it's disabled 2024-09-28 14:36:57 +02:00
Viktor Lofgren
542690d9f6 (search-service) Hide pagination when there is only 1 page of results 2024-09-28 13:48:09 +02:00
Viktor Lofgren
596a7fb4ea (actor) Disable the feed scraper on all nodes but the first 2024-09-28 12:36:16 +02:00
Viktor Lofgren
c3f726a01f (actor) Add a feed scraping actor
Add a new actor that polls an URL every 6 hours and amends the domain database with any unseen domains, flagging them to be crawled by the next crawl job.

The URLs are specified in data/scrape-urls.txt.  If this file is absent, the actor shuts down.
2024-09-28 12:33:29 +02:00
Viktor Lofgren
4538ade156 (live-capture) Add readme to live-capture function 2024-09-28 11:35:46 +02:00
Viktor Lofgren
f4709d8f32 (live-capture) Handle case when screenshot bytes are empty.
Add logic to flag the domain as fetched when the pngBytes array is empty. This ensures we won't try to re-fetch this domain again for a while.
2024-09-27 15:53:17 +02:00
Viktor Lofgren
3dda8c228c (live-capture) Handle failed screenshot fetch in BrowserlessClient
Return an empty byte array when screenshot fetch fails, ensuring downstream processes are not impacted by null responses. Additionally, only attempt to upload the screenshot if the byte array is non-empty, preventing invalid data from being stored.
2024-09-27 14:52:05 +02:00
Viktor Lofgren
ccf6b7caf3 (assistant) Refactor scheduling of tasks within SimilarDomainsService
Changed the scheduling function to use a single schedule call instead of a fixed delay for the init task. The updateScreenshotInfo method was also moved and slightly refactored for clearer readability and consistency.
2024-09-27 14:43:19 +02:00
Viktor Lofgren
fed33ed64a (search-service) Update screenshot request handling
Always request the main site screenshot to ensure staleness checks and necessary updates. Limit additional screenshot requests for similar and linking domains to avoid overloading with a maximum of 5 requests per view.
2024-09-27 14:27:25 +02:00
Viktor Lofgren
ca27d95ce1 (assistant) Add bounds checks for domain idx 2024-09-27 14:24:04 +02:00
Viktor Lofgren
3566fe296a (assistant) Add scheduled update job for screenshot information 2024-09-27 14:16:28 +02:00
Viktor Lofgren
c91435e314 (assistant) Don't attempt to respond to similarity and linkedness queries before the data is ready
This will reduce the number of exceptions in the assistant logs quite significantly.
2024-09-27 14:08:08 +02:00
Viktor Lofgren
31f30069a4 (live-capture) Dial down logging a bit 2024-09-27 14:00:55 +02:00
Viktor
e5726a75d2
Merge pull request #120 from MarginaliaSearch/live-capture-function
Add a new function 'Live Capture' for on-demand screenshot capture
2024-09-27 13:48:53 +02:00
Viktor Lofgren
c757d116bf (misc) Fix Broken Tests 2024-09-27 13:46:34 +02:00
Viktor Lofgren
23cce0c78a Add a new function 'Live Capture' for on-demand screenshot capture
The screenshots are requested by the site-service, and triggered via the site-info view.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
1bd29a586c (service-discovery) Add common base interface to all Grpc services
To be able to tell service discovery whether to enable a service on a particular runtime, a common base interface DiscoverableService extends BindableService was added.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
4565bfe359 (crawler) Make the crawler report crawling progress correctly when stopped and resumed. 2024-09-26 18:30:29 +02:00
Viktor Lofgren
336d6fdd14 (index-client) Fix error when zero results are found 2024-09-25 20:23:13 +02:00
Viktor Lofgren
95cde242ca (assistant) Fix NPE when IP information is absent 2024-09-25 20:19:17 +02:00
Viktor
9224176202
Merge pull request #119 from MarginaliaSearch/result-pagination
Add pagination support for the search results
2024-09-25 14:29:24 +02:00
Viktor Lofgren
0d2390fd13 (search-service) Only autofocus on the query when the query is empty 2024-09-25 14:27:03 +02:00
Viktor Lofgren
4a0356e26f (search-service) Add pagination support to the search GUI 2024-09-25 14:26:49 +02:00
Viktor Lofgren
73f973cc06 (search-query) Add pagination to search query API and the direct query-service interface 2024-09-25 14:20:59 +02:00
Viktor Lofgren
e9e8580913 (converter) Fix NPE bugs in converter due to the reintroduction of CrawledDocument.headers 2024-09-25 12:18:56 +02:00
Viktor Lofgren
8b85a58fea (search UX) Autofocus on the search form 2024-09-24 15:56:03 +02:00