Viktor Lofgren
a5b4951f23
(chore) Remove use of deprecated STR.-style string templates
2024-11-11 18:02:28 +01:00
Viktor Lofgren
8b8bf0748f
(feature-extraction) Add new DocumentHeaders class encapsulating Html headers.
...
Also adds a few new html features for CDNs and S3 hosting for use in ranking and query refinement.
2024-11-11 13:26:15 +01:00
Viktor
5cc71ae586
Merge pull request #123 from MarginaliaSearch/vlofgren-patch-1
...
Update ROADMAP.md
2024-11-10 18:57:49 +01:00
Viktor
33fcfe4b63
Update ROADMAP.md
2024-11-10 18:57:15 +01:00
Viktor
a31a3b53c4
Merge pull request #122 from MarginaliaSearch/fetch-rss-feeds
...
Automatic RSS feed polling
2024-11-10 18:35:28 +01:00
Viktor Lofgren
a456ec9599
(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished
2024-11-10 18:30:28 +01:00
Viktor Lofgren
a2bc9a98c0
(feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished
2024-11-10 17:45:20 +01:00
Viktor Lofgren
e24a98390c
(feed) Update API to allow specifying clean vs refresh update
...
Move the logic deciding which operation to perform into the actor, updating its state graph to incorporate a counter that runs a clean update once in a blue moon.
2024-11-09 18:43:47 +01:00
Viktor Lofgren
6f858cd627
(feed) Decrease update interval to 24 hours
2024-11-09 18:17:51 +01:00
Viktor Lofgren
a293266ccd
(feed) Wipe the feeds db and start over from system URLs periodically.
2024-11-09 18:17:16 +01:00
Viktor Lofgren
b8e0dc93d7
(search) Correctly show the feeds view when items are present
...
... otherwise show samples. This commit also removes the (Experimental) bit, as this is getting fairly mature.
2024-11-09 17:56:43 +01:00
Viktor Lofgren
d774c39031
(feeds) Reduce log spam
2024-11-09 17:56:43 +01:00
Viktor Lofgren
ab17af99da
(feeds) Refresh the feed db using the previous db, when it is available.
2024-11-09 17:56:43 +01:00
Viktor Lofgren
b0ac3c586f
(feeds) Correct parallelism using SimpleBlockingThreadPool
2024-11-09 17:56:43 +01:00
Viktor Lofgren
139fa85b18
(feeds) Add working heartbeat tracking progress
2024-11-09 17:56:43 +01:00
Viktor Lofgren
bfeb9a4538
(feeds) Retire feedlot the feed bot, move RSS capture into the live-capture service
2024-11-09 17:56:43 +01:00
Viktor
3d6c79ae5f
Merge pull request #121 from MarginaliaSearch/headless-setup
...
Headless deterministic setup
2024-11-08 13:50:54 +01:00
Viktor Lofgren
c9e9f73ea9
(setup) Break out installation action into non-interactive script
2024-11-08 13:38:40 +01:00
Viktor Lofgren
80e482b155
(setup) Add progress bar to downloads for better feedback
2024-11-08 13:38:40 +01:00
Viktor Lofgren
9351593495
(setup) Use huggingface for versioned hosting of language models
2024-11-08 13:38:40 +01:00
Viktor Lofgren
d74436f546
(setup) Use checksums for rdrpostagger and opennlp files
...
Also use versioned URLs for rdrpostagger
2024-11-08 13:38:40 +01:00
Viktor Lofgren
76e9053dd0
(setup) Move some file-downloads from setup script to the first boot of the control node of the system
...
We can only do this for files that are not required for unit tests.
As it is illegal to run more than one instance of the control service, this should be fine with regard to race conditions. The boot orchestration will also ensure that no other services will boot up before the downloading is complete.
2024-11-06 15:28:20 +01:00
Viktor Lofgren
dbb8bcdd8e
(crawler) Use a better hashInt implementation in CrawlDataReference
...
Guava's hash functions are slow as hell.
2024-10-15 18:25:55 +02:00
Viktor Lofgren
7305afa0f8
(crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris
2024-10-15 17:27:59 +02:00
Viktor Lofgren
481f999b70
(crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full.
...
Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.
2024-10-15 14:22:40 +02:00
Viktor Lofgren
4b16022556
(crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains
2024-10-15 14:21:59 +02:00
Viktor Lofgren
89dd201a7b
(link-parser) Make mailing list blocking optional
2024-10-15 13:48:32 +02:00
Viktor Lofgren
ab486323f2
(converter) Increase the number of links the converter will pick up per document
2024-10-15 13:46:19 +02:00
Viktor Lofgren
6460c11107
(index) Short-circuit rankResults when there are no results
2024-10-14 13:47:35 +02:00
Viktor Lofgren
89f7f3c17c
(query-parser) Fix regression where advice terms weren't parsed properly
2024-10-14 13:46:37 +02:00
Viktor Lofgren
fe800b3af7
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 19:04:49 +02:00
Viktor Lofgren
2a1077ff43
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:57:27 +02:00
Viktor Lofgren
01a16ff388
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:55:59 +02:00
Viktor Lofgren
eb60ddb729
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:49:39 +02:00
Viktor Lofgren
db5faeceee
(download-sample) Break apart actor for better error recovery
...
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:39:43 +02:00
Viktor Lofgren
45d3e6aa71
(download-sample) Break apart actor for better error recovery
...
Change also adds logged events to give more feedback that something is happening.
2024-10-04 13:19:09 +02:00
Viktor Lofgren
d84a2c183f
(*) Remove the crawl spec abstraction
...
The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled.
Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs.
This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
ecb5eedeae
(crawler, EXPERIMENT) Disable content type probing and use Accept header instead
...
There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.
2024-09-30 14:53:01 +02:00
Viktor Lofgren
90a2d4ae38
(index) Fix partial buffer writing in PrioDocIdsTransformer
...
Ensure all data is written to writeChannel by looping until the buffer is fully drained. This prevents potential data loss during the close operation and maintains data integrity.
2024-09-29 17:53:40 +02:00
Viktor Lofgren
2b8ab97ec1
(bit-writer) Do not clear buffer when creating a bit writer
2024-09-29 17:52:43 +02:00
Viktor Lofgren
43ca9c8a12
(sequence) Return Integer.MAX_VALUE for empty position lists.
...
Updated the method to return Integer.MAX_VALUE if any of the position lists are empty, instead of returning 0. This ensures that empty lists are handled consistently and address edge cases where an empty list is encountered.
2024-09-29 17:21:17 +02:00
Viktor Lofgren
69d99c91dd
(index) Optimize buffer handling in PrioDocIdsTransformer
2024-09-29 17:20:49 +02:00
Viktor Lofgren
a8cc98a0f6
(index) Fix write offset calculation in PrioDocIdsTransformer
...
Adjust the write offset calculation by adding the position of the write buffer. Updated the test to validate the transformation process and ensure correctness of output file positions.
2024-09-29 17:20:29 +02:00
Viktor Lofgren
2ee58f4bc9
(index) Adjust ranking parameters to dial down the importance of tcfProximity and firstPosition
2024-09-29 15:33:12 +02:00
Viktor Lofgren
938431e514
(scrape-feeds-actor) Add deduplication of insertion data
...
To avoid unnecessary db churn, the domains to be added are put in a set instead of a list, ensuring that they are unique.
2024-09-28 14:41:14 +02:00
Viktor Lofgren
b2de3c70fa
(scrape-feeds-actor) Add explicit commit in case it's disabled
2024-09-28 14:36:57 +02:00
Viktor Lofgren
542690d9f6
(search-service) Hide pagination when there is only 1 page of results
2024-09-28 13:48:09 +02:00
Viktor Lofgren
596a7fb4ea
(actor) Disable the feed scraper on all nodes but the first
2024-09-28 12:36:16 +02:00
Viktor Lofgren
c3f726a01f
(actor) Add a feed scraping actor
...
Add a new actor that polls an URL every 6 hours and amends the domain database with any unseen domains, flagging them to be crawled by the next crawl job.
The URLs are specified in data/scrape-urls.txt. If this file is absent, the actor shuts down.
2024-09-28 12:33:29 +02:00
Viktor Lofgren
4538ade156
(live-capture) Add readme to live-capture function
2024-09-28 11:35:46 +02:00