Commit Graph

190 Commits

Author SHA1 Message Date
Viktor Lofgren
3b99cffb3d (link-parser) Filter out URLs with binary file suffixes in LinkParser
Added an additional filter step to ensure URLs with binary suffixes are excluded during crawling. This prevents unnecessary processing of non-HTML content, improving the efficiency of the link parsing process.
2024-12-11 16:42:47 +01:00
Viktor Lofgren
14519294d2 Merge branch 'master' into live-search 2024-11-21 16:00:20 +01:00
Viktor Lofgren
51e46ad2b0 (refac) Move export tasks to a process and clean up process initialization for all ProcessMainClass descendents
Since some of the export tasks have been memory hungry, sometimes killing the executor-services, they've been moved to a separate process that can be given a larger Xmx.

While doing this, the ProcessMainClass was given utilities for the boilerplate surrounding receiving mq requests and responding to them, some effort was also put toward making the process boot process a bit more uniform.  It's still a bit heterogeneous between different processes, but a bit less so for now.
2024-11-21 16:00:09 +01:00
Viktor Lofgren
665c8831a3 (model) Fix resource leak in partially read crawl data streams.
Ensuring proper resource management by closing the underlying stream in the `close` method to prevent potential resource leaks.
2024-11-20 19:29:13 +01:00
Viktor Lofgren
79ce4de2ab (model) Remove deprecated fields from CrawledDocument and CrawledDomain 2024-11-20 15:27:05 +01:00
Viktor Lofgren
a91ab4c203 (live-crawler) Crude first-try process for live crawling #WIP
Some refactoring is still needed, but an dummy actor is in place and a process that crawls URLs from the livecapture service's RSS endpoints; that makes it all the way to being indexable.
2024-11-19 19:35:01 +01:00
Viktor Lofgren
9f47ce8d15 (chore) Remove lombok
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
a5b4951f23 (chore) Remove use of deprecated STR.-style string templates 2024-11-11 18:02:28 +01:00
Viktor Lofgren
dbb8bcdd8e (crawler) Use a better hashInt implementation in CrawlDataReference
Guava's hash functions are slow as hell.
2024-10-15 18:25:55 +02:00
Viktor Lofgren
7305afa0f8 (crawler) Clean up the crawler code a bit, removing vestigial abstractions and historical debris 2024-10-15 17:27:59 +02:00
Viktor Lofgren
481f999b70 (crawler) Make DomainCrawlFrontier a bit less aggressive with throwing away excess links when it's approaching full.
Also be a bit smarter about pre-allocating queues and sets based on depth rather than the number of provided URLs, which was always zero outside of tests.
2024-10-15 14:22:40 +02:00
Viktor Lofgren
4b16022556 (crawler) Correct Spec Provider so that it uses VISITED_URLS rather than KNOWN_URLS when growing domains 2024-10-15 14:21:59 +02:00
Viktor Lofgren
89dd201a7b (link-parser) Make mailing list blocking optional 2024-10-15 13:48:32 +02:00
Viktor Lofgren
fe800b3af7 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 19:04:49 +02:00
Viktor Lofgren
2a1077ff43 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:57:27 +02:00
Viktor Lofgren
01a16ff388 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:55:59 +02:00
Viktor Lofgren
eb60ddb729 (crawler) Properly enqueue links from the root document in the crawler 2024-10-05 17:49:39 +02:00
Viktor Lofgren
d84a2c183f (*) Remove the crawl spec abstraction
The crawl spec abstraction was used to upload lists of domains into the system for future crawling.  This was fairly clunky, and it was difficult to understand what was going to be crawled.

Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table.  This is much preferred and means the operator can directly manage domains without specs.

This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
ecb5eedeae (crawler, EXPERIMENT) Disable content type probing and use Accept header instead
There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.
2024-09-30 14:53:01 +02:00
Viktor Lofgren
4565bfe359 (crawler) Make the crawler report crawling progress correctly when stopped and resumed. 2024-09-26 18:30:29 +02:00
Viktor Lofgren
40512511af (crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl
This code is still a bit too complex, but it's slowly getting better.
2024-09-24 15:08:22 +02:00
Viktor Lofgren
162fc25ebc (minor) Fix accidental commit errors 2024-09-23 18:03:09 +02:00
Viktor Lofgren
e9854f194c (crawler) Refactor
* Restructure the code to make a bit more sense
* Store full headers in crawl data
* Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong
2024-09-23 17:51:07 +02:00
Viktor Lofgren
9c292a4f62 (doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00
Viktor Lofgren
8047e77757 (doc) Correct dead links and stale information in the docs 2024-09-13 11:01:05 +02:00
Viktor Lofgren
a3b0189934 Fix build errors after merge 2024-09-08 10:22:32 +02:00
Viktor Lofgren
8f367d96f8 Merge branch 'master' into term-positions
# Conflicts:
#	code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java
#	code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java
#	code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java
#	code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java
2024-09-08 10:14:43 +02:00
Viktor Lofgren
74148c790e (crawler) Pull additional new domains from node-affinity 0
Previously a bit ambiguously defined, node affinity 0 is now indicative that a domain is up for grabs for the next crawler
2024-09-01 13:00:36 +02:00
Viktor Lofgren
3d77456110 (*) Add domain parking service to ip blocklist 2024-09-01 12:53:22 +02:00
Viktor Lofgren
8d0f9652c7 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:38:34 +02:00
Viktor Lofgren
5353805cc6 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:37:09 +02:00
Viktor Lofgren
5407da5650 (crawler) Grab favicons as part of root sniff 2024-08-31 11:32:56 +02:00
Viktor Lofgren
38e2089c3f (perf) Code was still spending a lot of time resolving charsets
... in the failure case which wasn't captured by memoization.
2024-08-01 11:58:59 +02:00
Viktor Lofgren
285e657f68 Merge branch 'master' into term-positions
# Conflicts:
#	code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
2024-07-31 10:44:01 +02:00
Viktor Lofgren
80900107f7 (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
Viktor Lofgren
ec600b967d (crawler) Adjust domain locking
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress.  Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
accc598967 (crawler) Add 1 second pause after probing domain to reduce request pressure 2024-07-16 16:55:07 +02:00
Viktor Lofgren
02c4a2d4ba (crawler) Add a per-domain mutex for crawling
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
Viktor Lofgren
6665e447aa (crawler) Add crawl delays around probe call and deal with 429:s properly during this phase 2024-07-16 15:33:24 +02:00
Viktor Lofgren
f4d79c203d (crawler) Adjust revisit logic
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.

Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
2024-07-16 15:12:38 +02:00
Viktor Lofgren
4d29581ea4 (crawler) Introduce absolute upper limit to crawl depth growth 2024-07-16 14:40:45 +02:00
Viktor Lofgren
d86926be5f (crawl) Add new functionality for re-crawling a single domain 2024-07-05 15:31:55 +02:00
Viktor Lofgren
0ffbbaf4b9 (crawler) Update WARC builder to use SHA-256 for digests 2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b (crawler) Fetch TLS instead of SSL context 2024-06-12 09:07:54 +02:00
Viktor Lofgren
b4eac2516e (crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results 2024-06-02 16:30:34 +02:00
Viktor Lofgren
89aae93e60 (*) Lift jetty and guava-dependencies 2024-05-23 14:20:01 +02:00
Viktor Lofgren
b867eadbef (big-string) Remove the unused bigstring library 2024-05-18 13:40:03 +02:00
Viktor Lofgren
70e2e41955 (crawler) Content type prober should not swallow exceptions 2024-04-27 18:27:23 +02:00
Viktor Lofgren
4d71c776fc (crawler) Modify crawl set growth to grow small domains faster than larger ones 2024-04-27 17:36:27 +02:00