Commit Graph

223 Commits

Author SHA1 Message Date
Viktor Lofgren
d84a2c183f (*) Remove the crawl spec abstraction
The crawl spec abstraction was used to upload lists of domains into the system for future crawling.  This was fairly clunky, and it was difficult to understand what was going to be crawled.

Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table.  This is much preferred and means the operator can directly manage domains without specs.

This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
ecb5eedeae (crawler, EXPERIMENT) Disable content type probing and use Accept header instead
There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.
2024-09-30 14:53:01 +02:00
Viktor Lofgren
4565bfe359 (crawler) Make the crawler report crawling progress correctly when stopped and resumed. 2024-09-26 18:30:29 +02:00
Viktor Lofgren
40512511af (crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl
This code is still a bit too complex, but it's slowly getting better.
2024-09-24 15:08:22 +02:00
Viktor Lofgren
162fc25ebc (minor) Fix accidental commit errors 2024-09-23 18:03:09 +02:00
Viktor Lofgren
e9854f194c (crawler) Refactor
* Restructure the code to make a bit more sense
* Store full headers in crawl data
* Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong
2024-09-23 17:51:07 +02:00
Viktor Lofgren
9c292a4f62 (doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00
Viktor Lofgren
8047e77757 (doc) Correct dead links and stale information in the docs 2024-09-13 11:01:05 +02:00
Viktor Lofgren
a3b0189934 Fix build errors after merge 2024-09-08 10:22:32 +02:00
Viktor Lofgren
8f367d96f8 Merge branch 'master' into term-positions
# Conflicts:
#	code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java
#	code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java
#	code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java
#	code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java
#	code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java
2024-09-08 10:14:43 +02:00
Viktor Lofgren
74148c790e (crawler) Pull additional new domains from node-affinity 0
Previously a bit ambiguously defined, node affinity 0 is now indicative that a domain is up for grabs for the next crawler
2024-09-01 13:00:36 +02:00
Viktor Lofgren
3d77456110 (*) Add domain parking service to ip blocklist 2024-09-01 12:53:22 +02:00
Viktor Lofgren
8d0f9652c7 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:38:34 +02:00
Viktor Lofgren
5353805cc6 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:37:09 +02:00
Viktor Lofgren
5407da5650 (crawler) Grab favicons as part of root sniff 2024-08-31 11:32:56 +02:00
Viktor Lofgren
38e2089c3f (perf) Code was still spending a lot of time resolving charsets
... in the failure case which wasn't captured by memoization.
2024-08-01 11:58:59 +02:00
Viktor Lofgren
285e657f68 Merge branch 'master' into term-positions
# Conflicts:
#	code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
2024-07-31 10:44:01 +02:00
Viktor Lofgren
80900107f7 (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
Viktor Lofgren
ec600b967d (crawler) Adjust domain locking
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress.  Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
accc598967 (crawler) Add 1 second pause after probing domain to reduce request pressure 2024-07-16 16:55:07 +02:00
Viktor Lofgren
02c4a2d4ba (crawler) Add a per-domain mutex for crawling
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
Viktor Lofgren
6665e447aa (crawler) Add crawl delays around probe call and deal with 429:s properly during this phase 2024-07-16 15:33:24 +02:00
Viktor Lofgren
f4d79c203d (crawler) Adjust revisit logic
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.

Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
2024-07-16 15:12:38 +02:00
Viktor Lofgren
4d29581ea4 (crawler) Introduce absolute upper limit to crawl depth growth 2024-07-16 14:40:45 +02:00
Viktor Lofgren
d86926be5f (crawl) Add new functionality for re-crawling a single domain 2024-07-05 15:31:55 +02:00
Viktor Lofgren
0ffbbaf4b9 (crawler) Update WARC builder to use SHA-256 for digests 2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b (crawler) Fetch TLS instead of SSL context 2024-06-12 09:07:54 +02:00
Viktor Lofgren
b4eac2516e (crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results 2024-06-02 16:30:34 +02:00
Viktor Lofgren
89aae93e60 (*) Lift jetty and guava-dependencies 2024-05-23 14:20:01 +02:00
Viktor Lofgren
b867eadbef (big-string) Remove the unused bigstring library 2024-05-18 13:40:03 +02:00
Viktor Lofgren
70e2e41955 (crawler) Content type prober should not swallow exceptions 2024-04-27 18:27:23 +02:00
Viktor Lofgren
4d71c776fc (crawler) Modify crawl set growth to grow small domains faster than larger ones 2024-04-27 17:36:27 +02:00
Viktor Lofgren
7eb5e6aa66 (crawler) Abort recrawl if error count is too high 2024-04-24 21:46:40 +02:00
Viktor Lofgren
8b9629f2f6 (crawler) Remove unnecessary double-fetch of the root document 2024-04-24 14:38:59 +02:00
Viktor Lofgren
f6db16b313 (crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber 2024-04-24 14:10:03 +02:00
Viktor Lofgren
4668b1ddcb (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
Viktor Lofgren
dcf9d9caad (crawler) Emulate if-modified-since for domains that don't support the header
This will help reduce the strain on some server software, in particular Discourse.
2024-04-22 17:26:31 +02:00
Viktor Lofgren
7a69b76001 (crawler) Remove accidental log spam 2024-04-22 15:51:37 +02:00
Viktor Lofgren
ac07ef822f (crawler) Code quality 2024-04-22 15:37:35 +02:00
Viktor Lofgren
e7d4bcd872 (crawler) Use the probe-result to reduce the likelihood of crawling both http and https
This should drastically reduce the number of fetched documents on many domains
2024-04-22 15:36:43 +02:00
Viktor Lofgren
a28c6d7cfe (crawler) Strip W/-prefix from the etag when supplied as If-None-Match 2024-04-22 14:31:05 +02:00
Viktor Lofgren
d816f048f5 (crawler) Ensure all appropriate headers are recorded on the request 2024-04-22 14:14:24 +02:00
Viktor Lofgren
b09ddd0036 (crawler/converter) Remove legacy junk from parquet migration 2024-04-22 12:34:28 +02:00
Viktor Lofgren
fe8d583fdd (sys) Upgrade to JDK22
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
46423612e3 (refac) Merge service-discovery and service modules
Also adds a few tests to the server/client code.
2024-03-03 10:49:23 +01:00
Viktor Lofgren
e696fd9e92 (docs) Begin un-fucking the docs after refactoring 2024-02-27 21:22:21 +01:00
Viktor Lofgren
1d34224416 (refac) Remove src/main from all source code paths.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.

While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules.  Which you'll do a lot, because it's *modul*ar.  The src/main/java convention makes a lot of sense for a non-modular project though.  This ain't that.
2024-02-23 16:13:40 +01:00
Viktor Lofgren
f4ff7185f0 (refac) Move process-mqapi out of api directory 2024-02-23 11:18:29 +01:00
Viktor Lofgren
f8e7f75831 Move index to top level of code 2024-02-22 18:01:35 +01:00
Viktor Lofgren
085137ca63 * Extract the index functionality 2024-02-22 17:31:25 +01:00
Viktor Lofgren
0307c55f9f (refac) Zookeeper for service-discovery, kill service-client lib (WIP)
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.

A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.

The last remaining REST service, the assistant-service, has been migrated to gRPC.

This also proved a good time to clear out primordial technical debt from the root of the codebase.  The 'service-client' library has been taken behind the barn and given a last farewell.  It's replaced by a small library for managing gRPC channels.

Since it's no longer used by anything, RxJava has been removed as a dependency from the project.

Although the current state seems reasonably stable, this is a work-in-progress commit.
2024-02-20 11:41:14 +01:00
Viktor Lofgren
c73e43f5c9 (recrawl) Mitigate recrawl-before-load footgun
In the scenario where an operator

* Performs a new crawl from spec
* Doesn't load the data into the index
* Recrawls the data

The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file,
irrecoverably losing the crawl log making it impossible to load!

To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening.

More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl.  This should help the DbCrawlSpecProvider to find them regardless of loaded state.

This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.
2024-02-18 09:23:20 +01:00
Viktor Lofgren
cee707abd8 (crawler) Implement domain shuffling in DbCrawlSpecProvider
Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.
2024-02-17 17:47:38 +01:00
Viktor Lofgren
e66d0b7431 (warc) Minor code clean-up.
Remove redundant String$getBytes().  This is mainly an improvement in code consistency.
2024-02-10 18:30:33 +01:00
Viktor Lofgren
929caed0b9 (warc) Improve WARC standard adherence
The WARC specification says the records should transparently remove compression.  This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.
2024-02-09 20:07:01 +01:00
Viktor Lofgren
8340aa2b6c (warc) Improve WARC standard adherence
The WARC specification says the records should transparently remove compression.  This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.
2024-02-09 17:29:21 +01:00
Viktor Lofgren
785d8deadd (crawler) Improve meta-tag redirect handling, add tests for redirects.
Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file.  This works as intended.

Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier.  Added logic to handle this case, amended the test case to verify the new behavior.  Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.
2024-02-01 20:30:43 +01:00
Viktor Lofgren
93a2d5afbf (*) Fix poorly named test
Likely old refactoring gore.
2024-02-01 20:08:15 +01:00
Viktor Lofgren
d60c6b18d4 (doc) Update the readme's the crawler, as they've grown stale. 2024-02-01 18:10:55 +01:00
Viktor Lofgren
91c7960800 (crawler) Extract additional configuration properties
This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties.

The documentation is updated to reflect the change.

Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.
2024-01-20 10:36:04 +01:00
Viktor Lofgren
22c8fb3f59 (crawler) Fix a bug where reference copies of crawl data was written without etag and last-modified
This commit also adds a band-aid to ParquetSerializableCrawlDataStream to fetch this from the 304-entity.  This can be removed in a few months.
2024-01-18 16:02:27 +01:00
Viktor Lofgren
fd1eec99b5 (cleanup) Fix broken tests 2024-01-15 15:44:33 +01:00
Viktor Lofgren
7c6e18f7a7 (*) Overhaul settings and properties
Use a system.properties file to configure the system.  This is loaded statically by MainClass or ProcessMainClass.  Update the property names to be more consistent, and update the documentations to reflect the changes.
2024-01-13 17:12:18 +01:00
Viktor Lofgren
0caef1b307 (warc) Toggle for saving WARC data
Add a toggle for saving the WARC data generated by the search engine's crawler.  Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest.

The warc files are concatenated into larger archives, up to about 1 GB each.
An index is also created containing filenames, domain names, offsets and sizes
to help navigate these larger archives.

The warc data is saved in a directory warc/ under the crawl data storage.
2024-01-12 13:45:14 +01:00
Viktor Lofgren
734996002c (*) install script for deploying Marginalia outside the codebase
The changeset also makes the control service responsible for flyway migrations.  This helps reduce the number of places the database configuration needs to be spread out.  These automatic migrations can be disabled with -DdisableFlyway=true.

The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
2024-01-11 12:40:03 +01:00
Viktor Lofgren
e49ba887e9 (crawl data) Add compatibility layer for old crawl data format
The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records.  This is true for the new parquet format, but not for the old zstd/gson format.

To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order.

This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be.

Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.
2024-01-08 19:16:49 +01:00
Viktor Lofgren
75d87c73d1 (crawler) Disable Java's infinite DNS caching 2023-12-31 16:59:08 +01:00
Viktor Lofgren
0fe44c9bf2 (crawler) Fix broken test
A necessary step was accidentally deleted when cleaning up these tests previously.
2023-12-30 13:56:44 +01:00
Viktor Lofgren
0b112cb4d4 (warc) Update URL encoding in WarcProtocolReconstructor
The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.
2023-12-29 19:41:37 +01:00
Viktor Lofgren
00a974a721 (crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions
This commit also improves the test coverage for this part of the code.
2023-12-27 20:02:17 +01:00
Viktor Lofgren
f811a29f87 (crawler) Fix resource leak in crawler
A 10 MB thread local buffer wasn't static.  Oops.
2023-12-27 16:32:17 +01:00
Viktor Lofgren
9e5fe71f5b (crawler) Switch hash function in crawler
Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler.   This switches to the modified Murmur hash function used throughout Marginalia.
2023-12-27 13:29:00 +01:00
Viktor Lofgren
3ea1ddae22 (crawler) Roll back switch to virtual thread pool in crawler
This seems to cause a resource leak, it seems the http library uses thread locals?
2023-12-26 19:37:34 +01:00
Viktor Lofgren
25d086c4e1 (crawler) Clean up stale warc files
We should probably have an option to keep them, but not by default!
2023-12-25 15:07:36 +01:00
Viktor Lofgren
f779f760c4 (crawler) Even more lenient resyncing 2023-12-25 01:44:18 +01:00
Viktor Lofgren
f18f82e229 (crawler) Write etags and last-modified on reference copy
This commit also fixes a test that broke with a previous change.
2023-12-25 01:40:13 +01:00
Viktor Lofgren
67ef2b45fa (crawler) Reduce logging 2023-12-25 01:10:03 +01:00
Viktor Lofgren
d72e871265 (warc) Fix resync 2023-12-25 01:03:03 +01:00
Viktor Lofgren
4c9bc13309 (warc) Reduce log spam 2023-12-25 00:58:31 +01:00
Viktor Lofgren
84563b0d46 (crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK 2023-12-25 00:55:05 +01:00
Viktor Lofgren
c5aab7e8db (warc) Fix NPE in WarcRecorder 2023-12-25 00:54:38 +01:00
Viktor Lofgren
1755b646b8 (warc) Fix NPE in WarcRecorder 2023-12-25 00:48:42 +01:00
Viktor Lofgren
e1a155a9c8 (crawler) Increase growth of crawl jobs
A number of crawl jobs get stuck at about 300 documents, or just under.  This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl.  GOOD_URLS is based on how many documents successfully process, which is typically fairly small.  Switching to KNOWN_URLS should let this grow faster.

The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table.

The floor is also increased to 250 from 200.
2023-12-23 13:22:10 +01:00
Viktor Lofgren
a5bc29245b (cleanup) Remove vestigial support for WARC crawl data streams 2023-12-20 15:46:21 +01:00
Viktor Lofgren
bfae478251 Refactor CrawlerRevisitor for better consistency 2023-12-20 15:21:49 +01:00
Viktor Lofgren
3a56a06c4f (warc) Add a fields for etags and last-modified headers to the new crawl data formats
Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats.  This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely.

The commit also adds a few tests to this logic.
2023-12-18 17:45:54 +01:00
Viktor Lofgren
bf44805e69 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 14:00:07 +01:00
Viktor Lofgren
3113b5a551 (warc) Filter WarcResponses based on X-Robots-Tags
There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.
2023-12-16 15:58:27 +01:00
Viktor Lofgren
54ed3b86ba (minor) Remove dead code. 2023-12-15 21:49:35 +01:00
Viktor Lofgren
fa81e5b8ee (warc) Use a non-standard WARC header to convey information about whether a website uses cookies
This information is then propagated to the parquet file as a boolean.

For documents that are copied from the reference, use whatever value we last saw.  This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.
2023-12-15 16:37:53 +01:00
Viktor Lofgren
9fea22b90d (warc) Further tidying
This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled.

A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics.

Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.
2023-12-15 15:38:23 +01:00
Viktor Lofgren
0889b6d247 (warc) Clean up parquet conversion
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.

It also refactors the fetch result, body extraction and content type abstractions.
2023-12-14 20:39:40 +01:00
Viktor Lofgren
1328bc4938 (warc) Clean up parquet conversion
This commit cleans up the warc->parquet conversion.  Records with a http status other than 200 are now included.

The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body.

The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.
2023-12-14 16:05:48 +01:00
Viktor Lofgren
440e097d78 (crawler) WIP integration of WARC files into the crawler and converter process.
This commit is in a pretty rough state.  It refactors the crawler fairly significantly to offer better separation of concerns.  It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data.  This works, -ish.

There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.

A problem is that the WARC files are a bit too large.  It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
2023-12-13 15:33:42 +01:00
Viktor Lofgren
b74a3ebd85 (crawler) WIP integration of WARC files into the crawler process.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.

This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.

The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
2023-12-11 19:32:58 +01:00
Viktor Lofgren
45987a1d98 Merge branch 'master' into warc 2023-12-11 14:32:35 +01:00
Viktor Lofgren
f655ec5a5c (*) Refactor GeoIP-related code
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.

The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.

The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.

The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
2023-12-10 17:30:43 +01:00
Viktor Lofgren
e6a1052ba7 Simplify CrawlerMain, removing the CrawlerLimiter and using a global HttpFetcher with a virtual thread pool dispatcher instead of the default. 2023-12-08 20:24:01 +01:00
Viktor Lofgren
968dce50fc (crawler) Refactored IpInterceptingNetworkInterceptor for clarity. 2023-12-08 17:45:46 +01:00