MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	8b9629f2f6	(crawler) Remove unnecessary double-fetch of the root document	2024-04-24 14:38:59 +02:00
Viktor Lofgren	f6db16b313	(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber	2024-04-24 14:10:03 +02:00
Viktor Lofgren	4668b1ddcb	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 13:54:04 +02:00
Viktor Lofgren	dcf9d9caad	(crawler) Emulate if-modified-since for domains that don't support the header This will help reduce the strain on some server software, in particular Discourse.	2024-04-22 17:26:31 +02:00
Viktor Lofgren	7a69b76001	(crawler) Remove accidental log spam	2024-04-22 15:51:37 +02:00
Viktor Lofgren	ac07ef822f	(crawler) Code quality	2024-04-22 15:37:35 +02:00
Viktor Lofgren	e7d4bcd872	(crawler) Use the probe-result to reduce the likelihood of crawling both http and https This should drastically reduce the number of fetched documents on many domains	2024-04-22 15:36:43 +02:00
Viktor Lofgren	a28c6d7cfe	(crawler) Strip W/-prefix from the etag when supplied as If-None-Match	2024-04-22 14:31:05 +02:00
Viktor Lofgren	d816f048f5	(crawler) Ensure all appropriate headers are recorded on the request	2024-04-22 14:14:24 +02:00
Viktor Lofgren	b09ddd0036	(crawler/converter) Remove legacy junk from parquet migration	2024-04-22 12:34:28 +02:00
Viktor Lofgren	214551f1df	(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.	2024-04-19 20:36:01 +02:00
Viktor Lofgren	2353c73c57	(encyclopedia) Index the full articles Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.	2024-04-16 12:10:13 +02:00
Viktor Lofgren	bd0704d5a4	(*) Fix JDK22 migration issues A few bizarre build errors cropped up when migrating to JDK22. Not at all sure what caused them, but they were easy to mitigate.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	002afca1c5	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	d8f4e7d72b	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-03-19 10:42:09 +01:00
Viktor Lofgren	46423612e3	(refac) Merge service-discovery and service modules Also adds a few tests to the server/client code.	2024-03-03 10:49:23 +01:00
Viktor Lofgren	29bf473d74	(encyclopedia) Add URLencoding to path element This prevents corruption of the links to the sideloaded encyclopedia data when the article path contains characters that are not valid in a URL.	2024-03-01 17:28:09 +01:00
Viktor Lofgren	9f1649636e	Clean up documentation and rename `domain-links` to `link-graph`	2024-02-28 11:40:39 +01:00
Viktor Lofgren	e696fd9e92	(docs) Begin un-fucking the docs after refactoring	2024-02-27 21:22:21 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00
Viktor Lofgren	f4ff7185f0	(refac) Move process-mqapi out of api directory	2024-02-23 11:18:29 +01:00
Viktor Lofgren	f8e7f75831	Move index to top level of code	2024-02-22 18:01:35 +01:00
Viktor Lofgren	085137ca63	* Extract the index functionality	2024-02-22 17:31:25 +01:00
Viktor Lofgren	66c1281301	(zk-registry) epic jak shaving WIP Cleaning out a lot of old junk from the code, and one thing lead to another... * Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds. * The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs. * Project is migrated to GraalVM * gRPC clients are re-written with a neat fluent/functional style. e.g. ```channelPool.call(grpcStub::method) .async(executor) // <-- optional .run(argument); ``` This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall. * For now the project is all in on zookeeper * Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh. * To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP! Missing is documentation and testing, and some more breaking apart of code.	2024-02-22 14:01:23 +01:00
Viktor Lofgren	c600d7aa47	(refac) Inject ServiceRegistry into WebsiteAdjacenciesCalculator	2024-02-20 15:42:32 +01:00
Viktor Lofgren	0307c55f9f	(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.	2024-02-20 11:41:14 +01:00
Viktor Lofgren	c73e43f5c9	(recrawl) Mitigate recrawl-before-load footgun In the scenario where an operator * Performs a new crawl from spec * Doesn't load the data into the index * Recrawls the data The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file, irrecoverably losing the crawl log making it impossible to load! To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening. More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl. This should help the DbCrawlSpecProvider to find them regardless of loaded state. This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.	2024-02-18 09:23:20 +01:00
Viktor Lofgren	cee707abd8	(crawler) Implement domain shuffling in DbCrawlSpecProvider Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.	2024-02-17 17:47:38 +01:00
Viktor Lofgren	37a7296759	(sideload) Clean up the sideloading code Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach. The reddit sideloader now uses the SideloaderProcessing class. It also properly sets js-attributes for the sideloaded documents. The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.	2024-02-17 14:32:36 +01:00
Viktor Lofgren	dcc5cfb7c0	(index-journal) Improve documentation and code quality	2024-02-15 10:51:49 +01:00
Viktor Lofgren	fab36d6e63	(converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy.	2024-02-14 17:35:44 +01:00
Viktor Lofgren	02dd5c5853	(converter) Look at properties when deciding pool size Look at whether the property 'system.conserveProperty' is enabled when deciding he default pool size for the converter. If true, a much more conservative default is used, limiting the risk of running out of memory.	2024-02-12 16:24:19 +01:00
Viktor Lofgren	9d68062553	(converter) Make processing pool size configurable	2024-02-10 20:59:08 +01:00
Viktor Lofgren	e66d0b7431	(warc) Minor code clean-up. Remove redundant String$getBytes(). This is mainly an improvement in code consistency.	2024-02-10 18:30:33 +01:00
Viktor Lofgren	929caed0b9	(warc) Improve WARC standard adherence The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.	2024-02-09 20:07:01 +01:00
Viktor Lofgren	8340aa2b6c	(warc) Improve WARC standard adherence The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.	2024-02-09 17:29:21 +01:00
Viktor Lofgren	467ba5be20	(index-construction) Split repartition into two actions This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets. Since only the first is necessary before the index construction, the rest can be delayed until after... To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one. The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader. Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data. Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead. To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.	2024-02-06 17:20:07 +01:00
Viktor Lofgren	29ddf9e61d	(doc) Update docs	2024-02-06 16:29:55 +01:00
Viktor Lofgren	7286596fb4	(deps) Remove monkey patched GSON The codebase used to have a monkey patched version of gson that made special optimizations for the unusually large JSON files that used to store e.g. crawl data. Since JSON is no longer used in this fashion, the GSON fork is not needed anymore.	2024-02-06 12:11:39 +01:00
Viktor Lofgren	fa145f632b	(sideload) Add special handling for sideloaded wiki documents This update enhances the SideloaderProcessing and DocumentClass modules to specially handle sideloaded wiki documents. Wiki content is generally truncated to the first paragraph, which generally tends to be too short to be included independently. An additional DocumentClass (SIDELOAD) has been introduced to suppress the length check in this case.	2024-02-02 21:22:07 +01:00
Viktor Lofgren	785d8deadd	(crawler) Improve meta-tag redirect handling, add tests for redirects. Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file. This works as intended. Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier. Added logic to handle this case, amended the test case to verify the new behavior. Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.	2024-02-01 20:30:43 +01:00
Viktor Lofgren	93a2d5afbf	(*) Fix poorly named test Likely old refactoring gore.	2024-02-01 20:08:15 +01:00
Viktor Lofgren	d60c6b18d4	(doc) Update the readme's the crawler, as they've grown stale.	2024-02-01 18:10:55 +01:00
Viktor Lofgren	52a0255814	() Add flag for disabling ASCII flattening The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this. This assumption holds poorly in the wild. Adding an experimental* system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior. IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.	2024-01-31 11:50:59 +01:00
Viktor Lofgren	3fff7f6878	(converter) Fix issue where quality limits were no longer enforced	2024-01-23 11:42:17 +01:00
Viktor Lofgren	41d896ba3e	(converter) Refactor content type check in PlainTextDocumentProcessorPlugin The method `isApplicable` in the `PlainTextDocumentProcessorPlugin` was refactored to handle a wider range of content types beyond merely "text/plain". It now also handles any content type that starts with "text/plain;", to accomodate contentTypes that append a charset as well.	2024-01-22 17:52:14 +01:00
Viktor Lofgren	40c9d2050f	(control) Fully automatic conversion Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine. Removed the tool itself. This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency. This has been fixed, and :third-party:xz was removed.	2024-01-22 13:03:24 +01:00
Viktor Lofgren	91c7960800	(crawler) Extract additional configuration properties This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties. The documentation is updated to reflect the change. Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.	2024-01-20 10:36:04 +01:00
Viktor Lofgren	27ffb8fa8a	(converter) Integrate zim->db conversion into automatic encyclopedia processing workflow Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file. This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically. The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.	2024-01-19 13:59:03 +01:00
Viktor Lofgren	22c8fb3f59	(crawler) Fix a bug where reference copies of crawl data was written without etag and last-modified This commit also adds a band-aid to ParquetSerializableCrawlDataStream to fetch this from the 304-entity. This can be removed in a few months.	2024-01-18 16:02:27 +01:00

1 2 3 4 5 ...

345 Commits