MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	0a383a712d	(qdebug) Accurately display positions when intersecting with spans	2024-08-15 11:44:17 +02:00
Viktor Lofgren	75b0888032	(slop) Migrate to latest Slop version	2024-08-14 11:44:35 +02:00
Viktor Lofgren	623ee5570f	(slop) Break slop out into its own repository	2024-08-13 09:50:05 +02:00
Viktor Lofgren	fd2bad39f3	(keyword-extraction) Add body field for terms that are not otherwise part of a field	2024-08-13 09:49:26 +02:00
Viktor Lofgren	680ad19c7d	(keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors	2024-08-06 11:16:56 +02:00
Viktor Lofgren	2080e31616	(converter) Store link text positions To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends. Integrating this information with the ranking is not performed here.	2024-08-04 12:00:29 +02:00
Viktor Lofgren	e48f52faba	(experiment) Add add-hoc filter runner	2024-08-03 13:24:03 +02:00
Viktor Lofgren	4430a39120	(loader) Clean up	2024-08-02 12:32:47 +02:00
Viktor Lofgren	6228f46af1	(loader) Reduce log spam	2024-08-02 12:21:03 +02:00
Viktor Lofgren	ac67b6b5da	(converter) Fix exception handling while reading crawl data	2024-08-02 10:39:49 +02:00
Viktor Lofgren	1a268c24c8	(perf) Reduce DomPruningFilter hash table recalculation	2024-08-01 12:04:55 +02:00
Viktor Lofgren	38e2089c3f	(perf) Code was still spending a lot of time resolving charsets ... in the failure case which wasn't captured by memoization.	2024-08-01 11:58:59 +02:00
Viktor Lofgren	285e657f68	Merge branch 'master' into term-positions # Conflicts: # code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java # code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java	2024-07-31 10:44:01 +02:00
Viktor Lofgren	b316b55be9	(index) Experimental initial integration of document spans into index	2024-07-30 12:01:53 +02:00
Viktor Lofgren	80900107f7	(restructure) Clean up repo by moving stray features into converter-process and crawler-process	2024-07-30 10:14:00 +02:00
Viktor Lofgren	7e4efa45b8	(converter/loader) Simplify document record writing to not require predicated reads	2024-07-29 14:21:21 +02:00
Viktor Lofgren	86ea28d6bc	(converter/loader) Simplify document record writing to not require predicated reads	2024-07-29 14:18:52 +02:00
Viktor Lofgren	34703da144	(slop) Support for nested array types and array-of-object types Also adding very basic support for filtered reads via SlopTable. This is probably not a final design.	2024-07-29 14:00:43 +02:00
Viktor Lofgren	1282f78bc5	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 11:01:18 +02:00
Viktor Lofgren	2d5d965f7f	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 10:34:33 +02:00
Viktor Lofgren	afe56c7cf1	(loader) Tidy up code	2024-07-28 21:36:42 +02:00
Viktor Lofgren	7d51cf882f	(loader) Move rssFeeds to a different column group to avoid errors	2024-07-28 21:30:10 +02:00
Viktor Lofgren	9685993adb	(loader) Add spans to a different column group from spanCodes, as they are not in sync	2024-07-28 21:20:09 +02:00
Viktor Lofgren	261dcdadc8	(loader) Additional tracking for the control GUI	2024-07-28 21:19:45 +02:00
Viktor Lofgren	1caad7e19e	(slop) Update existing code to use the altered Slop interfaces	2024-07-28 13:21:08 +02:00
Viktor Lofgren	6c3abff664	(slop) Move GCS Slop column to the coded-sequence package This lets the slop library be stand-alone without dependence on coded-sequence. The change also gets rid of the vestigial seek() method in ColumnReader.	2024-07-27 13:58:45 +02:00
Viktor Lofgren	dcb43a3308	(slop) Introduce table concept to keep track of positions and simplify closing The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip. The second most common error is forgetting to close one of the columns in a reader or writer. To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.	2024-07-27 13:47:47 +02:00
Viktor Lofgren	ec600b967d	(crawler) Adjust domain locking Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress. Giving these a slightly larger allowance.	2024-07-27 11:54:46 +02:00
Viktor Lofgren	aebb2652e8	(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.	2024-07-27 11:44:13 +02:00
Viktor Lofgren	60ef826e07	(loader) Add heartbeat to update domain-ids step	2024-07-25 15:08:41 +02:00
Viktor Lofgren	2ad564404e	(loader) Add heartbeat to update domain-ids step	2024-07-23 15:28:52 +02:00
Viktor Lofgren	2bb9f18411	(dld) Refactor DocumentLanguageData Reduce the usage of raw arrays	2024-07-19 12:24:55 +02:00
Viktor Lofgren	22b35d5d91	(sentence-extractor) Add tag information to document language data Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object. Separator information is encoded as a bit set instead of an array of integers. The change also cleans up the SentenceExtractor class a fair bit. It no longer extracts ngrams, and a significant amount of redundant operations were removed as well. This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.	2024-07-18 15:57:48 +02:00
Viktor Lofgren	d36055a2d0	(keyword-extractor) Retire TfIdfHigh WordFlag This will bring the word flags count down to 8, and let us pack every value in a byte.	2024-07-17 13:54:39 +02:00
Viktor Lofgren	accc598967	(crawler) Add 1 second pause after probing domain to reduce request pressure	2024-07-16 16:55:07 +02:00
Viktor Lofgren	02c4a2d4ba	(crawler) Add a per-domain mutex for crawling To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.	2024-07-16 16:44:59 +02:00
Viktor Lofgren	6665e447aa	(crawler) Add crawl delays around probe call and deal with 429:s properly during this phase	2024-07-16 15:33:24 +02:00
Viktor Lofgren	f4d79c203d	(crawler) Adjust revisit logic The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed. Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.	2024-07-16 15:12:38 +02:00
Viktor Lofgren	4d29581ea4	(crawler) Introduce absolute upper limit to crawl depth growth	2024-07-16 14:40:45 +02:00
Viktor Lofgren	0b31c4cfbb	(coded-sequence) Replace GCS usage with an interface	2024-07-16 14:37:50 +02:00
Viktor	8ed5b51a32	Merge branch 'master' into term-positions	2024-07-15 07:05:31 +02:00
Viktor Lofgren	1ab875a75d	(test) Correcting flaky tests Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	85c99ae808	(index-reverse) Split index construction into separate packages for full and priority index	2024-07-06 15:44:47 +02:00
Viktor Lofgren	d86926be5f	(crawl) Add new functionality for re-crawling a single domain	2024-07-05 15:31:55 +02:00
Viktor Lofgren	6ee4d1eb90	(keyword) Increase the work area for position encoding The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.	2024-06-28 16:42:39 +02:00
Viktor Lofgren	dae22ccbe0	(test) Integration test from crawl->query	2024-06-25 22:17:26 +02:00
Viktor Lofgren	0ffbbaf4b9	(crawler) Update WARC builder to use SHA-256 for digests	2024-06-12 09:14:12 +02:00
Viktor Lofgren	6839415a0b	(crawler) Fetch TLS instead of SSL context	2024-06-12 09:07:54 +02:00
Viktor Lofgren	36160988e2	(index) Integrate positions data with indexes WIP This change integrates the new positions data with the forward and reverse indexes. The ranking code is still only partially re-written.	2024-06-10 15:09:06 +02:00
Viktor Lofgren	9f982a0c3d	(index) Integrate positions file properly	2024-06-06 16:45:42 +02:00
Viktor Lofgren	4a8afa6b9f	(index, WIP) Position data partially integrated with forward and reverse indexes. There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.	2024-06-06 12:54:52 +02:00
Viktor Lofgren	b4eac2516e	(crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results	2024-06-02 16:30:34 +02:00
Viktor Lofgren	9b922af075	(converter) Amend existing modifications to use gamma coded positions lists ... instead of serialized RoaringBitmaps as was the initial take on the problem.	2024-05-30 14:20:36 +02:00
Viktor Lofgren	619392edf9	(keywords) Add position information to keywords	2024-05-28 16:54:53 +02:00
Viktor Lofgren	0894822b68	(converter) Add position information to serialized document data This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.	2024-05-28 14:18:03 +02:00
Viktor Lofgren	f83f777fff	(converter) Experimental support for searching by URL Add up to synthetic 128 keywords per document, corresponding to links to other websites.	2024-05-23 17:10:57 +02:00
Viktor Lofgren	89aae93e60	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
Viktor Lofgren	d12c77305c	(btree) Clean up code	2024-05-18 18:03:17 +02:00
Viktor Lofgren	b867eadbef	(big-string) Remove the unused bigstring library	2024-05-18 13:40:03 +02:00
Viktor Lofgren	38aedb50ac	(converter) Do not suppress exceptions in the converter	2024-04-30 18:24:35 +02:00
Viktor Lofgren	70e2e41955	(crawler) Content type prober should not swallow exceptions	2024-04-27 18:27:23 +02:00
Viktor Lofgren	4d71c776fc	(crawler) Modify crawl set growth to grow small domains faster than larger ones	2024-04-27 17:36:27 +02:00
Viktor Lofgren	7eb5e6aa66	(crawler) Abort recrawl if error count is too high	2024-04-24 21:46:40 +02:00
Viktor Lofgren	8b9629f2f6	(crawler) Remove unnecessary double-fetch of the root document	2024-04-24 14:38:59 +02:00
Viktor Lofgren	f6db16b313	(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber	2024-04-24 14:10:03 +02:00
Viktor Lofgren	4668b1ddcb	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 13:54:04 +02:00
Viktor Lofgren	dcf9d9caad	(crawler) Emulate if-modified-since for domains that don't support the header This will help reduce the strain on some server software, in particular Discourse.	2024-04-22 17:26:31 +02:00
Viktor Lofgren	7a69b76001	(crawler) Remove accidental log spam	2024-04-22 15:51:37 +02:00
Viktor Lofgren	ac07ef822f	(crawler) Code quality	2024-04-22 15:37:35 +02:00
Viktor Lofgren	e7d4bcd872	(crawler) Use the probe-result to reduce the likelihood of crawling both http and https This should drastically reduce the number of fetched documents on many domains	2024-04-22 15:36:43 +02:00
Viktor Lofgren	a28c6d7cfe	(crawler) Strip W/-prefix from the etag when supplied as If-None-Match	2024-04-22 14:31:05 +02:00
Viktor Lofgren	d816f048f5	(crawler) Ensure all appropriate headers are recorded on the request	2024-04-22 14:14:24 +02:00
Viktor Lofgren	b09ddd0036	(crawler/converter) Remove legacy junk from parquet migration	2024-04-22 12:34:28 +02:00
Viktor Lofgren	214551f1df	(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.	2024-04-19 20:36:01 +02:00
Viktor Lofgren	2353c73c57	(encyclopedia) Index the full articles Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.	2024-04-16 12:10:13 +02:00
Viktor Lofgren	bd0704d5a4	(*) Fix JDK22 migration issues A few bizarre build errors cropped up when migrating to JDK22. Not at all sure what caused them, but they were easy to mitigate.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	002afca1c5	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	d8f4e7d72b	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-03-19 10:42:09 +01:00
Viktor Lofgren	46423612e3	(refac) Merge service-discovery and service modules Also adds a few tests to the server/client code.	2024-03-03 10:49:23 +01:00
Viktor Lofgren	29bf473d74	(encyclopedia) Add URLencoding to path element This prevents corruption of the links to the sideloaded encyclopedia data when the article path contains characters that are not valid in a URL.	2024-03-01 17:28:09 +01:00
Viktor Lofgren	9f1649636e	Clean up documentation and rename `domain-links` to `link-graph`	2024-02-28 11:40:39 +01:00
Viktor Lofgren	e696fd9e92	(docs) Begin un-fucking the docs after refactoring	2024-02-27 21:22:21 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00
Viktor Lofgren	f4ff7185f0	(refac) Move process-mqapi out of api directory	2024-02-23 11:18:29 +01:00
Viktor Lofgren	f8e7f75831	Move index to top level of code	2024-02-22 18:01:35 +01:00
Viktor Lofgren	085137ca63	* Extract the index functionality	2024-02-22 17:31:25 +01:00
Viktor Lofgren	66c1281301	(zk-registry) epic jak shaving WIP Cleaning out a lot of old junk from the code, and one thing lead to another... * Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds. * The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs. * Project is migrated to GraalVM * gRPC clients are re-written with a neat fluent/functional style. e.g. ```channelPool.call(grpcStub::method) .async(executor) // <-- optional .run(argument); ``` This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall. * For now the project is all in on zookeeper * Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh. * To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP! Missing is documentation and testing, and some more breaking apart of code.	2024-02-22 14:01:23 +01:00
Viktor Lofgren	c600d7aa47	(refac) Inject ServiceRegistry into WebsiteAdjacenciesCalculator	2024-02-20 15:42:32 +01:00
Viktor Lofgren	0307c55f9f	(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.	2024-02-20 11:41:14 +01:00
Viktor Lofgren	c73e43f5c9	(recrawl) Mitigate recrawl-before-load footgun In the scenario where an operator * Performs a new crawl from spec * Doesn't load the data into the index * Recrawls the data The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file, irrecoverably losing the crawl log making it impossible to load! To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening. More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl. This should help the DbCrawlSpecProvider to find them regardless of loaded state. This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.	2024-02-18 09:23:20 +01:00
Viktor Lofgren	cee707abd8	(crawler) Implement domain shuffling in DbCrawlSpecProvider Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.	2024-02-17 17:47:38 +01:00
Viktor Lofgren	37a7296759	(sideload) Clean up the sideloading code Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach. The reddit sideloader now uses the SideloaderProcessing class. It also properly sets js-attributes for the sideloaded documents. The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.	2024-02-17 14:32:36 +01:00
Viktor Lofgren	dcc5cfb7c0	(index-journal) Improve documentation and code quality	2024-02-15 10:51:49 +01:00
Viktor Lofgren	fab36d6e63	(converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy.	2024-02-14 17:35:44 +01:00
Viktor Lofgren	02dd5c5853	(converter) Look at properties when deciding pool size Look at whether the property 'system.conserveProperty' is enabled when deciding he default pool size for the converter. If true, a much more conservative default is used, limiting the risk of running out of memory.	2024-02-12 16:24:19 +01:00
Viktor Lofgren	9d68062553	(converter) Make processing pool size configurable	2024-02-10 20:59:08 +01:00
Viktor Lofgren	e66d0b7431	(warc) Minor code clean-up. Remove redundant String$getBytes(). This is mainly an improvement in code consistency.	2024-02-10 18:30:33 +01:00
Viktor Lofgren	929caed0b9	(warc) Improve WARC standard adherence The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.	2024-02-09 20:07:01 +01:00
Viktor Lofgren	8340aa2b6c	(warc) Improve WARC standard adherence The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.	2024-02-09 17:29:21 +01:00
Viktor Lofgren	467ba5be20	(index-construction) Split repartition into two actions This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets. Since only the first is necessary before the index construction, the rest can be delayed until after... To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one. The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader. Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data. Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead. To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.	2024-02-06 17:20:07 +01:00

1 2 3 4 5 ...

458 Commits