MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	e968365858	(index) Use new DomainRankingSets to configure ranking algos in index svc	2024-01-16 12:43:32 +01:00
Viktor Lofgren	36ad4c7466	(db) Add a new configuration object 'domain ranking set' for storing ranking parameters	2024-01-16 12:34:00 +01:00
Viktor Lofgren	5a62b3058f	(query-api) Make the search set identifier a string value in the API This will free the core marginalia search engine to use arbitrary search set definitions, while the app can use its hardcoded defaults.	2024-01-16 10:55:24 +01:00
Viktor Lofgren	a1df9e886a	(control) Also clean up stale 'NEW' messages	2024-01-15 16:14:02 +01:00
Viktor Lofgren	fd1eec99b5	(cleanup) Fix broken tests	2024-01-15 15:44:33 +01:00
Viktor Lofgren	e162406d40	(control) New control-side actors for cleaning up stale service heartbeats and message queue entries	2024-01-15 15:44:23 +01:00
Viktor Lofgren	c41e68aaab	(control) New export actions for RSS/Atom feeds and term frequency data This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.	2024-01-15 14:54:26 +01:00
Viktor Lofgren	4665af6c42	(control) Move export data endpoint to actions controller	2024-01-15 11:06:22 +01:00
Viktor Lofgren	c0b15427fe	(control) New crawl view should use radio buttons as multiple specs aren't supported	2024-01-15 11:03:47 +01:00
Viktor Lofgren	f29a9d972d	(control) Move 'new crawl spec' to /node/:id/actions, out of /node/:id/storage	2024-01-15 11:02:00 +01:00
Viktor Lofgren	b192373ae7	(control) Highlight unavailable items (creating, deleting) in node actions views	2024-01-15 10:47:54 +01:00
Viktor Lofgren	c042650382	(docs) Improve query service documentation	2024-01-13 21:16:45 +01:00
Viktor Lofgren	07a916a720	(search) Give the swipe hint on mobile a nicer finish	2024-01-13 18:51:54 +01:00
Viktor Lofgren	5134044530	(assistant) Make assistant client more robust to the service going down This is especially important for the non-essential functions, like website similarities...	2024-01-13 18:29:30 +01:00
Viktor Lofgren	4c62065e74	(install) Add two separate templates for the install script One template is for the full Marginalia Search style install, and the other is for a barebones install with no Marginalia-related fluff.	2024-01-13 18:27:42 +01:00
Viktor Lofgren	d28fc99119	(MainClass) ensure logging isn't loaded before service name is known This causes logs all to have names like ${sys:service-name}, instead of the service name...	2024-01-13 18:19:50 +01:00
Viktor Lofgren	c9fb45c85f	(search) Fix control.hideMarginaliaApp handling	2024-01-13 17:24:15 +01:00
Viktor Lofgren	7c6e18f7a7	(*) Overhaul settings and properties Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.	2024-01-13 17:12:18 +01:00
Viktor Lofgren	176b9c9666	(convert) Add sizeHints to legacy serializable cawl data stream This reduces the maximum memory usage when processing legacy crawl data	2024-01-13 15:50:36 +01:00
Viktor Lofgren	ecd9c35233	(control) Clean up the event log * Generate fewer uninteresting event messages. * Display fewer irrelevant fields in the overview table.	2024-01-13 13:28:02 +01:00
Viktor Lofgren	71e32c57d9	(control) Add better timestamps for the events and message queue views Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.	2024-01-13 13:04:56 +01:00
Viktor Lofgren	2fefd0e4e3	(control) Add better timestamps for the events and message queue views Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.	2024-01-13 13:03:52 +01:00
Viktor Lofgren	81eaf79a25	(control) UX polish	2024-01-13 12:31:13 +01:00
Viktor Lofgren	8dea7217a6	(control) UX fixes, node GUI doesn't break when an executor service goes offline.	2024-01-13 12:17:30 +01:00
Viktor Lofgren	c0fb9e17e8	(control) Add filter dropdown to message queue table This makes inspecting the queues for processes much easier, as it's otherwise often these important messages are drowned out by FSM chatter.	2024-01-12 18:46:17 +01:00
Viktor Lofgren	83776a8dce	(control) Wean the ExportDataActor off EC_DOMAIN_LINK The EC_DOMAIN_LINK table is deprecated and slated for removal, use QueryClient.getAllDomainLinks() instead. The ExportDataActor now uses the QueryClient appropriately. The CSV format was also changed to quote the values, to prevent e.g. Excel from interpreting the comma as a decimal separator when previewing the file. Finally the form for triggering an export was overhauled.	2024-01-12 17:09:11 +01:00
Viktor Lofgren	98c0972619	(control) Add a summary table for Actors in the Node overview	2024-01-12 16:32:15 +01:00
Viktor Lofgren	56d832d661	(control) Adjust the margins of the headings to be consistent	2024-01-12 16:16:57 +01:00
Viktor Lofgren	de3a350afe	(control) Disable broken actions and mark the actions view as WIP	2024-01-12 16:16:39 +01:00
Viktor Lofgren	708a741960	(test) Clean up test usage of migrations Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing. A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.	2024-01-12 15:55:50 +01:00
Viktor Lofgren	0caef1b307	(warc) Toggle for saving WARC data Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest. The warc files are concatenated into larger archives, up to about 1 GB each. An index is also created containing filenames, domain names, offsets and sizes to help navigate these larger archives. The warc data is saved in a directory warc/ under the crawl data storage.	2024-01-12 13:45:14 +01:00
Viktor Lofgren	264e2db539	(control) UX-improvements for control service This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views. It has many small tweaks to make the work flow better. It also adds a new /uploads directory in each index node, from which sideloaded data can be selected. This is a bit of a breaking change, as this directory needs to exist in each index node.	2024-01-12 12:33:05 +01:00
Viktor Lofgren	734996002c	(*) install script for deploying Marginalia outside the codebase The changeset also makes the control service responsible for flyway migrations. This helps reduce the number of places the database configuration needs to be spread out. These automatic migrations can be disabled with -DdisableFlyway=true. The commit also adds curl to the docker container, to enable docker health checks and interdependencies.	2024-01-11 12:40:03 +01:00
Viktor Lofgren	a0f28a7f9b	(*) Add a barebones configuration This adds a docker-compose file 'docker-compose-barebones.yml' which will only start the minimal number of services needed to run a whitelabel Marginalia Search-style search engine, with none of the surrounding frills. The change also adds a minimal search GUI to the query service, which is also available with JSON results if the appropriate Accept header is provided.	2024-01-10 20:23:51 +01:00
Viktor Lofgren	14b7680328	(loader) Update the size of the keyword files created by the loader Previously these ended up being about 200 Mb each, which is wastefully small. Increasing the size of these files makes the index construction faster.	2024-01-10 17:09:19 +01:00
Viktor Lofgren	f44222ce53	(control) Add a 'cancel' button to the process list This is a very nice QoL improvement, since it means you don't have to dig in the Actors view to terminate processes.	2024-01-10 15:02:42 +01:00
Viktor Lofgren	f310ad8d98	(control) Actor terminations work better Improves jank in the abort actor action, which would sometimes cause actors to hang or restart.	2024-01-10 14:18:49 +01:00
Viktor Lofgren	d56b394bcc	(control) GUI for loading external WARC files	2024-01-10 12:13:30 +01:00
Viktor Lofgren	55c9501e57	(search) Serve proper content type for static resources	2024-01-10 10:46:51 +01:00
Viktor	fad9575154	Merge pull request #69 from MarginaliaSearch/converter-optimizations Refactor the DomainProcessor to take advantage of the new crawl data format	2024-01-10 09:46:54 +01:00
Viktor Lofgren	97e11e1ac9	(search) Fix acknowledgement page for domain complaints rendering as plain text This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.	2024-01-10 09:37:40 +01:00
Viktor Lofgren	e6a1e164b2	(search) Swap swipe direction for more consistent experience	2024-01-10 09:37:40 +01:00
Viktor Lofgren	e4f8f81e89	(search) Mobile UX improvements. Swipe right to show filter menu. Fix CSS bug that caused parts of the menu to not have a background.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	176b3bb526	(search) Toggle for showing recent results Actually persist the value of the toggle between searches too...	2024-01-10 09:37:39 +01:00
Viktor Lofgren	b07752fa9b	(search) Toggle for showing recent results Will by default show results from the last 2 years. May need to tune this later.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	68fd0efbde	(search) Clean up search results template Rendering is very slow. Let's see if this has a measurable effect on latency.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	c80d3eb812	(search) Remove dead code	2024-01-10 09:37:35 +01:00
Viktor Lofgren	f9320995d6	(search) When clicking asn-links, show results from the unfiltered view...	2024-01-10 09:37:13 +01:00
Viktor Lofgren	f592c9f04d	(search) Fix acknowledgement page for domain complaints rendering as plain text This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.	2024-01-10 09:26:34 +01:00
Viktor Lofgren	bd7970fb1f	(search) Swap swipe direction for more consistent experience	2024-01-09 13:38:40 +01:00
Viktor Lofgren	c47730f2cc	(search) Mobile UX improvements. Swipe right to show filter menu. Fix CSS bug that caused parts of the menu to not have a background.	2024-01-09 13:30:30 +01:00
Viktor Lofgren	41cccfd2aa	(search) Toggle for showing recent results Actually persist the value of the toggle between searches too...	2024-01-09 11:36:49 +01:00
Viktor Lofgren	aff690f7d6	(search) Toggle for showing recent results Will by default show results from the last 2 years. May need to tune this later.	2024-01-09 11:28:36 +01:00
Viktor Lofgren	d4b0539d39	(search) Clean up search results template Rendering is very slow. Let's see if this has a measurable effect on latency.	2024-01-08 20:57:40 +01:00
Viktor Lofgren	cb55273769	(search) When clicking asn-links, show results from the unfiltered view...	2024-01-08 20:02:19 +01:00
Viktor Lofgren	fbad625126	(linkdb) Add delegating implementation of DomainLinkDb This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.	2024-01-08 19:56:33 +01:00
Viktor Lofgren	e49ba887e9	(crawl data) Add compatibility layer for old crawl data format The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records. This is true for the new parquet format, but not for the old zstd/gson format. To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order. This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be. Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.	2024-01-08 19:16:49 +01:00
Viktor Lofgren	edc1acbb7e	(*) Replace EC_DOMAIN_LINK table with files and in-memory caching The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need. This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service. A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file. The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.	2024-01-08 15:53:13 +01:00
Viktor Lofgren	ef02b712ad	(build) Remove false depdencency between icp and index-service This dependency causes the executor service docker image to change when the index service docker image changes.	2024-01-05 13:22:13 +01:00
Viktor Lofgren	aca217cf9a	(qs) Better metrics for QS	2024-01-05 13:22:13 +01:00
Viktor Lofgren	9e3386dbbb	(search) Fetch fewer results per page This is a test to evaluate how this impacts load times.	2024-01-05 13:22:13 +01:00
Viktor Lofgren	fdec565b34	(converter) Add upper 128KB limit to how much HTML we'll parse	2024-01-05 13:22:13 +01:00
Viktor Lofgren	33c2188c87	(feature) More trackers	2024-01-05 13:22:13 +01:00
Viktor Lofgren	b3c8fa74cc	(feature) Add another doubleclick variant to the adtech trackers	2024-01-05 13:22:13 +01:00
Viktor Lofgren	e53bb70bef	(converter) Penalize chatgpt content farm spam	2024-01-05 13:22:13 +01:00
Viktor Lofgren	109bec372c	(index) Adjust BM25 parameters	2024-01-05 13:21:52 +01:00
Viktor Lofgren	5c2561d05d	(search) Add query strategy requiring link	2024-01-05 13:21:52 +01:00
Viktor Lofgren	0e970b8037	(valuation) Tweaking penalties a bit	2024-01-05 13:21:52 +01:00
Viktor Lofgren	1694b4d6ef	(valuation) Increase the penalty for adtech a bit	2024-01-05 13:21:34 +01:00
Viktor Lofgren	396299c1db	(index) Reduce the value of site and site-adjacent in BM25P calculations	2024-01-05 13:21:33 +01:00
Viktor Lofgren	71d789aab0	(index) Tweak result valuation renormalization	2024-01-05 13:21:33 +01:00
Viktor Lofgren	6d2e14a656	(build) Remove false depdencency between icp and index-service This dependency causes the executor service docker image to change when the index service docker image changes.	2024-01-05 13:17:29 +01:00
Viktor Lofgren	4078708aea	(qs) Better metrics for QS	2024-01-04 13:27:14 +01:00
Viktor Lofgren	343ea9c6d8	(search) Fetch fewer results per page This is a test to evaluate how this impacts load times.	2024-01-04 13:18:07 +01:00
Viktor Lofgren	60361f88ed	(converter) Add upper 128KB limit to how much HTML we'll parse	2024-01-03 23:14:03 +01:00
Viktor Lofgren	f7560cb1d8	(feature) More trackers	2024-01-03 17:31:02 +01:00
Viktor Lofgren	1f66568d59	(feature) More trackers	2024-01-03 17:27:25 +01:00
Viktor Lofgren	7af07cef95	(feature) Add another doubleclick variant to the adtech trackers	2024-01-03 17:21:12 +01:00
Viktor Lofgren	41a540a629	(converter) Penalize chatgpt content farm spam	2024-01-03 17:04:38 +01:00
Viktor Lofgren	f599944942	(converter) Penalize chatgpt content farm spam	2024-01-03 16:51:26 +01:00
Viktor Lofgren	1e06aee6a2	(index) Adjust BM25 parameters	2024-01-03 16:30:46 +01:00
Viktor Lofgren	7bbaedef97	(search) Add query strategy requiring link	2024-01-03 16:23:00 +01:00
Viktor Lofgren	87048511fe	(valuation) Tweaking penalties a bit	2024-01-03 16:02:25 +01:00
Viktor Lofgren	c770f0b68b	(valuation) Tweaking penalties a bit	2024-01-03 15:59:21 +01:00
Viktor Lofgren	78c00ad512	(valuation) Tweaking penalties a bit	2024-01-03 15:52:57 +01:00
Viktor Lofgren	a19879d494	(valuation) Tweaking penalties a bit	2024-01-03 15:32:33 +01:00
Viktor Lofgren	ac1aca36b0	(valuation) Increase the penalty for adtech a bit	2024-01-03 15:20:38 +01:00
Viktor Lofgren	1f3b89cf28	(index) Reduce the value of site and site-adjacent in BM25P calculations	2024-01-03 15:20:18 +01:00
Viktor Lofgren	f732f6ae6f	(index) Tweak result valuation renormalization	2024-01-03 14:53:53 +01:00
Viktor Lofgren	0b9f3d1751	(*) Remove accidental commit of debug logging	2024-01-03 14:32:00 +01:00
Viktor Lofgren	0806aa6dfe	(language-processing) Add maximum length limit for text input in SentenceExtractor Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.	2024-01-03 14:27:47 +01:00
Viktor Lofgren	32436d099c	(language-processing) Add maximum length limit for text input in SentenceExtractor Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.	2024-01-03 14:27:47 +01:00
Viktor Lofgren	4ce692ccaf	(converter) Use SimpleBlockingThreadPool in ProcessingIterator	2024-01-03 14:27:47 +01:00
Viktor Lofgren	3caa4eed75	Merge branch 'master' into converter-optimizations	2024-01-02 17:13:25 +01:00
Viktor Lofgren	c70f508ae8	(prometheus) Saner histogram buckets	2024-01-02 17:13:14 +01:00
Viktor Lofgren	9e64d7aaf9	Merge branch 'master' into converter-optimizations	2024-01-02 15:46:24 +01:00
Viktor Lofgren	72b773f06d	(search) fix search metrics labeling	2024-01-02 15:46:14 +01:00
Viktor Lofgren	5f978b865b	Merge branch 'master' into converter-optimizations	2024-01-02 15:41:48 +01:00
Viktor Lofgren	57a4f92722	(api) fix missing metrics label in api service	2024-01-02 15:41:38 +01:00
Viktor Lofgren	87351e89ca	Merge branch 'master' into converter-optimizations	2024-01-02 15:17:02 +01:00
Viktor Lofgren	192e356169	(prometheus) Add instrumentation to the api service	2024-01-02 15:12:44 +01:00
Viktor Lofgren	31232e49fb	(prometheus) Add instrumentation to the search, qs and index services.	2024-01-02 15:02:29 +01:00
Viktor Lofgren	9d93a31755	Merge branch 'master' into converter-optimizations	2024-01-02 12:36:16 +01:00
Viktor Lofgren	9f7df59945	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:35:59 +01:00
Viktor Lofgren	d2418521a7	(index) Further ranking adjustments	2024-01-02 12:35:59 +01:00
Viktor Lofgren	9330b5b1d9	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	faa50bf578	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	f0d9618dfc	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:34:58 +01:00
Viktor Lofgren	310a880fa8	(index) Further ranking adjustments	2024-01-02 12:24:52 +01:00
Viktor Lofgren	fc6e3b6da0	(index) Further ranking adjustments	2024-01-01 18:51:03 +01:00
Viktor Lofgren	50771045d0	(index) Further ranking adjustments	2024-01-01 18:43:17 +01:00
Viktor Lofgren	8f522470ed	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-01 17:16:29 +01:00
Viktor Lofgren	dc90c9ac65	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-01 16:19:38 +01:00
Viktor Lofgren	e46e174b59	(keyword-extractor) Add another test for Name-extractor	2024-01-01 15:21:51 +01:00
Viktor Lofgren	7f3f3f577c	(backup) Add task heartbeats to the backup service	2024-01-01 15:20:57 +01:00
Viktor Lofgren	75d87c73d1	(crawler) Disable Java's infinite DNS caching	2023-12-31 16:59:08 +01:00
Viktor Lofgren	0fe44c9bf2	(crawler) Fix broken test A necessary step was accidentally deleted when cleaning up these tests previously.	2023-12-30 13:56:44 +01:00
Viktor Lofgren	7a1d20ed0a	(converter) Better use of ProcessingIterator Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service. This reduces thread churn in the converter sideloader style processing of regular crawl data.	2023-12-30 13:53:55 +01:00
Viktor Lofgren	70c83b60a1	(converter) Clean up fullProcessing() This function made some very flimsy-looking assumptions about the order of an iterable. These are still made, but more explicitly so.	2023-12-30 13:36:18 +01:00
Viktor Lofgren	7ba296ccdf	(converter) Route sizeHint to SideloadProcessing Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number. This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.	2023-12-30 13:05:10 +01:00
Viktor Lofgren	0b112cb4d4	(warc) Update URL encoding in WarcProtocolReconstructor The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.	2023-12-29 19:41:37 +01:00
Viktor Lofgren	68ac8d3e09	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:27 +01:00
Viktor Lofgren	f6fa8bd722	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:00 +01:00
Viktor Lofgren	6aee27a3f1	(*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style.	2023-12-29 16:36:01 +01:00
Viktor Lofgren	401568033c	Merge branch 'master' into converter-optimizations	2023-12-29 15:55:57 +01:00
Viktor Lofgren	ea73be6831	(search) Remove the ugly placeholder screenshots from the site info view.	2023-12-29 15:55:46 +01:00
Viktor Lofgren	ba8a75c84b	Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool	2023-12-29 15:10:32 +01:00
Viktor Lofgren	a1f3ccdd6d	Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool	2023-12-29 14:59:39 +01:00
Viktor Lofgren	647d38007f	Reduce queue polling time in ProcessingIterator Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.	2023-12-29 14:27:58 +01:00
Viktor Lofgren	e7dd28b926	(converter) Optimize sideload-loading Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.	2023-12-29 14:25:48 +01:00
Viktor Lofgren	b5fc9673d9	Merge branch 'master' into converter-optimizations	2023-12-29 14:04:43 +01:00
Viktor Lofgren	a065040323	(search) Don't inject arbitrary HTML into the site info view xD	2023-12-29 14:04:26 +01:00
Viktor Lofgren	dec3b1092d	(converter) Fix bugs in conversion This commit adds a safety check that the URL of the document is from the correct domain. It also adds a sizeHint() method to SerializableCrawlDataStream which may provide an indication if the stream is very large and benefits from sideload-style processing (which is slow). It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...	2023-12-29 13:58:08 +01:00
Viktor Lofgren	407915a86e	(converter) Fix NPEs in converter due to the new data format	2023-12-28 22:54:53 +01:00
Viktor Lofgren	c488599879	(converter) Fix NPE in converter	2023-12-28 19:52:26 +01:00
Viktor Lofgren	bcecc93e39	(converter) Swallow errors in parquet data stream	2023-12-28 19:45:35 +01:00
Viktor Lofgren	ff7d1a250e	Merge branch 'master' into converter-optimizations	2023-12-28 19:35:00 +01:00
Viktor Lofgren	70f338c3de	(search) Fix NPE in layout selection	2023-12-28 19:34:46 +01:00
Viktor Lofgren	c847d83011	(converter) Add size hint to converter sideload processing	2023-12-28 19:14:16 +01:00
Viktor Lofgren	5ce46a61d4	Merge branch 'master' into converter-optimizations	2023-12-28 13:26:19 +01:00
Viktor	775974d5ec	Merge pull request #67 from MarginaliaSearch/rss-feeds-in-site-info Add RSS Feeds to site info (WIP)	2023-12-28 13:25:38 +01:00
Viktor Lofgren	c7af40c368	(search) Change layout balance when feeds/samples are present	2023-12-28 13:16:10 +01:00
Viktor Lofgren	00a974a721	(crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions This commit also improves the test coverage for this part of the code.	2023-12-27 20:02:17 +01:00
Viktor Lofgren	7428ba2dd7	(converter) Basic test coverage for sideloading-style processing	2023-12-27 19:29:26 +01:00
Viktor Lofgren	b37223c053	(converter) Basic test coverage for sideloading-style processing	2023-12-27 18:33:16 +01:00
Viktor Lofgren	24051fec03	(converter) WIP Run sideload-style processing for large domains The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis. This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process. These websites now receive a simplified treatment. This is executed in the converter batch writer thread. This is slower, but the documents will not be persisted in memory.	2023-12-27 18:20:03 +01:00
Viktor Lofgren	f811a29f87	(crawler) Fix resource leak in crawler A 10 MB thread local buffer wasn't static. Oops.	2023-12-27 16:32:17 +01:00
Viktor Lofgren	acf7bcc7a6	(converter) Refactor the DomainProcessor for new format of crawl data With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter. This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while. The first step is to move stuff out of the domain processor into the document processor.	2023-12-27 13:57:59 +01:00
Viktor Lofgren	9707366348	(test) Fix a few slow tests that broke due to domainCount	2023-12-27 13:29:59 +01:00
Viktor Lofgren	9e5fe71f5b	(crawler) Switch hash function in crawler Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler. This switches to the modified Murmur hash function used throughout Marginalia.	2023-12-27 13:29:00 +01:00

1 2 3 4 5 ...

1014 Commits