MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	55c9501e57	(search) Serve proper content type for static resources	2024-01-10 10:46:51 +01:00
Viktor	fad9575154	Merge pull request #69 from MarginaliaSearch/converter-optimizations Refactor the DomainProcessor to take advantage of the new crawl data format	2024-01-10 09:46:54 +01:00
Viktor Lofgren	97e11e1ac9	(search) Fix acknowledgement page for domain complaints rendering as plain text This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.	2024-01-10 09:37:40 +01:00
Viktor Lofgren	e6a1e164b2	(search) Swap swipe direction for more consistent experience	2024-01-10 09:37:40 +01:00
Viktor Lofgren	e4f8f81e89	(search) Mobile UX improvements. Swipe right to show filter menu. Fix CSS bug that caused parts of the menu to not have a background.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	176b3bb526	(search) Toggle for showing recent results Actually persist the value of the toggle between searches too...	2024-01-10 09:37:39 +01:00
Viktor Lofgren	b07752fa9b	(search) Toggle for showing recent results Will by default show results from the last 2 years. May need to tune this later.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	68fd0efbde	(search) Clean up search results template Rendering is very slow. Let's see if this has a measurable effect on latency.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	c80d3eb812	(search) Remove dead code	2024-01-10 09:37:35 +01:00
Viktor Lofgren	f9320995d6	(search) When clicking asn-links, show results from the unfiltered view...	2024-01-10 09:37:13 +01:00
Viktor Lofgren	f592c9f04d	(search) Fix acknowledgement page for domain complaints rendering as plain text This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.	2024-01-10 09:26:34 +01:00
Viktor Lofgren	bd7970fb1f	(search) Swap swipe direction for more consistent experience	2024-01-09 13:38:40 +01:00
Viktor Lofgren	c47730f2cc	(search) Mobile UX improvements. Swipe right to show filter menu. Fix CSS bug that caused parts of the menu to not have a background.	2024-01-09 13:30:30 +01:00
Viktor Lofgren	41cccfd2aa	(search) Toggle for showing recent results Actually persist the value of the toggle between searches too...	2024-01-09 11:36:49 +01:00
Viktor Lofgren	aff690f7d6	(search) Toggle for showing recent results Will by default show results from the last 2 years. May need to tune this later.	2024-01-09 11:28:36 +01:00
Viktor Lofgren	d4b0539d39	(search) Clean up search results template Rendering is very slow. Let's see if this has a measurable effect on latency.	2024-01-08 20:57:40 +01:00
Viktor Lofgren	cb55273769	(search) When clicking asn-links, show results from the unfiltered view...	2024-01-08 20:02:19 +01:00
Viktor Lofgren	fbad625126	(linkdb) Add delegating implementation of DomainLinkDb This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.	2024-01-08 19:56:33 +01:00
Viktor Lofgren	e49ba887e9	(crawl data) Add compatibility layer for old crawl data format The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records. This is true for the new parquet format, but not for the old zstd/gson format. To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order. This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be. Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.	2024-01-08 19:16:49 +01:00
Viktor Lofgren	edc1acbb7e	(*) Replace EC_DOMAIN_LINK table with files and in-memory caching The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need. This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service. A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file. The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.	2024-01-08 15:53:13 +01:00
Viktor Lofgren	ef02b712ad	(build) Remove false depdencency between icp and index-service This dependency causes the executor service docker image to change when the index service docker image changes.	2024-01-05 13:22:13 +01:00
Viktor Lofgren	aca217cf9a	(qs) Better metrics for QS	2024-01-05 13:22:13 +01:00
Viktor Lofgren	9e3386dbbb	(search) Fetch fewer results per page This is a test to evaluate how this impacts load times.	2024-01-05 13:22:13 +01:00
Viktor Lofgren	fdec565b34	(converter) Add upper 128KB limit to how much HTML we'll parse	2024-01-05 13:22:13 +01:00
Viktor Lofgren	33c2188c87	(feature) More trackers	2024-01-05 13:22:13 +01:00
Viktor Lofgren	b3c8fa74cc	(feature) Add another doubleclick variant to the adtech trackers	2024-01-05 13:22:13 +01:00
Viktor Lofgren	e53bb70bef	(converter) Penalize chatgpt content farm spam	2024-01-05 13:22:13 +01:00
Viktor Lofgren	109bec372c	(index) Adjust BM25 parameters	2024-01-05 13:21:52 +01:00
Viktor Lofgren	5c2561d05d	(search) Add query strategy requiring link	2024-01-05 13:21:52 +01:00
Viktor Lofgren	0e970b8037	(valuation) Tweaking penalties a bit	2024-01-05 13:21:52 +01:00
Viktor Lofgren	1694b4d6ef	(valuation) Increase the penalty for adtech a bit	2024-01-05 13:21:34 +01:00
Viktor Lofgren	396299c1db	(index) Reduce the value of site and site-adjacent in BM25P calculations	2024-01-05 13:21:33 +01:00
Viktor Lofgren	71d789aab0	(index) Tweak result valuation renormalization	2024-01-05 13:21:33 +01:00
Viktor Lofgren	6d2e14a656	(build) Remove false depdencency between icp and index-service This dependency causes the executor service docker image to change when the index service docker image changes.	2024-01-05 13:17:29 +01:00
Viktor Lofgren	4078708aea	(qs) Better metrics for QS	2024-01-04 13:27:14 +01:00
Viktor Lofgren	343ea9c6d8	(search) Fetch fewer results per page This is a test to evaluate how this impacts load times.	2024-01-04 13:18:07 +01:00
Viktor Lofgren	60361f88ed	(converter) Add upper 128KB limit to how much HTML we'll parse	2024-01-03 23:14:03 +01:00
Viktor Lofgren	f7560cb1d8	(feature) More trackers	2024-01-03 17:31:02 +01:00
Viktor Lofgren	1f66568d59	(feature) More trackers	2024-01-03 17:27:25 +01:00
Viktor Lofgren	7af07cef95	(feature) Add another doubleclick variant to the adtech trackers	2024-01-03 17:21:12 +01:00
Viktor Lofgren	41a540a629	(converter) Penalize chatgpt content farm spam	2024-01-03 17:04:38 +01:00
Viktor Lofgren	f599944942	(converter) Penalize chatgpt content farm spam	2024-01-03 16:51:26 +01:00
Viktor Lofgren	1e06aee6a2	(index) Adjust BM25 parameters	2024-01-03 16:30:46 +01:00
Viktor Lofgren	7bbaedef97	(search) Add query strategy requiring link	2024-01-03 16:23:00 +01:00
Viktor Lofgren	87048511fe	(valuation) Tweaking penalties a bit	2024-01-03 16:02:25 +01:00
Viktor Lofgren	c770f0b68b	(valuation) Tweaking penalties a bit	2024-01-03 15:59:21 +01:00
Viktor Lofgren	78c00ad512	(valuation) Tweaking penalties a bit	2024-01-03 15:52:57 +01:00
Viktor Lofgren	a19879d494	(valuation) Tweaking penalties a bit	2024-01-03 15:32:33 +01:00
Viktor Lofgren	ac1aca36b0	(valuation) Increase the penalty for adtech a bit	2024-01-03 15:20:38 +01:00
Viktor Lofgren	1f3b89cf28	(index) Reduce the value of site and site-adjacent in BM25P calculations	2024-01-03 15:20:18 +01:00
Viktor Lofgren	f732f6ae6f	(index) Tweak result valuation renormalization	2024-01-03 14:53:53 +01:00
Viktor Lofgren	0b9f3d1751	(*) Remove accidental commit of debug logging	2024-01-03 14:32:00 +01:00
Viktor Lofgren	0806aa6dfe	(language-processing) Add maximum length limit for text input in SentenceExtractor Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.	2024-01-03 14:27:47 +01:00
Viktor Lofgren	32436d099c	(language-processing) Add maximum length limit for text input in SentenceExtractor Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.	2024-01-03 14:27:47 +01:00
Viktor Lofgren	4ce692ccaf	(converter) Use SimpleBlockingThreadPool in ProcessingIterator	2024-01-03 14:27:47 +01:00
Viktor Lofgren	3caa4eed75	Merge branch 'master' into converter-optimizations	2024-01-02 17:13:25 +01:00
Viktor Lofgren	c70f508ae8	(prometheus) Saner histogram buckets	2024-01-02 17:13:14 +01:00
Viktor Lofgren	9e64d7aaf9	Merge branch 'master' into converter-optimizations	2024-01-02 15:46:24 +01:00
Viktor Lofgren	72b773f06d	(search) fix search metrics labeling	2024-01-02 15:46:14 +01:00
Viktor Lofgren	5f978b865b	Merge branch 'master' into converter-optimizations	2024-01-02 15:41:48 +01:00
Viktor Lofgren	57a4f92722	(api) fix missing metrics label in api service	2024-01-02 15:41:38 +01:00
Viktor Lofgren	87351e89ca	Merge branch 'master' into converter-optimizations	2024-01-02 15:17:02 +01:00
Viktor Lofgren	192e356169	(prometheus) Add instrumentation to the api service	2024-01-02 15:12:44 +01:00
Viktor Lofgren	31232e49fb	(prometheus) Add instrumentation to the search, qs and index services.	2024-01-02 15:02:29 +01:00
Viktor Lofgren	9d93a31755	Merge branch 'master' into converter-optimizations	2024-01-02 12:36:16 +01:00
Viktor Lofgren	9f7df59945	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:35:59 +01:00
Viktor Lofgren	d2418521a7	(index) Further ranking adjustments	2024-01-02 12:35:59 +01:00
Viktor Lofgren	9330b5b1d9	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	faa50bf578	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	f0d9618dfc	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:34:58 +01:00
Viktor Lofgren	310a880fa8	(index) Further ranking adjustments	2024-01-02 12:24:52 +01:00
Viktor Lofgren	fc6e3b6da0	(index) Further ranking adjustments	2024-01-01 18:51:03 +01:00
Viktor Lofgren	50771045d0	(index) Further ranking adjustments	2024-01-01 18:43:17 +01:00
Viktor Lofgren	8f522470ed	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-01 17:16:29 +01:00
Viktor Lofgren	dc90c9ac65	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-01 16:19:38 +01:00
Viktor Lofgren	e46e174b59	(keyword-extractor) Add another test for Name-extractor	2024-01-01 15:21:51 +01:00
Viktor Lofgren	7f3f3f577c	(backup) Add task heartbeats to the backup service	2024-01-01 15:20:57 +01:00
Viktor Lofgren	75d87c73d1	(crawler) Disable Java's infinite DNS caching	2023-12-31 16:59:08 +01:00
Viktor Lofgren	0fe44c9bf2	(crawler) Fix broken test A necessary step was accidentally deleted when cleaning up these tests previously.	2023-12-30 13:56:44 +01:00
Viktor Lofgren	7a1d20ed0a	(converter) Better use of ProcessingIterator Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service. This reduces thread churn in the converter sideloader style processing of regular crawl data.	2023-12-30 13:53:55 +01:00
Viktor Lofgren	70c83b60a1	(converter) Clean up fullProcessing() This function made some very flimsy-looking assumptions about the order of an iterable. These are still made, but more explicitly so.	2023-12-30 13:36:18 +01:00
Viktor Lofgren	7ba296ccdf	(converter) Route sizeHint to SideloadProcessing Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number. This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.	2023-12-30 13:05:10 +01:00
Viktor Lofgren	0b112cb4d4	(warc) Update URL encoding in WarcProtocolReconstructor The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.	2023-12-29 19:41:37 +01:00
Viktor Lofgren	68ac8d3e09	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:27 +01:00
Viktor Lofgren	f6fa8bd722	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:00 +01:00
Viktor Lofgren	6aee27a3f1	(*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style.	2023-12-29 16:36:01 +01:00
Viktor Lofgren	401568033c	Merge branch 'master' into converter-optimizations	2023-12-29 15:55:57 +01:00
Viktor Lofgren	ea73be6831	(search) Remove the ugly placeholder screenshots from the site info view.	2023-12-29 15:55:46 +01:00
Viktor Lofgren	ba8a75c84b	Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool	2023-12-29 15:10:32 +01:00
Viktor Lofgren	a1f3ccdd6d	Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool	2023-12-29 14:59:39 +01:00
Viktor Lofgren	647d38007f	Reduce queue polling time in ProcessingIterator Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.	2023-12-29 14:27:58 +01:00
Viktor Lofgren	e7dd28b926	(converter) Optimize sideload-loading Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.	2023-12-29 14:25:48 +01:00
Viktor Lofgren	b5fc9673d9	Merge branch 'master' into converter-optimizations	2023-12-29 14:04:43 +01:00
Viktor Lofgren	a065040323	(search) Don't inject arbitrary HTML into the site info view xD	2023-12-29 14:04:26 +01:00
Viktor Lofgren	dec3b1092d	(converter) Fix bugs in conversion This commit adds a safety check that the URL of the document is from the correct domain. It also adds a sizeHint() method to SerializableCrawlDataStream which may provide an indication if the stream is very large and benefits from sideload-style processing (which is slow). It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...	2023-12-29 13:58:08 +01:00
Viktor Lofgren	407915a86e	(converter) Fix NPEs in converter due to the new data format	2023-12-28 22:54:53 +01:00
Viktor Lofgren	c488599879	(converter) Fix NPE in converter	2023-12-28 19:52:26 +01:00
Viktor Lofgren	bcecc93e39	(converter) Swallow errors in parquet data stream	2023-12-28 19:45:35 +01:00
Viktor Lofgren	ff7d1a250e	Merge branch 'master' into converter-optimizations	2023-12-28 19:35:00 +01:00
Viktor Lofgren	70f338c3de	(search) Fix NPE in layout selection	2023-12-28 19:34:46 +01:00

1 2 3 4 5 ...

926 Commits