MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	17dc00d05f	(control) Partial implementation of inspection utility for crawl data Uses duckdb and range queries to read the parquet files directly from the index partitions. UX is a bit rough but is in working order.	2024-05-20 18:02:46 +02:00
Viktor Lofgren	4fcd4a8197	(index) Refactor to reduce the level of indirection	2024-05-19 12:40:33 +02:00
Viktor Lofgren	daf2a8df54	(btree) Roll back optimization of queryDataWithIndex It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect. The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.	2024-05-19 11:29:28 +02:00
Viktor Lofgren	88997a1c4f	(btree) Clean up code	2024-05-18 18:38:46 +02:00
Viktor Lofgren	d12c77305c	(btree) Clean up code	2024-05-18 18:03:17 +02:00
Viktor Lofgren	ab4e2b222e	(array) Fix broken benchmarks	2024-05-18 13:41:24 +02:00
Viktor Lofgren	b867eadbef	(big-string) Remove the unused bigstring library	2024-05-18 13:40:03 +02:00
Viktor Lofgren	19163fa883	(array) Clean up the Array library IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it) Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs. Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.	2024-05-18 13:23:06 +02:00
Viktor Lofgren	650f3843bb	(array) Clean up search function jungle Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values. Replaced binary search function with a branchless version that is much faster. Cleaned up benchmark code.	2024-05-17 14:31:02 +02:00
Viktor Lofgren	9e766bc056	(array) Clean up search function jungle Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values. Replaced binary search function with a branchless version that is much faster. Cleaned up benchmark code.	2024-05-17 14:30:06 +02:00
Viktor Lofgren	48aff52e00	(array) Increase LongArray on-heap alignment to 16 bytes This primarily affects benchmarks, making performance more consistent for the 128 bit operations, as the system mostly works with memory mapped data.	2024-05-16 19:12:36 +02:00
Viktor Lofgren	9d7616317e	(array) Clean up native code a bit	2024-05-16 14:47:10 +02:00
Viktor Lofgren	d227a09fb1	(search) Extend paperdoll service mock with site info data and screenshots It's a bit of a hack job but will do, random exploration is available but only through a "browse:random"-style query	2024-05-15 12:40:55 +02:00
Viktor Lofgren	f48cf77c4d	(array, experimental) Add benchmark results for quicksort	2024-05-14 18:15:30 +02:00
Viktor Lofgren	3549be216f	(array, experimental) Documentation for native algos	2024-05-14 17:43:05 +02:00
Viktor Lofgren	c3e3a3dbc5	(search) Fix problem list in clustered search results	2024-05-14 13:05:52 +02:00
Viktor Lofgren	55a7c1db00	(array, experimental) Call C++ helper methods to do some low level stuff a bit faster than is possible with Java	2024-05-14 12:54:14 +02:00
Viktor Lofgren	c837321df1	(search) Provide a notification when no search results are found.	2024-05-06 20:11:39 +02:00
Viktor Lofgren	af7f6b89ec	(search) Delete vestigial stylesheet from the old design.	2024-05-06 19:52:29 +02:00
Viktor Lofgren	29a4d3df23	(search) Imrpove search-service paperdoll by mocking suggestions and news	2024-05-06 19:52:13 +02:00
Viktor Lofgren	7d1cafc070	(control) Add skip link for navigation in control GUI	2024-05-04 12:36:44 +02:00
Viktor Lofgren	5951c67a8b	(search) Center the search results page	2024-05-04 12:23:21 +02:00
Viktor Lofgren	c454007730	(search) Increase contrast for some UI elements	2024-05-04 12:02:52 +02:00
Viktor Lofgren	4e49cca43d	(search) Clean up SCSS code a bit	2024-05-04 11:58:54 +02:00
Viktor Lofgren	49a8c06095	(search) Improve contrast for text on random button	2024-05-04 11:51:19 +02:00
Viktor Lofgren	d01d9fa670	(search) Add screenreader-specific notification remark about when search results start.	2024-05-04 11:41:06 +02:00
Viktor Lofgren	a53a32f006	(search) Spell out website problems with "atomic elements" instead of having a hover that's inaccessible with keyboard navigation	2024-05-04 11:41:05 +02:00
Viktor Lofgren	3548d54cf6	(search) Add a screenreader-only alert when the search filters are updated to make it easier to understand what happens.	2024-05-04 11:41:04 +02:00
Viktor Lofgren	01f242ac7e	(search) Add stylesheet class for screenreader-only items	2024-05-04 11:41:03 +02:00
Viktor Lofgren	2840d9d403	(search) Add screenreader-only positions count text to search results	2024-05-04 11:41:03 +02:00
Viktor Lofgren	9fecfc5025	(search) Add autocomplete attribute to search-form	2024-05-04 11:41:02 +02:00
Viktor Lofgren	1b901e01f2	(search) Add bypass link that skips navigation	2024-05-04 11:41:01 +02:00
Viktor Lofgren	974aa35558	(search) Add proper alt-text to random exploration mode	2024-05-04 11:41:00 +02:00
Viktor Lofgren	4021a0ae98	(search) Add en-US language tags to all templates	2024-05-04 11:40:59 +02:00
Viktor Lofgren	b7a95be731	(search) Create a small mocking framework for running the search service in isolation.	2024-05-04 11:40:59 +02:00
Viktor Lofgren	616649f040	(logs) Fix logdir location	2024-05-04 11:40:59 +02:00
Viktor Lofgren	6087f9635c	(qs) Move index.html out of public directory It was put there to simulate the /public interface paradigm that is now deprecated.	2024-05-01 12:56:12 +02:00
Viktor Lofgren	2ad0bfda1e	(*) Fix boot orchestration for the services This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated. A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database. Move the first boot check into the MainClass instead of the Service constructor. The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.	2024-05-01 12:39:48 +02:00
Viktor Lofgren	08f8b6e022	(system) Log loaded properties to the console	2024-04-30 18:29:11 +02:00
Viktor Lofgren	800ed6b1e9	(zk) Terminately immediately if zookeeper isn't found This makes debugging easier	2024-04-30 18:28:49 +02:00
Viktor Lofgren	908535a3a0	(single-service) Ensure single-service spawner can specify the node	2024-04-30 18:27:46 +02:00
Viktor Lofgren	7fe2ab6f39	(file-storage) Ensure file storage root location can be overridden when running outside of docker	2024-04-30 18:26:15 +02:00
Viktor Lofgren	c9ee0c909e	(download-sample) Set +x permissions on directories created during this job	2024-04-30 18:25:07 +02:00
Viktor Lofgren	38aedb50ac	(converter) Do not suppress exceptions in the converter	2024-04-30 18:24:35 +02:00
Viktor Lofgren	4772e0b59d	(service) Deprecate /public prefix on HTTP Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations. Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths. The change removes the path prefix, and updates the docker templates to reflect the change. This will require a migration for existing systems.	2024-04-30 14:46:18 +02:00
Viktor Lofgren	70e2e41955	(crawler) Content type prober should not swallow exceptions	2024-04-27 18:27:23 +02:00
Viktor Lofgren	4d71c776fc	(crawler) Modify crawl set growth to grow small domains faster than larger ones	2024-04-27 17:36:27 +02:00
Viktor	2d49071e96	Merge branch 'master' into run-outside-docker	2024-04-25 18:53:26 +02:00
Viktor Lofgren	89889ecbbd	(single-service) Skip starting Prometheus if it's not explicitly enabled	2024-04-25 17:54:07 +02:00
Viktor Lofgren	c8ee354d0b	(log) Make log dir configurable via environment variable	2024-04-25 15:09:18 +02:00
Viktor Lofgren	4e5f069809	(build) Migrate ssr to the new root setting schema of java lang version	2024-04-25 15:08:56 +02:00
Viktor Lofgren	6690e9bde8	(service) Ensure the service discovery starts early This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.	2024-04-25 15:08:33 +02:00
Viktor Lofgren	e4b34b6ee6	(index) Correctly detect the presence of an all-virtual path through the query	2024-04-25 14:01:46 +02:00
Viktor Lofgren	3952ef6ca5	(service) Let singleservice configure ports and bind addresses	2024-04-25 13:49:57 +02:00
Viktor Lofgren	7eb5e6aa66	(crawler) Abort recrawl if error count is too high	2024-04-24 21:46:40 +02:00
Viktor Lofgren	282022d64e	(crawler) Remove unnecessary double-fetch of the root document	2024-04-24 14:44:39 +02:00
Viktor Lofgren	91a98a8807	(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber	2024-04-24 14:44:39 +02:00
Viktor Lofgren	32fe864a33	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e1c9313396	(crawler) Emulate if-modified-since for domains that don't support the header This will help reduce the strain on some server software, in particular Discourse.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f430a084e8	(crawler) Remove accidental log spam	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a86b596897	(crawler) Code quality	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6dd87b0378	(crawler) Use the probe-result to reduce the likelihood of crawling both http and https This should drastically reduce the number of fetched documents on many domains	2024-04-24 14:44:39 +02:00
Viktor Lofgren	c9f029c214	(crawler) Strip W/-prefix from the etag when supplied as If-None-Match	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6b88db10ad	(crawler) Ensure all appropriate headers are recorded on the request	2024-04-24 14:44:39 +02:00
Viktor Lofgren	8a891c2159	(crawler/converter) Remove legacy junk from parquet migration	2024-04-24 14:44:39 +02:00
Viktor Lofgren	ad2ac8eee3	(query) Mark flaky test, correct assert on test	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f46733a47a	(ranking) TermCoherenceFactory should be run for size=2 queries	2024-04-24 14:44:39 +02:00
Viktor Lofgren	934167323d	(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	64baa41e64	(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches	2024-04-24 14:44:39 +02:00
Viktor Lofgren	5165cf6d15	(ranking) Set regularMask correctly	2024-04-24 14:44:39 +02:00
Viktor Lofgren	4489b21528	(ranking) Cleanup	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f623b37577	(ranking) Suppress NaN:s in ranking output	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f4a2fea451	(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a748fc5448	(index, bugfix) Pass url quality to query service	2024-04-24 14:44:39 +02:00
Viktor Lofgren	0dcca0cb83	(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp	2024-04-24 14:44:39 +02:00
Viktor Lofgren	b80a83339b	(qs) Additional info in query debug UI	2024-04-24 14:44:39 +02:00
Viktor Lofgren	eb74d08f2a	(qs) Additional info in query debug UI	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e79ab0c70e	(qs) Basic query debug feature	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e419e26f3a	(proto) Improve handling of omitted parameters	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6102fd99bf	(qs) Improve logging	2024-04-24 14:44:39 +02:00
Viktor Lofgren	def36719d3	(query) Minor code cleanup	2024-04-24 14:44:39 +02:00
Viktor Lofgren	462aa9af26	(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a09c84e1b8	(query) Modify tokenizer to match the behavior of the sentence extractor This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	44b33798f3	(index) Clean up jaccard index term code and down-tune the parameter's importance a bit	2024-04-24 14:44:39 +02:00
Viktor Lofgren	2f0b648fad	(index) Add jaccard index term to boost results based on term overlap	2024-04-24 14:44:39 +02:00
Viktor Lofgren	de0e56f027	(index) Remove position overlap check, coherences will do the work instead	2024-04-24 14:44:39 +02:00
Viktor Lofgren	973ced7b13	(index) Omit absent terms from coherence checks	2024-04-24 14:44:39 +02:00
Viktor Lofgren	cb4b824a85	(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus	2024-04-24 14:44:39 +02:00
Viktor Lofgren	c583a538b1	(search) Add implicit coherence constraints based on segmentation	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e0224085b4	(index) Improve recall for small queries Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	44c1e1d6d9	(index) Remove dead code Since the performance fix in `3359f72239` had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	c620e9c026	(index) Experimental performance regression fix	2024-04-24 14:44:39 +02:00
Viktor Lofgren	1bb88968c5	(test) Fix broken test	2024-04-24 14:44:39 +02:00
Viktor Lofgren	df75e8f4aa	(index) Explicitly free LongQueryBuffers	2024-04-24 14:44:39 +02:00
Viktor Lofgren	adf846bfd2	(index) Fix term coherence evaluation The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	1748fcc5ac	(valuation) Impose stronger constraints on locality of terms Clean up logic a bit	2024-04-24 14:44:39 +02:00
Viktor Lofgren	08416393e0	(valuation) Impose stronger constraints on locality of terms	2024-04-24 14:44:39 +02:00
Viktor Lofgren	fce26015c9	(encyclopedia) Index the full articles Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	155be1078d	(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6efc0f21fe	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-24 14:44:39 +02:00

1 2 3 4 5 ...

1382 Commits