MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	7eb5e6aa66	(crawler) Abort recrawl if error count is too high	2024-04-24 21:46:40 +02:00
Viktor Lofgren	282022d64e	(crawler) Remove unnecessary double-fetch of the root document	2024-04-24 14:44:39 +02:00
Viktor Lofgren	91a98a8807	(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber	2024-04-24 14:44:39 +02:00
Viktor Lofgren	32fe864a33	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e1c9313396	(crawler) Emulate if-modified-since for domains that don't support the header This will help reduce the strain on some server software, in particular Discourse.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f430a084e8	(crawler) Remove accidental log spam	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a86b596897	(crawler) Code quality	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6dd87b0378	(crawler) Use the probe-result to reduce the likelihood of crawling both http and https This should drastically reduce the number of fetched documents on many domains	2024-04-24 14:44:39 +02:00
Viktor Lofgren	c9f029c214	(crawler) Strip W/-prefix from the etag when supplied as If-None-Match	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6b88db10ad	(crawler) Ensure all appropriate headers are recorded on the request	2024-04-24 14:44:39 +02:00
Viktor Lofgren	8a891c2159	(crawler/converter) Remove legacy junk from parquet migration	2024-04-24 14:44:39 +02:00
Viktor Lofgren	ad2ac8eee3	(query) Mark flaky test, correct assert on test	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f46733a47a	(ranking) TermCoherenceFactory should be run for size=2 queries	2024-04-24 14:44:39 +02:00
Viktor Lofgren	934167323d	(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	64baa41e64	(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches	2024-04-24 14:44:39 +02:00
Viktor Lofgren	5165cf6d15	(ranking) Set regularMask correctly	2024-04-24 14:44:39 +02:00
Viktor Lofgren	4489b21528	(ranking) Cleanup	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f623b37577	(ranking) Suppress NaN:s in ranking output	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f4a2fea451	(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a748fc5448	(index, bugfix) Pass url quality to query service	2024-04-24 14:44:39 +02:00
Viktor Lofgren	0dcca0cb83	(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp	2024-04-24 14:44:39 +02:00
Viktor Lofgren	b80a83339b	(qs) Additional info in query debug UI	2024-04-24 14:44:39 +02:00
Viktor Lofgren	eb74d08f2a	(qs) Additional info in query debug UI	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e79ab0c70e	(qs) Basic query debug feature	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e419e26f3a	(proto) Improve handling of omitted parameters	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6102fd99bf	(qs) Improve logging	2024-04-24 14:44:39 +02:00
Viktor Lofgren	def36719d3	(query) Minor code cleanup	2024-04-24 14:44:39 +02:00
Viktor Lofgren	462aa9af26	(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a09c84e1b8	(query) Modify tokenizer to match the behavior of the sentence extractor This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	44b33798f3	(index) Clean up jaccard index term code and down-tune the parameter's importance a bit	2024-04-24 14:44:39 +02:00
Viktor Lofgren	2f0b648fad	(index) Add jaccard index term to boost results based on term overlap	2024-04-24 14:44:39 +02:00
Viktor Lofgren	de0e56f027	(index) Remove position overlap check, coherences will do the work instead	2024-04-24 14:44:39 +02:00
Viktor Lofgren	973ced7b13	(index) Omit absent terms from coherence checks	2024-04-24 14:44:39 +02:00
Viktor Lofgren	cb4b824a85	(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus	2024-04-24 14:44:39 +02:00
Viktor Lofgren	c583a538b1	(search) Add implicit coherence constraints based on segmentation	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e0224085b4	(index) Improve recall for small queries Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	44c1e1d6d9	(index) Remove dead code Since the performance fix in `3359f72239` had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	c620e9c026	(index) Experimental performance regression fix	2024-04-24 14:44:39 +02:00
Viktor Lofgren	1bb88968c5	(test) Fix broken test	2024-04-24 14:44:39 +02:00
Viktor Lofgren	df75e8f4aa	(index) Explicitly free LongQueryBuffers	2024-04-24 14:44:39 +02:00
Viktor Lofgren	adf846bfd2	(index) Fix term coherence evaluation The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	1748fcc5ac	(valuation) Impose stronger constraints on locality of terms Clean up logic a bit	2024-04-24 14:44:39 +02:00
Viktor Lofgren	08416393e0	(valuation) Impose stronger constraints on locality of terms	2024-04-24 14:44:39 +02:00
Viktor Lofgren	fce26015c9	(encyclopedia) Index the full articles Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	155be1078d	(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6efc0f21fe	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f3255e080d	(ngram) Grab titles separately when extracting ngrams from wiki data	2024-04-24 14:44:39 +02:00
Viktor Lofgren	5f6a3ef9d0	(ngram) Correct \|s\|^\|s\|-normalization to use length and not count	2024-04-24 14:44:39 +02:00
Viktor Lofgren	afc4fed591	(ngram) Correct size value in ngram lexicon generation, trim the terms better	2024-04-24 14:44:39 +02:00
Viktor Lofgren	cb505f98ef	(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a0b3634cb6	(ngram) Only extract frequencies of title words, but use the body to increment the counters... The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e23359bae9	(query, minor) Remove debug statement	2024-04-24 14:44:39 +02:00
Viktor Lofgren	5531ed632a	(query, minor) Remove debug statement	2024-04-24 14:44:39 +02:00
Viktor Lofgren	150ee21f3c	(ngram) Clean up ngram lexicon code This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	c96da0ce1e	(segmentation) Pick best segmentation using \|s\|^\|s\|-style normalization This is better than doing all segmentations possible at the same time.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	a0d9e66ff7	(ngram) Fix index range in NgramLexicon to an avoid exception	2024-04-24 14:44:38 +02:00
Viktor Lofgren	55f627ed4c	(index) Clean up the code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	7dd8c78c6b	(ngrams) Remove the vestigial logic for capturing permutations of n-grams The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	8bf7d090fd	(qs) Clean up parsing code using new record matching	2024-04-24 14:44:38 +02:00
Viktor Lofgren	6bfe04b609	(term-freq-exporter) Reduce thread count and memory usage	2024-04-24 14:44:38 +02:00
Viktor Lofgren	491d6bec46	(term-freq-exporter) Extract ngrams in term-frequency-exporter	2024-04-24 14:44:38 +02:00
Viktor Lofgren	4fb86ac692	(search) Fix outdated assumptions about the results We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption. For the API service, we'll simulate the old behavior to keep the API stable. For the search service, we'll introduce a new way of calculating positions through tree aggregation.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	6cba6aef3b	(minor) Remove dead code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	7e216db463	(index) Add origin trace information for index readers This used to be supported by the system but got lost in refactoring at some point.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	adc90c8f1e	(sentence-extractor) Fix resource leak in sentence extractor The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation. The modified behavior checks for nullity before creating a new instance.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	e3316a3672	(index) Clean up new index query code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	a3a6d6292b	(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	8cb9455c32	(qs, WIP) Fix edge cases in query compilation This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w \| z_w) \| x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	dc65b2ee01	(qs, WIP) Clean up dead code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	98a1adbf81	(qs, WIP) Tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	0bd1e15cce	(qs, WIP) Tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	eda926767e	(qs, WIP) Tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	cd1a18c045	(qs, WIP) Break up code and tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	6f567fbea8	(qs, WIP) Fix output determinism, fix tests	2024-04-24 14:44:38 +02:00
Viktor Lofgren	0ebadd03a5	(WIP) Query rendering finally beginning to look like it works	2024-04-24 14:44:38 +02:00
Viktor Lofgren	2253b556b2	WIP	2024-04-24 14:44:17 +02:00
Viktor Lofgren	6a7a7009c7	(convert) Initial integration of segmentation data into the converter's keyword extraction logic	2024-04-24 14:44:17 +02:00
Viktor Lofgren	3c75057dcd	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-04-24 14:44:17 +02:00
Viktor Lofgren	212d101727	(control) GUI for exporting segmentation data from a wikipedia zim	2024-04-24 14:44:17 +02:00
Viktor Lofgren	760b80659d	(WIP) Partial integration of new query expansion code into the query-serivice	2024-04-24 14:44:17 +02:00
Viktor Lofgren	04879c005d	(WIP) Improve data extraction from wikipedia data	2024-04-24 14:44:17 +02:00
Viktor Lofgren	cb82927756	(WIP) Implement first take of new query segmentation algorithm	2024-04-24 14:44:17 +02:00
Viktor Lofgren	8b9629f2f6	(crawler) Remove unnecessary double-fetch of the root document	2024-04-24 14:38:59 +02:00
Viktor Lofgren	f6db16b313	(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber	2024-04-24 14:10:03 +02:00
Viktor Lofgren	4668b1ddcb	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 13:54:04 +02:00
Viktor Lofgren	dcf9d9caad	(crawler) Emulate if-modified-since for domains that don't support the header This will help reduce the strain on some server software, in particular Discourse.	2024-04-22 17:26:31 +02:00
Viktor Lofgren	7a69b76001	(crawler) Remove accidental log spam	2024-04-22 15:51:37 +02:00
Viktor Lofgren	ac07ef822f	(crawler) Code quality	2024-04-22 15:37:35 +02:00
Viktor Lofgren	e7d4bcd872	(crawler) Use the probe-result to reduce the likelihood of crawling both http and https This should drastically reduce the number of fetched documents on many domains	2024-04-22 15:36:43 +02:00
Viktor Lofgren	a28c6d7cfe	(crawler) Strip W/-prefix from the etag when supplied as If-None-Match	2024-04-22 14:31:05 +02:00
Viktor Lofgren	d816f048f5	(crawler) Ensure all appropriate headers are recorded on the request	2024-04-22 14:14:24 +02:00
Viktor Lofgren	b09ddd0036	(crawler/converter) Remove legacy junk from parquet migration	2024-04-22 12:34:28 +02:00
Viktor Lofgren	0a73b02a00	(query) Mark flaky test, correct assert on test	2024-04-21 12:30:14 +02:00
Viktor Lofgren	8769704462	(ranking) TermCoherenceFactory should be run for size=2 queries	2024-04-21 12:29:25 +02:00
Viktor Lofgren	214551f1df	(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.	2024-04-19 20:36:01 +02:00
Viktor Lofgren	2cc74c005a	(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches	2024-04-19 19:42:30 +02:00
Viktor Lofgren	ed250f57f2	(ranking) Set regularMask correctly	2024-04-19 14:31:57 +02:00
Viktor Lofgren	e92c25f7e0	(ranking) Cleanup	2024-04-19 14:13:12 +02:00
Viktor Lofgren	3ab563f314	(ranking) Suppress NaN:s in ranking output	2024-04-19 13:58:28 +02:00
Viktor Lofgren	426338cb45	(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N	2024-04-19 12:41:48 +02:00
Viktor Lofgren	5fa2375898	(index, bugfix) Pass url quality to query service	2024-04-19 12:41:26 +02:00
Viktor Lofgren	41782a0ab5	(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp	2024-04-19 12:19:26 +02:00
Viktor Lofgren	9b06433b82	(qs) Additional info in query debug UI	2024-04-19 12:18:53 +02:00
Viktor Lofgren	def607d840	(qs) Additional info in query debug UI	2024-04-19 11:46:27 +02:00
Viktor Lofgren	2b811fb422	(qs) Basic query debug feature	2024-04-19 11:00:56 +02:00
Viktor Lofgren	36cc62c10c	(proto) Improve handling of omitted parameters	2024-04-18 10:47:12 +02:00
Viktor Lofgren	975d92912c	(qs) Improve logging	2024-04-18 10:44:08 +02:00
Viktor Lofgren	8bbaf457de	(query) Minor code cleanup	2024-04-18 10:37:51 +02:00
Viktor Lofgren	7641a02f31	(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.	2024-04-18 10:36:15 +02:00
Viktor Lofgren	ce16239e34	(query) Modify tokenizer to match the behavior of the sentence extractor This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.	2024-04-17 17:54:32 +02:00
Viktor Lofgren	d64bd227cf	(index) Clean up jaccard index term code and down-tune the parameter's importance a bit	2024-04-17 17:40:16 +02:00
Viktor Lofgren	c5ab0a9054	(index) Add jaccard index term to boost results based on term overlap	2024-04-17 16:50:26 +02:00
Viktor Lofgren	dac948973d	(index) Remove position overlap check, coherences will do the work instead	2024-04-17 14:20:01 +02:00
Viktor Lofgren	9d008d1d6f	(index) Omit absent terms from coherence checks	2024-04-17 14:12:16 +02:00
Viktor Lofgren	f52457213e	(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus	2024-04-17 14:05:02 +02:00
Viktor Lofgren	579295a673	(search) Add implicit coherence constraints based on segmentation	2024-04-17 14:03:35 +02:00
Viktor Lofgren	af8ff8ce99	(index) Improve recall for small queries Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.	2024-04-16 22:51:03 +02:00
Viktor Lofgren	7fa3e86e64	(index) Remove dead code Since the performance fix in `3359f72239` had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.	2024-04-16 19:59:27 +02:00
Viktor Lofgren	3359f72239	(index) Experimental performance regression fix	2024-04-16 19:48:14 +02:00
Viktor Lofgren	41fa154aa6	(test) Fix broken test	2024-04-16 19:48:14 +02:00
Viktor Lofgren	deaba0152d	(index) Explicitly free LongQueryBuffers	2024-04-16 19:23:00 +02:00
Viktor Lofgren	feaef6093e	(index) Fix term coherence evaluation The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.	2024-04-16 18:07:43 +02:00
Viktor Lofgren	078fa4fdd0	(valuation) Impose stronger constraints on locality of terms Clean up logic a bit	2024-04-16 17:22:58 +02:00
Viktor Lofgren	2dc77a0638	(valuation) Impose stronger constraints on locality of terms	2024-04-16 17:15:21 +02:00
Viktor Lofgren	f434a8b492	(build) Upgrade jib plugin version	2024-04-16 15:25:23 +02:00
Viktor Lofgren	d2658d6f84	(sys) Add springboard service that can spawn multiple different marginalia services to make distribution easier.	2024-04-16 13:25:15 +02:00
Viktor Lofgren	8c559c8121	(conf) Add additional logic for discovering system root	2024-04-16 12:37:18 +02:00
Viktor Lofgren	2353c73c57	(encyclopedia) Index the full articles Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.	2024-04-16 12:10:13 +02:00
Viktor Lofgren	599e719ad4	(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.	2024-04-15 16:44:08 +02:00
Viktor Lofgren	b6d365bacd	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-15 16:04:07 +02:00
Viktor Lofgren	52f0c0d336	(ngram) Grab titles separately when extracting ngrams from wiki data	2024-04-13 19:34:16 +02:00
Viktor Lofgren	fda1c05164	(ngram) Correct \|s\|^\|s\|-normalization to use length and not count	2024-04-13 18:05:30 +02:00
Viktor Lofgren	1329d4abd8	(ngram) Correct size value in ngram lexicon generation, trim the terms better	2024-04-13 17:51:02 +02:00
Viktor Lofgren	f064992137	(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.	2024-04-13 17:07:23 +02:00
Viktor Lofgren	8a81a480a1	(ngram) Only extract frequencies of title words, but use the body to increment the counters... The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.	2024-04-12 18:08:31 +02:00
Viktor Lofgren	d729c400e5	(query, minor) Remove debug statement	2024-04-12 17:52:55 +02:00
Viktor Lofgren	ad4810d991	(query, minor) Remove debug statement	2024-04-12 17:45:26 +02:00
Viktor Lofgren	6a67043537	(ngram) Clean up ngram lexicon code This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.	2024-04-12 17:45:06 +02:00
Viktor Lofgren	864d6c28e7	(segmentation) Pick best segmentation using \|s\|^\|s\|-style normalization This is better than doing all segmentations possible at the same time.	2024-04-12 17:44:14 +02:00
Viktor Lofgren	bb6b51ad91	(ngram) Fix index range in NgramLexicon to an avoid exception	2024-04-12 10:13:25 +02:00
Viktor Lofgren	65e3caf402	(index) Clean up the code	2024-04-11 18:50:21 +02:00
Viktor Lofgren	b7d9a7ae89	(ngrams) Remove the vestigial logic for capturing permutations of n-grams The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.	2024-04-11 18:12:01 +02:00
Viktor Lofgren	ed73d79ec1	(qs) Clean up parsing code using new record matching	2024-04-11 17:36:08 +02:00
Viktor Lofgren	c538c25008	(term-freq-exporter) Reduce thread count and memory usage	2024-04-10 17:11:23 +02:00
Viktor Lofgren	4b47fadbab	(term-freq-exporter) Extract ngrams in term-frequency-exporter	2024-04-10 16:58:05 +02:00
Viktor Lofgren	fcdc843c15	(search) Fix outdated assumptions about the results We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption. For the API service, we'll simulate the old behavior to keep the API stable. For the search service, we'll introduce a new way of calculating positions through tree aggregation.	2024-04-07 12:09:44 +02:00
Viktor Lofgren	dbdcf459a7	(minor) Remove dead code	2024-04-06 16:27:16 +02:00
Viktor Lofgren	ef25d60666	(index) Add origin trace information for index readers This used to be supported by the system but got lost in refactoring at some point.	2024-04-06 13:28:14 +02:00
Viktor Lofgren	7f7021ce64	(sentence-extractor) Fix resource leak in sentence extractor The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation. The modified behavior checks for nullity before creating a new instance.	2024-04-05 18:52:58 +02:00
Joshua Holland	617e633d7a	Update keywords docs use of explore to browse I can't tell when this happened, but the proper keyword now seems to be browse and not explore.	2024-04-05 15:15:49 +02:00
Viktor Lofgren	ae7c760772	(index) Clean up new index query code	2024-04-05 13:30:49 +02:00
Viktor Lofgren	81815f3e0a	(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.	2024-04-04 20:17:58 +02:00
Joshua Holland	8e02f567d7	Update keywords docs use of explore to browse I can't tell when this happened, but the proper keyword now seems to be browse and not explore.	2024-04-01 00:04:12 -05:00
Viktor Lofgren	87bb93e1d4	(qs, WIP) Fix edge cases in query compilation This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w \| z_w) \| x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.	2024-03-29 12:40:27 +01:00
Viktor Lofgren	e596c929ac	(qs, WIP) Clean up dead code	2024-03-28 16:37:23 +01:00
Viktor Lofgren	9852b0e609	(qs, WIP) Tidy it up a bit	2024-03-28 14:18:26 +01:00
Viktor Lofgren	51b0d6c0d3	(qs, WIP) Tidy it up a bit	2024-03-28 14:09:17 +01:00
Viktor Lofgren	15391c7a88	(qs, WIP) Tidy it up a bit	2024-03-28 13:54:30 +01:00
Viktor Lofgren	fe62593286	(qs, WIP) Break up code and tidy it up a bit	2024-03-28 13:26:54 +01:00
Viktor Lofgren	4cc11e183c	(qs, WIP) Fix output determinism, fix tests	2024-03-28 13:11:26 +01:00
Viktor Lofgren	f82ebd7716	(WIP) Query rendering finally beginning to look like it works	2024-03-28 13:01:21 +01:00
Viktor Lofgren	bd0704d5a4	(*) Fix JDK22 migration issues A few bizarre build errors cropped up when migrating to JDK22. Not at all sure what caused them, but they were easy to mitigate.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	002afca1c5	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	a4b810f511	WIP	2024-03-21 14:33:26 +01:00
Viktor Lofgren	824765b1ee	(*) Fix JDK22 migration issues A few bizarre build errors cropped up when migrating to JDK22. Not at all sure what caused them, but they were easy to mitigate.	2024-03-21 14:27:13 +01:00
Viktor Lofgren	fe8d583fdd	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:27:13 +01:00
Viktor Lofgren	0bd3365c24	(convert) Initial integration of segmentation data into the converter's keyword extraction logic	2024-03-19 14:28:42 +01:00
Viktor Lofgren	d8f4e7d72b	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-03-19 10:42:09 +01:00
Viktor Lofgren	afc047cd27	(control) GUI for exporting segmentation data from a wikipedia zim	2024-03-18 13:45:23 +01:00
Viktor Lofgren	00ef4f9803	(WIP) Partial integration of new query expansion code into the query-serivice	2024-03-18 13:16:49 +01:00
Viktor Lofgren	07e4d7ec6d	(WIP) Improve data extraction from wikipedia data	2024-03-18 13:16:00 +01:00
Viktor Lofgren	8ae1f08095	(WIP) Implement first take of new query segmentation algorithm	2024-03-12 13:12:50 +01:00
Viktor Lofgren	57e6a12d08	(registry) Correct registerMonitor() behavior The previous behavior would listen to too many changes, and based on zookeeper and not curator assumptions about behavior, add an additional monitor on each invocation of each monitor, (which always trigger on service state changes), leading to each monitor re-registering and effectively doubling monitors in numbers whenever a service stopped or started, which in turn meant a lot of bizarre thrashing behavior even on changes in services that don't explicitly talk to each other. This re-registering behavior is no longer done.	2024-03-06 12:22:15 +01:00
Viktor Lofgren	46423612e3	(refac) Merge service-discovery and service modules Also adds a few tests to the server/client code.	2024-03-03 10:49:23 +01:00
Viktor Lofgren	29bf473d74	(encyclopedia) Add URLencoding to path element This prevents corruption of the links to the sideloaded encyclopedia data when the article path contains characters that are not valid in a URL.	2024-03-01 17:28:09 +01:00
Viktor Lofgren	9689f3faee	(domain-info) Fix incorrect array indexing	2024-02-29 18:56:09 +01:00
Viktor Lofgren	93fa58c93d	(domain-info) Fix incorrect array indexing Using the id instead of idx when addressing the ranksArray caused exceptions.	2024-02-29 17:54:23 +01:00
Viktor Lofgren	186a98cc99	(doc) Fix wonky bullet lists	2024-02-28 17:43:05 +01:00
Viktor Lofgren	9993f265ca	(doc) Remove irrelevant text	2024-02-28 17:40:05 +01:00
Viktor Lofgren	144f967dbf	(misc) Tweak pool sizes	2024-02-28 16:23:02 +01:00
Viktor Lofgren	b31c9bb726	(docs) Update process docs	2024-02-28 15:21:33 +01:00
Viktor Lofgren	c0820b5e5c	(docs) Update service docs	2024-02-28 15:19:31 +01:00
Viktor Lofgren	65b8a1d5d9	(grpc) Reduce error spam	2024-02-28 14:44:48 +01:00
Viktor Lofgren	a0648844fb	(grpc) Reduce error spam	2024-02-28 14:35:29 +01:00
Viktor Lofgren	c4a27003c6	(docs) Fix formatting	2024-02-28 14:22:57 +01:00
Viktor Lofgren	41abd8982f	(math) Clean up error handling	2024-02-28 14:19:50 +01:00
Viktor Lofgren	86bbc1043e	(service) Clean up thread pool creation	2024-02-28 14:06:32 +01:00
Viktor Lofgren	9a045a0588	(index) Clean up index code	2024-02-28 13:09:47 +01:00
Viktor Lofgren	9415539b38	(docs) Update docs	2024-02-28 12:25:19 +01:00
Viktor Lofgren	84bab2783d	(docs) Fix fake news in docs	2024-02-28 12:16:45 +01:00
Viktor Lofgren	d78e9e715f	(misc) Fix broken tests	2024-02-28 12:12:43 +01:00
Viktor Lofgren	a8ec59eb75	(conf) Add migration warning when ZOOKEEPER_HOSTS is not set.	2024-02-28 12:09:38 +01:00
Viktor Lofgren	20fc0ef13c	(gradle) Add task alias 'docker' for 'jibDockerBuild' The change also moves the jib boilerplate to an include.	2024-02-28 11:59:15 +01:00
Viktor Lofgren	9f1649636e	Clean up documentation and rename `domain-links` to `link-graph`	2024-02-28 11:40:39 +01:00
Viktor Lofgren	3a65fe8917	Add offload executor to GrpcChannelPoolFactory	2024-02-27 22:08:39 +01:00
Viktor Lofgren	99a6e56e99	(index-client) Increase thread count in index client This should be a fair bit larger than the number of index nodes	2024-02-27 22:00:29 +01:00
Viktor Lofgren	e696fd9e92	(docs) Begin un-fucking the docs after refactoring	2024-02-27 21:22:21 +01:00
Viktor Lofgren	c943954bb4	(domain-info) Reduce memory usage	2024-02-27 21:22:21 +01:00
Viktor Lofgren	eaf836dc66	(service/grpc) Reduce thread count Netty and GRPC by default spawns an incredible number of threads on high-core CPUs, which amount to a fair bit of RAM usage. Add custom executors that throttle this behavior.	2024-02-27 21:22:21 +01:00
Viktor Lofgren	dbf64b0987	(logs) Add the option for json logging	2024-02-27 21:22:20 +01:00
Viktor Lofgren	8d0af9548b	(search) Bot mitigation Add the ability to indicate to the search service that a request is malicious, and to poison the results by providing randomly reorered old results instead.	2024-02-27 21:22:19 +01:00
Viktor Lofgren	67aa20ea2c	(array) Attempting to debug strange errors	2024-02-27 21:22:18 +01:00
Viktor Lofgren	5604e9f531	(query) Bump query length, see what happens :P	2024-02-27 21:22:17 +01:00
Viktor Lofgren	1a51ec2d69	(index) Index optimization	2024-02-27 21:22:17 +01:00
Viktor Lofgren	3eb0800742	(index) Improve granularity of candidate queue polling	2024-02-27 21:22:17 +01:00
Viktor Lofgren	427f3e922f	(index) Retire count operation, clean up index code.	2024-02-27 21:22:17 +01:00
Viktor Lofgren	823ca73a3f	(domain-ranking) Fix a crash during ranking the edges of the similarity graph doesn't quite match the vertices of the link graph.	2024-02-27 21:22:17 +01:00
Viktor Lofgren	7fc0d4d786	(index) Observability for query execution queues	2024-02-27 21:22:17 +01:00
Viktor Lofgren	b8e336e809	(index) Reduce time allocation a bit	2024-02-27 21:22:17 +01:00
Viktor Lofgren	9429bf5c45	(index) Clean up	2024-02-27 21:22:17 +01:00
Viktor Lofgren	f7f0100174	(build) Make docker image registry and tag configurable in root build.gradle	2024-02-25 11:08:49 +01:00
Viktor Lofgren	fc00701a1e	(index) Experimental refactoring of the indexing functionality	2024-02-25 11:05:10 +01:00
Viktor Lofgren	09447f2ad2	(process service) Inherit parent's assertion status	2024-02-24 18:32:37 +01:00
Viktor Lofgren	ff0ef1eebc	(cleanup) Minor cleanups	2024-02-24 15:33:56 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00
Viktor Lofgren	56d35aa596	(refac) Move execution API out of executor service	2024-02-23 13:26:11 +01:00
Viktor Lofgren	2201b1a506	(refac) Clean up code issues	2024-02-23 11:39:19 +01:00
Viktor Lofgren	5cdb07023b	(refac) Clean up unused imports	2024-02-23 11:27:20 +01:00
Viktor Lofgren	6154e16951	(refac) Remove "distPath"	2024-02-23 11:22:02 +01:00
Viktor Lofgren	f4ff7185f0	(refac) Move process-mqapi out of api directory	2024-02-23 11:18:29 +01:00
Viktor Lofgren	6357d30ea0	Clean up docs	2024-02-22 19:53:20 +01:00
Viktor Lofgren	8d4ef982d0	Clean up docs	2024-02-22 19:37:59 +01:00
Viktor Lofgren	4740156cfa	Clean up docs	2024-02-22 18:18:58 +01:00
Viktor Lofgren	f8e7f75831	Move index to top level of code	2024-02-22 18:01:35 +01:00
Viktor Lofgren	085137ca63	* Extract the index functionality	2024-02-22 17:31:25 +01:00
Viktor Lofgren	3fd2a83184	* Extract the search-query function	2024-02-22 15:27:39 +01:00
Viktor Lofgren	66c1281301	(zk-registry) epic jak shaving WIP Cleaning out a lot of old junk from the code, and one thing lead to another... * Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds. * The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs. * Project is migrated to GraalVM * gRPC clients are re-written with a neat fluent/functional style. e.g. ```channelPool.call(grpcStub::method) .async(executor) // <-- optional .run(argument); ``` This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall. * For now the project is all in on zookeeper * Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh. * To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP! Missing is documentation and testing, and some more breaking apart of code.	2024-02-22 14:01:23 +01:00
Viktor Lofgren	73947d9eca	(zk-registry) Filter out phantom addresses in the registry The change adds a hostname validation step to remove endpoints from the ZkServiceRegistry when they do not resolve. This is a scenario that primarily happens when running in docker, and the entire system is started and stopped.	2024-02-20 18:09:11 +01:00
Viktor Lofgren	a69c0b2718	(grpc-client) Fix warmup crash The warmup would sometimes crash during a cold start-up, because it could not get an API. Changed the warmup to just create a GrpcSingleNodeChannelPool for the node.	2024-02-20 18:03:57 +01:00
Viktor Lofgren	6c764bceeb	(doc) Update documentation for `service-discovery`	2024-02-20 16:09:49 +01:00
Viktor Lofgren	273aeb7bae	(doc) Update documentation with new gRPC service setup	2024-02-20 16:06:05 +01:00
Viktor Lofgren	d185858266	(minor) Add missing query parameter to ServiceEndpoint.toURL	2024-02-20 15:49:43 +01:00
Viktor Lofgren	453bd6064b	(minor) Add warm-up to GrpcMultiNodeChannelPool to speed up the initial messages Without doing this, connections would be created lazily, which is probably never desirable.	2024-02-20 15:45:16 +01:00
Viktor Lofgren	14172312dc	(query-client) Fix query client The query service delegates and aggregates IndexDomainLinksApiGrpc messages to the index services. The query client was accidentally also doing this, instead of talking to the query client. Fixed so it correctly talks to the query client and nothing else.	2024-02-20 15:44:07 +01:00
Viktor Lofgren	c600d7aa47	(refac) Inject ServiceRegistry into WebsiteAdjacenciesCalculator	2024-02-20 15:42:32 +01:00
Viktor Lofgren	3c9234078a	(refac) Propagate ZOOKEEPER_HOSTS to spawned processes	2024-02-20 15:42:16 +01:00
Viktor Lofgren	ee8e0497ae	(refac) Move service discovery injection to a separate guice module	2024-02-20 15:41:04 +01:00
Viktor Lofgren	30bdb4b4e9	(config) Clean up service configuration for IP addresses Adds new ways to configure the bind and external IP addresses for a service. Notably, if the environment variable WMSA_IN_DOCKER is present, the system will grab the HOSTNAME variable and announce that as the external address in the service registry. The default bind address is also changed to be 0.0.0.0 only if WMSA_IN_DOCKER is present, otherwise 127.0.0.1; as this is a more secure default.	2024-02-20 14:22:48 +01:00
Viktor Lofgren	2ee492fb74	(gRPC) Bind gRPC services to an interface By default gRPC it magically decides on an interface. The change will explicitly tell it what to use.	2024-02-20 14:22:47 +01:00
Viktor Lofgren	36a5c8b44c	(cleanup) Clean up code	2024-02-20 14:22:47 +01:00
Viktor Lofgren	07b625c58d	(query-client) Add support for fault-tolerant requests to single node services Adding a method importantCall that will retry a failing request on each route until it succeeds or the routes run out.	2024-02-20 14:16:05 +01:00
Viktor Lofgren	746a865106	(client) Fix handling of channel refreshes The previous code made an incorrect assumption that all routes refer to the same node, and would overwrite the route list on each update. This lead to storms of closing and opening channels whenever an update was received. The new code is correctly aware that we may talk to multiple nodes.	2024-02-20 14:14:09 +01:00
Viktor	f85ec28a16	Merge branch 'master' into service-discovery	2024-02-20 11:44:12 +01:00
Viktor Lofgren	0307c55f9f	(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.	2024-02-20 11:41:14 +01:00
Viktor	d05c916491	Merge pull request #80 from MarginaliaSearch/ranking-algorithms Clean up domain ranking code	2024-02-18 09:52:34 +01:00
Viktor Lofgren	c73e43f5c9	(recrawl) Mitigate recrawl-before-load footgun In the scenario where an operator * Performs a new crawl from spec * Doesn't load the data into the index * Recrawls the data The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file, irrecoverably losing the crawl log making it impossible to load! To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening. More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl. This should help the DbCrawlSpecProvider to find them regardless of loaded state. This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.	2024-02-18 09:23:20 +01:00
Viktor Lofgren	e61e7f44b9	(blacklist) Delay startup of blacklist To help services start faster, the blacklist will no longer block until it's loaded. If such a behavior is desirable, a method was added to explicitly wait for the data.	2024-02-18 09:23:20 +01:00
Viktor Lofgren	f9b6ac03c6	(api) Clean up incorrect error handling in GrpcChannelPool	2024-02-18 08:45:35 +01:00
Viktor Lofgren	296ccc5f8e	(blacklist) Clean up blacklist impl The domain blacklist blocked the start-up of each process that injected it, adding like 30 seconds to the start-up time in prod. This change moves the loading to a separate thread entirely. For threads or processes that require the blacklist to be definitely loaded, a helper method was added that blocks until that time.	2024-02-18 08:16:48 +01:00
Viktor Lofgren	8cb5825617	(search) Temporarily disable the Popular filter This filter currently does not distinguish itself very much from the unfiltered results, and lends the impression that the filters don't "do anything". It may come back in some shape or form in the future, with some additional tweaking of the rankings...	2024-02-18 08:02:01 +01:00
Viktor Lofgren	cee707abd8	(crawler) Implement domain shuffling in DbCrawlSpecProvider Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.	2024-02-17 17:47:38 +01:00
Viktor Lofgren	92717a4832	(client) Refactor GrpcStubPool to handle error states Refactored the GRPC Stub Pool for better handling of channel SHUTDOWN state. Any disconnected channels are now re-created before returning the stub. The class was also renamed to GrpcChannelPool, as we no longer pool the stubs.	2024-02-17 14:42:26 +01:00
Viktor Lofgren	37a7296759	(sideload) Clean up the sideloading code Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach. The reddit sideloader now uses the SideloaderProcessing class. It also properly sets js-attributes for the sideloaded documents. The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.	2024-02-17 14:32:36 +01:00
Viktor Lofgren	ebbe49d17b	(sideload) Fix sideloading of explicitly selected stackexchange files Fix a bug where sideloading stackexchange files by explicitly selecting the 7z file would fail, since the 7z file would be passed along to the converter rather than the path to the pre-converted .db file.	2024-02-17 13:24:04 +01:00
Viktor Lofgren	b7e330855f	(control) Update descriptive text in the control GUI	2024-02-16 20:32:31 +01:00
Viktor Lofgren	ac89224fb0	(domain-ranking) Remove lingering mentions of the algorithms field from the GUI	2024-02-16 20:28:37 +01:00
Viktor Lofgren	9ec262ae00	(domain-ranking) Integrate new ranking logic The change deprecates the 'algorithm' field from the domain ranking set configuration. Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.	2024-02-16 20:22:01 +01:00
Viktor Lofgren	64acdb5f2a	(domain-ranking) Clean up domain ranking The domain ranking code was admittedly a bit of a clown fiesta; at the same time buggy, fragile and inscrutable. Migrating over to use JGraphT to store the link graph when doing rankings, and using their PageRank implementation. Also added a modified version that does PersonalizedPageRank.	2024-02-16 18:04:58 +01:00
Viktor Lofgren	a175b36382	(search) Correct accidental regression of the SmallWeb filter	2024-02-15 18:16:56 +01:00
Viktor Lofgren	16526d283c	(search) Correct accidental regression of the Vintage filter	2024-02-15 18:13:34 +01:00
Viktor Lofgren	752e677555	(search) Expose getSearchTitle in DecoratedSearchResults	2024-02-15 13:56:44 +01:00
Viktor Lofgren	f796af1ae8	(search) Fix failed refactoring	2024-02-15 13:53:19 +01:00
Viktor Lofgren	2515993536	(search) Fix issue where searchTitle setting gets lost when searching again It's important that the field names in SearchParameters matches the fields referenced in search-form.hdb, otherwise they will get lost in transit.	2024-02-15 13:52:11 +01:00
Viktor Lofgren	66b3e71e56	(search) Expose more search options This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias. The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period. These options are added to the search interface. The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well. The vintage filter is modified to add a temporal bias for the past.	2024-02-15 13:39:51 +01:00
Viktor Lofgren	652d151373	(process-models) Improve documentation	2024-02-15 12:21:12 +01:00
Viktor Lofgren	300b1a1b84	(index-query) Add some tests for the QueryFilter code	2024-02-15 12:03:30 +01:00
Viktor Lofgren	6c3b49417f	(index-query) Improve documentation and code quality	2024-02-15 11:33:50 +01:00
Viktor Lofgren	dcc5cfb7c0	(index-journal) Improve documentation and code quality	2024-02-15 10:51:49 +01:00
Viktor	d970836605	Merge pull request #79 from MarginaliaSearch/reddit (converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy, and improves the sideload UX a tiny bit.	2024-02-15 09:17:56 +01:00
Viktor Lofgren	8021bd0aae	(control) Sort upload listing results Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename. The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.	2024-02-15 09:13:40 +01:00
Viktor Lofgren	8f91156d80	(control) Improve sideload UX The sideload forms didn't properly set the label 'for' property, meaning that while label tags existed, they weren't appropriately clickable. Also removed unnecessary limits on the sideload target being a directory for stackexchange and warc. It's been possible to directly load a particular file for a while, but not allowed due to GUI limits.	2024-02-14 18:38:20 +01:00
Viktor Lofgren	fab36d6e63	(converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy.	2024-02-14 17:35:44 +01:00
Viktor Lofgren	3d54879c14	(API, minor) Clean up comments.	2024-02-14 12:09:16 +01:00
Viktor Lofgren	e17fcde865	(API, minor) Remove unnecessary inject.	2024-02-14 12:05:50 +01:00
Viktor Lofgren	6950dffcb4	(API) Fix result order in API results These results should be presented in the same order as their ranking score.	2024-02-14 11:47:14 +01:00
Viktor Lofgren	02dd5c5853	(converter) Look at properties when deciding pool size Look at whether the property 'system.conserveProperty' is enabled when deciding he default pool size for the converter. If true, a much more conservative default is used, limiting the risk of running out of memory.	2024-02-12 16:24:19 +01:00
Viktor Lofgren	5a1087dbf9	(qs-gui) Update documentation, add param for domain limit	2024-02-12 16:13:48 +01:00
Viktor Lofgren	7564dfeb7a	(minor) Correct link in documentation for app services	2024-02-12 15:55:06 +01:00
Viktor Lofgren	10bad635a8	(search) Experimental support for clustering search results Improves clustering of results.	2024-02-11 20:00:11 +01:00
Viktor Lofgren	7cc8b0fed5	(search) Experimental support for clustering search results Improves clustering of results.	2024-02-11 19:58:55 +01:00
Viktor Lofgren	a77846373b	(search) Experimental support for clustering search results Improves clustering of results.	2024-02-11 19:48:55 +01:00
Viktor Lofgren	bcd0dabb92	(search) Experimental support for clustering search results Adds experimental support for clustering search results by e.g. domain. At a first stage, this is only enabled for the wiki and forum filters. The commit also cleans up the UrlDetails class, which contained a number of vestigial entries.	2024-02-11 17:31:38 +01:00
Viktor Lofgren	9d68062553	(converter) Make processing pool size configurable	2024-02-10 20:59:08 +01:00
Viktor Lofgren	e66d0b7431	(warc) Minor code clean-up. Remove redundant String$getBytes(). This is mainly an improvement in code consistency.	2024-02-10 18:30:33 +01:00
Viktor Lofgren	ba26f6ce84	(doc) Documentation corrections	2024-02-10 14:16:01 +01:00
Viktor Lofgren	929caed0b9	(warc) Improve WARC standard adherence The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.	2024-02-09 20:07:01 +01:00
Viktor Lofgren	8340aa2b6c	(warc) Improve WARC standard adherence The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.	2024-02-09 17:29:21 +01:00
Viktor Lofgren	1188fe3bf0	(conf) Improve naming consistency Rename the property system.conserve-memory to system.conserveMemory in order to be consistent with other properties in the system.	2024-02-09 14:43:08 +01:00
Viktor Lofgren	b15f47d80e	(db) Retire the EC_DOMAIN_LINK table Retire the EC_DOMAIN_LINK table as the data has been migrated off into a file instead.	2024-02-08 15:52:30 +01:00
Viktor Lofgren	ef261cbbd7	(search) Remove stray spaces in bang commands	2024-02-08 14:46:18 +01:00
Conor Flynn	9d7df87886	(search) Fix broken !ddg handling https://duckduckgo.com/search?q=asdf leads to running a search for the term "search" instead of "asdf". Both https://duckduckgo.com/<query> and https://duckduckgo.com/?q=<query> are accepted, but using GET vars seemed more in-keeping with the code.	2024-02-08 13:28:02 +01:00
Viktor Lofgren	a4b2323ca3	(search) Change default search profile to No Filter Recent changes to the result ranking mean the no filter mode returns sufficiently good results for most queries that filtering by default just makes the search results more restricted.	2024-02-08 13:04:05 +01:00
Viktor	e8de468b0b	Make executor API talk GRPC (#75 ) * (executor-api) Make executor API talk GRPC The executor's REST API was very fragile and annoying to work with, lacking even basic type safety. Migrate to use GRPC instead. GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil. This is a fairly straightforward change, but it's also large so a solid round of testing is needed... The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients. ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name(). The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.	2024-02-08 13:01:12 +01:00
Viktor Lofgren	d83a3bf4e2	(search) Fix broken !w handling Printf format error derp.	2024-02-08 12:11:33 +01:00
Viktor Lofgren	f2b39ad055	(search) Fix broken !bang handling !bang query handling seems to have fallen victim to an overzealous refactoring effort, and broken. It's now repaired, and a test is in place to ensure we know if it breaks again.	2024-02-08 12:05:09 +01:00
Viktor Lofgren	95d1bd98e4	(array) Update documentation, make unsafe configurable The readme for the array library was extremely out of date. Updating it with accurate information about how the library works, and a demo that should compile. Also added a system property for disabling the use of sun.misc.Unsafe.	2024-02-07 12:26:47 +01:00
Viktor Lofgren	8acbc6a6b4	(index-construction) Split repartition into two actions cont'd Continues `467ba5be20` by breaking out a constant with the name of the primary ranking set. Also ensures it doesn't get spuriously logged as updated during the secondary updating pass.	2024-02-06 19:54:17 +01:00
Viktor Lofgren	467ba5be20	(index-construction) Split repartition into two actions This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets. Since only the first is necessary before the index construction, the rest can be delayed until after... To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one. The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader. Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data. Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead. To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.	2024-02-06 17:20:07 +01:00
Viktor Lofgren	29ddf9e61d	(doc) Update docs	2024-02-06 16:29:55 +01:00
Viktor Lofgren	92e119cab3	(doc) Update docs	2024-02-06 12:43:42 +01:00

... 4 5 6 7 8 ...

1528 Commits