MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 13:09:00 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	36cc62c10c	(proto) Improve handling of omitted parameters	2024-04-18 10:47:12 +02:00
Viktor Lofgren	975d92912c	(qs) Improve logging	2024-04-18 10:44:08 +02:00
Viktor Lofgren	8bbaf457de	(query) Minor code cleanup	2024-04-18 10:37:51 +02:00
Viktor Lofgren	7641a02f31	(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.	2024-04-18 10:36:15 +02:00
Viktor Lofgren	ce16239e34	(query) Modify tokenizer to match the behavior of the sentence extractor This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.	2024-04-17 17:54:32 +02:00
Viktor Lofgren	d64bd227cf	(index) Clean up jaccard index term code and down-tune the parameter's importance a bit	2024-04-17 17:40:16 +02:00
Viktor Lofgren	c5ab0a9054	(index) Add jaccard index term to boost results based on term overlap	2024-04-17 16:50:26 +02:00
Viktor Lofgren	dac948973d	(index) Remove position overlap check, coherences will do the work instead	2024-04-17 14:20:01 +02:00
Viktor Lofgren	9d008d1d6f	(index) Omit absent terms from coherence checks	2024-04-17 14:12:16 +02:00
Viktor Lofgren	f52457213e	(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus	2024-04-17 14:05:02 +02:00
Viktor Lofgren	579295a673	(search) Add implicit coherence constraints based on segmentation	2024-04-17 14:03:35 +02:00
Viktor Lofgren	af8ff8ce99	(index) Improve recall for small queries Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.	2024-04-16 22:51:03 +02:00
Viktor Lofgren	7fa3e86e64	(index) Remove dead code Since the performance fix in `3359f72239` had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.	2024-04-16 19:59:27 +02:00
Viktor Lofgren	3359f72239	(index) Experimental performance regression fix	2024-04-16 19:48:14 +02:00
Viktor Lofgren	41fa154aa6	(test) Fix broken test	2024-04-16 19:48:14 +02:00
Viktor Lofgren	deaba0152d	(index) Explicitly free LongQueryBuffers	2024-04-16 19:23:00 +02:00
Viktor Lofgren	feaef6093e	(index) Fix term coherence evaluation The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.	2024-04-16 18:07:43 +02:00
Viktor Lofgren	078fa4fdd0	(valuation) Impose stronger constraints on locality of terms Clean up logic a bit	2024-04-16 17:22:58 +02:00
Viktor Lofgren	2dc77a0638	(valuation) Impose stronger constraints on locality of terms	2024-04-16 17:15:21 +02:00
Viktor	cfd9a7187f	(query-segmentation) Merge pull request #89 from MarginaliaSearch/query-segmentation The changeset cleans up the query parsing logic in the query service. It gets rid of a lot of old and largely unmaintainable query-rewriting logic that was based on POS-tagging rules, and adds a new cleaner approach. Query parsing is also refactored, and the internal APIs are updated to remove unnecessary duplication of document-level data across each search term. A new query segmentation model is introduced based on a dictionary of known n-grams, with tools for extracting this dictionary from Wikipedia data. The changeset introduces a new segmentation model file, which is downloaded with the usual run/setup.sh, as well as an updated term frequency model. A new intermediate representation of the query is introduced, based on a DAG with predefined vertices initiating and terminating the graph. This is for the benefit of easily writing rules for generating alternative queries, e.g. using the new segmentation data. The graph is converted to a basic LL(1) syntax loosely reminiscent of a regular expression, where e.g. "( wiby \| marginalia \| kagi ) ( search engine \| searchengine )" expands to "wiby search engine", "wiby searchengine", "marginalia search engine", "marginalia searchengine", "kagi search engine" and "kagi searchengine". This compiled query is passed to the index, which parses the expression, where it is used for execution of the search and ranking of the results.	2024-04-16 15:31:05 +02:00
Viktor Lofgren	f434a8b492	(build) Upgrade jib plugin version	2024-04-16 15:25:23 +02:00
Viktor Lofgren	d2658d6f84	(sys) Add springboard service that can spawn multiple different marginalia services to make distribution easier.	2024-04-16 13:25:15 +02:00
Viktor Lofgren	8c559c8121	(conf) Add additional logic for discovering system root	2024-04-16 12:37:18 +02:00
Viktor Lofgren	2353c73c57	(encyclopedia) Index the full articles Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.	2024-04-16 12:10:13 +02:00
Viktor Lofgren	599e719ad4	(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.	2024-04-15 16:44:08 +02:00
Viktor Lofgren	b6d365bacd	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-15 16:04:07 +02:00
Viktor Lofgren	52f0c0d336	(ngram) Grab titles separately when extracting ngrams from wiki data	2024-04-13 19:34:16 +02:00
Viktor Lofgren	be55f3f937	(zim) Fix title extractor	2024-04-13 19:33:47 +02:00
Viktor Lofgren	fda1c05164	(ngram) Correct \|s\|^\|s\|-normalization to use length and not count	2024-04-13 18:05:30 +02:00
Viktor Lofgren	1329d4abd8	(ngram) Correct size value in ngram lexicon generation, trim the terms better	2024-04-13 17:51:02 +02:00
Viktor Lofgren	f064992137	(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.	2024-04-13 17:07:23 +02:00
Viktor Lofgren	8a81a480a1	(ngram) Only extract frequencies of title words, but use the body to increment the counters... The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.	2024-04-12 18:08:31 +02:00
Viktor Lofgren	d729c400e5	(query, minor) Remove debug statement	2024-04-12 17:52:55 +02:00
Viktor Lofgren	ad4810d991	(query, minor) Remove debug statement	2024-04-12 17:45:26 +02:00
Viktor Lofgren	6a67043537	(ngram) Clean up ngram lexicon code This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.	2024-04-12 17:45:06 +02:00
Viktor Lofgren	864d6c28e7	(segmentation) Pick best segmentation using \|s\|^\|s\|-style normalization This is better than doing all segmentations possible at the same time.	2024-04-12 17:44:14 +02:00
Viktor Lofgren	bb6b51ad91	(ngram) Fix index range in NgramLexicon to an avoid exception	2024-04-12 10:13:25 +02:00
Viktor Lofgren	65e3caf402	(index) Clean up the code	2024-04-11 18:50:21 +02:00
Viktor Lofgren	b7d9a7ae89	(ngrams) Remove the vestigial logic for capturing permutations of n-grams The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.	2024-04-11 18:12:01 +02:00
Viktor Lofgren	ed73d79ec1	(qs) Clean up parsing code using new record matching	2024-04-11 17:36:08 +02:00
Viktor Lofgren	c538c25008	(term-freq-exporter) Reduce thread count and memory usage	2024-04-10 17:11:23 +02:00
Viktor Lofgren	4b47fadbab	(term-freq-exporter) Extract ngrams in term-frequency-exporter	2024-04-10 16:58:05 +02:00
Viktor Lofgren	fcdc843c15	(search) Fix outdated assumptions about the results We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption. For the API service, we'll simulate the old behavior to keep the API stable. For the search service, we'll introduce a new way of calculating positions through tree aggregation.	2024-04-07 12:09:44 +02:00
Viktor Lofgren	dbdcf459a7	(minor) Remove dead code	2024-04-06 16:27:16 +02:00
Viktor Lofgren	ef25d60666	(index) Add origin trace information for index readers This used to be supported by the system but got lost in refactoring at some point.	2024-04-06 13:28:14 +02:00
Viktor Lofgren	7f7021ce64	(sentence-extractor) Fix resource leak in sentence extractor The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation. The modified behavior checks for nullity before creating a new instance.	2024-04-05 18:52:58 +02:00
Viktor Lofgren	448a941de2	(encyclopedia) Fix memory issue in preconversion step Use SimpleBlockingThreadPool pool instead of Java's Workstealing Pool as the latter causes runaway memory consumption in some circumstances, while SimpleBlockingThreadPool uses a bounded queue and always pushes back against the supplier if it can't hold any more tasks.	2024-04-05 16:57:53 +02:00
Viktor Lofgren	5766da69ec	(gradle) Upgrade to Gradle 8.7 This will reduce the hassle of juggling JDK versions for JDK 22, which was not supported by Gradle 8.5.	2024-04-05 15:15:49 +02:00
Joshua Holland	617e633d7a	Update keywords docs use of explore to browse I can't tell when this happened, but the proper keyword now seems to be browse and not explore.	2024-04-05 15:15:49 +02:00
Viktor Lofgren	b770a1143f	(run) Fix traefik middleware configuration	2024-04-05 15:15:49 +02:00

... 3 4 5 6 7 ...

2111 Commits