MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	adf846bfd2	(index) Fix term coherence evaluation The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	1748fcc5ac	(valuation) Impose stronger constraints on locality of terms Clean up logic a bit	2024-04-24 14:44:39 +02:00
Viktor Lofgren	08416393e0	(valuation) Impose stronger constraints on locality of terms	2024-04-24 14:44:39 +02:00
Viktor Lofgren	fce26015c9	(encyclopedia) Index the full articles Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	155be1078d	(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6efc0f21fe	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f3255e080d	(ngram) Grab titles separately when extracting ngrams from wiki data	2024-04-24 14:44:39 +02:00
Viktor Lofgren	0da03d4cfc	(zim) Fix title extractor	2024-04-24 14:44:39 +02:00
Viktor Lofgren	5f6a3ef9d0	(ngram) Correct \|s\|^\|s\|-normalization to use length and not count	2024-04-24 14:44:39 +02:00
Viktor Lofgren	afc4fed591	(ngram) Correct size value in ngram lexicon generation, trim the terms better	2024-04-24 14:44:39 +02:00
Viktor Lofgren	cb505f98ef	(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a0b3634cb6	(ngram) Only extract frequencies of title words, but use the body to increment the counters... The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e23359bae9	(query, minor) Remove debug statement	2024-04-24 14:44:39 +02:00
Viktor Lofgren	5531ed632a	(query, minor) Remove debug statement	2024-04-24 14:44:39 +02:00
Viktor Lofgren	150ee21f3c	(ngram) Clean up ngram lexicon code This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	c96da0ce1e	(segmentation) Pick best segmentation using \|s\|^\|s\|-style normalization This is better than doing all segmentations possible at the same time.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	a0d9e66ff7	(ngram) Fix index range in NgramLexicon to an avoid exception	2024-04-24 14:44:38 +02:00
Viktor Lofgren	55f627ed4c	(index) Clean up the code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	7dd8c78c6b	(ngrams) Remove the vestigial logic for capturing permutations of n-grams The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	8bf7d090fd	(qs) Clean up parsing code using new record matching	2024-04-24 14:44:38 +02:00
Viktor Lofgren	6bfe04b609	(term-freq-exporter) Reduce thread count and memory usage	2024-04-24 14:44:38 +02:00
Viktor Lofgren	491d6bec46	(term-freq-exporter) Extract ngrams in term-frequency-exporter	2024-04-24 14:44:38 +02:00
Viktor Lofgren	4fb86ac692	(search) Fix outdated assumptions about the results We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption. For the API service, we'll simulate the old behavior to keep the API stable. For the search service, we'll introduce a new way of calculating positions through tree aggregation.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	6cba6aef3b	(minor) Remove dead code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	7e216db463	(index) Add origin trace information for index readers This used to be supported by the system but got lost in refactoring at some point.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	adc90c8f1e	(sentence-extractor) Fix resource leak in sentence extractor The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation. The modified behavior checks for nullity before creating a new instance.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	e3316a3672	(index) Clean up new index query code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	a3a6d6292b	(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	8cb9455c32	(qs, WIP) Fix edge cases in query compilation This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w \| z_w) \| x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	dc65b2ee01	(qs, WIP) Clean up dead code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	98a1adbf81	(qs, WIP) Tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	0bd1e15cce	(qs, WIP) Tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	eda926767e	(qs, WIP) Tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	cd1a18c045	(qs, WIP) Break up code and tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	6f567fbea8	(qs, WIP) Fix output determinism, fix tests	2024-04-24 14:44:38 +02:00
Viktor Lofgren	0ebadd03a5	(WIP) Query rendering finally beginning to look like it works	2024-04-24 14:44:38 +02:00
Viktor Lofgren	2253b556b2	WIP	2024-04-24 14:44:17 +02:00
Viktor Lofgren	6a7a7009c7	(convert) Initial integration of segmentation data into the converter's keyword extraction logic	2024-04-24 14:44:17 +02:00
Viktor Lofgren	3c75057dcd	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-04-24 14:44:17 +02:00
Viktor Lofgren	212d101727	(control) GUI for exporting segmentation data from a wikipedia zim	2024-04-24 14:44:17 +02:00
Viktor Lofgren	760b80659d	(WIP) Partial integration of new query expansion code into the query-serivice	2024-04-24 14:44:17 +02:00
Viktor Lofgren	04879c005d	(WIP) Improve data extraction from wikipedia data	2024-04-24 14:44:17 +02:00
Viktor Lofgren	cb82927756	(WIP) Implement first take of new query segmentation algorithm	2024-04-24 14:44:17 +02:00
Viktor Lofgren	f434a8b492	(build) Upgrade jib plugin version	2024-04-16 15:25:23 +02:00
Viktor Lofgren	d2658d6f84	(sys) Add springboard service that can spawn multiple different marginalia services to make distribution easier.	2024-04-16 13:25:15 +02:00
Viktor Lofgren	8c559c8121	(conf) Add additional logic for discovering system root	2024-04-16 12:37:18 +02:00
Viktor Lofgren	448a941de2	(encyclopedia) Fix memory issue in preconversion step Use SimpleBlockingThreadPool pool instead of Java's Workstealing Pool as the latter causes runaway memory consumption in some circumstances, while SimpleBlockingThreadPool uses a bounded queue and always pushes back against the supplier if it can't hold any more tasks.	2024-04-05 16:57:53 +02:00
Viktor Lofgren	e1151ecf2a	(gradle) Upgrade to Gradle 8.7 This will reduce the hassle of juggling JDK versions for JDK 22, which was not supported by Gradle 8.5.	2024-04-05 15:12:38 +02:00
Viktor	3890c413a3	Merge pull request #88 from jmholla/patch-1 Update keywords docs use of explore to browse	2024-04-01 09:14:02 +02:00
Joshua Holland	8e02f567d7	Update keywords docs use of explore to browse I can't tell when this happened, but the proper keyword now seems to be browse and not explore.	2024-04-01 00:04:12 -05:00

1 2 3 4 5 ...

1887 Commits