MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 21:29:00 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	2353c73c57	(encyclopedia) Index the full articles Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.	2024-04-16 12:10:13 +02:00
Viktor Lofgren	599e719ad4	(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.	2024-04-15 16:44:08 +02:00
Viktor Lofgren	b6d365bacd	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-15 16:04:07 +02:00
Viktor Lofgren	52f0c0d336	(ngram) Grab titles separately when extracting ngrams from wiki data	2024-04-13 19:34:16 +02:00
Viktor Lofgren	be55f3f937	(zim) Fix title extractor	2024-04-13 19:33:47 +02:00
Viktor Lofgren	fda1c05164	(ngram) Correct \|s\|^\|s\|-normalization to use length and not count	2024-04-13 18:05:30 +02:00
Viktor Lofgren	1329d4abd8	(ngram) Correct size value in ngram lexicon generation, trim the terms better	2024-04-13 17:51:02 +02:00
Viktor Lofgren	f064992137	(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.	2024-04-13 17:07:23 +02:00
Viktor Lofgren	8a81a480a1	(ngram) Only extract frequencies of title words, but use the body to increment the counters... The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.	2024-04-12 18:08:31 +02:00
Viktor Lofgren	d729c400e5	(query, minor) Remove debug statement	2024-04-12 17:52:55 +02:00
Viktor Lofgren	ad4810d991	(query, minor) Remove debug statement	2024-04-12 17:45:26 +02:00
Viktor Lofgren	6a67043537	(ngram) Clean up ngram lexicon code This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.	2024-04-12 17:45:06 +02:00
Viktor Lofgren	864d6c28e7	(segmentation) Pick best segmentation using \|s\|^\|s\|-style normalization This is better than doing all segmentations possible at the same time.	2024-04-12 17:44:14 +02:00
Viktor Lofgren	bb6b51ad91	(ngram) Fix index range in NgramLexicon to an avoid exception	2024-04-12 10:13:25 +02:00
Viktor Lofgren	65e3caf402	(index) Clean up the code	2024-04-11 18:50:21 +02:00
Viktor Lofgren	b7d9a7ae89	(ngrams) Remove the vestigial logic for capturing permutations of n-grams The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.	2024-04-11 18:12:01 +02:00
Viktor Lofgren	ed73d79ec1	(qs) Clean up parsing code using new record matching	2024-04-11 17:36:08 +02:00
Viktor Lofgren	c538c25008	(term-freq-exporter) Reduce thread count and memory usage	2024-04-10 17:11:23 +02:00
Viktor Lofgren	4b47fadbab	(term-freq-exporter) Extract ngrams in term-frequency-exporter	2024-04-10 16:58:05 +02:00
Viktor Lofgren	fcdc843c15	(search) Fix outdated assumptions about the results We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption. For the API service, we'll simulate the old behavior to keep the API stable. For the search service, we'll introduce a new way of calculating positions through tree aggregation.	2024-04-07 12:09:44 +02:00
Viktor Lofgren	dbdcf459a7	(minor) Remove dead code	2024-04-06 16:27:16 +02:00
Viktor Lofgren	ef25d60666	(index) Add origin trace information for index readers This used to be supported by the system but got lost in refactoring at some point.	2024-04-06 13:28:14 +02:00
Viktor Lofgren	7f7021ce64	(sentence-extractor) Fix resource leak in sentence extractor The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation. The modified behavior checks for nullity before creating a new instance.	2024-04-05 18:52:58 +02:00
Viktor Lofgren	5766da69ec	(gradle) Upgrade to Gradle 8.7 This will reduce the hassle of juggling JDK versions for JDK 22, which was not supported by Gradle 8.5.	2024-04-05 15:15:49 +02:00
Joshua Holland	617e633d7a	Update keywords docs use of explore to browse I can't tell when this happened, but the proper keyword now seems to be browse and not explore.	2024-04-05 15:15:49 +02:00
Viktor Lofgren	b770a1143f	(run) Fix traefik middleware configuration	2024-04-05 15:15:49 +02:00
Viktor Lofgren	ae7c760772	(index) Clean up new index query code	2024-04-05 13:30:49 +02:00
Viktor Lofgren	81815f3e0a	(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.	2024-04-04 20:17:58 +02:00
Viktor Lofgren	87bb93e1d4	(qs, WIP) Fix edge cases in query compilation This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w \| z_w) \| x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.	2024-03-29 12:40:27 +01:00
Viktor Lofgren	e596c929ac	(qs, WIP) Clean up dead code	2024-03-28 16:37:23 +01:00
Viktor Lofgren	9852b0e609	(qs, WIP) Tidy it up a bit	2024-03-28 14:18:26 +01:00
Viktor Lofgren	51b0d6c0d3	(qs, WIP) Tidy it up a bit	2024-03-28 14:09:17 +01:00
Viktor Lofgren	15391c7a88	(qs, WIP) Tidy it up a bit	2024-03-28 13:54:30 +01:00
Viktor Lofgren	fe62593286	(qs, WIP) Break up code and tidy it up a bit	2024-03-28 13:26:54 +01:00
Viktor Lofgren	4cc11e183c	(qs, WIP) Fix output determinism, fix tests	2024-03-28 13:11:26 +01:00
Viktor Lofgren	f82ebd7716	(WIP) Query rendering finally beginning to look like it works	2024-03-28 13:01:21 +01:00
Viktor Lofgren	bd0704d5a4	(*) Fix JDK22 migration issues A few bizarre build errors cropped up when migrating to JDK22. Not at all sure what caused them, but they were easy to mitigate.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	1968485881	(docs) Upgrade to JDK22	2024-03-21 14:33:27 +01:00
Viktor Lofgren	002afca1c5	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:33:27 +01:00
Your Name	411b3f3138	(run/install.sh) fix docker compose file I was following the release demo video for v2024.01.0 https://www.youtube.com/watch?v=PNwMkenQQ24 and when I did 'docker compose up' the containers couldn't resolve the DNS name for 'zookeeper' I realized this was because the zookeeper container was using the default docker network, so I specified the wmsa network explicitly.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	a4b810f511	WIP	2024-03-21 14:33:26 +01:00
Viktor Lofgren	0bd3365c24	(convert) Initial integration of segmentation data into the converter's keyword extraction logic	2024-03-19 14:28:42 +01:00
Viktor Lofgren	d8f4e7d72b	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-03-19 10:42:09 +01:00
Viktor Lofgren	afc047cd27	(control) GUI for exporting segmentation data from a wikipedia zim	2024-03-18 13:45:23 +01:00
Viktor Lofgren	00ef4f9803	(WIP) Partial integration of new query expansion code into the query-serivice	2024-03-18 13:16:49 +01:00
Viktor Lofgren	07e4d7ec6d	(WIP) Improve data extraction from wikipedia data	2024-03-18 13:16:00 +01:00
Viktor Lofgren	8ae1f08095	(WIP) Implement first take of new query segmentation algorithm	2024-03-12 13:12:50 +01:00
Viktor Lofgren	57e6a12d08	(registry) Correct registerMonitor() behavior The previous behavior would listen to too many changes, and based on zookeeper and not curator assumptions about behavior, add an additional monitor on each invocation of each monitor, (which always trigger on service state changes), leading to each monitor re-registering and effectively doubling monitors in numbers whenever a service stopped or started, which in turn meant a lot of bizarre thrashing behavior even on changes in services that don't explicitly talk to each other. This re-registering behavior is no longer done.	2024-03-06 12:22:15 +01:00
Viktor Lofgren	46423612e3	(refac) Merge service-discovery and service modules Also adds a few tests to the server/client code.	2024-03-03 10:49:23 +01:00
Viktor Lofgren	29bf473d74	(encyclopedia) Add URLencoding to path element This prevents corruption of the links to the sideloaded encyclopedia data when the article path contains characters that are not valid in a URL.	2024-03-01 17:28:09 +01:00

1 2 3 4 5 ...

1877 Commits