MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	1ab875a75d	(test) Correcting flaky tests Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	87e38e6181	(search-query) refac: Move query factory	2024-06-27 13:14:47 +02:00
Viktor Lofgren	f73fc8dd57	(search-query) Fix end-inclusion bug in QWordGraphIterator	2024-06-27 13:13:42 +02:00
Viktor Lofgren	3faa5bf521	(search-query) Tidy up QueryGRPCService and IndexClient	2024-06-26 14:03:30 +02:00
Viktor Lofgren	6973712480	(query) Tidy up code	2024-06-26 13:40:06 +02:00
Viktor Lofgren	95b9af92a0	(index) Implement working optional TermCoherences	2024-06-26 12:22:06 +02:00
Viktor Lofgren	dae22ccbe0	(test) Integration test from crawl->query	2024-06-25 22:17:26 +02:00
Viktor Lofgren	9d00243d7f	(index) Partial re-implementation of position constraints	2024-06-24 15:55:54 +02:00
Viktor Lofgren	36160988e2	(index) Integrate positions data with indexes WIP This change integrates the new positions data with the forward and reverse indexes. The ranking code is still only partially re-written.	2024-06-10 15:09:06 +02:00
Viktor Lofgren	a69ab311c7	(qword) Fix tests that broke due to stopword removal	2024-05-28 14:15:45 +02:00
Viktor Lofgren	6985ab762a	(query) Improve handling of stopwords in queries	2024-05-23 20:50:55 +02:00
Viktor Lofgren	0b60411e5f	(query) Bugfix stopword issue Add a new rule that crates an alternative path that omits a word if it's a stopword. In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.	2024-05-23 20:15:14 +02:00
Viktor Lofgren	89aae93e60	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
Viktor Lofgren	4668b1ddcb	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 13:54:04 +02:00
Viktor Lofgren	0a73b02a00	(query) Mark flaky test, correct assert on test	2024-04-21 12:30:14 +02:00
Viktor Lofgren	2cc74c005a	(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches	2024-04-19 19:42:30 +02:00
Viktor Lofgren	ed250f57f2	(ranking) Set regularMask correctly	2024-04-19 14:31:57 +02:00
Viktor Lofgren	e92c25f7e0	(ranking) Cleanup	2024-04-19 14:13:12 +02:00
Viktor Lofgren	41782a0ab5	(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp	2024-04-19 12:19:26 +02:00
Viktor Lofgren	9b06433b82	(qs) Additional info in query debug UI	2024-04-19 12:18:53 +02:00
Viktor Lofgren	def607d840	(qs) Additional info in query debug UI	2024-04-19 11:46:27 +02:00
Viktor Lofgren	2b811fb422	(qs) Basic query debug feature	2024-04-19 11:00:56 +02:00
Viktor Lofgren	36cc62c10c	(proto) Improve handling of omitted parameters	2024-04-18 10:47:12 +02:00
Viktor Lofgren	8bbaf457de	(query) Minor code cleanup	2024-04-18 10:37:51 +02:00
Viktor Lofgren	7641a02f31	(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.	2024-04-18 10:36:15 +02:00
Viktor Lofgren	ce16239e34	(query) Modify tokenizer to match the behavior of the sentence extractor This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.	2024-04-17 17:54:32 +02:00
Viktor Lofgren	f52457213e	(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus	2024-04-17 14:05:02 +02:00
Viktor Lofgren	579295a673	(search) Add implicit coherence constraints based on segmentation	2024-04-17 14:03:35 +02:00
Viktor Lofgren	599e719ad4	(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.	2024-04-15 16:44:08 +02:00
Viktor Lofgren	b6d365bacd	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-15 16:04:07 +02:00
Viktor Lofgren	fda1c05164	(ngram) Correct \|s\|^\|s\|-normalization to use length and not count	2024-04-13 18:05:30 +02:00
Viktor Lofgren	d729c400e5	(query, minor) Remove debug statement	2024-04-12 17:52:55 +02:00
Viktor Lofgren	ad4810d991	(query, minor) Remove debug statement	2024-04-12 17:45:26 +02:00
Viktor Lofgren	864d6c28e7	(segmentation) Pick best segmentation using \|s\|^\|s\|-style normalization This is better than doing all segmentations possible at the same time.	2024-04-12 17:44:14 +02:00
Viktor Lofgren	b7d9a7ae89	(ngrams) Remove the vestigial logic for capturing permutations of n-grams The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.	2024-04-11 18:12:01 +02:00
Viktor Lofgren	ed73d79ec1	(qs) Clean up parsing code using new record matching	2024-04-11 17:36:08 +02:00
Viktor Lofgren	fcdc843c15	(search) Fix outdated assumptions about the results We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption. For the API service, we'll simulate the old behavior to keep the API stable. For the search service, we'll introduce a new way of calculating positions through tree aggregation.	2024-04-07 12:09:44 +02:00
Viktor Lofgren	ae7c760772	(index) Clean up new index query code	2024-04-05 13:30:49 +02:00
Viktor Lofgren	81815f3e0a	(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.	2024-04-04 20:17:58 +02:00
Viktor Lofgren	87bb93e1d4	(qs, WIP) Fix edge cases in query compilation This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w \| z_w) \| x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.	2024-03-29 12:40:27 +01:00
Viktor Lofgren	e596c929ac	(qs, WIP) Clean up dead code	2024-03-28 16:37:23 +01:00
Viktor Lofgren	9852b0e609	(qs, WIP) Tidy it up a bit	2024-03-28 14:18:26 +01:00
Viktor Lofgren	51b0d6c0d3	(qs, WIP) Tidy it up a bit	2024-03-28 14:09:17 +01:00
Viktor Lofgren	15391c7a88	(qs, WIP) Tidy it up a bit	2024-03-28 13:54:30 +01:00
Viktor Lofgren	fe62593286	(qs, WIP) Break up code and tidy it up a bit	2024-03-28 13:26:54 +01:00
Viktor Lofgren	4cc11e183c	(qs, WIP) Fix output determinism, fix tests	2024-03-28 13:11:26 +01:00
Viktor Lofgren	f82ebd7716	(WIP) Query rendering finally beginning to look like it works	2024-03-28 13:01:21 +01:00
Viktor Lofgren	002afca1c5	(sys) Upgrade to JDK22 This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.	2024-03-21 14:33:27 +01:00
Viktor Lofgren	a4b810f511	WIP	2024-03-21 14:33:26 +01:00
Viktor Lofgren	0bd3365c24	(convert) Initial integration of segmentation data into the converter's keyword extraction logic	2024-03-19 14:28:42 +01:00
Viktor Lofgren	d8f4e7d72b	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-03-19 10:42:09 +01:00
Viktor Lofgren	00ef4f9803	(WIP) Partial integration of new query expansion code into the query-serivice	2024-03-18 13:16:49 +01:00
Viktor Lofgren	07e4d7ec6d	(WIP) Improve data extraction from wikipedia data	2024-03-18 13:16:00 +01:00
Viktor Lofgren	8ae1f08095	(WIP) Implement first take of new query segmentation algorithm	2024-03-12 13:12:50 +01:00
Viktor Lofgren	46423612e3	(refac) Merge service-discovery and service modules Also adds a few tests to the server/client code.	2024-03-03 10:49:23 +01:00
Viktor Lofgren	9689f3faee	(domain-info) Fix incorrect array indexing	2024-02-29 18:56:09 +01:00
Viktor Lofgren	93fa58c93d	(domain-info) Fix incorrect array indexing Using the id instead of idx when addressing the ranksArray caused exceptions.	2024-02-29 17:54:23 +01:00
Viktor Lofgren	41abd8982f	(math) Clean up error handling	2024-02-28 14:19:50 +01:00
Viktor Lofgren	9415539b38	(docs) Update docs	2024-02-28 12:25:19 +01:00
Viktor Lofgren	84bab2783d	(docs) Fix fake news in docs	2024-02-28 12:16:45 +01:00
Viktor Lofgren	9f1649636e	Clean up documentation and rename `domain-links` to `link-graph`	2024-02-28 11:40:39 +01:00
Viktor Lofgren	c943954bb4	(domain-info) Reduce memory usage	2024-02-27 21:22:21 +01:00
Viktor Lofgren	eaf836dc66	(service/grpc) Reduce thread count Netty and GRPC by default spawns an incredible number of threads on high-core CPUs, which amount to a fair bit of RAM usage. Add custom executors that throttle this behavior.	2024-02-27 21:22:21 +01:00
Viktor Lofgren	5604e9f531	(query) Bump query length, see what happens :P	2024-02-27 21:22:17 +01:00
Viktor Lofgren	427f3e922f	(index) Retire count operation, clean up index code.	2024-02-27 21:22:17 +01:00
Viktor Lofgren	9429bf5c45	(index) Clean up	2024-02-27 21:22:17 +01:00
Viktor Lofgren	fc00701a1e	(index) Experimental refactoring of the indexing functionality	2024-02-25 11:05:10 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00
Viktor Lofgren	5cdb07023b	(refac) Clean up unused imports	2024-02-23 11:27:20 +01:00
Viktor Lofgren	f8e7f75831	Move index to top level of code	2024-02-22 18:01:35 +01:00
Viktor Lofgren	085137ca63	* Extract the index functionality	2024-02-22 17:31:25 +01:00
Viktor Lofgren	3fd2a83184	* Extract the search-query function	2024-02-22 15:27:39 +01:00
Viktor Lofgren	66c1281301	(zk-registry) epic jak shaving WIP Cleaning out a lot of old junk from the code, and one thing lead to another... * Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds. * The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs. * Project is migrated to GraalVM * gRPC clients are re-written with a neat fluent/functional style. e.g. ```channelPool.call(grpcStub::method) .async(executor) // <-- optional .run(argument); ``` This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall. * For now the project is all in on zookeeper * Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh. * To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP! Missing is documentation and testing, and some more breaking apart of code.	2024-02-22 14:01:23 +01:00

1 2 3 4

173 Commits