Commit Graph

93 Commits

Author SHA1 Message Date
Viktor Lofgren
1bd29a586c (service-discovery) Add common base interface to all Grpc services
To be able to tell service discovery whether to enable a service on a particular runtime, a common base interface DiscoverableService extends BindableService was added.
2024-09-27 13:46:34 +02:00
Viktor Lofgren
c757d116bf (misc) Fix Broken Tests 2024-09-27 13:46:34 +02:00
Viktor Lofgren
4a0356e26f (search-service) Add pagination support to the search GUI 2024-09-25 14:26:49 +02:00
Viktor Lofgren
73f973cc06 (search-query) Add pagination to search query API and the direct query-service interface 2024-09-25 14:20:59 +02:00
Viktor Lofgren
28e7c8e5e0 Increase temporal bias weight to give the recent results filter a bit more recency 2024-09-17 18:11:40 +02:00
Viktor Lofgren
99523ca079 (query-parser) Remove test that is no longer relevant 2024-09-10 10:35:56 +02:00
Viktor Lofgren
50ec922c2b (index) Fix broken index tests
Also cleaned up the tests to be less fragile to ranking algorithm changes.
2024-09-10 10:23:46 +02:00
Viktor Lofgren
50ba8fd099 (query-parsing) Correct handling of trailing parentheses 2024-09-03 11:45:14 +02:00
Viktor Lofgren
99b3b00b68 (query-parsing) Merge QueryTokenizer into QueryParser and add escaping of query grammar 2024-09-03 11:35:32 +02:00
Viktor Lofgren
f6d981761d (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:24:05 +02:00
Viktor Lofgren
8290c19e24 (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:21:01 +02:00
Viktor Lofgren
bb5d946c26 (index, EXPERIMENTAL) Clean up ranking code 2024-08-29 11:34:23 +02:00
Viktor Lofgren
4fbcc02f96 (index) Adjust sensible defaults for ranking parameters 2024-08-25 11:24:16 +02:00
Viktor Lofgren
9aa8f13731 (index) Remove tcfAvgDist ranking parameter
This is captured by tcfProximity already
2024-08-25 11:20:19 +02:00
Viktor Lofgren
0999f07320 (search-query) Add new ranking parameters for proximity and verbatim matches 2024-08-25 10:34:12 +02:00
Viktor Lofgren
5d2b455572 (search) Clean up inconsistent usage of MathClient in SearchOperator
Also clean up SearchOperator and adjacent code
2024-08-24 10:39:31 +02:00
Viktor Lofgren
9eb1f120fc (index) Repair positions bitmask for search result presentation 2024-08-22 11:28:23 +02:00
Viktor Lofgren
03d5dec24c (*) Refactor termCoherences and rename them to phrase constraints. 2024-08-15 11:02:19 +02:00
Viktor Lofgren
016a4c62e1 (index) Bugs and error fixes, chasing and fixing mystery results that did not contain all relevant keywords 2024-08-10 09:51:03 +02:00
Viktor Lofgren
41da4f422d (search-query) Always generate the "all"-segmentation 2024-08-09 13:20:00 +02:00
Viktor Lofgren
2e89b55593 (wip) Repair qdebug utility and show new ranking details 2024-08-09 12:57:25 +02:00
Viktor Lofgren
7babdb87d5 (index) Remove intermediate models 2024-08-07 10:10:44 +02:00
Viktor Lofgren
8462e88b8f (index) Add min-dist factor and adjust rankings 2024-08-03 13:07:00 +02:00
Viktor Lofgren
b316b55be9 (index) Experimental initial integration of document spans into index 2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7 (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
Viktor Lofgren
dfd19b5eb9 (index) Reduce the number of abstractions around result ranking
The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.
2024-07-16 08:18:54 +02:00
Viktor Lofgren
ad3857938d (search-api, ranking) Update with new ranking parameters
Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm.

The change also cleans out several parameters that no longer filled any function.
2024-07-15 04:49:40 +02:00
Viktor Lofgren
1ab875a75d (test) Correcting flaky tests
Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
87e38e6181 (search-query) refac: Move query factory 2024-06-27 13:14:47 +02:00
Viktor Lofgren
f73fc8dd57 (search-query) Fix end-inclusion bug in QWordGraphIterator 2024-06-27 13:13:42 +02:00
Viktor Lofgren
3faa5bf521 (search-query) Tidy up QueryGRPCService and IndexClient 2024-06-26 14:03:30 +02:00
Viktor Lofgren
6973712480 (query) Tidy up code 2024-06-26 13:40:06 +02:00
Viktor Lofgren
95b9af92a0 (index) Implement working optional TermCoherences 2024-06-26 12:22:06 +02:00
Viktor Lofgren
dae22ccbe0 (test) Integration test from crawl->query 2024-06-25 22:17:26 +02:00
Viktor Lofgren
9d00243d7f (index) Partial re-implementation of position constraints 2024-06-24 15:55:54 +02:00
Viktor Lofgren
36160988e2 (index) Integrate positions data with indexes WIP
This change integrates the new positions data with the forward and reverse indexes.

The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
a69ab311c7 (qword) Fix tests that broke due to stopword removal 2024-05-28 14:15:45 +02:00
Viktor Lofgren
6985ab762a (query) Improve handling of stopwords in queries 2024-05-23 20:50:55 +02:00
Viktor Lofgren
0b60411e5f (query) Bugfix stopword issue
Add a new rule that crates an alternative path that omits a word if it's a stopword.

In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
2024-05-23 20:15:14 +02:00
Viktor Lofgren
89aae93e60 (*) Lift jetty and guava-dependencies 2024-05-23 14:20:01 +02:00
Viktor Lofgren
4668b1ddcb (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
Viktor Lofgren
0a73b02a00 (query) Mark flaky test, correct assert on test 2024-04-21 12:30:14 +02:00
Viktor Lofgren
2cc74c005a (query) Always generate an ngram alternative, suppresses generation of multiple identical query branches 2024-04-19 19:42:30 +02:00
Viktor Lofgren
ed250f57f2 (ranking) Set regularMask correctly 2024-04-19 14:31:57 +02:00
Viktor Lofgren
e92c25f7e0 (ranking) Cleanup 2024-04-19 14:13:12 +02:00
Viktor Lofgren
41782a0ab5 (index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp 2024-04-19 12:19:26 +02:00
Viktor Lofgren
9b06433b82 (qs) Additional info in query debug UI 2024-04-19 12:18:53 +02:00
Viktor Lofgren
def607d840 (qs) Additional info in query debug UI 2024-04-19 11:46:27 +02:00