Since the performance fix in 3359f72239 had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
The changeset cleans up the query parsing logic in the query service. It gets rid of a lot of old and largely unmaintainable query-rewriting logic that was based on POS-tagging rules, and adds a new cleaner approach. Query parsing is also refactored, and the internal APIs are updated to remove unnecessary duplication of document-level data across each search term.
A new query segmentation model is introduced based on a dictionary of known n-grams, with tools for extracting this dictionary from Wikipedia data. The changeset introduces a new segmentation model file, which is downloaded with the usual run/setup.sh, as well as an updated term frequency model.
A new intermediate representation of the query is introduced, based on a DAG with predefined vertices initiating and terminating the graph. This is for the benefit of easily writing rules for generating alternative queries, e.g. using the new segmentation data.
The graph is converted to a basic LL(1) syntax loosely reminiscent of a regular expression, where e.g. "( wiby | marginalia | kagi ) ( search engine | searchengine )" expands to "wiby search engine", "wiby searchengine", "marginalia search engine", "marginalia searchengine", "kagi search engine" and "kagi searchengine".
This compiled query is passed to the index, which parses the expression, where it is used for execution of the search and ranking of the results.
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.