mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-23 04:58:59 +00:00
![]() The changeset cleans up the query parsing logic in the query service. It gets rid of a lot of old and largely unmaintainable query-rewriting logic that was based on POS-tagging rules, and adds a new cleaner approach. Query parsing is also refactored, and the internal APIs are updated to remove unnecessary duplication of document-level data across each search term. A new query segmentation model is introduced based on a dictionary of known n-grams, with tools for extracting this dictionary from Wikipedia data. The changeset introduces a new segmentation model file, which is downloaded with the usual run/setup.sh, as well as an updated term frequency model. A new intermediate representation of the query is introduced, based on a DAG with predefined vertices initiating and terminating the graph. This is for the benefit of easily writing rules for generating alternative queries, e.g. using the new segmentation data. The graph is converted to a basic LL(1) syntax loosely reminiscent of a regular expression, where e.g. "( wiby | marginalia | kagi ) ( search engine | searchengine )" expands to "wiby search engine", "wiby searchengine", "marginalia search engine", "marginalia searchengine", "kagi search engine" and "kagi searchengine". This compiled query is passed to the index, which parses the expression, where it is used for execution of the search and ranking of the results. |
||
---|---|---|
.. | ||
commons-codec | ||
count-min-sketch | ||
encyclopedia-marginalia-nu | ||
monkey-patch-opennlp | ||
openzim | ||
parquet-floor | ||
porterstemmer | ||
rdrpostagger | ||
symspell | ||
README.md |
Third Party Code
This is a mix of code from other projects, that has either been aggressively modified to suite the needs of the project, or lack an artifact, or to override some default that is inappropriate for the type of data Marginalia throws at the library.
Sources and Licenses
Modified
- RDRPosTagger - GPL3
- PorterStemmer - LGPL3
- OpenZIM - GPL-2.0+
- Commons Codec - Apache 2.0
- encylopedia.marginalia.nu - GPL 2.0+
Repackaged
- SymSpell - LGPL-3.0
- Count-Min-Sketch - Apache 2.0
Monkey Patched
- Stanford OpenNLP - Apache-2.0