From 7f7b1ffaba9c967a85b1824ea05f4d55f514d20f Mon Sep 17 00:00:00 2001 From: Viktor Date: Tue, 31 Dec 2024 14:40:34 +0100 Subject: [PATCH] Update ROADMAP.md --- ROADMAP.md | 52 +++++++++++++++++++++++++++------------------------- 1 file changed, 27 insertions(+), 25 deletions(-) diff --git a/ROADMAP.md b/ROADMAP.md index bab42ed5..d41c1e1a 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -8,20 +8,10 @@ be implemented as well. Major goals: * Reach 1 billion pages indexed -* Improve technical ability of indexing and search. Although this area has improved a bit, the - search engine is still not very good at dealing with longer queries. -## Proper Position Index (COMPLETED 2024-09) -The search engine uses a fixed width bit mask to indicate word positions. It has the benefit -of being very fast to evaluate and works well for what it is, but is inaccurate and has the -drawback of making support for quoted search terms inaccurate and largely reliant on indexing -word n-grams known beforehand. This limits the ability to interpret longer queries. - -The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions -list, as is the civilized way of doing this. - -Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99) +* Improve technical ability of indexing and search. ~~Although this area has improved a bit, the + search engine is still not very good at dealing with longer queries.~~ (As of PR [#129](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/129), this has improved significantly. There is still more work to be done ) ## Hybridize crawler w/ Common Crawl data @@ -37,8 +27,7 @@ Retaining the ability to independently crawl the web is still strongly desirable ## Safe Search -The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable -to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) ) +The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) ) combined with naive bayesian filter would go a long way, or something more sophisticated...? ## Web Design Overhaul @@ -55,15 +44,6 @@ associated with each language added, at least a models file or two, as well as s It would be very helpful to find a speaker of a large language other than English to help in the fine tuning. -## Finalize RSS support (COMPLETED 2024-11) - -Marginalia has experimental RSS preview support for a few domains. This works well and -it should be extended to all domains. It would also be interesting to offer search of the -RSS data itself, or use the RSS set to feed a special live index that updates faster than the -main dataset. - -Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125) - ## Support for binary formats like PDF The crawler needs to be modified to retain them, and the conversion logic needs to parse them. @@ -80,5 +60,27 @@ This looks like a good idea that wouldn't just help clean up the search filters website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search filter for any API consumer. -I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, -which is quite ad-hoc, but instead to work together to find some new common description language for this. +I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this. + +# Completed + +## Proper Position Index (COMPLETED 2024-09) + +The search engine uses a fixed width bit mask to indicate word positions. It has the benefit +of being very fast to evaluate and works well for what it is, but is inaccurate and has the +drawback of making support for quoted search terms inaccurate and largely reliant on indexing +word n-grams known beforehand. This limits the ability to interpret longer queries. + +The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions +list, as is the civilized way of doing this. + +Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99) + +## Finalize RSS support (COMPLETED 2024-11) + +Marginalia has experimental RSS preview support for a few domains. This works well and +it should be extended to all domains. It would also be interesting to offer search of the +RSS data itself, or use the RSS set to feed a special live index that updates faster than the +main dataset. + +Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)