Update ROADMAP.md

This commit is contained in:
Viktor 2024-12-31 14:40:34 +01:00 committed by GitHub
parent 0ea8092350
commit 7f7b1ffaba
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -8,20 +8,10 @@ be implemented as well.
Major goals:
* Reach 1 billion pages indexed
* Improve technical ability of indexing and search. Although this area has improved a bit, the
search engine is still not very good at dealing with longer queries.
## Proper Position Index (COMPLETED 2024-09)
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
of being very fast to evaluate and works well for what it is, but is inaccurate and has the
drawback of making support for quoted search terms inaccurate and largely reliant on indexing
word n-grams known beforehand. This limits the ability to interpret longer queries.
The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
list, as is the civilized way of doing this.
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
* Improve technical ability of indexing and search. ~~Although this area has improved a bit, the
search engine is still not very good at dealing with longer queries.~~ (As of PR [#129](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/129), this has improved significantly. There is still more work to be done )
## Hybridize crawler w/ Common Crawl data
@ -37,8 +27,7 @@ Retaining the ability to independently crawl the web is still strongly desirable
## Safe Search
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable
to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
The search engine has a bit of a problem showing spicy content mixed in with the results. It would be desirable to have a way to filter this out. It's likely something like a URL blacklist (e.g. [UT1](https://dsi.ut-capitole.fr/blacklists/index_en.php) )
combined with naive bayesian filter would go a long way, or something more sophisticated...?
## Web Design Overhaul
@ -55,15 +44,6 @@ associated with each language added, at least a models file or two, as well as s
It would be very helpful to find a speaker of a large language other than English to help in the fine tuning.
## Finalize RSS support (COMPLETED 2024-11)
Marginalia has experimental RSS preview support for a few domains. This works well and
it should be extended to all domains. It would also be interesting to offer search of the
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
main dataset.
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)
## Support for binary formats like PDF
The crawler needs to be modified to retain them, and the conversion logic needs to parse them.
@ -80,5 +60,27 @@ This looks like a good idea that wouldn't just help clean up the search filters
website, but might be cheap enough we might go as far as to offer a number of ad-hoc custom search
filter for any API consumer.
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language,
which is quite ad-hoc, but instead to work together to find some new common description language for this.
I've talked to the stract dev and he does not think it's a good idea to mimic their optics language, which is quite ad-hoc, but instead to work together to find some new common description language for this.
# Completed
## Proper Position Index (COMPLETED 2024-09)
The search engine uses a fixed width bit mask to indicate word positions. It has the benefit
of being very fast to evaluate and works well for what it is, but is inaccurate and has the
drawback of making support for quoted search terms inaccurate and largely reliant on indexing
word n-grams known beforehand. This limits the ability to interpret longer queries.
The positions mask should be supplemented or replaced with a more accurate (e.g.) gamma coded positions
list, as is the civilized way of doing this.
Completed with PR [#99](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/99)
## Finalize RSS support (COMPLETED 2024-11)
Marginalia has experimental RSS preview support for a few domains. This works well and
it should be extended to all domains. It would also be interesting to offer search of the
RSS data itself, or use the RSS set to feed a special live index that updates faster than the
main dataset.
Completed with PR [#122](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/122) and PR [#125](https://github.com/MarginaliaSearch/MarginaliaSearch/pull/125)