Commit Graph

72 Commits

Author SHA1 Message Date
Viktor Lofgren
3889c4bdd9 (refactor) Remove features-search and update documentation 2023-10-09 15:12:30 +02:00
Viktor Lofgren
77ccab7d80 (index) Move linkdb to index from search.
This makes index complete in the sense that you can deploy an index instance and build a complete separate application on top of it, without having to go through the Marginalia-laden search service.
2023-10-08 16:48:35 +02:00
Viktor
8e1abc3f10
(index-reverse) Parallel construction of the reverse indexes. (#52)
* (index-reverse) Parallel construction of the reverse indexes.

* (array) Remove wasteful calculation of numDistinct before merging two sorted arrays.

* (index-reverse)  Force changes to disk on close, reduce logging.

* (index-reverse)  Clean up merging process and add back logging

* (run)  Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM

* (index-reverse)  Better logging during processing

* (array) 2GB+ compatible write() function

* (array) 2GB+ compatible write() function

* (index-reverse) We are logging like Bolsonaro and I will not have it.

* (reverse-index) Self-diagnostics

* (btree) Fix bug in btree reader to do with large data sizes
2023-10-07 10:00:00 +02:00
Viktor Lofgren
c51159672e (build) Move unit test configuration to root build.gradle 2023-10-04 12:46:22 +02:00
Viktor Lofgren
40768e935b (test) Removing /tmp-guardrails as it doesn't hold in CI 2023-10-02 16:52:59 +02:00
Viktor Lofgren
d160954080 (index) Two useful debug endpoints 2023-09-24 19:39:48 +02:00
Viktor Lofgren
14372e0ef0 (index) Slightly reduce alloc churn 2023-09-24 19:36:14 +02:00
Viktor Lofgren
028b5a4f0d (minor performance) Reduce GC churn in index 2023-09-24 12:12:08 +02:00
Viktor Lofgren
1bd146fb8e (minor) Remove dead code 2023-09-24 10:55:20 +02:00
Viktor Lofgren
5f6c3da7a4 (index) Add close methods on the index readers so they clean up their mmaps 2023-09-24 10:54:23 +02:00
Viktor Lofgren
dbe9235f3a (*) Upgrade to JDK21 with preview enabled.
... also move some common configuration into the root build.gradle-file.

Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work.  This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
9e185e80ce (control-service) Add timestamp to file storages. 2023-09-02 14:01:04 +02:00
Viktor Lofgren
d31d8ec5b0 (index) Log keyword ids on hex format 2023-09-01 15:40:24 +02:00
Viktor Lofgren
764e7d1315 (index) Add more comprehensive integration tests for the index service. 2023-08-30 10:37:24 +02:00
Viktor Lofgren
3f288e264b (minor) Clean up dead endpoints 2023-08-29 17:04:54 +02:00
Viktor Lofgren
39c1857c61 (heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction. 2023-08-29 13:07:55 +02:00
Viktor Lofgren
b6a92506d1 (index) Hook in missing DocIdRewriter
This enables documents to be ranked properly.
2023-08-28 19:53:43 +02:00
Viktor Lofgren
3101b74580 (index) Move to a lexicon-free index design
This is a system-wide change.  The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table.   This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.

The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.

It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
460998d512 (index) Move index construction to separate process.
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service.  It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
2023-08-25 12:52:54 +02:00
Viktor Lofgren
b911665691 (index) Clean up and optimize valuator 2023-08-24 18:34:06 +02:00
Viktor Lofgren
56eb83319d (index) Clean up result domain deduplicator 2023-08-24 18:24:55 +02:00
Viktor Lofgren
1e6800565a (system) Remove EdgeId<T> and similar objects
They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.
2023-08-24 17:46:02 +02:00
Viktor Lofgren
9894f37412 (index) Implement new URL ID coding scheme.
Also refactor along the way.  Really needs an additional pass, these tests are very hairy.
2023-08-24 16:44:27 +02:00
Viktor Lofgren
ebc84c22fb Upgrade antique lombok plugin
This permits tests to run on JDK20 environments.
2023-08-23 14:34:32 +00:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
4d75fa2908 Upgrade gradle and docker plugin to support native JDK20 environments 2023-08-23 13:30:55 +00:00
Viktor Lofgren
704de50a9b (forward-index, valuator) HTML features in valuator
Put it in the forward index for easy access during index-side valuation.
2023-08-18 11:54:56 +02:00
Viktor Lofgren
251fc63b42 (*) Fix merge gore 2023-08-09 13:33:28 +02:00
Viktor Lofgren
624b78ec3a (heartbeat) Task heartbeats 2023-08-04 14:40:06 +02:00
Viktor Lofgren
9979c9defe (search/index) Add blogosphere filter 2023-08-02 20:13:30 +02:00
Viktor Lofgren
e22e65eee4 (index) Fix bug related to debug print statements 2023-07-22 14:33:58 +02:00
Viktor Lofgren
d7ab21fe34 (*) Refactor Control Service and processes 2023-07-17 21:20:31 +02:00
Viktor Lofgren
8b74e3aa0d (*) File Storage WIP 2023-07-14 17:08:10 +02:00
Viktor Lofgren
88b9ec70c6 (control, WIP) Run reconvert-load from converter :D 2023-07-11 18:05:37 +02:00
Viktor
cbbf60a599 Better fingerprinting (#35)
* Better fingerprinting for server tech
* Many more features in FeatureExtractor
* Blog specialization
* SiteType table
2023-07-10 18:58:43 +02:00
Viktor Lofgren
96eecc6ea5 Minor: Readability. 2023-07-10 18:58:43 +02:00
Viktor Lofgren
d9e6c4f266 Trial integration of MQ-FSM into index service. 2023-07-06 18:04:16 +02:00
Viktor Lofgren
62cc9df206 Embryo of new control process
* New events and heartbeat tables in mariadb
* Refactored to a cleaner Service interface
2023-07-03 10:40:32 +02:00
Viktor Lofgren
bd2c3855ed Add bits and keywords for generator classes (docs, forum, wiki). 2023-06-23 21:35:28 +02:00
Viktor Lofgren
55c65f0935 Use document generator to complement the document selection.
Will let through e.g. a modern SSG in the small web filter.
2023-06-22 17:21:33 +02:00
Viktor Lofgren
fd192d2791 Fix putative overflow error with a large dictionary 2023-05-28 11:57:06 +02:00
Viktor Lofgren
df1850bd45 Fix bug in index service where tld: and links:-queries wouldn't work. 2023-04-15 18:39:16 +02:00
Viktor Lofgren
502713f7a8 Reduce memory churn 2023-04-10 16:51:17 +02:00
Viktor Lofgren
e19256a6b6 Tune settings to retrieve more results. 2023-04-10 15:39:20 +02:00
Viktor Lofgren
ccc41d1717 Clean up of the index query handling related code. 2023-04-10 14:50:57 +02:00
Viktor Lofgren
e49b1dd155 Better handling of quote terms, fix bug in handling of longer queries.
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:20:40 +02:00
Viktor Lofgren
fe419b12b4 Better handling of quote terms, fix bug in handling of longer queries.
... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java
2023-04-10 13:11:40 +02:00
Viktor Lofgren
535a51a621 Repair broken year query test. 2023-04-08 12:04:09 +02:00
Viktor
a278fc6296
Increase search result relevance (#8)
* Increase accuracy of the position bits.
* Increase their width to 56.
* Use a rolling position scheme for bits 16-56 to increase the average accuracy.
* Result ranking overhaul
* Optimized queries
* BM25 in the index service's ranking
* Make gui less jank
* Javadocs for ranking parameters.
2023-04-07 20:18:08 +02:00
Viktor Lofgren
3fb249758e Adjust result ordering. 2023-04-02 12:05:22 +02:00