Commit Graph

160 Commits

Author SHA1 Message Date
Viktor Lofgren
cf366c602f (search) Refactor SearchQueryIndexService in preparation for feature extraction.
Prefer working on DecoratedSearchResultItem in favor of UrlDetails.
2023-10-08 17:15:41 +02:00
Viktor Lofgren
77ccab7d80 (index) Move linkdb to index from search.
This makes index complete in the sense that you can deploy an index instance and build a complete separate application on top of it, without having to go through the Marginalia-laden search service.
2023-10-08 16:48:35 +02:00
Viktor Lofgren
f51ba63742 (search) Remove dead file 2023-10-07 21:05:06 +02:00
Viktor Lofgren
9044518be5 (search) Fix broken link to git repo 2023-10-07 19:43:22 +02:00
Viktor Lofgren
9e0367eef4 (search) Filter blacklisted items in API query service as well 2023-10-07 16:16:04 +02:00
Viktor Lofgren
235bb6c1b9 (control) Administrative QOL improvement, GUI for banning spam 2023-10-07 15:45:50 +02:00
Viktor Lofgren
49344d7ea8 (control) Administrative QOL improvement, GUI for banning spam 2023-10-07 15:43:18 +02:00
Viktor Lofgren
1b418d77ff (search) We got some new IP ranges to work with for the crawler 2023-10-07 13:41:55 +02:00
Viktor Lofgren
80cc302627 (search) We can't in claim to be on PC hardware anymore... 2023-10-07 11:49:29 +02:00
Viktor
8e1abc3f10
(index-reverse) Parallel construction of the reverse indexes. (#52)
* (index-reverse) Parallel construction of the reverse indexes.

* (array) Remove wasteful calculation of numDistinct before merging two sorted arrays.

* (index-reverse)  Force changes to disk on close, reduce logging.

* (index-reverse)  Clean up merging process and add back logging

* (run)  Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM

* (index-reverse)  Better logging during processing

* (array) 2GB+ compatible write() function

* (array) 2GB+ compatible write() function

* (index-reverse) We are logging like Bolsonaro and I will not have it.

* (reverse-index) Self-diagnostics

* (btree) Fix bug in btree reader to do with large data sizes
2023-10-07 10:00:00 +02:00
Viktor Lofgren
c51159672e (build) Move unit test configuration to root build.gradle 2023-10-04 12:46:22 +02:00
Viktor Lofgren
405300b4b2 (control) Fix bug where finishing one process ad hoc task would remove all other tasks from the db 2023-10-04 11:44:31 +02:00
Viktor Lofgren
40768e935b (test) Removing /tmp-guardrails as it doesn't hold in CI 2023-10-02 16:52:59 +02:00
Viktor Lofgren
d160954080 (index) Two useful debug endpoints 2023-09-24 19:39:48 +02:00
Viktor Lofgren
14372e0ef0 (index) Slightly reduce alloc churn 2023-09-24 19:36:14 +02:00
Viktor Lofgren
03bffa27ac (search) Add combined id to the search result HTML 2023-09-24 19:34:35 +02:00
Viktor Lofgren
028b5a4f0d (minor performance) Reduce GC churn in index 2023-09-24 12:12:08 +02:00
Viktor Lofgren
1bd146fb8e (minor) Remove dead code 2023-09-24 10:55:20 +02:00
Viktor Lofgren
5f6c3da7a4 (index) Add close methods on the index readers so they clean up their mmaps 2023-09-24 10:54:23 +02:00
Viktor Lofgren
dbe9235f3a (*) Upgrade to JDK21 with preview enabled.
... also move some common configuration into the root build.gradle-file.

Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work.  This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
d78569986b (backups) Fix bug where backup service would zero the linkdb when restoring. 2023-09-22 18:34:34 +02:00
Viktor Lofgren
95323e6caa (backups) Support restore multi-source load data 2023-09-22 18:34:17 +02:00
Viktor Lofgren
f809d22fc6 (loader) Support simultaneous loading of multiple processed data sets 2023-09-22 13:14:58 +02:00
Viktor Lofgren
70aa04c047 (converter, stackexchange-xml) Add the ability to sideload stackexchange data 2023-09-21 12:48:33 +02:00
Viktor Lofgren
f8050816ac (search) Don't run LSH deduplication on details with zero lsh to support not calculating this hash. 2023-09-21 12:47:02 +02:00
Viktor Lofgren
9b385ec7cc (converter) Make it possible to sideload documents from a directory tree 2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46 (crawl-spec) Parquetify crawl spec
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
5e5aaf9a7e (converter, control) Re-enable sideloading encyclopedia data 2023-09-14 12:12:07 +02:00
Viktor Lofgren
07d7507ac6 (control-service) Move Actions up in storage-details
Papercut fix. If a file storage area has a lot of files, you have to scroll down a long way to get to the actions otherwise.
2023-09-02 15:41:55 +02:00
Viktor Lofgren
9e185e80ce (control-service) Add timestamp to file storages. 2023-09-02 14:01:04 +02:00
Viktor Lofgren
d31d8ec5b0 (index) Log keyword ids on hex format 2023-09-01 15:40:24 +02:00
Viktor Lofgren
2b00cd632d (process) Propagate environment JVM params to the index constructor 2023-09-01 15:39:42 +02:00
Viktor Lofgren
764e7d1315 (index) Add more comprehensive integration tests for the index service. 2023-08-30 10:37:24 +02:00
Viktor Lofgren
e4d7958379 (control) ProcessLivenessMonitorActor shouldn't reap tasks based on service instance liveness 2023-08-29 18:19:04 +02:00
Viktor Lofgren
3f288e264b (minor) Clean up dead endpoints 2023-08-29 17:04:54 +02:00
Viktor Lofgren
dd593c292c (loader) Minor optimizations and bugfixes.
* Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well
* Remove remains of OldDomains
* Ensure LOADER_PROCESS_OPTS gets fed to the processes
* LinkdbStatusWriter won't execute batch after each added item post 100 items
2023-08-29 15:37:52 +02:00
Viktor Lofgren
39c1857c61 (heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction. 2023-08-29 13:07:55 +02:00
Viktor Lofgren
c57a2d0dc3 (control-service) Remove old index journal files when restoring a backup. 2023-08-29 11:58:01 +02:00
Viktor Lofgren
6525b16e1f (minor) Improved logging and error messages 2023-08-28 19:53:55 +02:00
Viktor Lofgren
b6a92506d1 (index) Hook in missing DocIdRewriter
This enables documents to be ranked properly.
2023-08-28 19:53:43 +02:00
Viktor Lofgren
3101b74580 (index) Move to a lexicon-free index design
This is a system-wide change.  The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table.   This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.

The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.

It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
194a6057dd (index,control) Recoverable index backups 2023-08-25 14:57:43 +02:00
Viktor Lofgren
e710e057e2 (db) Remove EC_URL and EC_PAGE_DATA from mariadb database 2023-08-25 13:45:03 +02:00
Viktor Lofgren
28188a6e59 (control) Simplify ConvertAndLoadActor 2023-08-25 13:30:20 +02:00
Viktor Lofgren
70a5df96c8 (control) Display progress of process tasks 2023-08-25 13:05:21 +02:00
Viktor Lofgren
460998d512 (index) Move index construction to separate process.
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service.  It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
2023-08-25 12:52:54 +02:00
Viktor Lofgren
e741301417 (search) Remove endpoint flush-search-caches
It's not necessary anymore with the new linkdb.
2023-08-25 09:51:06 +02:00
Viktor Lofgren
5ed5298409 (converter) Update confusing state description
SWAP_LEXICON doesn't instruct the index service to do anything.  It just moves the file.
2023-08-24 18:56:49 +02:00
Viktor Lofgren
b911665691 (index) Clean up and optimize valuator 2023-08-24 18:34:06 +02:00
Viktor Lofgren
56eb83319d (index) Clean up result domain deduplicator 2023-08-24 18:24:55 +02:00