Viktor Lofgren
16e0738731
(*) Get multi-node routing working.
2023-10-15 18:38:30 +02:00
Viktor Lofgren
eacbf87979
(control) New list and form for index nodes.
2023-10-14 21:46:52 +02:00
Viktor Lofgren
108b4cb648
(service) Keep disabled multi-noded services dormant when they are configured to be disabled.
2023-10-14 20:58:55 +02:00
Viktor Lofgren
a9dff407a1
(config/db) Clean up migrations
2023-10-14 20:34:03 +02:00
Viktor Lofgren
9e26109e36
(reverse-index) Don't always POST
2023-10-14 16:48:29 +02:00
Viktor Lofgren
6308a8dfcd
(control) Node configuration
2023-10-14 16:47:52 +02:00
Viktor Lofgren
4baf9527d7
(*) WIP Control GUI redesign, executor-service, multi-node mq
...
This turned out to be very difficult to do in small isolated steps.
* Design overhaul of the control gui using bootstrap
* Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes
* Add node-affinity to message queue
2023-10-14 12:08:43 +02:00
Viktor Lofgren
199c459697
(*) Add node-affinity to services, processes and file storage.
2023-10-10 12:32:22 +02:00
Viktor Lofgren
61288c5e68
(service, client) First steps towards multiple nodedness
2023-10-09 22:13:27 +02:00
Viktor Lofgren
8375237de5
(converter) Add special keyword for websites with a tilde url.
2023-10-09 17:02:32 +02:00
Viktor Lofgren
6319b8ef51
(api-service) Improved testability, always set content type to application/json
2023-10-09 15:39:34 +02:00
Viktor Lofgren
397a85eaa4
(query-service) Apply blacklisting to search results
2023-10-09 15:18:53 +02:00
Viktor Lofgren
3889c4bdd9
(refactor) Remove features-search and update documentation
2023-10-09 15:12:30 +02:00
Viktor Lofgren
c899f1cb85
(docs) Update documentation to reflect new query service
2023-10-09 14:56:59 +02:00
Viktor Lofgren
d8956c51d0
(refactor) Remove api:search-api
...
Application services should not have an API, but purely act as clients
to the core services (which should always have an API).
2023-10-09 14:42:33 +02:00
Viktor Lofgren
5dd55c7cad
(refactor) Rename satellite services to application services
...
This is a better descriptor, since they now all implement different applications on top of the core services' APIs.
2023-10-09 13:45:45 +02:00
Viktor Lofgren
c0e61d4c87
(refactor) Move search service into services-satellite
2023-10-09 13:40:01 +02:00
Viktor Lofgren
97e17282ab
(query-service) Move query parsing from search-service to the new query service.
2023-10-09 13:27:44 +02:00
Viktor Lofgren
94c882af7d
(query-service) Provide delegate of IndexApi's query functionality.
...
This is an intermediate step in the process of introducing the query-service as a proxy between search and index.
2023-10-08 22:22:26 +02:00
Viktor Lofgren
89c6d85f2f
(query-service) Create new empty 'query-service' service
2023-10-08 17:31:50 +02:00
Viktor Lofgren
cf366c602f
(search) Refactor SearchQueryIndexService in preparation for feature extraction.
...
Prefer working on DecoratedSearchResultItem in favor of UrlDetails.
2023-10-08 17:15:41 +02:00
Viktor Lofgren
77ccab7d80
(index) Move linkdb to index from search.
...
This makes index complete in the sense that you can deploy an index instance and build a complete separate application on top of it, without having to go through the Marginalia-laden search service.
2023-10-08 16:48:35 +02:00
Viktor Lofgren
f51ba63742
(search) Remove dead file
2023-10-07 21:05:06 +02:00
Viktor Lofgren
9044518be5
(search) Fix broken link to git repo
2023-10-07 19:43:22 +02:00
Viktor Lofgren
9e0367eef4
(search) Filter blacklisted items in API query service as well
2023-10-07 16:16:04 +02:00
Viktor Lofgren
235bb6c1b9
(control) Administrative QOL improvement, GUI for banning spam
2023-10-07 15:45:50 +02:00
Viktor Lofgren
49344d7ea8
(control) Administrative QOL improvement, GUI for banning spam
2023-10-07 15:43:18 +02:00
Viktor Lofgren
1b418d77ff
(search) We got some new IP ranges to work with for the crawler
2023-10-07 13:41:55 +02:00
Viktor Lofgren
80cc302627
(search) We can't in claim to be on PC hardware anymore...
2023-10-07 11:49:29 +02:00
Viktor
8e1abc3f10
(index-reverse) Parallel construction of the reverse indexes. ( #52 )
...
* (index-reverse) Parallel construction of the reverse indexes.
* (array) Remove wasteful calculation of numDistinct before merging two sorted arrays.
* (index-reverse) Force changes to disk on close, reduce logging.
* (index-reverse) Clean up merging process and add back logging
* (run) Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM
* (index-reverse) Better logging during processing
* (array) 2GB+ compatible write() function
* (array) 2GB+ compatible write() function
* (index-reverse) We are logging like Bolsonaro and I will not have it.
* (reverse-index) Self-diagnostics
* (btree) Fix bug in btree reader to do with large data sizes
2023-10-07 10:00:00 +02:00
Viktor Lofgren
e498c6907a
(forward-index) Don't leak off heap memory
2023-10-05 21:22:13 +02:00
Viktor Lofgren
08e8fc6736
(index-journal) Thread safe IndexJournalReadEntry
2023-10-05 19:39:09 +02:00
Viktor Lofgren
f6e9ef6de9
(array) Fix transferFrom() so it survives larger than 2 GB transfers
2023-10-04 13:57:36 +02:00
Viktor Lofgren
c51159672e
(build) Move unit test configuration to root build.gradle
2023-10-04 12:46:22 +02:00
Viktor Lofgren
233b51e29e
(test) flag DomainTypesTest as Slow to exclude from regular CI
2023-10-04 12:23:10 +02:00
Viktor Lofgren
54c8e13a68
(term-frequency-dict) Fix memory leak in TermFrequencyDict
2023-10-04 11:55:11 +02:00
Viktor Lofgren
405300b4b2
(control) Fix bug where finishing one process ad hoc task would remove all other tasks from the db
2023-10-04 11:44:31 +02:00
Viktor Lofgren
40768e935b
(test) Removing /tmp-guardrails as it doesn't hold in CI
2023-10-02 16:52:59 +02:00
Viktor Lofgren
13ee31770a
(file storage) Make it possible to override the value returned by getFileStorage(type) with a JVM property.
2023-10-01 12:57:53 +02:00
Viktor Lofgren
93dc80000c
(bugfix) Fix NPE in KeywordExtractor due to bad SoftReference handling
2023-09-26 17:16:41 +02:00
Viktor Lofgren
e0cd3cd991
(converter) Alter StackexchangeSideloader's summary length to align with the rest of the system.
2023-09-26 12:19:43 +02:00
Viktor Lofgren
81ae501e73
(converter) Use ThreadLocalSentenceExtractorProvider for PlainText plugin as well
2023-09-25 18:28:34 +02:00
Viktor Lofgren
9b781f8404
(keyoword-extractor) Address very rare race condition in memoization logic
2023-09-25 18:28:04 +02:00
Viktor Lofgren
f797a92f87
(converter, minor) Use domain name in task heartbeat progress
2023-09-25 18:27:04 +02:00
Viktor Lofgren
ec6c9bca62
(common) Fix factual error in comments
2023-09-24 19:40:19 +02:00
Viktor Lofgren
a433bbbe45
(converter) Fix rare sentence extractor bug
...
It was caused by non-thread safe concurrent memory access in SentenceExtractor.
2023-09-24 19:39:48 +02:00
Viktor Lofgren
8ca20f184d
(keyword-extraction) Chasing my tail looking for a bug
2023-09-24 19:39:48 +02:00
Viktor Lofgren
d160954080
(index) Two useful debug endpoints
2023-09-24 19:39:48 +02:00
Viktor Lofgren
14372e0ef0
(index) Slightly reduce alloc churn
2023-09-24 19:36:14 +02:00
Viktor Lofgren
03bffa27ac
(search) Add combined id to the search result HTML
2023-09-24 19:34:35 +02:00
Viktor Lofgren
028b5a4f0d
(minor performance) Reduce GC churn in index
2023-09-24 12:12:08 +02:00
Viktor Lofgren
cd12f49fc0
(long-array) Return slices SegmentLongArray of itself for range() &c
2023-09-24 11:31:54 +02:00
Viktor Lofgren
1bd146fb8e
(minor) Remove dead code
2023-09-24 10:55:20 +02:00
Viktor Lofgren
5f6c3da7a4
(index) Add close methods on the index readers so they clean up their mmaps
2023-09-24 10:54:23 +02:00
Viktor Lofgren
d0aa754252
(long-array) Implement java.lang.foreign.Arena based lifecycle control for LongArray.
...
Further de-ByteBuffer:ing of these classes is to be done, but this is the smallest most urgently needed benefit.
This commit is a WIP but in a fully working state, pushing due to the importance of the changes to offer lifecycle control over mmaps.
2023-09-24 10:40:06 +02:00
Viktor Lofgren
dbe9235f3a
(*) Upgrade to JDK21 with preview enabled.
...
... also move some common configuration into the root build.gradle-file.
Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
d78569986b
(backups) Fix bug where backup service would zero the linkdb when restoring.
2023-09-22 18:34:34 +02:00
Viktor Lofgren
95323e6caa
(backups) Support restore multi-source load data
2023-09-22 18:34:17 +02:00
Viktor Lofgren
f809d22fc6
(loader) Support simultaneous loading of multiple processed data sets
2023-09-22 13:14:58 +02:00
Viktor Lofgren
10cad3abb2
(dating) Implementing @samstorment's fantastic design polish
2023-09-21 15:19:50 +02:00
Viktor Lofgren
9338f35cd8
(doc) Remove confusingly outdated ER-diagrams
2023-09-21 15:08:27 +02:00
Viktor Lofgren
ad660cf420
(converter) Bugfix: Don't try to Path.of() on optional field
2023-09-21 13:27:09 +02:00
Viktor Lofgren
75f8ae2815
(file-storage) Use human-readable timestamps in the names of file storage directories
2023-09-21 13:22:53 +02:00
Viktor Lofgren
70aa04c047
(converter, stackexchange-xml) Add the ability to sideload stackexchange data
2023-09-21 12:48:33 +02:00
Viktor Lofgren
4aa47e87f2
(blocking-thread-pool) Add isTerminated convenience function
2023-09-21 12:47:41 +02:00
Viktor Lofgren
f8050816ac
(search) Don't run LSH deduplication on details with zero lsh to support not calculating this hash.
2023-09-21 12:47:02 +02:00
Viktor Lofgren
5b0a6d7ec1
(stackexchange-converter) Create tool for converting stackexchange 7z-files to digestible sqlite db:s
2023-09-20 15:15:13 +02:00
Viktor Lofgren
3b4d08f52b
(stackexchange-integration) Add better comments
2023-09-20 14:43:06 +02:00
Viktor Lofgren
6bbf40d7d2
(stackexchange-integration) Tools for reading stackexchange xml files
2023-09-20 14:17:33 +02:00
Viktor Lofgren
d895f83520
(blocking-thread-pool) Move DumbThreadPool to its own micro-library
...
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
f6b9e8c5eb
(converter) JavadocSpecialization should truncate its summary if it gets too long
2023-09-17 16:25:33 +02:00
Viktor Lofgren
98bcdf6028
(converter) DirtreeSideloader now trims /index.html from the URL if present
...
This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.
2023-09-17 16:08:16 +02:00
Viktor Lofgren
9b385ec7cc
(converter) Make it possible to sideload documents from a directory tree
2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46
(crawl-spec) Parquetify crawl spec
...
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
c67d95c00f
(converter) Write dummy processor log when sideloading
2023-09-14 14:13:03 +02:00
Viktor Lofgren
5e5aaf9a7e
(converter, control) Re-enable sideloading encyclopedia data
2023-09-14 12:12:07 +02:00
Viktor Lofgren
35996d0adb
(docs) Update the documentation up-to-date information
2023-09-14 11:33:36 +02:00
Viktor Lofgren
eaeb23d41e
(refactor) Remove converting-model package completely
2023-09-14 11:21:44 +02:00
Viktor Lofgren
c71f6ad417
(converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup
2023-09-14 10:11:57 +02:00
Viktor Lofgren
87a8593291
(work-log) Fix bug where items weren't added to the current batch on logItem
2023-09-14 10:11:04 +02:00
Viktor Lofgren
4799dd769e
(converting) WIP begin to remove converting-model and the old InstructionsCompiler
2023-09-13 19:18:58 +02:00
Viktor Lofgren
24b4606f96
(converter,loader) Converter outputs parquet files instead of compressed json.
2023-09-13 16:13:41 +02:00
Viktor Lofgren
064bc5ee76
(processed-data) New parquet-serializable models for converter output
2023-09-11 14:08:40 +02:00
Viktor Lofgren
a52d78c8ee
(work-log) New batching work log
2023-09-11 14:08:08 +02:00
Viktor Lofgren
07d7507ac6
(control-service) Move Actions up in storage-details
...
Papercut fix. If a file storage area has a lot of files, you have to scroll down a long way to get to the actions otherwise.
2023-09-02 15:41:55 +02:00
Viktor Lofgren
c68d17d482
(keyword-extraction) Fix bug leading to position data missing on some keywords.
...
This was due to a discrepancy between the KeywordPositionBitmask and WordsTfIdfCounts' concept of a keyword.
2023-09-02 14:48:55 +02:00
Viktor Lofgren
9e185e80ce
(control-service) Add timestamp to file storages.
2023-09-02 14:01:04 +02:00
Viktor Lofgren
676e7c7947
(keywords) Add Serializable properties that went missing as the record became a class
2023-09-02 09:52:01 +02:00
Viktor Lofgren
04212b2cef
(btree) Add more consistent asserts on sortedness
2023-09-01 15:45:02 +02:00
Viktor Lofgren
bafc2a1f30
(reverse-index) Force() final docs after being written
...
Unlikely to be a problem, but we want to ensure it's on dsik before we go read it later.
2023-09-01 15:43:53 +02:00
Viktor Lofgren
563e388a45
(reverse-index) Fix parallel documents sorting bug
...
Bug was caused by parallel sorting capturing the iterator rather than the offsets to sort.
2023-09-01 15:42:45 +02:00
Viktor Lofgren
d31d8ec5b0
(index) Log keyword ids on hex format
2023-09-01 15:40:24 +02:00
Viktor Lofgren
2b00cd632d
(process) Propagate environment JVM params to the index constructor
2023-09-01 15:39:42 +02:00
Viktor Lofgren
5f427d2b4c
(keywords) Clean up leaky abstractions, clean up tests
2023-09-01 13:52:00 +02:00
Viktor Lofgren
8c0ce4fc1d
(index journal; minor) Clean up
2023-09-01 11:32:24 +02:00
Viktor Lofgren
10a74f45ea
(index journal; minor) Even cleaner separation of concerns.
2023-09-01 11:28:02 +02:00
Viktor Lofgren
320dad7f1a
(index journal) Fix leaky abstraction in IndexJournalReader.
...
The caller shouldn't be required to know the on-disk layout of the file to make use of the data in a performant way.
2023-09-01 11:18:13 +02:00
Viktor Lofgren
88ac72c8eb
(journal/reverse index) Working WIP fix over-allocation of documents
2023-08-31 20:16:02 +02:00
Viktor Lofgren
f74b9df0a7
(array) Don't use paging arrays when mapping small files for writing
2023-08-31 20:15:10 +02:00
Viktor Lofgren
a6f1335375
(loader) Fix bugfix where the loader would omit some meta and words.
2023-08-31 17:48:43 +02:00