Viktor Lofgren
9e0367eef4
(search) Filter blacklisted items in API query service as well
2023-10-07 16:16:04 +02:00
Viktor Lofgren
235bb6c1b9
(control) Administrative QOL improvement, GUI for banning spam
2023-10-07 15:45:50 +02:00
Viktor Lofgren
49344d7ea8
(control) Administrative QOL improvement, GUI for banning spam
2023-10-07 15:43:18 +02:00
Viktor Lofgren
1b418d77ff
(search) We got some new IP ranges to work with for the crawler
2023-10-07 13:41:55 +02:00
Viktor Lofgren
80cc302627
(search) We can't in claim to be on PC hardware anymore...
2023-10-07 11:49:29 +02:00
Viktor
8e1abc3f10
(index-reverse) Parallel construction of the reverse indexes. ( #52 )
...
* (index-reverse) Parallel construction of the reverse indexes.
* (array) Remove wasteful calculation of numDistinct before merging two sorted arrays.
* (index-reverse) Force changes to disk on close, reduce logging.
* (index-reverse) Clean up merging process and add back logging
* (run) Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM
* (index-reverse) Better logging during processing
* (array) 2GB+ compatible write() function
* (array) 2GB+ compatible write() function
* (index-reverse) We are logging like Bolsonaro and I will not have it.
* (reverse-index) Self-diagnostics
* (btree) Fix bug in btree reader to do with large data sizes
2023-10-07 10:00:00 +02:00
Viktor Lofgren
e498c6907a
(forward-index) Don't leak off heap memory
2023-10-05 21:22:13 +02:00
Viktor Lofgren
08e8fc6736
(index-journal) Thread safe IndexJournalReadEntry
2023-10-05 19:39:09 +02:00
Viktor Lofgren
f6e9ef6de9
(array) Fix transferFrom() so it survives larger than 2 GB transfers
2023-10-04 13:57:36 +02:00
Viktor Lofgren
c51159672e
(build) Move unit test configuration to root build.gradle
2023-10-04 12:46:22 +02:00
Viktor Lofgren
233b51e29e
(test) flag DomainTypesTest as Slow to exclude from regular CI
2023-10-04 12:23:10 +02:00
Viktor Lofgren
54c8e13a68
(term-frequency-dict) Fix memory leak in TermFrequencyDict
2023-10-04 11:55:11 +02:00
Viktor Lofgren
405300b4b2
(control) Fix bug where finishing one process ad hoc task would remove all other tasks from the db
2023-10-04 11:44:31 +02:00
Viktor Lofgren
40768e935b
(test) Removing /tmp-guardrails as it doesn't hold in CI
2023-10-02 16:52:59 +02:00
Viktor Lofgren
13ee31770a
(file storage) Make it possible to override the value returned by getFileStorage(type) with a JVM property.
2023-10-01 12:57:53 +02:00
Viktor Lofgren
93dc80000c
(bugfix) Fix NPE in KeywordExtractor due to bad SoftReference handling
2023-09-26 17:16:41 +02:00
Viktor Lofgren
e0cd3cd991
(converter) Alter StackexchangeSideloader's summary length to align with the rest of the system.
2023-09-26 12:19:43 +02:00
Viktor Lofgren
81ae501e73
(converter) Use ThreadLocalSentenceExtractorProvider for PlainText plugin as well
2023-09-25 18:28:34 +02:00
Viktor Lofgren
9b781f8404
(keyoword-extractor) Address very rare race condition in memoization logic
2023-09-25 18:28:04 +02:00
Viktor Lofgren
f797a92f87
(converter, minor) Use domain name in task heartbeat progress
2023-09-25 18:27:04 +02:00
Viktor Lofgren
ec6c9bca62
(common) Fix factual error in comments
2023-09-24 19:40:19 +02:00
Viktor Lofgren
a433bbbe45
(converter) Fix rare sentence extractor bug
...
It was caused by non-thread safe concurrent memory access in SentenceExtractor.
2023-09-24 19:39:48 +02:00
Viktor Lofgren
8ca20f184d
(keyword-extraction) Chasing my tail looking for a bug
2023-09-24 19:39:48 +02:00
Viktor Lofgren
d160954080
(index) Two useful debug endpoints
2023-09-24 19:39:48 +02:00
Viktor Lofgren
14372e0ef0
(index) Slightly reduce alloc churn
2023-09-24 19:36:14 +02:00
Viktor Lofgren
03bffa27ac
(search) Add combined id to the search result HTML
2023-09-24 19:34:35 +02:00
Viktor Lofgren
028b5a4f0d
(minor performance) Reduce GC churn in index
2023-09-24 12:12:08 +02:00
Viktor Lofgren
cd12f49fc0
(long-array) Return slices SegmentLongArray of itself for range() &c
2023-09-24 11:31:54 +02:00
Viktor Lofgren
1bd146fb8e
(minor) Remove dead code
2023-09-24 10:55:20 +02:00
Viktor Lofgren
5f6c3da7a4
(index) Add close methods on the index readers so they clean up their mmaps
2023-09-24 10:54:23 +02:00
Viktor Lofgren
d0aa754252
(long-array) Implement java.lang.foreign.Arena based lifecycle control for LongArray.
...
Further de-ByteBuffer:ing of these classes is to be done, but this is the smallest most urgently needed benefit.
This commit is a WIP but in a fully working state, pushing due to the importance of the changes to offer lifecycle control over mmaps.
2023-09-24 10:40:06 +02:00
Viktor Lofgren
dbe9235f3a
(*) Upgrade to JDK21 with preview enabled.
...
... also move some common configuration into the root build.gradle-file.
Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
d78569986b
(backups) Fix bug where backup service would zero the linkdb when restoring.
2023-09-22 18:34:34 +02:00
Viktor Lofgren
95323e6caa
(backups) Support restore multi-source load data
2023-09-22 18:34:17 +02:00
Viktor Lofgren
f809d22fc6
(loader) Support simultaneous loading of multiple processed data sets
2023-09-22 13:14:58 +02:00
Viktor Lofgren
10cad3abb2
(dating) Implementing @samstorment's fantastic design polish
2023-09-21 15:19:50 +02:00
Viktor Lofgren
9338f35cd8
(doc) Remove confusingly outdated ER-diagrams
2023-09-21 15:08:27 +02:00
Viktor Lofgren
ad660cf420
(converter) Bugfix: Don't try to Path.of() on optional field
2023-09-21 13:27:09 +02:00
Viktor Lofgren
75f8ae2815
(file-storage) Use human-readable timestamps in the names of file storage directories
2023-09-21 13:22:53 +02:00
Viktor Lofgren
70aa04c047
(converter, stackexchange-xml) Add the ability to sideload stackexchange data
2023-09-21 12:48:33 +02:00
Viktor Lofgren
4aa47e87f2
(blocking-thread-pool) Add isTerminated convenience function
2023-09-21 12:47:41 +02:00
Viktor Lofgren
f8050816ac
(search) Don't run LSH deduplication on details with zero lsh to support not calculating this hash.
2023-09-21 12:47:02 +02:00
Viktor Lofgren
5b0a6d7ec1
(stackexchange-converter) Create tool for converting stackexchange 7z-files to digestible sqlite db:s
2023-09-20 15:15:13 +02:00
Viktor Lofgren
3b4d08f52b
(stackexchange-integration) Add better comments
2023-09-20 14:43:06 +02:00
Viktor Lofgren
6bbf40d7d2
(stackexchange-integration) Tools for reading stackexchange xml files
2023-09-20 14:17:33 +02:00
Viktor Lofgren
d895f83520
(blocking-thread-pool) Move DumbThreadPool to its own micro-library
...
Also rename it to SimpleBlockingThreadPool.
2023-09-20 10:11:49 +02:00
Viktor Lofgren
f6b9e8c5eb
(converter) JavadocSpecialization should truncate its summary if it gets too long
2023-09-17 16:25:33 +02:00
Viktor Lofgren
98bcdf6028
(converter) DirtreeSideloader now trims /index.html from the URL if present
...
This is a crawler artifact in 9 cases out of 10, and may lead to bad URLs.
2023-09-17 16:08:16 +02:00
Viktor Lofgren
9b385ec7cc
(converter) Make it possible to sideload documents from a directory tree
2023-09-17 14:35:06 +02:00
Viktor Lofgren
5c040f7a46
(crawl-spec) Parquetify crawl spec
...
* Crawl-specs are now parquet files
* Deprecate the crawl-job-extractor tool
2023-09-17 09:41:34 +02:00
Viktor Lofgren
c67d95c00f
(converter) Write dummy processor log when sideloading
2023-09-14 14:13:03 +02:00
Viktor Lofgren
5e5aaf9a7e
(converter, control) Re-enable sideloading encyclopedia data
2023-09-14 12:12:07 +02:00
Viktor Lofgren
35996d0adb
(docs) Update the documentation up-to-date information
2023-09-14 11:33:36 +02:00
Viktor Lofgren
eaeb23d41e
(refactor) Remove converting-model package completely
2023-09-14 11:21:44 +02:00
Viktor Lofgren
c71f6ad417
(converter) Add heartbeats to the loader processes and execute the tasks in parallel for a ~2X speedup
2023-09-14 10:11:57 +02:00
Viktor Lofgren
87a8593291
(work-log) Fix bug where items weren't added to the current batch on logItem
2023-09-14 10:11:04 +02:00
Viktor Lofgren
4799dd769e
(converting) WIP begin to remove converting-model and the old InstructionsCompiler
2023-09-13 19:18:58 +02:00
Viktor Lofgren
24b4606f96
(converter,loader) Converter outputs parquet files instead of compressed json.
2023-09-13 16:13:41 +02:00
Viktor Lofgren
064bc5ee76
(processed-data) New parquet-serializable models for converter output
2023-09-11 14:08:40 +02:00
Viktor Lofgren
a52d78c8ee
(work-log) New batching work log
2023-09-11 14:08:08 +02:00
Viktor Lofgren
07d7507ac6
(control-service) Move Actions up in storage-details
...
Papercut fix. If a file storage area has a lot of files, you have to scroll down a long way to get to the actions otherwise.
2023-09-02 15:41:55 +02:00
Viktor Lofgren
c68d17d482
(keyword-extraction) Fix bug leading to position data missing on some keywords.
...
This was due to a discrepancy between the KeywordPositionBitmask and WordsTfIdfCounts' concept of a keyword.
2023-09-02 14:48:55 +02:00
Viktor Lofgren
9e185e80ce
(control-service) Add timestamp to file storages.
2023-09-02 14:01:04 +02:00
Viktor Lofgren
676e7c7947
(keywords) Add Serializable properties that went missing as the record became a class
2023-09-02 09:52:01 +02:00
Viktor Lofgren
04212b2cef
(btree) Add more consistent asserts on sortedness
2023-09-01 15:45:02 +02:00
Viktor Lofgren
bafc2a1f30
(reverse-index) Force() final docs after being written
...
Unlikely to be a problem, but we want to ensure it's on dsik before we go read it later.
2023-09-01 15:43:53 +02:00
Viktor Lofgren
563e388a45
(reverse-index) Fix parallel documents sorting bug
...
Bug was caused by parallel sorting capturing the iterator rather than the offsets to sort.
2023-09-01 15:42:45 +02:00
Viktor Lofgren
d31d8ec5b0
(index) Log keyword ids on hex format
2023-09-01 15:40:24 +02:00
Viktor Lofgren
2b00cd632d
(process) Propagate environment JVM params to the index constructor
2023-09-01 15:39:42 +02:00
Viktor Lofgren
5f427d2b4c
(keywords) Clean up leaky abstractions, clean up tests
2023-09-01 13:52:00 +02:00
Viktor Lofgren
8c0ce4fc1d
(index journal; minor) Clean up
2023-09-01 11:32:24 +02:00
Viktor Lofgren
10a74f45ea
(index journal; minor) Even cleaner separation of concerns.
2023-09-01 11:28:02 +02:00
Viktor Lofgren
320dad7f1a
(index journal) Fix leaky abstraction in IndexJournalReader.
...
The caller shouldn't be required to know the on-disk layout of the file to make use of the data in a performant way.
2023-09-01 11:18:13 +02:00
Viktor Lofgren
88ac72c8eb
(journal/reverse index) Working WIP fix over-allocation of documents
2023-08-31 20:16:02 +02:00
Viktor Lofgren
f74b9df0a7
(array) Don't use paging arrays when mapping small files for writing
2023-08-31 20:15:10 +02:00
Viktor Lofgren
a6f1335375
(loader) Fix bugfix where the loader would omit some meta and words.
2023-08-31 17:48:43 +02:00
Viktor Lofgren
f321fa5ad3
(array) Override to Paging...Array$range()
...
This is a big performance boost in array.range().get().
Without an override, each access will go through pages[page].get(...) for each get()-operation. This adds up very quickly. BTreeReader does a bunch of get():s on a range()'d array during traversal in the queryData... methods.
2023-08-31 13:52:29 +02:00
Viktor Lofgren
03d999444d
(ldb) Re-add accidentally removed stmt.addBatch that breaks
2023-08-31 12:06:30 +02:00
Viktor Lofgren
763ed260c3
(ldb) Better handling of null pubYear
2023-08-30 23:08:27 +02:00
Viktor Lofgren
764e7d1315
(index) Add more comprehensive integration tests for the index service.
2023-08-30 10:37:24 +02:00
Viktor Lofgren
048f685073
(ldb) add OR IGNORE to insert status query
...
Otherwise it will sometimes fail because documents may appear more than once in error scenarios.
2023-08-30 10:34:01 +02:00
Viktor Lofgren
e4d7958379
(control) ProcessLivenessMonitorActor shouldn't reap tasks based on service instance liveness
2023-08-29 18:19:04 +02:00
Viktor Lofgren
3f288e264b
(minor) Clean up dead endpoints
2023-08-29 17:04:54 +02:00
Viktor Lofgren
dd593c292c
(loader) Minor optimizations and bugfixes.
...
* Reduce memory churn in LoaderIndexJournalWriter, fix bug with keyword mappings as well
* Remove remains of OldDomains
* Ensure LOADER_PROCESS_OPTS gets fed to the processes
* LinkdbStatusWriter won't execute batch after each added item post 100 items
2023-08-29 15:37:52 +02:00
Viktor Lofgren
39c1857c61
(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.
2023-08-29 13:07:55 +02:00
Viktor Lofgren
c57a2d0dc3
(control-service) Remove old index journal files when restoring a backup.
2023-08-29 11:58:01 +02:00
Viktor Lofgren
a2e6616100
(index-reverse) Add documentation and clean up code.
2023-08-29 11:35:54 +02:00
Viktor Lofgren
ba4513e82c
(loader) Revert accidental experimental changes that slipped by in an earlier commit
2023-08-28 19:54:56 +02:00
Viktor Lofgren
6525b16e1f
(minor) Improved logging and error messages
2023-08-28 19:53:55 +02:00
Viktor Lofgren
b6a92506d1
(index) Hook in missing DocIdRewriter
...
This enables documents to be ranked properly.
2023-08-28 19:53:43 +02:00
Viktor Lofgren
ffa0366deb
(minor) Fix typo in ActorStateMachine's logging
2023-08-28 16:11:52 +02:00
Viktor Lofgren
00c4686ef0
(reverse-index) Fix over-allocation of the count array in merging
2023-08-28 14:36:28 +02:00
Viktor Lofgren
3101b74580
(index) Move to a lexicon-free index design
...
This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.
The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.
It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
194a6057dd
(index,control) Recoverable index backups
2023-08-25 14:57:43 +02:00
Viktor Lofgren
e710e057e2
(db) Remove EC_URL and EC_PAGE_DATA from mariadb database
2023-08-25 13:45:03 +02:00
Viktor Lofgren
28188a6e59
(control) Simplify ConvertAndLoadActor
2023-08-25 13:30:20 +02:00
Viktor Lofgren
70a5df96c8
(control) Display progress of process tasks
2023-08-25 13:05:21 +02:00
Viktor Lofgren
460998d512
(index) Move index construction to separate process.
...
This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service. It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D
2023-08-25 12:52:54 +02:00
Viktor Lofgren
e741301417
(search) Remove endpoint flush-search-caches
...
It's not necessary anymore with the new linkdb.
2023-08-25 09:51:06 +02:00
Viktor Lofgren
5ed5298409
(converter) Update confusing state description
...
SWAP_LEXICON doesn't instruct the index service to do anything. It just moves the file.
2023-08-24 18:56:49 +02:00