Commit Graph

45 Commits

Author SHA1 Message Date
Viktor Lofgren
ab4e2b222e (array) Fix broken benchmarks 2024-05-18 13:41:24 +02:00
Viktor Lofgren
19163fa883 (array) Clean up the Array library
IntArray gets the YAGNI axe.   The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot.   Removing the latter, as all it ever did was clutter up the codebase and add technical debt.  If we need int arrays, we fork LongArray again (or add int capabilities to it)

Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs.

Finally adding sz=2 specializations to the quick- and insertion sort algorithms.  It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
2024-05-18 13:23:06 +02:00
Viktor Lofgren
650f3843bb (array) Clean up search function jungle
Retire search functions that weren't used, including the native implementations.  Drop confusing suffixes on search function names.  Search functions no longer encode search misses as negative values.

Replaced binary search function with a branchless version that is much faster.

Cleaned up benchmark code.
2024-05-17 14:31:02 +02:00
Viktor Lofgren
9e766bc056 (array) Clean up search function jungle
Retire search functions that weren't used, including the native implementations.  Drop confusing suffixes on search function names.  Search functions no longer encode search misses as negative values.

Replaced binary search function with a branchless version that is much faster.

Cleaned up benchmark code.
2024-05-17 14:30:06 +02:00
Viktor Lofgren
48aff52e00 (array) Increase LongArray on-heap alignment to 16 bytes
This primarily affects benchmarks, making performance more consistent for the 128 bit operations, as the system mostly works with memory mapped data.
2024-05-16 19:12:36 +02:00
Viktor Lofgren
9d7616317e (array) Clean up native code a bit 2024-05-16 14:47:10 +02:00
Viktor Lofgren
f48cf77c4d (array, experimental) Add benchmark results for quicksort 2024-05-14 18:15:30 +02:00
Viktor Lofgren
3549be216f (array, experimental) Documentation for native algos 2024-05-14 17:43:05 +02:00
Viktor Lofgren
55a7c1db00 (array, experimental) Call C++ helper methods to do some low level stuff a bit faster than is possible with Java 2024-05-14 12:54:14 +02:00
Viktor Lofgren
4668b1ddcb (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
Viktor Lofgren
deaba0152d (index) Explicitly free LongQueryBuffers 2024-04-16 19:23:00 +02:00
Viktor Lofgren
ae7c760772 (index) Clean up new index query code 2024-04-05 13:30:49 +02:00
Viktor Lofgren
81815f3e0a (qs, index) New query model integrated with index service.
Seems to work, tests are green and initial testing finds no errors.  Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-04 20:17:58 +02:00
Viktor Lofgren
002afca1c5 (sys) Upgrade to JDK22
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
e696fd9e92 (docs) Begin un-fucking the docs after refactoring 2024-02-27 21:22:21 +01:00
Viktor Lofgren
67aa20ea2c (array) Attempting to debug strange errors 2024-02-27 21:22:18 +01:00
Viktor Lofgren
1d34224416 (refac) Remove src/main from all source code paths.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.

While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules.  Which you'll do a lot, because it's *modul*ar.  The src/main/java convention makes a lot of sense for a non-modular project though.  This ain't that.
2024-02-23 16:13:40 +01:00
Viktor Lofgren
0307c55f9f (refac) Zookeeper for service-discovery, kill service-client lib (WIP)
To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added.

A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything.

The last remaining REST service, the assistant-service, has been migrated to gRPC.

This also proved a good time to clear out primordial technical debt from the root of the codebase.  The 'service-client' library has been taken behind the barn and given a last farewell.  It's replaced by a small library for managing gRPC channels.

Since it's no longer used by anything, RxJava has been removed as a dependency from the project.

Although the current state seems reasonably stable, this is a work-in-progress commit.
2024-02-20 11:41:14 +01:00
Viktor Lofgren
300b1a1b84 (index-query) Add some tests for the QueryFilter code 2024-02-15 12:03:30 +01:00
Viktor Lofgren
6c3b49417f (index-query) Improve documentation and code quality 2024-02-15 11:33:50 +01:00
Viktor Lofgren
95d1bd98e4 (array) Update documentation, make unsafe configurable
The readme for the array library was extremely out of date.  Updating it with accurate information about how the library works, and a demo that should compile.

Also added a system property for disabling the use of sun.misc.Unsafe.
2024-02-07 12:26:47 +01:00
Viktor Lofgren
d986f90074 (index) Fix consistency between RandomFileAssembler implementations
The RandomFileAssembler implementations, introduced in commit 53c575db3f were all acting subtly differently.  The RWF implementation wrote BigEndian longs instead of the native endianness used by the other implementations (and expected by the index construction code), further the mmap implementation exposed a bug in LongArray.write() that caused it to create a larger file than necessary.

A test was built to ensure the output of these implementations is equivalent.
2024-02-05 21:01:32 +01:00
Viktor Lofgren
400f4840ad (*) Fix broken code in jmh 2024-01-23 17:08:21 +01:00
Viktor Lofgren
1eb0adf6d3 (array) Add sun.misc.Unsafe variant of LongArray 2024-01-22 13:38:42 +01:00
Viktor Lofgren
6a1bfd6270 (array) Remove unused 'madvise' code and 3rd party dependency on 'uppend'
This wasn't actually hooked in anywhere.  Removing the dependency and code.  If it turns out we need madvise in the future, we'll re-introducde it.
2024-01-22 13:01:57 +01:00
Viktor Lofgren
7c6e18f7a7 (*) Overhaul settings and properties
Use a system.properties file to configure the system.  This is loaded statically by MainClass or ProcessMainClass.  Update the property names to be more consistent, and update the documentations to reflect the changes.
2024-01-13 17:12:18 +01:00
Viktor Lofgren
f613f4f2df (array) Fix spurious search results
This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss.

It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.
2023-10-26 15:27:02 +02:00
Viktor
8e1abc3f10
(index-reverse) Parallel construction of the reverse indexes. (#52)
* (index-reverse) Parallel construction of the reverse indexes.

* (array) Remove wasteful calculation of numDistinct before merging two sorted arrays.

* (index-reverse)  Force changes to disk on close, reduce logging.

* (index-reverse)  Clean up merging process and add back logging

* (run)  Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM

* (index-reverse)  Better logging during processing

* (array) 2GB+ compatible write() function

* (array) 2GB+ compatible write() function

* (index-reverse) We are logging like Bolsonaro and I will not have it.

* (reverse-index) Self-diagnostics

* (btree) Fix bug in btree reader to do with large data sizes
2023-10-07 10:00:00 +02:00
Viktor Lofgren
f6e9ef6de9 (array) Fix transferFrom() so it survives larger than 2 GB transfers 2023-10-04 13:57:36 +02:00
Viktor Lofgren
40768e935b (test) Removing /tmp-guardrails as it doesn't hold in CI 2023-10-02 16:52:59 +02:00
Viktor Lofgren
cd12f49fc0 (long-array) Return slices SegmentLongArray of itself for range() &c 2023-09-24 11:31:54 +02:00
Viktor Lofgren
d0aa754252 (long-array) Implement java.lang.foreign.Arena based lifecycle control for LongArray.
Further de-ByteBuffer:ing of these classes is to be done, but this is the smallest most urgently needed benefit.

This commit is a WIP but in a fully working state, pushing due to the importance of the changes to offer lifecycle control over mmaps.
2023-09-24 10:40:06 +02:00
Viktor Lofgren
dbe9235f3a (*) Upgrade to JDK21 with preview enabled.
... also move some common configuration into the root build.gradle-file.

Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work.  This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
Viktor Lofgren
f74b9df0a7 (array) Don't use paging arrays when mapping small files for writing 2023-08-31 20:15:10 +02:00
Viktor Lofgren
f321fa5ad3 (array) Override to Paging...Array$range()
This is a big performance boost in array.range().get().

Without an override, each access will go through pages[page].get(...) for each get()-operation.  This adds up very quickly.  BTreeReader does a bunch of get():s on a range()'d array during traversal in the queryData... methods.
2023-08-31 13:52:29 +02:00
Viktor Lofgren
3101b74580 (index) Move to a lexicon-free index design
This is a system-wide change.  The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table.   This made index-construction easier, but it
also added a fairly significant RAM penalty to both the index service and the loader.

The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices.

It also became necessary half-way through to upgrade guice as its error reporting wasn't *quite* compatible with JDK20.
2023-08-28 14:02:23 +02:00
Viktor Lofgren
aa0d256d6a Upgrade code to Java 20.
* Change language version
* Upgrade Lombok to a JDK20 compatible version
2023-08-23 13:37:49 +00:00
Viktor Lofgren
4e9e79454f Fix broken transformation functions in the PagingArray classes. 2023-05-28 13:31:05 +02:00
Viktor Lofgren
b0bc07b4e7 Insertion sort was *super* busted I don't even know how it worked 2023-05-28 12:17:50 +02:00
Viktor Lofgren
6814c90625 Fix N-width sorting bug 2023-05-28 11:57:06 +02:00
Viktor
96bac70b85
Tools for merging sorted lists, and merging btrees. (#14)
* Utilities for merging BTrees of entity size 1 and 2.
* Isolate and clean up sorting algorithms. 
* Functions for keeping distinct items in a LongArray
2023-04-20 15:28:09 +02:00
Viktor
1b9ae7b42d
Update readme.md 2023-03-21 16:38:39 +01:00
Viktor Lofgren
1bb1248ab0 Optimize array library, jmh benchmarks. 2023-03-21 16:02:31 +01:00
Viktor Lofgren
616effdb3c The refactoring will continue until morale improves. 2023-03-12 10:04:48 +01:00
Viktor Lofgren
ad1be7c835 Move all code to a code directory. 2023-03-07 17:14:32 +01:00