mirror of
https://github.com/MarginaliaSearch/MarginaliaSearch.git
synced 2025-02-24 05:18:58 +00:00
data:image/s3,"s3://crabby-images/c765d/c765d5283f4176ac41b612e7ae83ed62e7ddf9a1" alt="Viktor Lofgren"
IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it) Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs. Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
84 lines
3.0 KiB
Markdown
84 lines
3.0 KiB
Markdown
# Array Library
|
|
|
|
The array library offers easy allocation of large [memory mapped files](https://en.wikipedia.org/wiki/Memory-mapped_file)
|
|
and off-heap memory, along with helper functions for accessing and using such memory.
|
|
|
|
Historically this used ByteBuffers, but has been updated to use the new [MemorySegment](https://openjdk.org/jeps/454)
|
|
API. By default, it uses sun.misc.Unsafe to access the memory, but it can be configured to use the new MemorySegment access
|
|
methods instead by setting the system property `system.noSunMiscUnsafe` to true. This is quite a bit slower, but
|
|
use-after-free results in a harmless exception rather than a SIGSEGV.
|
|
|
|
Internally the array objects use Arena allocators to manage memory, and need to be closed to free the memory. Both
|
|
confined and shared memory can be allocated, as per the MemorySegment API.
|
|
|
|
Native code is used to implement some of the more performance critical algorithms,
|
|
such as quicksort and binary search. These are available in the [cpp](cpp) subproject.
|
|
Java implementations are available as a fallback, but are somewhat slower.
|
|
|
|
The library is implemented in a fairly unidiomatic way using interfaces to accomplish diamond inheritance.
|
|
|
|
## Quick demo:
|
|
```java
|
|
try (var array = LongArrayFactory.mmapForWritingConfined(Path.of("/tmp/test"), 1<<16)) {
|
|
array.transformEach(50, 1000, (pos, val) -> Long.hashCode(pos));
|
|
array.quickSort(50, 1000);
|
|
if (array.binarySearch(array.get(100), 50, 1000) >= 0) {
|
|
System.out.println("Nevermind, I found it!");
|
|
}
|
|
|
|
array.range(50, 1000).fill(0, 950, 1);
|
|
array.forEach(0, 100, (pos, val) -> {
|
|
System.out.println(pos + ":" + val);
|
|
});
|
|
}
|
|
```
|
|
|
|
|
|
## Query Buffers
|
|
|
|
The class and [LongQueryBuffer](java/nu/marginalia/array/buffer/LongQueryBuffer.java) is used heavily in the search engine's query processing.
|
|
|
|
It is a dual-pointer buffer that offers tools for filtering data.
|
|
|
|
```java
|
|
LongQueryBuffer buffer = new LongQueryBuffer(1000);
|
|
|
|
// later ...
|
|
|
|
// Prepare the buffer for filling
|
|
buffer.reset();
|
|
fillBufferSomehow(buffer);
|
|
|
|
// length is updated and data is set
|
|
// read pointer and write pointer is now at 0
|
|
|
|
// A typical filtering operation may look like this:
|
|
|
|
while (buffer.hasMore()) { // read < end
|
|
if (someCondition(buffer.currentValue())) {
|
|
// copy the value pointed to by the read
|
|
// pointer to the read pointer, and
|
|
// advance both
|
|
buffer.retainAndAdvance();
|
|
}
|
|
else {
|
|
// advance the read pointer
|
|
buffer.rejectAndAdvance();
|
|
}
|
|
}
|
|
|
|
// set end to the write pointer, and
|
|
// resets the read and write pointers
|
|
buffer.finalizeFiltering();
|
|
|
|
// ... after this we can filter again, or
|
|
// consume the data
|
|
```
|
|
|
|
|
|
Especially noteworthy are the operations `retain()` and `reject()` in [LongArraySearch](java/nu/marginalia/array/algo/LongArraySearch.java).
|
|
They keep or remove all items in the buffer that exist in the referenced range of the array,
|
|
which must be sorted.
|
|
|
|
These are used to offer an intersection operation for the B-Tree with sub-linear run time.
|