MarginaliaSearch/code/libraries/array
Viktor Lofgren aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
..
cpp Merge branch 'master' into term-positions 2024-07-15 07:05:31 +02:00
java/nu/marginalia/array (index-reverse) Added compression to priority index 2024-07-11 16:13:23 +02:00
src/jmh/java/nu/marginalia/array/page (array) Fix broken benchmarks 2024-05-18 13:41:24 +02:00
test/nu/marginalia/array/algo (wip) Extract and encode spans data 2024-07-27 11:44:13 +02:00
build.gradle (wip) Extract and encode spans data 2024-07-27 11:44:13 +02:00
readme.md (array) Clean up the Array library 2024-05-18 13:23:06 +02:00

Array Library

The array library offers easy allocation of large memory mapped files and off-heap memory, along with helper functions for accessing and using such memory.

Historically this used ByteBuffers, but has been updated to use the new MemorySegment API. By default, it uses sun.misc.Unsafe to access the memory, but it can be configured to use the new MemorySegment access methods instead by setting the system property system.noSunMiscUnsafe to true. This is quite a bit slower, but use-after-free results in a harmless exception rather than a SIGSEGV.

Internally the array objects use Arena allocators to manage memory, and need to be closed to free the memory. Both confined and shared memory can be allocated, as per the MemorySegment API.

Native code is used to implement some of the more performance critical algorithms, such as quicksort and binary search. These are available in the cpp subproject. Java implementations are available as a fallback, but are somewhat slower.

The library is implemented in a fairly unidiomatic way using interfaces to accomplish diamond inheritance.

Quick demo:

try (var array = LongArrayFactory.mmapForWritingConfined(Path.of("/tmp/test"), 1<<16)) {
    array.transformEach(50, 1000, (pos, val) -> Long.hashCode(pos));
    array.quickSort(50, 1000);
    if (array.binarySearch(array.get(100), 50, 1000) >= 0) {
        System.out.println("Nevermind, I found it!");
    }
    
    array.range(50, 1000).fill(0, 950, 1);
    array.forEach(0, 100, (pos, val) -> {
        System.out.println(pos + ":" + val);
    });
}

Query Buffers

The class and LongQueryBuffer is used heavily in the search engine's query processing.

It is a dual-pointer buffer that offers tools for filtering data.

LongQueryBuffer buffer = new LongQueryBuffer(1000);

// later ...

// Prepare the buffer for filling
buffer.reset();
fillBufferSomehow(buffer); 

// length is updated and data is set
// read pointer and write pointer is now at 0

// A typical filtering operation may look like this:
        
while (buffer.hasMore()) { // read < end
    if (someCondition(buffer.currentValue())) {
        // copy the value pointed to by the read
        // pointer to the read pointer, and
        // advance both
        buffer.retainAndAdvance();
    }
    else {
        // advance the read pointer
        buffer.rejectAndAdvance();
    }
}

// set end to the write pointer, and 
// resets the read and write pointers
buffer.finalizeFiltering();

// ... after this we can filter again, or
// consume the data

Especially noteworthy are the operations retain() and reject() in LongArraySearch. They keep or remove all items in the buffer that exist in the referenced range of the array, which must be sorted.

These are used to offer an intersection operation for the B-Tree with sub-linear run time.