The priority index documents file can be trivially compressed to a large degree.
Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
The implementation was incorrectly using 1 bit more than it should. The change also adds a put method for Elias delta; and cleans up the interface a bit.
Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome
Btree index adds overhead and disk space and doesn't fill any function for the prio index.
* Update finalize logic with a new IO transformer that copies the data and prepends a size
* Update the reader to read the new format
* Added a test
Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets
The term data iterator is quite hot and was performing buffer slice operations that were not necessary.
Replacing with a fixed pointer alias that can be repositioned to the relevant data.
The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped.
Removed this unnecessary step and move to copying the buffer directly instead.
After real-world testing, it was determined that 256 was still a bit too low, but 512 seems like it will only truncate outlier cases like assembly code and certain tabulations.
The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns. By default, these are put in the log directory.
The change also adds a JVM parameter that makes it shut up about native access.
This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.
The default C++ language standard on macOS is gnu++98, which won't build
this module.
Full error:
```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
[](const p64x2& fst, const p64x2& snd) {
^
```
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
Add a new rule that crates an alternative path that omits a word if it's a stopword.
In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect.
The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.
IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it)
Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs.
Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values.
Replaced binary search function with a branchless version that is much faster.
Cleaned up benchmark code.
Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values.
Replaced binary search function with a branchless version that is much faster.
Cleaned up benchmark code.
This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.
A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database. Move the first boot check into the MainClass instead of the Service constructor.
The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations.
Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths.
The change removes the path prefix, and updates the docker templates to reflect the change. This will require a migration for existing systems.
This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.