The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance. It got worse, so the change is reverted.
Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.
This lets the slop library be stand-alone without dependence on coded-sequence.
The change also gets rid of the vestigial seek() method in ColumnReader.
The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip.
The second most common error is forgetting to close one of the columns in a reader or writer.
To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.
Refactoring keyword extraction to extract spans information.
Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.
This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.
Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.
Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer. This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty.
The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte.
Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.
The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.
Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm.
The change also cleans out several parameters that no longer filled any function.
The priority index documents file can be trivially compressed to a large degree.
Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome
Btree index adds overhead and disk space and doesn't fill any function for the prio index.
* Update finalize logic with a new IO transformer that copies the data and prepends a size
* Update the reader to read the new format
* Added a test
The term data iterator is quite hot and was performing buffer slice operations that were not necessary.
Replacing with a fixed pointer alias that can be repositioned to the relevant data.
The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped.
Removed this unnecessary step and move to copying the buffer directly instead.
IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it)
Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs.
Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
Since the performance fix in 3359f72239 had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
Since the performance fix in 3359f72239 had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
Netty and GRPC by default spawns an incredible number of threads on high-core CPUs, which amount to a fair bit of RAM usage.
Add custom executors that throttle this behavior.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.
While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's *modul*ar. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.