Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress. Giving these a slightly larger allowance.
Refactoring keyword extraction to extract spans information.
Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.
This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.
The CompressingStorageReader would incorrectly report having data when a file was empty. Preemptively attempting to fill the backing buffer fixes the behavior.
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object. Separator information is encoded as a bit set instead of an array of integers.
The change also cleans up the SentenceExtractor class a fair bit. It no longer extracts ngrams, and a significant amount of redundant operations were removed as well. This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.
Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.
Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer. This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty.
The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte.
Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.
The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.
How'd This Ever Work? (tm)
TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
How'd This Ever Work? (tm)
TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm.
The change also cleans out several parameters that no longer filled any function.
It was incorrectly assumed that a "next" value could not be zero or negative, as this is not representable via the Gamam code. This is incorrect in this case, as we're able to provide a negative offset. Changing to using Integer.MIN_VALUE as indicator that a value is absent instead, as this will never be used.
The priority index documents file can be trivially compressed to a large degree.
Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
The implementation was incorrectly using 1 bit more than it should. The change also adds a put method for Elias delta; and cleans up the interface a bit.
Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome
Btree index adds overhead and disk space and doesn't fill any function for the prio index.
* Update finalize logic with a new IO transformer that copies the data and prepends a size
* Update the reader to read the new format
* Added a test
Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets
The term data iterator is quite hot and was performing buffer slice operations that were not necessary.
Replacing with a fixed pointer alias that can be repositioned to the relevant data.
The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped.
Removed this unnecessary step and move to copying the buffer directly instead.