Commit Graph

2458 Commits

Author SHA1 Message Date
Viktor Lofgren
ec600b967d (crawler) Adjust domain locking
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress.  Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
52a9a0d410 (slop) Translate nulls to empty strings when passed to the StringColumnWriters. 2024-07-25 18:26:41 +02:00
Viktor Lofgren
4123e99469 (slop) Handle empty compressed files correctly
The CompressingStorageReader would incorrectly report having data when a file was empty.  Preemptively attempting to fill the backing buffer fixes the behavior.
2024-07-25 18:26:13 +02:00
Viktor Lofgren
51a8a242ac (slop) First commit of slop library
Slop is a low-abstraction data storage convention for column based storage of complex data.
2024-07-25 15:08:41 +02:00
Viktor Lofgren
60ef826e07 (loader) Add heartbeat to update domain-ids step 2024-07-25 15:08:41 +02:00
Viktor Lofgren
2ad564404e (loader) Add heartbeat to update domain-ids step 2024-07-23 15:28:52 +02:00
Viktor Lofgren
2bb9f18411 (dld) Refactor DocumentLanguageData
Reduce the usage of raw arrays
2024-07-19 12:24:55 +02:00
Viktor Lofgren
7a1edc0880 (term-freq) Reduce the number of low-relevance words in the dictionary
Using a statistical trick to reduce the number of low-frequency words in the dictionary, as they are numerous and not very informative.
2024-07-19 12:23:28 +02:00
Viktor Lofgren
b812e96c6d (language-processing) Select the appropriate language filter
The incorrect filter was selected based on the provided parameter, this has been corrected.
2024-07-19 12:22:32 +02:00
Viktor Lofgren
22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
Viktor Lofgren
d36055a2d0 (keyword-extractor) Retire TfIdfHigh WordFlag
This will bring the word flags count down to 8, and let us pack every value in a byte.
2024-07-17 13:54:39 +02:00
Viktor Lofgren
0d227f3543 (cleanup) Remove next-prime library only used in tests 2024-07-17 13:48:03 +02:00
Viktor Lofgren
accc598967 (crawler) Add 1 second pause after probing domain to reduce request pressure 2024-07-16 16:55:07 +02:00
Viktor Lofgren
02c4a2d4ba (crawler) Add a per-domain mutex for crawling
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
Viktor Lofgren
6665e447aa (crawler) Add crawl delays around probe call and deal with 429:s properly during this phase 2024-07-16 15:33:24 +02:00
Viktor Lofgren
7eb955cc42 (setup) Change mirror for opennlp
Seems like the estointernet mirror no longer works.  Use apache.org instead.
2024-07-16 15:19:13 +02:00
Viktor Lofgren
f4d79c203d (crawler) Adjust revisit logic
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.

Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
2024-07-16 15:12:38 +02:00
Viktor Lofgren
4d29581ea4 (crawler) Introduce absolute upper limit to crawl depth growth 2024-07-16 14:40:45 +02:00
Viktor Lofgren
0b31c4cfbb (coded-sequence) Replace GCS usage with an interface 2024-07-16 14:37:50 +02:00
Viktor Lofgren
5c098005cc (index) Fix broken test
Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.
2024-07-16 12:37:59 +02:00
Viktor Lofgren
ae87e41cec (index) Fix rare BitReader.takeWhileZero bug
Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer.  This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty.

The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte.

Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.
2024-07-16 11:03:56 +02:00
Viktor Lofgren
dfd19b5eb9 (index) Reduce the number of abstractions around result ranking
The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.
2024-07-16 08:18:54 +02:00
Viktor
8ed5b51a32
Merge branch 'master' into term-positions 2024-07-15 07:05:31 +02:00
Viktor Lofgren
9d0e5dee02 Fix gitignore issue .so files not to be ignored correctly. 2024-07-15 05:18:10 +02:00
Viktor Lofgren
ffd970036d (term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
How'd This Ever Work? (tm)

TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:16:17 +02:00
Viktor Lofgren
fa162698c2 (term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
How'd This Ever Work? (tm)

TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:15:30 +02:00
Viktor Lofgren
ad3857938d (search-api, ranking) Update with new ranking parameters
Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm.

The change also cleans out several parameters that no longer filled any function.
2024-07-15 04:49:40 +02:00
Viktor Lofgren
179a6002c2 (coded-sequence) Add a callback for re-filling underlying buffer 2024-07-12 23:50:28 +02:00
Viktor Lofgren
d28fc86956 (index-prio) Add fuzz test for prio index 2024-07-11 19:22:36 +02:00
Viktor Lofgren
6303977e9c (index-prio) Fail louder when size is 0 in PrioDocIdsTransformer
We can't deal with this scenario and should complain very loudly
2024-07-11 19:22:05 +02:00
Viktor Lofgren
97695693f2 (index-prio) Don't increment readItems counter when the output buffer is full
This behavior was causing the reader to sometimes discard trailing entries in the list.
2024-07-11 19:21:36 +02:00
Viktor Lofgren
1ab875a75d (test) Correcting flaky tests
Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
31881874a9 (coded-sequence) Correct indicator of next-value
It was incorrectly assumed that a "next" value could not be zero or negative, as this is not representable via the Gamam code.  This is incorrect in this case, as we're able to provide a negative offset.  Changing to using Integer.MIN_VALUE as indicator that a value is absent instead, as this will never be used.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
f090f0101b (index-construction) Gather up preindex writes
Use fewer writes when finalizing the preindex documents.dat file, as this was getting too slow.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
9881cac2da (index-reader) Correctly handle negative offset values
When wordOffset(...) returns a negative value, it means the word isn't present in the index, and we should abort.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
12590d3449 (index-reverse) Added compression to priority index
The priority index documents file can be trivially compressed to a large degree.

Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
2024-07-11 16:13:23 +02:00
Viktor Lofgren
abf7a8d78d (coded-sequence) Correct implementation of Elias gamma
Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.
2024-07-10 14:28:28 +02:00
Viktor Lofgren
ecfe17521a (coded-sequence) Correct implementation of Elias gamma
The implementation was incorrectly using 1 bit more than it should.  The change also adds a put method for Elias delta; and cleans up the interface a bit.
2024-07-09 17:28:21 +02:00
Viktor Lofgren
0d29e2a39d (index-reverse) Entry Sources reset() their LongQueryBuffer
Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome
2024-07-09 01:39:40 +02:00
Viktor Lofgren
12a2ab93db (actor) Improve error messages for convert-and-load
Some copy-and-paste errors had snuck in and every index construction error was reported as "repartitioned failed"; updated with more useful messages.
2024-07-08 19:19:30 +02:00
Viktor Lofgren
d90bd340bb (index-reverse) Removing btree indexes from prio documents file
Btree index adds overhead and disk space and doesn't fill any function for the prio index.

* Update finalize logic with a new IO transformer that copies the data and prepends a size
* Update the reader to read the new format
* Added a test
2024-07-08 17:20:17 +02:00
Viktor Lofgren
21afe94096 (index-reverse) Don't use 128 bit merge function for prio index 2024-07-07 21:36:10 +02:00
Viktor Lofgren
fa36689597 (index-reverse) Simplify priority index
* Do not emit a documents file
* Do not interlace metadata or offsets with doc ids
2024-07-06 18:04:08 +02:00
Viktor Lofgren
85c99ae808 (index-reverse) Split index construction into separate packages for full and priority index 2024-07-06 15:44:47 +02:00
Viktor Lofgren
a4ecd5f4ce (minor) Fix non-compiling test due to previous refactor 2024-07-06 15:11:43 +02:00
Viktor Lofgren
6401a513d7 (crawl) Fix onsubmit confirm dialog for single-site recrawl 2024-07-05 17:21:03 +02:00
Viktor Lofgren
d86926be5f (crawl) Add new functionality for re-crawling a single domain 2024-07-05 15:31:55 +02:00
Viktor Lofgren
a6b03a66dc (crawl) Reduce Charset.forName() object churn
Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets
2024-07-04 20:49:07 +02:00
Viktor Lofgren
d023e399d2 (index) Remove unnecessary allocations in journal reader
The term data iterator is quite hot and was performing buffer slice operations that were not necessary.

Replacing with a fixed pointer alias that can be repositioned to the relevant data.

The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped.

Removed this unnecessary step and move to copying the buffer directly instead.
2024-07-04 15:38:22 +02:00