Commit Graph

2314 Commits

Author SHA1 Message Date
Viktor Lofgren
85c99ae808 (index-reverse) Split index construction into separate packages for full and priority index 2024-07-06 15:44:47 +02:00
Viktor Lofgren
a4ecd5f4ce (minor) Fix non-compiling test due to previous refactor 2024-07-06 15:11:43 +02:00
Viktor Lofgren
6401a513d7 (crawl) Fix onsubmit confirm dialog for single-site recrawl 2024-07-05 17:21:03 +02:00
Viktor Lofgren
d86926be5f (crawl) Add new functionality for re-crawling a single domain 2024-07-05 15:31:55 +02:00
Viktor Lofgren
a6b03a66dc (crawl) Reduce Charset.forName() object churn
Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets
2024-07-04 20:49:07 +02:00
Viktor Lofgren
d023e399d2 (index) Remove unnecessary allocations in journal reader
The term data iterator is quite hot and was performing buffer slice operations that were not necessary.

Replacing with a fixed pointer alias that can be repositioned to the relevant data.

The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped.

Removed this unnecessary step and move to copying the buffer directly instead.
2024-07-04 15:38:22 +02:00
Viktor Lofgren
e8ab1e14e0 (keyword-extraction) Update upper limit to number of positions per word
After real-world testing, it was determined that 256 was still a bit too low, but 512 seems like it will only truncate outlier cases like assembly code and certain tabulations.
2024-07-02 20:52:32 +02:00
Viktor Lofgren
a6e15cb338 (keyword-extraction) Update upper limit to number of positions per word
100 was a bit too low, let's try 256.
2024-06-30 22:46:56 +02:00
Viktor Lofgren
4fbb863a10 (keyword-extraction) Add upper limit to number of positions per word
Also adding some logging for this event to get a feel for how big these lists get with realistic data.  To be cleaned up later.
2024-06-30 22:41:38 +02:00
Viktor Lofgren
6ee4d1eb90 (keyword) Increase the work area for position encoding
The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.
2024-06-28 16:42:39 +02:00
Viktor Lofgren
738e0e5fed (process) Add option for automatic profiling
The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns.  By default, these are put in the log directory.

The change also adds a JVM parameter that makes it shut up about native access.
2024-06-27 13:58:36 +02:00
Viktor Lofgren
0e4dd3d76d (minor) Remove accidentally committed debug printf 2024-06-27 13:40:53 +02:00
Viktor Lofgren
10fe5a78cb (log) Prevent tests from trying to log to file
They would never have succeeded, but it adds an annoying preamble of error spam in the console window.
2024-06-27 13:19:48 +02:00
Viktor Lofgren
975b8ae2e9 (minor) Tidy code 2024-06-27 13:15:31 +02:00
Viktor Lofgren
935234939c (test) Add query parsing to IntegrationTest 2024-06-27 13:15:20 +02:00
Viktor Lofgren
87e38e6181 (search-query) refac: Move query factory 2024-06-27 13:14:47 +02:00
Viktor Lofgren
f73fc8dd57 (search-query) Fix end-inclusion bug in QWordGraphIterator 2024-06-27 13:13:42 +02:00
Viktor Lofgren
3faa5bf521 (search-query) Tidy up QueryGRPCService and IndexClient 2024-06-26 14:03:30 +02:00
Viktor Lofgren
6973712480 (query) Tidy up code 2024-06-26 13:40:06 +02:00
Viktor Lofgren
02df421c94 (*) Trim the stopwords list
Having an overlong stopwords list leads to quoted terms not performing well.  For now we'll slash it to just "a" and "the".
2024-06-26 12:22:57 +02:00
Viktor Lofgren
95b9af92a0 (index) Implement working optional TermCoherences 2024-06-26 12:22:06 +02:00
Viktor Lofgren
8ee64c0771 (index) Correct TermCoherence requirements 2024-06-25 22:18:10 +02:00
Viktor Lofgren
b805f6daa8 (gamma) Fix readCount() behavior in EGC 2024-06-25 22:17:54 +02:00
Viktor Lofgren
dae22ccbe0 (test) Integration test from crawl->query 2024-06-25 22:17:26 +02:00
Viktor Lofgren
9d00243d7f (index) Partial re-implementation of position constraints 2024-06-24 15:55:54 +02:00
Viktor Lofgren
5461634616 (doc) Add readme.md for coded-sequence library
This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.
2024-06-24 14:28:51 +02:00
Viktor Lofgren
40bca93884 (gamma) Minor clean-up 2024-06-24 13:56:43 +02:00
Viktor Lofgren
b798f28443 (journal) Fixing journal encoding
Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.
2024-06-24 13:56:27 +02:00
Viktor Lofgren
fff2ce5721 (gamma) Correctly decode zero-length sequences 2024-06-24 13:11:41 +02:00
Viktor
69f88255e9
Merge pull request #101 from MarginaliaSearch/security-scan
Address security scan findings
2024-06-17 13:18:36 +02:00
Viktor
08ff79827e
Merge branch 'master' into security-scan 2024-06-17 13:18:25 +02:00
Viktor Lofgren
67703e2274 (run) Update install.sh with stronger warnings against non-docker install. 2024-06-17 13:15:15 +02:00
Viktor Lofgren
d0d6bb173c (control) Fix warc data http status filter default value 2024-06-17 12:40:25 +02:00
Viktor Lofgren
54caf17107 (docs) Amend install instructions for non-docker install 2024-06-16 10:22:07 +02:00
Viktor Lofgren
2168b7cf7d (docs) Update docs with clearer references to the full guide
The commit also mentions the non-docker install
2024-06-16 10:01:19 +02:00
Viktor Lofgren
90744433c9 Merge branch 'master' into security-scan
# Conflicts:
#	code/libraries/array/cpp/resources/libcpp.so
2024-06-13 13:14:47 +02:00
Viktor
5371f078f7
Merge pull request #102 from jaseemabid/jabid/macos-build
Make the project buildable on macOS
2024-06-12 14:45:03 +02:00
Jaseem Abid
0dd14a4bd0 Specify C++ standard in build command
The default C++ language standard on macOS is gnu++98, which won't build
this module.

Full error:

```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
    [](const p64x2& fst, const p64x2& snd) {
    ^
```
2024-06-12 12:47:10 +01:00
Jaseem Abid
9974b31a09 Don't track build files(libcpp.so) with git 2024-06-12 12:45:49 +01:00
Viktor Lofgren
0ffbbaf4b9 (crawler) Update WARC builder to use SHA-256 for digests 2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b (crawler) Fetch TLS instead of SSL context 2024-06-12 09:07:54 +02:00
Viktor Lofgren
55f3ac4846 (atags) Fix duckdb SQL injection
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
2024-06-12 09:05:57 +02:00
Viktor Lofgren
801cf4b5da (search) Fix bad practice usage of innerHTML to set what should be text content. 2024-06-12 08:59:40 +02:00
Viktor Lofgren
e0459d0c0d (build) Upgrade parquet dependencies to 1.14.0
This gets rid of a vulnerable transitive dependency.
2024-06-12 08:57:22 +02:00
Viktor Lofgren
23759a7243 (loader) Correctly clamp document size 2024-06-10 18:29:14 +02:00
Viktor Lofgren
55b2b7636b (loader) Correctly load the positions column in the keyword projection 2024-06-10 18:27:15 +02:00
Viktor Lofgren
36160988e2 (index) Integrate positions data with indexes WIP
This change integrates the new positions data with the forward and reverse indexes.

The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
9f982a0c3d (index) Integrate positions file properly 2024-06-06 16:45:42 +02:00
Viktor Lofgren
dcbec9414f (index) Fix non-compiling tests 2024-06-06 16:35:09 +02:00
Viktor Lofgren
a07cf1ba93 (array/cpp) Update gitignore to properly exclude libcpp.so 2024-06-06 13:06:08 +02:00