Viktor Lofgren
a6e15cb338
(keyword-extraction) Update upper limit to number of positions per word
...
100 was a bit too low, let's try 256.
2024-06-30 22:46:56 +02:00
Viktor Lofgren
4fbb863a10
(keyword-extraction) Add upper limit to number of positions per word
...
Also adding some logging for this event to get a feel for how big these lists get with realistic data. To be cleaned up later.
2024-06-30 22:41:38 +02:00
Viktor Lofgren
6ee4d1eb90
(keyword) Increase the work area for position encoding
...
The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.
2024-06-28 16:42:39 +02:00
Viktor Lofgren
738e0e5fed
(process) Add option for automatic profiling
...
The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns. By default, these are put in the log directory.
The change also adds a JVM parameter that makes it shut up about native access.
2024-06-27 13:58:36 +02:00
Viktor Lofgren
0e4dd3d76d
(minor) Remove accidentally committed debug printf
2024-06-27 13:40:53 +02:00
Viktor Lofgren
10fe5a78cb
(log) Prevent tests from trying to log to file
...
They would never have succeeded, but it adds an annoying preamble of error spam in the console window.
2024-06-27 13:19:48 +02:00
Viktor Lofgren
975b8ae2e9
(minor) Tidy code
2024-06-27 13:15:31 +02:00
Viktor Lofgren
935234939c
(test) Add query parsing to IntegrationTest
2024-06-27 13:15:20 +02:00
Viktor Lofgren
87e38e6181
(search-query) refac: Move query factory
2024-06-27 13:14:47 +02:00
Viktor Lofgren
f73fc8dd57
(search-query) Fix end-inclusion bug in QWordGraphIterator
2024-06-27 13:13:42 +02:00
Viktor Lofgren
3faa5bf521
(search-query) Tidy up QueryGRPCService and IndexClient
2024-06-26 14:03:30 +02:00
Viktor Lofgren
6973712480
(query) Tidy up code
2024-06-26 13:40:06 +02:00
Viktor Lofgren
02df421c94
(*) Trim the stopwords list
...
Having an overlong stopwords list leads to quoted terms not performing well. For now we'll slash it to just "a" and "the".
2024-06-26 12:22:57 +02:00
Viktor Lofgren
95b9af92a0
(index) Implement working optional TermCoherences
2024-06-26 12:22:06 +02:00
Viktor Lofgren
8ee64c0771
(index) Correct TermCoherence requirements
2024-06-25 22:18:10 +02:00
Viktor Lofgren
b805f6daa8
(gamma) Fix readCount() behavior in EGC
2024-06-25 22:17:54 +02:00
Viktor Lofgren
dae22ccbe0
(test) Integration test from crawl->query
2024-06-25 22:17:26 +02:00
Viktor Lofgren
9d00243d7f
(index) Partial re-implementation of position constraints
2024-06-24 15:55:54 +02:00
Viktor Lofgren
5461634616
(doc) Add readme.md for coded-sequence library
...
This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.
2024-06-24 14:28:51 +02:00
Viktor Lofgren
40bca93884
(gamma) Minor clean-up
2024-06-24 13:56:43 +02:00
Viktor Lofgren
b798f28443
(journal) Fixing journal encoding
...
Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.
2024-06-24 13:56:27 +02:00
Viktor Lofgren
fff2ce5721
(gamma) Correctly decode zero-length sequences
2024-06-24 13:11:41 +02:00
Viktor
08ff79827e
Merge branch 'master' into security-scan
2024-06-17 13:18:25 +02:00
Viktor Lofgren
d0d6bb173c
(control) Fix warc data http status filter default value
2024-06-17 12:40:25 +02:00
Viktor Lofgren
90744433c9
Merge branch 'master' into security-scan
...
# Conflicts:
# code/libraries/array/cpp/resources/libcpp.so
2024-06-13 13:14:47 +02:00
Jaseem Abid
0dd14a4bd0
Specify C++ standard in build command
...
The default C++ language standard on macOS is gnu++98, which won't build
this module.
Full error:
```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
[](const p64x2& fst, const p64x2& snd) {
^
```
2024-06-12 12:47:10 +01:00
Jaseem Abid
9974b31a09
Don't track build files(libcpp.so) with git
2024-06-12 12:45:49 +01:00
Viktor Lofgren
0ffbbaf4b9
(crawler) Update WARC builder to use SHA-256 for digests
2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b
(crawler) Fetch TLS instead of SSL context
2024-06-12 09:07:54 +02:00
Viktor Lofgren
55f3ac4846
(atags) Fix duckdb SQL injection
...
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
2024-06-12 09:05:57 +02:00
Viktor Lofgren
801cf4b5da
(search) Fix bad practice usage of innerHTML to set what should be text content.
2024-06-12 08:59:40 +02:00
Viktor Lofgren
e0459d0c0d
(build) Upgrade parquet dependencies to 1.14.0
...
This gets rid of a vulnerable transitive dependency.
2024-06-12 08:57:22 +02:00
Viktor Lofgren
23759a7243
(loader) Correctly clamp document size
2024-06-10 18:29:14 +02:00
Viktor Lofgren
55b2b7636b
(loader) Correctly load the positions column in the keyword projection
2024-06-10 18:27:15 +02:00
Viktor Lofgren
36160988e2
(index) Integrate positions data with indexes WIP
...
This change integrates the new positions data with the forward and reverse indexes.
The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
9f982a0c3d
(index) Integrate positions file properly
2024-06-06 16:45:42 +02:00
Viktor Lofgren
dcbec9414f
(index) Fix non-compiling tests
2024-06-06 16:35:09 +02:00
Viktor Lofgren
a07cf1ba93
(array/cpp) Update gitignore to properly exclude libcpp.so
2024-06-06 13:06:08 +02:00
Viktor Lofgren
4a8afa6b9f
(index, WIP) Position data partially integrated with forward and reverse indexes.
...
There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.
2024-06-06 12:54:52 +02:00
Sam Storment
9c06f446fb
(search) Styling tweaks. Make the filter button near the top right corener a bit bigger so it's easier to press on mobile
2024-06-05 19:55:17 -05:00
Sam Storment
2d076cbd67
(search) move data-has-js attribute from body to html element
2024-06-05 18:20:33 -05:00
Sam Storment
fb2eef24d6
Handle themeing when javascript is disabled. Hide the theme select and fallback to dark media query instead of data-theme attribute
2024-06-03 14:15:35 -05:00
Sam Storment
e2f68d9ccf
Add a theme select to the header that lets users toggle their theme independent of their OS theme
2024-06-02 21:02:52 -05:00
Viktor Lofgren
d4f4d751c0
Merge remote-tracking branch 'origin/master'
2024-06-02 16:30:41 +02:00
Viktor Lofgren
b4eac2516e
(crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results
2024-06-02 16:30:34 +02:00
Viktor
4435f6245c
Merge pull request #94 from samstorment/search-dark-theme
...
Search Dark Theme
2024-06-02 16:21:52 +02:00
Viktor Lofgren
9b922af075
(converter) Amend existing modifications to use gamma coded positions lists
...
... instead of serialized RoaringBitmaps as was the initial take on the problem.
2024-05-30 14:20:36 +02:00
Viktor Lofgren
0112ae725c
(gamma) Implement a small library for Elias gamma coding an integer sequence
2024-05-30 14:19:13 +02:00
Viktor Lofgren
619392edf9
(keywords) Add position information to keywords
2024-05-28 16:54:53 +02:00
Viktor Lofgren
0894822b68
(converter) Add position information to serialized document data
...
This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
2024-05-28 14:18:03 +02:00
Viktor Lofgren
a69ab311c7
(qword) Fix tests that broke due to stopword removal
2024-05-28 14:15:45 +02:00
Viktor Lofgren
6985ab762a
(query) Improve handling of stopwords in queries
2024-05-23 20:50:55 +02:00
Viktor Lofgren
0e8300979b
(search) Update the no result text to request bug reports.
2024-05-23 20:18:16 +02:00
Viktor Lofgren
0b60411e5f
(query) Bugfix stopword issue
...
Add a new rule that crates an alternative path that omits a word if it's a stopword.
In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
2024-05-23 20:15:14 +02:00
Viktor Lofgren
f83f777fff
(converter) Experimental support for searching by URL
...
Add up to synthetic 128 keywords per document, corresponding to links to other websites.
2024-05-23 17:10:57 +02:00
Viktor Lofgren
89aae93e60
(*) Lift jetty and guava-dependencies
2024-05-23 14:20:01 +02:00
Viktor Lofgren
65b74f9cab
(registry) Fix broken test
2024-05-23 14:15:01 +02:00
Sam Storment
7543e98035
Merge branch 'MarginaliaSearch:master' into search-dark-theme
2024-05-22 18:06:37 -05:00
Viktor Lofgren
59ec70eb73
(*) Clean up code related to crawl parquet inspection
2024-05-22 12:55:08 +02:00
Viktor Lofgren
365229991b
(control) Improve pagination for crawl data inspector
2024-05-21 19:44:48 +02:00
Viktor Lofgren
959a8e29ee
(control) Improve pagination for crawl data inspector
2024-05-21 19:27:25 +02:00
Viktor Lofgren
197c82acd4
(control) Add filter functionality for crawl data inspector
2024-05-21 19:05:44 +02:00
Viktor Lofgren
9539fdb53c
(control) Clean up UX for crawl data inspector
2024-05-21 18:27:24 +02:00
Sam Storment
5659df4388
(search) Set link and form field colors manually to override browser defaults with poor dark mode contrast
2024-05-21 00:03:46 -05:00
Viktor Lofgren
24bf29d369
(*) Upgrade opennlp and deprecate the monkey patched version of the code as it's no longer needed
2024-05-20 18:03:21 +02:00
Viktor Lofgren
17dc00d05f
(control) Partial implementation of inspection utility for crawl data
...
Uses duckdb and range queries to read the parquet files directly from the index partitions.
UX is a bit rough but is in working order.
2024-05-20 18:02:46 +02:00
Viktor Lofgren
4fcd4a8197
(index) Refactor to reduce the level of indirection
2024-05-19 12:40:33 +02:00
Viktor Lofgren
daf2a8df54
(btree) Roll back optimization of queryDataWithIndex
...
It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect.
The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.
2024-05-19 11:29:28 +02:00
Sam Storment
43489c98d8
(search) Minor dark theme tweaks after the new mocked UI elements were added
2024-05-19 01:06:54 -05:00
Viktor Lofgren
88997a1c4f
(btree) Clean up code
2024-05-18 18:38:46 +02:00
Viktor Lofgren
d12c77305c
(btree) Clean up code
2024-05-18 18:03:17 +02:00
Viktor Lofgren
ab4e2b222e
(array) Fix broken benchmarks
2024-05-18 13:41:24 +02:00
Viktor Lofgren
b867eadbef
(big-string) Remove the unused bigstring library
2024-05-18 13:40:03 +02:00
Viktor Lofgren
19163fa883
(array) Clean up the Array library
...
IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it)
Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs.
Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
2024-05-18 13:23:06 +02:00
Sam Storment
a7c33809c4
Merge branch 'master' into search-dark-theme
2024-05-17 22:52:19 -05:00
Viktor Lofgren
650f3843bb
(array) Clean up search function jungle
...
Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values.
Replaced binary search function with a branchless version that is much faster.
Cleaned up benchmark code.
2024-05-17 14:31:02 +02:00
Viktor Lofgren
9e766bc056
(array) Clean up search function jungle
...
Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values.
Replaced binary search function with a branchless version that is much faster.
Cleaned up benchmark code.
2024-05-17 14:30:06 +02:00
Viktor Lofgren
48aff52e00
(array) Increase LongArray on-heap alignment to 16 bytes
...
This primarily affects benchmarks, making performance more consistent for the 128 bit operations, as the system mostly works with memory mapped data.
2024-05-16 19:12:36 +02:00
Viktor Lofgren
9d7616317e
(array) Clean up native code a bit
2024-05-16 14:47:10 +02:00
Viktor Lofgren
d227a09fb1
(search) Extend paperdoll service mock with site info data and screenshots
...
It's a bit of a hack job but will do, random exploration is available but only through a "browse:random"-style query
2024-05-15 12:40:55 +02:00
Viktor Lofgren
f48cf77c4d
(array, experimental) Add benchmark results for quicksort
2024-05-14 18:15:30 +02:00
Viktor Lofgren
3549be216f
(array, experimental) Documentation for native algos
2024-05-14 17:43:05 +02:00
Viktor Lofgren
c3e3a3dbc5
(search) Fix problem list in clustered search results
2024-05-14 13:05:52 +02:00
Viktor Lofgren
55a7c1db00
(array, experimental) Call C++ helper methods to do some low level stuff a bit faster than is possible with Java
2024-05-14 12:54:14 +02:00
Sam Storment
bb315221ab
(search, WIP) Make the dark theme look generally nicer. Rename CSS custom properties a bit. Switch a lot of background colors to HSL to make it easy to change colors relative to one another.
2024-05-14 01:32:40 -05:00
Sam Storment
c38766c5a6
(search, WIP) Convert SCSS variables to CSS custom properties for dynamic theming
2024-05-08 22:13:24 -05:00
Viktor Lofgren
c837321df1
(search) Provide a notification when no search results are found.
2024-05-06 20:11:39 +02:00
Viktor Lofgren
af7f6b89ec
(search) Delete vestigial stylesheet from the old design.
2024-05-06 19:52:29 +02:00
Viktor Lofgren
29a4d3df23
(search) Imrpove search-service paperdoll by mocking suggestions and news
2024-05-06 19:52:13 +02:00
Viktor Lofgren
7d1cafc070
(control) Add skip link for navigation in control GUI
2024-05-04 12:36:44 +02:00
Viktor Lofgren
5951c67a8b
(search) Center the search results page
2024-05-04 12:23:21 +02:00
Viktor Lofgren
c454007730
(search) Increase contrast for some UI elements
2024-05-04 12:02:52 +02:00
Viktor Lofgren
4e49cca43d
(search) Clean up SCSS code a bit
2024-05-04 11:58:54 +02:00
Viktor Lofgren
49a8c06095
(search) Improve contrast for text on random button
2024-05-04 11:51:19 +02:00
Viktor Lofgren
d01d9fa670
(search) Add screenreader-specific notification remark about when search results start.
2024-05-04 11:41:06 +02:00
Viktor Lofgren
a53a32f006
(search) Spell out website problems with "atomic elements" instead of having a hover that's inaccessible with keyboard navigation
2024-05-04 11:41:05 +02:00
Viktor Lofgren
3548d54cf6
(search) Add a screenreader-only alert when the search filters are updated to make it easier to understand what happens.
2024-05-04 11:41:04 +02:00
Viktor Lofgren
01f242ac7e
(search) Add stylesheet class for screenreader-only items
2024-05-04 11:41:03 +02:00
Viktor Lofgren
2840d9d403
(search) Add screenreader-only positions count text to search results
2024-05-04 11:41:03 +02:00
Viktor Lofgren
9fecfc5025
(search) Add autocomplete attribute to search-form
2024-05-04 11:41:02 +02:00
Viktor Lofgren
1b901e01f2
(search) Add bypass link that skips navigation
2024-05-04 11:41:01 +02:00
Viktor Lofgren
974aa35558
(search) Add proper alt-text to random exploration mode
2024-05-04 11:41:00 +02:00
Viktor Lofgren
4021a0ae98
(search) Add en-US language tags to all templates
2024-05-04 11:40:59 +02:00
Viktor Lofgren
b7a95be731
(search) Create a small mocking framework for running the search service in isolation.
2024-05-04 11:40:59 +02:00
Viktor Lofgren
616649f040
(logs) Fix logdir location
2024-05-04 11:40:59 +02:00
Viktor Lofgren
6087f9635c
(qs) Move index.html out of public directory
...
It was put there to simulate the /public interface paradigm that is now deprecated.
2024-05-01 12:56:12 +02:00
Viktor Lofgren
2ad0bfda1e
(*) Fix boot orchestration for the services
...
This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.
A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database. Move the first boot check into the MainClass instead of the Service constructor.
The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
2024-05-01 12:39:48 +02:00
Viktor Lofgren
08f8b6e022
(system) Log loaded properties to the console
2024-04-30 18:29:11 +02:00
Viktor Lofgren
800ed6b1e9
(zk) Terminately immediately if zookeeper isn't found
...
This makes debugging easier
2024-04-30 18:28:49 +02:00
Viktor Lofgren
908535a3a0
(single-service) Ensure single-service spawner can specify the node
2024-04-30 18:27:46 +02:00
Viktor Lofgren
7fe2ab6f39
(file-storage) Ensure file storage root location can be overridden when running outside of docker
2024-04-30 18:26:15 +02:00
Viktor Lofgren
c9ee0c909e
(download-sample) Set +x permissions on directories created during this job
2024-04-30 18:25:07 +02:00
Viktor Lofgren
38aedb50ac
(converter) Do not suppress exceptions in the converter
2024-04-30 18:24:35 +02:00
Viktor Lofgren
4772e0b59d
(service) Deprecate /public prefix on HTTP
...
Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations.
Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths.
The change removes the path prefix, and updates the docker templates to reflect the change. This will require a migration for existing systems.
2024-04-30 14:46:18 +02:00
Viktor Lofgren
70e2e41955
(crawler) Content type prober should not swallow exceptions
2024-04-27 18:27:23 +02:00
Viktor Lofgren
4d71c776fc
(crawler) Modify crawl set growth to grow small domains faster than larger ones
2024-04-27 17:36:27 +02:00
Viktor
2d49071e96
Merge branch 'master' into run-outside-docker
2024-04-25 18:53:26 +02:00
Viktor Lofgren
89889ecbbd
(single-service) Skip starting Prometheus if it's not explicitly enabled
2024-04-25 17:54:07 +02:00
Viktor Lofgren
c8ee354d0b
(log) Make log dir configurable via environment variable
2024-04-25 15:09:18 +02:00
Viktor Lofgren
4e5f069809
(build) Migrate ssr to the new root setting schema of java lang version
2024-04-25 15:08:56 +02:00
Viktor Lofgren
6690e9bde8
(service) Ensure the service discovery starts early
...
This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.
2024-04-25 15:08:33 +02:00
Viktor Lofgren
e4b34b6ee6
(index) Correctly detect the presence of an all-virtual path through the query
2024-04-25 14:01:46 +02:00
Viktor Lofgren
3952ef6ca5
(service) Let singleservice configure ports and bind addresses
2024-04-25 13:49:57 +02:00
Viktor Lofgren
7eb5e6aa66
(crawler) Abort recrawl if error count is too high
2024-04-24 21:46:40 +02:00
Viktor Lofgren
282022d64e
(crawler) Remove unnecessary double-fetch of the root document
2024-04-24 14:44:39 +02:00
Viktor Lofgren
91a98a8807
(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber
2024-04-24 14:44:39 +02:00
Viktor Lofgren
32fe864a33
(build) Java 22 and its consequences has been a disaster for Marginalia Search
...
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e1c9313396
(crawler) Emulate if-modified-since for domains that don't support the header
...
This will help reduce the strain on some server software, in particular Discourse.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f430a084e8
(crawler) Remove accidental log spam
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a86b596897
(crawler) Code quality
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6dd87b0378
(crawler) Use the probe-result to reduce the likelihood of crawling both http and https
...
This should drastically reduce the number of fetched documents on many domains
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c9f029c214
(crawler) Strip W/-prefix from the etag when supplied as If-None-Match
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6b88db10ad
(crawler) Ensure all appropriate headers are recorded on the request
2024-04-24 14:44:39 +02:00
Viktor Lofgren
8a891c2159
(crawler/converter) Remove legacy junk from parquet migration
2024-04-24 14:44:39 +02:00
Viktor Lofgren
ad2ac8eee3
(query) Mark flaky test, correct assert on test
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f46733a47a
(ranking) TermCoherenceFactory should be run for size=2 queries
2024-04-24 14:44:39 +02:00
Viktor Lofgren
934167323d
(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
64baa41e64
(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches
2024-04-24 14:44:39 +02:00
Viktor Lofgren
5165cf6d15
(ranking) Set regularMask correctly
2024-04-24 14:44:39 +02:00
Viktor Lofgren
4489b21528
(ranking) Cleanup
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f623b37577
(ranking) Suppress NaN:s in ranking output
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f4a2fea451
(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a748fc5448
(index, bugfix) Pass url quality to query service
2024-04-24 14:44:39 +02:00
Viktor Lofgren
0dcca0cb83
(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp
2024-04-24 14:44:39 +02:00
Viktor Lofgren
b80a83339b
(qs) Additional info in query debug UI
2024-04-24 14:44:39 +02:00
Viktor Lofgren
eb74d08f2a
(qs) Additional info in query debug UI
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e79ab0c70e
(qs) Basic query debug feature
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e419e26f3a
(proto) Improve handling of omitted parameters
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6102fd99bf
(qs) Improve logging
2024-04-24 14:44:39 +02:00
Viktor Lofgren
def36719d3
(query) Minor code cleanup
2024-04-24 14:44:39 +02:00
Viktor Lofgren
462aa9af26
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
...
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a09c84e1b8
(query) Modify tokenizer to match the behavior of the sentence extractor
...
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44b33798f3
(index) Clean up jaccard index term code and down-tune the parameter's importance a bit
2024-04-24 14:44:39 +02:00
Viktor Lofgren
2f0b648fad
(index) Add jaccard index term to boost results based on term overlap
2024-04-24 14:44:39 +02:00
Viktor Lofgren
de0e56f027
(index) Remove position overlap check, coherences will do the work instead
2024-04-24 14:44:39 +02:00
Viktor Lofgren
973ced7b13
(index) Omit absent terms from coherence checks
2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb4b824a85
(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c583a538b1
(search) Add implicit coherence constraints based on segmentation
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e0224085b4
(index) Improve recall for small queries
...
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
44c1e1d6d9
(index) Remove dead code
...
Since the performance fix in 3359f72239
had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
c620e9c026
(index) Experimental performance regression fix
2024-04-24 14:44:39 +02:00
Viktor Lofgren
1bb88968c5
(test) Fix broken test
2024-04-24 14:44:39 +02:00
Viktor Lofgren
df75e8f4aa
(index) Explicitly free LongQueryBuffers
2024-04-24 14:44:39 +02:00
Viktor Lofgren
adf846bfd2
(index) Fix term coherence evaluation
...
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
1748fcc5ac
(valuation) Impose stronger constraints on locality of terms
...
Clean up logic a bit
2024-04-24 14:44:39 +02:00
Viktor Lofgren
08416393e0
(valuation) Impose stronger constraints on locality of terms
2024-04-24 14:44:39 +02:00
Viktor Lofgren
fce26015c9
(encyclopedia) Index the full articles
...
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
155be1078d
(index) Fix priority search terms
...
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
6efc0f21fe
(index) Clean up data model
...
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
f3255e080d
(ngram) Grab titles separately when extracting ngrams from wiki data
2024-04-24 14:44:39 +02:00
Viktor Lofgren
5f6a3ef9d0
(ngram) Correct |s|^|s|-normalization to use length and not count
2024-04-24 14:44:39 +02:00
Viktor Lofgren
afc4fed591
(ngram) Correct size value in ngram lexicon generation, trim the terms better
2024-04-24 14:44:39 +02:00
Viktor Lofgren
cb505f98ef
(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
a0b3634cb6
(ngram) Only extract frequencies of title words, but use the body to increment the counters...
...
The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.
2024-04-24 14:44:39 +02:00
Viktor Lofgren
e23359bae9
(query, minor) Remove debug statement
2024-04-24 14:44:39 +02:00
Viktor Lofgren
5531ed632a
(query, minor) Remove debug statement
2024-04-24 14:44:39 +02:00
Viktor Lofgren
150ee21f3c
(ngram) Clean up ngram lexicon code
...
This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
c96da0ce1e
(segmentation) Pick best segmentation using |s|^|s|-style normalization
...
This is better than doing all segmentations possible at the same time.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
a0d9e66ff7
(ngram) Fix index range in NgramLexicon to an avoid exception
2024-04-24 14:44:38 +02:00
Viktor Lofgren
55f627ed4c
(index) Clean up the code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
7dd8c78c6b
(ngrams) Remove the vestigial logic for capturing permutations of n-grams
...
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8bf7d090fd
(qs) Clean up parsing code using new record matching
2024-04-24 14:44:38 +02:00
Viktor Lofgren
6bfe04b609
(term-freq-exporter) Reduce thread count and memory usage
2024-04-24 14:44:38 +02:00
Viktor Lofgren
491d6bec46
(term-freq-exporter) Extract ngrams in term-frequency-exporter
2024-04-24 14:44:38 +02:00
Viktor Lofgren
4fb86ac692
(search) Fix outdated assumptions about the results
...
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
6cba6aef3b
(minor) Remove dead code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
7e216db463
(index) Add origin trace information for index readers
...
This used to be supported by the system but got lost in refactoring at some point.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
adc90c8f1e
(sentence-extractor) Fix resource leak in sentence extractor
...
The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation.
The modified behavior checks for nullity before creating a new instance.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
e3316a3672
(index) Clean up new index query code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
a3a6d6292b
(qs, index) New query model integrated with index service.
...
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
8cb9455c32
(qs, WIP) Fix edge cases in query compilation
...
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
2024-04-24 14:44:38 +02:00
Viktor Lofgren
dc65b2ee01
(qs, WIP) Clean up dead code
2024-04-24 14:44:38 +02:00
Viktor Lofgren
98a1adbf81
(qs, WIP) Tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
0bd1e15cce
(qs, WIP) Tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
eda926767e
(qs, WIP) Tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
cd1a18c045
(qs, WIP) Break up code and tidy it up a bit
2024-04-24 14:44:38 +02:00
Viktor Lofgren
6f567fbea8
(qs, WIP) Fix output determinism, fix tests
2024-04-24 14:44:38 +02:00
Viktor Lofgren
0ebadd03a5
(WIP) Query rendering finally beginning to look like it works
2024-04-24 14:44:38 +02:00
Viktor Lofgren
2253b556b2
WIP
2024-04-24 14:44:17 +02:00
Viktor Lofgren
6a7a7009c7
(convert) Initial integration of segmentation data into the converter's keyword extraction logic
2024-04-24 14:44:17 +02:00
Viktor Lofgren
3c75057dcd
(qs) Retire NGramBloomFilter, integrate new segmentation model instead
2024-04-24 14:44:17 +02:00
Viktor Lofgren
212d101727
(control) GUI for exporting segmentation data from a wikipedia zim
2024-04-24 14:44:17 +02:00
Viktor Lofgren
760b80659d
(WIP) Partial integration of new query expansion code into the query-serivice
2024-04-24 14:44:17 +02:00
Viktor Lofgren
04879c005d
(WIP) Improve data extraction from wikipedia data
2024-04-24 14:44:17 +02:00
Viktor Lofgren
cb82927756
(WIP) Implement first take of new query segmentation algorithm
2024-04-24 14:44:17 +02:00
Viktor Lofgren
8b9629f2f6
(crawler) Remove unnecessary double-fetch of the root document
2024-04-24 14:38:59 +02:00
Viktor Lofgren
f6db16b313
(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber
2024-04-24 14:10:03 +02:00
Viktor Lofgren
4668b1ddcb
(build) Java 22 and its consequences has been a disaster for Marginalia Search
...
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
Viktor Lofgren
dcf9d9caad
(crawler) Emulate if-modified-since for domains that don't support the header
...
This will help reduce the strain on some server software, in particular Discourse.
2024-04-22 17:26:31 +02:00
Viktor Lofgren
7a69b76001
(crawler) Remove accidental log spam
2024-04-22 15:51:37 +02:00
Viktor Lofgren
ac07ef822f
(crawler) Code quality
2024-04-22 15:37:35 +02:00
Viktor Lofgren
e7d4bcd872
(crawler) Use the probe-result to reduce the likelihood of crawling both http and https
...
This should drastically reduce the number of fetched documents on many domains
2024-04-22 15:36:43 +02:00
Viktor Lofgren
a28c6d7cfe
(crawler) Strip W/-prefix from the etag when supplied as If-None-Match
2024-04-22 14:31:05 +02:00
Viktor Lofgren
d816f048f5
(crawler) Ensure all appropriate headers are recorded on the request
2024-04-22 14:14:24 +02:00
Viktor Lofgren
b09ddd0036
(crawler/converter) Remove legacy junk from parquet migration
2024-04-22 12:34:28 +02:00
Viktor Lofgren
0a73b02a00
(query) Mark flaky test, correct assert on test
2024-04-21 12:30:14 +02:00
Viktor Lofgren
8769704462
(ranking) TermCoherenceFactory should be run for size=2 queries
2024-04-21 12:29:25 +02:00
Viktor Lofgren
214551f1df
(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.
2024-04-19 20:36:01 +02:00
Viktor Lofgren
2cc74c005a
(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches
2024-04-19 19:42:30 +02:00
Viktor Lofgren
ed250f57f2
(ranking) Set regularMask correctly
2024-04-19 14:31:57 +02:00
Viktor Lofgren
e92c25f7e0
(ranking) Cleanup
2024-04-19 14:13:12 +02:00
Viktor Lofgren
3ab563f314
(ranking) Suppress NaN:s in ranking output
2024-04-19 13:58:28 +02:00
Viktor Lofgren
426338cb45
(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N
2024-04-19 12:41:48 +02:00
Viktor Lofgren
5fa2375898
(index, bugfix) Pass url quality to query service
2024-04-19 12:41:26 +02:00
Viktor Lofgren
41782a0ab5
(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp
2024-04-19 12:19:26 +02:00
Viktor Lofgren
9b06433b82
(qs) Additional info in query debug UI
2024-04-19 12:18:53 +02:00
Viktor Lofgren
def607d840
(qs) Additional info in query debug UI
2024-04-19 11:46:27 +02:00
Viktor Lofgren
2b811fb422
(qs) Basic query debug feature
2024-04-19 11:00:56 +02:00
Viktor Lofgren
36cc62c10c
(proto) Improve handling of omitted parameters
2024-04-18 10:47:12 +02:00
Viktor Lofgren
975d92912c
(qs) Improve logging
2024-04-18 10:44:08 +02:00
Viktor Lofgren
8bbaf457de
(query) Minor code cleanup
2024-04-18 10:37:51 +02:00
Viktor Lofgren
7641a02f31
(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard
...
The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.
2024-04-18 10:36:15 +02:00
Viktor Lofgren
ce16239e34
(query) Modify tokenizer to match the behavior of the sentence extractor
...
This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.
2024-04-17 17:54:32 +02:00
Viktor Lofgren
d64bd227cf
(index) Clean up jaccard index term code and down-tune the parameter's importance a bit
2024-04-17 17:40:16 +02:00
Viktor Lofgren
c5ab0a9054
(index) Add jaccard index term to boost results based on term overlap
2024-04-17 16:50:26 +02:00
Viktor Lofgren
dac948973d
(index) Remove position overlap check, coherences will do the work instead
2024-04-17 14:20:01 +02:00
Viktor Lofgren
9d008d1d6f
(index) Omit absent terms from coherence checks
2024-04-17 14:12:16 +02:00
Viktor Lofgren
f52457213e
(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus
2024-04-17 14:05:02 +02:00
Viktor Lofgren
579295a673
(search) Add implicit coherence constraints based on segmentation
2024-04-17 14:03:35 +02:00
Viktor Lofgren
af8ff8ce99
(index) Improve recall for small queries
...
Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.
2024-04-16 22:51:03 +02:00
Viktor Lofgren
7fa3e86e64
(index) Remove dead code
...
Since the performance fix in 3359f72239
had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.
2024-04-16 19:59:27 +02:00
Viktor Lofgren
3359f72239
(index) Experimental performance regression fix
2024-04-16 19:48:14 +02:00
Viktor Lofgren
41fa154aa6
(test) Fix broken test
2024-04-16 19:48:14 +02:00
Viktor Lofgren
deaba0152d
(index) Explicitly free LongQueryBuffers
2024-04-16 19:23:00 +02:00
Viktor Lofgren
feaef6093e
(index) Fix term coherence evaluation
...
The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.
2024-04-16 18:07:43 +02:00
Viktor Lofgren
078fa4fdd0
(valuation) Impose stronger constraints on locality of terms
...
Clean up logic a bit
2024-04-16 17:22:58 +02:00
Viktor Lofgren
2dc77a0638
(valuation) Impose stronger constraints on locality of terms
2024-04-16 17:15:21 +02:00
Viktor Lofgren
f434a8b492
(build) Upgrade jib plugin version
2024-04-16 15:25:23 +02:00
Viktor Lofgren
d2658d6f84
(sys) Add springboard service that can spawn multiple different marginalia services to make distribution easier.
2024-04-16 13:25:15 +02:00
Viktor Lofgren
8c559c8121
(conf) Add additional logic for discovering system root
2024-04-16 12:37:18 +02:00
Viktor Lofgren
2353c73c57
(encyclopedia) Index the full articles
...
Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.
2024-04-16 12:10:13 +02:00
Viktor Lofgren
599e719ad4
(index) Fix priority search terms
...
This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.
2024-04-15 16:44:08 +02:00
Viktor Lofgren
b6d365bacd
(index) Clean up data model
...
The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality.
The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.
2024-04-15 16:04:07 +02:00
Viktor Lofgren
52f0c0d336
(ngram) Grab titles separately when extracting ngrams from wiki data
2024-04-13 19:34:16 +02:00
Viktor Lofgren
fda1c05164
(ngram) Correct |s|^|s|-normalization to use length and not count
2024-04-13 18:05:30 +02:00
Viktor Lofgren
1329d4abd8
(ngram) Correct size value in ngram lexicon generation, trim the terms better
2024-04-13 17:51:02 +02:00
Viktor Lofgren
f064992137
(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.
2024-04-13 17:07:23 +02:00
Viktor Lofgren
8a81a480a1
(ngram) Only extract frequencies of title words, but use the body to increment the counters...
...
The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.
2024-04-12 18:08:31 +02:00
Viktor Lofgren
d729c400e5
(query, minor) Remove debug statement
2024-04-12 17:52:55 +02:00
Viktor Lofgren
ad4810d991
(query, minor) Remove debug statement
2024-04-12 17:45:26 +02:00
Viktor Lofgren
6a67043537
(ngram) Clean up ngram lexicon code
...
This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.
2024-04-12 17:45:06 +02:00
Viktor Lofgren
864d6c28e7
(segmentation) Pick best segmentation using |s|^|s|-style normalization
...
This is better than doing all segmentations possible at the same time.
2024-04-12 17:44:14 +02:00
Viktor Lofgren
bb6b51ad91
(ngram) Fix index range in NgramLexicon to an avoid exception
2024-04-12 10:13:25 +02:00
Viktor Lofgren
65e3caf402
(index) Clean up the code
2024-04-11 18:50:21 +02:00
Viktor Lofgren
b7d9a7ae89
(ngrams) Remove the vestigial logic for capturing permutations of n-grams
...
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-11 18:12:01 +02:00
Viktor Lofgren
ed73d79ec1
(qs) Clean up parsing code using new record matching
2024-04-11 17:36:08 +02:00
Viktor Lofgren
c538c25008
(term-freq-exporter) Reduce thread count and memory usage
2024-04-10 17:11:23 +02:00
Viktor Lofgren
4b47fadbab
(term-freq-exporter) Extract ngrams in term-frequency-exporter
2024-04-10 16:58:05 +02:00
Viktor Lofgren
fcdc843c15
(search) Fix outdated assumptions about the results
...
We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption.
For the API service, we'll simulate the old behavior to keep the API stable.
For the search service, we'll introduce a new way of calculating positions through tree aggregation.
2024-04-07 12:09:44 +02:00
Viktor Lofgren
dbdcf459a7
(minor) Remove dead code
2024-04-06 16:27:16 +02:00
Viktor Lofgren
ef25d60666
(index) Add origin trace information for index readers
...
This used to be supported by the system but got lost in refactoring at some point.
2024-04-06 13:28:14 +02:00
Viktor Lofgren
7f7021ce64
(sentence-extractor) Fix resource leak in sentence extractor
...
The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation.
The modified behavior checks for nullity before creating a new instance.
2024-04-05 18:52:58 +02:00
Joshua Holland
617e633d7a
Update keywords docs use of explore to browse
...
I can't tell when this happened, but the proper keyword now seems to be browse and not explore.
2024-04-05 15:15:49 +02:00
Viktor Lofgren
ae7c760772
(index) Clean up new index query code
2024-04-05 13:30:49 +02:00
Viktor Lofgren
81815f3e0a
(qs, index) New query model integrated with index service.
...
Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.
2024-04-04 20:17:58 +02:00
Joshua Holland
8e02f567d7
Update keywords docs use of explore to browse
...
I can't tell when this happened, but the proper keyword now seems to be browse and not explore.
2024-04-01 00:04:12 -05:00
Viktor Lofgren
87bb93e1d4
(qs, WIP) Fix edge cases in query compilation
...
This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w | z_w) | x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.
2024-03-29 12:40:27 +01:00
Viktor Lofgren
e596c929ac
(qs, WIP) Clean up dead code
2024-03-28 16:37:23 +01:00
Viktor Lofgren
9852b0e609
(qs, WIP) Tidy it up a bit
2024-03-28 14:18:26 +01:00
Viktor Lofgren
51b0d6c0d3
(qs, WIP) Tidy it up a bit
2024-03-28 14:09:17 +01:00
Viktor Lofgren
15391c7a88
(qs, WIP) Tidy it up a bit
2024-03-28 13:54:30 +01:00
Viktor Lofgren
fe62593286
(qs, WIP) Break up code and tidy it up a bit
2024-03-28 13:26:54 +01:00
Viktor Lofgren
4cc11e183c
(qs, WIP) Fix output determinism, fix tests
2024-03-28 13:11:26 +01:00
Viktor Lofgren
f82ebd7716
(WIP) Query rendering finally beginning to look like it works
2024-03-28 13:01:21 +01:00
Viktor Lofgren
bd0704d5a4
(*) Fix JDK22 migration issues
...
A few bizarre build errors cropped up when migrating to JDK22. Not at all sure what caused them, but they were easy to mitigate.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
002afca1c5
(sys) Upgrade to JDK22
...
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:33:27 +01:00
Viktor Lofgren
a4b810f511
WIP
2024-03-21 14:33:26 +01:00
Viktor Lofgren
824765b1ee
(*) Fix JDK22 migration issues
...
A few bizarre build errors cropped up when migrating to JDK22. Not at all sure what caused them, but they were easy to mitigate.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
fe8d583fdd
(sys) Upgrade to JDK22
...
This also entails upgrading JIB to 3.4.1 and Lombok to 1.18.32.
2024-03-21 14:27:13 +01:00
Viktor Lofgren
0bd3365c24
(convert) Initial integration of segmentation data into the converter's keyword extraction logic
2024-03-19 14:28:42 +01:00
Viktor Lofgren
d8f4e7d72b
(qs) Retire NGramBloomFilter, integrate new segmentation model instead
2024-03-19 10:42:09 +01:00
Viktor Lofgren
afc047cd27
(control) GUI for exporting segmentation data from a wikipedia zim
2024-03-18 13:45:23 +01:00
Viktor Lofgren
00ef4f9803
(WIP) Partial integration of new query expansion code into the query-serivice
2024-03-18 13:16:49 +01:00
Viktor Lofgren
07e4d7ec6d
(WIP) Improve data extraction from wikipedia data
2024-03-18 13:16:00 +01:00
Viktor Lofgren
8ae1f08095
(WIP) Implement first take of new query segmentation algorithm
2024-03-12 13:12:50 +01:00
Viktor Lofgren
57e6a12d08
(registry) Correct registerMonitor() behavior
...
The previous behavior would listen to too many changes, and based on zookeeper and not curator assumptions about behavior, add an additional monitor on each invocation of each monitor, (which always trigger on service state changes), leading to each monitor re-registering and effectively doubling monitors in numbers whenever a service stopped or started, which in turn meant a lot of bizarre thrashing behavior even on changes in services that don't explicitly talk to each other.
This re-registering behavior is no longer done.
2024-03-06 12:22:15 +01:00
Viktor Lofgren
46423612e3
(refac) Merge service-discovery and service modules
...
Also adds a few tests to the server/client code.
2024-03-03 10:49:23 +01:00
Viktor Lofgren
29bf473d74
(encyclopedia) Add URLencoding to path element
...
This prevents corruption of the links to the sideloaded encyclopedia data when the article path contains characters that are not valid in a URL.
2024-03-01 17:28:09 +01:00
Viktor Lofgren
9689f3faee
(domain-info) Fix incorrect array indexing
2024-02-29 18:56:09 +01:00
Viktor Lofgren
93fa58c93d
(domain-info) Fix incorrect array indexing
...
Using the id instead of idx when addressing the ranksArray caused exceptions.
2024-02-29 17:54:23 +01:00