Viktor Lofgren
3faa5bf521
(search-query) Tidy up QueryGRPCService and IndexClient
2024-06-26 14:03:30 +02:00
Viktor Lofgren
6973712480
(query) Tidy up code
2024-06-26 13:40:06 +02:00
Viktor Lofgren
02df421c94
(*) Trim the stopwords list
...
Having an overlong stopwords list leads to quoted terms not performing well. For now we'll slash it to just "a" and "the".
2024-06-26 12:22:57 +02:00
Viktor Lofgren
95b9af92a0
(index) Implement working optional TermCoherences
2024-06-26 12:22:06 +02:00
Viktor Lofgren
8ee64c0771
(index) Correct TermCoherence requirements
2024-06-25 22:18:10 +02:00
Viktor Lofgren
b805f6daa8
(gamma) Fix readCount() behavior in EGC
2024-06-25 22:17:54 +02:00
Viktor Lofgren
dae22ccbe0
(test) Integration test from crawl->query
2024-06-25 22:17:26 +02:00
Viktor Lofgren
9d00243d7f
(index) Partial re-implementation of position constraints
2024-06-24 15:55:54 +02:00
Viktor Lofgren
5461634616
(doc) Add readme.md for coded-sequence library
...
This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.
2024-06-24 14:28:51 +02:00
Viktor Lofgren
40bca93884
(gamma) Minor clean-up
2024-06-24 13:56:43 +02:00
Viktor Lofgren
b798f28443
(journal) Fixing journal encoding
...
Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.
2024-06-24 13:56:27 +02:00
Viktor Lofgren
fff2ce5721
(gamma) Correctly decode zero-length sequences
2024-06-24 13:11:41 +02:00
Viktor
08ff79827e
Merge branch 'master' into security-scan
2024-06-17 13:18:25 +02:00
Viktor Lofgren
d0d6bb173c
(control) Fix warc data http status filter default value
2024-06-17 12:40:25 +02:00
Viktor Lofgren
90744433c9
Merge branch 'master' into security-scan
...
# Conflicts:
# code/libraries/array/cpp/resources/libcpp.so
2024-06-13 13:14:47 +02:00
Jaseem Abid
0dd14a4bd0
Specify C++ standard in build command
...
The default C++ language standard on macOS is gnu++98, which won't build
this module.
Full error:
```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
[](const p64x2& fst, const p64x2& snd) {
^
```
2024-06-12 12:47:10 +01:00
Jaseem Abid
9974b31a09
Don't track build files(libcpp.so) with git
2024-06-12 12:45:49 +01:00
Viktor Lofgren
0ffbbaf4b9
(crawler) Update WARC builder to use SHA-256 for digests
2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b
(crawler) Fetch TLS instead of SSL context
2024-06-12 09:07:54 +02:00
Viktor Lofgren
55f3ac4846
(atags) Fix duckdb SQL injection
...
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
2024-06-12 09:05:57 +02:00
Viktor Lofgren
801cf4b5da
(search) Fix bad practice usage of innerHTML to set what should be text content.
2024-06-12 08:59:40 +02:00
Viktor Lofgren
e0459d0c0d
(build) Upgrade parquet dependencies to 1.14.0
...
This gets rid of a vulnerable transitive dependency.
2024-06-12 08:57:22 +02:00
Viktor Lofgren
23759a7243
(loader) Correctly clamp document size
2024-06-10 18:29:14 +02:00
Viktor Lofgren
55b2b7636b
(loader) Correctly load the positions column in the keyword projection
2024-06-10 18:27:15 +02:00
Viktor Lofgren
36160988e2
(index) Integrate positions data with indexes WIP
...
This change integrates the new positions data with the forward and reverse indexes.
The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
9f982a0c3d
(index) Integrate positions file properly
2024-06-06 16:45:42 +02:00
Viktor Lofgren
dcbec9414f
(index) Fix non-compiling tests
2024-06-06 16:35:09 +02:00
Viktor Lofgren
a07cf1ba93
(array/cpp) Update gitignore to properly exclude libcpp.so
2024-06-06 13:06:08 +02:00
Viktor Lofgren
4a8afa6b9f
(index, WIP) Position data partially integrated with forward and reverse indexes.
...
There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.
2024-06-06 12:54:52 +02:00
Sam Storment
9c06f446fb
(search) Styling tweaks. Make the filter button near the top right corener a bit bigger so it's easier to press on mobile
2024-06-05 19:55:17 -05:00
Sam Storment
2d076cbd67
(search) move data-has-js attribute from body to html element
2024-06-05 18:20:33 -05:00
Sam Storment
fb2eef24d6
Handle themeing when javascript is disabled. Hide the theme select and fallback to dark media query instead of data-theme attribute
2024-06-03 14:15:35 -05:00
Sam Storment
e2f68d9ccf
Add a theme select to the header that lets users toggle their theme independent of their OS theme
2024-06-02 21:02:52 -05:00
Viktor Lofgren
d4f4d751c0
Merge remote-tracking branch 'origin/master'
2024-06-02 16:30:41 +02:00
Viktor Lofgren
b4eac2516e
(crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results
2024-06-02 16:30:34 +02:00
Viktor
4435f6245c
Merge pull request #94 from samstorment/search-dark-theme
...
Search Dark Theme
2024-06-02 16:21:52 +02:00
Viktor Lofgren
9b922af075
(converter) Amend existing modifications to use gamma coded positions lists
...
... instead of serialized RoaringBitmaps as was the initial take on the problem.
2024-05-30 14:20:36 +02:00
Viktor Lofgren
0112ae725c
(gamma) Implement a small library for Elias gamma coding an integer sequence
2024-05-30 14:19:13 +02:00
Viktor Lofgren
619392edf9
(keywords) Add position information to keywords
2024-05-28 16:54:53 +02:00
Viktor Lofgren
0894822b68
(converter) Add position information to serialized document data
...
This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
2024-05-28 14:18:03 +02:00
Viktor Lofgren
a69ab311c7
(qword) Fix tests that broke due to stopword removal
2024-05-28 14:15:45 +02:00
Viktor Lofgren
6985ab762a
(query) Improve handling of stopwords in queries
2024-05-23 20:50:55 +02:00
Viktor Lofgren
0e8300979b
(search) Update the no result text to request bug reports.
2024-05-23 20:18:16 +02:00
Viktor Lofgren
0b60411e5f
(query) Bugfix stopword issue
...
Add a new rule that crates an alternative path that omits a word if it's a stopword.
In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
2024-05-23 20:15:14 +02:00
Viktor Lofgren
f83f777fff
(converter) Experimental support for searching by URL
...
Add up to synthetic 128 keywords per document, corresponding to links to other websites.
2024-05-23 17:10:57 +02:00
Viktor Lofgren
89aae93e60
(*) Lift jetty and guava-dependencies
2024-05-23 14:20:01 +02:00
Viktor Lofgren
65b74f9cab
(registry) Fix broken test
2024-05-23 14:15:01 +02:00
Sam Storment
7543e98035
Merge branch 'MarginaliaSearch:master' into search-dark-theme
2024-05-22 18:06:37 -05:00
Viktor Lofgren
59ec70eb73
(*) Clean up code related to crawl parquet inspection
2024-05-22 12:55:08 +02:00
Viktor Lofgren
365229991b
(control) Improve pagination for crawl data inspector
2024-05-21 19:44:48 +02:00
Viktor Lofgren
959a8e29ee
(control) Improve pagination for crawl data inspector
2024-05-21 19:27:25 +02:00
Viktor Lofgren
197c82acd4
(control) Add filter functionality for crawl data inspector
2024-05-21 19:05:44 +02:00
Viktor Lofgren
9539fdb53c
(control) Clean up UX for crawl data inspector
2024-05-21 18:27:24 +02:00
Sam Storment
5659df4388
(search) Set link and form field colors manually to override browser defaults with poor dark mode contrast
2024-05-21 00:03:46 -05:00
Viktor Lofgren
24bf29d369
(*) Upgrade opennlp and deprecate the monkey patched version of the code as it's no longer needed
2024-05-20 18:03:21 +02:00
Viktor Lofgren
17dc00d05f
(control) Partial implementation of inspection utility for crawl data
...
Uses duckdb and range queries to read the parquet files directly from the index partitions.
UX is a bit rough but is in working order.
2024-05-20 18:02:46 +02:00
Viktor Lofgren
4fcd4a8197
(index) Refactor to reduce the level of indirection
2024-05-19 12:40:33 +02:00
Viktor Lofgren
daf2a8df54
(btree) Roll back optimization of queryDataWithIndex
...
It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect.
The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.
2024-05-19 11:29:28 +02:00
Sam Storment
43489c98d8
(search) Minor dark theme tweaks after the new mocked UI elements were added
2024-05-19 01:06:54 -05:00
Viktor Lofgren
88997a1c4f
(btree) Clean up code
2024-05-18 18:38:46 +02:00
Viktor Lofgren
d12c77305c
(btree) Clean up code
2024-05-18 18:03:17 +02:00
Viktor Lofgren
ab4e2b222e
(array) Fix broken benchmarks
2024-05-18 13:41:24 +02:00
Viktor Lofgren
b867eadbef
(big-string) Remove the unused bigstring library
2024-05-18 13:40:03 +02:00
Viktor Lofgren
19163fa883
(array) Clean up the Array library
...
IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it)
Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs.
Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.
2024-05-18 13:23:06 +02:00
Sam Storment
a7c33809c4
Merge branch 'master' into search-dark-theme
2024-05-17 22:52:19 -05:00
Viktor Lofgren
650f3843bb
(array) Clean up search function jungle
...
Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values.
Replaced binary search function with a branchless version that is much faster.
Cleaned up benchmark code.
2024-05-17 14:31:02 +02:00
Viktor Lofgren
9e766bc056
(array) Clean up search function jungle
...
Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values.
Replaced binary search function with a branchless version that is much faster.
Cleaned up benchmark code.
2024-05-17 14:30:06 +02:00
Viktor Lofgren
48aff52e00
(array) Increase LongArray on-heap alignment to 16 bytes
...
This primarily affects benchmarks, making performance more consistent for the 128 bit operations, as the system mostly works with memory mapped data.
2024-05-16 19:12:36 +02:00
Viktor Lofgren
9d7616317e
(array) Clean up native code a bit
2024-05-16 14:47:10 +02:00
Viktor Lofgren
d227a09fb1
(search) Extend paperdoll service mock with site info data and screenshots
...
It's a bit of a hack job but will do, random exploration is available but only through a "browse:random"-style query
2024-05-15 12:40:55 +02:00
Viktor Lofgren
f48cf77c4d
(array, experimental) Add benchmark results for quicksort
2024-05-14 18:15:30 +02:00
Viktor Lofgren
3549be216f
(array, experimental) Documentation for native algos
2024-05-14 17:43:05 +02:00
Viktor Lofgren
c3e3a3dbc5
(search) Fix problem list in clustered search results
2024-05-14 13:05:52 +02:00
Viktor Lofgren
55a7c1db00
(array, experimental) Call C++ helper methods to do some low level stuff a bit faster than is possible with Java
2024-05-14 12:54:14 +02:00
Sam Storment
bb315221ab
(search, WIP) Make the dark theme look generally nicer. Rename CSS custom properties a bit. Switch a lot of background colors to HSL to make it easy to change colors relative to one another.
2024-05-14 01:32:40 -05:00
Sam Storment
c38766c5a6
(search, WIP) Convert SCSS variables to CSS custom properties for dynamic theming
2024-05-08 22:13:24 -05:00
Viktor Lofgren
c837321df1
(search) Provide a notification when no search results are found.
2024-05-06 20:11:39 +02:00
Viktor Lofgren
af7f6b89ec
(search) Delete vestigial stylesheet from the old design.
2024-05-06 19:52:29 +02:00
Viktor Lofgren
29a4d3df23
(search) Imrpove search-service paperdoll by mocking suggestions and news
2024-05-06 19:52:13 +02:00
Viktor Lofgren
7d1cafc070
(control) Add skip link for navigation in control GUI
2024-05-04 12:36:44 +02:00
Viktor Lofgren
5951c67a8b
(search) Center the search results page
2024-05-04 12:23:21 +02:00
Viktor Lofgren
c454007730
(search) Increase contrast for some UI elements
2024-05-04 12:02:52 +02:00
Viktor Lofgren
4e49cca43d
(search) Clean up SCSS code a bit
2024-05-04 11:58:54 +02:00
Viktor Lofgren
49a8c06095
(search) Improve contrast for text on random button
2024-05-04 11:51:19 +02:00
Viktor Lofgren
d01d9fa670
(search) Add screenreader-specific notification remark about when search results start.
2024-05-04 11:41:06 +02:00
Viktor Lofgren
a53a32f006
(search) Spell out website problems with "atomic elements" instead of having a hover that's inaccessible with keyboard navigation
2024-05-04 11:41:05 +02:00
Viktor Lofgren
3548d54cf6
(search) Add a screenreader-only alert when the search filters are updated to make it easier to understand what happens.
2024-05-04 11:41:04 +02:00
Viktor Lofgren
01f242ac7e
(search) Add stylesheet class for screenreader-only items
2024-05-04 11:41:03 +02:00
Viktor Lofgren
2840d9d403
(search) Add screenreader-only positions count text to search results
2024-05-04 11:41:03 +02:00
Viktor Lofgren
9fecfc5025
(search) Add autocomplete attribute to search-form
2024-05-04 11:41:02 +02:00
Viktor Lofgren
1b901e01f2
(search) Add bypass link that skips navigation
2024-05-04 11:41:01 +02:00
Viktor Lofgren
974aa35558
(search) Add proper alt-text to random exploration mode
2024-05-04 11:41:00 +02:00
Viktor Lofgren
4021a0ae98
(search) Add en-US language tags to all templates
2024-05-04 11:40:59 +02:00
Viktor Lofgren
b7a95be731
(search) Create a small mocking framework for running the search service in isolation.
2024-05-04 11:40:59 +02:00
Viktor Lofgren
616649f040
(logs) Fix logdir location
2024-05-04 11:40:59 +02:00
Viktor Lofgren
6087f9635c
(qs) Move index.html out of public directory
...
It was put there to simulate the /public interface paradigm that is now deprecated.
2024-05-01 12:56:12 +02:00
Viktor Lofgren
2ad0bfda1e
(*) Fix boot orchestration for the services
...
This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated.
A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database. Move the first boot check into the MainClass instead of the Service constructor.
The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.
2024-05-01 12:39:48 +02:00
Viktor Lofgren
08f8b6e022
(system) Log loaded properties to the console
2024-04-30 18:29:11 +02:00
Viktor Lofgren
800ed6b1e9
(zk) Terminately immediately if zookeeper isn't found
...
This makes debugging easier
2024-04-30 18:28:49 +02:00
Viktor Lofgren
908535a3a0
(single-service) Ensure single-service spawner can specify the node
2024-04-30 18:27:46 +02:00