Viktor Lofgren
ec600b967d
(crawler) Adjust domain locking
...
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress. Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
accc598967
(crawler) Add 1 second pause after probing domain to reduce request pressure
2024-07-16 16:55:07 +02:00
Viktor Lofgren
02c4a2d4ba
(crawler) Add a per-domain mutex for crawling
...
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
Viktor Lofgren
6665e447aa
(crawler) Add crawl delays around probe call and deal with 429:s properly during this phase
2024-07-16 15:33:24 +02:00
Viktor Lofgren
7eb955cc42
(setup) Change mirror for opennlp
...
Seems like the estointernet mirror no longer works. Use apache.org instead.
2024-07-16 15:19:13 +02:00
Viktor Lofgren
f4d79c203d
(crawler) Adjust revisit logic
...
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.
Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
2024-07-16 15:12:38 +02:00
Viktor Lofgren
4d29581ea4
(crawler) Introduce absolute upper limit to crawl depth growth
2024-07-16 14:40:45 +02:00
Viktor Lofgren
ffd970036d
(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
...
How'd This Ever Work? (tm)
TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:16:17 +02:00
Viktor Lofgren
6401a513d7
(crawl) Fix onsubmit confirm dialog for single-site recrawl
2024-07-05 17:21:03 +02:00
Viktor Lofgren
d86926be5f
(crawl) Add new functionality for re-crawling a single domain
2024-07-05 15:31:55 +02:00
Viktor
69f88255e9
Merge pull request #101 from MarginaliaSearch/security-scan
...
Address security scan findings
2024-06-17 13:18:36 +02:00
Viktor
08ff79827e
Merge branch 'master' into security-scan
2024-06-17 13:18:25 +02:00
Viktor Lofgren
67703e2274
(run) Update install.sh with stronger warnings against non-docker install.
2024-06-17 13:15:15 +02:00
Viktor Lofgren
d0d6bb173c
(control) Fix warc data http status filter default value
2024-06-17 12:40:25 +02:00
Viktor Lofgren
54caf17107
(docs) Amend install instructions for non-docker install
2024-06-16 10:22:07 +02:00
Viktor Lofgren
2168b7cf7d
(docs) Update docs with clearer references to the full guide
...
The commit also mentions the non-docker install
2024-06-16 10:01:19 +02:00
Viktor Lofgren
90744433c9
Merge branch 'master' into security-scan
...
# Conflicts:
# code/libraries/array/cpp/resources/libcpp.so
2024-06-13 13:14:47 +02:00
Viktor
5371f078f7
Merge pull request #102 from jaseemabid/jabid/macos-build
...
Make the project buildable on macOS
2024-06-12 14:45:03 +02:00
Jaseem Abid
0dd14a4bd0
Specify C++ standard in build command
...
The default C++ language standard on macOS is gnu++98, which won't build
this module.
Full error:
```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
[](const p64x2& fst, const p64x2& snd) {
^
```
2024-06-12 12:47:10 +01:00
Jaseem Abid
9974b31a09
Don't track build files(libcpp.so) with git
2024-06-12 12:45:49 +01:00
Viktor Lofgren
0ffbbaf4b9
(crawler) Update WARC builder to use SHA-256 for digests
2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b
(crawler) Fetch TLS instead of SSL context
2024-06-12 09:07:54 +02:00
Viktor Lofgren
55f3ac4846
(atags) Fix duckdb SQL injection
...
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
2024-06-12 09:05:57 +02:00
Viktor Lofgren
801cf4b5da
(search) Fix bad practice usage of innerHTML to set what should be text content.
2024-06-12 08:59:40 +02:00
Viktor Lofgren
e0459d0c0d
(build) Upgrade parquet dependencies to 1.14.0
...
This gets rid of a vulnerable transitive dependency.
2024-06-12 08:57:22 +02:00
Viktor Lofgren
a07cf1ba93
(array/cpp) Update gitignore to properly exclude libcpp.so
2024-06-06 13:06:08 +02:00
Viktor
bb06cc9ff3
Merge pull request #98 from samstorment/ThemeSwitcher
...
OS Independent Theme Switcher
2024-06-06 12:51:19 +02:00
Sam Storment
9c06f446fb
(search) Styling tweaks. Make the filter button near the top right corener a bit bigger so it's easier to press on mobile
2024-06-05 19:55:17 -05:00
Sam Storment
2d076cbd67
(search) move data-has-js attribute from body to html element
2024-06-05 18:20:33 -05:00
Sam Storment
fb2eef24d6
Handle themeing when javascript is disabled. Hide the theme select and fallback to dark media query instead of data-theme attribute
2024-06-03 14:15:35 -05:00
Sam Storment
e2f68d9ccf
Add a theme select to the header that lets users toggle their theme independent of their OS theme
2024-06-02 21:02:52 -05:00
Viktor Lofgren
d4f4d751c0
Merge remote-tracking branch 'origin/master'
2024-06-02 16:30:41 +02:00
Viktor Lofgren
b4eac2516e
(crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results
2024-06-02 16:30:34 +02:00
Viktor
4435f6245c
Merge pull request #94 from samstorment/search-dark-theme
...
Search Dark Theme
2024-06-02 16:21:52 +02:00
Viktor Lofgren
206a7ce6c1
Merge remote-tracking branch 'origin/master'
2024-05-28 14:15:57 +02:00
Viktor Lofgren
a69ab311c7
(qword) Fix tests that broke due to stopword removal
2024-05-28 14:15:45 +02:00
Viktor
a61327fa0b
Update ROADMAP.md
2024-05-24 13:57:50 +02:00
Viktor Lofgren
6985ab762a
(query) Improve handling of stopwords in queries
2024-05-23 20:50:55 +02:00
Viktor Lofgren
0e8300979b
(search) Update the no result text to request bug reports.
2024-05-23 20:18:16 +02:00
Viktor Lofgren
0b60411e5f
(query) Bugfix stopword issue
...
Add a new rule that crates an alternative path that omits a word if it's a stopword.
In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
2024-05-23 20:15:14 +02:00
Viktor Lofgren
f83f777fff
(converter) Experimental support for searching by URL
...
Add up to synthetic 128 keywords per document, corresponding to links to other websites.
2024-05-23 17:10:57 +02:00
Viktor Lofgren
89aae93e60
(*) Lift jetty and guava-dependencies
2024-05-23 14:20:01 +02:00
Viktor Lofgren
65b74f9cab
(registry) Fix broken test
2024-05-23 14:15:01 +02:00
Sam Storment
7543e98035
Merge branch 'MarginaliaSearch:master' into search-dark-theme
2024-05-22 18:06:37 -05:00
Viktor Lofgren
59ec70eb73
(*) Clean up code related to crawl parquet inspection
2024-05-22 12:55:08 +02:00
Viktor Lofgren
365229991b
(control) Improve pagination for crawl data inspector
2024-05-21 19:44:48 +02:00
Viktor Lofgren
959a8e29ee
(control) Improve pagination for crawl data inspector
2024-05-21 19:27:25 +02:00
Viktor Lofgren
197c82acd4
(control) Add filter functionality for crawl data inspector
2024-05-21 19:05:44 +02:00
Viktor Lofgren
9539fdb53c
(control) Clean up UX for crawl data inspector
2024-05-21 18:27:24 +02:00
Sam Storment
5659df4388
(search) Set link and form field colors manually to override browser defaults with poor dark mode contrast
2024-05-21 00:03:46 -05:00