Commit Graph

2142 Commits

Author SHA1 Message Date
Viktor Lofgren
b1bfe6f76e (control) New view for domains
Add capability to assign domains, and bulk-add new domains.
2024-08-30 17:06:48 +02:00
Viktor Lofgren
74e25370ca (control) New view for domains
Still a work in progress, but at this point it's possible to use for viewing domains
2024-08-29 15:40:40 +02:00
Viktor Lofgren
2f38c95886 (index) Backport bugfix from term-positions branch
The ordering of TermIdsList is assumed to be unchanged by the surrounding code, but the constructor sorts the dang list to be able to do contains() by binary search.  This is no bueno.

This is gonna be a merge conflict in the future, but it's too big of a bug to leave for another month.
2024-08-09 21:17:02 +02:00
Viktor Lofgren
ac67b6b5da (converter) Fix exception handling while reading crawl data 2024-08-02 10:39:49 +02:00
Viktor Lofgren
696fd8909d (screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones 2024-07-31 21:44:10 +02:00
Viktor Lofgren
02b1c4b172 (screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones 2024-07-31 20:21:23 +02:00
Viktor Lofgren
f19148132a (search) Restrict site-search by passing domain id along with the site:-term
This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.
2024-07-30 21:41:07 +02:00
Viktor Lofgren
ec600b967d (crawler) Adjust domain locking
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress.  Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
accc598967 (crawler) Add 1 second pause after probing domain to reduce request pressure 2024-07-16 16:55:07 +02:00
Viktor Lofgren
02c4a2d4ba (crawler) Add a per-domain mutex for crawling
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
Viktor Lofgren
6665e447aa (crawler) Add crawl delays around probe call and deal with 429:s properly during this phase 2024-07-16 15:33:24 +02:00
Viktor Lofgren
7eb955cc42 (setup) Change mirror for opennlp
Seems like the estointernet mirror no longer works.  Use apache.org instead.
2024-07-16 15:19:13 +02:00
Viktor Lofgren
f4d79c203d (crawler) Adjust revisit logic
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.

Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
2024-07-16 15:12:38 +02:00
Viktor Lofgren
4d29581ea4 (crawler) Introduce absolute upper limit to crawl depth growth 2024-07-16 14:40:45 +02:00
Viktor Lofgren
ffd970036d (term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
How'd This Ever Work? (tm)

TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:16:17 +02:00
Viktor Lofgren
6401a513d7 (crawl) Fix onsubmit confirm dialog for single-site recrawl 2024-07-05 17:21:03 +02:00
Viktor Lofgren
d86926be5f (crawl) Add new functionality for re-crawling a single domain 2024-07-05 15:31:55 +02:00
Viktor
69f88255e9
Merge pull request #101 from MarginaliaSearch/security-scan
Address security scan findings
2024-06-17 13:18:36 +02:00
Viktor
08ff79827e
Merge branch 'master' into security-scan 2024-06-17 13:18:25 +02:00
Viktor Lofgren
67703e2274 (run) Update install.sh with stronger warnings against non-docker install. 2024-06-17 13:15:15 +02:00
Viktor Lofgren
d0d6bb173c (control) Fix warc data http status filter default value 2024-06-17 12:40:25 +02:00
Viktor Lofgren
54caf17107 (docs) Amend install instructions for non-docker install 2024-06-16 10:22:07 +02:00
Viktor Lofgren
2168b7cf7d (docs) Update docs with clearer references to the full guide
The commit also mentions the non-docker install
2024-06-16 10:01:19 +02:00
Viktor Lofgren
90744433c9 Merge branch 'master' into security-scan
# Conflicts:
#	code/libraries/array/cpp/resources/libcpp.so
2024-06-13 13:14:47 +02:00
Viktor
5371f078f7
Merge pull request #102 from jaseemabid/jabid/macos-build
Make the project buildable on macOS
2024-06-12 14:45:03 +02:00
Jaseem Abid
0dd14a4bd0 Specify C++ standard in build command
The default C++ language standard on macOS is gnu++98, which won't build
this module.

Full error:

```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
    [](const p64x2& fst, const p64x2& snd) {
    ^
```
2024-06-12 12:47:10 +01:00
Jaseem Abid
9974b31a09 Don't track build files(libcpp.so) with git 2024-06-12 12:45:49 +01:00
Viktor Lofgren
0ffbbaf4b9 (crawler) Update WARC builder to use SHA-256 for digests 2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b (crawler) Fetch TLS instead of SSL context 2024-06-12 09:07:54 +02:00
Viktor Lofgren
55f3ac4846 (atags) Fix duckdb SQL injection
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
2024-06-12 09:05:57 +02:00
Viktor Lofgren
801cf4b5da (search) Fix bad practice usage of innerHTML to set what should be text content. 2024-06-12 08:59:40 +02:00
Viktor Lofgren
e0459d0c0d (build) Upgrade parquet dependencies to 1.14.0
This gets rid of a vulnerable transitive dependency.
2024-06-12 08:57:22 +02:00
Viktor Lofgren
a07cf1ba93 (array/cpp) Update gitignore to properly exclude libcpp.so 2024-06-06 13:06:08 +02:00
Viktor
bb06cc9ff3
Merge pull request #98 from samstorment/ThemeSwitcher
OS Independent Theme Switcher
2024-06-06 12:51:19 +02:00
Sam Storment
9c06f446fb (search) Styling tweaks. Make the filter button near the top right corener a bit bigger so it's easier to press on mobile 2024-06-05 19:55:17 -05:00
Sam Storment
2d076cbd67 (search) move data-has-js attribute from body to html element 2024-06-05 18:20:33 -05:00
Sam Storment
fb2eef24d6 Handle themeing when javascript is disabled. Hide the theme select and fallback to dark media query instead of data-theme attribute 2024-06-03 14:15:35 -05:00
Sam Storment
e2f68d9ccf Add a theme select to the header that lets users toggle their theme independent of their OS theme 2024-06-02 21:02:52 -05:00
Viktor Lofgren
d4f4d751c0 Merge remote-tracking branch 'origin/master' 2024-06-02 16:30:41 +02:00
Viktor Lofgren
b4eac2516e (crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results 2024-06-02 16:30:34 +02:00
Viktor
4435f6245c
Merge pull request #94 from samstorment/search-dark-theme
Search Dark Theme
2024-06-02 16:21:52 +02:00
Viktor Lofgren
206a7ce6c1 Merge remote-tracking branch 'origin/master' 2024-05-28 14:15:57 +02:00
Viktor Lofgren
a69ab311c7 (qword) Fix tests that broke due to stopword removal 2024-05-28 14:15:45 +02:00
Viktor
a61327fa0b
Update ROADMAP.md 2024-05-24 13:57:50 +02:00
Viktor Lofgren
6985ab762a (query) Improve handling of stopwords in queries 2024-05-23 20:50:55 +02:00
Viktor Lofgren
0e8300979b (search) Update the no result text to request bug reports. 2024-05-23 20:18:16 +02:00
Viktor Lofgren
0b60411e5f (query) Bugfix stopword issue
Add a new rule that crates an alternative path that omits a word if it's a stopword.

In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
2024-05-23 20:15:14 +02:00
Viktor Lofgren
f83f777fff (converter) Experimental support for searching by URL
Add up to synthetic 128 keywords per document, corresponding to links to other websites.
2024-05-23 17:10:57 +02:00
Viktor Lofgren
89aae93e60 (*) Lift jetty and guava-dependencies 2024-05-23 14:20:01 +02:00
Viktor Lofgren
65b74f9cab (registry) Fix broken test 2024-05-23 14:15:01 +02:00