Viktor Lofgren
ab6a4b1749
(control) Correct id value for domain addition tool
2024-09-01 12:25:15 +02:00
Viktor Lofgren
aeeb1d0cb7
(control) Add utility for adding domains from an external URL
2024-09-01 12:14:21 +02:00
Viktor Lofgren
185b79f2a5
(converter) Fix bug where sideloaded reddit content was errouneously categoriszed as wiki-generated.
2024-09-01 11:30:25 +02:00
Viktor Lofgren
8d0f9652c7
(crawler) Correct RSS-sitemap behavior
2024-08-31 11:38:34 +02:00
Viktor Lofgren
5353805cc6
(crawler) Correct RSS-sitemap behavior
2024-08-31 11:37:09 +02:00
Viktor Lofgren
5407da5650
(crawler) Grab favicons as part of root sniff
2024-08-31 11:32:56 +02:00
Viktor Lofgren
b1bfe6f76e
(control) New view for domains
...
Add capability to assign domains, and bulk-add new domains.
2024-08-30 17:06:48 +02:00
Viktor Lofgren
74e25370ca
(control) New view for domains
...
Still a work in progress, but at this point it's possible to use for viewing domains
2024-08-29 15:40:40 +02:00
Viktor Lofgren
2f38c95886
(index) Backport bugfix from term-positions branch
...
The ordering of TermIdsList is assumed to be unchanged by the surrounding code, but the constructor sorts the dang list to be able to do contains() by binary search. This is no bueno.
This is gonna be a merge conflict in the future, but it's too big of a bug to leave for another month.
2024-08-09 21:17:02 +02:00
Viktor Lofgren
ac67b6b5da
(converter) Fix exception handling while reading crawl data
2024-08-02 10:39:49 +02:00
Viktor Lofgren
696fd8909d
(screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones
2024-07-31 21:44:10 +02:00
Viktor Lofgren
02b1c4b172
(screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones
2024-07-31 20:21:23 +02:00
Viktor Lofgren
f19148132a
(search) Restrict site-search by passing domain id along with the site:-term
...
This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.
2024-07-30 21:41:07 +02:00
Viktor Lofgren
ec600b967d
(crawler) Adjust domain locking
...
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress. Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
accc598967
(crawler) Add 1 second pause after probing domain to reduce request pressure
2024-07-16 16:55:07 +02:00
Viktor Lofgren
02c4a2d4ba
(crawler) Add a per-domain mutex for crawling
...
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
Viktor Lofgren
6665e447aa
(crawler) Add crawl delays around probe call and deal with 429:s properly during this phase
2024-07-16 15:33:24 +02:00
Viktor Lofgren
7eb955cc42
(setup) Change mirror for opennlp
...
Seems like the estointernet mirror no longer works. Use apache.org instead.
2024-07-16 15:19:13 +02:00
Viktor Lofgren
f4d79c203d
(crawler) Adjust revisit logic
...
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.
Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
2024-07-16 15:12:38 +02:00
Viktor Lofgren
4d29581ea4
(crawler) Introduce absolute upper limit to crawl depth growth
2024-07-16 14:40:45 +02:00
Viktor Lofgren
ffd970036d
(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
...
How'd This Ever Work? (tm)
TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:16:17 +02:00
Viktor Lofgren
6401a513d7
(crawl) Fix onsubmit confirm dialog for single-site recrawl
2024-07-05 17:21:03 +02:00
Viktor Lofgren
d86926be5f
(crawl) Add new functionality for re-crawling a single domain
2024-07-05 15:31:55 +02:00
Viktor
69f88255e9
Merge pull request #101 from MarginaliaSearch/security-scan
...
Address security scan findings
2024-06-17 13:18:36 +02:00
Viktor
08ff79827e
Merge branch 'master' into security-scan
2024-06-17 13:18:25 +02:00
Viktor Lofgren
67703e2274
(run) Update install.sh with stronger warnings against non-docker install.
2024-06-17 13:15:15 +02:00
Viktor Lofgren
d0d6bb173c
(control) Fix warc data http status filter default value
2024-06-17 12:40:25 +02:00
Viktor Lofgren
54caf17107
(docs) Amend install instructions for non-docker install
2024-06-16 10:22:07 +02:00
Viktor Lofgren
2168b7cf7d
(docs) Update docs with clearer references to the full guide
...
The commit also mentions the non-docker install
2024-06-16 10:01:19 +02:00
Viktor Lofgren
90744433c9
Merge branch 'master' into security-scan
...
# Conflicts:
# code/libraries/array/cpp/resources/libcpp.so
2024-06-13 13:14:47 +02:00
Viktor
5371f078f7
Merge pull request #102 from jaseemabid/jabid/macos-build
...
Make the project buildable on macOS
2024-06-12 14:45:03 +02:00
Jaseem Abid
0dd14a4bd0
Specify C++ standard in build command
...
The default C++ language standard on macOS is gnu++98, which won't build
this module.
Full error:
```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
[](const p64x2& fst, const p64x2& snd) {
^
```
2024-06-12 12:47:10 +01:00
Jaseem Abid
9974b31a09
Don't track build files(libcpp.so) with git
2024-06-12 12:45:49 +01:00
Viktor Lofgren
0ffbbaf4b9
(crawler) Update WARC builder to use SHA-256 for digests
2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b
(crawler) Fetch TLS instead of SSL context
2024-06-12 09:07:54 +02:00
Viktor Lofgren
55f3ac4846
(atags) Fix duckdb SQL injection
...
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
2024-06-12 09:05:57 +02:00
Viktor Lofgren
801cf4b5da
(search) Fix bad practice usage of innerHTML to set what should be text content.
2024-06-12 08:59:40 +02:00
Viktor Lofgren
e0459d0c0d
(build) Upgrade parquet dependencies to 1.14.0
...
This gets rid of a vulnerable transitive dependency.
2024-06-12 08:57:22 +02:00
Viktor Lofgren
a07cf1ba93
(array/cpp) Update gitignore to properly exclude libcpp.so
2024-06-06 13:06:08 +02:00
Viktor
bb06cc9ff3
Merge pull request #98 from samstorment/ThemeSwitcher
...
OS Independent Theme Switcher
2024-06-06 12:51:19 +02:00
Sam Storment
9c06f446fb
(search) Styling tweaks. Make the filter button near the top right corener a bit bigger so it's easier to press on mobile
2024-06-05 19:55:17 -05:00
Sam Storment
2d076cbd67
(search) move data-has-js attribute from body to html element
2024-06-05 18:20:33 -05:00
Sam Storment
fb2eef24d6
Handle themeing when javascript is disabled. Hide the theme select and fallback to dark media query instead of data-theme attribute
2024-06-03 14:15:35 -05:00
Sam Storment
e2f68d9ccf
Add a theme select to the header that lets users toggle their theme independent of their OS theme
2024-06-02 21:02:52 -05:00
Viktor Lofgren
d4f4d751c0
Merge remote-tracking branch 'origin/master'
2024-06-02 16:30:41 +02:00
Viktor Lofgren
b4eac2516e
(crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results
2024-06-02 16:30:34 +02:00
Viktor
4435f6245c
Merge pull request #94 from samstorment/search-dark-theme
...
Search Dark Theme
2024-06-02 16:21:52 +02:00
Viktor Lofgren
206a7ce6c1
Merge remote-tracking branch 'origin/master'
2024-05-28 14:15:57 +02:00
Viktor Lofgren
a69ab311c7
(qword) Fix tests that broke due to stopword removal
2024-05-28 14:15:45 +02:00
Viktor
a61327fa0b
Update ROADMAP.md
2024-05-24 13:57:50 +02:00