Viktor Lofgren
ab486323f2
(converter) Increase the number of links the converter will pick up per document
2024-10-15 13:46:19 +02:00
Viktor Lofgren
fe800b3af7
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 19:04:49 +02:00
Viktor Lofgren
2a1077ff43
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:57:27 +02:00
Viktor Lofgren
01a16ff388
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:55:59 +02:00
Viktor Lofgren
eb60ddb729
(crawler) Properly enqueue links from the root document in the crawler
2024-10-05 17:49:39 +02:00
Viktor Lofgren
d84a2c183f
(*) Remove the crawl spec abstraction
...
The crawl spec abstraction was used to upload lists of domains into the system for future crawling. This was fairly clunky, and it was difficult to understand what was going to be crawled.
Since a while back, a new domains listing view has been added to the control view that allows direct access to the domains table. This is much preferred and means the operator can directly manage domains without specs.
This commit removes the crawl spec abstraction from the code, and changes the GUI to direct to the domains list instead.
2024-10-03 13:41:17 +02:00
Viktor Lofgren
ecb5eedeae
(crawler, EXPERIMENT) Disable content type probing and use Accept header instead
...
There's reason to think this may speed up crawling quite significantly, and the benefits of the probing aren't quite there.
2024-09-30 14:53:01 +02:00
Viktor Lofgren
4565bfe359
(crawler) Make the crawler report crawling progress correctly when stopped and resumed.
2024-09-26 18:30:29 +02:00
Viktor Lofgren
e9e8580913
(converter) Fix NPE bugs in converter due to the reintroduction of CrawledDocument.headers
2024-09-25 12:18:56 +02:00
Viktor Lofgren
40512511af
(crawler) Refactor boundary between CrawlerRetreiver and HttpFetcherImpl
...
This code is still a bit too complex, but it's slowly getting better.
2024-09-24 15:08:22 +02:00
Viktor Lofgren
162fc25ebc
(minor) Fix accidental commit errors
2024-09-23 18:03:09 +02:00
Viktor Lofgren
e9854f194c
(crawler) Refactor
...
* Restructure the code to make a bit more sense
* Store full headers in crawl data
* Fix bug in retry-after header that assumed the timeout was in milliseconds, and then clamped it to a lower bound of 500ms, meaning this was almost always handled wrong
2024-09-23 17:51:07 +02:00
Viktor Lofgren
9c292a4f62
(doc) Fix outdated links in documentation
2024-09-22 13:56:17 +02:00
Viktor Lofgren
8047e77757
(doc) Correct dead links and stale information in the docs
2024-09-13 11:01:05 +02:00
Viktor Lofgren
2a92de29ce
(loader) Fix it so that the loader doesn't explode if it sees an invalid URL
2024-09-12 11:36:00 +02:00
Viktor Lofgren
a3b0189934
Fix build errors after merge
2024-09-08 10:22:32 +02:00
Viktor Lofgren
8f367d96f8
Merge branch 'master' into term-positions
...
# Conflicts:
# code/index/java/nu/marginalia/index/results/model/ids/TermIdList.java
# code/processes/converting-process/java/nu/marginalia/converting/ConverterMain.java
# code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
# code/processes/crawling-process/java/nu/marginalia/crawl/retreival/fetcher/HttpFetcherImpl.java
# code/processes/crawling-process/model/java/nu/marginalia/io/crawldata/CrawledDomainReader.java
# code/processes/crawling-process/test/nu/marginalia/crawling/HttpFetcherTest.java
# code/processes/crawling-process/test/nu/marginalia/crawling/retreival/CrawlerMockFetcherTest.java
# code/services-application/search-service/java/nu/marginalia/search/svc/SearchQueryIndexService.java
2024-09-08 10:14:43 +02:00
Viktor Lofgren
f78ef36cd4
(slop) Upgrade to 0.0.8, add encodings to string columns.
2024-09-04 15:19:00 +02:00
Viktor Lofgren
dc67c81f99
(summary) Fix a few cases where noscript tags would sometimes be used for document summary
2024-09-04 15:00:40 +02:00
Viktor Lofgren
74148c790e
(crawler) Pull additional new domains from node-affinity 0
...
Previously a bit ambiguously defined, node affinity 0 is now indicative that a domain is up for grabs for the next crawler
2024-09-01 13:00:36 +02:00
Viktor Lofgren
3d77456110
(*) Add domain parking service to ip blocklist
2024-09-01 12:53:22 +02:00
Viktor Lofgren
185b79f2a5
(converter) Fix bug where sideloaded reddit content was errouneously categoriszed as wiki-generated.
2024-09-01 11:30:25 +02:00
Viktor Lofgren
8d0f9652c7
(crawler) Correct RSS-sitemap behavior
2024-08-31 11:38:34 +02:00
Viktor Lofgren
5353805cc6
(crawler) Correct RSS-sitemap behavior
2024-08-31 11:37:09 +02:00
Viktor Lofgren
5407da5650
(crawler) Grab favicons as part of root sniff
2024-08-31 11:32:56 +02:00
Viktor Lofgren
abab5bdc8a
(index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data
2024-08-26 14:20:39 +02:00
Viktor Lofgren
b09e2dbeb7
(build) Fix dependency churn from testcontainers
...
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
266d6e4bea
(slop) Replace SlopPageRef<T> with SlopTable.Ref<T>
2024-08-21 10:13:49 +02:00
Viktor Lofgren
b0a874a842
(*) Upgrade slop library -> 0.0.5
2024-08-18 11:05:27 +02:00
Viktor Lofgren
0a383a712d
(qdebug) Accurately display positions when intersecting with spans
2024-08-15 11:44:17 +02:00
Viktor Lofgren
75b0888032
(slop) Migrate to latest Slop version
2024-08-14 11:44:35 +02:00
Viktor Lofgren
623ee5570f
(slop) Break slop out into its own repository
2024-08-13 09:50:05 +02:00
Viktor Lofgren
fd2bad39f3
(keyword-extraction) Add body field for terms that are not otherwise part of a field
2024-08-13 09:49:26 +02:00
Viktor Lofgren
680ad19c7d
(keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors
2024-08-06 11:16:56 +02:00
Viktor Lofgren
2080e31616
(converter) Store link text positions
...
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends. Integrating this information with the ranking is not performed here.
2024-08-04 12:00:29 +02:00
Viktor Lofgren
e48f52faba
(experiment) Add add-hoc filter runner
2024-08-03 13:24:03 +02:00
Viktor Lofgren
4430a39120
(loader) Clean up
2024-08-02 12:32:47 +02:00
Viktor Lofgren
6228f46af1
(loader) Reduce log spam
2024-08-02 12:21:03 +02:00
Viktor Lofgren
ac67b6b5da
(converter) Fix exception handling while reading crawl data
2024-08-02 10:39:49 +02:00
Viktor Lofgren
1a268c24c8
(perf) Reduce DomPruningFilter hash table recalculation
2024-08-01 12:04:55 +02:00
Viktor Lofgren
38e2089c3f
(perf) Code was still spending a lot of time resolving charsets
...
... in the failure case which wasn't captured by memoization.
2024-08-01 11:58:59 +02:00
Viktor Lofgren
285e657f68
Merge branch 'master' into term-positions
...
# Conflicts:
# code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
# code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
2024-07-31 10:44:01 +02:00
Viktor Lofgren
b316b55be9
(index) Experimental initial integration of document spans into index
2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7
(restructure) Clean up repo by moving stray features into converter-process and crawler-process
2024-07-30 10:14:00 +02:00
Viktor Lofgren
7e4efa45b8
(converter/loader) Simplify document record writing to not require predicated reads
2024-07-29 14:21:21 +02:00
Viktor Lofgren
86ea28d6bc
(converter/loader) Simplify document record writing to not require predicated reads
2024-07-29 14:18:52 +02:00
Viktor Lofgren
34703da144
(slop) Support for nested array types and array-of-object types
...
Also adding very basic support for filtered reads via SlopTable. This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
1282f78bc5
(slop-models) Fix incorrect column grouping leading to errors in converter
2024-07-29 11:01:18 +02:00
Viktor Lofgren
2d5d965f7f
(slop-models) Fix incorrect column grouping leading to errors in converter
2024-07-29 10:34:33 +02:00
Viktor Lofgren
afe56c7cf1
(loader) Tidy up code
2024-07-28 21:36:42 +02:00
Viktor Lofgren
7d51cf882f
(loader) Move rssFeeds to a different column group to avoid errors
2024-07-28 21:30:10 +02:00
Viktor Lofgren
9685993adb
(loader) Add spans to a different column group from spanCodes, as they are not in sync
2024-07-28 21:20:09 +02:00
Viktor Lofgren
261dcdadc8
(loader) Additional tracking for the control GUI
2024-07-28 21:19:45 +02:00
Viktor Lofgren
1caad7e19e
(slop) Update existing code to use the altered Slop interfaces
2024-07-28 13:21:08 +02:00
Viktor Lofgren
6c3abff664
(slop) Move GCS Slop column to the coded-sequence package
...
This lets the slop library be stand-alone without dependence on coded-sequence.
The change also gets rid of the vestigial seek() method in ColumnReader.
2024-07-27 13:58:45 +02:00
Viktor Lofgren
dcb43a3308
(slop) Introduce table concept to keep track of positions and simplify closing
...
The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip.
The second most common error is forgetting to close one of the columns in a reader or writer.
To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.
2024-07-27 13:47:47 +02:00
Viktor Lofgren
ec600b967d
(crawler) Adjust domain locking
...
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress. Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
aebb2652e8
(wip) Extract and encode spans data
...
Refactoring keyword extraction to extract spans information.
Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.
This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
60ef826e07
(loader) Add heartbeat to update domain-ids step
2024-07-25 15:08:41 +02:00
Viktor Lofgren
2ad564404e
(loader) Add heartbeat to update domain-ids step
2024-07-23 15:28:52 +02:00
Viktor Lofgren
2bb9f18411
(dld) Refactor DocumentLanguageData
...
Reduce the usage of raw arrays
2024-07-19 12:24:55 +02:00
Viktor Lofgren
22b35d5d91
(sentence-extractor) Add tag information to document language data
...
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object. Separator information is encoded as a bit set instead of an array of integers.
The change also cleans up the SentenceExtractor class a fair bit. It no longer extracts ngrams, and a significant amount of redundant operations were removed as well. This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
Viktor Lofgren
d36055a2d0
(keyword-extractor) Retire TfIdfHigh WordFlag
...
This will bring the word flags count down to 8, and let us pack every value in a byte.
2024-07-17 13:54:39 +02:00
Viktor Lofgren
accc598967
(crawler) Add 1 second pause after probing domain to reduce request pressure
2024-07-16 16:55:07 +02:00
Viktor Lofgren
02c4a2d4ba
(crawler) Add a per-domain mutex for crawling
...
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
Viktor Lofgren
6665e447aa
(crawler) Add crawl delays around probe call and deal with 429:s properly during this phase
2024-07-16 15:33:24 +02:00
Viktor Lofgren
f4d79c203d
(crawler) Adjust revisit logic
...
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.
Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
2024-07-16 15:12:38 +02:00
Viktor Lofgren
4d29581ea4
(crawler) Introduce absolute upper limit to crawl depth growth
2024-07-16 14:40:45 +02:00
Viktor Lofgren
0b31c4cfbb
(coded-sequence) Replace GCS usage with an interface
2024-07-16 14:37:50 +02:00
Viktor
8ed5b51a32
Merge branch 'master' into term-positions
2024-07-15 07:05:31 +02:00
Viktor Lofgren
1ab875a75d
(test) Correcting flaky tests
...
Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
85c99ae808
(index-reverse) Split index construction into separate packages for full and priority index
2024-07-06 15:44:47 +02:00
Viktor Lofgren
d86926be5f
(crawl) Add new functionality for re-crawling a single domain
2024-07-05 15:31:55 +02:00
Viktor Lofgren
6ee4d1eb90
(keyword) Increase the work area for position encoding
...
The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.
2024-06-28 16:42:39 +02:00
Viktor Lofgren
dae22ccbe0
(test) Integration test from crawl->query
2024-06-25 22:17:26 +02:00
Viktor Lofgren
0ffbbaf4b9
(crawler) Update WARC builder to use SHA-256 for digests
2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b
(crawler) Fetch TLS instead of SSL context
2024-06-12 09:07:54 +02:00
Viktor Lofgren
36160988e2
(index) Integrate positions data with indexes WIP
...
This change integrates the new positions data with the forward and reverse indexes.
The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
9f982a0c3d
(index) Integrate positions file properly
2024-06-06 16:45:42 +02:00
Viktor Lofgren
4a8afa6b9f
(index, WIP) Position data partially integrated with forward and reverse indexes.
...
There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.
2024-06-06 12:54:52 +02:00
Viktor Lofgren
b4eac2516e
(crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results
2024-06-02 16:30:34 +02:00
Viktor Lofgren
9b922af075
(converter) Amend existing modifications to use gamma coded positions lists
...
... instead of serialized RoaringBitmaps as was the initial take on the problem.
2024-05-30 14:20:36 +02:00
Viktor Lofgren
619392edf9
(keywords) Add position information to keywords
2024-05-28 16:54:53 +02:00
Viktor Lofgren
0894822b68
(converter) Add position information to serialized document data
...
This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
2024-05-28 14:18:03 +02:00
Viktor Lofgren
f83f777fff
(converter) Experimental support for searching by URL
...
Add up to synthetic 128 keywords per document, corresponding to links to other websites.
2024-05-23 17:10:57 +02:00
Viktor Lofgren
89aae93e60
(*) Lift jetty and guava-dependencies
2024-05-23 14:20:01 +02:00
Viktor Lofgren
d12c77305c
(btree) Clean up code
2024-05-18 18:03:17 +02:00
Viktor Lofgren
b867eadbef
(big-string) Remove the unused bigstring library
2024-05-18 13:40:03 +02:00
Viktor Lofgren
38aedb50ac
(converter) Do not suppress exceptions in the converter
2024-04-30 18:24:35 +02:00
Viktor Lofgren
70e2e41955
(crawler) Content type prober should not swallow exceptions
2024-04-27 18:27:23 +02:00
Viktor Lofgren
4d71c776fc
(crawler) Modify crawl set growth to grow small domains faster than larger ones
2024-04-27 17:36:27 +02:00
Viktor Lofgren
7eb5e6aa66
(crawler) Abort recrawl if error count is too high
2024-04-24 21:46:40 +02:00
Viktor Lofgren
8b9629f2f6
(crawler) Remove unnecessary double-fetch of the root document
2024-04-24 14:38:59 +02:00
Viktor Lofgren
f6db16b313
(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber
2024-04-24 14:10:03 +02:00
Viktor Lofgren
4668b1ddcb
(build) Java 22 and its consequences has been a disaster for Marginalia Search
...
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle
The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
Viktor Lofgren
dcf9d9caad
(crawler) Emulate if-modified-since for domains that don't support the header
...
This will help reduce the strain on some server software, in particular Discourse.
2024-04-22 17:26:31 +02:00
Viktor Lofgren
7a69b76001
(crawler) Remove accidental log spam
2024-04-22 15:51:37 +02:00
Viktor Lofgren
ac07ef822f
(crawler) Code quality
2024-04-22 15:37:35 +02:00
Viktor Lofgren
e7d4bcd872
(crawler) Use the probe-result to reduce the likelihood of crawling both http and https
...
This should drastically reduce the number of fetched documents on many domains
2024-04-22 15:36:43 +02:00
Viktor Lofgren
a28c6d7cfe
(crawler) Strip W/-prefix from the etag when supplied as If-None-Match
2024-04-22 14:31:05 +02:00