Commit Graph

333 Commits

Author SHA1 Message Date
Viktor Lofgren
fbdedf53de Fix bug in CrawlerRetreiver
... where the root URL wasn't always added properly to the front of the crawl queue.
2023-06-27 15:50:38 +02:00
Viktor Lofgren
d167ad2017 Remove sitemap related log spam 2023-06-27 13:59:47 +02:00
Viktor Lofgren
f8f9f04158 Specialized logic for processing Lemmy-based websites. 2023-06-27 10:57:54 +02:00
Viktor Lofgren
b0c7480d06 Set default timeouts for java.net.URL-connections 2023-06-27 10:57:54 +02:00
Viktor Lofgren
e7af77e151 Tests for crawler specialization + testdata 2023-06-27 10:57:54 +02:00
Viktor Lofgren
ec940e36d0 Sitemap support, refined crawler specialization 2023-06-27 10:57:54 +02:00
Viktor Lofgren
ed373eef61 Refactor crawler and add special logic for some platforms
* Break apart CrawlerRetreiver
* Break apart HttpFetcher into an interface and impl for testing sanity
* Add special logic for Lemmy, Mediawiki and Discourse to not waste requests on paths that aren't interesting.
2023-06-27 10:57:54 +02:00
Viktor Lofgren
bd2c3855ed Add bits and keywords for generator classes (docs, forum, wiki). 2023-06-23 21:35:28 +02:00
Viktor Lofgren
b5ef67ed28 Categorize generators by type
This is a great quality signal!
Add the type as document bitflags by category.
2023-06-22 16:04:37 +02:00
Viktor Lofgren
f140e7d7c7 Use a default tag for unset or invalid generators. 2023-06-21 17:30:14 +02:00
Viktor Lofgren
a9a2960e86 New synthetic keyword for document generator meta tag. 2023-06-20 16:25:49 +02:00
Viktor Lofgren
7326ba74fe Tweaks to pub date heuristics to make it mostly get the 'historyofphilosophy.net' case right.
Use HTML standard for plausibility checks in the more guesswork-like heuristics. Added more class names to look for date strings.
2023-06-20 14:15:05 +02:00
Viktor Lofgren
67c15a34e6 Reduce the amount of expensive operations in HtmlDocumentProcessorPlugin. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
266ad2e4de Re-introduce monkey patched GSON to make converter run better.
fixup! Re-introduce monkey patched GSON to make converter run better.

fixup! Re-introduce monkey patched GSON to make converter run better.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
e4372289a5 Use fixed buffers for BigString compression and decompression to reduce GC churn.
fixup! Use fixed buffers for BigString compression and decompression to reduce GC churn.
2023-06-19 17:58:19 +02:00
Viktor Lofgren
44b1fe0e6d Move list-conversion into getDescription method. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
88399e30e2 Consider keyword relevance signals when creating the document summary using the DOM walker. 2023-06-19 17:58:19 +02:00
Viktor Lofgren
eb2ca942d5 Up the default crawl delay to 1 second. 2023-06-07 22:02:17 +02:00
Viktor Lofgren
e332faa07e Fix test that broke when memex.marginalia.nu started redirecting to www.marginalia.nu. 2023-05-28 13:46:24 +02:00
Viktor Lofgren
a9f7b4c457 Add synthetic keywords for same-site files linked from a document (e.g. file:png). Also add category keywords, like file:image or file:document. 2023-04-30 19:29:13 +02:00
Viktor Lofgren
1e3b6934bb Reduce log noise during loading. Bad URLs don't need to be loaded, they can be grepped from the instructions. 2023-04-30 18:36:44 +02:00
Viktor
112f43b3a1
Api service response cache (#16)
* Add response caching to the API service to help SearXNG

* Clean up the code a bit.

* Add an endpoint without a terminating slash for getLicense.

* Add tests for API service.
2023-04-22 15:42:32 +02:00
Viktor Lofgren
d42ab19166 Issue 5: Fix bug where some IPv6 addresses blew up domain loading. 2023-04-15 14:11:08 +02:00
Viktor Lofgren
2ab26f37b8 Bug fix for document metadata encoding that breaks year based queries. 2023-04-14 16:56:49 +02:00
Viktor Lofgren
cc4e089a5d Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc. 2023-03-30 15:46:15 +02:00
Viktor Lofgren
4d05be4095 Refactor InternalLinkGraph 2023-03-30 15:44:23 +02:00
Viktor Lofgren
8f51345a1d Add experiment runner tool and got rid of experiments module in processes. 2023-03-28 16:58:46 +02:00
Viktor Lofgren
03bd892b95 Improve document processing in conversion.
* Add flags for long and short documents.
* Break out common length logic from plugins.
* Cleaning up of related code.
2023-03-28 16:38:00 +02:00
Viktor
ac1ac3ea57
Move database to a separate module
* Move database to a separate project, break apart sql file into separate entities.
* Fix front page news listing.
2023-03-25 15:26:17 +01:00
Viktor Lofgren
ca22c287a5 Make use of DocumentFlags' flags 2023-03-21 16:03:15 +01:00
Viktor Lofgren
2eb972dea1 Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00
Viktor Lofgren
449471a076 Yet more restructuring. Improved search result ranking. 2023-03-16 21:35:54 +01:00
Viktor Lofgren
d82532b7f1 More restructuring, big bug fixes in keyword extraction. 2023-03-13 17:39:53 +01:00