Commit Graph

1390 Commits

Author SHA1 Message Date
Viktor Lofgren
b15f47d80e (db) Retire the EC_DOMAIN_LINK table
Retire the EC_DOMAIN_LINK table as the data has been migrated off into a file instead.
2024-02-08 15:52:30 +01:00
Viktor Lofgren
ef261cbbd7 (search) Remove stray spaces in bang commands 2024-02-08 14:46:18 +01:00
Conor Flynn
9d7df87886
(search) Fix broken !ddg handling
https://duckduckgo.com/search?q=asdf leads to running a search for the term "search" instead of "asdf".

Both https://duckduckgo.com/<query> and https://duckduckgo.com/?q=<query> are accepted, but using GET vars seemed more in-keeping with the code.
2024-02-08 13:28:02 +01:00
Viktor Lofgren
a4b2323ca3 (search) Change default search profile to No Filter
Recent changes to the result ranking mean the no filter mode returns sufficiently good results for most queries that filtering by default just makes the search results more restricted.
2024-02-08 13:04:05 +01:00
Viktor
e8de468b0b
Make executor API talk GRPC (#75)
* (executor-api) Make executor API talk GRPC

The executor's REST API was very fragile and annoying to work with, lacking even basic type safety.  Migrate to use GRPC instead.  GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil.  This is a fairly straightforward change, but it's also large so a solid round of testing is needed...

The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients.

ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name().

The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.
2024-02-08 13:01:12 +01:00
Viktor Lofgren
d83a3bf4e2 (search) Fix broken !w handling
Printf format error derp.
2024-02-08 12:11:33 +01:00
Viktor Lofgren
f2b39ad055 (search) Fix broken !bang handling
!bang query handling seems to have fallen victim to an overzealous refactoring effort, and broken.

It's now repaired, and a test is in place to ensure we know if it breaks again.
2024-02-08 12:05:09 +01:00
Viktor Lofgren
95d1bd98e4 (array) Update documentation, make unsafe configurable
The readme for the array library was extremely out of date.  Updating it with accurate information about how the library works, and a demo that should compile.

Also added a system property for disabling the use of sun.misc.Unsafe.
2024-02-07 12:26:47 +01:00
Viktor Lofgren
8acbc6a6b4 (index-construction) Split repartition into two actions cont'd
Continues 467ba5be20 by breaking out a constant with the name of the primary ranking set.  Also ensures it doesn't get spuriously logged as updated during the secondary updating pass.
2024-02-06 19:54:17 +01:00
Viktor Lofgren
467ba5be20 (index-construction) Split repartition into two actions
This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets.  Since only the first is necessary before the index construction, the rest can be delayed until after...

To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one.

The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader.

Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data.

Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead.

To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.
2024-02-06 17:20:07 +01:00
Viktor Lofgren
29ddf9e61d (doc) Update docs 2024-02-06 16:29:55 +01:00
Viktor Lofgren
92e119cab3 (doc) Update docs 2024-02-06 12:43:42 +01:00
Viktor Lofgren
92049ba8e4 (doc) Update docs 2024-02-06 12:41:28 +01:00
Viktor Lofgren
54330b9921 (*) Remove dead code 2024-02-06 12:41:13 +01:00
Viktor Lofgren
d1aeb030f2 (doc) Update RandomWriteFunnel documentation 2024-02-06 12:35:24 +01:00
Viktor Lofgren
f89274d1ea (minor) Fix broken test
Fallout from changes in endianness made in d986f90074
2024-02-06 12:12:26 +01:00
Viktor Lofgren
7286596fb4 (deps) Remove monkey patched GSON
The codebase used to have a monkey patched version of gson that made special optimizations for the unusually large JSON files that used to store e.g. crawl data.

Since JSON is no longer used in this fashion, the GSON fork is not needed anymore.
2024-02-06 12:11:39 +01:00
Viktor Lofgren
a2fc83d94e (control) Add configurable border styling
To help distinguish between environments, a system property 'control.appBorder' is added that is injected as a body element border property in the control GUI stylesheets.
2024-02-06 12:05:02 +01:00
Viktor Lofgren
2161799cc3 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:18:00 +01:00
Viktor Lofgren
c88f132057 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:10:03 +01:00
Viktor Lofgren
c6313a5906 (sideload) Fix filename error in dealing with stackoverflow files 2024-02-06 11:06:36 +01:00
Viktor Lofgren
eadcdb5bed (minor) Improve error handling, naming logging in IndexResultDecorator 2024-02-05 21:05:44 +01:00
Viktor Lofgren
6e7649b5f7 (loader) Mitigate fragile paging behavior
IndexJournalWriterPagingImpl was modified to not page on number of entries written, but number of (equivalent uncompressed) bytes written.

Since the failure mode if too much data is written per file is quiet corruption of the index, the former behavior was extremely fragile.  The new behavior should consistently ensure that the data is sufficiently small to not cause any integer rollovers.

The change in 6dcc20038c was reverted, as there is really no sane reason to have this configurable in software.
2024-02-05 21:05:03 +01:00
Viktor Lofgren
d986f90074 (index) Fix consistency between RandomFileAssembler implementations
The RandomFileAssembler implementations, introduced in commit 53c575db3f were all acting subtly differently.  The RWF implementation wrote BigEndian longs instead of the native endianness used by the other implementations (and expected by the index construction code), further the mmap implementation exposed a bug in LongArray.write() that caused it to create a larger file than necessary.

A test was built to ensure the output of these implementations is equivalent.
2024-02-05 21:01:32 +01:00
Viktor Lofgren
53c575db3f (index-construction) Make random-write file strategy configurable
To cope with writing large files out of order, the system needs some form of strategy to avoid writing them directly to disk, as this causes insane amounts of disk thrashing.  By default, the data is just buffered in RAM.  This works well on a large server, but smaller systems struggle.

To help systems with small RAM process large amounts of data, the old RandomWriteFunnel is brought back if the system property 'system.conserve-memory' is set to true.  RandomWriteFunnel is buffering the construction by creating a series of small files that pigeonhole the writes into rough neighborhoods, and then it goes over the files one by one to construct one area of the file at a time.  This is relatively slow and uses more than twice the disk size.

A new interface RandomFileAssembler is introduced as an abstraction for this operation.  A third strategy, direct mmaps, is also introduced if the file is very small (less than 1 GB).  In this domain, disk thrashing is unlikely since it will comfortably fit in RAM.
2024-02-05 12:31:15 +01:00
Viktor Lofgren
6dcc20038c (index-journal) Make index journal page size configurable
Adds a new system property loader.journal-page-size to configure this setting.
2024-02-05 11:26:05 +01:00
Viktor Lofgren
fa145f632b (sideload) Add special handling for sideloaded wiki documents
This update enhances the SideloaderProcessing and DocumentClass modules to specially handle sideloaded wiki documents. Wiki content is generally truncated to the first paragraph, which generally tends to be too short to be included independently. An additional DocumentClass (SIDELOAD) has been introduced to suppress the length check in this case.
2024-02-02 21:22:07 +01:00
Viktor Lofgren
785d8deadd (crawler) Improve meta-tag redirect handling, add tests for redirects.
Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file.  This works as intended.

Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier.  Added logic to handle this case, amended the test case to verify the new behavior.  Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.
2024-02-01 20:30:43 +01:00
Viktor Lofgren
93a2d5afbf (*) Fix poorly named test
Likely old refactoring gore.
2024-02-01 20:08:15 +01:00
Viktor Lofgren
d60c6b18d4 (doc) Update the readme's the crawler, as they've grown stale. 2024-02-01 18:10:55 +01:00
Viktor Lofgren
d1e02569f4 (language-processing) Add a system property for configuring which language detection model to use
The flag is `system.languageDetectionModelVersion`.

* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
2024-01-31 13:02:33 +01:00
Viktor Lofgren
9ce67029ca (language-processing) Add a system property for configuring which language detection model to use
The flag is `system.languageDetectionModelVersion`.

* If negative, no model is used.
* If 0, both models are used.
* If 1, the old crappy model is used.
* If 2, the new fasttext model is used.
2024-01-31 13:02:16 +01:00
Viktor Lofgren
98f3382cea (minor) Fix test and improve error message 2024-01-31 11:53:41 +01:00
Viktor Lofgren
52a0255814 (*) Add flag for disabling ASCII flattening
The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this.  This assumption holds poorly in the wild.

Adding an **experimental** system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior.

IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.
2024-01-31 11:50:59 +01:00
Viktor Lofgren
eb59ac8535 (index-ranking) Adjust the BM25P factors a bit
Since the bleed-flags set by the anchor tags logics have been changed to Site and SiteAdjacent, give them a bit of more importance when set together with ExternalLink.

UrlDomain and UrlPath are also only more consistently only rewarded once.
2024-01-30 21:27:29 +01:00
Viktor Lofgren
6edc318597 (control) Fix typo in URL linking to new-crawl-specs 2024-01-26 10:43:10 +01:00
Viktor Lofgren
182c0cf28e (control) Add warnings about domain data contamination 2024-01-25 18:26:15 +01:00
Viktor Lofgren
0b105b5986 (converter) Update hyperlink text for new crawl spec creation.
Fix minor typo.
2024-01-25 18:05:11 +01:00
Viktor Lofgren
cae1bad274 (*) Add download-sample action, refactor file storage
This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu.

It also refactors out some leaky abstractions out of FileStorageService.  allocateTemporaryStorage has been renamed allocateStorage.  The storage was never temporary in any scenario...

It also doesn't take a storage base, as there was always only one valid option for this input.  The allocateStorage method finds the appropriate base itself.
2024-01-25 13:36:30 +01:00
Viktor Lofgren
1b8b97b8ec (sample-exporter) Add some limits on sizes and lengths
Tar files will reject entries with filenames over 100b, so we need a limit there.  Also added a maximum size limit to keep the file sizes reasonable.
2024-01-25 11:51:53 +01:00
Viktor Lofgren
c088c25b09 (*) Fix broken test, clean up code 2024-01-24 12:50:41 +01:00
Viktor Lofgren
958d64720e (control) Add a view for restarting aborted processes
This will avoid having to dig in the message queue to perform this relatively common task.

The control service was also refactored to extract common timestamp formatting logic out of the data objects and into the rendering.
2024-01-24 12:47:10 +01:00
Viktor Lofgren
805afad4fe (control) New GUI for exporting crawl data samples
Not going to win any beauty pageants, but this is pretty peripheral functionality.
2024-01-23 17:08:21 +01:00
Viktor Lofgren
400f4840ad (*) Fix broken code in jmh 2024-01-23 17:08:21 +01:00
Viktor Lofgren
ee7792596d (*) Fix broken test
Probably shouldn't have tests depending on external data like this...
2024-01-23 12:03:47 +01:00
Viktor Lofgren
0081328aca (converter) Adjust which flags are set by anchor text keywords
It's a mistake to let it bleed into Title, as this is a high quality signal.  We'll co-opt Site and SiteAdjacent instead to reinforce the ExternalLink when count is high.
2024-01-23 11:54:00 +01:00
Viktor Lofgren
3fff7f6878 (converter) Fix issue where quality limits were no longer enforced 2024-01-23 11:42:17 +01:00
Viktor Lofgren
f15dd06473 (index) Delayed close() of SearchIndexReader
This avoids concurrent access errors.  This is especially important when using Unsafe-based LongArrays, since we have concurrent access to the underlying memory-mapped file.  If pull the rug from under the caller by closing the file, we'll get a SIGSEGV.  Even with a "safe" MemorySegment, we'll get ugly stacktraces if we close the file while a thread is still accessing it.

So we spin up a thread that sleeps for a minute before actually unmapping the file, allowing any ongoing requests to wrap up.  This is 100% a hack, but it lets us get away with doing this without adding locks to the index readers.

Since this is "just" mmapped data, and this operation happens optimistically once a month, it should be safe if the call gets lost.
2024-01-23 11:08:41 +01:00
Viktor Lofgren
dd26819d66 (actor) Try to rare data race where a finished job is considered dead. 2024-01-22 21:22:38 +01:00
Viktor Lofgren
a6d257df5b (converter) Update Stackexchange sideload instruction
The sideload instruction in the stackexchange template was updated. The instruction now states that stackexchange data will be loaded from a directory on the server and directs users to a new documentation url for more detailed information.
2024-01-22 18:29:20 +01:00
Viktor Lofgren
41d896ba3e (converter) Refactor content type check in PlainTextDocumentProcessorPlugin
The method `isApplicable` in the `PlainTextDocumentProcessorPlugin` was refactored to handle a wider range of content types beyond merely "text/plain". It now also handles any content type that starts with "text/plain;", to accomodate contentTypes that append a charset as well.
2024-01-22 17:52:14 +01:00
Viktor Lofgren
51cdf46645 (control) Improve accessibility in search-to-ban template
This update enhances accessibility by associating labels with the corresponding checkboxes in the search-to-ban template.
2024-01-22 15:01:00 +01:00
Viktor Lofgren
1eb0adf6d3 (array) Add sun.misc.Unsafe variant of LongArray 2024-01-22 13:38:42 +01:00
Viktor Lofgren
40c9d2050f (control) Fully automatic conversion
Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine.

Removed the tool itself.

This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency.  This has been fixed, and :third-party:xz was removed.
2024-01-22 13:03:24 +01:00
Viktor Lofgren
3a325845c7 (mq) Add better error handling in fsm and mq
java.lang.Error:s were not handled properly, leading to mismatch in the bookkeeping of the FSMs.  These are now caught, acted on, and re-thrown.

MqSynchronousInbox also no longer assumes all exceptions are InterruptedException.
2024-01-22 13:03:24 +01:00
Viktor Lofgren
6a1bfd6270 (array) Remove unused 'madvise' code and 3rd party dependency on 'uppend'
This wasn't actually hooked in anywhere.  Removing the dependency and code.  If it turns out we need madvise in the future, we'll re-introducde it.
2024-01-22 13:01:57 +01:00
Viktor Lofgren
b91ea1d7ca (control) Re-add gui for sideloading dirtrees 2024-01-20 18:09:40 +01:00
Viktor Lofgren
c5760cd535 (test) Fix broken test 2024-01-20 13:39:40 +01:00
Viktor Lofgren
91c7960800 (crawler) Extract additional configuration properties
This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties.

The documentation is updated to reflect the change.

Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.
2024-01-20 10:36:04 +01:00
Viktor Lofgren
2079a5574b (control) Update heading in restore backup template
Changed the heading in the partial restore backup page from "Load" to "Restore Backup".
2024-01-19 21:46:53 +01:00
Viktor Lofgren
27ffb8fa8a (converter) Integrate zim->db conversion into automatic encyclopedia processing workflow
Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file.  This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically.

The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.
2024-01-19 13:59:03 +01:00
Viktor Lofgren
22c8fb3f59 (crawler) Fix a bug where reference copies of crawl data was written without etag and last-modified
This commit also adds a band-aid to ParquetSerializableCrawlDataStream to fetch this from the 304-entity.  This can be removed in a few months.
2024-01-18 16:02:27 +01:00
Viktor Lofgren
964419803a Fix broken test 2024-01-18 15:42:01 +01:00
Viktor Lofgren
6271d5d544 (mq) Add relation tracking between MQ messages for easier tracking and debugging.
The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID.  This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers.

The existing RELATED_ID field has too many semantics associated with them,
among other things the FSM code uses them this field in tracking state changes.

The change set also improves the consistency of inbox names.  The IndexClient was buggy and populated its outbox with a UUID.  This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.
2024-01-18 15:08:27 +01:00
Viktor Lofgren
175bd310f5 (control) Message queue UX improvements 2024-01-18 13:05:50 +01:00
Viktor Lofgren
67ee6f4126 (control) Clean up filtering UX in Events table 2024-01-18 12:35:39 +01:00
Viktor Lofgren
01b312f14c (*) Make new index nodes accept queries by default
It's a confusing default behavior.

This was off for nodes n>1 before as a bandaid since querying indices with no data caused delays and errors.  This has been fixed now, so there's no need to do this anymore!
2024-01-18 12:05:37 +01:00
Viktor Lofgren
18638c62de (control) Rephrase text 2024-01-18 11:53:10 +01:00
Viktor Lofgren
753d000788 (control) Add toggle for automatic loading of processed data 2024-01-18 11:52:58 +01:00
Viktor Lofgren
19e781b104 (control) Add basic input validation to node actions
Will present a simple error message when required fields aren't populated, instead of a cryptic HTTP status error.
2024-01-18 11:52:49 +01:00
Viktor Lofgren
aa2df327db (index) Prevent index from attempting to answer queries when no index data is loaded
This improves query times, and gets rid of exceptions in the logs when one of the index nodes doesn't have any data loaded, yet is configured to answer queries.
2024-01-18 11:05:45 +01:00
Viktor Lofgren
321fa94b8f (crawler) Fix rare exception in content type handling due to improper length checking of a split() array 2024-01-18 11:05:45 +01:00
Viktor Lofgren
41cdb8f71b (control) Fix broken update button in the update-domain-ranking-set form
id property was on the wrong element.
2024-01-17 18:21:09 +01:00
Viktor Lofgren
304d4c9acf (control) Fix result ordering in the file storage listing view
In some scenarios, such as when restoring storage items from json-manifest on db failure, the file storage view would present the items in a non-chronological order.  Added a sort() operation to mitigate this.
2024-01-17 10:56:30 +01:00
Viktor Lofgren
7fd4c092e3 (control) Clean up UX and accessibility for new domain ranking sets.
The change also adds basic support for error messages in the GUI.
2024-01-17 10:47:14 +01:00
Viktor Lofgren
2fe5705542 (control) GUI for ranking sets
Still missing is some polish, forms don't have proper labels, validation is inconsistent, no error messages, etc.
2024-01-16 17:10:09 +01:00
Viktor Lofgren
e968365858 (index) Use new DomainRankingSets to configure ranking algos in index svc 2024-01-16 12:43:32 +01:00
Viktor Lofgren
36ad4c7466 (db) Add a new configuration object 'domain ranking set' for storing ranking parameters 2024-01-16 12:34:00 +01:00
Viktor Lofgren
5a62b3058f (query-api) Make the search set identifier a string value in the API
This will free the core marginalia search engine to use arbitrary search set definitions, while the app can use its hardcoded defaults.
2024-01-16 10:55:24 +01:00
Viktor Lofgren
a1df9e886a (control) Also clean up stale 'NEW' messages 2024-01-15 16:14:02 +01:00
Viktor Lofgren
fd1eec99b5 (cleanup) Fix broken tests 2024-01-15 15:44:33 +01:00
Viktor Lofgren
e162406d40 (control) New control-side actors for cleaning up stale service heartbeats and message queue entries 2024-01-15 15:44:23 +01:00
Viktor Lofgren
c41e68aaab (control) New export actions for RSS/Atom feeds and term frequency data
This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.
2024-01-15 14:54:26 +01:00
Viktor Lofgren
4665af6c42 (control) Move export data endpoint to actions controller 2024-01-15 11:06:22 +01:00
Viktor Lofgren
c0b15427fe (control) New crawl view should use radio buttons as multiple specs aren't supported 2024-01-15 11:03:47 +01:00
Viktor Lofgren
f29a9d972d (control) Move 'new crawl spec' to /node/:id/actions, out of /node/:id/storage 2024-01-15 11:02:00 +01:00
Viktor Lofgren
b192373ae7 (control) Highlight unavailable items (creating, deleting) in node actions views 2024-01-15 10:47:54 +01:00
Viktor Lofgren
c042650382 (docs) Improve query service documentation 2024-01-13 21:16:45 +01:00
Viktor Lofgren
07a916a720 (search) Give the swipe hint on mobile a nicer finish 2024-01-13 18:51:54 +01:00
Viktor Lofgren
5134044530 (assistant) Make assistant client more robust to the service going down
This is especially important for the non-essential functions, like website similarities...
2024-01-13 18:29:30 +01:00
Viktor Lofgren
4c62065e74 (install) Add two separate templates for the install script
One template is for the full Marginalia Search style install, and the other is for a barebones install with no Marginalia-related fluff.
2024-01-13 18:27:42 +01:00
Viktor Lofgren
d28fc99119 (MainClass) ensure logging isn't loaded before service name is known
This causes logs all to have names like ${sys:service-name}, instead of the service name...
2024-01-13 18:19:50 +01:00
Viktor Lofgren
c9fb45c85f (search) Fix control.hideMarginaliaApp handling 2024-01-13 17:24:15 +01:00
Viktor Lofgren
7c6e18f7a7 (*) Overhaul settings and properties
Use a system.properties file to configure the system.  This is loaded statically by MainClass or ProcessMainClass.  Update the property names to be more consistent, and update the documentations to reflect the changes.
2024-01-13 17:12:18 +01:00
Viktor Lofgren
176b9c9666 (convert) Add sizeHints to legacy serializable cawl data stream
This reduces the maximum memory usage when processing legacy crawl data
2024-01-13 15:50:36 +01:00
Viktor Lofgren
ecd9c35233 (control) Clean up the event log
* Generate fewer uninteresting event messages.
* Display fewer irrelevant fields in the overview table.
2024-01-13 13:28:02 +01:00
Viktor Lofgren
71e32c57d9 (control) Add better timestamps for the events and message queue views
Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.
2024-01-13 13:04:56 +01:00
Viktor Lofgren
2fefd0e4e3 (control) Add better timestamps for the events and message queue views
Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.
2024-01-13 13:03:52 +01:00
Viktor Lofgren
81eaf79a25 (control) UX polish 2024-01-13 12:31:13 +01:00
Viktor Lofgren
8dea7217a6 (control) UX fixes, node GUI doesn't break when an executor service goes offline. 2024-01-13 12:17:30 +01:00
Viktor Lofgren
c0fb9e17e8 (control) Add filter dropdown to message queue table
This makes inspecting the queues for processes much easier, as it's otherwise
often these important messages are drowned out by FSM chatter.
2024-01-12 18:46:17 +01:00
Viktor Lofgren
83776a8dce (control) Wean the ExportDataActor off EC_DOMAIN_LINK
The EC_DOMAIN_LINK table is deprecated and slated for removal, use QueryClient.getAllDomainLinks() instead.

The ExportDataActor now uses the QueryClient appropriately.  The CSV format was also changed to quote the values, to prevent e.g. Excel from interpreting the comma as a decimal separator when previewing the file.

Finally the form for triggering an export was overhauled.
2024-01-12 17:09:11 +01:00
Viktor Lofgren
98c0972619 (control) Add a summary table for Actors in the Node overview 2024-01-12 16:32:15 +01:00
Viktor Lofgren
56d832d661 (control) Adjust the margins of the headings to be consistent 2024-01-12 16:16:57 +01:00
Viktor Lofgren
de3a350afe (control) Disable broken actions and mark the actions view as WIP 2024-01-12 16:16:39 +01:00
Viktor Lofgren
708a741960 (test) Clean up test usage of migrations
Several tests were manually running migrations in a large copy-paste blob of code.  This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing.

A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded.   Existing tests are migrated to use the new code.
2024-01-12 15:55:50 +01:00
Viktor Lofgren
0caef1b307 (warc) Toggle for saving WARC data
Add a toggle for saving the WARC data generated by the search engine's crawler.  Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest.

The warc files are concatenated into larger archives, up to about 1 GB each.
An index is also created containing filenames, domain names, offsets and sizes
to help navigate these larger archives.

The warc data is saved in a directory warc/ under the crawl data storage.
2024-01-12 13:45:14 +01:00
Viktor Lofgren
264e2db539 (control) UX-improvements for control service
This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views.  It has many small tweaks to make the work flow better.

It also adds a new /uploads directory in each index node, from which sideloaded data can be selected.  This is a bit of a breaking change, as this directory needs to exist in each index node.
2024-01-12 12:33:05 +01:00
Viktor Lofgren
734996002c (*) install script for deploying Marginalia outside the codebase
The changeset also makes the control service responsible for flyway migrations.  This helps reduce the number of places the database configuration needs to be spread out.  These automatic migrations can be disabled with -DdisableFlyway=true.

The commit also adds curl to the docker container, to enable docker health checks and interdependencies.
2024-01-11 12:40:03 +01:00
Viktor Lofgren
a0f28a7f9b (*) Add a barebones configuration
This adds a docker-compose file 'docker-compose-barebones.yml' which will only start the minimal number of services needed to run a whitelabel Marginalia Search-style search engine, with none of the surrounding frills.

The change also adds a minimal search GUI to the query service, which is also available with JSON results if the appropriate Accept header is provided.
2024-01-10 20:23:51 +01:00
Viktor Lofgren
14b7680328 (loader) Update the size of the keyword files created by the loader
Previously these ended up being about 200 Mb each, which is wastefully small.  Increasing the size of these files makes the index construction faster.
2024-01-10 17:09:19 +01:00
Viktor Lofgren
f44222ce53 (control) Add a 'cancel' button to the process list
This is a very nice QoL improvement, since it means you don't have to dig in the Actors view to terminate processes.
2024-01-10 15:02:42 +01:00
Viktor Lofgren
f310ad8d98 (control) Actor terminations work better
Improves jank in the abort actor action, which would sometimes cause actors to hang or restart.
2024-01-10 14:18:49 +01:00
Viktor Lofgren
d56b394bcc (control) GUI for loading external WARC files 2024-01-10 12:13:30 +01:00
Viktor Lofgren
55c9501e57 (search) Serve proper content type for static resources 2024-01-10 10:46:51 +01:00
Viktor
fad9575154
Merge pull request #69 from MarginaliaSearch/converter-optimizations
Refactor the DomainProcessor to take advantage of the new crawl data format
2024-01-10 09:46:54 +01:00
Viktor Lofgren
97e11e1ac9 (search) Fix acknowledgement page for domain complaints rendering as plain text
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used.  This method is removed with this change.
2024-01-10 09:37:40 +01:00
Viktor Lofgren
e6a1e164b2 (search) Swap swipe direction for more consistent experience 2024-01-10 09:37:40 +01:00
Viktor Lofgren
e4f8f81e89 (search) Mobile UX improvements.
Swipe right to show filter menu.

Fix CSS bug that caused parts of the menu to not have a background.
2024-01-10 09:37:39 +01:00
Viktor Lofgren
176b3bb526 (search) Toggle for showing recent results
Actually persist the value of the toggle between searches too...
2024-01-10 09:37:39 +01:00
Viktor Lofgren
b07752fa9b (search) Toggle for showing recent results
Will by default show results from the last 2 years.  May need to tune this later.
2024-01-10 09:37:39 +01:00
Viktor Lofgren
68fd0efbde (search) Clean up search results template
Rendering is very slow. Let's see if this has a measurable effect on latency.
2024-01-10 09:37:39 +01:00
Viktor Lofgren
c80d3eb812 (search) Remove dead code 2024-01-10 09:37:35 +01:00
Viktor Lofgren
f9320995d6 (search) When clicking asn-links, show results from the unfiltered view... 2024-01-10 09:37:13 +01:00
Viktor Lofgren
f592c9f04d (search) Fix acknowledgement page for domain complaints rendering as plain text
This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used.  This method is removed with this change.
2024-01-10 09:26:34 +01:00
Viktor Lofgren
bd7970fb1f (search) Swap swipe direction for more consistent experience 2024-01-09 13:38:40 +01:00
Viktor Lofgren
c47730f2cc (search) Mobile UX improvements.
Swipe right to show filter menu.

Fix CSS bug that caused parts of the menu to not have a background.
2024-01-09 13:30:30 +01:00
Viktor Lofgren
41cccfd2aa (search) Toggle for showing recent results
Actually persist the value of the toggle between searches too...
2024-01-09 11:36:49 +01:00
Viktor Lofgren
aff690f7d6 (search) Toggle for showing recent results
Will by default show results from the last 2 years.  May need to tune this later.
2024-01-09 11:28:36 +01:00
Viktor Lofgren
d4b0539d39 (search) Clean up search results template
Rendering is very slow. Let's see if this has a measurable effect on latency.
2024-01-08 20:57:40 +01:00
Viktor Lofgren
cb55273769 (search) When clicking asn-links, show results from the unfiltered view... 2024-01-08 20:02:19 +01:00
Viktor Lofgren
fbad625126 (linkdb) Add delegating implementation of DomainLinkDb
This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.
2024-01-08 19:56:33 +01:00
Viktor Lofgren
e49ba887e9 (crawl data) Add compatibility layer for old crawl data format
The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records.  This is true for the new parquet format, but not for the old zstd/gson format.

To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order.

This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be.

Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.
2024-01-08 19:16:49 +01:00
Viktor Lofgren
edc1acbb7e (*) Replace EC_DOMAIN_LINK table with files and in-memory caching
The EC_DOMAIN_LINK MariaDB table stores links between domains.  This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB).  This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need.

This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains.  This file is loaded in memory in each node, and can be queried via the Query Service.

A migration step is needed before this file is created in each node.   Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file.

The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.
2024-01-08 15:53:13 +01:00
Viktor Lofgren
ef02b712ad (build) Remove false depdencency between icp and index-service
This dependency causes the executor service docker image to change when the index service docker image changes.
2024-01-05 13:22:13 +01:00
Viktor Lofgren
aca217cf9a (qs) Better metrics for QS 2024-01-05 13:22:13 +01:00
Viktor Lofgren
9e3386dbbb (search) Fetch fewer results per page
This is a test to evaluate how this impacts load times.
2024-01-05 13:22:13 +01:00
Viktor Lofgren
fdec565b34 (converter) Add upper 128KB limit to how much HTML we'll parse 2024-01-05 13:22:13 +01:00
Viktor Lofgren
33c2188c87 (feature) More trackers 2024-01-05 13:22:13 +01:00
Viktor Lofgren
b3c8fa74cc (feature) Add another doubleclick variant to the adtech trackers 2024-01-05 13:22:13 +01:00
Viktor Lofgren
e53bb70bef (converter) Penalize chatgpt content farm spam 2024-01-05 13:22:13 +01:00
Viktor Lofgren
109bec372c (index) Adjust BM25 parameters 2024-01-05 13:21:52 +01:00
Viktor Lofgren
5c2561d05d (search) Add query strategy requiring link 2024-01-05 13:21:52 +01:00
Viktor Lofgren
0e970b8037 (valuation) Tweaking penalties a bit 2024-01-05 13:21:52 +01:00
Viktor Lofgren
1694b4d6ef (valuation) Increase the penalty for adtech a bit 2024-01-05 13:21:34 +01:00
Viktor Lofgren
396299c1db (index) Reduce the value of site and site-adjacent in BM25P calculations 2024-01-05 13:21:33 +01:00
Viktor Lofgren
71d789aab0 (index) Tweak result valuation renormalization 2024-01-05 13:21:33 +01:00
Viktor Lofgren
6d2e14a656 (build) Remove false depdencency between icp and index-service
This dependency causes the executor service docker image to change when the index service docker image changes.
2024-01-05 13:17:29 +01:00
Viktor Lofgren
4078708aea (qs) Better metrics for QS 2024-01-04 13:27:14 +01:00
Viktor Lofgren
343ea9c6d8 (search) Fetch fewer results per page
This is a test to evaluate how this impacts load times.
2024-01-04 13:18:07 +01:00
Viktor Lofgren
60361f88ed (converter) Add upper 128KB limit to how much HTML we'll parse 2024-01-03 23:14:03 +01:00
Viktor Lofgren
f7560cb1d8 (feature) More trackers 2024-01-03 17:31:02 +01:00
Viktor Lofgren
1f66568d59 (feature) More trackers 2024-01-03 17:27:25 +01:00
Viktor Lofgren
7af07cef95 (feature) Add another doubleclick variant to the adtech trackers 2024-01-03 17:21:12 +01:00
Viktor Lofgren
41a540a629 (converter) Penalize chatgpt content farm spam 2024-01-03 17:04:38 +01:00
Viktor Lofgren
f599944942 (converter) Penalize chatgpt content farm spam 2024-01-03 16:51:26 +01:00
Viktor Lofgren
1e06aee6a2 (index) Adjust BM25 parameters 2024-01-03 16:30:46 +01:00
Viktor Lofgren
7bbaedef97 (search) Add query strategy requiring link 2024-01-03 16:23:00 +01:00
Viktor Lofgren
87048511fe (valuation) Tweaking penalties a bit 2024-01-03 16:02:25 +01:00
Viktor Lofgren
c770f0b68b (valuation) Tweaking penalties a bit 2024-01-03 15:59:21 +01:00
Viktor Lofgren
78c00ad512 (valuation) Tweaking penalties a bit 2024-01-03 15:52:57 +01:00
Viktor Lofgren
a19879d494 (valuation) Tweaking penalties a bit 2024-01-03 15:32:33 +01:00
Viktor Lofgren
ac1aca36b0 (valuation) Increase the penalty for adtech a bit 2024-01-03 15:20:38 +01:00
Viktor Lofgren
1f3b89cf28 (index) Reduce the value of site and site-adjacent in BM25P calculations 2024-01-03 15:20:18 +01:00
Viktor Lofgren
f732f6ae6f (index) Tweak result valuation renormalization 2024-01-03 14:53:53 +01:00
Viktor Lofgren
0b9f3d1751 (*) Remove accidental commit of debug logging 2024-01-03 14:32:00 +01:00
Viktor Lofgren
0806aa6dfe (language-processing) Add maximum length limit for text input in SentenceExtractor
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
2024-01-03 14:27:47 +01:00
Viktor Lofgren
32436d099c (language-processing) Add maximum length limit for text input in SentenceExtractor
Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.
2024-01-03 14:27:47 +01:00
Viktor Lofgren
4ce692ccaf (converter) Use SimpleBlockingThreadPool in ProcessingIterator 2024-01-03 14:27:47 +01:00
Viktor Lofgren
3caa4eed75 Merge branch 'master' into converter-optimizations 2024-01-02 17:13:25 +01:00
Viktor Lofgren
c70f508ae8 (prometheus) Saner histogram buckets 2024-01-02 17:13:14 +01:00
Viktor Lofgren
9e64d7aaf9 Merge branch 'master' into converter-optimizations 2024-01-02 15:46:24 +01:00
Viktor Lofgren
72b773f06d (search) fix search metrics labeling 2024-01-02 15:46:14 +01:00
Viktor Lofgren
5f978b865b Merge branch 'master' into converter-optimizations 2024-01-02 15:41:48 +01:00
Viktor Lofgren
57a4f92722 (api) fix missing metrics label in api service 2024-01-02 15:41:38 +01:00
Viktor Lofgren
87351e89ca Merge branch 'master' into converter-optimizations 2024-01-02 15:17:02 +01:00
Viktor Lofgren
192e356169 (prometheus) Add instrumentation to the api service 2024-01-02 15:12:44 +01:00
Viktor Lofgren
31232e49fb (prometheus) Add instrumentation to the search, qs and index services. 2024-01-02 15:02:29 +01:00
Viktor Lofgren
9d93a31755 Merge branch 'master' into converter-optimizations 2024-01-02 12:36:16 +01:00
Viktor Lofgren
9f7df59945 (sideload) Reduce quality assessment.
This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.
2024-01-02 12:35:59 +01:00
Viktor Lofgren
d2418521a7 (index) Further ranking adjustments 2024-01-02 12:35:59 +01:00
Viktor Lofgren
9330b5b1d9 (index) Adjust rank weightings to fix bad wikipedia results
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero.  This meant that "bad" results always rank the same.  The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.

Some of the weights were also re-adjusted based on what appears to produce better results.  Needs evaluation.
2024-01-02 12:35:44 +01:00
Viktor Lofgren
faa50bf578 (sideload) Just index based on first paragraph
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!

This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
2024-01-02 12:35:44 +01:00
Viktor Lofgren
f0d9618dfc (sideload) Reduce quality assessment.
This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.
2024-01-02 12:34:58 +01:00
Viktor Lofgren
310a880fa8 (index) Further ranking adjustments 2024-01-02 12:24:52 +01:00
Viktor Lofgren
fc6e3b6da0 (index) Further ranking adjustments 2024-01-01 18:51:03 +01:00
Viktor Lofgren
50771045d0 (index) Further ranking adjustments 2024-01-01 18:43:17 +01:00
Viktor Lofgren
8f522470ed (index) Adjust rank weightings to fix bad wikipedia results
There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero.  This meant that "bad" results always rank the same.  The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization.

Some of the weights were also re-adjusted based on what appears to produce better results.  Needs evaluation.
2024-01-01 17:16:29 +01:00
Viktor Lofgren
dc90c9ac65 (sideload) Just index based on first paragraph
This seems like it would make the wikipedia search result worse, but it drastically improves the result quality!

This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.
2024-01-01 16:19:38 +01:00
Viktor Lofgren
e46e174b59 (keyword-extractor) Add another test for Name-extractor 2024-01-01 15:21:51 +01:00
Viktor Lofgren
7f3f3f577c (backup) Add task heartbeats to the backup service 2024-01-01 15:20:57 +01:00
Viktor Lofgren
75d87c73d1 (crawler) Disable Java's infinite DNS caching 2023-12-31 16:59:08 +01:00
Viktor Lofgren
0fe44c9bf2 (crawler) Fix broken test
A necessary step was accidentally deleted when cleaning up these tests previously.
2023-12-30 13:56:44 +01:00
Viktor Lofgren
7a1d20ed0a (converter) Better use of ProcessingIterator
Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service.

This reduces thread churn in the converter sideloader style processing of regular crawl data.
2023-12-30 13:53:55 +01:00
Viktor Lofgren
70c83b60a1 (converter) Clean up fullProcessing()
This function made some very flimsy-looking assumptions about the order of an iterable.  These are still made, but more explicitly so.
2023-12-30 13:36:18 +01:00
Viktor Lofgren
7ba296ccdf (converter) Route sizeHint to SideloadProcessing
Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number.

This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.
2023-12-30 13:05:10 +01:00
Viktor Lofgren
0b112cb4d4 (warc) Update URL encoding in WarcProtocolReconstructor
The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.
2023-12-29 19:41:37 +01:00
Viktor Lofgren
68ac8d3e09 (search) Fetch fewer linking and similar domains.
Showing a total of 200 connected domains is not very informative.
2023-12-29 16:37:27 +01:00
Viktor Lofgren
f6fa8bd722 (search) Fetch fewer linking and similar domains.
Showing a total of 200 connected domains is not very informative.
2023-12-29 16:37:00 +01:00
Viktor Lofgren
6aee27a3f1 (*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style. 2023-12-29 16:36:01 +01:00
Viktor Lofgren
401568033c Merge branch 'master' into converter-optimizations 2023-12-29 15:55:57 +01:00
Viktor Lofgren
ea73be6831 (search) Remove the ugly placeholder screenshots from the site info view. 2023-12-29 15:55:46 +01:00
Viktor Lofgren
ba8a75c84b Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool 2023-12-29 15:10:32 +01:00
Viktor Lofgren
a1f3ccdd6d Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool 2023-12-29 14:59:39 +01:00
Viktor Lofgren
647d38007f Reduce queue polling time in ProcessingIterator
Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.
2023-12-29 14:27:58 +01:00
Viktor Lofgren
e7dd28b926 (converter) Optimize sideload-loading
Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.
2023-12-29 14:25:48 +01:00
Viktor Lofgren
b5fc9673d9 Merge branch 'master' into converter-optimizations 2023-12-29 14:04:43 +01:00
Viktor Lofgren
a065040323 (search) Don't inject arbitrary HTML into the site info view xD 2023-12-29 14:04:26 +01:00
Viktor Lofgren
dec3b1092d (converter) Fix bugs in conversion
This commit adds a safety check that the URL of the document is from the correct domain.

It also adds a sizeHint() method to SerializableCrawlDataStream which *may* provide an indication if the stream is very large and benefits from sideload-style processing (which is slow).

It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...
2023-12-29 13:58:08 +01:00
Viktor Lofgren
407915a86e (converter) Fix NPEs in converter due to the new data format 2023-12-28 22:54:53 +01:00
Viktor Lofgren
c488599879 (converter) Fix NPE in converter 2023-12-28 19:52:26 +01:00
Viktor Lofgren
bcecc93e39 (converter) Swallow errors in parquet data stream 2023-12-28 19:45:35 +01:00
Viktor Lofgren
ff7d1a250e Merge branch 'master' into converter-optimizations 2023-12-28 19:35:00 +01:00
Viktor Lofgren
70f338c3de (search) Fix NPE in layout selection 2023-12-28 19:34:46 +01:00
Viktor Lofgren
c847d83011 (converter) Add size hint to converter sideload processing 2023-12-28 19:14:16 +01:00
Viktor Lofgren
5ce46a61d4 Merge branch 'master' into converter-optimizations 2023-12-28 13:26:19 +01:00
Viktor
775974d5ec
Merge pull request #67 from MarginaliaSearch/rss-feeds-in-site-info
Add RSS Feeds to site info (WIP)
2023-12-28 13:25:38 +01:00
Viktor Lofgren
c7af40c368 (search) Change layout balance when feeds/samples are present 2023-12-28 13:16:10 +01:00
Viktor Lofgren
00a974a721 (crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions
This commit also improves the test coverage for this part of the code.
2023-12-27 20:02:17 +01:00
Viktor Lofgren
7428ba2dd7 (converter) Basic test coverage for sideloading-style processing 2023-12-27 19:29:26 +01:00
Viktor Lofgren
b37223c053 (converter) Basic test coverage for sideloading-style processing 2023-12-27 18:33:16 +01:00
Viktor Lofgren
24051fec03 (converter) WIP Run sideload-style processing for large domains
The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis.   This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process.

These websites now receive a simplified treatment.  This is executed in the converter batch writer thread.  This is slower, but the documents will not be persisted in memory.
2023-12-27 18:20:03 +01:00
Viktor Lofgren
f811a29f87 (crawler) Fix resource leak in crawler
A 10 MB thread local buffer wasn't static.  Oops.
2023-12-27 16:32:17 +01:00
Viktor Lofgren
acf7bcc7a6 (converter) Refactor the DomainProcessor for new format of crawl data
With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter.  This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while.

The first step is to move stuff out of the domain processor into the document processor.
2023-12-27 13:57:59 +01:00
Viktor Lofgren
9707366348 (test) Fix a few slow tests that broke due to domainCount 2023-12-27 13:29:59 +01:00
Viktor Lofgren
9e5fe71f5b (crawler) Switch hash function in crawler
Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler.   This switches to the modified Murmur hash function used throughout Marginalia.
2023-12-27 13:29:00 +01:00
Viktor Lofgren
5d1b7da728 Updated site info feed and search service
Modified site info feed template to secure the description field against injected code. Also adjusted search service by extracting samples within the correct scope and including them in the returned site info. This improves the quality and security of the displayed information.
2023-12-26 22:06:01 +01:00
Viktor Lofgren
3ea1ddae22 (crawler) Roll back switch to virtual thread pool in crawler
This seems to cause a resource leak, it seems the http library uses thread locals?
2023-12-26 19:37:34 +01:00
Viktor Lofgren
1694e9c78c (search) Add RSS Feeds to site info
This change integrates the Feedlot RSS Bot with Marginalia's site info view to offer a preview of the latest updates.

 The change introduces a new tiny feature that is a feedlot-client based on Java's HttpClient.
2023-12-26 16:21:40 +01:00
Viktor Lofgren
4763077b76 (search/index) Add a new keyword "count"
This is for filtering results on how many times the term appears on the domain.  The intent is to be beneficial in creating e.g. a domain search feature.   It's also very helpful when tracking down spammy domains.
2023-12-25 20:38:29 +01:00
Viktor Lofgren
c0eaca220c (search) Add convenient link for AS search to the search view 2023-12-25 15:07:58 +01:00
Viktor Lofgren
25d086c4e1 (crawler) Clean up stale warc files
We should probably have an option to keep them, but not by default!
2023-12-25 15:07:36 +01:00
Viktor Lofgren
88551043cd (crawler) Even more lenient resyncing 2023-12-25 01:48:11 +01:00
Viktor Lofgren
f779f760c4 (crawler) Even more lenient resyncing 2023-12-25 01:44:18 +01:00
Viktor Lofgren
f18f82e229 (crawler) Write etags and last-modified on reference copy
This commit also fixes a test that broke with a previous change.
2023-12-25 01:40:13 +01:00
Viktor Lofgren
67ef2b45fa (crawler) Reduce logging 2023-12-25 01:10:03 +01:00
Viktor Lofgren
d72e871265 (warc) Fix resync 2023-12-25 01:03:03 +01:00
Viktor Lofgren
4c9bc13309 (warc) Reduce log spam 2023-12-25 00:58:31 +01:00
Viktor Lofgren
84563b0d46 (crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK 2023-12-25 00:55:05 +01:00
Viktor Lofgren
c5aab7e8db (warc) Fix NPE in WarcRecorder 2023-12-25 00:54:38 +01:00
Viktor Lofgren
1755b646b8 (warc) Fix NPE in WarcRecorder 2023-12-25 00:48:42 +01:00
Viktor Lofgren
85f906ea53 (executor) Fix removal of stale process heartbeats 2023-12-23 13:49:24 +01:00
Viktor Lofgren
e1a155a9c8 (crawler) Increase growth of crawl jobs
A number of crawl jobs get stuck at about 300 documents, or just under.  This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl.  GOOD_URLS is based on how many documents successfully process, which is typically fairly small.  Switching to KNOWN_URLS should let this grow faster.

The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table.

The floor is also increased to 250 from 200.
2023-12-23 13:22:10 +01:00
Viktor Lofgren
0454447e41 (executor) Implement process removal for long-absent heartbeats
Added functionality to remove processes from listing that have not checked in for over a day. A 'removeProcessHeartbeat' function was created to delete the respective entry from the PROCESS_HEARTBEAT table in case heartbeats are absent for more than one day.
2023-12-23 13:18:21 +01:00
Viktor Lofgren
7b40c0bbee (assistant) Clean up similar websites' results 2023-12-22 14:07:01 +01:00
Viktor Lofgren
dc773c5c20 (adjacencies) Clean up AdjacenciesLoader
Make JDBC batching more consistent, also adds a test case for the loader.
2023-12-21 14:14:22 +01:00
Viktor Lofgren
b6253b03c2 (adjacencies) Fix bug in AdjacenciesLoader
This fixes a bug where a prepared statement was created before the table it was supposed to insert into was created.  This fails and does nothing.

Furthermore, added the logging that would have warned about this failure, had it been in place.
2023-12-21 13:12:31 +01:00
Viktor Lofgren
a5bc29245b (cleanup) Remove vestigial support for WARC crawl data streams 2023-12-20 15:46:21 +01:00
Viktor Lofgren
bfae478251 Refactor CrawlerRevisitor for better consistency 2023-12-20 15:21:49 +01:00
Viktor Lofgren
a7cd490593 (minor) Remove dead code. 2023-12-19 18:58:33 +01:00
Viktor Lofgren
dd8fb04886 (converter) Add sizeloadSizeAdvice field to several ProcessedDomain
Since the sideloaders don't populate the documents list in ProcessedDomain to keep the memory footprint manageable, the code that estimates knownUrls etc. will set them to zero, which has negative effects on their ranking.  This change will populate them with a bullshit value within a sane ballpark, ensuring that these domains show up in the rankings.
2023-12-19 18:37:51 +01:00
Viktor
5bd3934d22
Merge pull request #64 from dreimolo/macos_AS_fix
Macos apple silicon fix, and slight improvements to sample downloader
2023-12-18 18:29:14 +01:00
Viktor Lofgren
3a56a06c4f (warc) Add a fields for etags and last-modified headers to the new crawl data formats
Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats.  This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely.

The commit also adds a few tests to this logic.
2023-12-18 17:45:54 +01:00
Viktor Lofgren
126ac3816f (converter) Reduce queue size in ConverterWriter
The size of the ArrayBlockingQueue in ConverterWriter.java has been reduced from 4 to 1. This change aims to reduce the memory utilization by not having fully processed domains piling up in RAM.  This may cause the writer to go idle in waiting for new data, but that may be preferable to an OOM.
2023-12-18 13:42:40 +01:00
Viktor Lofgren
d02bed1a55 (loader) Optimize DomainLoaderService for faster startups
Initialization parameters in DomainLoaderService and DomainIdRegistry have been updated to improve performance. This is done by adding sane default sizes to the hash tables involved, reducing GC churn, but also by setting a sensible fetch size to the queries used, and not fetching irrelevant information such as the domain name.
2023-12-18 13:15:10 +01:00
Viktor Lofgren
b7ed0ce537 (loader) Reset count after executing batch in DomainLoaderService
This should greatly speed up starting the loader process.
2023-12-18 12:43:53 +01:00
Viktor Lofgren
a742503508 (search) Add view for showing mutual links between two websites 2023-12-17 17:50:44 +01:00
Viktor Lofgren
33312ab09e (geo-ip) Update readme 2023-12-17 16:08:33 +01:00
Viktor Lofgren
c422f0b9fb (geo-ip) Tidy up error handling 2023-12-17 16:06:51 +01:00
Viktor Lofgren
c92f1b8df8 (geo-ip) Revert removal of ip2location logic
We do both ip2location and ASN data.

The change also adds some keywords based on autonomous system information, on a somewhat experimental basis.  It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.
2023-12-17 15:03:00 +01:00
Viktor Lofgren
bde68ba48b Merge branch 'master' into asn-info 2023-12-17 14:00:23 +01:00
Viktor Lofgren
bf44805e69 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 14:00:07 +01:00
Viktor Lofgren
edf9aa2c23 (*) Rename EdgeDomain$domain into topDomain
This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time.

Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.
2023-12-17 13:59:54 +01:00
Viktor Lofgren
4801c47273 (crawling-model) Fix bug where CrawledDocument.getDomain() trimmed www-prefixes
This had the knock-on effect of breaking the anchor tag loading in the processor for a lot of domains, since they'd grab domains for the wrong domain name.
2023-12-17 13:53:31 +01:00
Viktor Lofgren
bcad6492d6 (sideloader) Fix integration problems with sideloaders
In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment.

Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters.

Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.
2023-12-17 13:28:17 +01:00
Viktor Lofgren
5ab2a22e88 (search) Fix result count back down to 1 per domain 2023-12-17 13:14:23 +01:00
Viktor Lofgren
d7bd540683 (*) Replace the ip2location IP geolocation data with ASN information from apnic.net.
Doesn't really make sense to use ip2location as a middle man for information that is already freely available...
2023-12-16 21:55:04 +01:00
Viktor Lofgren
722b56c8ca (index) Fix rare bug in the index-switching logic
This is caused by a resource contention with the query code.  The proper way to fix this is to use some form of synchronization, but that will slow the code down.  So we just hammer it a few times and let the GC deal with the problem if it fails.  Not optimal, but fast.
2023-12-16 18:57:35 +01:00
Viktor Lofgren
f3f12058dc (assistant) Fix logic error in filtering related domains 2023-12-16 18:45:53 +01:00
Viktor Lofgren
3da38d0483 (assistant) Fix logic error in filtering related domains 2023-12-16 18:44:25 +01:00
Viktor Lofgren
d715b1f9ca (search) Improve error handling in search parameters parsing
The code now intercepts and deals with potential exceptions during the parsing of search parameters. This is in response to constant bad requests from bots which were cluttering the logs. A catch clause is added that suppresses these errors and redirects to the base URL.
2023-12-16 18:42:13 +01:00
Viktor Lofgren
e13fa25e11 (assistant) Clean up the site info related domains view by filtering viable domains 2023-12-16 18:37:09 +01:00
Viktor Lofgren
34d4834ff6 (assistant) Clean up the site info related domains view by filtering viable domains 2023-12-16 18:27:24 +01:00
Viktor Lofgren
117ddd17d7 (assistant) Fix bugs in IP flag emoji generation 2023-12-16 17:07:17 +01:00
Viktor Lofgren
6f2bf38f0e (index) Fix off-by-1 error in the domain count limiter 2023-12-16 16:57:05 +01:00
Viktor Lofgren
320882c34a (site-info) Try to discover the schema of the website with a site:-query
The site info view can't blindly assume that every website supports https.  To figure out which schema to use when linking to a site, execute a single-result search for site:domain.name and then grab the schema off the result.

To allow this, a count parameter is introduced to doSiteSearch() in SearchOperator.
2023-12-16 16:34:53 +01:00
Viktor Lofgren
3113b5a551 (warc) Filter WarcResponses based on X-Robots-Tags
There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.
2023-12-16 15:58:27 +01:00
dreimolo
c0cc05177f corrects protobuf.plugins.grpc 2023-12-16 14:24:41 +01:00
dreimolo
0b34d43804 workaround for failing mac on apple silicon deps 2023-12-16 14:22:11 +01:00
Viktor Lofgren
54ed3b86ba (minor) Remove dead code. 2023-12-15 21:49:35 +01:00
Viktor Lofgren
2001d0f707 (converter) Add @Deprecated annotation to a few fields that should no longer be used. 2023-12-15 21:42:00 +01:00
Viktor Lofgren
0f9cd9c87d (warc) More accurate filering of advisory records
Further create records for resources that were blocked due to robots.txt; as well as tests to verify this happens.
2023-12-15 21:37:02 +01:00
Viktor Lofgren
2e7db61808 (warc) More accurate filering of advisory records
We want to mute some of these records so that they don't produce documents, but in some cases we want a document to be produced for accounting purposes.

Added improved tests that reach for known resources on www.marginalia.nu to test the behavior when encountering bad content type and 404s.

The commit also adds some safety try-catch:es around the charset handling, as it may sometimes explode when fed incorrect data, and we do be guessing...
2023-12-15 21:31:16 +01:00
Viktor Lofgren
5329968155 (crawler) Update CrawlingThenConvertingIntegrationTest
This commit updates CrawlingThenConvertingIntegrationTest with additional tests for invalid, redirecting, and blocked domains. Improvements have also been made to filter out irrelevant entries in ParquetSerializableCrawlDataStream.
2023-12-15 21:04:06 +01:00
Viktor Lofgren
2e536e3141 (crawler) Add timestamp to CrawledDocument records
This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream.

The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type.  This is to avoid having to do format conversions when writing and reading the data.

This parquet field populates the timestamp field in CrawledDocument.
2023-12-15 20:23:27 +01:00
Viktor Lofgren
cf935a5331 (converter) Read cookie information
Add an optional new field to CrawledDocument containing information about whether the domain has cookies.  This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object.

Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.
2023-12-15 18:09:53 +01:00
Viktor Lofgren
fa81e5b8ee (warc) Use a non-standard WARC header to convey information about whether a website uses cookies
This information is then propagated to the parquet file as a boolean.

For documents that are copied from the reference, use whatever value we last saw.  This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.
2023-12-15 16:37:53 +01:00
Viktor Lofgren
9fea22b90d (warc) Further tidying
This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled.

A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics.

Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.
2023-12-15 15:38:23 +01:00
Viktor Lofgren
0889b6d247 (warc) Clean up parquet conversion
This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure.

It also refactors the fetch result, body extraction and content type abstractions.
2023-12-14 20:39:40 +01:00
Viktor Lofgren
1328bc4938 (warc) Clean up parquet conversion
This commit cleans up the warc->parquet conversion.  Records with a http status other than 200 are now included.

The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body.

The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.
2023-12-14 16:05:48 +01:00
Viktor Lofgren
787a20cbaa (crawling-model) Implement a parquet format for crawl data
This is not hooked into anything yet.  The change also makes modifications to the parquet-floor library to support reading and writing of byte[] arrays.  This is desirable since we may in the future want to support inputs that are not text-based, and codifying the assumption that each document is a string will definitely cause us grief down the line.
2023-12-13 16:22:19 +01:00
Viktor Lofgren
440e097d78 (crawler) WIP integration of WARC files into the crawler and converter process.
This commit is in a pretty rough state.  It refactors the crawler fairly significantly to offer better separation of concerns.  It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data.  This works, -ish.

There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either.

A problem is that the WARC files are a bit too large.  It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.
2023-12-13 15:33:42 +01:00
Viktor Lofgren
b74a3ebd85 (crawler) WIP integration of WARC files into the crawler process.
At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly.

This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled.

The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.
2023-12-11 19:32:58 +01:00
Viktor Lofgren
45987a1d98 Merge branch 'master' into warc 2023-12-11 14:32:35 +01:00
Viktor Lofgren
8f0950fc44 (geoip) Fix incorrect synchronization. 2023-12-11 14:01:39 +01:00
Viktor Lofgren
30bc3f9281 (converter) Use the prefix ip: instead of geopip: for country codes
This is the same as the prefix for the IP address, but I don't think that substantially matters, the as two have such different namespaces there can be no confusion.
2023-12-11 13:59:23 +01:00
Viktor Lofgren
f655ec5a5c (*) Refactor GeoIP-related code
In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services.

The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions.

The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server.

The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.
2023-12-10 17:30:43 +01:00
Viktor Lofgren
84b4158555 (minor) Fix broken test 2023-12-10 14:39:20 +01:00
Viktor Lofgren
91dd45cf64 (search) IP and IP geolocation in site info view
This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.
2023-12-09 20:06:55 +01:00
Viktor Lofgren
37af60254f (search) Better recipe filter
Tune the recipe filter to give better results, by using the 'popular' domains set along with excluding results with heavy tracking.
2023-12-09 20:06:55 +01:00
Viktor Lofgren
f0e736d4ea (search) Update the search profile 'Academia' to strictly filter on academic tlds
The previous version used a personalized pagerank centering on a few academic domains, but this didn't work very well and most results were not very academia-centric.
2023-12-09 20:06:55 +01:00
Viktor Lofgren
e3ebb0c5bb (*) Rename the search filter 'RETRO' into 'POPULAR'
This will make the terminology more consistent between the GUI and the code.  The rankings yaml still uses 'retro' though, for to retain compatibility.
2023-12-09 20:06:54 +01:00
Viktor Lofgren
6382f779c3 (search) Revert back to using 'Popular' as the default search filter
Unfiltered is a bit too ... unfiltered, and gives a bad first impression for many queries.
2023-12-09 16:34:12 +01:00
Viktor Lofgren
8ef34883a8 (search) Move site information out of the search service and into assistant.
This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available.  It also permits exposing this information via API in the future if there is interest in this.

The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time.

Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.
2023-12-09 16:30:06 +01:00
Viktor Lofgren
5c46af0edb (converter) Refactor EncyclopediaMarginaliaNuSideloader to use ProcessingIterator
Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator.

The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().
2023-12-09 15:20:53 +01:00
Viktor Lofgren
b6511fbfe2 (converter) Add AnchorTextKeywords to EncyclopediaMarginaliaNuSideloader processing
The commit updates EncyclopediaMarginaliaNuSideloader to include the AnchorTextKeywords in processing documents, aiding search result relevance.

It also removes old test-related functionality and a large but fairly useless test previously used to debug a specific problem, to the detriment of the overall code quality.
2023-12-09 15:20:52 +01:00
Viktor Lofgren
eccb12b366 (control) Fix spurious state detection in control-side actors
A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor!

To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.
2023-12-09 12:50:05 +01:00
Viktor Lofgren
d0982e7ba5 (converter) Add error handling and lazy load external domain links
The converter was not properly initiating the external links for each domain, causing an NPE in conversion.  This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data.

Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.
2023-12-09 12:33:39 +01:00
Viktor Lofgren
fc30da0d48 (converter) Add academia recognition to DomainProcessor
The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like *.ac.ccTld or *.edu.ccTld.

 If these conditions are met, the search term "special:academia" is added to the domain.

 The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well.  The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.
2023-12-08 20:31:34 +01:00
Viktor Lofgren
e6a1052ba7 Simplify CrawlerMain, removing the CrawlerLimiter and using a global HttpFetcher with a virtual thread pool dispatcher instead of the default. 2023-12-08 20:24:01 +01:00
Viktor Lofgren
968dce50fc (crawler) Refactored IpInterceptingNetworkInterceptor for clarity. 2023-12-08 17:45:46 +01:00
Viktor Lofgren
3bbffd3c22 (crawler) Refactor HttpFetcher to integrate WarcRecorder
Partially hook in the WarcRecorder into the crawler process.  So far it's not read, but should record the crawled documents.

The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.
2023-12-08 17:12:51 +01:00
Viktor Lofgren
072b5fcd12 Implement Warc-recording wrapper for OkHttp3 client
This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted.  This component is currently not hooked into anything.

The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'.

The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.
2023-12-08 13:49:16 +01:00
Viktor Lofgren
fabffa80f0 (warc) Integrate the crawler's content type parsing and charset logic into the WarcSideloader 2023-12-07 15:26:01 +01:00
Viktor Lofgren
064265b0b9 (crawler) Move content type/charset sniffing to a separate microlibrary
This functionality needs to be accessed by the WarcSideloader, which is in the converter.  The resultant microlibrary is tiny, but I think in this case it's justifiable.
2023-12-07 15:16:37 +01:00
Viktor Lofgren
2d5d11645d (warc) Refactor WarcSideloaderTest to not rely on specific test files on the computer 2023-12-06 19:00:29 +01:00
Viktor Lofgren
cc813a5624 (convert) Add basic support for Warc file sideloading
This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.
2023-12-06 18:43:55 +01:00
Viktor Lofgren
156c067f79 (search) Fix mobile issues with browse feature 2023-12-05 21:28:50 +01:00
Viktor Lofgren
b33b013d41 (search) Fix broken script tag
Apparently it can't be called suggestions.js...?
2023-12-05 20:29:13 +01:00
Viktor Lofgren
e74e2f705f (search) Fix broken script tag
suggestions.js became something else.
2023-12-05 20:20:07 +01:00
Viktor Lofgren
2e438847fc (search) Optimize related domains queries
In the future this logic probably needs to move into a separate
service, as it's still quite slow to load.  But this fixes response
times and DOS potential of previous version.
2023-12-05 20:12:03 +01:00
Viktor Lofgren
9301c47d93 (search) Optimize related domains queries 2023-12-05 14:42:03 +01:00
Viktor Lofgren
20ec58b07f (search) Remove layout-breakingly long URLs from the similar domains view.
They're almost all .onion URLs anyway, not really the space we're looking to peer into.
2023-12-05 13:58:15 +01:00
Viktor Lofgren
98983c1015 (search) Hopefully fix race condition that leaves the response with no Content-type header 2023-12-05 13:52:36 +01:00
Viktor Lofgren
67195592c6 (search) Hopefully fix race condition that leaves the response with no Content-type header 2023-12-05 13:48:42 +01:00
Viktor Lofgren
d1e88df71e (search) Cleaning up the code a bit 2023-12-05 13:26:05 +01:00
Viktor Lofgren
f36cfe34ab (search) Hackery to get a more balanced view 2023-12-04 22:50:39 +01:00
Viktor Lofgren
8a1934008c (search) Merge similar sites results with the info view.
WIP: This commit needs to be cleaned up.
2023-12-04 22:10:24 +01:00
Viktor Lofgren
b41bb9cfcf (search) Use a &Xi; for mobile button title instead of "Filters".
Makes it easier to distinguish form the search button.
2023-12-03 16:33:25 +01:00
Viktor Lofgren
d58324bbef (search) Clean up filters menu a bit, improve accessibility. 2023-12-02 18:05:30 +01:00
Viktor Lofgren
cbbd45d3e5 (search) Clean up filters menu a bit, improve accessibility. 2023-12-02 18:01:03 +01:00
Viktor Lofgren
b89633ae4b (search) Don't render a filter button on mobile when there are no filters to be presented. 2023-12-02 17:23:45 +01:00
Viktor Lofgren
96357e9bfd (search) Fix typeahead suggestions, as well as improve mobile and desktop UX in small ways. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
d530c3096f (search) GUI tweaks to make the new interface not fall apart on mobile/chrome 2023-12-02 17:06:40 +01:00
Viktor Lofgren
ae0c1c3f2d (control) Adjust search result margins for better visual density 2023-12-02 17:06:40 +01:00
Viktor Lofgren
0cc2564380 (search) CSS tweaks 2023-12-02 17:06:40 +01:00
Viktor Lofgren
38d20022ad (search) Fix script loading for mobile support 2023-12-02 17:06:40 +01:00
Viktor Lofgren
280132dad0 (search) Fix script loading for mobile support 2023-12-02 17:06:40 +01:00
Viktor Lofgren
61de4e2789 (search) Retain filter options when performing a new search from the input field 2023-12-02 17:06:40 +01:00
Viktor Lofgren
f9d3455320 (search) Reduce visual weight of search results 2023-12-02 17:06:40 +01:00
Viktor Lofgren
2ff64c3c12 (search) New toggle for reducing tracking 2023-12-02 17:06:40 +01:00
Viktor Lofgren
902f235b5b (search) Integrate 'similar' tab in site info. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
97d43a6fa2 (search) Revamp browse results with new look. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
9bc65ff0ca (search) Desaturate search result titles according to rank 2023-12-02 17:06:40 +01:00
Viktor Lofgren
6cd6a615fd (search) Add data-filter to body as a data attribute
For future shenanigans ;D
2023-12-02 17:06:40 +01:00
Viktor Lofgren
5639f0653d (search) Rename SearchProfile.name into filterId
Avoid foot-gun caused by name clash with the Enumeration method name(), which returns the Java name of the enumeration value.
2023-12-02 17:06:40 +01:00
Viktor Lofgren
251174c9a2 (search) Update front page with new look 2023-12-02 17:06:40 +01:00
Viktor Lofgren
42ea87d637 (search) Update conversion results, error page, and dictionary results with new CSS. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
7c8a60b8cf (search) Site info view is mostly done
Also optimize the rendering a bit to avoid having to allocate huge string buffers, writing directly to Spark's response instead.
2023-12-02 17:06:40 +01:00
Viktor Lofgren
2f4500be5a (search) New frontend look 2023-12-02 17:06:40 +01:00
Viktor Lofgren
fa7534a362 (search) Remove dead code 2023-12-02 17:06:40 +01:00
Viktor Lofgren
a258f0af7a (search) Refactor search parameters to include query 2023-12-02 17:06:40 +01:00
Viktor Lofgren
01621c6344 (renderer) Make helpers configurable on a by-service basis. 2023-12-02 17:06:40 +01:00
Viktor Lofgren
c7934342a6 (control) Automatic recrawl 2023-12-02 17:06:24 +01:00
Viktor Lofgren
f5c324c06b (minor) Fix broken test 2023-12-01 17:44:39 +01:00
Viktor Lofgren
f615cf2391 (convert) Loosen up the rules enforcement for documents that have external links. 2023-12-01 17:44:29 +01:00
Viktor Lofgren
e5d274fe1c (docs) Improve architectural documentation 2023-11-30 21:38:57 +01:00
Viktor Lofgren
166a391eae (docs) Improve architectural documentation for the crawler. 2023-11-30 21:30:57 +01:00
Viktor Lofgren
5fb24bb27f (docs) Improve architectural documentation for the converter. 2023-11-30 20:43:22 +01:00
Viktor Lofgren
5a5430b383 (convert) Wiki specialization that should do a better job at removing junk keywords and providing a useful summary. 2023-11-30 20:04:46 +01:00
Viktor Lofgren
67a1e1c874 (control) GUI for triggering control-side actors 2023-11-29 15:31:14 +01:00
Viktor Lofgren
4155fbe94c (control) Reprocess-all actor 2023-11-28 17:58:48 +01:00
Viktor Lofgren
347fe6b7be (control) Reindex-all actor 2023-11-28 16:41:09 +01:00
Viktor Lofgren
ff3ceb981e (control) Button for removing a stale 'NEW' status
If a process is violently terminated, the associated file storage may get stuck in the ephemeral 'NEW' state, preventing future operations on the associated data.

To remedy this without having to dig through the database, a button was added to reset the state.  It's a band-aid, but the situation is rare enough that I think it's fine.
2023-11-28 15:18:24 +01:00
Viktor Lofgren
1dafa0c74d (mqapi/control) Repair repartition endpoint, deprecate notify endpoints.
The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId.  In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.
2023-11-27 16:01:12 +01:00
Viktor Lofgren
09917837d0 (process) Ensure construction exceptions are logged
Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs.

The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.
2023-11-22 18:32:06 +01:00
Viktor Lofgren
dd507a3808 (db) Fix migrations, bump flyway to 10.0.1
Tricky problem, creating a procedure apparently needs delimiter shenanigans in Flyway, otherwise it will truncate the END statement and mariadb will be sad.
2023-11-21 20:04:35 +01:00
Viktor Lofgren
dd9406d0ac (control) Make storage type tabs consistent
This had fallen off in the Create New Specification view, it lacked Exports.
2023-11-17 11:26:45 +01:00
Viktor Lofgren
f58a9f46be (loader) Don't truncate the entire links table on load
This behavior is an old vestige from the days of only having a single loader process.  We'd truncate the links table because doing inserts/updates was too slow.  This was also important because we had 32 bit ID, and there's a lot of links between domains to go around...

Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE.

We also update the PRIMARY KEY to a BIGINT.  We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.
2023-11-16 10:30:12 +01:00
Viktor Lofgren
1cbf23e7e7 (test) Don't fail test if atags.parquet is not in ~vlofgren 2023-11-15 09:11:38 +01:00
Viktor Lofgren
63554ba171 (explore2) Add robots.txt 2023-11-14 09:15:32 +01:00
Viktor Lofgren
5de37cb820 (converter) Set feature flags appropriately on stackexchange posts 2023-11-12 15:48:08 +01:00
Viktor Lofgren
e5cee1f46d (sideload) Fix sideloading so that it doesn't get disproportionately good rankings
Also add type flags so that e.g. wikipedia shows up in the wikis filter.
2023-11-12 14:57:57 +01:00
Viktor Lofgren
e9a01caa5c (index) Fix broken metrics 2023-11-11 12:53:47 +01:00
Viktor Lofgren
858357a246 (metrics) Get prometheus up out of disrepair
* Fix bad labels
* Add nodeId where appropriate
* Hopefully fix histogram buckets for index query times
2023-11-08 14:01:28 +01:00
Viktor Lofgren
7aa2f80117 (domain) id.au should be treated as a TLD 2023-11-06 19:07:47 +01:00
Viktor Lofgren
7617b4cbc2 (crawler) Fix NPE in crawler caused by not having fetched the domains list yet 2023-11-06 18:16:38 +01:00
Viktor Lofgren
e0c769fd19 (converter) Integrate atags.parquet with the encyclopedia sideloader
Also clean up stackexchange and dirtree a bit.
2023-11-06 18:03:01 +01:00
Viktor Lofgren
ebd10a5f28 (crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized 2023-11-06 16:14:58 +01:00
Viktor Lofgren
2b77184281 (converter) Integrate atags with the topology field 2023-11-06 13:46:44 +01:00
Viktor Lofgren
e23976f6c4 (search) Fix card title overflow 2023-11-06 13:25:39 +01:00
Viktor Lofgren
0b8dc02eba (result-ranking) Nudge up results with ngram matches a tiny bit 2023-11-06 13:14:22 +01:00
Viktor Lofgren
fde1d0677e (search) Remove unnecessary dependencies 2023-11-06 12:56:32 +01:00
Viktor Lofgren
48986574ae (result-ranking) Use a weighted calculation of priority term importance 2023-11-06 12:56:21 +01:00
Viktor Lofgren
c7a6a71d07 (result-ranking) Use a weighted calculation of priority term importance 2023-11-06 12:48:23 +01:00
Viktor Lofgren
1847845151 Revert "(loader) Optimize INSERT statements"
This reverts commit 7cb92195d1.
2023-11-04 19:32:02 +01:00
Viktor Lofgren
7cb92195d1 (loader) Optimize INSERT statements
INSERT IGNORE is too slow.
2023-11-04 17:43:55 +01:00
Viktor Lofgren
72afa0341f duckdb connection may need to be synchronized? 2023-11-04 14:30:25 +01:00
Viktor Lofgren
0152004c42 Initial Commit Anchor Tags
* Added new (optional) model file in $WMSA_HOME/data/atags.parquet
* Converter gets a component for creating a projection of its domains onto the full atags parquet file
* New WordFlag ExternalLink
* These terms are also for now flagged as title words
* Fixed a bug where Title words aliased with UrlDomain words
* Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking
2023-11-04 14:24:17 +01:00
Viktor Lofgren
8e9698c9a0 (control/search) Add ability to suggest removing a site from random exploration
This is what most complaints have been about.
2023-11-02 15:29:49 +01:00
Viktor Lofgren
3047e2dd7c (screenshot-capture-tool) Make screenshot-capture-tool cooperate with docker 2023-11-01 16:38:55 +01:00
Viktor Lofgren
a8b9d21f2d (executor) Refine atag export logic
* Remove obviously uninteresting tags
* Omit URL schema for more sensible sorting
* Change the column order to put the source domain last
2023-11-01 13:23:14 +01:00
Viktor Lofgren
c77a5b7cb6 (control) GUI for atags export 2023-10-31 17:55:47 +01:00
Viktor Lofgren
23f2068e33 (executor) Actor for exporting anchor tag data from crawl data 2023-10-31 17:32:34 +01:00
Viktor Lofgren
ffadfb4149 (control) Use a partial template for the storage types tabs. 2023-10-31 17:12:14 +01:00
Viktor Lofgren
b7e38cfbae (control) Add exports view 2023-10-31 17:08:48 +01:00
Viktor Lofgren
659743b39c (executor) Export Data actor allocates its own storage 2023-10-31 17:04:07 +01:00
Viktor Lofgren
69758c5859 (control) Nicer redirects acknowledging actions 2023-10-31 16:26:29 +01:00
Viktor Lofgren
81bfd7e5fb (experiment) Utility for exporting atags 2023-10-31 16:10:21 +01:00
Viktor Lofgren
8f74dbdbb4 (crawler) Set more lenient parameters for recrawl 2023-10-30 11:35:30 +01:00
Viktor Lofgren
fd5a7eac87 (crawler) Exit crawler retriever on thread interrupted 2023-10-30 11:34:16 +01:00
Viktor Lofgren
6bac3c75cb (api) API documentation 2023-10-29 16:13:21 +01:00
Viktor Lofgren
5d6e0e3790 (log) Clean up logging
Don't log the PROCESS stream to executor's logs, as it will also be logged in the spawned process' log files.

Also tell the spawned process which "service" it is so that it gets a log file with a name that makes sense.
2023-10-29 15:52:17 +01:00
Viktor Lofgren
2871a326e6 (ctrl/exe) Clean up UX and code 2023-10-29 14:09:39 +01:00
Viktor Lofgren
abb42f0f36 (crawler) Fix bug in SQL statement
Arguments were in the wrong order in inserting fetching sites submitted to be crawled
2023-10-29 13:19:17 +01:00
Viktor Lofgren
f6fcb04817 (experiment) Repair the experiment runner 2023-10-27 16:16:50 +02:00
Viktor Lofgren
88f49834fd (docs) Update documentation 2023-10-27 12:45:39 +02:00
Viktor Lofgren
4415f52e18 (keyword-extraction) Fix broken test 2023-10-27 12:19:33 +02:00
Viktor Lofgren
98d742d634 (actor) Code cleanup 2023-10-27 12:19:20 +02:00
Viktor Lofgren
6c1ca10be7 (minor) code cleanup 2023-10-27 11:38:37 +02:00
Viktor Lofgren
aeaf2d546a (search) Fix broken redirect for flagging problems with websites 2023-10-27 11:20:49 +02:00
Viktor Lofgren
c7cb6664b4 (control) Indicate missing services with danger-color instead of having a distracting and constantly updating last-seen number 2023-10-26 18:05:22 +02:00
Viktor Lofgren
79adba9284 (index) Fix bug in dealing with quoted search terms 2023-10-26 16:28:23 +02:00
Viktor Lofgren
37b7f52f2c (minor) Reduce log severity for getTermMeta miss 2023-10-26 15:41:52 +02:00
Viktor Lofgren
c89e0ab255 (minor) Disable ~vlofgren specific debug test 2023-10-26 15:27:59 +02:00
Viktor Lofgren
f613f4f2df (array) Fix spurious search results
This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss.

It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.
2023-10-26 15:27:02 +02:00
Viktor Lofgren
a497e4c920 (crawler) Terminate crawler after a few hours of no progress 2023-10-26 12:49:28 +02:00
Viktor Lofgren
0f637fb722 (logging) Better logging configurations 2023-10-26 12:48:10 +02:00
Viktor Lofgren
abbadc92a0 (exdecutor) Prevent TriggerAdjacencyCalculationActor from showing up in the actions tab when it isn't running 2023-10-25 21:25:07 +02:00
Viktor Lofgren
97fcbdd6d9 (control) Move storage actions into the actions tab
* Also disable annoying CSS animations
2023-10-25 21:23:56 +02:00
Viktor Lofgren
d7686b665e Refactoring
* Encyclopedia sideloader; permit providing base URL.
* Storage base shows node id in GUI
* ProcessLivenessMonitorActor restarts automatically
* Clean-up of outbox code
2023-10-25 18:51:02 +02:00
Viktor Lofgren
5de41a3a7f (search-service) Show node affinity in site info tab 2023-10-25 12:44:48 +02:00
Viktor Lofgren
84cdac83d6 (control) Move message queue monitor to control 2023-10-24 16:44:28 +02:00
Viktor Lofgren
436a55ee1e (control) Render UUID tooltip with dashes. 2023-10-24 16:37:40 +02:00
Viktor Lofgren
313cc2965c (index-creation) Print whether full or prio is created
Previous state of saying reverse index for both was pretty confusing.
2023-10-24 16:23:10 +02:00
Viktor Lofgren
95f74c5ea7 (control) Filter out heartbeats that are stopped 2023-10-24 16:09:28 +02:00
Viktor Lofgren
8d1c3c754d Testing development flow with adding a ~tilde search filter 2023-10-24 15:35:15 +02:00
Viktor Lofgren
72152f9d80 Fix bug in handling js parameters 2023-10-24 15:10:02 +02:00
Viktor Lofgren
ebd365a128 Fix exception 2023-10-24 15:04:12 +02:00
Viktor Lofgren
0406e76889 (api) Remove logging cruft 2023-10-24 13:39:05 +02:00
Viktor Lofgren
c2b28c0f8d (api) Trial streaming API 2023-10-24 13:26:46 +02:00
Viktor Lofgren
9aa5038756 (search) Remove unnecessary filtering operation 2023-10-24 11:43:47 +02:00
Viktor Lofgren
a860f8f1a8 (index/qs) GRPC API for better query peformance 2023-10-24 11:38:07 +02:00
Viktor Lofgren
487c016a32 (qs) Speed 2023-10-23 14:03:09 +02:00
Viktor Lofgren
e4bddb4993 (control) Better UUID accessibility 2023-10-23 12:53:53 +02:00
Viktor Lofgren
731afcb864 (qs) Parallel execution 2023-10-23 12:06:03 +02:00
Viktor Lofgren
efb73ff4e7 (qs) Don't blow up if an index node isn't responsive 2023-10-23 11:53:18 +02:00
Viktor Lofgren
2ed2f35a9b (actor) Rewrite of the actor prototype class using record pattern matching 2023-10-23 10:18:20 +02:00
Viktor Lofgren
119151cad3 (converter) Separtion of concerns 2023-10-22 14:35:33 +02:00
Viktor Lofgren
758f9b5aa5 (converter) Get UUID pips out of the models
Rendering concerns shouldn't be in the models, it's poor separation of concerns and very difficult to follow.
2023-10-22 14:24:52 +02:00
Viktor Lofgren
e06a8c1de2 (converter) Put upper limit on number of worker threads. 2023-10-22 14:03:09 +02:00
Viktor Lofgren
29ce8ca0cf (db) Reduce db pool size
This is a temporary thing
2023-10-22 14:03:09 +02:00
Viktor Lofgren
eb4158df0b (control) Fix start/stop FSM endpoints 2023-10-22 14:03:09 +02:00
Viktor Lofgren
12fda1a36b (control) Temporarily re-writing the data balancer to get it to work in prod
Need to clean this up later.
2023-10-22 14:03:09 +02:00
Viktor Lofgren
e927f99777 (control) JSON serializes Map<Integer> to Map<Double> and Java gets confused 2023-10-21 16:24:20 +02:00
Viktor Lofgren
044bcf55bd (control) Fix SQL in rebalance actor 2023-10-21 16:13:37 +02:00
Viktor Lofgren
e475af9f49 (control) Initialize controlActorService 2023-10-21 16:06:53 +02:00
Viktor Lofgren
c6abcd91fa (control) Better use of FS states, fix bug with start/stop actors 2023-10-20 16:37:49 +02:00
Viktor Lofgren
10fc489822 (converter) More robust filename resolution 2023-10-20 14:16:03 +02:00
Viktor Lofgren
d76d926c38 (control/executor) Add new configuration options for node
It's now possible to configure prod instance to not retain processed data.
2023-10-20 14:05:19 +02:00