MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 21:29:00 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	9f1649636e	Clean up documentation and rename `domain-links` to `link-graph`	2024-02-28 11:40:39 +01:00
Viktor Lofgren	3a65fe8917	Add offload executor to GrpcChannelPoolFactory	2024-02-27 22:08:39 +01:00
Viktor Lofgren	99a6e56e99	(index-client) Increase thread count in index client This should be a fair bit larger than the number of index nodes	2024-02-27 22:00:29 +01:00
Viktor Lofgren	e696fd9e92	(docs) Begin un-fucking the docs after refactoring	2024-02-27 21:22:21 +01:00
Viktor Lofgren	c943954bb4	(domain-info) Reduce memory usage	2024-02-27 21:22:21 +01:00
Viktor Lofgren	eaf836dc66	(service/grpc) Reduce thread count Netty and GRPC by default spawns an incredible number of threads on high-core CPUs, which amount to a fair bit of RAM usage. Add custom executors that throttle this behavior.	2024-02-27 21:22:21 +01:00
Viktor Lofgren	dbf64b0987	(logs) Add the option for json logging	2024-02-27 21:22:20 +01:00
Viktor Lofgren	8d0af9548b	(search) Bot mitigation Add the ability to indicate to the search service that a request is malicious, and to poison the results by providing randomly reorered old results instead.	2024-02-27 21:22:19 +01:00
Viktor Lofgren	67aa20ea2c	(array) Attempting to debug strange errors	2024-02-27 21:22:18 +01:00
Viktor Lofgren	5604e9f531	(query) Bump query length, see what happens :P	2024-02-27 21:22:17 +01:00
Viktor Lofgren	1a51ec2d69	(index) Index optimization	2024-02-27 21:22:17 +01:00
Viktor Lofgren	3eb0800742	(index) Improve granularity of candidate queue polling	2024-02-27 21:22:17 +01:00
Viktor Lofgren	427f3e922f	(index) Retire count operation, clean up index code.	2024-02-27 21:22:17 +01:00
Viktor Lofgren	823ca73a3f	(domain-ranking) Fix a crash during ranking the edges of the similarity graph doesn't quite match the vertices of the link graph.	2024-02-27 21:22:17 +01:00
Viktor Lofgren	7fc0d4d786	(index) Observability for query execution queues	2024-02-27 21:22:17 +01:00
Viktor Lofgren	b8e336e809	(index) Reduce time allocation a bit	2024-02-27 21:22:17 +01:00
Viktor Lofgren	9429bf5c45	(index) Clean up	2024-02-27 21:22:17 +01:00
Viktor Lofgren	f7f0100174	(build) Make docker image registry and tag configurable in root build.gradle	2024-02-25 11:08:49 +01:00
Viktor Lofgren	fc00701a1e	(index) Experimental refactoring of the indexing functionality	2024-02-25 11:05:10 +01:00
Viktor Lofgren	09447f2ad2	(process service) Inherit parent's assertion status	2024-02-24 18:32:37 +01:00
Viktor Lofgren	ff0ef1eebc	(cleanup) Minor cleanups	2024-02-24 15:33:56 +01:00
Viktor Lofgren	1d34224416	(refac) Remove src/main from all source code paths. Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one. While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules. Which you'll do a lot, because it's modular. The src/main/java convention makes a lot of sense for a non-modular project though. This ain't that.	2024-02-23 16:13:40 +01:00
Viktor Lofgren	56d35aa596	(refac) Move execution API out of executor service	2024-02-23 13:26:11 +01:00
Viktor Lofgren	2201b1a506	(refac) Clean up code issues	2024-02-23 11:39:19 +01:00
Viktor Lofgren	5cdb07023b	(refac) Clean up unused imports	2024-02-23 11:27:20 +01:00
Viktor Lofgren	6154e16951	(refac) Remove "distPath"	2024-02-23 11:22:02 +01:00
Viktor Lofgren	f4ff7185f0	(refac) Move process-mqapi out of api directory	2024-02-23 11:18:29 +01:00
Viktor Lofgren	6357d30ea0	Clean up docs	2024-02-22 19:53:20 +01:00
Viktor Lofgren	8d4ef982d0	Clean up docs	2024-02-22 19:37:59 +01:00
Viktor Lofgren	4740156cfa	Clean up docs	2024-02-22 18:18:58 +01:00
Viktor Lofgren	f8e7f75831	Move index to top level of code	2024-02-22 18:01:35 +01:00
Viktor Lofgren	085137ca63	* Extract the index functionality	2024-02-22 17:31:25 +01:00
Viktor Lofgren	3fd2a83184	* Extract the search-query function	2024-02-22 15:27:39 +01:00
Viktor Lofgren	66c1281301	(zk-registry) epic jak shaving WIP Cleaning out a lot of old junk from the code, and one thing lead to another... * Build is improved, now constructing docker images with 'jib'. Clean build went from 3 minutes to 50 seconds. * The ProcessService's spawning is smarter. Will now just spawn a java process instead of relying on the application plugin's generated outputs. * Project is migrated to GraalVM * gRPC clients are re-written with a neat fluent/functional style. e.g. ```channelPool.call(grpcStub::method) .async(executor) // <-- optional .run(argument); ``` This change is primarily to allow handling ManagedChannel errors, but it turned out to be a pretty clean API overall. * For now the project is all in on zookeeper * Service discovery is now based on APIs and not services. Theoretically means we could ship the same code either a monolith or a service mesh. * To this end, began modularizing a few of the APIs so that they aren't strongly "living" in a service. WIP! Missing is documentation and testing, and some more breaking apart of code.	2024-02-22 14:01:23 +01:00
Viktor Lofgren	73947d9eca	(zk-registry) Filter out phantom addresses in the registry The change adds a hostname validation step to remove endpoints from the ZkServiceRegistry when they do not resolve. This is a scenario that primarily happens when running in docker, and the entire system is started and stopped.	2024-02-20 18:09:11 +01:00
Viktor Lofgren	a69c0b2718	(grpc-client) Fix warmup crash The warmup would sometimes crash during a cold start-up, because it could not get an API. Changed the warmup to just create a GrpcSingleNodeChannelPool for the node.	2024-02-20 18:03:57 +01:00
Viktor Lofgren	6c764bceeb	(doc) Update documentation for `service-discovery`	2024-02-20 16:09:49 +01:00
Viktor Lofgren	273aeb7bae	(doc) Update documentation with new gRPC service setup	2024-02-20 16:06:05 +01:00
Viktor Lofgren	d185858266	(minor) Add missing query parameter to ServiceEndpoint.toURL	2024-02-20 15:49:43 +01:00
Viktor Lofgren	453bd6064b	(minor) Add warm-up to GrpcMultiNodeChannelPool to speed up the initial messages Without doing this, connections would be created lazily, which is probably never desirable.	2024-02-20 15:45:16 +01:00
Viktor Lofgren	14172312dc	(query-client) Fix query client The query service delegates and aggregates IndexDomainLinksApiGrpc messages to the index services. The query client was accidentally also doing this, instead of talking to the query client. Fixed so it correctly talks to the query client and nothing else.	2024-02-20 15:44:07 +01:00
Viktor Lofgren	c600d7aa47	(refac) Inject ServiceRegistry into WebsiteAdjacenciesCalculator	2024-02-20 15:42:32 +01:00
Viktor Lofgren	3c9234078a	(refac) Propagate ZOOKEEPER_HOSTS to spawned processes	2024-02-20 15:42:16 +01:00
Viktor Lofgren	ee8e0497ae	(refac) Move service discovery injection to a separate guice module	2024-02-20 15:41:04 +01:00
Viktor Lofgren	30bdb4b4e9	(config) Clean up service configuration for IP addresses Adds new ways to configure the bind and external IP addresses for a service. Notably, if the environment variable WMSA_IN_DOCKER is present, the system will grab the HOSTNAME variable and announce that as the external address in the service registry. The default bind address is also changed to be 0.0.0.0 only if WMSA_IN_DOCKER is present, otherwise 127.0.0.1; as this is a more secure default.	2024-02-20 14:22:48 +01:00
Viktor Lofgren	2ee492fb74	(gRPC) Bind gRPC services to an interface By default gRPC it magically decides on an interface. The change will explicitly tell it what to use.	2024-02-20 14:22:47 +01:00
Viktor Lofgren	36a5c8b44c	(cleanup) Clean up code	2024-02-20 14:22:47 +01:00
Viktor Lofgren	07b625c58d	(query-client) Add support for fault-tolerant requests to single node services Adding a method importantCall that will retry a failing request on each route until it succeeds or the routes run out.	2024-02-20 14:16:05 +01:00
Viktor Lofgren	746a865106	(client) Fix handling of channel refreshes The previous code made an incorrect assumption that all routes refer to the same node, and would overwrite the route list on each update. This lead to storms of closing and opening channels whenever an update was received. The new code is correctly aware that we may talk to multiple nodes.	2024-02-20 14:14:09 +01:00
Viktor	f85ec28a16	Merge branch 'master' into service-discovery	2024-02-20 11:44:12 +01:00
Viktor Lofgren	0307c55f9f	(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.	2024-02-20 11:41:14 +01:00
Viktor	d05c916491	Merge pull request #80 from MarginaliaSearch/ranking-algorithms Clean up domain ranking code	2024-02-18 09:52:34 +01:00
Viktor Lofgren	c73e43f5c9	(recrawl) Mitigate recrawl-before-load footgun In the scenario where an operator * Performs a new crawl from spec * Doesn't load the data into the index * Recrawls the data The recrawl will not find the domains in the database, and the crawl log will be overwritten with an empty file, irrecoverably losing the crawl log making it impossible to load! To mitigate the impact similar problems, the change saves a backup of the old crawl log, as well as complains about this happening. More specifically to this exact scenario however, the parquet-loaded domains are also preemptively inserted into the domain database at the start of the crawl. This should help the DbCrawlSpecProvider to find them regardless of loaded state. This may seem a bit redundant, but losing crawl data is arguably the worst type of disaster scenario for this software, so it's arguably merited.	2024-02-18 09:23:20 +01:00
Viktor Lofgren	e61e7f44b9	(blacklist) Delay startup of blacklist To help services start faster, the blacklist will no longer block until it's loaded. If such a behavior is desirable, a method was added to explicitly wait for the data.	2024-02-18 09:23:20 +01:00
Viktor Lofgren	f9b6ac03c6	(api) Clean up incorrect error handling in GrpcChannelPool	2024-02-18 08:45:35 +01:00
Viktor Lofgren	296ccc5f8e	(blacklist) Clean up blacklist impl The domain blacklist blocked the start-up of each process that injected it, adding like 30 seconds to the start-up time in prod. This change moves the loading to a separate thread entirely. For threads or processes that require the blacklist to be definitely loaded, a helper method was added that blocks until that time.	2024-02-18 08:16:48 +01:00
Viktor Lofgren	8cb5825617	(search) Temporarily disable the Popular filter This filter currently does not distinguish itself very much from the unfiltered results, and lends the impression that the filters don't "do anything". It may come back in some shape or form in the future, with some additional tweaking of the rankings...	2024-02-18 08:02:01 +01:00
Viktor Lofgren	cee707abd8	(crawler) Implement domain shuffling in DbCrawlSpecProvider Modified the DbCrawlSpecProvider to shuffle domains after loading to ensure a good mix for each crawl. This change prevents overload of crawling the same server in parallel from different subdomains or crawling big domains all at once.	2024-02-17 17:47:38 +01:00
Viktor Lofgren	92717a4832	(client) Refactor GrpcStubPool to handle error states Refactored the GRPC Stub Pool for better handling of channel SHUTDOWN state. Any disconnected channels are now re-created before returning the stub. The class was also renamed to GrpcChannelPool, as we no longer pool the stubs.	2024-02-17 14:42:26 +01:00
Viktor Lofgren	37a7296759	(sideload) Clean up the sideloading code Clean up the sideloading code a bit, making the Reddit sideloader use the more sophisticated SideloaderProcessing approach to sideloading, instead of mimicing StackexchangeSideloader's cruder approach. The reddit sideloader now uses the SideloaderProcessing class. It also properly sets js-attributes for the sideloaded documents. The control GUI now also filters the upload directory items based on name, and disables the items that do not have appropriate filenames.	2024-02-17 14:32:36 +01:00
Viktor Lofgren	ebbe49d17b	(sideload) Fix sideloading of explicitly selected stackexchange files Fix a bug where sideloading stackexchange files by explicitly selecting the 7z file would fail, since the 7z file would be passed along to the converter rather than the path to the pre-converted .db file.	2024-02-17 13:24:04 +01:00
Viktor Lofgren	b7e330855f	(control) Update descriptive text in the control GUI	2024-02-16 20:32:31 +01:00
Viktor Lofgren	ac89224fb0	(domain-ranking) Remove lingering mentions of the algorithms field from the GUI	2024-02-16 20:28:37 +01:00
Viktor Lofgren	9ec262ae00	(domain-ranking) Integrate new ranking logic The change deprecates the 'algorithm' field from the domain ranking set configuration. Instead, the algorithm will be chosen based on whether influence domains are provided, and whether similarity data is present.	2024-02-16 20:22:01 +01:00
Viktor Lofgren	64acdb5f2a	(domain-ranking) Clean up domain ranking The domain ranking code was admittedly a bit of a clown fiesta; at the same time buggy, fragile and inscrutable. Migrating over to use JGraphT to store the link graph when doing rankings, and using their PageRank implementation. Also added a modified version that does PersonalizedPageRank.	2024-02-16 18:04:58 +01:00
Viktor Lofgren	a175b36382	(search) Correct accidental regression of the SmallWeb filter	2024-02-15 18:16:56 +01:00
Viktor Lofgren	16526d283c	(search) Correct accidental regression of the Vintage filter	2024-02-15 18:13:34 +01:00
Viktor Lofgren	752e677555	(search) Expose getSearchTitle in DecoratedSearchResults	2024-02-15 13:56:44 +01:00
Viktor Lofgren	f796af1ae8	(search) Fix failed refactoring	2024-02-15 13:53:19 +01:00
Viktor Lofgren	2515993536	(search) Fix issue where searchTitle setting gets lost when searching again It's important that the field names in SearchParameters matches the fields referenced in search-form.hdb, otherwise they will get lost in transit.	2024-02-15 13:52:11 +01:00
Viktor Lofgren	66b3e71e56	(search) Expose more search options This change set updates the query APIs to enable the search service to add additional criteria, such as QueryStrategy and TemporalBias. The QueryStrategy makes it possible to e.g. require a match is in the title of a result, and TemporalBias enables penalizing results that are not within a particular time period. These options are added to the search interface. The old 'recent results' is modified to use TemporalBias, and a new filter 'Search In Title' is added as well. The vintage filter is modified to add a temporal bias for the past.	2024-02-15 13:39:51 +01:00
Viktor Lofgren	652d151373	(process-models) Improve documentation	2024-02-15 12:21:12 +01:00
Viktor Lofgren	300b1a1b84	(index-query) Add some tests for the QueryFilter code	2024-02-15 12:03:30 +01:00
Viktor Lofgren	6c3b49417f	(index-query) Improve documentation and code quality	2024-02-15 11:33:50 +01:00
Viktor Lofgren	dcc5cfb7c0	(index-journal) Improve documentation and code quality	2024-02-15 10:51:49 +01:00
Viktor	d970836605	Merge pull request #79 from MarginaliaSearch/reddit (converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy, and improves the sideload UX a tiny bit.	2024-02-15 09:17:56 +01:00
Viktor Lofgren	8021bd0aae	(control) Sort upload listing results Improve the UX of the sideload GUI by sorting the results in a sensible fashion, first by whether it's a directory, then by its filename. The change also changes the timestamp rendering to a more human-readable format than full ISO-8601.	2024-02-15 09:13:40 +01:00
Viktor Lofgren	8f91156d80	(control) Improve sideload UX The sideload forms didn't properly set the label 'for' property, meaning that while label tags existed, they weren't appropriately clickable. Also removed unnecessary limits on the sideload target being a directory for stackexchange and warc. It's been possible to directly load a particular file for a while, but not allowed due to GUI limits.	2024-02-14 18:38:20 +01:00
Viktor Lofgren	fab36d6e63	(converter) Loader for reddit data Adds experimental sideloading support for pusshift.io style reddit data. This dataset is limited to data older than 2023, due to licensing changes making large-scale data extraction difficult. Since the median post quality on reddit is not very good, he sideloader will only load a subset of self-texts and top-level comments that have sufficiently many upvotes. Empirically this appears to mostly return good matches, even if it probably could index more. Tests were written for this, but all require local reddit data which can't be distributed with the source code. If these can not be found, the tests will shortcircuit as OK. They're mostly there for debugging, and it's fine if they don't always run. The change also refactors the sideloading a bit since it was a bit messy.	2024-02-14 17:35:44 +01:00
Viktor Lofgren	3d54879c14	(API, minor) Clean up comments.	2024-02-14 12:09:16 +01:00
Viktor Lofgren	e17fcde865	(API, minor) Remove unnecessary inject.	2024-02-14 12:05:50 +01:00
Viktor Lofgren	6950dffcb4	(API) Fix result order in API results These results should be presented in the same order as their ranking score.	2024-02-14 11:47:14 +01:00
Viktor Lofgren	02dd5c5853	(converter) Look at properties when deciding pool size Look at whether the property 'system.conserveProperty' is enabled when deciding he default pool size for the converter. If true, a much more conservative default is used, limiting the risk of running out of memory.	2024-02-12 16:24:19 +01:00
Viktor Lofgren	5a1087dbf9	(qs-gui) Update documentation, add param for domain limit	2024-02-12 16:13:48 +01:00
Viktor Lofgren	7564dfeb7a	(minor) Correct link in documentation for app services	2024-02-12 15:55:06 +01:00
Viktor Lofgren	10bad635a8	(search) Experimental support for clustering search results Improves clustering of results.	2024-02-11 20:00:11 +01:00
Viktor Lofgren	7cc8b0fed5	(search) Experimental support for clustering search results Improves clustering of results.	2024-02-11 19:58:55 +01:00
Viktor Lofgren	a77846373b	(search) Experimental support for clustering search results Improves clustering of results.	2024-02-11 19:48:55 +01:00
Viktor Lofgren	bcd0dabb92	(search) Experimental support for clustering search results Adds experimental support for clustering search results by e.g. domain. At a first stage, this is only enabled for the wiki and forum filters. The commit also cleans up the UrlDetails class, which contained a number of vestigial entries.	2024-02-11 17:31:38 +01:00
Viktor Lofgren	9d68062553	(converter) Make processing pool size configurable	2024-02-10 20:59:08 +01:00
Viktor Lofgren	e66d0b7431	(warc) Minor code clean-up. Remove redundant String$getBytes(). This is mainly an improvement in code consistency.	2024-02-10 18:30:33 +01:00
Viktor Lofgren	ba26f6ce84	(doc) Documentation corrections	2024-02-10 14:16:01 +01:00
Viktor Lofgren	929caed0b9	(warc) Improve WARC standard adherence The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.	2024-02-09 20:07:01 +01:00
Viktor Lofgren	8340aa2b6c	(warc) Improve WARC standard adherence The WARC specification says the records should transparently remove compression. This was not done, leading to the WARC typically being a bit of a gzip-Matryoshka.	2024-02-09 17:29:21 +01:00
Viktor Lofgren	1188fe3bf0	(conf) Improve naming consistency Rename the property system.conserve-memory to system.conserveMemory in order to be consistent with other properties in the system.	2024-02-09 14:43:08 +01:00
Viktor Lofgren	b15f47d80e	(db) Retire the EC_DOMAIN_LINK table Retire the EC_DOMAIN_LINK table as the data has been migrated off into a file instead.	2024-02-08 15:52:30 +01:00
Viktor Lofgren	ef261cbbd7	(search) Remove stray spaces in bang commands	2024-02-08 14:46:18 +01:00
Conor Flynn	9d7df87886	(search) Fix broken !ddg handling https://duckduckgo.com/search?q=asdf leads to running a search for the term "search" instead of "asdf". Both https://duckduckgo.com/<query> and https://duckduckgo.com/?q=<query> are accepted, but using GET vars seemed more in-keeping with the code.	2024-02-08 13:28:02 +01:00
Viktor Lofgren	a4b2323ca3	(search) Change default search profile to No Filter Recent changes to the result ranking mean the no filter mode returns sufficiently good results for most queries that filtering by default just makes the search results more restricted.	2024-02-08 13:04:05 +01:00
Viktor	e8de468b0b	Make executor API talk GRPC (#75 ) * (executor-api) Make executor API talk GRPC The executor's REST API was very fragile and annoying to work with, lacking even basic type safety. Migrate to use GRPC instead. GRPC is a bit of a pain with how verbose it is, but that is probably a lesser evil. This is a fairly straightforward change, but it's also large so a solid round of testing is needed... The change set breaks out the GrpcStubPool previously residing in the QueryService, and makes it available to all clients. ServiceId.name was also renamed to avoid the very dangerous clash with Enum.name(). The boilerplate needed for grpc was also extracted into a common gradle file for inclusion into the appropriate build.gradle-files.	2024-02-08 13:01:12 +01:00
Viktor Lofgren	d83a3bf4e2	(search) Fix broken !w handling Printf format error derp.	2024-02-08 12:11:33 +01:00
Viktor Lofgren	f2b39ad055	(search) Fix broken !bang handling !bang query handling seems to have fallen victim to an overzealous refactoring effort, and broken. It's now repaired, and a test is in place to ensure we know if it breaks again.	2024-02-08 12:05:09 +01:00
Viktor Lofgren	95d1bd98e4	(array) Update documentation, make unsafe configurable The readme for the array library was extremely out of date. Updating it with accurate information about how the library works, and a demo that should compile. Also added a system property for disabling the use of sun.misc.Unsafe.	2024-02-07 12:26:47 +01:00
Viktor Lofgren	8acbc6a6b4	(index-construction) Split repartition into two actions cont'd Continues `467ba5be20` by breaking out a constant with the name of the primary ranking set. Also ensures it doesn't get spuriously logged as updated during the secondary updating pass.	2024-02-06 19:54:17 +01:00
Viktor Lofgren	467ba5be20	(index-construction) Split repartition into two actions This change splits the previous 'repartition' action into two steps, one for recalculating the domain rankings, and one for recalculating the other ranking sets. Since only the first is necessary before the index construction, the rest can be delayed until after... To avoid issues in handling the shotgun blast of MqNotifications, Service was switched over to use a synchronous message queue instead of an asynchronous one. The change also modifies the behavior so that only node 1 will push the changes to the EC_DOMAIN database table, to avoid unnecessary db locks and contention with the loader. Additionally, the change fixes a bug where the index construction code wasn't actually picking up the rankings data. Since the index construction used to be performed by the index-service, merely saving the data to memory was enough for it to be accessible within the index-construction logic, but since it's been broken out into a separate process, the new process just injected an empty DomainRankings object instead. To fix this, DomainRankings can now be persisted to disk, and a pre-loaded version of the object is injected into the index-construction process.	2024-02-06 17:20:07 +01:00
Viktor Lofgren	29ddf9e61d	(doc) Update docs	2024-02-06 16:29:55 +01:00
Viktor Lofgren	92e119cab3	(doc) Update docs	2024-02-06 12:43:42 +01:00
Viktor Lofgren	92049ba8e4	(doc) Update docs	2024-02-06 12:41:28 +01:00
Viktor Lofgren	54330b9921	(*) Remove dead code	2024-02-06 12:41:13 +01:00
Viktor Lofgren	d1aeb030f2	(doc) Update RandomWriteFunnel documentation	2024-02-06 12:35:24 +01:00
Viktor Lofgren	f89274d1ea	(minor) Fix broken test Fallout from changes in endianness made in `d986f90074`	2024-02-06 12:12:26 +01:00
Viktor Lofgren	7286596fb4	(deps) Remove monkey patched GSON The codebase used to have a monkey patched version of gson that made special optimizations for the unusually large JSON files that used to store e.g. crawl data. Since JSON is no longer used in this fashion, the GSON fork is not needed anymore.	2024-02-06 12:11:39 +01:00
Viktor Lofgren	a2fc83d94e	(control) Add configurable border styling To help distinguish between environments, a system property 'control.appBorder' is added that is injected as a body element border property in the control GUI stylesheets.	2024-02-06 12:05:02 +01:00
Viktor Lofgren	2161799cc3	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:18:00 +01:00
Viktor Lofgren	c88f132057	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:10:03 +01:00
Viktor Lofgren	c6313a5906	(sideload) Fix filename error in dealing with stackoverflow files	2024-02-06 11:06:36 +01:00
Viktor Lofgren	eadcdb5bed	(minor) Improve error handling, naming logging in IndexResultDecorator	2024-02-05 21:05:44 +01:00
Viktor Lofgren	6e7649b5f7	(loader) Mitigate fragile paging behavior IndexJournalWriterPagingImpl was modified to not page on number of entries written, but number of (equivalent uncompressed) bytes written. Since the failure mode if too much data is written per file is quiet corruption of the index, the former behavior was extremely fragile. The new behavior should consistently ensure that the data is sufficiently small to not cause any integer rollovers. The change in `6dcc20038c` was reverted, as there is really no sane reason to have this configurable in software.	2024-02-05 21:05:03 +01:00
Viktor Lofgren	d986f90074	(index) Fix consistency between RandomFileAssembler implementations The RandomFileAssembler implementations, introduced in commit `53c575db3f` were all acting subtly differently. The RWF implementation wrote BigEndian longs instead of the native endianness used by the other implementations (and expected by the index construction code), further the mmap implementation exposed a bug in LongArray.write() that caused it to create a larger file than necessary. A test was built to ensure the output of these implementations is equivalent.	2024-02-05 21:01:32 +01:00
Viktor Lofgren	53c575db3f	(index-construction) Make random-write file strategy configurable To cope with writing large files out of order, the system needs some form of strategy to avoid writing them directly to disk, as this causes insane amounts of disk thrashing. By default, the data is just buffered in RAM. This works well on a large server, but smaller systems struggle. To help systems with small RAM process large amounts of data, the old RandomWriteFunnel is brought back if the system property 'system.conserve-memory' is set to true. RandomWriteFunnel is buffering the construction by creating a series of small files that pigeonhole the writes into rough neighborhoods, and then it goes over the files one by one to construct one area of the file at a time. This is relatively slow and uses more than twice the disk size. A new interface RandomFileAssembler is introduced as an abstraction for this operation. A third strategy, direct mmaps, is also introduced if the file is very small (less than 1 GB). In this domain, disk thrashing is unlikely since it will comfortably fit in RAM.	2024-02-05 12:31:15 +01:00
Viktor Lofgren	6dcc20038c	(index-journal) Make index journal page size configurable Adds a new system property loader.journal-page-size to configure this setting.	2024-02-05 11:26:05 +01:00
Viktor Lofgren	fa145f632b	(sideload) Add special handling for sideloaded wiki documents This update enhances the SideloaderProcessing and DocumentClass modules to specially handle sideloaded wiki documents. Wiki content is generally truncated to the first paragraph, which generally tends to be too short to be included independently. An additional DocumentClass (SIDELOAD) has been introduced to suppress the length check in this case.	2024-02-02 21:22:07 +01:00
Viktor Lofgren	785d8deadd	(crawler) Improve meta-tag redirect handling, add tests for redirects. Wrote a new test to examine the redirect behavior of the crawler, ensuring that the redirect URL is the URL that is reported in the parquet file. This works as intended. Noticed in the course of this that the crawler doesn't add links from meta-tag redirects to the crawl frontier. Added logic to handle this case, amended the test case to verify the new behavior. Added the meta-redirect case to the HtmlDocumentProcessorPlugin as well, so that we consider it a link between documents in the unlikely case that a meta redirect is to another domain.	2024-02-01 20:30:43 +01:00
Viktor Lofgren	93a2d5afbf	(*) Fix poorly named test Likely old refactoring gore.	2024-02-01 20:08:15 +01:00
Viktor Lofgren	d60c6b18d4	(doc) Update the readme's the crawler, as they've grown stale.	2024-02-01 18:10:55 +01:00
Viktor Lofgren	d1e02569f4	(language-processing) Add a system property for configuring which language detection model to use The flag is `system.languageDetectionModelVersion`. * If negative, no model is used. * If 0, both models are used. * If 1, the old crappy model is used. * If 2, the new fasttext model is used.	2024-01-31 13:02:33 +01:00
Viktor Lofgren	9ce67029ca	(language-processing) Add a system property for configuring which language detection model to use The flag is `system.languageDetectionModelVersion`. * If negative, no model is used. * If 0, both models are used. * If 1, the old crappy model is used. * If 2, the new fasttext model is used.	2024-01-31 13:02:16 +01:00
Viktor Lofgren	98f3382cea	(minor) Fix test and improve error message	2024-01-31 11:53:41 +01:00
Viktor Lofgren	52a0255814	() Add flag for disabling ASCII flattening The production configuration assumes all content of interest is 7 bit ASCII, and makes a series of optimizations based on this. This assumption holds poorly in the wild. Adding an experimental* system property 'system.noFlattenUnicode', that when set to TRUE, will disable this behavior. IMPORTANT!! The index needs to be re-constructed when this flag is changed, as different hash functions are selected for the keyword->identifier mappings.	2024-01-31 11:50:59 +01:00
Viktor Lofgren	eb59ac8535	(index-ranking) Adjust the BM25P factors a bit Since the bleed-flags set by the anchor tags logics have been changed to Site and SiteAdjacent, give them a bit of more importance when set together with ExternalLink. UrlDomain and UrlPath are also only more consistently only rewarded once.	2024-01-30 21:27:29 +01:00
Viktor Lofgren	6edc318597	(control) Fix typo in URL linking to new-crawl-specs	2024-01-26 10:43:10 +01:00
Viktor Lofgren	182c0cf28e	(control) Add warnings about domain data contamination	2024-01-25 18:26:15 +01:00
Viktor Lofgren	0b105b5986	(converter) Update hyperlink text for new crawl spec creation. Fix minor typo.	2024-01-25 18:05:11 +01:00
Viktor Lofgren	cae1bad274	(*) Add download-sample action, refactor file storage This changeset adds an action for downloading a set of sample data from downloads.marginalia.nu. It also refactors out some leaky abstractions out of FileStorageService. allocateTemporaryStorage has been renamed allocateStorage. The storage was never temporary in any scenario... It also doesn't take a storage base, as there was always only one valid option for this input. The allocateStorage method finds the appropriate base itself.	2024-01-25 13:36:30 +01:00
Viktor Lofgren	1b8b97b8ec	(sample-exporter) Add some limits on sizes and lengths Tar files will reject entries with filenames over 100b, so we need a limit there. Also added a maximum size limit to keep the file sizes reasonable.	2024-01-25 11:51:53 +01:00
Viktor Lofgren	c088c25b09	(*) Fix broken test, clean up code	2024-01-24 12:50:41 +01:00
Viktor Lofgren	958d64720e	(control) Add a view for restarting aborted processes This will avoid having to dig in the message queue to perform this relatively common task. The control service was also refactored to extract common timestamp formatting logic out of the data objects and into the rendering.	2024-01-24 12:47:10 +01:00
Viktor Lofgren	805afad4fe	(control) New GUI for exporting crawl data samples Not going to win any beauty pageants, but this is pretty peripheral functionality.	2024-01-23 17:08:21 +01:00
Viktor Lofgren	400f4840ad	(*) Fix broken code in jmh	2024-01-23 17:08:21 +01:00
Viktor Lofgren	ee7792596d	(*) Fix broken test Probably shouldn't have tests depending on external data like this...	2024-01-23 12:03:47 +01:00
Viktor Lofgren	0081328aca	(converter) Adjust which flags are set by anchor text keywords It's a mistake to let it bleed into Title, as this is a high quality signal. We'll co-opt Site and SiteAdjacent instead to reinforce the ExternalLink when count is high.	2024-01-23 11:54:00 +01:00
Viktor Lofgren	3fff7f6878	(converter) Fix issue where quality limits were no longer enforced	2024-01-23 11:42:17 +01:00
Viktor Lofgren	f15dd06473	(index) Delayed close() of SearchIndexReader This avoids concurrent access errors. This is especially important when using Unsafe-based LongArrays, since we have concurrent access to the underlying memory-mapped file. If pull the rug from under the caller by closing the file, we'll get a SIGSEGV. Even with a "safe" MemorySegment, we'll get ugly stacktraces if we close the file while a thread is still accessing it. So we spin up a thread that sleeps for a minute before actually unmapping the file, allowing any ongoing requests to wrap up. This is 100% a hack, but it lets us get away with doing this without adding locks to the index readers. Since this is "just" mmapped data, and this operation happens optimistically once a month, it should be safe if the call gets lost.	2024-01-23 11:08:41 +01:00
Viktor Lofgren	dd26819d66	(actor) Try to rare data race where a finished job is considered dead.	2024-01-22 21:22:38 +01:00
Viktor Lofgren	a6d257df5b	(converter) Update Stackexchange sideload instruction The sideload instruction in the stackexchange template was updated. The instruction now states that stackexchange data will be loaded from a directory on the server and directs users to a new documentation url for more detailed information.	2024-01-22 18:29:20 +01:00
Viktor Lofgren	41d896ba3e	(converter) Refactor content type check in PlainTextDocumentProcessorPlugin The method `isApplicable` in the `PlainTextDocumentProcessorPlugin` was refactored to handle a wider range of content types beyond merely "text/plain". It now also handles any content type that starts with "text/plain;", to accomodate contentTypes that append a charset as well.	2024-01-22 17:52:14 +01:00
Viktor Lofgren	51cdf46645	(control) Improve accessibility in search-to-ban template This update enhances accessibility by associating labels with the corresponding checkboxes in the search-to-ban template.	2024-01-22 15:01:00 +01:00
Viktor Lofgren	1eb0adf6d3	(array) Add sun.misc.Unsafe variant of LongArray	2024-01-22 13:38:42 +01:00
Viktor Lofgren	40c9d2050f	(control) Fully automatic conversion Removed the need to have to run an external tool to pre-process the data in order to load stackexchange-style data into the search engine. Removed the tool itself. This stirred up some issues with the dependencies, that were due to both third-party:ing xz and importing it as a dependency. This has been fixed, and :third-party:xz was removed.	2024-01-22 13:03:24 +01:00
Viktor Lofgren	3a325845c7	(mq) Add better error handling in fsm and mq java.lang.Error:s were not handled properly, leading to mismatch in the bookkeeping of the FSMs. These are now caught, acted on, and re-thrown. MqSynchronousInbox also no longer assumes all exceptions are InterruptedException.	2024-01-22 13:03:24 +01:00
Viktor Lofgren	6a1bfd6270	(array) Remove unused 'madvise' code and 3rd party dependency on 'uppend' This wasn't actually hooked in anywhere. Removing the dependency and code. If it turns out we need madvise in the future, we'll re-introducde it.	2024-01-22 13:01:57 +01:00
Viktor Lofgren	b91ea1d7ca	(control) Re-add gui for sideloading dirtrees	2024-01-20 18:09:40 +01:00
Viktor Lofgren	c5760cd535	(test) Fix broken test	2024-01-20 13:39:40 +01:00
Viktor Lofgren	91c7960800	(crawler) Extract additional configuration properties This commit extracts several previously hardcoded configuration properties, and makes then available through system.properties. The documentation is updated to reflect the change. Dead code was also removed in the process. CrawlSpecGenerator is left feeling a bit over-engineered still, since it's built for a more general case, where all other implementations but the current one are removed, but we'll leave it like this for now as it's fairly readable still.	2024-01-20 10:36:04 +01:00
Viktor Lofgren	2079a5574b	(control) Update heading in restore backup template Changed the heading in the partial restore backup page from "Load" to "Restore Backup".	2024-01-19 21:46:53 +01:00
Viktor Lofgren	27ffb8fa8a	(converter) Integrate zim->db conversion into automatic encyclopedia processing workflow Previously, in order to load encyclopedia data into the search engine, it was necessary to use the encyclopedia.marginalia.nu converter to first create a .db-file. This isn't very ergonomic, so parts of that code-base was lifted in as a 3rd party library, and conversion from .zim to .db is now done automatically. The output file name is based on the original filename, plus a crc32 hash and a .db-ending, to ensure we can recycle the data on repeat loads.	2024-01-19 13:59:03 +01:00
Viktor Lofgren	22c8fb3f59	(crawler) Fix a bug where reference copies of crawl data was written without etag and last-modified This commit also adds a band-aid to ParquetSerializableCrawlDataStream to fetch this from the 304-entity. This can be removed in a few months.	2024-01-18 16:02:27 +01:00
Viktor Lofgren	964419803a	Fix broken test	2024-01-18 15:42:01 +01:00
Viktor Lofgren	6271d5d544	(mq) Add relation tracking between MQ messages for easier tracking and debugging. The change adds a new column to the MESSAGE_QUEUE table called AUDIT_RELATED_ID. This field is populated transparently, using a dictionary mapping Thread IDs to Message IDs, populated by the inbox handlers. The existing RELATED_ID field has too many semantics associated with them, among other things the FSM code uses them this field in tracking state changes. The change set also improves the consistency of inbox names. The IndexClient was buggy and populated its outbox with a UUID. This is fixed. All Service2Service outboxes are now prefixed with 'pp:' to make them even easier to differentiate.	2024-01-18 15:08:27 +01:00
Viktor Lofgren	175bd310f5	(control) Message queue UX improvements	2024-01-18 13:05:50 +01:00
Viktor Lofgren	67ee6f4126	(control) Clean up filtering UX in Events table	2024-01-18 12:35:39 +01:00
Viktor Lofgren	01b312f14c	(*) Make new index nodes accept queries by default It's a confusing default behavior. This was off for nodes n>1 before as a bandaid since querying indices with no data caused delays and errors. This has been fixed now, so there's no need to do this anymore!	2024-01-18 12:05:37 +01:00
Viktor Lofgren	18638c62de	(control) Rephrase text	2024-01-18 11:53:10 +01:00
Viktor Lofgren	753d000788	(control) Add toggle for automatic loading of processed data	2024-01-18 11:52:58 +01:00
Viktor Lofgren	19e781b104	(control) Add basic input validation to node actions Will present a simple error message when required fields aren't populated, instead of a cryptic HTTP status error.	2024-01-18 11:52:49 +01:00
Viktor Lofgren	aa2df327db	(index) Prevent index from attempting to answer queries when no index data is loaded This improves query times, and gets rid of exceptions in the logs when one of the index nodes doesn't have any data loaded, yet is configured to answer queries.	2024-01-18 11:05:45 +01:00
Viktor Lofgren	321fa94b8f	(crawler) Fix rare exception in content type handling due to improper length checking of a split() array	2024-01-18 11:05:45 +01:00
Viktor Lofgren	41cdb8f71b	(control) Fix broken update button in the update-domain-ranking-set form id property was on the wrong element.	2024-01-17 18:21:09 +01:00
Viktor Lofgren	304d4c9acf	(control) Fix result ordering in the file storage listing view In some scenarios, such as when restoring storage items from json-manifest on db failure, the file storage view would present the items in a non-chronological order. Added a sort() operation to mitigate this.	2024-01-17 10:56:30 +01:00
Viktor Lofgren	7fd4c092e3	(control) Clean up UX and accessibility for new domain ranking sets. The change also adds basic support for error messages in the GUI.	2024-01-17 10:47:14 +01:00
Viktor Lofgren	2fe5705542	(control) GUI for ranking sets Still missing is some polish, forms don't have proper labels, validation is inconsistent, no error messages, etc.	2024-01-16 17:10:09 +01:00
Viktor Lofgren	e968365858	(index) Use new DomainRankingSets to configure ranking algos in index svc	2024-01-16 12:43:32 +01:00
Viktor Lofgren	36ad4c7466	(db) Add a new configuration object 'domain ranking set' for storing ranking parameters	2024-01-16 12:34:00 +01:00
Viktor Lofgren	5a62b3058f	(query-api) Make the search set identifier a string value in the API This will free the core marginalia search engine to use arbitrary search set definitions, while the app can use its hardcoded defaults.	2024-01-16 10:55:24 +01:00
Viktor Lofgren	a1df9e886a	(control) Also clean up stale 'NEW' messages	2024-01-15 16:14:02 +01:00
Viktor Lofgren	fd1eec99b5	(cleanup) Fix broken tests	2024-01-15 15:44:33 +01:00
Viktor Lofgren	e162406d40	(control) New control-side actors for cleaning up stale service heartbeats and message queue entries	2024-01-15 15:44:23 +01:00
Viktor Lofgren	c41e68aaab	(control) New export actions for RSS/Atom feeds and term frequency data This commit also refactors the executor a bit, and introduces a new converter-feature called data-extractors for this class of jobs.	2024-01-15 14:54:26 +01:00
Viktor Lofgren	4665af6c42	(control) Move export data endpoint to actions controller	2024-01-15 11:06:22 +01:00
Viktor Lofgren	c0b15427fe	(control) New crawl view should use radio buttons as multiple specs aren't supported	2024-01-15 11:03:47 +01:00
Viktor Lofgren	f29a9d972d	(control) Move 'new crawl spec' to /node/:id/actions, out of /node/:id/storage	2024-01-15 11:02:00 +01:00
Viktor Lofgren	b192373ae7	(control) Highlight unavailable items (creating, deleting) in node actions views	2024-01-15 10:47:54 +01:00
Viktor Lofgren	c042650382	(docs) Improve query service documentation	2024-01-13 21:16:45 +01:00
Viktor Lofgren	07a916a720	(search) Give the swipe hint on mobile a nicer finish	2024-01-13 18:51:54 +01:00
Viktor Lofgren	5134044530	(assistant) Make assistant client more robust to the service going down This is especially important for the non-essential functions, like website similarities...	2024-01-13 18:29:30 +01:00
Viktor Lofgren	4c62065e74	(install) Add two separate templates for the install script One template is for the full Marginalia Search style install, and the other is for a barebones install with no Marginalia-related fluff.	2024-01-13 18:27:42 +01:00
Viktor Lofgren	d28fc99119	(MainClass) ensure logging isn't loaded before service name is known This causes logs all to have names like ${sys:service-name}, instead of the service name...	2024-01-13 18:19:50 +01:00
Viktor Lofgren	c9fb45c85f	(search) Fix control.hideMarginaliaApp handling	2024-01-13 17:24:15 +01:00
Viktor Lofgren	7c6e18f7a7	(*) Overhaul settings and properties Use a system.properties file to configure the system. This is loaded statically by MainClass or ProcessMainClass. Update the property names to be more consistent, and update the documentations to reflect the changes.	2024-01-13 17:12:18 +01:00
Viktor Lofgren	176b9c9666	(convert) Add sizeHints to legacy serializable cawl data stream This reduces the maximum memory usage when processing legacy crawl data	2024-01-13 15:50:36 +01:00
Viktor Lofgren	ecd9c35233	(control) Clean up the event log * Generate fewer uninteresting event messages. * Display fewer irrelevant fields in the overview table.	2024-01-13 13:28:02 +01:00
Viktor Lofgren	71e32c57d9	(control) Add better timestamps for the events and message queue views Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.	2024-01-13 13:04:56 +01:00
Viktor Lofgren	2fefd0e4e3	(control) Add better timestamps for the events and message queue views Adjust display precision based on distance into the past, full ms-accurate timestamps available via hover-action.	2024-01-13 13:03:52 +01:00
Viktor Lofgren	81eaf79a25	(control) UX polish	2024-01-13 12:31:13 +01:00
Viktor Lofgren	8dea7217a6	(control) UX fixes, node GUI doesn't break when an executor service goes offline.	2024-01-13 12:17:30 +01:00
Viktor Lofgren	c0fb9e17e8	(control) Add filter dropdown to message queue table This makes inspecting the queues for processes much easier, as it's otherwise often these important messages are drowned out by FSM chatter.	2024-01-12 18:46:17 +01:00
Viktor Lofgren	83776a8dce	(control) Wean the ExportDataActor off EC_DOMAIN_LINK The EC_DOMAIN_LINK table is deprecated and slated for removal, use QueryClient.getAllDomainLinks() instead. The ExportDataActor now uses the QueryClient appropriately. The CSV format was also changed to quote the values, to prevent e.g. Excel from interpreting the comma as a decimal separator when previewing the file. Finally the form for triggering an export was overhauled.	2024-01-12 17:09:11 +01:00
Viktor Lofgren	98c0972619	(control) Add a summary table for Actors in the Node overview	2024-01-12 16:32:15 +01:00
Viktor Lofgren	56d832d661	(control) Adjust the margins of the headings to be consistent	2024-01-12 16:16:57 +01:00
Viktor Lofgren	de3a350afe	(control) Disable broken actions and mark the actions view as WIP	2024-01-12 16:16:39 +01:00
Viktor Lofgren	708a741960	(test) Clean up test usage of migrations Several tests were manually running migrations in a large copy-paste blob of code. This makes the test less useful as it's possible to break the code while keeping the tests green by introducing a new migration that never gets run in the tests, and it's also difficult to reason about what the tests are doing. A new test helper library is introduced with a TestMigrationLoader that can both run Flyway migrations, or load specific migrations in the cases a specific set of migrations need to be loaded. Existing tests are migrated to use the new code.	2024-01-12 15:55:50 +01:00
Viktor Lofgren	0caef1b307	(warc) Toggle for saving WARC data Add a toggle for saving the WARC data generated by the search engine's crawler. Normally this is discarded, but for debugging or archival purposes, retaining it may be of interest. The warc files are concatenated into larger archives, up to about 1 GB each. An index is also created containing filenames, domain names, offsets and sizes to help navigate these larger archives. The warc data is saved in a directory warc/ under the crawl data storage.	2024-01-12 13:45:14 +01:00
Viktor Lofgren	264e2db539	(control) UX-improvements for control service This commit overhauls a lot of the UX for the control service, adding a new actions menu to the nodes views. It has many small tweaks to make the work flow better. It also adds a new /uploads directory in each index node, from which sideloaded data can be selected. This is a bit of a breaking change, as this directory needs to exist in each index node.	2024-01-12 12:33:05 +01:00
Viktor Lofgren	734996002c	(*) install script for deploying Marginalia outside the codebase The changeset also makes the control service responsible for flyway migrations. This helps reduce the number of places the database configuration needs to be spread out. These automatic migrations can be disabled with -DdisableFlyway=true. The commit also adds curl to the docker container, to enable docker health checks and interdependencies.	2024-01-11 12:40:03 +01:00
Viktor Lofgren	a0f28a7f9b	(*) Add a barebones configuration This adds a docker-compose file 'docker-compose-barebones.yml' which will only start the minimal number of services needed to run a whitelabel Marginalia Search-style search engine, with none of the surrounding frills. The change also adds a minimal search GUI to the query service, which is also available with JSON results if the appropriate Accept header is provided.	2024-01-10 20:23:51 +01:00
Viktor Lofgren	14b7680328	(loader) Update the size of the keyword files created by the loader Previously these ended up being about 200 Mb each, which is wastefully small. Increasing the size of these files makes the index construction faster.	2024-01-10 17:09:19 +01:00
Viktor Lofgren	f44222ce53	(control) Add a 'cancel' button to the process list This is a very nice QoL improvement, since it means you don't have to dig in the Actors view to terminate processes.	2024-01-10 15:02:42 +01:00
Viktor Lofgren	f310ad8d98	(control) Actor terminations work better Improves jank in the abort actor action, which would sometimes cause actors to hang or restart.	2024-01-10 14:18:49 +01:00
Viktor Lofgren	d56b394bcc	(control) GUI for loading external WARC files	2024-01-10 12:13:30 +01:00
Viktor Lofgren	55c9501e57	(search) Serve proper content type for static resources	2024-01-10 10:46:51 +01:00
Viktor	fad9575154	Merge pull request #69 from MarginaliaSearch/converter-optimizations Refactor the DomainProcessor to take advantage of the new crawl data format	2024-01-10 09:46:54 +01:00
Viktor Lofgren	97e11e1ac9	(search) Fix acknowledgement page for domain complaints rendering as plain text This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.	2024-01-10 09:37:40 +01:00
Viktor Lofgren	e6a1e164b2	(search) Swap swipe direction for more consistent experience	2024-01-10 09:37:40 +01:00
Viktor Lofgren	e4f8f81e89	(search) Mobile UX improvements. Swipe right to show filter menu. Fix CSS bug that caused parts of the menu to not have a background.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	176b3bb526	(search) Toggle for showing recent results Actually persist the value of the toggle between searches too...	2024-01-10 09:37:39 +01:00
Viktor Lofgren	b07752fa9b	(search) Toggle for showing recent results Will by default show results from the last 2 years. May need to tune this later.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	68fd0efbde	(search) Clean up search results template Rendering is very slow. Let's see if this has a measurable effect on latency.	2024-01-10 09:37:39 +01:00
Viktor Lofgren	c80d3eb812	(search) Remove dead code	2024-01-10 09:37:35 +01:00
Viktor Lofgren	f9320995d6	(search) When clicking asn-links, show results from the unfiltered view...	2024-01-10 09:37:13 +01:00
Viktor Lofgren	f592c9f04d	(search) Fix acknowledgement page for domain complaints rendering as plain text This was caused by incorrect usage of the renderInto() function, which was always buggy and should never be used. This method is removed with this change.	2024-01-10 09:26:34 +01:00
Viktor Lofgren	bd7970fb1f	(search) Swap swipe direction for more consistent experience	2024-01-09 13:38:40 +01:00
Viktor Lofgren	c47730f2cc	(search) Mobile UX improvements. Swipe right to show filter menu. Fix CSS bug that caused parts of the menu to not have a background.	2024-01-09 13:30:30 +01:00
Viktor Lofgren	41cccfd2aa	(search) Toggle for showing recent results Actually persist the value of the toggle between searches too...	2024-01-09 11:36:49 +01:00
Viktor Lofgren	aff690f7d6	(search) Toggle for showing recent results Will by default show results from the last 2 years. May need to tune this later.	2024-01-09 11:28:36 +01:00
Viktor Lofgren	d4b0539d39	(search) Clean up search results template Rendering is very slow. Let's see if this has a measurable effect on latency.	2024-01-08 20:57:40 +01:00
Viktor Lofgren	cb55273769	(search) When clicking asn-links, show results from the unfiltered view...	2024-01-08 20:02:19 +01:00
Viktor Lofgren	fbad625126	(linkdb) Add delegating implementation of DomainLinkDb This facilitates switching between SQL and File-backed implementations on the fly while migrating from one to the other.	2024-01-08 19:56:33 +01:00
Viktor Lofgren	e49ba887e9	(crawl data) Add compatibility layer for old crawl data format The new converter logic assumes that the crawl data is ordered where the domain record comes first, and then a sequence of document records. This is true for the new parquet format, but not for the old zstd/gson format. To make the new converter compatible with the old format, a specialized reader is introduced that scans for the domain record before running through the sequence of document records; and presenting them in the new order. This is slower than just reading the file beginning to end, so in order to retain performance when this ordering isn't necessary, a CompatibilityLevel flag is added to CrawledDomainReader, permitting the caller to decide how compatible the data needs to be. Down the line when all the old data is purged, this should be removed, as it amounts to technical debt.	2024-01-08 19:16:49 +01:00
Viktor Lofgren	edc1acbb7e	(*) Replace EC_DOMAIN_LINK table with files and in-memory caching The EC_DOMAIN_LINK MariaDB table stores links between domains. This is problematic, as both updating and querying this table is very slow in relation to how small the data is (~10 GB). This slowness is largely caused by the database enforcing ACID guarantees we don't particularly need. This changeset replaces the EC_DOMAIN_LINK table with a file in each index node containing 32 bit integer pairs corresponding to links between two domains. This file is loaded in memory in each node, and can be queried via the Query Service. A migration step is needed before this file is created in each node. Until that happens, the actual data is loaded from the EC_DOMAIN_LINK table, but accessed as though it was a file. The changeset also migrates/renames the links.db file to documents.db to avoid naming confusion between the two.	2024-01-08 15:53:13 +01:00
Viktor Lofgren	ef02b712ad	(build) Remove false depdencency between icp and index-service This dependency causes the executor service docker image to change when the index service docker image changes.	2024-01-05 13:22:13 +01:00
Viktor Lofgren	aca217cf9a	(qs) Better metrics for QS	2024-01-05 13:22:13 +01:00
Viktor Lofgren	9e3386dbbb	(search) Fetch fewer results per page This is a test to evaluate how this impacts load times.	2024-01-05 13:22:13 +01:00
Viktor Lofgren	fdec565b34	(converter) Add upper 128KB limit to how much HTML we'll parse	2024-01-05 13:22:13 +01:00
Viktor Lofgren	33c2188c87	(feature) More trackers	2024-01-05 13:22:13 +01:00
Viktor Lofgren	b3c8fa74cc	(feature) Add another doubleclick variant to the adtech trackers	2024-01-05 13:22:13 +01:00
Viktor Lofgren	e53bb70bef	(converter) Penalize chatgpt content farm spam	2024-01-05 13:22:13 +01:00
Viktor Lofgren	109bec372c	(index) Adjust BM25 parameters	2024-01-05 13:21:52 +01:00
Viktor Lofgren	5c2561d05d	(search) Add query strategy requiring link	2024-01-05 13:21:52 +01:00
Viktor Lofgren	0e970b8037	(valuation) Tweaking penalties a bit	2024-01-05 13:21:52 +01:00
Viktor Lofgren	1694b4d6ef	(valuation) Increase the penalty for adtech a bit	2024-01-05 13:21:34 +01:00
Viktor Lofgren	396299c1db	(index) Reduce the value of site and site-adjacent in BM25P calculations	2024-01-05 13:21:33 +01:00
Viktor Lofgren	71d789aab0	(index) Tweak result valuation renormalization	2024-01-05 13:21:33 +01:00
Viktor Lofgren	6d2e14a656	(build) Remove false depdencency between icp and index-service This dependency causes the executor service docker image to change when the index service docker image changes.	2024-01-05 13:17:29 +01:00
Viktor Lofgren	4078708aea	(qs) Better metrics for QS	2024-01-04 13:27:14 +01:00
Viktor Lofgren	343ea9c6d8	(search) Fetch fewer results per page This is a test to evaluate how this impacts load times.	2024-01-04 13:18:07 +01:00
Viktor Lofgren	60361f88ed	(converter) Add upper 128KB limit to how much HTML we'll parse	2024-01-03 23:14:03 +01:00
Viktor Lofgren	f7560cb1d8	(feature) More trackers	2024-01-03 17:31:02 +01:00
Viktor Lofgren	1f66568d59	(feature) More trackers	2024-01-03 17:27:25 +01:00
Viktor Lofgren	7af07cef95	(feature) Add another doubleclick variant to the adtech trackers	2024-01-03 17:21:12 +01:00
Viktor Lofgren	41a540a629	(converter) Penalize chatgpt content farm spam	2024-01-03 17:04:38 +01:00
Viktor Lofgren	f599944942	(converter) Penalize chatgpt content farm spam	2024-01-03 16:51:26 +01:00
Viktor Lofgren	1e06aee6a2	(index) Adjust BM25 parameters	2024-01-03 16:30:46 +01:00
Viktor Lofgren	7bbaedef97	(search) Add query strategy requiring link	2024-01-03 16:23:00 +01:00
Viktor Lofgren	87048511fe	(valuation) Tweaking penalties a bit	2024-01-03 16:02:25 +01:00
Viktor Lofgren	c770f0b68b	(valuation) Tweaking penalties a bit	2024-01-03 15:59:21 +01:00
Viktor Lofgren	78c00ad512	(valuation) Tweaking penalties a bit	2024-01-03 15:52:57 +01:00
Viktor Lofgren	a19879d494	(valuation) Tweaking penalties a bit	2024-01-03 15:32:33 +01:00
Viktor Lofgren	ac1aca36b0	(valuation) Increase the penalty for adtech a bit	2024-01-03 15:20:38 +01:00
Viktor Lofgren	1f3b89cf28	(index) Reduce the value of site and site-adjacent in BM25P calculations	2024-01-03 15:20:18 +01:00
Viktor Lofgren	f732f6ae6f	(index) Tweak result valuation renormalization	2024-01-03 14:53:53 +01:00
Viktor Lofgren	0b9f3d1751	(*) Remove accidental commit of debug logging	2024-01-03 14:32:00 +01:00
Viktor Lofgren	0806aa6dfe	(language-processing) Add maximum length limit for text input in SentenceExtractor Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.	2024-01-03 14:27:47 +01:00
Viktor Lofgren	32436d099c	(language-processing) Add maximum length limit for text input in SentenceExtractor Added a new constant, MAX_TEXT_LENGTH, to the SentenceExtractor class. If the length of the text input exceeds this limit, the text is truncated to fit within the limit. This modification is designed to prevent excessive resource usage for unusually long text inputs.	2024-01-03 14:27:47 +01:00
Viktor Lofgren	4ce692ccaf	(converter) Use SimpleBlockingThreadPool in ProcessingIterator	2024-01-03 14:27:47 +01:00
Viktor Lofgren	3caa4eed75	Merge branch 'master' into converter-optimizations	2024-01-02 17:13:25 +01:00
Viktor Lofgren	c70f508ae8	(prometheus) Saner histogram buckets	2024-01-02 17:13:14 +01:00
Viktor Lofgren	9e64d7aaf9	Merge branch 'master' into converter-optimizations	2024-01-02 15:46:24 +01:00
Viktor Lofgren	72b773f06d	(search) fix search metrics labeling	2024-01-02 15:46:14 +01:00
Viktor Lofgren	5f978b865b	Merge branch 'master' into converter-optimizations	2024-01-02 15:41:48 +01:00
Viktor Lofgren	57a4f92722	(api) fix missing metrics label in api service	2024-01-02 15:41:38 +01:00
Viktor Lofgren	87351e89ca	Merge branch 'master' into converter-optimizations	2024-01-02 15:17:02 +01:00
Viktor Lofgren	192e356169	(prometheus) Add instrumentation to the api service	2024-01-02 15:12:44 +01:00
Viktor Lofgren	31232e49fb	(prometheus) Add instrumentation to the search, qs and index services.	2024-01-02 15:02:29 +01:00
Viktor Lofgren	9d93a31755	Merge branch 'master' into converter-optimizations	2024-01-02 12:36:16 +01:00
Viktor Lofgren	9f7df59945	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:35:59 +01:00
Viktor Lofgren	d2418521a7	(index) Further ranking adjustments	2024-01-02 12:35:59 +01:00
Viktor Lofgren	9330b5b1d9	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	faa50bf578	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-02 12:35:44 +01:00
Viktor Lofgren	f0d9618dfc	(sideload) Reduce quality assessment. This will make these sideloaded results rank much better as there is a pretty harsh penalty for large low-q websites.	2024-01-02 12:34:58 +01:00
Viktor Lofgren	310a880fa8	(index) Further ranking adjustments	2024-01-02 12:24:52 +01:00
Viktor Lofgren	fc6e3b6da0	(index) Further ranking adjustments	2024-01-01 18:51:03 +01:00
Viktor Lofgren	50771045d0	(index) Further ranking adjustments	2024-01-01 18:43:17 +01:00
Viktor Lofgren	8f522470ed	(index) Adjust rank weightings to fix bad wikipedia results There was as bug where if the input of ResultValuator.normalize() was negative, it was truncated to zero. This meant that "bad" results always rank the same. The penalty factor "overallPart" was moved outside of the function and was re-weighted to accomplish a better normalization. Some of the weights were also re-adjusted based on what appears to produce better results. Needs evaluation.	2024-01-01 17:16:29 +01:00
Viktor Lofgren	dc90c9ac65	(sideload) Just index based on first paragraph This seems like it would make the wikipedia search result worse, but it drastically improves the result quality! This is because wikipedia has a lot of articles that each talk about a lot of irrelevant concepts, and indexing the entire document means tangentially relevant results tend to displace the most relevant results.	2024-01-01 16:19:38 +01:00
Viktor Lofgren	e46e174b59	(keyword-extractor) Add another test for Name-extractor	2024-01-01 15:21:51 +01:00
Viktor Lofgren	7f3f3f577c	(backup) Add task heartbeats to the backup service	2024-01-01 15:20:57 +01:00
Viktor Lofgren	75d87c73d1	(crawler) Disable Java's infinite DNS caching	2023-12-31 16:59:08 +01:00
Viktor Lofgren	0fe44c9bf2	(crawler) Fix broken test A necessary step was accidentally deleted when cleaning up these tests previously.	2023-12-30 13:56:44 +01:00
Viktor Lofgren	7a1d20ed0a	(converter) Better use of ProcessingIterator Modify processingiterator to be constructed via a factory, to enable re-use of its backing executor service. This reduces thread churn in the converter sideloader style processing of regular crawl data.	2023-12-30 13:53:55 +01:00
Viktor Lofgren	70c83b60a1	(converter) Clean up fullProcessing() This function made some very flimsy-looking assumptions about the order of an iterable. These are still made, but more explicitly so.	2023-12-30 13:36:18 +01:00
Viktor Lofgren	7ba296ccdf	(converter) Route sizeHint to SideloadProcessing Route the sizeHint from the input parquet file to SideloadProcessing, so that it can set sideloadSizeAdvice appropriately, instead of using a fixed "large" number. This is necessary to populate the KNOWN_URL column in the domain data table, which is important as it is used in e.g. calculating how far to re-crawl the site in the future.	2023-12-30 13:05:10 +01:00
Viktor Lofgren	0b112cb4d4	(warc) Update URL encoding in WarcProtocolReconstructor The URI query string is now URL encoded in the WarcProtocolReconstructor. This change ensures proper encoding of special characters as per the standard URL encoding rules and improves URL validity during the crawling process.	2023-12-29 19:41:37 +01:00
Viktor Lofgren	68ac8d3e09	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:27 +01:00
Viktor Lofgren	f6fa8bd722	(search) Fetch fewer linking and similar domains. Showing a total of 200 connected domains is not very informative.	2023-12-29 16:37:00 +01:00
Viktor Lofgren	6aee27a3f1	(*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style.	2023-12-29 16:36:01 +01:00
Viktor Lofgren	401568033c	Merge branch 'master' into converter-optimizations	2023-12-29 15:55:57 +01:00
Viktor Lofgren	ea73be6831	(search) Remove the ugly placeholder screenshots from the site info view.	2023-12-29 15:55:46 +01:00
Viktor Lofgren	ba8a75c84b	Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool	2023-12-29 15:10:32 +01:00
Viktor Lofgren	a1f3ccdd6d	Fix bug in ProcessingIterator where it would run the tasks in only one single thread instead of using the pool	2023-12-29 14:59:39 +01:00
Viktor Lofgren	647d38007f	Reduce queue polling time in ProcessingIterator Updated ProcessingIterator's queue polling from one second to 50 milliseconds for improved performance. This facilitates faster document processing across more cores, reducing bottlenecks and slow single-threaded processing.	2023-12-29 14:27:58 +01:00
Viktor Lofgren	e7dd28b926	(converter) Optimize sideload-loading Use ProcessingIterator to fan out processing of documents across more cores, instead of doing all of it in the writer thread blocking everything else with slow single-threaded processing.	2023-12-29 14:25:48 +01:00
Viktor Lofgren	b5fc9673d9	Merge branch 'master' into converter-optimizations	2023-12-29 14:04:43 +01:00
Viktor Lofgren	a065040323	(search) Don't inject arbitrary HTML into the site info view xD	2023-12-29 14:04:26 +01:00
Viktor Lofgren	dec3b1092d	(converter) Fix bugs in conversion This commit adds a safety check that the URL of the document is from the correct domain. It also adds a sizeHint() method to SerializableCrawlDataStream which may provide an indication if the stream is very large and benefits from sideload-style processing (which is slow). It furthermore addresses a bug where the ProcessedDomain.write() invoked the wrong method on ConverterBatchWriter and only wrote the domain metadata, not the rest...	2023-12-29 13:58:08 +01:00
Viktor Lofgren	407915a86e	(converter) Fix NPEs in converter due to the new data format	2023-12-28 22:54:53 +01:00
Viktor Lofgren	c488599879	(converter) Fix NPE in converter	2023-12-28 19:52:26 +01:00
Viktor Lofgren	bcecc93e39	(converter) Swallow errors in parquet data stream	2023-12-28 19:45:35 +01:00
Viktor Lofgren	ff7d1a250e	Merge branch 'master' into converter-optimizations	2023-12-28 19:35:00 +01:00
Viktor Lofgren	70f338c3de	(search) Fix NPE in layout selection	2023-12-28 19:34:46 +01:00
Viktor Lofgren	c847d83011	(converter) Add size hint to converter sideload processing	2023-12-28 19:14:16 +01:00
Viktor Lofgren	5ce46a61d4	Merge branch 'master' into converter-optimizations	2023-12-28 13:26:19 +01:00
Viktor	775974d5ec	Merge pull request #67 from MarginaliaSearch/rss-feeds-in-site-info Add RSS Feeds to site info (WIP)	2023-12-28 13:25:38 +01:00
Viktor Lofgren	c7af40c368	(search) Change layout balance when feeds/samples are present	2023-12-28 13:16:10 +01:00
Viktor Lofgren	00a974a721	(crawler) Fix bug in resynchronizer where it would fail to capture expected exceptions This commit also improves the test coverage for this part of the code.	2023-12-27 20:02:17 +01:00
Viktor Lofgren	7428ba2dd7	(converter) Basic test coverage for sideloading-style processing	2023-12-27 19:29:26 +01:00
Viktor Lofgren	b37223c053	(converter) Basic test coverage for sideloading-style processing	2023-12-27 18:33:16 +01:00
Viktor Lofgren	24051fec03	(converter) WIP Run sideload-style processing for large domains The processor normally retains the domain data in memory after processing to be able to do additional site-wide analysis. This works well, except there are a number of outlier websites that have an absurd number of documents that can rapidly fill up the heap of the process. These websites now receive a simplified treatment. This is executed in the converter batch writer thread. This is slower, but the documents will not be persisted in memory.	2023-12-27 18:20:03 +01:00
Viktor Lofgren	f811a29f87	(crawler) Fix resource leak in crawler A 10 MB thread local buffer wasn't static. Oops.	2023-12-27 16:32:17 +01:00
Viktor Lofgren	acf7bcc7a6	(converter) Refactor the DomainProcessor for new format of crawl data With the new crawler modifications, the crawl data comes in a slightly different order, and a result of this is that we can optimize the converter. This is a breaking change that will be incompatible with the old style of crawl data, hence it will linger as a branch for a while. The first step is to move stuff out of the domain processor into the document processor.	2023-12-27 13:57:59 +01:00
Viktor Lofgren	9707366348	(test) Fix a few slow tests that broke due to domainCount	2023-12-27 13:29:59 +01:00
Viktor Lofgren	9e5fe71f5b	(crawler) Switch hash function in crawler Guava's hashers are a bit allocation hungry, and a big driver of GC churn in the crawler. This switches to the modified Murmur hash function used throughout Marginalia.	2023-12-27 13:29:00 +01:00
Viktor Lofgren	5d1b7da728	Updated site info feed and search service Modified site info feed template to secure the description field against injected code. Also adjusted search service by extracting samples within the correct scope and including them in the returned site info. This improves the quality and security of the displayed information.	2023-12-26 22:06:01 +01:00
Viktor Lofgren	3ea1ddae22	(crawler) Roll back switch to virtual thread pool in crawler This seems to cause a resource leak, it seems the http library uses thread locals?	2023-12-26 19:37:34 +01:00
Viktor Lofgren	1694e9c78c	(search) Add RSS Feeds to site info This change integrates the Feedlot RSS Bot with Marginalia's site info view to offer a preview of the latest updates. The change introduces a new tiny feature that is a feedlot-client based on Java's HttpClient.	2023-12-26 16:21:40 +01:00
Viktor Lofgren	4763077b76	(search/index) Add a new keyword "count" This is for filtering results on how many times the term appears on the domain. The intent is to be beneficial in creating e.g. a domain search feature. It's also very helpful when tracking down spammy domains.	2023-12-25 20:38:29 +01:00
Viktor Lofgren	c0eaca220c	(search) Add convenient link for AS search to the search view	2023-12-25 15:07:58 +01:00
Viktor Lofgren	25d086c4e1	(crawler) Clean up stale warc files We should probably have an option to keep them, but not by default!	2023-12-25 15:07:36 +01:00
Viktor Lofgren	88551043cd	(crawler) Even more lenient resyncing	2023-12-25 01:48:11 +01:00
Viktor Lofgren	f779f760c4	(crawler) Even more lenient resyncing	2023-12-25 01:44:18 +01:00
Viktor Lofgren	f18f82e229	(crawler) Write etags and last-modified on reference copy This commit also fixes a test that broke with a previous change.	2023-12-25 01:40:13 +01:00
Viktor Lofgren	67ef2b45fa	(crawler) Reduce logging	2023-12-25 01:10:03 +01:00
Viktor Lofgren	d72e871265	(warc) Fix resync	2023-12-25 01:03:03 +01:00
Viktor Lofgren	4c9bc13309	(warc) Reduce log spam	2023-12-25 00:58:31 +01:00
Viktor Lofgren	84563b0d46	(crawler) Be a bit more conservative about pulling etags and so on if the previous fetch wasn't OK	2023-12-25 00:55:05 +01:00
Viktor Lofgren	c5aab7e8db	(warc) Fix NPE in WarcRecorder	2023-12-25 00:54:38 +01:00
Viktor Lofgren	1755b646b8	(warc) Fix NPE in WarcRecorder	2023-12-25 00:48:42 +01:00
Viktor Lofgren	85f906ea53	(executor) Fix removal of stale process heartbeats	2023-12-23 13:49:24 +01:00
Viktor Lofgren	e1a155a9c8	(crawler) Increase growth of crawl jobs A number of crawl jobs get stuck at about 300 documents, or just under. This seems to be because we fail to increase the crawl limit, which is based on MAX(200, 1.25 x GOOD_URLS) with a 1.5x modifier applied upon a recrawl. GOOD_URLS is based on how many documents successfully process, which is typically fairly small. Switching to KNOWN_URLS should let this grow faster. The SQL query in the DbCrawlSpecProvider class has been updated; 'GOOD_URLS' has been replaced with 'KNOWN_URLS'. This update ensures the correct data is selected from the DOMAIN_METADATA table. The floor is also increased to 250 from 200.	2023-12-23 13:22:10 +01:00
Viktor Lofgren	0454447e41	(executor) Implement process removal for long-absent heartbeats Added functionality to remove processes from listing that have not checked in for over a day. A 'removeProcessHeartbeat' function was created to delete the respective entry from the PROCESS_HEARTBEAT table in case heartbeats are absent for more than one day.	2023-12-23 13:18:21 +01:00
Viktor Lofgren	7b40c0bbee	(assistant) Clean up similar websites' results	2023-12-22 14:07:01 +01:00
Viktor Lofgren	dc773c5c20	(adjacencies) Clean up AdjacenciesLoader Make JDBC batching more consistent, also adds a test case for the loader.	2023-12-21 14:14:22 +01:00
Viktor Lofgren	b6253b03c2	(adjacencies) Fix bug in AdjacenciesLoader This fixes a bug where a prepared statement was created before the table it was supposed to insert into was created. This fails and does nothing. Furthermore, added the logging that would have warned about this failure, had it been in place.	2023-12-21 13:12:31 +01:00
Viktor Lofgren	a5bc29245b	(cleanup) Remove vestigial support for WARC crawl data streams	2023-12-20 15:46:21 +01:00
Viktor Lofgren	bfae478251	Refactor CrawlerRevisitor for better consistency	2023-12-20 15:21:49 +01:00
Viktor Lofgren	a7cd490593	(minor) Remove dead code.	2023-12-19 18:58:33 +01:00
Viktor Lofgren	dd8fb04886	(converter) Add sizeloadSizeAdvice field to several ProcessedDomain Since the sideloaders don't populate the documents list in ProcessedDomain to keep the memory footprint manageable, the code that estimates knownUrls etc. will set them to zero, which has negative effects on their ranking. This change will populate them with a bullshit value within a sane ballpark, ensuring that these domains show up in the rankings.	2023-12-19 18:37:51 +01:00
Viktor	5bd3934d22	Merge pull request #64 from dreimolo/macos_AS_fix Macos apple silicon fix, and slight improvements to sample downloader	2023-12-18 18:29:14 +01:00
Viktor Lofgren	3a56a06c4f	(warc) Add a fields for etags and last-modified headers to the new crawl data formats Make some temporary modifications to the CrawledDocument model to support both a "big string" style headers field like in the old formats, and explicit fields as in the new formats. This is a bit awkward to deal with, but it's a necessity until we migrate off the old formats entirely. The commit also adds a few tests to this logic.	2023-12-18 17:45:54 +01:00
Viktor Lofgren	126ac3816f	(converter) Reduce queue size in ConverterWriter The size of the ArrayBlockingQueue in ConverterWriter.java has been reduced from 4 to 1. This change aims to reduce the memory utilization by not having fully processed domains piling up in RAM. This may cause the writer to go idle in waiting for new data, but that may be preferable to an OOM.	2023-12-18 13:42:40 +01:00
Viktor Lofgren	d02bed1a55	(loader) Optimize DomainLoaderService for faster startups Initialization parameters in DomainLoaderService and DomainIdRegistry have been updated to improve performance. This is done by adding sane default sizes to the hash tables involved, reducing GC churn, but also by setting a sensible fetch size to the queries used, and not fetching irrelevant information such as the domain name.	2023-12-18 13:15:10 +01:00
Viktor Lofgren	b7ed0ce537	(loader) Reset count after executing batch in DomainLoaderService This should greatly speed up starting the loader process.	2023-12-18 12:43:53 +01:00
Viktor Lofgren	a742503508	(search) Add view for showing mutual links between two websites	2023-12-17 17:50:44 +01:00
Viktor Lofgren	33312ab09e	(geo-ip) Update readme	2023-12-17 16:08:33 +01:00
Viktor Lofgren	c422f0b9fb	(geo-ip) Tidy up error handling	2023-12-17 16:06:51 +01:00
Viktor Lofgren	c92f1b8df8	(geo-ip) Revert removal of ip2location logic We do both ip2location and ASN data. The change also adds some keywords based on autonomous system information, on a somewhat experimental basis. It would be neat to be able to e.g. exclude cloud services or just e.g. cloudflare from the search results.	2023-12-17 15:03:00 +01:00
Viktor Lofgren	bde68ba48b	Merge branch 'master' into asn-info	2023-12-17 14:00:23 +01:00
Viktor Lofgren	bf44805e69	(*) Rename EdgeDomain$domain into topDomain This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time. Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.	2023-12-17 14:00:07 +01:00
Viktor Lofgren	edf9aa2c23	(*) Rename EdgeDomain$domain into topDomain This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time. Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.	2023-12-17 13:59:54 +01:00
Viktor Lofgren	4801c47273	(crawling-model) Fix bug where CrawledDocument.getDomain() trimmed www-prefixes This had the knock-on effect of breaking the anchor tag loading in the processor for a lot of domains, since they'd grab domains for the wrong domain name.	2023-12-17 13:53:31 +01:00
Viktor Lofgren	bcad6492d6	(sideloader) Fix integration problems with sideloaders In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment. Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters. Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.	2023-12-17 13:28:17 +01:00
Viktor Lofgren	5ab2a22e88	(search) Fix result count back down to 1 per domain	2023-12-17 13:14:23 +01:00
Viktor Lofgren	d7bd540683	(*) Replace the ip2location IP geolocation data with ASN information from apnic.net. Doesn't really make sense to use ip2location as a middle man for information that is already freely available...	2023-12-16 21:55:04 +01:00
Viktor Lofgren	722b56c8ca	(index) Fix rare bug in the index-switching logic This is caused by a resource contention with the query code. The proper way to fix this is to use some form of synchronization, but that will slow the code down. So we just hammer it a few times and let the GC deal with the problem if it fails. Not optimal, but fast.	2023-12-16 18:57:35 +01:00
Viktor Lofgren	f3f12058dc	(assistant) Fix logic error in filtering related domains	2023-12-16 18:45:53 +01:00
Viktor Lofgren	3da38d0483	(assistant) Fix logic error in filtering related domains	2023-12-16 18:44:25 +01:00
Viktor Lofgren	d715b1f9ca	(search) Improve error handling in search parameters parsing The code now intercepts and deals with potential exceptions during the parsing of search parameters. This is in response to constant bad requests from bots which were cluttering the logs. A catch clause is added that suppresses these errors and redirects to the base URL.	2023-12-16 18:42:13 +01:00
Viktor Lofgren	e13fa25e11	(assistant) Clean up the site info related domains view by filtering viable domains	2023-12-16 18:37:09 +01:00
Viktor Lofgren	34d4834ff6	(assistant) Clean up the site info related domains view by filtering viable domains	2023-12-16 18:27:24 +01:00
Viktor Lofgren	117ddd17d7	(assistant) Fix bugs in IP flag emoji generation	2023-12-16 17:07:17 +01:00
Viktor Lofgren	6f2bf38f0e	(index) Fix off-by-1 error in the domain count limiter	2023-12-16 16:57:05 +01:00
Viktor Lofgren	320882c34a	(site-info) Try to discover the schema of the website with a site:-query The site info view can't blindly assume that every website supports https. To figure out which schema to use when linking to a site, execute a single-result search for site:domain.name and then grab the schema off the result. To allow this, a count parameter is introduced to doSiteSearch() in SearchOperator.	2023-12-16 16:34:53 +01:00
Viktor Lofgren	3113b5a551	(warc) Filter WarcResponses based on X-Robots-Tags There really is no fantastic place to put this logic, but we need to remove entries with an X-Robots-Tags header where that header indicates it doesn't want to be crawled by Marginalia.	2023-12-16 15:58:27 +01:00
dreimolo	c0cc05177f	corrects protobuf.plugins.grpc	2023-12-16 14:24:41 +01:00
dreimolo	0b34d43804	workaround for failing mac on apple silicon deps	2023-12-16 14:22:11 +01:00
Viktor Lofgren	54ed3b86ba	(minor) Remove dead code.	2023-12-15 21:49:35 +01:00
Viktor Lofgren	2001d0f707	(converter) Add @Deprecated annotation to a few fields that should no longer be used.	2023-12-15 21:42:00 +01:00
Viktor Lofgren	0f9cd9c87d	(warc) More accurate filering of advisory records Further create records for resources that were blocked due to robots.txt; as well as tests to verify this happens.	2023-12-15 21:37:02 +01:00
Viktor Lofgren	2e7db61808	(warc) More accurate filering of advisory records We want to mute some of these records so that they don't produce documents, but in some cases we want a document to be produced for accounting purposes. Added improved tests that reach for known resources on www.marginalia.nu to test the behavior when encountering bad content type and 404s. The commit also adds some safety try-catch:es around the charset handling, as it may sometimes explode when fed incorrect data, and we do be guessing...	2023-12-15 21:31:16 +01:00
Viktor Lofgren	5329968155	(crawler) Update CrawlingThenConvertingIntegrationTest This commit updates CrawlingThenConvertingIntegrationTest with additional tests for invalid, redirecting, and blocked domains. Improvements have also been made to filter out irrelevant entries in ParquetSerializableCrawlDataStream.	2023-12-15 21:04:06 +01:00
Viktor Lofgren	2e536e3141	(crawler) Add timestamp to CrawledDocument records This update includes the addition of timestamps to the parquet format for crawl data, as extracted from the Warc stream. The parquet format stores the timestamp as a 64 bit long, seconds since unix epoch, without a logical type. This is to avoid having to do format conversions when writing and reading the data. This parquet field populates the timestamp field in CrawledDocument.	2023-12-15 20:23:27 +01:00
Viktor Lofgren	cf935a5331	(converter) Read cookie information Add an optional new field to CrawledDocument containing information about whether the domain has cookies. This was previously on the CrawledDomain object, but since the WarcFormat requires us to write a WarcInfo object at the start of a crawl rather than at the end, this information is unobtainable when creating the CrawledDomain object. Also fix a bug in the deduplication logic in the DomainProcessor class that caused a test to break.	2023-12-15 18:09:53 +01:00
Viktor Lofgren	fa81e5b8ee	(warc) Use a non-standard WARC header to convey information about whether a website uses cookies This information is then propagated to the parquet file as a boolean. For documents that are copied from the reference, use whatever value we last saw. This isn't 100% deterministic and may result in false negatives, but permits websites that used cookies but have stopped to repent and have the change reflect in the search engine more quickly.	2023-12-15 16:37:53 +01:00
Viktor Lofgren	9fea22b90d	(warc) Further tidying This commit includes mostly exception handling, error propagation, a few bug fixes and minor changes to log formatting. The CrawlDelayTimer, HTTP 429 responses and IOException responses are now more accurately handled. A non-standard WarcXEntityRefused WARC record has also been introduced, essentially acting as a rejected 'response' with different semantics. Besides these, several existing features have been refined, such as URL encoding, crawl depth incrementing and usage of Content-Length headers.	2023-12-15 15:38:23 +01:00
Viktor Lofgren	0889b6d247	(warc) Clean up parquet conversion This commit further cleans up the warc->parquet conversion. It fixes issues with redirect handling in WarcRecorder, adds support information about redirects and errors due to probe failure. It also refactors the fetch result, body extraction and content type abstractions.	2023-12-14 20:39:40 +01:00
Viktor Lofgren	1328bc4938	(warc) Clean up parquet conversion This commit cleans up the warc->parquet conversion. Records with a http status other than 200 are now included. The commit also fixes a bug where the robots.txt parser would be fed the full HTTP response (and choke), instead of the body. The DocumentBodyExtractor code has also been cleaned up, and now offers a way of just getting the byte[] representation for later processing, as conversion to and from strings is a bit wasteful.	2023-12-14 16:05:48 +01:00
Viktor Lofgren	787a20cbaa	(crawling-model) Implement a parquet format for crawl data This is not hooked into anything yet. The change also makes modifications to the parquet-floor library to support reading and writing of byte[] arrays. This is desirable since we may in the future want to support inputs that are not text-based, and codifying the assumption that each document is a string will definitely cause us grief down the line.	2023-12-13 16:22:19 +01:00
Viktor Lofgren	440e097d78	(crawler) WIP integration of WARC files into the crawler and converter process. This commit is in a pretty rough state. It refactors the crawler fairly significantly to offer better separation of concerns. It replaces the zstd compressed json files used to store crawl data with WARC files entirely, and the converter is modified to be able to consume this data. This works, -ish. There appears to be some bug relating to reading robots.txt, and the X-Robots-Tag header is no longer processed either. A problem is that the WARC files are a bit too large. It will probably be likely to introduce a new format to store the crawl data long term, something like parquet; and use WARCs for intermediate storage to enable the crawler to be restarted without needing a recrawl.	2023-12-13 15:33:42 +01:00
Viktor Lofgren	b74a3ebd85	(crawler) WIP integration of WARC files into the crawler process. At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly. This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled. The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.	2023-12-11 19:32:58 +01:00
Viktor Lofgren	45987a1d98	Merge branch 'master' into warc	2023-12-11 14:32:35 +01:00
Viktor Lofgren	8f0950fc44	(geoip) Fix incorrect synchronization.	2023-12-11 14:01:39 +01:00
Viktor Lofgren	30bc3f9281	(converter) Use the prefix ip: instead of geopip: for country codes This is the same as the prefix for the IP address, but I don't think that substantially matters, the as two have such different namespaces there can be no confusion.	2023-12-11 13:59:23 +01:00
Viktor Lofgren	f655ec5a5c	(*) Refactor GeoIP-related code In this commit, GeoIP-related classes are refactored and relocated to a common library as they are shared across multiple services. The crawler is refactored to enable the GeoIpBlocklist to use the new GeoIpDictionary as the base of its decisions. The converter is modified ot query this data to add a geoip:-keyword to documents to permit limiting a search to the country of the hosting server. The commit also adds due BY-SA attribution in the search engine footer for the source of the IP geolocation data.	2023-12-10 17:30:43 +01:00
Viktor Lofgren	84b4158555	(minor) Fix broken test	2023-12-10 14:39:20 +01:00
Viktor Lofgren	91dd45cf64	(search) IP and IP geolocation in site info view This commit also fixes a bug in the loader where the IP field wouldn't always populate as intended, and refactors the DomainInformationService to use significantly fewer SQL queries.	2023-12-09 20:06:55 +01:00
Viktor Lofgren	37af60254f	(search) Better recipe filter Tune the recipe filter to give better results, by using the 'popular' domains set along with excluding results with heavy tracking.	2023-12-09 20:06:55 +01:00
Viktor Lofgren	f0e736d4ea	(search) Update the search profile 'Academia' to strictly filter on academic tlds The previous version used a personalized pagerank centering on a few academic domains, but this didn't work very well and most results were not very academia-centric.	2023-12-09 20:06:55 +01:00
Viktor Lofgren	e3ebb0c5bb	(*) Rename the search filter 'RETRO' into 'POPULAR' This will make the terminology more consistent between the GUI and the code. The rankings yaml still uses 'retro' though, for to retain compatibility.	2023-12-09 20:06:54 +01:00
Viktor Lofgren	6382f779c3	(search) Revert back to using 'Popular' as the default search filter Unfiltered is a bit too ... unfiltered, and gives a bad first impression for many queries.	2023-12-09 16:34:12 +01:00
Viktor Lofgren	8ef34883a8	(search) Move site information out of the search service and into assistant. This reduces the impact of restarting the search service, as the site information takes a few minutes to load during which it's not available. It also permits exposing this information via API in the future if there is interest in this. The assistant service was also modified to do a late load of the suggestions trie, as this is a major contributor to its start-up time. Finally, some changes were made to the client library, a new get() method was added that takes a TypeToken to allow deserialization of generics such as List<Foo>, and the scheduler was also modified to use virtual threads.	2023-12-09 16:30:06 +01:00
Viktor Lofgren	5c46af0edb	(converter) Refactor EncyclopediaMarginaliaNuSideloader to use ProcessingIterator Refactored the getDocumentsStream method in EncyclopediaMarginaliaNuSideloader to use the newly extracted ProcessingIterator class that encapsulates processing a stream of results from e.g a database query in parallel and returning the computed results as an iterator. The iterator was also improved on to be more reliable, previous versions of the logic would sometimes deadlock due to false positives in hasMore().	2023-12-09 15:20:53 +01:00
Viktor Lofgren	b6511fbfe2	(converter) Add AnchorTextKeywords to EncyclopediaMarginaliaNuSideloader processing The commit updates EncyclopediaMarginaliaNuSideloader to include the AnchorTextKeywords in processing documents, aiding search result relevance. It also removes old test-related functionality and a large but fairly useless test previously used to debug a specific problem, to the detriment of the overall code quality.	2023-12-09 15:20:52 +01:00
Viktor Lofgren	eccb12b366	(control) Fix spurious state detection in control-side actors A race condition was found where precession actors would sometimes skip a step, because when invoking ExecutorRemoteActor.getState(), it would get the last 'OK' actor state from a previous run of the actor! To avoid this, the trigger method was changed from returning a boolean to the message ID, negative if an error occurred, to be passed to getState to select only messages that pertain to the present or future runs.	2023-12-09 12:50:05 +01:00
Viktor Lofgren	d0982e7ba5	(converter) Add error handling and lazy load external domain links The converter was not properly initiating the external links for each domain, causing an NPE in conversion. This needs to be loaded later since we don't know the domain we're processing until we've seen it in the crawl data. Also made some refactorings to make finding converter bugs easier, and finding the related domain less awkward from the SerializableCrawlData interface.	2023-12-09 12:33:39 +01:00
Viktor Lofgren	fc30da0d48	(converter) Add academia recognition to DomainProcessor The code now includes an additional function in the DomainProcessor class that checks if a domain is associated with academia. An academic domain is identified by the ".edu" TLD, or fits a specific regex pattern matching domains like .ac.ccTld or .edu.ccTld. If these conditions are met, the search term "special:academia" is added to the domain. The existing academia search filter uses personalized pagerank to select academia-adjacent domains, but it isn't working very well. The hope is that filtering on domain names will be more effective, and that it can supplant the ranking-based approach.	2023-12-08 20:31:34 +01:00
Viktor Lofgren	e6a1052ba7	Simplify CrawlerMain, removing the CrawlerLimiter and using a global HttpFetcher with a virtual thread pool dispatcher instead of the default.	2023-12-08 20:24:01 +01:00
Viktor Lofgren	968dce50fc	(crawler) Refactored IpInterceptingNetworkInterceptor for clarity.	2023-12-08 17:45:46 +01:00
Viktor Lofgren	3bbffd3c22	(crawler) Refactor HttpFetcher to integrate WarcRecorder Partially hook in the WarcRecorder into the crawler process. So far it's not read, but should record the crawled documents. The WarcRecorder and HttpFetcher classes were also refactored and broken apart to be easier to reason about.	2023-12-08 17:12:51 +01:00
Viktor Lofgren	072b5fcd12	Implement Warc-recording wrapper for OkHttp3 client This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted. This component is currently not hooked into anything. The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'. The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.	2023-12-08 13:49:16 +01:00
Viktor Lofgren	fabffa80f0	(warc) Integrate the crawler's content type parsing and charset logic into the WarcSideloader	2023-12-07 15:26:01 +01:00
Viktor Lofgren	064265b0b9	(crawler) Move content type/charset sniffing to a separate microlibrary This functionality needs to be accessed by the WarcSideloader, which is in the converter. The resultant microlibrary is tiny, but I think in this case it's justifiable.	2023-12-07 15:16:37 +01:00
Viktor Lofgren	2d5d11645d	(warc) Refactor WarcSideloaderTest to not rely on specific test files on the computer	2023-12-06 19:00:29 +01:00
Viktor Lofgren	cc813a5624	(convert) Add basic support for Warc file sideloading This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.	2023-12-06 18:43:55 +01:00
Viktor Lofgren	156c067f79	(search) Fix mobile issues with browse feature	2023-12-05 21:28:50 +01:00
Viktor Lofgren	b33b013d41	(search) Fix broken script tag Apparently it can't be called suggestions.js...?	2023-12-05 20:29:13 +01:00
Viktor Lofgren	e74e2f705f	(search) Fix broken script tag suggestions.js became something else.	2023-12-05 20:20:07 +01:00
Viktor Lofgren	2e438847fc	(search) Optimize related domains queries In the future this logic probably needs to move into a separate service, as it's still quite slow to load. But this fixes response times and DOS potential of previous version.	2023-12-05 20:12:03 +01:00
Viktor Lofgren	9301c47d93	(search) Optimize related domains queries	2023-12-05 14:42:03 +01:00
Viktor Lofgren	20ec58b07f	(search) Remove layout-breakingly long URLs from the similar domains view. They're almost all .onion URLs anyway, not really the space we're looking to peer into.	2023-12-05 13:58:15 +01:00
Viktor Lofgren	98983c1015	(search) Hopefully fix race condition that leaves the response with no Content-type header	2023-12-05 13:52:36 +01:00
Viktor Lofgren	67195592c6	(search) Hopefully fix race condition that leaves the response with no Content-type header	2023-12-05 13:48:42 +01:00
Viktor Lofgren	d1e88df71e	(search) Cleaning up the code a bit	2023-12-05 13:26:05 +01:00
Viktor Lofgren	f36cfe34ab	(search) Hackery to get a more balanced view	2023-12-04 22:50:39 +01:00
Viktor Lofgren	8a1934008c	(search) Merge similar sites results with the info view. WIP: This commit needs to be cleaned up.	2023-12-04 22:10:24 +01:00
Viktor Lofgren	b41bb9cfcf	(search) Use a Ξ for mobile button title instead of "Filters". Makes it easier to distinguish form the search button.	2023-12-03 16:33:25 +01:00
Viktor Lofgren	d58324bbef	(search) Clean up filters menu a bit, improve accessibility.	2023-12-02 18:05:30 +01:00
Viktor Lofgren	cbbd45d3e5	(search) Clean up filters menu a bit, improve accessibility.	2023-12-02 18:01:03 +01:00
Viktor Lofgren	b89633ae4b	(search) Don't render a filter button on mobile when there are no filters to be presented.	2023-12-02 17:23:45 +01:00
Viktor Lofgren	96357e9bfd	(search) Fix typeahead suggestions, as well as improve mobile and desktop UX in small ways.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	d530c3096f	(search) GUI tweaks to make the new interface not fall apart on mobile/chrome	2023-12-02 17:06:40 +01:00
Viktor Lofgren	ae0c1c3f2d	(control) Adjust search result margins for better visual density	2023-12-02 17:06:40 +01:00
Viktor Lofgren	0cc2564380	(search) CSS tweaks	2023-12-02 17:06:40 +01:00
Viktor Lofgren	38d20022ad	(search) Fix script loading for mobile support	2023-12-02 17:06:40 +01:00
Viktor Lofgren	280132dad0	(search) Fix script loading for mobile support	2023-12-02 17:06:40 +01:00
Viktor Lofgren	61de4e2789	(search) Retain filter options when performing a new search from the input field	2023-12-02 17:06:40 +01:00
Viktor Lofgren	f9d3455320	(search) Reduce visual weight of search results	2023-12-02 17:06:40 +01:00
Viktor Lofgren	2ff64c3c12	(search) New toggle for reducing tracking	2023-12-02 17:06:40 +01:00
Viktor Lofgren	902f235b5b	(search) Integrate 'similar' tab in site info.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	97d43a6fa2	(search) Revamp browse results with new look.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	9bc65ff0ca	(search) Desaturate search result titles according to rank	2023-12-02 17:06:40 +01:00
Viktor Lofgren	6cd6a615fd	(search) Add data-filter to body as a data attribute For future shenanigans ;D	2023-12-02 17:06:40 +01:00
Viktor Lofgren	5639f0653d	(search) Rename SearchProfile.name into filterId Avoid foot-gun caused by name clash with the Enumeration method name(), which returns the Java name of the enumeration value.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	251174c9a2	(search) Update front page with new look	2023-12-02 17:06:40 +01:00
Viktor Lofgren	42ea87d637	(search) Update conversion results, error page, and dictionary results with new CSS.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	7c8a60b8cf	(search) Site info view is mostly done Also optimize the rendering a bit to avoid having to allocate huge string buffers, writing directly to Spark's response instead.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	2f4500be5a	(search) New frontend look	2023-12-02 17:06:40 +01:00
Viktor Lofgren	fa7534a362	(search) Remove dead code	2023-12-02 17:06:40 +01:00
Viktor Lofgren	a258f0af7a	(search) Refactor search parameters to include query	2023-12-02 17:06:40 +01:00
Viktor Lofgren	01621c6344	(renderer) Make helpers configurable on a by-service basis.	2023-12-02 17:06:40 +01:00
Viktor Lofgren	c7934342a6	(control) Automatic recrawl	2023-12-02 17:06:24 +01:00
Viktor Lofgren	f5c324c06b	(minor) Fix broken test	2023-12-01 17:44:39 +01:00
Viktor Lofgren	f615cf2391	(convert) Loosen up the rules enforcement for documents that have external links.	2023-12-01 17:44:29 +01:00
Viktor Lofgren	e5d274fe1c	(docs) Improve architectural documentation	2023-11-30 21:38:57 +01:00
Viktor Lofgren	166a391eae	(docs) Improve architectural documentation for the crawler.	2023-11-30 21:30:57 +01:00
Viktor Lofgren	5fb24bb27f	(docs) Improve architectural documentation for the converter.	2023-11-30 20:43:22 +01:00
Viktor Lofgren	5a5430b383	(convert) Wiki specialization that should do a better job at removing junk keywords and providing a useful summary.	2023-11-30 20:04:46 +01:00
Viktor Lofgren	67a1e1c874	(control) GUI for triggering control-side actors	2023-11-29 15:31:14 +01:00
Viktor Lofgren	4155fbe94c	(control) Reprocess-all actor	2023-11-28 17:58:48 +01:00
Viktor Lofgren	347fe6b7be	(control) Reindex-all actor	2023-11-28 16:41:09 +01:00
Viktor Lofgren	ff3ceb981e	(control) Button for removing a stale 'NEW' status If a process is violently terminated, the associated file storage may get stuck in the ephemeral 'NEW' state, preventing future operations on the associated data. To remedy this without having to dig through the database, a button was added to reset the state. It's a band-aid, but the situation is rare enough that I think it's fine.	2023-11-28 15:18:24 +01:00
Viktor Lofgren	1dafa0c74d	(mqapi/control) Repair repartition endpoint, deprecate notify endpoints. The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.	2023-11-27 16:01:12 +01:00
Viktor Lofgren	09917837d0	(process) Ensure construction exceptions are logged Wrapping these exceptions in a try-catch and logging them with slf4j will ensure they end up in the process logs. The way it worked using the default exception handler, they'd print on console (which nothing captures!), leading to a very annoying debugging experience.	2023-11-22 18:32:06 +01:00
Viktor Lofgren	dd507a3808	(db) Fix migrations, bump flyway to 10.0.1 Tricky problem, creating a procedure apparently needs delimiter shenanigans in Flyway, otherwise it will truncate the END statement and mariadb will be sad.	2023-11-21 20:04:35 +01:00
Viktor Lofgren	dd9406d0ac	(control) Make storage type tabs consistent This had fallen off in the Create New Specification view, it lacked Exports.	2023-11-17 11:26:45 +01:00
Viktor Lofgren	f58a9f46be	(loader) Don't truncate the entire links table on load This behavior is an old vestige from the days of only having a single loader process. We'd truncate the links table because doing inserts/updates was too slow. This was also important because we had 32 bit ID, and there's a lot of links between domains to go around... Instead we delete the rows associated with the current node with a stored procedure PURGE_LINKS_TABLE. We also update the PRIMARY KEY to a BIGINT. We'll need to load the data in excess of billion times to hit an ID rollover, so it'll be fine.	2023-11-16 10:30:12 +01:00
Viktor Lofgren	1cbf23e7e7	(test) Don't fail test if atags.parquet is not in ~vlofgren	2023-11-15 09:11:38 +01:00
Viktor Lofgren	63554ba171	(explore2) Add robots.txt	2023-11-14 09:15:32 +01:00
Viktor Lofgren	5de37cb820	(converter) Set feature flags appropriately on stackexchange posts	2023-11-12 15:48:08 +01:00
Viktor Lofgren	e5cee1f46d	(sideload) Fix sideloading so that it doesn't get disproportionately good rankings Also add type flags so that e.g. wikipedia shows up in the wikis filter.	2023-11-12 14:57:57 +01:00
Viktor Lofgren	e9a01caa5c	(index) Fix broken metrics	2023-11-11 12:53:47 +01:00
Viktor Lofgren	858357a246	(metrics) Get prometheus up out of disrepair * Fix bad labels * Add nodeId where appropriate * Hopefully fix histogram buckets for index query times	2023-11-08 14:01:28 +01:00
Viktor Lofgren	7aa2f80117	(domain) id.au should be treated as a TLD	2023-11-06 19:07:47 +01:00
Viktor Lofgren	7617b4cbc2	(crawler) Fix NPE in crawler caused by not having fetched the domains list yet	2023-11-06 18:16:38 +01:00
Viktor Lofgren	e0c769fd19	(converter) Integrate atags.parquet with the encyclopedia sideloader Also clean up stackexchange and dirtree a bit.	2023-11-06 18:03:01 +01:00
Viktor Lofgren	ebd10a5f28	(crawler) Integrate atags.parquet with the crawler so that "important" URLs are prioritized	2023-11-06 16:14:58 +01:00
Viktor Lofgren	2b77184281	(converter) Integrate atags with the topology field	2023-11-06 13:46:44 +01:00
Viktor Lofgren	e23976f6c4	(search) Fix card title overflow	2023-11-06 13:25:39 +01:00
Viktor Lofgren	0b8dc02eba	(result-ranking) Nudge up results with ngram matches a tiny bit	2023-11-06 13:14:22 +01:00
Viktor Lofgren	fde1d0677e	(search) Remove unnecessary dependencies	2023-11-06 12:56:32 +01:00
Viktor Lofgren	48986574ae	(result-ranking) Use a weighted calculation of priority term importance	2023-11-06 12:56:21 +01:00
Viktor Lofgren	c7a6a71d07	(result-ranking) Use a weighted calculation of priority term importance	2023-11-06 12:48:23 +01:00
Viktor Lofgren	1847845151	Revert "(loader) Optimize INSERT statements" This reverts commit `7cb92195d1`.	2023-11-04 19:32:02 +01:00
Viktor Lofgren	7cb92195d1	(loader) Optimize INSERT statements INSERT IGNORE is too slow.	2023-11-04 17:43:55 +01:00
Viktor Lofgren	72afa0341f	duckdb connection may need to be synchronized?	2023-11-04 14:30:25 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	8e9698c9a0	(control/search) Add ability to suggest removing a site from random exploration This is what most complaints have been about.	2023-11-02 15:29:49 +01:00
Viktor Lofgren	3047e2dd7c	(screenshot-capture-tool) Make screenshot-capture-tool cooperate with docker	2023-11-01 16:38:55 +01:00
Viktor Lofgren	a8b9d21f2d	(executor) Refine atag export logic * Remove obviously uninteresting tags * Omit URL schema for more sensible sorting * Change the column order to put the source domain last	2023-11-01 13:23:14 +01:00
Viktor Lofgren	c77a5b7cb6	(control) GUI for atags export	2023-10-31 17:55:47 +01:00
Viktor Lofgren	23f2068e33	(executor) Actor for exporting anchor tag data from crawl data	2023-10-31 17:32:34 +01:00
Viktor Lofgren	ffadfb4149	(control) Use a partial template for the storage types tabs.	2023-10-31 17:12:14 +01:00
Viktor Lofgren	b7e38cfbae	(control) Add exports view	2023-10-31 17:08:48 +01:00
Viktor Lofgren	659743b39c	(executor) Export Data actor allocates its own storage	2023-10-31 17:04:07 +01:00
Viktor Lofgren	69758c5859	(control) Nicer redirects acknowledging actions	2023-10-31 16:26:29 +01:00
Viktor Lofgren	81bfd7e5fb	(experiment) Utility for exporting atags	2023-10-31 16:10:21 +01:00
Viktor Lofgren	8f74dbdbb4	(crawler) Set more lenient parameters for recrawl	2023-10-30 11:35:30 +01:00
Viktor Lofgren	fd5a7eac87	(crawler) Exit crawler retriever on thread interrupted	2023-10-30 11:34:16 +01:00
Viktor Lofgren	6bac3c75cb	(api) API documentation	2023-10-29 16:13:21 +01:00
Viktor Lofgren	5d6e0e3790	(log) Clean up logging Don't log the PROCESS stream to executor's logs, as it will also be logged in the spawned process' log files. Also tell the spawned process which "service" it is so that it gets a log file with a name that makes sense.	2023-10-29 15:52:17 +01:00
Viktor Lofgren	2871a326e6	(ctrl/exe) Clean up UX and code	2023-10-29 14:09:39 +01:00
Viktor Lofgren	abb42f0f36	(crawler) Fix bug in SQL statement Arguments were in the wrong order in inserting fetching sites submitted to be crawled	2023-10-29 13:19:17 +01:00

... 8 9 10 11 12 ...

1535 Commits