MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	0307c55f9f	(refac) Zookeeper for service-discovery, kill service-client lib (WIP) To avoid having to either hard-code or manually configure service addresses (possibly several dozen), and to reduce the project's dependency on docker to deal with routing and discovery, the option to use [Zookeeper](https://zookeeper.apache.org/) to manage services and discovery has been added. A service registry interface was added, with a Zookeeper implementation and a basic implementation that only works on docker and hard-codes everything. The last remaining REST service, the assistant-service, has been migrated to gRPC. This also proved a good time to clear out primordial technical debt from the root of the codebase. The 'service-client' library has been taken behind the barn and given a last farewell. It's replaced by a small library for managing gRPC channels. Since it's no longer used by anything, RxJava has been removed as a dependency from the project. Although the current state seems reasonably stable, this is a work-in-progress commit.	2024-02-20 11:41:14 +01:00
Viktor Lofgren	6aee27a3f1	(*) Fix bug in EdgeDomain where it would permit domains with a trailing period, DNS style.	2023-12-29 16:36:01 +01:00
Viktor Lofgren	bf44805e69	(*) Rename EdgeDomain$domain into topDomain This variable had a very confusing name, and was dangerously easy to use in the wrong place with the result of getting something that only works as expected half the time. Ideally this class needs an overhaul, the assumptions it makes about domain names aren't great.	2023-12-17 14:00:07 +01:00
Viktor Lofgren	bcad6492d6	(sideloader) Fix integration problems with sideloaders In encyclopedia, add a class "mw-content-text" that the WikiSpecialization class is looking for during pruning to give the articles a more fair treatment. Also add generator keywords based on the generator type provided, to ensure that these documents show up in appropriate filters. Further, add a new document flag value 'Sideloaded' to be able to distinguish these entries.	2023-12-17 13:28:17 +01:00
Viktor Lofgren	b74a3ebd85	(crawler) WIP integration of WARC files into the crawler process. At this stage, the crawler will use the WARCs to resume a crawl if it terminates incorrectly. This is a WIP commit, since the warc files are not fully incorporated into the work flow, they are deleted after the domain is crawled. The commit also includes fairly invasive refactoring of the crawler classes, to accomplish better separation of concerns.	2023-12-11 19:32:58 +01:00
Viktor Lofgren	072b5fcd12	Implement Warc-recording wrapper for OkHttp3 client This is a first step of using WARC as an intermediate flight recorder style step in the crawler, ultimately aimed at being able to resume crawls if the crawler is restarted. This component is currently not hooked into anything. The OkHttp3 client wrapper class 'WarcRecordingFetcherClient' was implemented for web archiving. This allows for the recording of HTTP requests and responses. New classes were introduced, 'WarcDigestBuilder', 'IpInterceptingNetworkInterceptor', and 'WarcProtocolReconstructor'. The JWarc dependency was added to the build.gradle file, and relevant unit tests were also introduced. Some HttpFetcher-adjacent structural changes were also done for better organization.	2023-12-08 13:49:16 +01:00
Viktor Lofgren	7aa2f80117	(domain) id.au should be treated as a TLD	2023-11-06 19:07:47 +01:00
Viktor Lofgren	2b77184281	(converter) Integrate atags with the topology field	2023-11-06 13:46:44 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	3889c4bdd9	(refactor) Remove features-search and update documentation	2023-10-09 15:12:30 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	ec6c9bca62	(common) Fix factual error in comments	2023-09-24 19:40:19 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	eaeb23d41e	(refactor) Remove converting-model package completely	2023-09-14 11:21:44 +02:00
Viktor Lofgren	1e6800565a	(system) Remove EdgeId<T> and similar objects They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.	2023-08-24 17:46:02 +02:00
Viktor Lofgren	c909120ae1	(search) Basic working integration of linkdb in search service	2023-08-24 17:24:56 +02:00
Viktor Lofgren	9894f37412	(index) Implement new URL ID coding scheme. Also refactor along the way. Really needs an additional pass, these tests are very hairy.	2023-08-24 16:44:27 +02:00
Viktor Lofgren	c70670bacb	(common) New UrlIdCodec class Have a single class responsible for encoding and decoding URL ids, as it's a bit finicky and used all over.	2023-08-24 11:41:07 +02:00
Viktor Lofgren	7bb3e44a76	(common) Deprecate EdgeId and similar	2023-08-24 11:16:28 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	15912f31d0	(control-service) Basic GUI for deleting bad links from exploration mode	2023-08-21 18:35:26 +02:00
Viktor Lofgren	6cb784df75	(minor) Improve comment	2023-08-18 11:25:36 +02:00
Viktor Lofgren	c019a029ec	(flags) Documentation and preventative bugfix	2023-08-17 17:42:31 +02:00
Viktor Lofgren	4598c7f40f	(valuation) Penalize wordpress style kebab case urls	2023-08-16 13:11:24 +02:00
Viktor Lofgren	ce293029c7	(converter) Treat adtech tracking as advertisement.	2023-08-09 14:23:53 +02:00
Viktor Lofgren	5411950b87	(minor) Tidy up EdgeDomain class a bit, no functional difference	2023-07-31 10:31:29 +02:00
Viktor Lofgren	77d5e39fe0	Make processed data Serializable	2023-07-28 18:11:19 +02:00
Viktor Lofgren	7470c170b1	(minor) EdgeUrl.parse() should deal with null	2023-07-24 15:06:57 +02:00
Viktor	cbbf60a599	Better fingerprinting (#35 ) * Better fingerprinting for server tech * Many more features in FeatureExtractor * Blog specialization * SiteType table	2023-07-10 18:58:43 +02:00
Viktor Lofgren	f92d8a0975	EdgeUrl conversion to/from java.net.URL	2023-06-27 10:57:54 +02:00
Viktor Lofgren	5abaf13192	Fix serialization bug with CompressedBigString	2023-06-27 10:57:54 +02:00
Viktor Lofgren	bd2c3855ed	Add bits and keywords for generator classes (docs, forum, wiki).	2023-06-23 21:35:28 +02:00
Viktor Lofgren	54c2be893b	TRIVIAL: Remove unused import.	2023-06-22 17:21:47 +02:00
Viktor Lofgren	b5ef67ed28	Categorize generators by type This is a great quality signal! Add the type as document bitflags by category.	2023-06-22 16:04:37 +02:00
Viktor Lofgren	d1a004bea6	(minor) Clean up StringPool	2023-06-19 17:58:19 +02:00
Viktor Lofgren	2cda57355a	More word metadata tests	2023-05-28 11:57:06 +02:00
Viktor Lofgren	2ab26f37b8	Bug fix for document metadata encoding that breaks year based queries.	2023-04-14 16:56:49 +02:00
Viktor	a278fc6296	Increase search result relevance (#8 ) * Increase accuracy of the position bits. * Increase their width to 56. * Use a rolling position scheme for bits 16-56 to increase the average accuracy. * Result ranking overhaul * Optimized queries * BM25 in the index service's ranking * Make gui less jank * Javadocs for ranking parameters.	2023-04-07 20:18:08 +02:00
Viktor Lofgren	716ab35b4e	Search ranking debuggability improvements.	2023-04-02 13:43:24 +02:00
Viktor Lofgren	cc4e089a5d	Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc.	2023-03-30 15:46:15 +02:00
Viktor Lofgren	03bd892b95	Improve document processing in conversion. * Add flags for long and short documents. * Break out common length logic from plugins. * Cleaning up of related code.	2023-03-28 16:38:00 +02:00
Viktor	ac1ac3ea57	Move database to a separate module * Move database to a separate project, break apart sql file into separate entities. * Fix front page news listing.	2023-03-25 15:26:17 +01:00
Viktor Lofgren	46f81aca2f	Break apart reverse index into a separate full index and priority index. It did this before using the same code. This will make the priority index about half as big since it no longer needs to keep metadata.	2023-03-21 16:12:31 +01:00
Viktor Lofgren	ca22c287a5	Make use of DocumentFlags' flags	2023-03-21 16:03:15 +01:00
Viktor Lofgren	72115e490f	Put news into a database table instead of keeping them hardcoded, request counter on front page.	2023-03-19 12:54:58 +01:00
Viktor Lofgren	bdd2b4a43e	Put news into a database table instead of keeping them hardcoded.	2023-03-19 11:46:13 +01:00
Viktor Lofgren	2eb972dea1	Remove unrelated code, break tools into their own directory.	2023-03-17 16:03:11 +01:00
Viktor Lofgren	449471a076	Yet more restructuring. Improved search result ranking.	2023-03-16 21:35:54 +01:00
Viktor Lofgren	4cec89da91	Fix bug where results would sometimes be presented solely based on the fact that the document is important on the site in general, regardless of whether it's important to the document.	2023-03-11 14:20:32 +01:00

1 2

58 Commits