MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	722b56c8ca	(index) Fix rare bug in the index-switching logic This is caused by a resource contention with the query code. The proper way to fix this is to use some form of synchronization, but that will slow the code down. So we just hammer it a few times and let the GC deal with the problem if it fails. Not optimal, but fast.	2023-12-16 18:57:35 +01:00
Viktor Lofgren	45987a1d98	Merge branch 'master' into warc	2023-12-11 14:32:35 +01:00
Viktor Lofgren	e3ebb0c5bb	(*) Rename the search filter 'RETRO' into 'POPULAR' This will make the terminology more consistent between the GUI and the code. The rankings yaml still uses 'retro' though, for to retain compatibility.	2023-12-09 20:06:54 +01:00
Viktor Lofgren	cc813a5624	(convert) Add basic support for Warc file sideloading This update includes the integration of the jwarc library and implements support for Warc file sideloading, as a first trial integration with this library.	2023-12-06 18:43:55 +01:00
Viktor Lofgren	1dafa0c74d	(mqapi/control) Repair repartition endpoint, deprecate notify endpoints. The repartition endpoint was mis-addressing its mqapi notifications, omitting the proper nodeId. In fixing this, it became apparent that having both @MqRequest and @MqNotification is a serious footgun, and the two should be unified into a single API where the caller isn't burdened with knowledge of the remote end's implementation specifics.	2023-11-27 16:01:12 +01:00
Viktor Lofgren	e9a01caa5c	(index) Fix broken metrics	2023-11-11 12:53:47 +01:00
Viktor Lofgren	858357a246	(metrics) Get prometheus up out of disrepair * Fix bad labels * Add nodeId where appropriate * Hopefully fix histogram buckets for index query times	2023-11-08 14:01:28 +01:00
Viktor Lofgren	0152004c42	Initial Commit Anchor Tags * Added new (optional) model file in $WMSA_HOME/data/atags.parquet * Converter gets a component for creating a projection of its domains onto the full atags parquet file * New WordFlag ExternalLink * These terms are also for now flagged as title words * Fixed a bug where Title words aliased with UrlDomain words * Fixed a bug in the encyclopedia sideloader that gave everything too high topology ranking	2023-11-04 14:24:17 +01:00
Viktor Lofgren	88f49834fd	(docs) Update documentation	2023-10-27 12:45:39 +02:00
Viktor Lofgren	79adba9284	(index) Fix bug in dealing with quoted search terms	2023-10-26 16:28:23 +02:00
Viktor Lofgren	f613f4f2df	(array) Fix spurious search results This was caused by a bug in the binary search algorithm causing it to sometimes return positive values when encoding a search miss. It was also necessary to get rid of the vestiges of the old LongArray and IntArray classes to make this fix doable.	2023-10-26 15:27:02 +02:00
Viktor Lofgren	313cc2965c	(index-creation) Print whether full or prio is created Previous state of saying reverse index for both was pretty confusing.	2023-10-24 16:23:10 +02:00
Viktor Lofgren	0406e76889	(api) Remove logging cruft	2023-10-24 13:39:05 +02:00
Viktor Lofgren	c2b28c0f8d	(api) Trial streaming API	2023-10-24 13:26:46 +02:00
Viktor Lofgren	a860f8f1a8	(index/qs) GRPC API for better query peformance	2023-10-24 11:38:07 +02:00
Viktor Lofgren	108b4cb648	(service) Keep disabled multi-noded services dormant when they are configured to be disabled.	2023-10-14 20:58:55 +02:00
Viktor Lofgren	4baf9527d7	() WIP Control GUI redesign, executor-service, multi-node mq This turned out to be very difficult to do in small isolated steps. Design overhaul of the control gui using bootstrap * Move the actors out of control-service into to a new executor-service, that can be run on multiple nodes * Add node-affinity to message queue	2023-10-14 12:08:43 +02:00
Viktor Lofgren	199c459697	(*) Add node-affinity to services, processes and file storage.	2023-10-10 12:32:22 +02:00
Viktor Lofgren	3889c4bdd9	(refactor) Remove features-search and update documentation	2023-10-09 15:12:30 +02:00
Viktor Lofgren	77ccab7d80	(index) Move linkdb to index from search. This makes index complete in the sense that you can deploy an index instance and build a complete separate application on top of it, without having to go through the Marginalia-laden search service.	2023-10-08 16:48:35 +02:00
Viktor	8e1abc3f10	(index-reverse) Parallel construction of the reverse indexes. (#52 ) * (index-reverse) Parallel construction of the reverse indexes. * (array) Remove wasteful calculation of numDistinct before merging two sorted arrays. * (index-reverse) Force changes to disk on close, reduce logging. * (index-reverse) Clean up merging process and add back logging * (run) Add a conservative default for INDEX_CONSTRUCTION_PROCESS_OPTS's parallelism as it eats a lot of RAM * (index-reverse) Better logging during processing * (array) 2GB+ compatible write() function * (array) 2GB+ compatible write() function * (index-reverse) We are logging like Bolsonaro and I will not have it. * (reverse-index) Self-diagnostics * (btree) Fix bug in btree reader to do with large data sizes	2023-10-07 10:00:00 +02:00
Viktor Lofgren	c51159672e	(build) Move unit test configuration to root build.gradle	2023-10-04 12:46:22 +02:00
Viktor Lofgren	40768e935b	(test) Removing /tmp-guardrails as it doesn't hold in CI	2023-10-02 16:52:59 +02:00
Viktor Lofgren	d160954080	(index) Two useful debug endpoints	2023-09-24 19:39:48 +02:00
Viktor Lofgren	14372e0ef0	(index) Slightly reduce alloc churn	2023-09-24 19:36:14 +02:00
Viktor Lofgren	028b5a4f0d	(minor performance) Reduce GC churn in index	2023-09-24 12:12:08 +02:00
Viktor Lofgren	1bd146fb8e	(minor) Remove dead code	2023-09-24 10:55:20 +02:00
Viktor Lofgren	5f6c3da7a4	(index) Add close methods on the index readers so they clean up their mmaps	2023-09-24 10:54:23 +02:00
Viktor Lofgren	dbe9235f3a	(*) Upgrade to JDK21 with preview enabled. ... also move some common configuration into the root build.gradle-file. Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work. This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.	2023-09-24 10:38:59 +02:00
Viktor Lofgren	9e185e80ce	(control-service) Add timestamp to file storages.	2023-09-02 14:01:04 +02:00
Viktor Lofgren	d31d8ec5b0	(index) Log keyword ids on hex format	2023-09-01 15:40:24 +02:00
Viktor Lofgren	764e7d1315	(index) Add more comprehensive integration tests for the index service.	2023-08-30 10:37:24 +02:00
Viktor Lofgren	3f288e264b	(minor) Clean up dead endpoints	2023-08-29 17:04:54 +02:00
Viktor Lofgren	39c1857c61	(heartbeat, reverse-index) Better heartbeat mocking, improved heartbeats for reverse index construction.	2023-08-29 13:07:55 +02:00
Viktor Lofgren	b6a92506d1	(index) Hook in missing DocIdRewriter This enables documents to be ranked properly.	2023-08-28 19:53:43 +02:00
Viktor Lofgren	3101b74580	(index) Move to a lexicon-free index design This is a system-wide change. The index used to have a lexicon, mapping words to wordIds using a large in-memory hash table. This made index-construction easier, but it also added a fairly significant RAM penalty to both the index service and the loader. The new design moves to 64 bit word identifiers calculated using the murmur hash of the keyword, and an index construction based on merging smaller indices. It also became necessary half-way through to upgrade guice as its error reporting wasn't quite compatible with JDK20.	2023-08-28 14:02:23 +02:00
Viktor Lofgren	460998d512	(index) Move index construction to separate process. This provides a much cleaner separation of concerns, and makes it possible to get rid of a lot of the gunkier parts of the index service. It will also permit lowering the Xmx on the index service a fair bit, so we can get CompressedOOps again :D	2023-08-25 12:52:54 +02:00
Viktor Lofgren	b911665691	(index) Clean up and optimize valuator	2023-08-24 18:34:06 +02:00
Viktor Lofgren	56eb83319d	(index) Clean up result domain deduplicator	2023-08-24 18:24:55 +02:00
Viktor Lofgren	1e6800565a	(system) Remove EdgeId<T> and similar objects They seemed like a good idea at the time, but in practice they're wasting resources and not really providing the clarity I had hoped.	2023-08-24 17:46:02 +02:00
Viktor Lofgren	9894f37412	(index) Implement new URL ID coding scheme. Also refactor along the way. Really needs an additional pass, these tests are very hairy.	2023-08-24 16:44:27 +02:00
Viktor Lofgren	ebc84c22fb	Upgrade antique lombok plugin This permits tests to run on JDK20 environments.	2023-08-23 14:34:32 +00:00
Viktor Lofgren	aa0d256d6a	Upgrade code to Java 20. * Change language version * Upgrade Lombok to a JDK20 compatible version	2023-08-23 13:37:49 +00:00
Viktor Lofgren	4d75fa2908	Upgrade gradle and docker plugin to support native JDK20 environments	2023-08-23 13:30:55 +00:00
Viktor Lofgren	704de50a9b	(forward-index, valuator) HTML features in valuator Put it in the forward index for easy access during index-side valuation.	2023-08-18 11:54:56 +02:00
Viktor Lofgren	251fc63b42	(*) Fix merge gore	2023-08-09 13:33:28 +02:00
Viktor Lofgren	624b78ec3a	(heartbeat) Task heartbeats	2023-08-04 14:40:06 +02:00
Viktor Lofgren	9979c9defe	(search/index) Add blogosphere filter	2023-08-02 20:13:30 +02:00
Viktor Lofgren	e22e65eee4	(index) Fix bug related to debug print statements	2023-07-22 14:33:58 +02:00
Viktor Lofgren	d7ab21fe34	(*) Refactor Control Service and processes	2023-07-17 21:20:31 +02:00
Viktor Lofgren	8b74e3aa0d	(*) File Storage WIP	2023-07-14 17:08:10 +02:00
Viktor Lofgren	88b9ec70c6	(control, WIP) Run reconvert-load from converter :D	2023-07-11 18:05:37 +02:00
Viktor	cbbf60a599	Better fingerprinting (#35 ) * Better fingerprinting for server tech * Many more features in FeatureExtractor * Blog specialization * SiteType table	2023-07-10 18:58:43 +02:00
Viktor Lofgren	96eecc6ea5	Minor: Readability.	2023-07-10 18:58:43 +02:00
Viktor Lofgren	d9e6c4f266	Trial integration of MQ-FSM into index service.	2023-07-06 18:04:16 +02:00
Viktor Lofgren	62cc9df206	Embryo of new control process * New events and heartbeat tables in mariadb * Refactored to a cleaner Service interface	2023-07-03 10:40:32 +02:00
Viktor Lofgren	bd2c3855ed	Add bits and keywords for generator classes (docs, forum, wiki).	2023-06-23 21:35:28 +02:00
Viktor Lofgren	55c65f0935	Use document generator to complement the document selection. Will let through e.g. a modern SSG in the small web filter.	2023-06-22 17:21:33 +02:00
Viktor Lofgren	fd192d2791	Fix putative overflow error with a large dictionary	2023-05-28 11:57:06 +02:00
Viktor Lofgren	df1850bd45	Fix bug in index service where tld: and links:-queries wouldn't work.	2023-04-15 18:39:16 +02:00
Viktor Lofgren	502713f7a8	Reduce memory churn	2023-04-10 16:51:17 +02:00
Viktor Lofgren	e19256a6b6	Tune settings to retrieve more results.	2023-04-10 15:39:20 +02:00
Viktor Lofgren	ccc41d1717	Clean up of the index query handling related code.	2023-04-10 14:50:57 +02:00
Viktor Lofgren	e49b1dd155	Better handling of quote terms, fix bug in handling of longer queries. ... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java	2023-04-10 13:20:40 +02:00
Viktor Lofgren	fe419b12b4	Better handling of quote terms, fix bug in handling of longer queries. ... where some terms may previously have been ignored. The latter bug was due to the handling of QueryHeads with AnyOf-style predicates interacting poorly with alreadyConsideredTerms in SearchIndex.java	2023-04-10 13:11:40 +02:00
Viktor Lofgren	535a51a621	Repair broken year query test.	2023-04-08 12:04:09 +02:00
Viktor	a278fc6296	Increase search result relevance (#8 ) * Increase accuracy of the position bits. * Increase their width to 56. * Use a rolling position scheme for bits 16-56 to increase the average accuracy. * Result ranking overhaul * Optimized queries * BM25 in the index service's ranking * Make gui less jank * Javadocs for ranking parameters.	2023-04-07 20:18:08 +02:00
Viktor Lofgren	3fb249758e	Adjust result ordering.	2023-04-02 12:05:22 +02:00
Viktor Lofgren	f7a6ef2179	Smarter queries, better logging.	2023-04-02 12:05:09 +02:00
Viktor Lofgren	105d93cd85	Index query builder automatically ignores redundant predicates.	2023-04-02 12:04:26 +02:00
Viktor Lofgren	1e4157017d	More helpful descriptions of index queries.	2023-04-02 12:03:58 +02:00
Viktor Lofgren	cc4e089a5d	Consider average sentence length when selecting search results. This promotes proses over code listings, tabular data, etc.	2023-03-30 15:46:15 +02:00
Viktor Lofgren	dcf6218cdb	Fix bugs related to search result selection in the case with multiple search terms. * A deduplication filter step ran too early, and removed many good results on the basis that they partially, but did not fully fit another set of search terms. * Altered the query creation process to prefer documents where multiple terms appear in the priority index.	2023-03-29 15:18:52 +02:00
Viktor Lofgren	17ca4f9eea	Permit search results that are all synthetic to pass relevancy check.	2023-03-27 17:27:35 +02:00
Viktor Lofgren	862e925d7c	"-Dsmall-ram=TRUE" no longer does anything. Remove references to the flag, which previously reduced the memory footprint of the loader and index service.	2023-03-26 21:37:11 +02:00
Viktor Lofgren	a0027ad32b	Fix broken diagram links after doc/ restructuring.	2023-03-25 16:32:10 +01:00
Viktor	c4a6bf7672	Update readme.md	2023-03-22 17:01:34 +01:00
Viktor Lofgren	46f81aca2f	Break apart reverse index into a separate full index and priority index. It did this before using the same code. This will make the priority index about half as big since it no longer needs to keep metadata.	2023-03-21 16:12:31 +01:00
Viktor Lofgren	6a20b2b678	Trivial reformatting of code.	2023-03-17 22:11:14 +01:00
Viktor Lofgren	449471a076	Yet more restructuring. Improved search result ranking.	2023-03-16 21:35:54 +01:00
Viktor Lofgren	73eaa0865d	The refactoring will continue until morale improves.	2023-03-12 10:50:31 +01:00
Viktor Lofgren	616effdb3c	The refactoring will continue until morale improves.	2023-03-12 10:04:48 +01:00
Viktor Lofgren	6d939175b1	Additional code restructuring to get rid of util and misc-style packages.	2023-03-11 13:48:40 +01:00
Viktor Lofgren	73e412ea5b	Clean up search-service and index-api	2023-03-11 12:26:12 +01:00
Viktor Lofgren	919b80b9ab	Gradle shouldn't generate dist zips, zipping jar files is slow and also just ridiculous when you realize jar files are zip files and you can't compress a file twice using the same algo.	2023-03-11 11:34:51 +01:00
Viktor Lofgren	a62015d5f3	Fix broken test, compiler warning.	2023-03-10 17:12:12 +01:00
Viktor Lofgren	722ff3bffb	Word feature bit for words that appear in the URL, new search profile for plain text files, better plain text titles.	2023-03-10 16:46:56 +01:00
Viktor Lofgren	efb46cc703	Remove count from WordMetadata entirely.	2023-03-09 18:14:14 +01:00
Viktor Lofgren	1252f95da5	Fix for valuation bug in index code that wouldn't sort bad-ish items properly.	2023-03-07 21:26:04 +01:00
Viktor Lofgren	ad1be7c835	Move all code to a code directory.	2023-03-07 17:14:32 +01:00

1 2 3

140 Commits