MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 05:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	92522e8d97	(index) Attenuate bm25 score based on query length	2024-08-15 08:41:38 +02:00
Viktor Lofgren	049d94ce31	(index) Add body position match to qdebug fields	2024-08-15 08:39:37 +02:00
Viktor Lofgren	dbc6a95276	(index) Consume the new 'body' span in index to make it used in ranking	2024-08-15 08:33:43 +02:00
Viktor Lofgren	75b0888032	(slop) Migrate to latest Slop version	2024-08-14 11:44:35 +02:00
Viktor Lofgren	623ee5570f	(slop) Break slop out into its own repository	2024-08-13 09:50:05 +02:00
Viktor Lofgren	e6c8a6febe	(index) Add index-side deduplication in selectBestResults	2024-08-10 10:51:59 +02:00
Viktor Lofgren	4ece5f847b	(index) Add more qdebug factors	2024-08-10 10:45:30 +02:00
Viktor Lofgren	e4f04af044	(index) Give BODY matches a verbatim match value	2024-08-10 10:22:19 +02:00
Viktor Lofgren	b730b17f52	(index) Correct handling of firstPosition to avoid d/z	2024-08-10 10:21:59 +02:00
Viktor Lofgren	98c40958ab	(index) Simplify verbatim match calculation	2024-08-10 09:54:56 +02:00
Viktor Lofgren	41b52f5bcd	(index) Simplify verbatim match calculation	2024-08-10 09:51:03 +02:00
Viktor Lofgren	016a4c62e1	(index) Bugs and error fixes, chasing and fixing mystery results that did not contain all relevant keywords	2024-08-10 09:51:03 +02:00
Viktor Lofgren	df89661ed2	(index) In SearchResultItem, populate combinedId with combinedId and not its ranking-removed documentId cousin	2024-08-09 16:32:32 +02:00
Viktor Lofgren	2e89b55593	(wip) Repair qdebug utility and show new ranking details	2024-08-09 12:57:25 +02:00
Viktor Lofgren	7babdb87d5	(index) Remove intermediate models	2024-08-07 10:10:44 +02:00
Viktor Lofgren	f01267bc6b	(index) Don't load fwd index offsets into a hash table at start. This makes the service take forever to start up. Memory map the data instead and binary search. This is a bit slower, but not by much.	2024-08-06 11:16:28 +02:00
Viktor Lofgren	df6a05b9a7	(index) Avoid hypothetical divide-by-zero in tcfAvgDist	2024-08-06 10:55:57 +02:00
Viktor Lofgren	8569bb8e11	(index) Avoid divide-by-zero when minDist returns 0	2024-08-06 10:34:05 +02:00
Viktor Lofgren	ca6e2db2b9	(index) Include external link texts in verbatim score	2024-08-06 10:23:23 +02:00
Viktor Lofgren	2080e31616	(converter) Store link text positions To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends. Integrating this information with the ranking is not performed here.	2024-08-04 12:00:29 +02:00
Viktor Lofgren	ee49c01d86	(index) Tune ranking for verbatim matches in the title, rewarding shorter titles	2024-08-03 14:47:23 +02:00
Viktor Lofgren	b21f8538a8	(index) Tune ranking for verbatim matches in the title, rewarding shorter titles	2024-08-03 14:41:38 +02:00
Viktor Lofgren	dd15676d33	(index) Tune ranking for verbatim matches in the title, rewarding shorter titles	2024-08-03 14:18:04 +02:00
Viktor Lofgren	ec5a17ad13	(index) Tune ranking for verbatim matches in the title, rewarding shorter titles	2024-08-03 14:07:02 +02:00
Viktor Lofgren	8462e88b8f	(index) Add min-dist factor and adjust rankings	2024-08-03 13:07:00 +02:00
Viktor Lofgren	bf26ead010	(index) Remove hasPrioTerm check as we should sort this out in ranking	2024-08-03 13:06:50 +02:00
Viktor Lofgren	c2cedfa83c	(index) Experimental ranking signals	2024-08-03 10:33:41 +02:00
Viktor Lofgren	c6c8b059bf	(index) Return some variant of the previously removed 'Bm25PrioGraphVisitor'	2024-08-03 10:10:12 +02:00
Viktor Lofgren	d8a99784e5	(index) Adding a few experimental relevance signals	2024-08-02 20:26:07 +02:00
Viktor Lofgren	e2107901ec	(index) Add span information for anchor tags, tweak ranking params	2024-08-01 11:46:30 +02:00
Viktor Lofgren	15745b692e	(index) Coherences need to be able to deal with null values among positions	2024-07-31 22:00:14 +02:00
Viktor Lofgren	dc5c668940	(index) Re-enable parallelization of index construction, disable parallel sorting during construction The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance. It got worse, so the change is reverted. Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.	2024-07-31 10:06:53 +02:00
Viktor Lofgren	b316b55be9	(index) Experimental initial integration of document spans into index	2024-07-30 12:01:53 +02:00
Viktor Lofgren	34703da144	(slop) Support for nested array types and array-of-object types Also adding very basic support for filtered reads via SlopTable. This is probably not a final design.	2024-07-29 14:00:43 +02:00
Viktor Lofgren	1282f78bc5	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 11:01:18 +02:00
Viktor Lofgren	2d5d965f7f	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 10:34:33 +02:00
Viktor Lofgren	e585116dab	(slop) Add 32 bit read method for Varint along with the old 64 bit version	2024-07-28 13:20:18 +02:00
Viktor Lofgren	d05a2e57e9	(index-forward) Spans Writer should not be in the index page loop context	2024-07-27 15:17:04 +02:00
Viktor Lofgren	6c3abff664	(slop) Move GCS Slop column to the coded-sequence package This lets the slop library be stand-alone without dependence on coded-sequence. The change also gets rid of the vestigial seek() method in ColumnReader.	2024-07-27 13:58:45 +02:00
Viktor Lofgren	dcb43a3308	(slop) Introduce table concept to keep track of positions and simplify closing The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip. The second most common error is forgetting to close one of the columns in a reader or writer. To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.	2024-07-27 13:47:47 +02:00
Viktor Lofgren	aebb2652e8	(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.	2024-07-27 11:44:13 +02:00
Viktor Lofgren	0b31c4cfbb	(coded-sequence) Replace GCS usage with an interface	2024-07-16 14:37:50 +02:00
Viktor Lofgren	5c098005cc	(index) Fix broken test Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.	2024-07-16 12:37:59 +02:00
Viktor Lofgren	ae87e41cec	(index) Fix rare BitReader.takeWhileZero bug Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer. This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty. The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte. Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.	2024-07-16 11:03:56 +02:00
Viktor Lofgren	dfd19b5eb9	(index) Reduce the number of abstractions around result ranking The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.	2024-07-16 08:18:54 +02:00
Viktor Lofgren	ad3857938d	(search-api, ranking) Update with new ranking parameters Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm. The change also cleans out several parameters that no longer filled any function.	2024-07-15 04:49:40 +02:00
Viktor Lofgren	179a6002c2	(coded-sequence) Add a callback for re-filling underlying buffer	2024-07-12 23:50:28 +02:00
Viktor Lofgren	d28fc86956	(index-prio) Add fuzz test for prio index	2024-07-11 19:22:36 +02:00
Viktor Lofgren	6303977e9c	(index-prio) Fail louder when size is 0 in PrioDocIdsTransformer We can't deal with this scenario and should complain very loudly	2024-07-11 19:22:05 +02:00
Viktor Lofgren	97695693f2	(index-prio) Don't increment readItems counter when the output buffer is full This behavior was causing the reader to sometimes discard trailing entries in the list.	2024-07-11 19:21:36 +02:00

1 2 3 4

174 Commits