MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-23 21:18:58 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	e48f52faba	(experiment) Add add-hoc filter runner	2024-08-03 13:24:03 +02:00
Viktor Lofgren	8462e88b8f	(index) Add min-dist factor and adjust rankings	2024-08-03 13:07:00 +02:00
Viktor Lofgren	bf26ead010	(index) Remove hasPrioTerm check as we should sort this out in ranking	2024-08-03 13:06:50 +02:00
Viktor Lofgren	c2cedfa83c	(index) Experimental ranking signals	2024-08-03 10:33:41 +02:00
Viktor Lofgren	eba2844361	(index) Experimental ranking signals	2024-08-03 10:32:46 +02:00
Viktor Lofgren	c6c8b059bf	(index) Return some variant of the previously removed 'Bm25PrioGraphVisitor'	2024-08-03 10:10:12 +02:00
Viktor Lofgren	d8a99784e5	(index) Adding a few experimental relevance signals	2024-08-02 20:26:07 +02:00
Viktor Lofgren	57929ff242	(coded-sequence) Varint sequence	2024-08-02 20:22:56 +02:00
Viktor Lofgren	4430a39120	(loader) Clean up	2024-08-02 12:32:47 +02:00
Viktor Lofgren	6228f46af1	(loader) Reduce log spam	2024-08-02 12:21:03 +02:00
Viktor Lofgren	ac67b6b5da	(converter) Fix exception handling while reading crawl data	2024-08-02 10:39:49 +02:00
Viktor Lofgren	1a268c24c8	(perf) Reduce DomPruningFilter hash table recalculation	2024-08-01 12:04:55 +02:00
Viktor Lofgren	38e2089c3f	(perf) Code was still spending a lot of time resolving charsets ... in the failure case which wasn't captured by memoization.	2024-08-01 11:58:59 +02:00
Viktor Lofgren	e2107901ec	(index) Add span information for anchor tags, tweak ranking params	2024-08-01 11:46:30 +02:00
Viktor Lofgren	15745b692e	(index) Coherences need to be able to deal with null values among positions	2024-07-31 22:00:14 +02:00
Viktor Lofgren	696fd8909d	(screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones	2024-07-31 21:44:10 +02:00
Viktor Lofgren	02b1c4b172	(screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones	2024-07-31 20:21:23 +02:00
Viktor Lofgren	285e657f68	Merge branch 'master' into term-positions # Conflicts: # code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java # code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java	2024-07-31 10:44:01 +02:00
Viktor Lofgren	046ffc7752	(build) Upgrade jib to 3.4.3	2024-07-31 10:39:50 +02:00
Viktor Lofgren	2ef66ce0ca	(actor) Reset NEW flag earlier when auto-deletion is disabled Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.	2024-07-31 10:31:03 +02:00
Viktor Lofgren	dc5c668940	(index) Re-enable parallelization of index construction, disable parallel sorting during construction The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance. It got worse, so the change is reverted. Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.	2024-07-31 10:06:53 +02:00
Viktor Lofgren	f19148132a	(search) Restrict site-search by passing domain id along with the site:-term This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.	2024-07-30 21:41:07 +02:00
Viktor Lofgren	6d7b886aaa	(converter) Correct sort order of files in control storage GUI Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.	2024-07-30 19:43:27 +02:00
Viktor Lofgren	b316b55be9	(index) Experimental initial integration of document spans into index	2024-07-30 12:01:53 +02:00
Viktor Lofgren	80900107f7	(restructure) Clean up repo by moving stray features into converter-process and crawler-process	2024-07-30 10:14:00 +02:00
Viktor Lofgren	7e4efa45b8	(converter/loader) Simplify document record writing to not require predicated reads	2024-07-29 14:21:21 +02:00
Viktor Lofgren	86ea28d6bc	(converter/loader) Simplify document record writing to not require predicated reads	2024-07-29 14:18:52 +02:00
Viktor Lofgren	34703da144	(slop) Support for nested array types and array-of-object types Also adding very basic support for filtered reads via SlopTable. This is probably not a final design.	2024-07-29 14:00:43 +02:00
Viktor Lofgren	1282f78bc5	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 11:01:18 +02:00
Viktor Lofgren	2d5d965f7f	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 10:34:33 +02:00
Viktor Lofgren	afe56c7cf1	(loader) Tidy up code	2024-07-28 21:36:42 +02:00
Viktor Lofgren	7d51cf882f	(loader) Move rssFeeds to a different column group to avoid errors	2024-07-28 21:30:10 +02:00
Viktor Lofgren	499deac2ef	(slop) Fix test that broke when we split get into int get() and long getLong()	2024-07-28 21:20:37 +02:00
Viktor Lofgren	9685993adb	(loader) Add spans to a different column group from spanCodes, as they are not in sync	2024-07-28 21:20:09 +02:00
Viktor Lofgren	261dcdadc8	(loader) Additional tracking for the control GUI	2024-07-28 21:19:45 +02:00
Viktor Lofgren	314a901bf0	(slop) Clean up build.gradle from unnecessary copy-paste garbage	2024-07-28 13:22:20 +02:00
Viktor Lofgren	1caad7e19e	(slop) Update existing code to use the altered Slop interfaces	2024-07-28 13:21:08 +02:00
Viktor Lofgren	e585116dab	(slop) Add 32 bit read method for Varint along with the old 64 bit version	2024-07-28 13:20:18 +02:00
Viktor Lofgren	40f42bf654	(slop) Add signed 16 bit column type "short"	2024-07-28 13:19:44 +02:00
Viktor Lofgren	eaf7fbb9e9	(slop) Improve Conveniences for Enum * New fixed width 8 bit version of Enum * Access to the enum's dictionary, and a method for reading the ordinal directly to reduce GC churn	2024-07-28 13:19:15 +02:00
Viktor Lofgren	d05a2e57e9	(index-forward) Spans Writer should not be in the index page loop context	2024-07-27 15:17:04 +02:00
Viktor Lofgren	f8684118f3	(slop) Add columnDesc information to the column readers and writers, and correct a few broken position() implementations Added a test that should find any additional broken implementations, as it's very important that this function is correct.	2024-07-27 14:35:30 +02:00
Viktor Lofgren	2e1f669aea	(slop) Remove additional vestigial seek() implementations	2024-07-27 14:35:30 +02:00
Viktor Lofgren	6c3abff664	(slop) Move GCS Slop column to the coded-sequence package This lets the slop library be stand-alone without dependence on coded-sequence. The change also gets rid of the vestigial seek() method in ColumnReader.	2024-07-27 13:58:45 +02:00
Viktor Lofgren	dcb43a3308	(slop) Introduce table concept to keep track of positions and simplify closing The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip. The second most common error is forgetting to close one of the columns in a reader or writer. To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.	2024-07-27 13:47:47 +02:00
Viktor Lofgren	ec600b967d	(crawler) Adjust domain locking Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress. Giving these a slightly larger allowance.	2024-07-27 11:54:46 +02:00
Viktor Lofgren	aebb2652e8	(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.	2024-07-27 11:44:13 +02:00
Viktor Lofgren	52a9a0d410	(slop) Translate nulls to empty strings when passed to the StringColumnWriters.	2024-07-25 18:26:41 +02:00
Viktor Lofgren	4123e99469	(slop) Handle empty compressed files correctly The CompressingStorageReader would incorrectly report having data when a file was empty. Preemptively attempting to fill the backing buffer fixes the behavior.	2024-07-25 18:26:13 +02:00
Viktor Lofgren	51a8a242ac	(slop) First commit of slop library Slop is a low-abstraction data storage convention for column based storage of complex data.	2024-07-25 15:08:41 +02:00
Viktor Lofgren	60ef826e07	(loader) Add heartbeat to update domain-ids step	2024-07-25 15:08:41 +02:00
Viktor Lofgren	2ad564404e	(loader) Add heartbeat to update domain-ids step	2024-07-23 15:28:52 +02:00
Viktor Lofgren	2bb9f18411	(dld) Refactor DocumentLanguageData Reduce the usage of raw arrays	2024-07-19 12:24:55 +02:00
Viktor Lofgren	7a1edc0880	(term-freq) Reduce the number of low-relevance words in the dictionary Using a statistical trick to reduce the number of low-frequency words in the dictionary, as they are numerous and not very informative.	2024-07-19 12:23:28 +02:00
Viktor Lofgren	b812e96c6d	(language-processing) Select the appropriate language filter The incorrect filter was selected based on the provided parameter, this has been corrected.	2024-07-19 12:22:32 +02:00
Viktor Lofgren	22b35d5d91	(sentence-extractor) Add tag information to document language data Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object. Separator information is encoded as a bit set instead of an array of integers. The change also cleans up the SentenceExtractor class a fair bit. It no longer extracts ngrams, and a significant amount of redundant operations were removed as well. This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.	2024-07-18 15:57:48 +02:00
Viktor Lofgren	d36055a2d0	(keyword-extractor) Retire TfIdfHigh WordFlag This will bring the word flags count down to 8, and let us pack every value in a byte.	2024-07-17 13:54:39 +02:00
Viktor Lofgren	0d227f3543	(cleanup) Remove next-prime library only used in tests	2024-07-17 13:48:03 +02:00
Viktor Lofgren	accc598967	(crawler) Add 1 second pause after probing domain to reduce request pressure	2024-07-16 16:55:07 +02:00
Viktor Lofgren	02c4a2d4ba	(crawler) Add a per-domain mutex for crawling To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.	2024-07-16 16:44:59 +02:00
Viktor Lofgren	6665e447aa	(crawler) Add crawl delays around probe call and deal with 429:s properly during this phase	2024-07-16 15:33:24 +02:00
Viktor Lofgren	f4d79c203d	(crawler) Adjust revisit logic The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed. Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.	2024-07-16 15:12:38 +02:00
Viktor Lofgren	4d29581ea4	(crawler) Introduce absolute upper limit to crawl depth growth	2024-07-16 14:40:45 +02:00
Viktor Lofgren	0b31c4cfbb	(coded-sequence) Replace GCS usage with an interface	2024-07-16 14:37:50 +02:00
Viktor Lofgren	5c098005cc	(index) Fix broken test Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.	2024-07-16 12:37:59 +02:00
Viktor Lofgren	ae87e41cec	(index) Fix rare BitReader.takeWhileZero bug Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer. This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty. The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte. Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.	2024-07-16 11:03:56 +02:00
Viktor Lofgren	dfd19b5eb9	(index) Reduce the number of abstractions around result ranking The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.	2024-07-16 08:18:54 +02:00
Viktor	8ed5b51a32	Merge branch 'master' into term-positions	2024-07-15 07:05:31 +02:00
Viktor Lofgren	9d0e5dee02	Fix gitignore issue .so files not to be ignored correctly.	2024-07-15 05:18:10 +02:00
Viktor Lofgren	ffd970036d	(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter How'd This Ever Work? (tm) TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.	2024-07-15 05:16:17 +02:00
Viktor Lofgren	fa162698c2	(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter How'd This Ever Work? (tm) TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.	2024-07-15 05:15:30 +02:00
Viktor Lofgren	ad3857938d	(search-api, ranking) Update with new ranking parameters Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm. The change also cleans out several parameters that no longer filled any function.	2024-07-15 04:49:40 +02:00
Viktor Lofgren	179a6002c2	(coded-sequence) Add a callback for re-filling underlying buffer	2024-07-12 23:50:28 +02:00
Viktor Lofgren	d28fc86956	(index-prio) Add fuzz test for prio index	2024-07-11 19:22:36 +02:00
Viktor Lofgren	6303977e9c	(index-prio) Fail louder when size is 0 in PrioDocIdsTransformer We can't deal with this scenario and should complain very loudly	2024-07-11 19:22:05 +02:00
Viktor Lofgren	97695693f2	(index-prio) Don't increment readItems counter when the output buffer is full This behavior was causing the reader to sometimes discard trailing entries in the list.	2024-07-11 19:21:36 +02:00
Viktor Lofgren	1ab875a75d	(test) Correcting flaky tests Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	31881874a9	(coded-sequence) Correct indicator of next-value It was incorrectly assumed that a "next" value could not be zero or negative, as this is not representable via the Gamam code. This is incorrect in this case, as we're able to provide a negative offset. Changing to using Integer.MIN_VALUE as indicator that a value is absent instead, as this will never be used.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	f090f0101b	(index-construction) Gather up preindex writes Use fewer writes when finalizing the preindex documents.dat file, as this was getting too slow.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	9881cac2da	(index-reader) Correctly handle negative offset values When wordOffset(...) returns a negative value, it means the word isn't present in the index, and we should abort.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	12590d3449	(index-reverse) Added compression to priority index The priority index documents file can be trivially compressed to a large degree. Compression schema: ``` 00b -> diff docord (E gamma) 01b -> diff domainid (E delta) + (1 + docord) (E delta) 10b -> rank (E gamma) + domainid,docord (raw) 11b -> 30 bit size header, followed by 1 raw doc id (61 bits) ```	2024-07-11 16:13:23 +02:00
Viktor Lofgren	abf7a8d78d	(coded-sequence) Correct implementation of Elias gamma Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.	2024-07-10 14:28:28 +02:00
Viktor Lofgren	ecfe17521a	(coded-sequence) Correct implementation of Elias gamma The implementation was incorrectly using 1 bit more than it should. The change also adds a put method for Elias delta; and cleans up the interface a bit.	2024-07-09 17:28:21 +02:00
Viktor Lofgren	0d29e2a39d	(index-reverse) Entry Sources reset() their LongQueryBuffer Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome	2024-07-09 01:39:40 +02:00
Viktor Lofgren	12a2ab93db	(actor) Improve error messages for convert-and-load Some copy-and-paste errors had snuck in and every index construction error was reported as "repartitioned failed"; updated with more useful messages.	2024-07-08 19:19:30 +02:00
Viktor Lofgren	d90bd340bb	(index-reverse) Removing btree indexes from prio documents file Btree index adds overhead and disk space and doesn't fill any function for the prio index. * Update finalize logic with a new IO transformer that copies the data and prepends a size * Update the reader to read the new format * Added a test	2024-07-08 17:20:17 +02:00
Viktor Lofgren	21afe94096	(index-reverse) Don't use 128 bit merge function for prio index	2024-07-07 21:36:10 +02:00
Viktor Lofgren	fa36689597	(index-reverse) Simplify priority index * Do not emit a documents file * Do not interlace metadata or offsets with doc ids	2024-07-06 18:04:08 +02:00
Viktor Lofgren	85c99ae808	(index-reverse) Split index construction into separate packages for full and priority index	2024-07-06 15:44:47 +02:00
Viktor Lofgren	a4ecd5f4ce	(minor) Fix non-compiling test due to previous refactor	2024-07-06 15:11:43 +02:00
Viktor Lofgren	6401a513d7	(crawl) Fix onsubmit confirm dialog for single-site recrawl	2024-07-05 17:21:03 +02:00
Viktor Lofgren	d86926be5f	(crawl) Add new functionality for re-crawling a single domain	2024-07-05 15:31:55 +02:00
Viktor Lofgren	a6b03a66dc	(crawl) Reduce Charset.forName() object churn Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets	2024-07-04 20:49:07 +02:00
Viktor Lofgren	d023e399d2	(index) Remove unnecessary allocations in journal reader The term data iterator is quite hot and was performing buffer slice operations that were not necessary. Replacing with a fixed pointer alias that can be repositioned to the relevant data. The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped. Removed this unnecessary step and move to copying the buffer directly instead.	2024-07-04 15:38:22 +02:00
Viktor Lofgren	e8ab1e14e0	(keyword-extraction) Update upper limit to number of positions per word After real-world testing, it was determined that 256 was still a bit too low, but 512 seems like it will only truncate outlier cases like assembly code and certain tabulations.	2024-07-02 20:52:32 +02:00
Viktor Lofgren	a6e15cb338	(keyword-extraction) Update upper limit to number of positions per word 100 was a bit too low, let's try 256.	2024-06-30 22:46:56 +02:00
Viktor Lofgren	4fbb863a10	(keyword-extraction) Add upper limit to number of positions per word Also adding some logging for this event to get a feel for how big these lists get with realistic data. To be cleaned up later.	2024-06-30 22:41:38 +02:00
Viktor Lofgren	6ee4d1eb90	(keyword) Increase the work area for position encoding The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.	2024-06-28 16:42:39 +02:00
Viktor Lofgren	738e0e5fed	(process) Add option for automatic profiling The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns. By default, these are put in the log directory. The change also adds a JVM parameter that makes it shut up about native access.	2024-06-27 13:58:36 +02:00
Viktor Lofgren	0e4dd3d76d	(minor) Remove accidentally committed debug printf	2024-06-27 13:40:53 +02:00
Viktor Lofgren	10fe5a78cb	(log) Prevent tests from trying to log to file They would never have succeeded, but it adds an annoying preamble of error spam in the console window.	2024-06-27 13:19:48 +02:00
Viktor Lofgren	975b8ae2e9	(minor) Tidy code	2024-06-27 13:15:31 +02:00
Viktor Lofgren	935234939c	(test) Add query parsing to IntegrationTest	2024-06-27 13:15:20 +02:00
Viktor Lofgren	87e38e6181	(search-query) refac: Move query factory	2024-06-27 13:14:47 +02:00
Viktor Lofgren	f73fc8dd57	(search-query) Fix end-inclusion bug in QWordGraphIterator	2024-06-27 13:13:42 +02:00
Viktor Lofgren	3faa5bf521	(search-query) Tidy up QueryGRPCService and IndexClient	2024-06-26 14:03:30 +02:00
Viktor Lofgren	6973712480	(query) Tidy up code	2024-06-26 13:40:06 +02:00
Viktor Lofgren	02df421c94	(*) Trim the stopwords list Having an overlong stopwords list leads to quoted terms not performing well. For now we'll slash it to just "a" and "the".	2024-06-26 12:22:57 +02:00
Viktor Lofgren	95b9af92a0	(index) Implement working optional TermCoherences	2024-06-26 12:22:06 +02:00
Viktor Lofgren	8ee64c0771	(index) Correct TermCoherence requirements	2024-06-25 22:18:10 +02:00
Viktor Lofgren	b805f6daa8	(gamma) Fix readCount() behavior in EGC	2024-06-25 22:17:54 +02:00
Viktor Lofgren	dae22ccbe0	(test) Integration test from crawl->query	2024-06-25 22:17:26 +02:00
Viktor Lofgren	9d00243d7f	(index) Partial re-implementation of position constraints	2024-06-24 15:55:54 +02:00
Viktor Lofgren	5461634616	(doc) Add readme.md for coded-sequence library This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.	2024-06-24 14:28:51 +02:00
Viktor Lofgren	40bca93884	(gamma) Minor clean-up	2024-06-24 13:56:43 +02:00
Viktor Lofgren	b798f28443	(journal) Fixing journal encoding Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.	2024-06-24 13:56:27 +02:00
Viktor Lofgren	fff2ce5721	(gamma) Correctly decode zero-length sequences	2024-06-24 13:11:41 +02:00
Viktor	08ff79827e	Merge branch 'master' into security-scan	2024-06-17 13:18:25 +02:00
Viktor Lofgren	d0d6bb173c	(control) Fix warc data http status filter default value	2024-06-17 12:40:25 +02:00
Viktor Lofgren	90744433c9	Merge branch 'master' into security-scan # Conflicts: # code/libraries/array/cpp/resources/libcpp.so	2024-06-13 13:14:47 +02:00
Jaseem Abid	0dd14a4bd0	Specify C++ standard in build command The default C++ language standard on macOS is gnu++98, which won't build this module. Full error: ``` > Task :code:libraries:array:cpp:compileCpp FAILED src/main/cpp/cpphelpers.cpp:28:5: error: expected expression [](const p64x2& fst, const p64x2& snd) { ^ ```	2024-06-12 12:47:10 +01:00
Jaseem Abid	9974b31a09	Don't track build files(libcpp.so) with git	2024-06-12 12:45:49 +01:00
Viktor Lofgren	0ffbbaf4b9	(crawler) Update WARC builder to use SHA-256 for digests	2024-06-12 09:14:12 +02:00
Viktor Lofgren	6839415a0b	(crawler) Fetch TLS instead of SSL context	2024-06-12 09:07:54 +02:00
Viktor Lofgren	55f3ac4846	(atags) Fix duckdb SQL injection The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.	2024-06-12 09:05:57 +02:00
Viktor Lofgren	801cf4b5da	(search) Fix bad practice usage of innerHTML to set what should be text content.	2024-06-12 08:59:40 +02:00
Viktor Lofgren	e0459d0c0d	(build) Upgrade parquet dependencies to 1.14.0 This gets rid of a vulnerable transitive dependency.	2024-06-12 08:57:22 +02:00
Viktor Lofgren	23759a7243	(loader) Correctly clamp document size	2024-06-10 18:29:14 +02:00
Viktor Lofgren	55b2b7636b	(loader) Correctly load the positions column in the keyword projection	2024-06-10 18:27:15 +02:00
Viktor Lofgren	36160988e2	(index) Integrate positions data with indexes WIP This change integrates the new positions data with the forward and reverse indexes. The ranking code is still only partially re-written.	2024-06-10 15:09:06 +02:00
Viktor Lofgren	9f982a0c3d	(index) Integrate positions file properly	2024-06-06 16:45:42 +02:00
Viktor Lofgren	dcbec9414f	(index) Fix non-compiling tests	2024-06-06 16:35:09 +02:00
Viktor Lofgren	a07cf1ba93	(array/cpp) Update gitignore to properly exclude libcpp.so	2024-06-06 13:06:08 +02:00
Viktor Lofgren	4a8afa6b9f	(index, WIP) Position data partially integrated with forward and reverse indexes. There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.	2024-06-06 12:54:52 +02:00
Sam Storment	9c06f446fb	(search) Styling tweaks. Make the filter button near the top right corener a bit bigger so it's easier to press on mobile	2024-06-05 19:55:17 -05:00
Sam Storment	2d076cbd67	(search) move data-has-js attribute from body to html element	2024-06-05 18:20:33 -05:00
Sam Storment	fb2eef24d6	Handle themeing when javascript is disabled. Hide the theme select and fallback to dark media query instead of data-theme attribute	2024-06-03 14:15:35 -05:00
Sam Storment	e2f68d9ccf	Add a theme select to the header that lets users toggle their theme independent of their OS theme	2024-06-02 21:02:52 -05:00
Viktor Lofgren	d4f4d751c0	Merge remote-tracking branch 'origin/master'	2024-06-02 16:30:41 +02:00
Viktor Lofgren	b4eac2516e	(crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results	2024-06-02 16:30:34 +02:00
Viktor	4435f6245c	Merge pull request #94 from samstorment/search-dark-theme Search Dark Theme	2024-06-02 16:21:52 +02:00
Viktor Lofgren	9b922af075	(converter) Amend existing modifications to use gamma coded positions lists ... instead of serialized RoaringBitmaps as was the initial take on the problem.	2024-05-30 14:20:36 +02:00
Viktor Lofgren	0112ae725c	(gamma) Implement a small library for Elias gamma coding an integer sequence	2024-05-30 14:19:13 +02:00
Viktor Lofgren	619392edf9	(keywords) Add position information to keywords	2024-05-28 16:54:53 +02:00
Viktor Lofgren	0894822b68	(converter) Add position information to serialized document data This is not hooked in yet, and the term metadata is still left intact. It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.	2024-05-28 14:18:03 +02:00
Viktor Lofgren	a69ab311c7	(qword) Fix tests that broke due to stopword removal	2024-05-28 14:15:45 +02:00
Viktor Lofgren	6985ab762a	(query) Improve handling of stopwords in queries	2024-05-23 20:50:55 +02:00
Viktor Lofgren	0e8300979b	(search) Update the no result text to request bug reports.	2024-05-23 20:18:16 +02:00
Viktor Lofgren	0b60411e5f	(query) Bugfix stopword issue Add a new rule that crates an alternative path that omits a word if it's a stopword. In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.	2024-05-23 20:15:14 +02:00
Viktor Lofgren	f83f777fff	(converter) Experimental support for searching by URL Add up to synthetic 128 keywords per document, corresponding to links to other websites.	2024-05-23 17:10:57 +02:00
Viktor Lofgren	89aae93e60	(*) Lift jetty and guava-dependencies	2024-05-23 14:20:01 +02:00
Viktor Lofgren	65b74f9cab	(registry) Fix broken test	2024-05-23 14:15:01 +02:00
Sam Storment	7543e98035	Merge branch 'MarginaliaSearch:master' into search-dark-theme	2024-05-22 18:06:37 -05:00
Viktor Lofgren	59ec70eb73	(*) Clean up code related to crawl parquet inspection	2024-05-22 12:55:08 +02:00
Viktor Lofgren	365229991b	(control) Improve pagination for crawl data inspector	2024-05-21 19:44:48 +02:00
Viktor Lofgren	959a8e29ee	(control) Improve pagination for crawl data inspector	2024-05-21 19:27:25 +02:00
Viktor Lofgren	197c82acd4	(control) Add filter functionality for crawl data inspector	2024-05-21 19:05:44 +02:00
Viktor Lofgren	9539fdb53c	(control) Clean up UX for crawl data inspector	2024-05-21 18:27:24 +02:00
Sam Storment	5659df4388	(search) Set link and form field colors manually to override browser defaults with poor dark mode contrast	2024-05-21 00:03:46 -05:00
Viktor Lofgren	24bf29d369	(*) Upgrade opennlp and deprecate the monkey patched version of the code as it's no longer needed	2024-05-20 18:03:21 +02:00
Viktor Lofgren	17dc00d05f	(control) Partial implementation of inspection utility for crawl data Uses duckdb and range queries to read the parquet files directly from the index partitions. UX is a bit rough but is in working order.	2024-05-20 18:02:46 +02:00
Viktor Lofgren	4fcd4a8197	(index) Refactor to reduce the level of indirection	2024-05-19 12:40:33 +02:00
Viktor Lofgren	daf2a8df54	(btree) Roll back optimization of queryDataWithIndex It had been previously assumed that re-writing this function in the style of retain() would make it faster, but it had the opposite effect. The reason why retain is so fast due to properties of the data that hold true when intersecting document lists, where long runs of adjacent documents are expected, but not when looking up the data associated with the already intersected documents, where the data is more sparse.	2024-05-19 11:29:28 +02:00
Sam Storment	43489c98d8	(search) Minor dark theme tweaks after the new mocked UI elements were added	2024-05-19 01:06:54 -05:00
Viktor Lofgren	88997a1c4f	(btree) Clean up code	2024-05-18 18:38:46 +02:00
Viktor Lofgren	d12c77305c	(btree) Clean up code	2024-05-18 18:03:17 +02:00
Viktor Lofgren	ab4e2b222e	(array) Fix broken benchmarks	2024-05-18 13:41:24 +02:00
Viktor Lofgren	b867eadbef	(big-string) Remove the unused bigstring library	2024-05-18 13:40:03 +02:00
Viktor Lofgren	19163fa883	(array) Clean up the Array library IntArray gets the YAGNI axe. The array library had two implementations, one for longs which was used, and one for ints, which only ever saw bit rot. Removing the latter, as all it ever did was clutter up the codebase and add technical debt. If we need int arrays, we fork LongArray again (or add int capabilities to it) Also cleaning up the interfaces, removing layers of redundant abstractions and adding javadocs. Finally adding sz=2 specializations to the quick- and insertion sort algorithms. It seems the JIT isn't optimizing these particularly well, this is an attempt to help it out a bit.	2024-05-18 13:23:06 +02:00
Sam Storment	a7c33809c4	Merge branch 'master' into search-dark-theme	2024-05-17 22:52:19 -05:00
Viktor Lofgren	650f3843bb	(array) Clean up search function jungle Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values. Replaced binary search function with a branchless version that is much faster. Cleaned up benchmark code.	2024-05-17 14:31:02 +02:00
Viktor Lofgren	9e766bc056	(array) Clean up search function jungle Retire search functions that weren't used, including the native implementations. Drop confusing suffixes on search function names. Search functions no longer encode search misses as negative values. Replaced binary search function with a branchless version that is much faster. Cleaned up benchmark code.	2024-05-17 14:30:06 +02:00
Viktor Lofgren	48aff52e00	(array) Increase LongArray on-heap alignment to 16 bytes This primarily affects benchmarks, making performance more consistent for the 128 bit operations, as the system mostly works with memory mapped data.	2024-05-16 19:12:36 +02:00
Viktor Lofgren	9d7616317e	(array) Clean up native code a bit	2024-05-16 14:47:10 +02:00
Viktor Lofgren	d227a09fb1	(search) Extend paperdoll service mock with site info data and screenshots It's a bit of a hack job but will do, random exploration is available but only through a "browse:random"-style query	2024-05-15 12:40:55 +02:00
Viktor Lofgren	f48cf77c4d	(array, experimental) Add benchmark results for quicksort	2024-05-14 18:15:30 +02:00
Viktor Lofgren	3549be216f	(array, experimental) Documentation for native algos	2024-05-14 17:43:05 +02:00
Viktor Lofgren	c3e3a3dbc5	(search) Fix problem list in clustered search results	2024-05-14 13:05:52 +02:00
Viktor Lofgren	55a7c1db00	(array, experimental) Call C++ helper methods to do some low level stuff a bit faster than is possible with Java	2024-05-14 12:54:14 +02:00
Sam Storment	bb315221ab	(search, WIP) Make the dark theme look generally nicer. Rename CSS custom properties a bit. Switch a lot of background colors to HSL to make it easy to change colors relative to one another.	2024-05-14 01:32:40 -05:00
Sam Storment	c38766c5a6	(search, WIP) Convert SCSS variables to CSS custom properties for dynamic theming	2024-05-08 22:13:24 -05:00
Viktor Lofgren	c837321df1	(search) Provide a notification when no search results are found.	2024-05-06 20:11:39 +02:00
Viktor Lofgren	af7f6b89ec	(search) Delete vestigial stylesheet from the old design.	2024-05-06 19:52:29 +02:00
Viktor Lofgren	29a4d3df23	(search) Imrpove search-service paperdoll by mocking suggestions and news	2024-05-06 19:52:13 +02:00
Viktor Lofgren	7d1cafc070	(control) Add skip link for navigation in control GUI	2024-05-04 12:36:44 +02:00
Viktor Lofgren	5951c67a8b	(search) Center the search results page	2024-05-04 12:23:21 +02:00
Viktor Lofgren	c454007730	(search) Increase contrast for some UI elements	2024-05-04 12:02:52 +02:00
Viktor Lofgren	4e49cca43d	(search) Clean up SCSS code a bit	2024-05-04 11:58:54 +02:00
Viktor Lofgren	49a8c06095	(search) Improve contrast for text on random button	2024-05-04 11:51:19 +02:00
Viktor Lofgren	d01d9fa670	(search) Add screenreader-specific notification remark about when search results start.	2024-05-04 11:41:06 +02:00
Viktor Lofgren	a53a32f006	(search) Spell out website problems with "atomic elements" instead of having a hover that's inaccessible with keyboard navigation	2024-05-04 11:41:05 +02:00
Viktor Lofgren	3548d54cf6	(search) Add a screenreader-only alert when the search filters are updated to make it easier to understand what happens.	2024-05-04 11:41:04 +02:00
Viktor Lofgren	01f242ac7e	(search) Add stylesheet class for screenreader-only items	2024-05-04 11:41:03 +02:00
Viktor Lofgren	2840d9d403	(search) Add screenreader-only positions count text to search results	2024-05-04 11:41:03 +02:00
Viktor Lofgren	9fecfc5025	(search) Add autocomplete attribute to search-form	2024-05-04 11:41:02 +02:00
Viktor Lofgren	1b901e01f2	(search) Add bypass link that skips navigation	2024-05-04 11:41:01 +02:00
Viktor Lofgren	974aa35558	(search) Add proper alt-text to random exploration mode	2024-05-04 11:41:00 +02:00
Viktor Lofgren	4021a0ae98	(search) Add en-US language tags to all templates	2024-05-04 11:40:59 +02:00
Viktor Lofgren	b7a95be731	(search) Create a small mocking framework for running the search service in isolation.	2024-05-04 11:40:59 +02:00
Viktor Lofgren	616649f040	(logs) Fix logdir location	2024-05-04 11:40:59 +02:00
Viktor Lofgren	6087f9635c	(qs) Move index.html out of public directory It was put there to simulate the /public interface paradigm that is now deprecated.	2024-05-01 12:56:12 +02:00
Viktor Lofgren	2ad0bfda1e	(*) Fix boot orchestration for the services This corrects an annoying bug that had the system crash and burn on first start-up due to a race condition in service initialization, where the services were attempting to access the database before it was properly migrated. A fix was in principle already in place, but it was running too late and did not prevent attempts to access the as-yet uninitialized database. Move the first boot check into the MainClass instead of the Service constructor. The change also adds more appropriate docker dependencies to the services to fix rare errors resolving the hostname of the database.	2024-05-01 12:39:48 +02:00
Viktor Lofgren	08f8b6e022	(system) Log loaded properties to the console	2024-04-30 18:29:11 +02:00
Viktor Lofgren	800ed6b1e9	(zk) Terminately immediately if zookeeper isn't found This makes debugging easier	2024-04-30 18:28:49 +02:00
Viktor Lofgren	908535a3a0	(single-service) Ensure single-service spawner can specify the node	2024-04-30 18:27:46 +02:00
Viktor Lofgren	7fe2ab6f39	(file-storage) Ensure file storage root location can be overridden when running outside of docker	2024-04-30 18:26:15 +02:00
Viktor Lofgren	c9ee0c909e	(download-sample) Set +x permissions on directories created during this job	2024-04-30 18:25:07 +02:00
Viktor Lofgren	38aedb50ac	(converter) Do not suppress exceptions in the converter	2024-04-30 18:24:35 +02:00
Viktor Lofgren	4772e0b59d	(service) Deprecate /public prefix on HTTP Before the gRPC migration, the system would serve both public and internal requests over HTTP, but distinguish the two using path prefixes and a few HTTP Headers (X-Public, X-Context) added by the reverse proxy to prevent misconfigurations. Since internal requests meaningfully no longer use HTTP, this convention is just an obstacle now, adding the need to always run the system behind a reverse proxy that rewrites the paths. The change removes the path prefix, and updates the docker templates to reflect the change. This will require a migration for existing systems.	2024-04-30 14:46:18 +02:00
Viktor Lofgren	70e2e41955	(crawler) Content type prober should not swallow exceptions	2024-04-27 18:27:23 +02:00
Viktor Lofgren	4d71c776fc	(crawler) Modify crawl set growth to grow small domains faster than larger ones	2024-04-27 17:36:27 +02:00
Viktor	2d49071e96	Merge branch 'master' into run-outside-docker	2024-04-25 18:53:26 +02:00
Viktor Lofgren	89889ecbbd	(single-service) Skip starting Prometheus if it's not explicitly enabled	2024-04-25 17:54:07 +02:00
Viktor Lofgren	c8ee354d0b	(log) Make log dir configurable via environment variable	2024-04-25 15:09:18 +02:00
Viktor Lofgren	4e5f069809	(build) Migrate ssr to the new root setting schema of java lang version	2024-04-25 15:08:56 +02:00
Viktor Lofgren	6690e9bde8	(service) Ensure the service discovery starts early This is necessary as we use zookeeper to orchestrate first-time startup of the services, to ensure that the database is properly migrated by the control service before anything else is permitted to start.	2024-04-25 15:08:33 +02:00
Viktor Lofgren	e4b34b6ee6	(index) Correctly detect the presence of an all-virtual path through the query	2024-04-25 14:01:46 +02:00
Viktor Lofgren	3952ef6ca5	(service) Let singleservice configure ports and bind addresses	2024-04-25 13:49:57 +02:00
Viktor Lofgren	7eb5e6aa66	(crawler) Abort recrawl if error count is too high	2024-04-24 21:46:40 +02:00
Viktor Lofgren	282022d64e	(crawler) Remove unnecessary double-fetch of the root document	2024-04-24 14:44:39 +02:00
Viktor Lofgren	91a98a8807	(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber	2024-04-24 14:44:39 +02:00
Viktor Lofgren	32fe864a33	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e1c9313396	(crawler) Emulate if-modified-since for domains that don't support the header This will help reduce the strain on some server software, in particular Discourse.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f430a084e8	(crawler) Remove accidental log spam	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a86b596897	(crawler) Code quality	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6dd87b0378	(crawler) Use the probe-result to reduce the likelihood of crawling both http and https This should drastically reduce the number of fetched documents on many domains	2024-04-24 14:44:39 +02:00
Viktor Lofgren	c9f029c214	(crawler) Strip W/-prefix from the etag when supplied as If-None-Match	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6b88db10ad	(crawler) Ensure all appropriate headers are recorded on the request	2024-04-24 14:44:39 +02:00
Viktor Lofgren	8a891c2159	(crawler/converter) Remove legacy junk from parquet migration	2024-04-24 14:44:39 +02:00
Viktor Lofgren	ad2ac8eee3	(query) Mark flaky test, correct assert on test	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f46733a47a	(ranking) TermCoherenceFactory should be run for size=2 queries	2024-04-24 14:44:39 +02:00
Viktor Lofgren	934167323d	(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	64baa41e64	(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches	2024-04-24 14:44:39 +02:00
Viktor Lofgren	5165cf6d15	(ranking) Set regularMask correctly	2024-04-24 14:44:39 +02:00
Viktor Lofgren	4489b21528	(ranking) Cleanup	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f623b37577	(ranking) Suppress NaN:s in ranking output	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f4a2fea451	(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a748fc5448	(index, bugfix) Pass url quality to query service	2024-04-24 14:44:39 +02:00
Viktor Lofgren	0dcca0cb83	(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp	2024-04-24 14:44:39 +02:00
Viktor Lofgren	b80a83339b	(qs) Additional info in query debug UI	2024-04-24 14:44:39 +02:00
Viktor Lofgren	eb74d08f2a	(qs) Additional info in query debug UI	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e79ab0c70e	(qs) Basic query debug feature	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e419e26f3a	(proto) Improve handling of omitted parameters	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6102fd99bf	(qs) Improve logging	2024-04-24 14:44:39 +02:00
Viktor Lofgren	def36719d3	(query) Minor code cleanup	2024-04-24 14:44:39 +02:00
Viktor Lofgren	462aa9af26	(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a09c84e1b8	(query) Modify tokenizer to match the behavior of the sentence extractor This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	44b33798f3	(index) Clean up jaccard index term code and down-tune the parameter's importance a bit	2024-04-24 14:44:39 +02:00
Viktor Lofgren	2f0b648fad	(index) Add jaccard index term to boost results based on term overlap	2024-04-24 14:44:39 +02:00
Viktor Lofgren	de0e56f027	(index) Remove position overlap check, coherences will do the work instead	2024-04-24 14:44:39 +02:00
Viktor Lofgren	973ced7b13	(index) Omit absent terms from coherence checks	2024-04-24 14:44:39 +02:00
Viktor Lofgren	cb4b824a85	(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus	2024-04-24 14:44:39 +02:00
Viktor Lofgren	c583a538b1	(search) Add implicit coherence constraints based on segmentation	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e0224085b4	(index) Improve recall for small queries Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	44c1e1d6d9	(index) Remove dead code Since the performance fix in `3359f72239` had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	c620e9c026	(index) Experimental performance regression fix	2024-04-24 14:44:39 +02:00
Viktor Lofgren	1bb88968c5	(test) Fix broken test	2024-04-24 14:44:39 +02:00
Viktor Lofgren	df75e8f4aa	(index) Explicitly free LongQueryBuffers	2024-04-24 14:44:39 +02:00
Viktor Lofgren	adf846bfd2	(index) Fix term coherence evaluation The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	1748fcc5ac	(valuation) Impose stronger constraints on locality of terms Clean up logic a bit	2024-04-24 14:44:39 +02:00
Viktor Lofgren	08416393e0	(valuation) Impose stronger constraints on locality of terms	2024-04-24 14:44:39 +02:00
Viktor Lofgren	fce26015c9	(encyclopedia) Index the full articles Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	155be1078d	(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	6efc0f21fe	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	f3255e080d	(ngram) Grab titles separately when extracting ngrams from wiki data	2024-04-24 14:44:39 +02:00
Viktor Lofgren	5f6a3ef9d0	(ngram) Correct \|s\|^\|s\|-normalization to use length and not count	2024-04-24 14:44:39 +02:00
Viktor Lofgren	afc4fed591	(ngram) Correct size value in ngram lexicon generation, trim the terms better	2024-04-24 14:44:39 +02:00
Viktor Lofgren	cb505f98ef	(ngram) Use simple blocking pool instead of FJP; split on underscores in article names.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	a0b3634cb6	(ngram) Only extract frequencies of title words, but use the body to increment the counters... The sign of the counter is used to indicate whether a term has appeared as title. Until it's seen in the title, it's provisionally saved as a negative count.	2024-04-24 14:44:39 +02:00
Viktor Lofgren	e23359bae9	(query, minor) Remove debug statement	2024-04-24 14:44:39 +02:00
Viktor Lofgren	5531ed632a	(query, minor) Remove debug statement	2024-04-24 14:44:39 +02:00
Viktor Lofgren	150ee21f3c	(ngram) Clean up ngram lexicon code This is both an optimization that removes some GC churn, as well as a clean-up of the code that removes references to outdated concepts.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	c96da0ce1e	(segmentation) Pick best segmentation using \|s\|^\|s\|-style normalization This is better than doing all segmentations possible at the same time.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	a0d9e66ff7	(ngram) Fix index range in NgramLexicon to an avoid exception	2024-04-24 14:44:38 +02:00
Viktor Lofgren	55f627ed4c	(index) Clean up the code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	7dd8c78c6b	(ngrams) Remove the vestigial logic for capturing permutations of n-grams The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	8bf7d090fd	(qs) Clean up parsing code using new record matching	2024-04-24 14:44:38 +02:00
Viktor Lofgren	6bfe04b609	(term-freq-exporter) Reduce thread count and memory usage	2024-04-24 14:44:38 +02:00
Viktor Lofgren	491d6bec46	(term-freq-exporter) Extract ngrams in term-frequency-exporter	2024-04-24 14:44:38 +02:00
Viktor Lofgren	4fb86ac692	(search) Fix outdated assumptions about the results We no longer break the query into "sets" of search terms and need to adapt the code to not use this assumption. For the API service, we'll simulate the old behavior to keep the API stable. For the search service, we'll introduce a new way of calculating positions through tree aggregation.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	6cba6aef3b	(minor) Remove dead code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	7e216db463	(index) Add origin trace information for index readers This used to be supported by the system but got lost in refactoring at some point.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	adc90c8f1e	(sentence-extractor) Fix resource leak in sentence extractor The code would always re-initialize the static ngramLexicon and rdrposTagger fields with new instances even if they were already instantiated, leading to a ton of unnecessary RAM allocation. The modified behavior checks for nullity before creating a new instance.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	e3316a3672	(index) Clean up new index query code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	a3a6d6292b	(qs, index) New query model integrated with index service. Seems to work, tests are green and initial testing finds no errors. Still a bit untested, committing WIP as-is because it would suck to lose weeks of work due to a drive failure or something.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	8cb9455c32	(qs, WIP) Fix edge cases in query compilation This addresses the relatively common case where the graph consists of two segments, such as x y, z w; in this case we want an output like (x_y) (z w \| z_w) \| x y (z_w). The generated output does somewhat pessimize a few other cases, but this one is arguably more important.	2024-04-24 14:44:38 +02:00
Viktor Lofgren	dc65b2ee01	(qs, WIP) Clean up dead code	2024-04-24 14:44:38 +02:00
Viktor Lofgren	98a1adbf81	(qs, WIP) Tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	0bd1e15cce	(qs, WIP) Tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	eda926767e	(qs, WIP) Tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	cd1a18c045	(qs, WIP) Break up code and tidy it up a bit	2024-04-24 14:44:38 +02:00
Viktor Lofgren	6f567fbea8	(qs, WIP) Fix output determinism, fix tests	2024-04-24 14:44:38 +02:00
Viktor Lofgren	0ebadd03a5	(WIP) Query rendering finally beginning to look like it works	2024-04-24 14:44:38 +02:00
Viktor Lofgren	2253b556b2	WIP	2024-04-24 14:44:17 +02:00
Viktor Lofgren	6a7a7009c7	(convert) Initial integration of segmentation data into the converter's keyword extraction logic	2024-04-24 14:44:17 +02:00
Viktor Lofgren	3c75057dcd	(qs) Retire NGramBloomFilter, integrate new segmentation model instead	2024-04-24 14:44:17 +02:00
Viktor Lofgren	212d101727	(control) GUI for exporting segmentation data from a wikipedia zim	2024-04-24 14:44:17 +02:00
Viktor Lofgren	760b80659d	(WIP) Partial integration of new query expansion code into the query-serivice	2024-04-24 14:44:17 +02:00
Viktor Lofgren	04879c005d	(WIP) Improve data extraction from wikipedia data	2024-04-24 14:44:17 +02:00
Viktor Lofgren	cb82927756	(WIP) Implement first take of new query segmentation algorithm	2024-04-24 14:44:17 +02:00
Viktor Lofgren	8b9629f2f6	(crawler) Remove unnecessary double-fetch of the root document	2024-04-24 14:38:59 +02:00
Viktor Lofgren	f6db16b313	(crawler) Reduce log noise from timeouts in SoftIfModifiedSinceProber	2024-04-24 14:10:03 +02:00
Viktor Lofgren	4668b1ddcb	(build) Java 22 and its consequences has been a disaster for Marginalia Search Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.	2024-04-24 13:54:04 +02:00
Viktor Lofgren	dcf9d9caad	(crawler) Emulate if-modified-since for domains that don't support the header This will help reduce the strain on some server software, in particular Discourse.	2024-04-22 17:26:31 +02:00
Viktor Lofgren	7a69b76001	(crawler) Remove accidental log spam	2024-04-22 15:51:37 +02:00
Viktor Lofgren	ac07ef822f	(crawler) Code quality	2024-04-22 15:37:35 +02:00
Viktor Lofgren	e7d4bcd872	(crawler) Use the probe-result to reduce the likelihood of crawling both http and https This should drastically reduce the number of fetched documents on many domains	2024-04-22 15:36:43 +02:00
Viktor Lofgren	a28c6d7cfe	(crawler) Strip W/-prefix from the etag when supplied as If-None-Match	2024-04-22 14:31:05 +02:00
Viktor Lofgren	d816f048f5	(crawler) Ensure all appropriate headers are recorded on the request	2024-04-22 14:14:24 +02:00
Viktor Lofgren	b09ddd0036	(crawler/converter) Remove legacy junk from parquet migration	2024-04-22 12:34:28 +02:00
Viktor Lofgren	0a73b02a00	(query) Mark flaky test, correct assert on test	2024-04-21 12:30:14 +02:00
Viktor Lofgren	8769704462	(ranking) TermCoherenceFactory should be run for size=2 queries	2024-04-21 12:29:25 +02:00
Viktor Lofgren	214551f1df	(converter) Stopgap fix for some cases of lost crawl data due to HTTP 304. The root cause needs further investigation.	2024-04-19 20:36:01 +02:00
Viktor Lofgren	2cc74c005a	(query) Always generate an ngram alternative, suppresses generation of multiple identical query branches	2024-04-19 19:42:30 +02:00
Viktor Lofgren	ed250f57f2	(ranking) Set regularMask correctly	2024-04-19 14:31:57 +02:00
Viktor Lofgren	e92c25f7e0	(ranking) Cleanup	2024-04-19 14:13:12 +02:00
Viktor Lofgren	3ab563f314	(ranking) Suppress NaN:s in ranking output	2024-04-19 13:58:28 +02:00
Viktor Lofgren	426338cb45	(ranking, bugfix) Use bm25NgramWeight and not full weight for bM25N	2024-04-19 12:41:48 +02:00
Viktor Lofgren	5fa2375898	(index, bugfix) Pass url quality to query service	2024-04-19 12:41:26 +02:00
Viktor Lofgren	41782a0ab5	(index) Fix TCF bug where the ngram terms would be considered instead of the regular ones due to a logical derp	2024-04-19 12:19:26 +02:00
Viktor Lofgren	9b06433b82	(qs) Additional info in query debug UI	2024-04-19 12:18:53 +02:00
Viktor Lofgren	def607d840	(qs) Additional info in query debug UI	2024-04-19 11:46:27 +02:00
Viktor Lofgren	2b811fb422	(qs) Basic query debug feature	2024-04-19 11:00:56 +02:00
Viktor Lofgren	36cc62c10c	(proto) Improve handling of omitted parameters	2024-04-18 10:47:12 +02:00
Viktor Lofgren	975d92912c	(qs) Improve logging	2024-04-18 10:44:08 +02:00
Viktor Lofgren	8bbaf457de	(query) Minor code cleanup	2024-04-18 10:37:51 +02:00
Viktor Lofgren	7641a02f31	(query) Update ranking parameters with new variables for bm25 ngrams and tcf mutual jaccard The change also makes it so that as long as the values are defaults, they don't need to be sent over the wire and decoded.	2024-04-18 10:36:15 +02:00
Viktor Lofgren	ce16239e34	(query) Modify tokenizer to match the behavior of the sentence extractor This must match, otherwise a query like "plato's republic" won't match the indexed keywords, since they would strip the possessive.	2024-04-17 17:54:32 +02:00
Viktor Lofgren	d64bd227cf	(index) Clean up jaccard index term code and down-tune the parameter's importance a bit	2024-04-17 17:40:16 +02:00
Viktor Lofgren	c5ab0a9054	(index) Add jaccard index term to boost results based on term overlap	2024-04-17 16:50:26 +02:00
Viktor Lofgren	dac948973d	(index) Remove position overlap check, coherences will do the work instead	2024-04-17 14:20:01 +02:00
Viktor Lofgren	9d008d1d6f	(index) Omit absent terms from coherence checks	2024-04-17 14:12:16 +02:00
Viktor Lofgren	f52457213e	(index) Split ngram and regular keyword bm25 calculation and add ngram score as a bonus	2024-04-17 14:05:02 +02:00
Viktor Lofgren	579295a673	(search) Add implicit coherence constraints based on segmentation	2024-04-17 14:03:35 +02:00
Viktor Lofgren	af8ff8ce99	(index) Improve recall for small queries Partially reverse the previous commit and add a query head for the priority index when there are few query interpretations.	2024-04-16 22:51:03 +02:00
Viktor Lofgren	7fa3e86e64	(index) Remove dead code Since the performance fix in `3359f72239` had a huge positive impact without reducing result quality, it's possible to remove the QueryBranchWalker and associated code.	2024-04-16 19:59:27 +02:00
Viktor Lofgren	3359f72239	(index) Experimental performance regression fix	2024-04-16 19:48:14 +02:00
Viktor Lofgren	41fa154aa6	(test) Fix broken test	2024-04-16 19:48:14 +02:00
Viktor Lofgren	deaba0152d	(index) Explicitly free LongQueryBuffers	2024-04-16 19:23:00 +02:00
Viktor Lofgren	feaef6093e	(index) Fix term coherence evaluation The code was incorrectly using the documentId instead of the combined id, resulting in almost all result sets being incorrectly seen as zero.	2024-04-16 18:07:43 +02:00
Viktor Lofgren	078fa4fdd0	(valuation) Impose stronger constraints on locality of terms Clean up logic a bit	2024-04-16 17:22:58 +02:00
Viktor Lofgren	2dc77a0638	(valuation) Impose stronger constraints on locality of terms	2024-04-16 17:15:21 +02:00
Viktor Lofgren	f434a8b492	(build) Upgrade jib plugin version	2024-04-16 15:25:23 +02:00
Viktor Lofgren	d2658d6f84	(sys) Add springboard service that can spawn multiple different marginalia services to make distribution easier.	2024-04-16 13:25:15 +02:00
Viktor Lofgren	8c559c8121	(conf) Add additional logic for discovering system root	2024-04-16 12:37:18 +02:00
Viktor Lofgren	2353c73c57	(encyclopedia) Index the full articles Previously, in an experimental change, only the first paragraph was indexed, intended to reduce the amount of noisy tangential hits. This was not a good idea, so the change is reverted.	2024-04-16 12:10:13 +02:00
Viktor Lofgren	599e719ad4	(index) Fix priority search terms This functionality fell into disrepair some while ago. It's supposed to allow non-mandatory search terms that boost the ranking if they are present in the document.	2024-04-15 16:44:08 +02:00
Viktor Lofgren	b6d365bacd	(index) Clean up data model The change set cleans up the data model for the term-level data. This used to contain a bunch of fields with document-level metadata. This data-duplication means a larger memory footprint and worse memory locality. The ranking code is also modified to not accept SearchResultKeywordScores, but rather CompiledQueryLong and CqDataInts containing only the term metadata and the frequency information needed for ranking. This is again an effort to improve memory locality.	2024-04-15 16:04:07 +02:00
Viktor Lofgren	52f0c0d336	(ngram) Grab titles separately when extracting ngrams from wiki data	2024-04-13 19:34:16 +02:00
Viktor Lofgren	fda1c05164	(ngram) Correct \|s\|^\|s\|-normalization to use length and not count	2024-04-13 18:05:30 +02:00

... 5 6 7 8 9 ...

1796 Commits