MarginaliaSearch

mirror of https://github.com/MarginaliaSearch/MarginaliaSearch.git synced 2025-02-24 13:19:02 +00:00

Author	SHA1	Message	Date
Viktor Lofgren	2ad93ad41a	(*) Clean up	2024-08-14 11:43:45 +02:00
Viktor Lofgren	623ee5570f	(slop) Break slop out into its own repository	2024-08-13 09:50:05 +02:00
Viktor Lofgren	fd2bad39f3	(keyword-extraction) Add body field for terms that are not otherwise part of a field	2024-08-13 09:49:26 +02:00
Viktor Lofgren	e6c8a6febe	(index) Add index-side deduplication in selectBestResults	2024-08-10 10:51:59 +02:00
Viktor Lofgren	4ece5f847b	(index) Add more qdebug factors	2024-08-10 10:45:30 +02:00
Viktor Lofgren	e4f04af044	(index) Give BODY matches a verbatim match value	2024-08-10 10:22:19 +02:00
Viktor Lofgren	b730b17f52	(index) Correct handling of firstPosition to avoid d/z	2024-08-10 10:21:59 +02:00
Viktor Lofgren	98c40958ab	(index) Simplify verbatim match calculation	2024-08-10 09:54:56 +02:00
Viktor Lofgren	41b52f5bcd	(index) Simplify verbatim match calculation	2024-08-10 09:51:03 +02:00
Viktor Lofgren	4264fb9f49	(query-service) Clean up qdebug UI a bit	2024-08-10 09:51:03 +02:00
Viktor Lofgren	016a4c62e1	(index) Bugs and error fixes, chasing and fixing mystery results that did not contain all relevant keywords	2024-08-10 09:51:03 +02:00
Viktor Lofgren	2f38c95886	(index) Backport bugfix from term-positions branch The ordering of TermIdsList is assumed to be unchanged by the surrounding code, but the constructor sorts the dang list to be able to do contains() by binary search. This is no bueno. This is gonna be a merge conflict in the future, but it's too big of a bug to leave for another month.	2024-08-09 21:17:02 +02:00
Viktor Lofgren	df89661ed2	(index) In SearchResultItem, populate combinedId with combinedId and not its ranking-removed documentId cousin	2024-08-09 16:32:32 +02:00
Viktor Lofgren	41da4f422d	(search-query) Always generate the "all"-segmentation	2024-08-09 13:20:00 +02:00
Viktor Lofgren	2e89b55593	(wip) Repair qdebug utility and show new ranking details	2024-08-09 12:57:25 +02:00
Viktor Lofgren	7babdb87d5	(index) Remove intermediate models	2024-08-07 10:10:44 +02:00
Viktor Lofgren	680ad19c7d	(keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors	2024-08-06 11:16:56 +02:00
Viktor Lofgren	f01267bc6b	(index) Don't load fwd index offsets into a hash table at start. This makes the service take forever to start up. Memory map the data instead and binary search. This is a bit slower, but not by much.	2024-08-06 11:16:28 +02:00
Viktor Lofgren	df6a05b9a7	(index) Avoid hypothetical divide-by-zero in tcfAvgDist	2024-08-06 10:55:57 +02:00
Viktor Lofgren	8569bb8e11	(index) Avoid divide-by-zero when minDist returns 0	2024-08-06 10:34:05 +02:00
Viktor Lofgren	ca6e2db2b9	(index) Include external link texts in verbatim score	2024-08-06 10:23:23 +02:00
Viktor Lofgren	2080e31616	(converter) Store link text positions To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends. Integrating this information with the ranking is not performed here.	2024-08-04 12:00:29 +02:00
Viktor Lofgren	c379be846c	(slop) Update readme	2024-08-04 10:58:23 +02:00
Viktor Lofgren	9bc665628b	(slop) VarintLE implementation, correct enum8 column	2024-08-04 10:57:52 +02:00
Viktor Lofgren	ee49c01d86	(index) Tune ranking for verbatim matches in the title, rewarding shorter titles	2024-08-03 14:47:23 +02:00
Viktor Lofgren	b21f8538a8	(index) Tune ranking for verbatim matches in the title, rewarding shorter titles	2024-08-03 14:41:38 +02:00
Viktor Lofgren	dd15676d33	(index) Tune ranking for verbatim matches in the title, rewarding shorter titles	2024-08-03 14:18:04 +02:00
Viktor Lofgren	ec5a17ad13	(index) Tune ranking for verbatim matches in the title, rewarding shorter titles	2024-08-03 14:07:02 +02:00
Viktor Lofgren	e48f52faba	(experiment) Add add-hoc filter runner	2024-08-03 13:24:03 +02:00
Viktor Lofgren	8462e88b8f	(index) Add min-dist factor and adjust rankings	2024-08-03 13:07:00 +02:00
Viktor Lofgren	bf26ead010	(index) Remove hasPrioTerm check as we should sort this out in ranking	2024-08-03 13:06:50 +02:00
Viktor Lofgren	c2cedfa83c	(index) Experimental ranking signals	2024-08-03 10:33:41 +02:00
Viktor Lofgren	eba2844361	(index) Experimental ranking signals	2024-08-03 10:32:46 +02:00
Viktor Lofgren	c6c8b059bf	(index) Return some variant of the previously removed 'Bm25PrioGraphVisitor'	2024-08-03 10:10:12 +02:00
Viktor Lofgren	d8a99784e5	(index) Adding a few experimental relevance signals	2024-08-02 20:26:07 +02:00
Viktor Lofgren	57929ff242	(coded-sequence) Varint sequence	2024-08-02 20:22:56 +02:00
Viktor Lofgren	4430a39120	(loader) Clean up	2024-08-02 12:32:47 +02:00
Viktor Lofgren	6228f46af1	(loader) Reduce log spam	2024-08-02 12:21:03 +02:00
Viktor Lofgren	ac67b6b5da	(converter) Fix exception handling while reading crawl data	2024-08-02 10:39:49 +02:00
Viktor Lofgren	1a268c24c8	(perf) Reduce DomPruningFilter hash table recalculation	2024-08-01 12:04:55 +02:00
Viktor Lofgren	38e2089c3f	(perf) Code was still spending a lot of time resolving charsets ... in the failure case which wasn't captured by memoization.	2024-08-01 11:58:59 +02:00
Viktor Lofgren	e2107901ec	(index) Add span information for anchor tags, tweak ranking params	2024-08-01 11:46:30 +02:00
Viktor Lofgren	15745b692e	(index) Coherences need to be able to deal with null values among positions	2024-07-31 22:00:14 +02:00
Viktor Lofgren	696fd8909d	(screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones	2024-07-31 21:44:10 +02:00
Viktor Lofgren	02b1c4b172	(screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones	2024-07-31 20:21:23 +02:00
Viktor Lofgren	285e657f68	Merge branch 'master' into term-positions # Conflicts: # code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java # code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java	2024-07-31 10:44:01 +02:00
Viktor Lofgren	046ffc7752	(build) Upgrade jib to 3.4.3	2024-07-31 10:39:50 +02:00
Viktor Lofgren	2ef66ce0ca	(actor) Reset NEW flag earlier when auto-deletion is disabled Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.	2024-07-31 10:31:03 +02:00
Viktor Lofgren	dc5c668940	(index) Re-enable parallelization of index construction, disable parallel sorting during construction The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance. It got worse, so the change is reverted. Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.	2024-07-31 10:06:53 +02:00
Viktor Lofgren	f19148132a	(search) Restrict site-search by passing domain id along with the site:-term This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.	2024-07-30 21:41:07 +02:00
Viktor Lofgren	6d7b886aaa	(converter) Correct sort order of files in control storage GUI Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.	2024-07-30 19:43:27 +02:00
Viktor Lofgren	b316b55be9	(index) Experimental initial integration of document spans into index	2024-07-30 12:01:53 +02:00
Viktor Lofgren	80900107f7	(restructure) Clean up repo by moving stray features into converter-process and crawler-process	2024-07-30 10:14:00 +02:00
Viktor Lofgren	7e4efa45b8	(converter/loader) Simplify document record writing to not require predicated reads	2024-07-29 14:21:21 +02:00
Viktor Lofgren	86ea28d6bc	(converter/loader) Simplify document record writing to not require predicated reads	2024-07-29 14:18:52 +02:00
Viktor Lofgren	34703da144	(slop) Support for nested array types and array-of-object types Also adding very basic support for filtered reads via SlopTable. This is probably not a final design.	2024-07-29 14:00:43 +02:00
Viktor Lofgren	1282f78bc5	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 11:01:18 +02:00
Viktor Lofgren	2d5d965f7f	(slop-models) Fix incorrect column grouping leading to errors in converter	2024-07-29 10:34:33 +02:00
Viktor Lofgren	afe56c7cf1	(loader) Tidy up code	2024-07-28 21:36:42 +02:00
Viktor Lofgren	7d51cf882f	(loader) Move rssFeeds to a different column group to avoid errors	2024-07-28 21:30:10 +02:00
Viktor Lofgren	499deac2ef	(slop) Fix test that broke when we split get into int get() and long getLong()	2024-07-28 21:20:37 +02:00
Viktor Lofgren	9685993adb	(loader) Add spans to a different column group from spanCodes, as they are not in sync	2024-07-28 21:20:09 +02:00
Viktor Lofgren	261dcdadc8	(loader) Additional tracking for the control GUI	2024-07-28 21:19:45 +02:00
Viktor Lofgren	314a901bf0	(slop) Clean up build.gradle from unnecessary copy-paste garbage	2024-07-28 13:22:20 +02:00
Viktor Lofgren	1caad7e19e	(slop) Update existing code to use the altered Slop interfaces	2024-07-28 13:21:08 +02:00
Viktor Lofgren	e585116dab	(slop) Add 32 bit read method for Varint along with the old 64 bit version	2024-07-28 13:20:18 +02:00
Viktor Lofgren	40f42bf654	(slop) Add signed 16 bit column type "short"	2024-07-28 13:19:44 +02:00
Viktor Lofgren	eaf7fbb9e9	(slop) Improve Conveniences for Enum * New fixed width 8 bit version of Enum * Access to the enum's dictionary, and a method for reading the ordinal directly to reduce GC churn	2024-07-28 13:19:15 +02:00
Viktor Lofgren	d05a2e57e9	(index-forward) Spans Writer should not be in the index page loop context	2024-07-27 15:17:04 +02:00
Viktor Lofgren	f8684118f3	(slop) Add columnDesc information to the column readers and writers, and correct a few broken position() implementations Added a test that should find any additional broken implementations, as it's very important that this function is correct.	2024-07-27 14:35:30 +02:00
Viktor Lofgren	2e1f669aea	(slop) Remove additional vestigial seek() implementations	2024-07-27 14:35:30 +02:00
Viktor Lofgren	6c3abff664	(slop) Move GCS Slop column to the coded-sequence package This lets the slop library be stand-alone without dependence on coded-sequence. The change also gets rid of the vestigial seek() method in ColumnReader.	2024-07-27 13:58:45 +02:00
Viktor Lofgren	dcb43a3308	(slop) Introduce table concept to keep track of positions and simplify closing The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip. The second most common error is forgetting to close one of the columns in a reader or writer. To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.	2024-07-27 13:47:47 +02:00
Viktor Lofgren	ec600b967d	(crawler) Adjust domain locking Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress. Giving these a slightly larger allowance.	2024-07-27 11:54:46 +02:00
Viktor Lofgren	aebb2652e8	(wip) Extract and encode spans data Refactoring keyword extraction to extract spans information. Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions. This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact. Will push as-is to get back to being able to do more isolated work.	2024-07-27 11:44:13 +02:00
Viktor Lofgren	52a9a0d410	(slop) Translate nulls to empty strings when passed to the StringColumnWriters.	2024-07-25 18:26:41 +02:00
Viktor Lofgren	4123e99469	(slop) Handle empty compressed files correctly The CompressingStorageReader would incorrectly report having data when a file was empty. Preemptively attempting to fill the backing buffer fixes the behavior.	2024-07-25 18:26:13 +02:00
Viktor Lofgren	51a8a242ac	(slop) First commit of slop library Slop is a low-abstraction data storage convention for column based storage of complex data.	2024-07-25 15:08:41 +02:00
Viktor Lofgren	60ef826e07	(loader) Add heartbeat to update domain-ids step	2024-07-25 15:08:41 +02:00
Viktor Lofgren	2ad564404e	(loader) Add heartbeat to update domain-ids step	2024-07-23 15:28:52 +02:00
Viktor Lofgren	2bb9f18411	(dld) Refactor DocumentLanguageData Reduce the usage of raw arrays	2024-07-19 12:24:55 +02:00
Viktor Lofgren	7a1edc0880	(term-freq) Reduce the number of low-relevance words in the dictionary Using a statistical trick to reduce the number of low-frequency words in the dictionary, as they are numerous and not very informative.	2024-07-19 12:23:28 +02:00
Viktor Lofgren	b812e96c6d	(language-processing) Select the appropriate language filter The incorrect filter was selected based on the provided parameter, this has been corrected.	2024-07-19 12:22:32 +02:00
Viktor Lofgren	22b35d5d91	(sentence-extractor) Add tag information to document language data Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object. Separator information is encoded as a bit set instead of an array of integers. The change also cleans up the SentenceExtractor class a fair bit. It no longer extracts ngrams, and a significant amount of redundant operations were removed as well. This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.	2024-07-18 15:57:48 +02:00
Viktor Lofgren	d36055a2d0	(keyword-extractor) Retire TfIdfHigh WordFlag This will bring the word flags count down to 8, and let us pack every value in a byte.	2024-07-17 13:54:39 +02:00
Viktor Lofgren	0d227f3543	(cleanup) Remove next-prime library only used in tests	2024-07-17 13:48:03 +02:00
Viktor Lofgren	accc598967	(crawler) Add 1 second pause after probing domain to reduce request pressure	2024-07-16 16:55:07 +02:00
Viktor Lofgren	02c4a2d4ba	(crawler) Add a per-domain mutex for crawling To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.	2024-07-16 16:44:59 +02:00
Viktor Lofgren	6665e447aa	(crawler) Add crawl delays around probe call and deal with 429:s properly during this phase	2024-07-16 15:33:24 +02:00
Viktor Lofgren	f4d79c203d	(crawler) Adjust revisit logic The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed. Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.	2024-07-16 15:12:38 +02:00
Viktor Lofgren	4d29581ea4	(crawler) Introduce absolute upper limit to crawl depth growth	2024-07-16 14:40:45 +02:00
Viktor Lofgren	0b31c4cfbb	(coded-sequence) Replace GCS usage with an interface	2024-07-16 14:37:50 +02:00
Viktor Lofgren	5c098005cc	(index) Fix broken test Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.	2024-07-16 12:37:59 +02:00
Viktor Lofgren	ae87e41cec	(index) Fix rare BitReader.takeWhileZero bug Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer. This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty. The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte. Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.	2024-07-16 11:03:56 +02:00
Viktor Lofgren	dfd19b5eb9	(index) Reduce the number of abstractions around result ranking The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.	2024-07-16 08:18:54 +02:00
Viktor	8ed5b51a32	Merge branch 'master' into term-positions	2024-07-15 07:05:31 +02:00
Viktor Lofgren	9d0e5dee02	Fix gitignore issue .so files not to be ignored correctly.	2024-07-15 05:18:10 +02:00
Viktor Lofgren	ffd970036d	(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter How'd This Ever Work? (tm) TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.	2024-07-15 05:16:17 +02:00
Viktor Lofgren	fa162698c2	(term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter How'd This Ever Work? (tm) TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.	2024-07-15 05:15:30 +02:00
Viktor Lofgren	ad3857938d	(search-api, ranking) Update with new ranking parameters Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm. The change also cleans out several parameters that no longer filled any function.	2024-07-15 04:49:40 +02:00
Viktor Lofgren	179a6002c2	(coded-sequence) Add a callback for re-filling underlying buffer	2024-07-12 23:50:28 +02:00
Viktor Lofgren	d28fc86956	(index-prio) Add fuzz test for prio index	2024-07-11 19:22:36 +02:00
Viktor Lofgren	6303977e9c	(index-prio) Fail louder when size is 0 in PrioDocIdsTransformer We can't deal with this scenario and should complain very loudly	2024-07-11 19:22:05 +02:00
Viktor Lofgren	97695693f2	(index-prio) Don't increment readItems counter when the output buffer is full This behavior was causing the reader to sometimes discard trailing entries in the list.	2024-07-11 19:21:36 +02:00
Viktor Lofgren	1ab875a75d	(test) Correcting flaky tests Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	31881874a9	(coded-sequence) Correct indicator of next-value It was incorrectly assumed that a "next" value could not be zero or negative, as this is not representable via the Gamam code. This is incorrect in this case, as we're able to provide a negative offset. Changing to using Integer.MIN_VALUE as indicator that a value is absent instead, as this will never be used.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	f090f0101b	(index-construction) Gather up preindex writes Use fewer writes when finalizing the preindex documents.dat file, as this was getting too slow.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	9881cac2da	(index-reader) Correctly handle negative offset values When wordOffset(...) returns a negative value, it means the word isn't present in the index, and we should abort.	2024-07-11 16:13:23 +02:00
Viktor Lofgren	12590d3449	(index-reverse) Added compression to priority index The priority index documents file can be trivially compressed to a large degree. Compression schema: ``` 00b -> diff docord (E gamma) 01b -> diff domainid (E delta) + (1 + docord) (E delta) 10b -> rank (E gamma) + domainid,docord (raw) 11b -> 30 bit size header, followed by 1 raw doc id (61 bits) ```	2024-07-11 16:13:23 +02:00
Viktor Lofgren	abf7a8d78d	(coded-sequence) Correct implementation of Elias gamma Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.	2024-07-10 14:28:28 +02:00
Viktor Lofgren	ecfe17521a	(coded-sequence) Correct implementation of Elias gamma The implementation was incorrectly using 1 bit more than it should. The change also adds a put method for Elias delta; and cleans up the interface a bit.	2024-07-09 17:28:21 +02:00
Viktor Lofgren	0d29e2a39d	(index-reverse) Entry Sources reset() their LongQueryBuffer Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome	2024-07-09 01:39:40 +02:00
Viktor Lofgren	12a2ab93db	(actor) Improve error messages for convert-and-load Some copy-and-paste errors had snuck in and every index construction error was reported as "repartitioned failed"; updated with more useful messages.	2024-07-08 19:19:30 +02:00
Viktor Lofgren	d90bd340bb	(index-reverse) Removing btree indexes from prio documents file Btree index adds overhead and disk space and doesn't fill any function for the prio index. * Update finalize logic with a new IO transformer that copies the data and prepends a size * Update the reader to read the new format * Added a test	2024-07-08 17:20:17 +02:00
Viktor Lofgren	21afe94096	(index-reverse) Don't use 128 bit merge function for prio index	2024-07-07 21:36:10 +02:00
Viktor Lofgren	fa36689597	(index-reverse) Simplify priority index * Do not emit a documents file * Do not interlace metadata or offsets with doc ids	2024-07-06 18:04:08 +02:00
Viktor Lofgren	85c99ae808	(index-reverse) Split index construction into separate packages for full and priority index	2024-07-06 15:44:47 +02:00
Viktor Lofgren	a4ecd5f4ce	(minor) Fix non-compiling test due to previous refactor	2024-07-06 15:11:43 +02:00
Viktor Lofgren	6401a513d7	(crawl) Fix onsubmit confirm dialog for single-site recrawl	2024-07-05 17:21:03 +02:00
Viktor Lofgren	d86926be5f	(crawl) Add new functionality for re-crawling a single domain	2024-07-05 15:31:55 +02:00
Viktor Lofgren	a6b03a66dc	(crawl) Reduce Charset.forName() object churn Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets	2024-07-04 20:49:07 +02:00
Viktor Lofgren	d023e399d2	(index) Remove unnecessary allocations in journal reader The term data iterator is quite hot and was performing buffer slice operations that were not necessary. Replacing with a fixed pointer alias that can be repositioned to the relevant data. The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped. Removed this unnecessary step and move to copying the buffer directly instead.	2024-07-04 15:38:22 +02:00
Viktor Lofgren	e8ab1e14e0	(keyword-extraction) Update upper limit to number of positions per word After real-world testing, it was determined that 256 was still a bit too low, but 512 seems like it will only truncate outlier cases like assembly code and certain tabulations.	2024-07-02 20:52:32 +02:00
Viktor Lofgren	a6e15cb338	(keyword-extraction) Update upper limit to number of positions per word 100 was a bit too low, let's try 256.	2024-06-30 22:46:56 +02:00
Viktor Lofgren	4fbb863a10	(keyword-extraction) Add upper limit to number of positions per word Also adding some logging for this event to get a feel for how big these lists get with realistic data. To be cleaned up later.	2024-06-30 22:41:38 +02:00
Viktor Lofgren	6ee4d1eb90	(keyword) Increase the work area for position encoding The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.	2024-06-28 16:42:39 +02:00
Viktor Lofgren	738e0e5fed	(process) Add option for automatic profiling The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns. By default, these are put in the log directory. The change also adds a JVM parameter that makes it shut up about native access.	2024-06-27 13:58:36 +02:00
Viktor Lofgren	0e4dd3d76d	(minor) Remove accidentally committed debug printf	2024-06-27 13:40:53 +02:00
Viktor Lofgren	10fe5a78cb	(log) Prevent tests from trying to log to file They would never have succeeded, but it adds an annoying preamble of error spam in the console window.	2024-06-27 13:19:48 +02:00
Viktor Lofgren	975b8ae2e9	(minor) Tidy code	2024-06-27 13:15:31 +02:00
Viktor Lofgren	935234939c	(test) Add query parsing to IntegrationTest	2024-06-27 13:15:20 +02:00
Viktor Lofgren	87e38e6181	(search-query) refac: Move query factory	2024-06-27 13:14:47 +02:00
Viktor Lofgren	f73fc8dd57	(search-query) Fix end-inclusion bug in QWordGraphIterator	2024-06-27 13:13:42 +02:00
Viktor Lofgren	3faa5bf521	(search-query) Tidy up QueryGRPCService and IndexClient	2024-06-26 14:03:30 +02:00
Viktor Lofgren	6973712480	(query) Tidy up code	2024-06-26 13:40:06 +02:00
Viktor Lofgren	02df421c94	(*) Trim the stopwords list Having an overlong stopwords list leads to quoted terms not performing well. For now we'll slash it to just "a" and "the".	2024-06-26 12:22:57 +02:00
Viktor Lofgren	95b9af92a0	(index) Implement working optional TermCoherences	2024-06-26 12:22:06 +02:00
Viktor Lofgren	8ee64c0771	(index) Correct TermCoherence requirements	2024-06-25 22:18:10 +02:00
Viktor Lofgren	b805f6daa8	(gamma) Fix readCount() behavior in EGC	2024-06-25 22:17:54 +02:00
Viktor Lofgren	dae22ccbe0	(test) Integration test from crawl->query	2024-06-25 22:17:26 +02:00
Viktor Lofgren	9d00243d7f	(index) Partial re-implementation of position constraints	2024-06-24 15:55:54 +02:00
Viktor Lofgren	5461634616	(doc) Add readme.md for coded-sequence library This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.	2024-06-24 14:28:51 +02:00
Viktor Lofgren	40bca93884	(gamma) Minor clean-up	2024-06-24 13:56:43 +02:00
Viktor Lofgren	b798f28443	(journal) Fixing journal encoding Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.	2024-06-24 13:56:27 +02:00
Viktor Lofgren	fff2ce5721	(gamma) Correctly decode zero-length sequences	2024-06-24 13:11:41 +02:00
Viktor	08ff79827e	Merge branch 'master' into security-scan	2024-06-17 13:18:25 +02:00
Viktor Lofgren	d0d6bb173c	(control) Fix warc data http status filter default value	2024-06-17 12:40:25 +02:00
Viktor Lofgren	90744433c9	Merge branch 'master' into security-scan # Conflicts: # code/libraries/array/cpp/resources/libcpp.so	2024-06-13 13:14:47 +02:00
Jaseem Abid	0dd14a4bd0	Specify C++ standard in build command The default C++ language standard on macOS is gnu++98, which won't build this module. Full error: ``` > Task :code:libraries:array:cpp:compileCpp FAILED src/main/cpp/cpphelpers.cpp:28:5: error: expected expression [](const p64x2& fst, const p64x2& snd) { ^ ```	2024-06-12 12:47:10 +01:00
Jaseem Abid	9974b31a09	Don't track build files(libcpp.so) with git	2024-06-12 12:45:49 +01:00

1 2 3 4 5 ...

1624 Commits