Commit Graph

1796 Commits

Author SHA1 Message Date
Viktor Lofgren
50ba8fd099 (query-parsing) Correct handling of trailing parentheses 2024-09-03 11:45:14 +02:00
Viktor Lofgren
99b3b00b68 (query-parsing) Merge QueryTokenizer into QueryParser and add escaping of query grammar 2024-09-03 11:35:32 +02:00
Viktor Lofgren
f6d981761d (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:24:05 +02:00
Viktor Lofgren
8290c19e24 (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:21:01 +02:00
Viktor Lofgren
7a69dff6cf (search) Correct handling of languages on fandom 2024-09-01 13:46:01 +02:00
Viktor Lofgren
bfb7ed2c99 (search) Translate cursed medium URLs to scribe.rip links via the search application 2024-09-01 13:32:14 +02:00
Viktor Lofgren
e19dc9b13e (search) Translate cursed fandom URLs to breezewiki links via the search application 2024-09-01 13:23:35 +02:00
Viktor Lofgren
74148c790e (crawler) Pull additional new domains from node-affinity 0
Previously a bit ambiguously defined, node affinity 0 is now indicative that a domain is up for grabs for the next crawler
2024-09-01 13:00:36 +02:00
Viktor Lofgren
3d77456110 (*) Add domain parking service to ip blocklist 2024-09-01 12:53:22 +02:00
Viktor Lofgren
ab6a4b1749 (control) Correct id value for domain addition tool 2024-09-01 12:25:15 +02:00
Viktor Lofgren
aeeb1d0cb7 (control) Add utility for adding domains from an external URL 2024-09-01 12:14:21 +02:00
Viktor Lofgren
185b79f2a5 (converter) Fix bug where sideloaded reddit content was errouneously categoriszed as wiki-generated. 2024-09-01 11:30:25 +02:00
Viktor Lofgren
8d0f9652c7 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:38:34 +02:00
Viktor Lofgren
5353805cc6 (crawler) Correct RSS-sitemap behavior 2024-08-31 11:37:09 +02:00
Viktor Lofgren
5407da5650 (crawler) Grab favicons as part of root sniff 2024-08-31 11:32:56 +02:00
Viktor Lofgren
b1bfe6f76e (control) New view for domains
Add capability to assign domains, and bulk-add new domains.
2024-08-30 17:06:48 +02:00
Viktor Lofgren
74e25370ca (control) New view for domains
Still a work in progress, but at this point it's possible to use for viewing domains
2024-08-29 15:40:40 +02:00
Viktor Lofgren
bb5d946c26 (index, EXPERIMENTAL) Clean up ranking code 2024-08-29 11:34:23 +02:00
Viktor Lofgren
abab5bdc8a (index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data 2024-08-26 14:20:39 +02:00
Viktor Lofgren
30bf845c81 (index) Speed up minDist calculations by excluding large lists 2024-08-26 13:04:15 +02:00
Viktor Lofgren
77efce0673 (paper-doll) Fix compilation 2024-08-26 12:51:29 +02:00
Viktor Lofgren
67a98fb0b0 (coded-sequence) Handle weird legacy HTML that puts everything in a heading 2024-08-26 12:49:15 +02:00
Viktor Lofgren
7d471ec30d (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:45:11 +02:00
Viktor Lofgren
f3182a9264 (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:02:37 +02:00
Viktor Lofgren
805cb5ad58 (coded-sequence) Correct behavior of findIntersections 2024-08-25 14:54:17 +02:00
Viktor Lofgren
fdf05cedae (index) Optimize DocumentSpan.countIntersections 2024-08-25 14:12:30 +02:00
Viktor Lofgren
9c5f463775 (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:59:11 +02:00
Viktor Lofgren
893fae6d59 (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:51:43 +02:00
Viktor Lofgren
5660f291af (index) Optimize DocumentSpan.countIntersections 2024-08-25 13:43:29 +02:00
Viktor Lofgren
efd56efc63 (index) Optimize SequenceOperations.minDistance 2024-08-25 13:28:06 +02:00
Viktor Lofgren
d94373f4b1 (index) Optimize calculatePositionsMask 2024-08-25 13:24:37 +02:00
Viktor Lofgren
0d01a48260 (index) Optimize SequenceOperations 2024-08-25 13:19:37 +02:00
Viktor Lofgren
00ab2684fa (index) Optimize SequenceOperations 2024-08-25 13:17:38 +02:00
Viktor Lofgren
a5585110a6 (index) Optimize SequenceOperations 2024-08-25 13:16:31 +02:00
Viktor Lofgren
965c89798e (index) Optimize DocumentSpan 2024-08-25 12:44:33 +02:00
Viktor Lofgren
982b03382b (index) Optimize DocumentSpan 2024-08-25 12:31:15 +02:00
Viktor Lofgren
24b805472a (index) Evaluate performance implication of decoding gcs early 2024-08-25 12:23:09 +02:00
Viktor Lofgren
6ce029b317 (index) Remove vestigial parameter 2024-08-25 12:14:12 +02:00
Viktor Lofgren
63e5b0ab18 (index) Correct weightedCounts calculations 2024-08-25 12:06:56 +02:00
Viktor Lofgren
6dda2c2d83 (coded-sequence) Reduce allocations in GCS.values() 2024-08-25 12:06:31 +02:00
Viktor Lofgren
3fb3c0b92e (index) Optimize ranking calculations 2024-08-25 11:56:11 +02:00
Viktor Lofgren
aa2c960b74 (index) Optimize ranking calculations 2024-08-25 11:53:44 +02:00
Viktor Lofgren
4fbcc02f96 (index) Adjust sensible defaults for ranking parameters 2024-08-25 11:24:16 +02:00
Viktor Lofgren
9aa8f13731 (index) Remove tcfAvgDist ranking parameter
This is captured by tcfProximity already
2024-08-25 11:20:19 +02:00
Viktor Lofgren
65bee366dc (index) Try harmonic mean for avgMinDist 2024-08-25 11:11:52 +02:00
Viktor Lofgren
53700e6667 (index) Try harmonic mean for avgMinDist 2024-08-25 11:08:41 +02:00
Viktor Lofgren
7f498e10b7 (index) Adjust proximity score 2024-08-25 11:01:35 +02:00
Viktor Lofgren
6eb0f13411 (index) Adjust handling of full phrase matches to prioritize full query matches over large partial matches 2024-08-25 10:54:04 +02:00
Viktor Lofgren
773377fe84 (index) Correct handling of full phrase match group 2024-08-25 10:48:34 +02:00
Viktor Lofgren
4372c8c835 (index) Give ranking components more consistent names 2024-08-25 10:44:27 +02:00
Viktor Lofgren
099133bdbc (index) Fix verbatim match score after moving full phrase group to a separate entity 2024-08-25 10:43:35 +02:00
Viktor Lofgren
b09e2dbeb7 (build) Fix dependency churn from testcontainers
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
96bcf03ad5 (index) Address broken tests
They are still broken, but less so.
2024-08-25 10:34:36 +02:00
Viktor Lofgren
0999f07320 (search-query) Add new ranking parameters for proximity and verbatim matches 2024-08-25 10:34:12 +02:00
Viktor Lofgren
5d2b455572 (search) Clean up inconsistent usage of MathClient in SearchOperator
Also clean up SearchOperator and adjacent code
2024-08-24 10:39:31 +02:00
Viktor Lofgren
ea75ddc0e0 (search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator 2024-08-22 11:50:52 +02:00
Viktor Lofgren
2db0e446cb (search) Absorb SearchQueryIndexService into SearchOperator, and clean up SearchOperator 2024-08-22 11:49:29 +02:00
Viktor Lofgren
557bdaa694 (search) Clean up SearchQueryIndexService and surrounding code 2024-08-22 11:45:28 +02:00
Viktor Lofgren
9eb1f120fc (index) Repair positions bitmask for search result presentation 2024-08-22 11:28:23 +02:00
Viktor Lofgren
266d6e4bea (slop) Replace SlopPageRef<T> with SlopTable.Ref<T> 2024-08-21 10:13:49 +02:00
Viktor Lofgren
e4c97a91d8 (*) Comment clarity 2024-08-21 10:12:00 +02:00
Viktor Lofgren
b0a874a842 (*) Upgrade slop library -> 0.0.5 2024-08-18 11:05:27 +02:00
Viktor Lofgren
bca40de107 (*) Upgrade slop library 2024-08-18 10:43:41 +02:00
Viktor Lofgren
93652e0937 (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:55:48 +02:00
Viktor Lofgren
0a383a712d (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:44:17 +02:00
Viktor Lofgren
03d5dec24c (*) Refactor termCoherences and rename them to phrase constraints. 2024-08-15 11:02:19 +02:00
Viktor Lofgren
b2a3cac351 (*) Remove broken imports 2024-08-15 11:01:34 +02:00
Viktor Lofgren
a18edad04c (index) Remove stopword list from converter
We want to index all words in the document, stopword handling is moved to the index where we change the semantics to elide inclusion checks in query construction for a very short list of words tentatively hard-coded in SearchTerms.
2024-08-15 09:36:50 +02:00
Viktor Lofgren
92522e8d97 (index) Attenuate bm25 score based on query length 2024-08-15 08:41:38 +02:00
Viktor Lofgren
049d94ce31 (index) Add body position match to qdebug fields 2024-08-15 08:39:37 +02:00
Viktor Lofgren
dbc6a95276 (index) Consume the new 'body' span in index to make it used in ranking 2024-08-15 08:33:43 +02:00
Viktor Lofgren
75b0888032 (slop) Migrate to latest Slop version 2024-08-14 11:44:35 +02:00
Viktor Lofgren
2ad93ad41a (*) Clean up 2024-08-14 11:43:45 +02:00
Viktor Lofgren
623ee5570f (slop) Break slop out into its own repository 2024-08-13 09:50:05 +02:00
Viktor Lofgren
fd2bad39f3 (keyword-extraction) Add body field for terms that are not otherwise part of a field 2024-08-13 09:49:26 +02:00
Viktor Lofgren
e6c8a6febe (index) Add index-side deduplication in selectBestResults 2024-08-10 10:51:59 +02:00
Viktor Lofgren
4ece5f847b (index) Add more qdebug factors 2024-08-10 10:45:30 +02:00
Viktor Lofgren
e4f04af044 (index) Give BODY matches a verbatim match value 2024-08-10 10:22:19 +02:00
Viktor Lofgren
b730b17f52 (index) Correct handling of firstPosition to avoid d/z 2024-08-10 10:21:59 +02:00
Viktor Lofgren
98c40958ab (index) Simplify verbatim match calculation 2024-08-10 09:54:56 +02:00
Viktor Lofgren
41b52f5bcd (index) Simplify verbatim match calculation 2024-08-10 09:51:03 +02:00
Viktor Lofgren
4264fb9f49 (query-service) Clean up qdebug UI a bit 2024-08-10 09:51:03 +02:00
Viktor Lofgren
016a4c62e1 (index) Bugs and error fixes, chasing and fixing mystery results that did not contain all relevant keywords 2024-08-10 09:51:03 +02:00
Viktor Lofgren
2f38c95886 (index) Backport bugfix from term-positions branch
The ordering of TermIdsList is assumed to be unchanged by the surrounding code, but the constructor sorts the dang list to be able to do contains() by binary search.  This is no bueno.

This is gonna be a merge conflict in the future, but it's too big of a bug to leave for another month.
2024-08-09 21:17:02 +02:00
Viktor Lofgren
df89661ed2 (index) In SearchResultItem, populate combinedId with combinedId and not its ranking-removed documentId cousin 2024-08-09 16:32:32 +02:00
Viktor Lofgren
41da4f422d (search-query) Always generate the "all"-segmentation 2024-08-09 13:20:00 +02:00
Viktor Lofgren
2e89b55593 (wip) Repair qdebug utility and show new ranking details 2024-08-09 12:57:25 +02:00
Viktor Lofgren
7babdb87d5 (index) Remove intermediate models 2024-08-07 10:10:44 +02:00
Viktor Lofgren
680ad19c7d (keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors 2024-08-06 11:16:56 +02:00
Viktor Lofgren
f01267bc6b (index) Don't load fwd index offsets into a hash table at start.
This makes the service take forever to start up.  Memory map the data instead and binary search.  This is a bit slower, but not by much.
2024-08-06 11:16:28 +02:00
Viktor Lofgren
df6a05b9a7 (index) Avoid hypothetical divide-by-zero in tcfAvgDist 2024-08-06 10:55:57 +02:00
Viktor Lofgren
8569bb8e11 (index) Avoid divide-by-zero when minDist returns 0 2024-08-06 10:34:05 +02:00
Viktor Lofgren
ca6e2db2b9 (index) Include external link texts in verbatim score 2024-08-06 10:23:23 +02:00
Viktor Lofgren
2080e31616 (converter) Store link text positions
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends.  Integrating this information with the ranking is not performed here.
2024-08-04 12:00:29 +02:00
Viktor Lofgren
c379be846c (slop) Update readme 2024-08-04 10:58:23 +02:00
Viktor Lofgren
9bc665628b (slop) VarintLE implementation, correct enum8 column 2024-08-04 10:57:52 +02:00
Viktor Lofgren
ee49c01d86 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:47:23 +02:00
Viktor Lofgren
b21f8538a8 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:41:38 +02:00
Viktor Lofgren
dd15676d33 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:18:04 +02:00
Viktor Lofgren
ec5a17ad13 (index) Tune ranking for verbatim matches in the title, rewarding shorter titles 2024-08-03 14:07:02 +02:00
Viktor Lofgren
e48f52faba (experiment) Add add-hoc filter runner 2024-08-03 13:24:03 +02:00
Viktor Lofgren
8462e88b8f (index) Add min-dist factor and adjust rankings 2024-08-03 13:07:00 +02:00
Viktor Lofgren
bf26ead010 (index) Remove hasPrioTerm check as we should sort this out in ranking 2024-08-03 13:06:50 +02:00
Viktor Lofgren
c2cedfa83c (index) Experimental ranking signals 2024-08-03 10:33:41 +02:00
Viktor Lofgren
eba2844361 (index) Experimental ranking signals 2024-08-03 10:32:46 +02:00
Viktor Lofgren
c6c8b059bf (index) Return some variant of the previously removed 'Bm25PrioGraphVisitor' 2024-08-03 10:10:12 +02:00
Viktor Lofgren
d8a99784e5 (index) Adding a few experimental relevance signals 2024-08-02 20:26:07 +02:00
Viktor Lofgren
57929ff242 (coded-sequence) Varint sequence 2024-08-02 20:22:56 +02:00
Viktor Lofgren
4430a39120 (loader) Clean up 2024-08-02 12:32:47 +02:00
Viktor Lofgren
6228f46af1 (loader) Reduce log spam 2024-08-02 12:21:03 +02:00
Viktor Lofgren
ac67b6b5da (converter) Fix exception handling while reading crawl data 2024-08-02 10:39:49 +02:00
Viktor Lofgren
1a268c24c8 (perf) Reduce DomPruningFilter hash table recalculation 2024-08-01 12:04:55 +02:00
Viktor Lofgren
38e2089c3f (perf) Code was still spending a lot of time resolving charsets
... in the failure case which wasn't captured by memoization.
2024-08-01 11:58:59 +02:00
Viktor Lofgren
e2107901ec (index) Add span information for anchor tags, tweak ranking params 2024-08-01 11:46:30 +02:00
Viktor Lofgren
15745b692e (index) Coherences need to be able to deal with null values among positions 2024-07-31 22:00:14 +02:00
Viktor Lofgren
696fd8909d (screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones 2024-07-31 21:44:10 +02:00
Viktor Lofgren
02b1c4b172 (screenshot-capture-tool) Make screenshot bot spend time refreshing old screenshots instead of always capturing new ones 2024-07-31 20:21:23 +02:00
Viktor Lofgren
285e657f68 Merge branch 'master' into term-positions
# Conflicts:
#	code/processes/crawling-process/java/nu/marginalia/crawl/CrawlerMain.java
#	code/processes/crawling-process/java/nu/marginalia/crawl/retreival/CrawlerRetreiver.java
2024-07-31 10:44:01 +02:00
Viktor Lofgren
046ffc7752 (build) Upgrade jib to 3.4.3 2024-07-31 10:39:50 +02:00
Viktor Lofgren
2ef66ce0ca (actor) Reset NEW flag earlier when auto-deletion is disabled
Don't wait until the loader step is finished to reset the NEW flag, as this leaves manually processed (but not yet loaded) crawl data stuck in "CREATING" in the GUI.
2024-07-31 10:31:03 +02:00
Viktor Lofgren
dc5c668940 (index) Re-enable parallelization of index construction, disable parallel sorting during construction
The first change, running index construction in parallel, was previously how it was done, but it was changed to run sequentially to see how it would affect performance.  It got worse, so the change is reverted.

Though it's been noted that sorting in parallel is likely not a good idea as it leads to a lot of I/O thrashing, so this is changed to be done sequentially.
2024-07-31 10:06:53 +02:00
Viktor Lofgren
f19148132a (search) Restrict site-search by passing domain id along with the site:-term
This will help these queries deal with domains that do not have a subdomain so that they do not drag up subdomains as well, as they are also given the special site:-keyword for their corresponding parent domain.
2024-07-30 21:41:07 +02:00
Viktor Lofgren
6d7b886aaa (converter) Correct sort order of files in control storage GUI
Previously it was sorted on a field that would switch to just showing the time whenever the date was the same as the day's date, leading to a bizarre sort order where files created today was typically shown first, followed by the rest of the files with the oldest date first.
2024-07-30 19:43:27 +02:00
Viktor Lofgren
b316b55be9 (index) Experimental initial integration of document spans into index 2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7 (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
Viktor Lofgren
7e4efa45b8 (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:21:21 +02:00
Viktor Lofgren
86ea28d6bc (converter/loader) Simplify document record writing to not require predicated reads 2024-07-29 14:18:52 +02:00
Viktor Lofgren
34703da144 (slop) Support for nested array types and array-of-object types
Also adding very basic support for filtered reads via SlopTable.  This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
1282f78bc5 (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 11:01:18 +02:00
Viktor Lofgren
2d5d965f7f (slop-models) Fix incorrect column grouping leading to errors in converter 2024-07-29 10:34:33 +02:00
Viktor Lofgren
afe56c7cf1 (loader) Tidy up code 2024-07-28 21:36:42 +02:00
Viktor Lofgren
7d51cf882f (loader) Move rssFeeds to a different column group to avoid errors 2024-07-28 21:30:10 +02:00
Viktor Lofgren
499deac2ef (slop) Fix test that broke when we split get into int get() and long getLong() 2024-07-28 21:20:37 +02:00
Viktor Lofgren
9685993adb (loader) Add spans to a different column group from spanCodes, as they are not in sync 2024-07-28 21:20:09 +02:00
Viktor Lofgren
261dcdadc8 (loader) Additional tracking for the control GUI 2024-07-28 21:19:45 +02:00
Viktor Lofgren
314a901bf0 (slop) Clean up build.gradle from unnecessary copy-paste garbage 2024-07-28 13:22:20 +02:00
Viktor Lofgren
1caad7e19e (slop) Update existing code to use the altered Slop interfaces 2024-07-28 13:21:08 +02:00
Viktor Lofgren
e585116dab (slop) Add 32 bit read method for Varint along with the old 64 bit version 2024-07-28 13:20:18 +02:00
Viktor Lofgren
40f42bf654 (slop) Add signed 16 bit column type "short" 2024-07-28 13:19:44 +02:00
Viktor Lofgren
eaf7fbb9e9 (slop) Improve Conveniences for Enum
* New fixed width 8 bit version of Enum
* Access to the enum's dictionary, and a method for reading the ordinal directly to reduce GC churn
2024-07-28 13:19:15 +02:00
Viktor Lofgren
d05a2e57e9 (index-forward) Spans Writer should not be in the index page loop context 2024-07-27 15:17:04 +02:00
Viktor Lofgren
f8684118f3 (slop) Add columnDesc information to the column readers and writers, and correct a few broken position() implementations
Added a test that should find any additional broken implementations, as it's very important that this function is correct.
2024-07-27 14:35:30 +02:00
Viktor Lofgren
2e1f669aea (slop) Remove additional vestigial seek() implementations 2024-07-27 14:35:30 +02:00
Viktor Lofgren
6c3abff664 (slop) Move GCS Slop column to the coded-sequence package
This lets the slop library be stand-alone without dependence on coded-sequence.

The change also gets rid of the vestigial seek() method in ColumnReader.
2024-07-27 13:58:45 +02:00
Viktor Lofgren
dcb43a3308 (slop) Introduce table concept to keep track of positions and simplify closing
The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip.

The second most common error is forgetting to close one of the columns in a reader or writer.

To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.
2024-07-27 13:47:47 +02:00
Viktor Lofgren
ec600b967d (crawler) Adjust domain locking
Turns out throttling to only 1 lock per domain means the crawler chokes hard on large hosting websites such as wordpress.  Giving these a slightly larger allowance.
2024-07-27 11:54:46 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
52a9a0d410 (slop) Translate nulls to empty strings when passed to the StringColumnWriters. 2024-07-25 18:26:41 +02:00
Viktor Lofgren
4123e99469 (slop) Handle empty compressed files correctly
The CompressingStorageReader would incorrectly report having data when a file was empty.  Preemptively attempting to fill the backing buffer fixes the behavior.
2024-07-25 18:26:13 +02:00
Viktor Lofgren
51a8a242ac (slop) First commit of slop library
Slop is a low-abstraction data storage convention for column based storage of complex data.
2024-07-25 15:08:41 +02:00
Viktor Lofgren
60ef826e07 (loader) Add heartbeat to update domain-ids step 2024-07-25 15:08:41 +02:00
Viktor Lofgren
2ad564404e (loader) Add heartbeat to update domain-ids step 2024-07-23 15:28:52 +02:00
Viktor Lofgren
2bb9f18411 (dld) Refactor DocumentLanguageData
Reduce the usage of raw arrays
2024-07-19 12:24:55 +02:00
Viktor Lofgren
7a1edc0880 (term-freq) Reduce the number of low-relevance words in the dictionary
Using a statistical trick to reduce the number of low-frequency words in the dictionary, as they are numerous and not very informative.
2024-07-19 12:23:28 +02:00
Viktor Lofgren
b812e96c6d (language-processing) Select the appropriate language filter
The incorrect filter was selected based on the provided parameter, this has been corrected.
2024-07-19 12:22:32 +02:00
Viktor Lofgren
22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00
Viktor Lofgren
d36055a2d0 (keyword-extractor) Retire TfIdfHigh WordFlag
This will bring the word flags count down to 8, and let us pack every value in a byte.
2024-07-17 13:54:39 +02:00
Viktor Lofgren
0d227f3543 (cleanup) Remove next-prime library only used in tests 2024-07-17 13:48:03 +02:00
Viktor Lofgren
accc598967 (crawler) Add 1 second pause after probing domain to reduce request pressure 2024-07-16 16:55:07 +02:00
Viktor Lofgren
02c4a2d4ba (crawler) Add a per-domain mutex for crawling
To let up the pressure on domains with lot sof subdomains such as substack, medium, neocities, etc. a per-domain mutex is added that will limit crawling of these domains to one thread at a time.
2024-07-16 16:44:59 +02:00
Viktor Lofgren
6665e447aa (crawler) Add crawl delays around probe call and deal with 429:s properly during this phase 2024-07-16 15:33:24 +02:00
Viktor Lofgren
f4d79c203d (crawler) Adjust revisit logic
The revisit logic wasn't sufficiently dampening the recrawl rate for websites that largely have not changed.

Modified it to be more reactive to the degree to which the content has changed, while applying upper and lower limits depending on the size of the crawl set.
2024-07-16 15:12:38 +02:00
Viktor Lofgren
4d29581ea4 (crawler) Introduce absolute upper limit to crawl depth growth 2024-07-16 14:40:45 +02:00
Viktor Lofgren
0b31c4cfbb (coded-sequence) Replace GCS usage with an interface 2024-07-16 14:37:50 +02:00
Viktor Lofgren
5c098005cc (index) Fix broken test
Expected behavior changed since the ranking algorithm now takes into account the number of positions of the keyword, and the test loader was previously modified to generate positions based on prime factors of the document id.
2024-07-16 12:37:59 +02:00
Viktor Lofgren
ae87e41cec (index) Fix rare BitReader.takeWhileZero bug
Fix rare bug where the takeWhileZero method would fail to repopulate the underlying buffer.  This caused intermittent de-compression errors if takeWhileZero happened at a 64 bit boundary while the underlying buffer was empty.

The change also alters how sequence-lengths are encoded, to more consistently use the getGamma method instead of adding special significance to a zero first byte.

Finally, assertions are added checking the invariants of the gamma and delta coding logic as well as UrlIdCodec to earlier detect issues.
2024-07-16 11:03:56 +02:00
Viktor Lofgren
dfd19b5eb9 (index) Reduce the number of abstractions around result ranking
The change also restructures the internal API a bit, moving resultsFromDomain from RpcRawResultItem into RpcDecoratedResultItem, as the previous order was driving complexity in the code that generates these objects, and the consumer side of things puts all this data in the same object regardless.
2024-07-16 08:18:54 +02:00
Viktor
8ed5b51a32
Merge branch 'master' into term-positions 2024-07-15 07:05:31 +02:00
Viktor Lofgren
9d0e5dee02 Fix gitignore issue .so files not to be ignored correctly. 2024-07-15 05:18:10 +02:00
Viktor Lofgren
ffd970036d (term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
How'd This Ever Work? (tm)

TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:16:17 +02:00
Viktor Lofgren
fa162698c2 (term-frequency) Fix concurrency issues in SentenceExtractor and TermFrequencyExporter
How'd This Ever Work? (tm)

TermFrequencyExporter was using Math.clamp() incorrectly, and SentenceExtractor was synchronizing on its own instance when initializing shared static members, causing rare issues when spinning multiple SE:s up at once.
2024-07-15 05:15:30 +02:00
Viktor Lofgren
ad3857938d (search-api, ranking) Update with new ranking parameters
Adding new ranking parameters to the API and routing them through the system, in order to permit integration of the new position data with the ranking algorithm.

The change also cleans out several parameters that no longer filled any function.
2024-07-15 04:49:40 +02:00
Viktor Lofgren
179a6002c2 (coded-sequence) Add a callback for re-filling underlying buffer 2024-07-12 23:50:28 +02:00
Viktor Lofgren
d28fc86956 (index-prio) Add fuzz test for prio index 2024-07-11 19:22:36 +02:00
Viktor Lofgren
6303977e9c (index-prio) Fail louder when size is 0 in PrioDocIdsTransformer
We can't deal with this scenario and should complain very loudly
2024-07-11 19:22:05 +02:00
Viktor Lofgren
97695693f2 (index-prio) Don't increment readItems counter when the output buffer is full
This behavior was causing the reader to sometimes discard trailing entries in the list.
2024-07-11 19:21:36 +02:00
Viktor Lofgren
1ab875a75d (test) Correcting flaky tests
Also changing the inappropriate usage of ReverseIndexPrioFileNames for the full index in test code.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
31881874a9 (coded-sequence) Correct indicator of next-value
It was incorrectly assumed that a "next" value could not be zero or negative, as this is not representable via the Gamam code.  This is incorrect in this case, as we're able to provide a negative offset.  Changing to using Integer.MIN_VALUE as indicator that a value is absent instead, as this will never be used.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
f090f0101b (index-construction) Gather up preindex writes
Use fewer writes when finalizing the preindex documents.dat file, as this was getting too slow.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
9881cac2da (index-reader) Correctly handle negative offset values
When wordOffset(...) returns a negative value, it means the word isn't present in the index, and we should abort.
2024-07-11 16:13:23 +02:00
Viktor Lofgren
12590d3449 (index-reverse) Added compression to priority index
The priority index documents file can be trivially compressed to a large degree.

Compression schema:
```
00b -> diff docord (E gamma)
01b -> diff domainid (E delta) + (1 + docord) (E delta)
10b -> rank (E gamma) + domainid,docord (raw)
11b -> 30 bit size header, followed by 1 raw doc id (61 bits)
```
2024-07-11 16:13:23 +02:00
Viktor Lofgren
abf7a8d78d (coded-sequence) Correct implementation of Elias gamma
Also clean up the code a bit as the EliasGammaCodec class was an iterator, and it was leaking abstraction details.
2024-07-10 14:28:28 +02:00
Viktor Lofgren
ecfe17521a (coded-sequence) Correct implementation of Elias gamma
The implementation was incorrectly using 1 bit more than it should.  The change also adds a put method for Elias delta; and cleans up the interface a bit.
2024-07-09 17:28:21 +02:00
Viktor Lofgren
0d29e2a39d (index-reverse) Entry Sources reset() their LongQueryBuffer
Previously this was the responsibility of the caller, which lead to the possibility of passing in improperly prepared buffers and receiving bad outcome
2024-07-09 01:39:40 +02:00
Viktor Lofgren
12a2ab93db (actor) Improve error messages for convert-and-load
Some copy-and-paste errors had snuck in and every index construction error was reported as "repartitioned failed"; updated with more useful messages.
2024-07-08 19:19:30 +02:00
Viktor Lofgren
d90bd340bb (index-reverse) Removing btree indexes from prio documents file
Btree index adds overhead and disk space and doesn't fill any function for the prio index.

* Update finalize logic with a new IO transformer that copies the data and prepends a size
* Update the reader to read the new format
* Added a test
2024-07-08 17:20:17 +02:00
Viktor Lofgren
21afe94096 (index-reverse) Don't use 128 bit merge function for prio index 2024-07-07 21:36:10 +02:00
Viktor Lofgren
fa36689597 (index-reverse) Simplify priority index
* Do not emit a documents file
* Do not interlace metadata or offsets with doc ids
2024-07-06 18:04:08 +02:00
Viktor Lofgren
85c99ae808 (index-reverse) Split index construction into separate packages for full and priority index 2024-07-06 15:44:47 +02:00
Viktor Lofgren
a4ecd5f4ce (minor) Fix non-compiling test due to previous refactor 2024-07-06 15:11:43 +02:00
Viktor Lofgren
6401a513d7 (crawl) Fix onsubmit confirm dialog for single-site recrawl 2024-07-05 17:21:03 +02:00
Viktor Lofgren
d86926be5f (crawl) Add new functionality for re-crawling a single domain 2024-07-05 15:31:55 +02:00
Viktor Lofgren
a6b03a66dc (crawl) Reduce Charset.forName() object churn
Cache the Charset object returned from Charset.forName() for future use, since we're likely to see the same charset again and Charset.forName(...) can be surprisingly expensive and its built-in caching strategy, which just caches the 2 last values seen doesn't cope well with how we're hitting it with a wide array of random charsets
2024-07-04 20:49:07 +02:00
Viktor Lofgren
d023e399d2 (index) Remove unnecessary allocations in journal reader
The term data iterator is quite hot and was performing buffer slice operations that were not necessary.

Replacing with a fixed pointer alias that can be repositioned to the relevant data.

The positions data was also being wrapped in a GammaCodedSequence only to be immediately un-wrapped.

Removed this unnecessary step and move to copying the buffer directly instead.
2024-07-04 15:38:22 +02:00
Viktor Lofgren
e8ab1e14e0 (keyword-extraction) Update upper limit to number of positions per word
After real-world testing, it was determined that 256 was still a bit too low, but 512 seems like it will only truncate outlier cases like assembly code and certain tabulations.
2024-07-02 20:52:32 +02:00
Viktor Lofgren
a6e15cb338 (keyword-extraction) Update upper limit to number of positions per word
100 was a bit too low, let's try 256.
2024-06-30 22:46:56 +02:00
Viktor Lofgren
4fbb863a10 (keyword-extraction) Add upper limit to number of positions per word
Also adding some logging for this event to get a feel for how big these lists get with realistic data.  To be cleaned up later.
2024-06-30 22:41:38 +02:00
Viktor Lofgren
6ee4d1eb90 (keyword) Increase the work area for position encoding
The change also moves the allocation outside of the build()-method to allow re-use of this rather large temporary buffer.
2024-06-28 16:42:39 +02:00
Viktor Lofgren
738e0e5fed (process) Add option for automatic profiling
The change adds a new system property 'system.profile' that makes ProcessService automatically trigger JFR profiling of the processes it spawns.  By default, these are put in the log directory.

The change also adds a JVM parameter that makes it shut up about native access.
2024-06-27 13:58:36 +02:00
Viktor Lofgren
0e4dd3d76d (minor) Remove accidentally committed debug printf 2024-06-27 13:40:53 +02:00
Viktor Lofgren
10fe5a78cb (log) Prevent tests from trying to log to file
They would never have succeeded, but it adds an annoying preamble of error spam in the console window.
2024-06-27 13:19:48 +02:00
Viktor Lofgren
975b8ae2e9 (minor) Tidy code 2024-06-27 13:15:31 +02:00
Viktor Lofgren
935234939c (test) Add query parsing to IntegrationTest 2024-06-27 13:15:20 +02:00
Viktor Lofgren
87e38e6181 (search-query) refac: Move query factory 2024-06-27 13:14:47 +02:00
Viktor Lofgren
f73fc8dd57 (search-query) Fix end-inclusion bug in QWordGraphIterator 2024-06-27 13:13:42 +02:00
Viktor Lofgren
3faa5bf521 (search-query) Tidy up QueryGRPCService and IndexClient 2024-06-26 14:03:30 +02:00
Viktor Lofgren
6973712480 (query) Tidy up code 2024-06-26 13:40:06 +02:00
Viktor Lofgren
02df421c94 (*) Trim the stopwords list
Having an overlong stopwords list leads to quoted terms not performing well.  For now we'll slash it to just "a" and "the".
2024-06-26 12:22:57 +02:00
Viktor Lofgren
95b9af92a0 (index) Implement working optional TermCoherences 2024-06-26 12:22:06 +02:00
Viktor Lofgren
8ee64c0771 (index) Correct TermCoherence requirements 2024-06-25 22:18:10 +02:00
Viktor Lofgren
b805f6daa8 (gamma) Fix readCount() behavior in EGC 2024-06-25 22:17:54 +02:00
Viktor Lofgren
dae22ccbe0 (test) Integration test from crawl->query 2024-06-25 22:17:26 +02:00
Viktor Lofgren
9d00243d7f (index) Partial re-implementation of position constraints 2024-06-24 15:55:54 +02:00
Viktor Lofgren
5461634616 (doc) Add readme.md for coded-sequence library
This commit introduces a readme.md file to document the functionality and usage of the coded-sequence library. It covers the Elias Gamma code support, how sequences are encoded, and methods the library offers to query sequences, iterate over values, access data, and decode sequences.
2024-06-24 14:28:51 +02:00
Viktor Lofgren
40bca93884 (gamma) Minor clean-up 2024-06-24 13:56:43 +02:00
Viktor Lofgren
b798f28443 (journal) Fixing journal encoding
Adjusting some bit widths for entry and record sizes to ensure these don't overflow, as this would corrupt the written journal.
2024-06-24 13:56:27 +02:00
Viktor Lofgren
fff2ce5721 (gamma) Correctly decode zero-length sequences 2024-06-24 13:11:41 +02:00
Viktor
08ff79827e
Merge branch 'master' into security-scan 2024-06-17 13:18:25 +02:00
Viktor Lofgren
d0d6bb173c (control) Fix warc data http status filter default value 2024-06-17 12:40:25 +02:00
Viktor Lofgren
90744433c9 Merge branch 'master' into security-scan
# Conflicts:
#	code/libraries/array/cpp/resources/libcpp.so
2024-06-13 13:14:47 +02:00
Jaseem Abid
0dd14a4bd0 Specify C++ standard in build command
The default C++ language standard on macOS is gnu++98, which won't build
this module.

Full error:

```
> Task :code:libraries:array:cpp:compileCpp FAILED
src/main/cpp/cpphelpers.cpp:28:5: error: expected expression
    [](const p64x2& fst, const p64x2& snd) {
    ^
```
2024-06-12 12:47:10 +01:00
Jaseem Abid
9974b31a09 Don't track build files(libcpp.so) with git 2024-06-12 12:45:49 +01:00
Viktor Lofgren
0ffbbaf4b9 (crawler) Update WARC builder to use SHA-256 for digests 2024-06-12 09:14:12 +02:00
Viktor Lofgren
6839415a0b (crawler) Fetch TLS instead of SSL context 2024-06-12 09:07:54 +02:00
Viktor Lofgren
55f3ac4846 (atags) Fix duckdb SQL injection
The input comes from the config file so this isn't a very realistic threat vector, and even if it wasn't it's a query in an empty duckdb instance; but adding a validation check to provide a better error message.
2024-06-12 09:05:57 +02:00
Viktor Lofgren
801cf4b5da (search) Fix bad practice usage of innerHTML to set what should be text content. 2024-06-12 08:59:40 +02:00
Viktor Lofgren
e0459d0c0d (build) Upgrade parquet dependencies to 1.14.0
This gets rid of a vulnerable transitive dependency.
2024-06-12 08:57:22 +02:00
Viktor Lofgren
23759a7243 (loader) Correctly clamp document size 2024-06-10 18:29:14 +02:00
Viktor Lofgren
55b2b7636b (loader) Correctly load the positions column in the keyword projection 2024-06-10 18:27:15 +02:00
Viktor Lofgren
36160988e2 (index) Integrate positions data with indexes WIP
This change integrates the new positions data with the forward and reverse indexes.

The ranking code is still only partially re-written.
2024-06-10 15:09:06 +02:00
Viktor Lofgren
9f982a0c3d (index) Integrate positions file properly 2024-06-06 16:45:42 +02:00
Viktor Lofgren
dcbec9414f (index) Fix non-compiling tests 2024-06-06 16:35:09 +02:00
Viktor Lofgren
a07cf1ba93 (array/cpp) Update gitignore to properly exclude libcpp.so 2024-06-06 13:06:08 +02:00
Viktor Lofgren
4a8afa6b9f (index, WIP) Position data partially integrated with forward and reverse indexes.
There's no graceful way of doing this in small commits, pushing to avoid the risk of data loss.
2024-06-06 12:54:52 +02:00
Sam Storment
9c06f446fb (search) Styling tweaks. Make the filter button near the top right corener a bit bigger so it's easier to press on mobile 2024-06-05 19:55:17 -05:00
Sam Storment
2d076cbd67 (search) move data-has-js attribute from body to html element 2024-06-05 18:20:33 -05:00
Sam Storment
fb2eef24d6 Handle themeing when javascript is disabled. Hide the theme select and fallback to dark media query instead of data-theme attribute 2024-06-03 14:15:35 -05:00
Sam Storment
e2f68d9ccf Add a theme select to the header that lets users toggle their theme independent of their OS theme 2024-06-02 21:02:52 -05:00
Viktor Lofgren
d4f4d751c0 Merge remote-tracking branch 'origin/master' 2024-06-02 16:30:41 +02:00
Viktor Lofgren
b4eac2516e (crawler) Send "Accept"-headers when fetching documents, also indicate we prefer English results 2024-06-02 16:30:34 +02:00
Viktor
4435f6245c
Merge pull request #94 from samstorment/search-dark-theme
Search Dark Theme
2024-06-02 16:21:52 +02:00
Viktor Lofgren
9b922af075 (converter) Amend existing modifications to use gamma coded positions lists
... instead of serialized RoaringBitmaps as was the initial take on the problem.
2024-05-30 14:20:36 +02:00
Viktor Lofgren
0112ae725c (gamma) Implement a small library for Elias gamma coding an integer sequence 2024-05-30 14:19:13 +02:00
Viktor Lofgren
619392edf9 (keywords) Add position information to keywords 2024-05-28 16:54:53 +02:00
Viktor Lofgren
0894822b68 (converter) Add position information to serialized document data
This is not hooked in yet, and the term metadata is still left intact.  It should probably shrink to a smaller representation (byte?) with the upcoming removal of the position mask.
2024-05-28 14:18:03 +02:00
Viktor Lofgren
a69ab311c7 (qword) Fix tests that broke due to stopword removal 2024-05-28 14:15:45 +02:00
Viktor Lofgren
6985ab762a (query) Improve handling of stopwords in queries 2024-05-23 20:50:55 +02:00
Viktor Lofgren
0e8300979b (search) Update the no result text to request bug reports. 2024-05-23 20:18:16 +02:00
Viktor Lofgren
0b60411e5f (query) Bugfix stopword issue
Add a new rule that crates an alternative path that omits a word if it's a stopword.

In queries where a stopword is present, and no query ngram expansion is possible, the query should not require the stopword to be present in the index, as this results in no search results being found.
2024-05-23 20:15:14 +02:00
Viktor Lofgren
f83f777fff (converter) Experimental support for searching by URL
Add up to synthetic 128 keywords per document, corresponding to links to other websites.
2024-05-23 17:10:57 +02:00