Commit Graph

220 Commits

Author SHA1 Message Date
Viktor Lofgren
35f49bbb60 (coded-sequence) Add equals and hashCode to VCS 2024-09-10 10:33:56 +02:00
Viktor Lofgren
8290c19e24 (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:21:01 +02:00
Viktor Lofgren
abab5bdc8a (index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data 2024-08-26 14:20:39 +02:00
Viktor Lofgren
30bf845c81 (index) Speed up minDist calculations by excluding large lists 2024-08-26 13:04:15 +02:00
Viktor Lofgren
7d471ec30d (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:45:11 +02:00
Viktor Lofgren
f3182a9264 (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:02:37 +02:00
Viktor Lofgren
805cb5ad58 (coded-sequence) Correct behavior of findIntersections 2024-08-25 14:54:17 +02:00
Viktor Lofgren
efd56efc63 (index) Optimize SequenceOperations.minDistance 2024-08-25 13:28:06 +02:00
Viktor Lofgren
0d01a48260 (index) Optimize SequenceOperations 2024-08-25 13:19:37 +02:00
Viktor Lofgren
00ab2684fa (index) Optimize SequenceOperations 2024-08-25 13:17:38 +02:00
Viktor Lofgren
a5585110a6 (index) Optimize SequenceOperations 2024-08-25 13:16:31 +02:00
Viktor Lofgren
965c89798e (index) Optimize DocumentSpan 2024-08-25 12:44:33 +02:00
Viktor Lofgren
24b805472a (index) Evaluate performance implication of decoding gcs early 2024-08-25 12:23:09 +02:00
Viktor Lofgren
6dda2c2d83 (coded-sequence) Reduce allocations in GCS.values() 2024-08-25 12:06:31 +02:00
Viktor Lofgren
7f498e10b7 (index) Adjust proximity score 2024-08-25 11:01:35 +02:00
Viktor Lofgren
b09e2dbeb7 (build) Fix dependency churn from testcontainers
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
9eb1f120fc (index) Repair positions bitmask for search result presentation 2024-08-22 11:28:23 +02:00
Viktor Lofgren
e4c97a91d8 (*) Comment clarity 2024-08-21 10:12:00 +02:00
Viktor Lofgren
bca40de107 (*) Upgrade slop library 2024-08-18 10:43:41 +02:00
Viktor Lofgren
b2a3cac351 (*) Remove broken imports 2024-08-15 11:01:34 +02:00
Viktor Lofgren
a18edad04c (index) Remove stopword list from converter
We want to index all words in the document, stopword handling is moved to the index where we change the semantics to elide inclusion checks in query construction for a very short list of words tentatively hard-coded in SearchTerms.
2024-08-15 09:36:50 +02:00
Viktor Lofgren
75b0888032 (slop) Migrate to latest Slop version 2024-08-14 11:44:35 +02:00
Viktor Lofgren
623ee5570f (slop) Break slop out into its own repository 2024-08-13 09:50:05 +02:00
Viktor Lofgren
41b52f5bcd (index) Simplify verbatim match calculation 2024-08-10 09:51:03 +02:00
Viktor Lofgren
2080e31616 (converter) Store link text positions
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends.  Integrating this information with the ranking is not performed here.
2024-08-04 12:00:29 +02:00
Viktor Lofgren
c379be846c (slop) Update readme 2024-08-04 10:58:23 +02:00
Viktor Lofgren
9bc665628b (slop) VarintLE implementation, correct enum8 column 2024-08-04 10:57:52 +02:00
Viktor Lofgren
8462e88b8f (index) Add min-dist factor and adjust rankings 2024-08-03 13:07:00 +02:00
Viktor Lofgren
eba2844361 (index) Experimental ranking signals 2024-08-03 10:32:46 +02:00
Viktor Lofgren
57929ff242 (coded-sequence) Varint sequence 2024-08-02 20:22:56 +02:00
Viktor Lofgren
e2107901ec (index) Add span information for anchor tags, tweak ranking params 2024-08-01 11:46:30 +02:00
Viktor Lofgren
b316b55be9 (index) Experimental initial integration of document spans into index 2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7 (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00
Viktor Lofgren
34703da144 (slop) Support for nested array types and array-of-object types
Also adding very basic support for filtered reads via SlopTable.  This is probably not a final design.
2024-07-29 14:00:43 +02:00
Viktor Lofgren
499deac2ef (slop) Fix test that broke when we split get into int get() and long getLong() 2024-07-28 21:20:37 +02:00
Viktor Lofgren
314a901bf0 (slop) Clean up build.gradle from unnecessary copy-paste garbage 2024-07-28 13:22:20 +02:00
Viktor Lofgren
e585116dab (slop) Add 32 bit read method for Varint along with the old 64 bit version 2024-07-28 13:20:18 +02:00
Viktor Lofgren
40f42bf654 (slop) Add signed 16 bit column type "short" 2024-07-28 13:19:44 +02:00
Viktor Lofgren
eaf7fbb9e9 (slop) Improve Conveniences for Enum
* New fixed width 8 bit version of Enum
* Access to the enum's dictionary, and a method for reading the ordinal directly to reduce GC churn
2024-07-28 13:19:15 +02:00
Viktor Lofgren
f8684118f3 (slop) Add columnDesc information to the column readers and writers, and correct a few broken position() implementations
Added a test that should find any additional broken implementations, as it's very important that this function is correct.
2024-07-27 14:35:30 +02:00
Viktor Lofgren
2e1f669aea (slop) Remove additional vestigial seek() implementations 2024-07-27 14:35:30 +02:00
Viktor Lofgren
6c3abff664 (slop) Move GCS Slop column to the coded-sequence package
This lets the slop library be stand-alone without dependence on coded-sequence.

The change also gets rid of the vestigial seek() method in ColumnReader.
2024-07-27 13:58:45 +02:00
Viktor Lofgren
dcb43a3308 (slop) Introduce table concept to keep track of positions and simplify closing
The most common error when dealing with Slop columns is that they can fall out of sync with each other if the programmer accidentally does a conditional read and forgets to skip.

The second most common error is forgetting to close one of the columns in a reader or writer.

To deal with both cases, a new class SlopTable is added that keeps track of the lifecycle of all slop columns and performs a check when closing them that they are in sync.
2024-07-27 13:47:47 +02:00
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor Lofgren
52a9a0d410 (slop) Translate nulls to empty strings when passed to the StringColumnWriters. 2024-07-25 18:26:41 +02:00
Viktor Lofgren
4123e99469 (slop) Handle empty compressed files correctly
The CompressingStorageReader would incorrectly report having data when a file was empty.  Preemptively attempting to fill the backing buffer fixes the behavior.
2024-07-25 18:26:13 +02:00
Viktor Lofgren
51a8a242ac (slop) First commit of slop library
Slop is a low-abstraction data storage convention for column based storage of complex data.
2024-07-25 15:08:41 +02:00
Viktor Lofgren
2bb9f18411 (dld) Refactor DocumentLanguageData
Reduce the usage of raw arrays
2024-07-19 12:24:55 +02:00
Viktor Lofgren
b812e96c6d (language-processing) Select the appropriate language filter
The incorrect filter was selected based on the provided parameter, this has been corrected.
2024-07-19 12:22:32 +02:00
Viktor Lofgren
22b35d5d91 (sentence-extractor) Add tag information to document language data
Decorates DocumentSentences with information about which HTML tags they are nested in, and removes some redundant data on this rather memory hungry object.  Separator information is encoded as a bit set instead of an array of integers.

The change also cleans up the SentenceExtractor class a fair bit.  It no longer extracts ngrams, and a significant amount of redundant operations were removed as well.  This is still a pretty unpleasant class to work in, but this is the first step in making it a little bit better.
2024-07-18 15:57:48 +02:00