Commit Graph

242 Commits

Author SHA1 Message Date
Viktor Lofgren
18ca926c7f (converter) Truncate excessively long strings in SentenceExtractor, malformed data was effectively DOS:ing the converter 2025-01-26 12:52:54 +01:00
Viktor Lofgren
55d6ab933f Merge branch 'master' into slop-crawl-data-spike 2025-01-21 13:32:58 +01:00
Viktor Lofgren
dcad0d7863 (search) Tweak token formation. 2025-01-05 21:01:09 +01:00
Viktor Lofgren
94e1aa0baf (search) Tweak token formation to still break apart emails in brackets. 2025-01-05 20:55:44 +01:00
Viktor Lofgren
b62f043910 (search) Adjust token formation rules to be more lenient to C++ and PHP code.
This addresses Issue #142
2025-01-05 20:50:27 +01:00
Viktor Lofgren
895cee7004 (crawler) Improved feed discovery, new domain state db per crawlset
Feed discover is improved with by probing a few likely endpoints when no feed link tag is provided.  To store the feed URLs, a sqlite database is added to each crawlset that stores a simple summary of the crawl job, including any feed URLs that have been discovered.

Solves issue #135
2024-12-26 15:05:52 +01:00
Viktor Lofgren
b510b7feb8 Spike for storing crawl data in slop instead of parquet
This seems to reduce RAM overhead to 100s of MB (from ~2 GB), as well as roughly double the read speeds.  On disk size is virtually identical.
2024-12-15 15:49:47 +01:00
Viktor Lofgren
fdc3efa250 (setup) Remove OpenNLP tokenization model
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
2024-11-28 16:03:05 +01:00
Viktor Lofgren
9f47ce8d15 (chore) Remove lombok
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
a5b4951f23 (chore) Remove use of deprecated STR.-style string templates 2024-11-11 18:02:28 +01:00
Viktor Lofgren
a456ec9599 (feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished 2024-11-10 18:30:28 +01:00
Viktor Lofgren
a2bc9a98c0 (feed) Use the message queue to permit the feeds service to tell the calling actor when it's finished 2024-11-10 17:45:20 +01:00
Viktor Lofgren
2b8ab97ec1 (bit-writer) Do not clear buffer when creating a bit writer 2024-09-29 17:52:43 +02:00
Viktor Lofgren
43ca9c8a12 (sequence) Return Integer.MAX_VALUE for empty position lists.
Updated the method to return Integer.MAX_VALUE if any of the position lists are empty, instead of returning 0. This ensures that empty lists are handled consistently and address edge cases where an empty list is encountered.
2024-09-29 17:21:17 +02:00
Viktor Lofgren
95cde242ca (assistant) Fix NPE when IP information is absent 2024-09-25 20:19:17 +02:00
Viktor Lofgren
9c292a4f62 (doc) Fix outdated links in documentation 2024-09-22 13:56:17 +02:00
Viktor Lofgren
edb42836da (vcs) Fix shared state issues with VarintCodedSequence's iterators.
Also cleans up the code a bit.
2024-09-21 16:09:15 +02:00
Viktor Lofgren
1ff88ff0bc (vcs) Stopgap fix for quoted queries with the same term appearinc multiple times
There are reentrance issues with VarintCodedSequence, this hides the symptom but these need to be corrected properly.
2024-09-21 14:07:59 +02:00
Viktor Lofgren
60ad4786bc (index) Use MemorySegment.copy for LongArray->LongArray transfers 2024-09-17 13:56:31 +02:00
Viktor Lofgren
9f9c6736ab (index) Use MemorySegment.copy for LongArray->LongArray transfers 2024-09-17 13:49:02 +02:00
Viktor Lofgren
a8bec13ed9 (index) Evaluate using mmap reads during index construction in favor of filechannel reads
It's likely that this will be faster, as the reads are on average small and sequential, and can't be buffered easily.
2024-09-13 16:14:56 +02:00
Viktor Lofgren
8047e77757 (doc) Correct dead links and stale information in the docs 2024-09-13 11:01:05 +02:00
Viktor Lofgren
35f49bbb60 (coded-sequence) Add equals and hashCode to VCS 2024-09-10 10:33:56 +02:00
Viktor Lofgren
8290c19e24 (query-parsing) Drop search term elements that aren't indexed by the search engine 2024-09-03 11:21:01 +02:00
Viktor Lofgren
abab5bdc8a (index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data 2024-08-26 14:20:39 +02:00
Viktor Lofgren
30bf845c81 (index) Speed up minDist calculations by excluding large lists 2024-08-26 13:04:15 +02:00
Viktor Lofgren
7d471ec30d (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:45:11 +02:00
Viktor Lofgren
f3182a9264 (coded-sequence) Evaluate new minDist implementation 2024-08-26 12:02:37 +02:00
Viktor Lofgren
805cb5ad58 (coded-sequence) Correct behavior of findIntersections 2024-08-25 14:54:17 +02:00
Viktor Lofgren
efd56efc63 (index) Optimize SequenceOperations.minDistance 2024-08-25 13:28:06 +02:00
Viktor Lofgren
0d01a48260 (index) Optimize SequenceOperations 2024-08-25 13:19:37 +02:00
Viktor Lofgren
00ab2684fa (index) Optimize SequenceOperations 2024-08-25 13:17:38 +02:00
Viktor Lofgren
a5585110a6 (index) Optimize SequenceOperations 2024-08-25 13:16:31 +02:00
Viktor Lofgren
965c89798e (index) Optimize DocumentSpan 2024-08-25 12:44:33 +02:00
Viktor Lofgren
24b805472a (index) Evaluate performance implication of decoding gcs early 2024-08-25 12:23:09 +02:00
Viktor Lofgren
6dda2c2d83 (coded-sequence) Reduce allocations in GCS.values() 2024-08-25 12:06:31 +02:00
Viktor Lofgren
7f498e10b7 (index) Adjust proximity score 2024-08-25 11:01:35 +02:00
Viktor Lofgren
b09e2dbeb7 (build) Fix dependency churn from testcontainers
Apparently you need to pull in commons-codec now in order to run testcontainers, through spooky action at a distance.
2024-08-25 10:35:48 +02:00
Viktor Lofgren
9eb1f120fc (index) Repair positions bitmask for search result presentation 2024-08-22 11:28:23 +02:00
Viktor Lofgren
e4c97a91d8 (*) Comment clarity 2024-08-21 10:12:00 +02:00
Viktor Lofgren
bca40de107 (*) Upgrade slop library 2024-08-18 10:43:41 +02:00
Viktor Lofgren
b2a3cac351 (*) Remove broken imports 2024-08-15 11:01:34 +02:00
Viktor Lofgren
a18edad04c (index) Remove stopword list from converter
We want to index all words in the document, stopword handling is moved to the index where we change the semantics to elide inclusion checks in query construction for a very short list of words tentatively hard-coded in SearchTerms.
2024-08-15 09:36:50 +02:00
Viktor Lofgren
75b0888032 (slop) Migrate to latest Slop version 2024-08-14 11:44:35 +02:00
Viktor Lofgren
623ee5570f (slop) Break slop out into its own repository 2024-08-13 09:50:05 +02:00
Viktor Lofgren
41b52f5bcd (index) Simplify verbatim match calculation 2024-08-10 09:51:03 +02:00
Viktor Lofgren
2080e31616 (converter) Store link text positions
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends.  Integrating this information with the ranking is not performed here.
2024-08-04 12:00:29 +02:00
Viktor Lofgren
c379be846c (slop) Update readme 2024-08-04 10:58:23 +02:00
Viktor Lofgren
9bc665628b (slop) VarintLE implementation, correct enum8 column 2024-08-04 10:57:52 +02:00
Viktor Lofgren
8462e88b8f (index) Add min-dist factor and adjust rankings 2024-08-03 13:07:00 +02:00