Commit Graph

14 Commits

Author SHA1 Message Date
Viktor Lofgren
e0c0ed27bc (keyword-extraction) Clean up code and add tests for position and spans calculation
This code has been a bit of a mess and historically significantly flaky, so some test coverage is more than overdue.
2024-12-08 14:14:52 +01:00
Viktor Lofgren
291ca8daf1 (converter/index) Improve atag sentence matching by taking into consideration how many times a sentence appears in the links
This change breaks the format of the atags.parquet file.
2024-12-08 00:27:11 +01:00
Viktor Lofgren
fdc3efa250 (setup) Remove OpenNLP tokenization model
This update eliminates all occurrences of the OpenNLP token model from the setup script, configuration, and test files, as this model file is no longer used.
2024-11-28 16:03:05 +01:00
Viktor Lofgren
9ec41e27c6 (keyword-extractor) Fix bug where external link keywords weren't generating document spans as intended 2024-11-25 18:30:22 +01:00
Viktor Lofgren
200743c84f (minor) Remove delomobok debris 2024-11-25 18:29:21 +01:00
Viktor Lofgren
9f47ce8d15 (chore) Remove lombok
There are likely some instances of delombok gore with this commit.
2024-11-11 21:14:38 +01:00
Viktor Lofgren
8047e77757 (doc) Correct dead links and stale information in the docs 2024-09-13 11:01:05 +02:00
Viktor Lofgren
abab5bdc8a (index, EXPERIMENTAL) Evaluate using Varint instead of GCS for position data 2024-08-26 14:20:39 +02:00
Viktor Lofgren
0a383a712d (qdebug) Accurately display positions when intersecting with spans 2024-08-15 11:44:17 +02:00
Viktor Lofgren
fd2bad39f3 (keyword-extraction) Add body field for terms that are not otherwise part of a field 2024-08-13 09:49:26 +02:00
Viktor Lofgren
680ad19c7d (keyword-extraction) Correct behavior when loading spans so that they are not double-loaded causing errors 2024-08-06 11:16:56 +02:00
Viktor Lofgren
2080e31616 (converter) Store link text positions
To help offer verbatim matches for external link texts, we assign these positions in the document a bit after the actual document ends.  Integrating this information with the ranking is not performed here.
2024-08-04 12:00:29 +02:00
Viktor Lofgren
b316b55be9 (index) Experimental initial integration of document spans into index 2024-07-30 12:01:53 +02:00
Viktor Lofgren
80900107f7 (restructure) Clean up repo by moving stray features into converter-process and crawler-process 2024-07-30 10:14:00 +02:00