Commit Graph

10 Commits

Author SHA1 Message Date
Viktor Lofgren
aebb2652e8 (wip) Extract and encode spans data
Refactoring keyword extraction to extract spans information.

Modifying the intermediate storage of converted data to use the new slop library, which is allows for easier storage of ad-hoc binary data like spans and positions.

This is a bit of a katamari damacy commit that ended up dragging along a bunch of other fairly tangentially related changes that are hard to break out into separate commits after the fact.  Will push as-is to get back to being able to do more isolated work.
2024-07-27 11:44:13 +02:00
Viktor
8ed5b51a32
Merge branch 'master' into term-positions 2024-07-15 07:05:31 +02:00
Viktor Lofgren
12a2ab93db (actor) Improve error messages for convert-and-load
Some copy-and-paste errors had snuck in and every index construction error was reported as "repartitioned failed"; updated with more useful messages.
2024-07-08 19:19:30 +02:00
Viktor Lofgren
d86926be5f (crawl) Add new functionality for re-crawling a single domain 2024-07-05 15:31:55 +02:00
Viktor Lofgren
c9ee0c909e (download-sample) Set +x permissions on directories created during this job 2024-04-30 18:25:07 +02:00
Viktor Lofgren
b7d9a7ae89 (ngrams) Remove the vestigial logic for capturing permutations of n-grams
The change also reduces the object churn in NGramLexicon, as this is a very hot method in the converter.
2024-04-11 18:12:01 +02:00
Viktor Lofgren
0bd3365c24 (convert) Initial integration of segmentation data into the converter's keyword extraction logic 2024-03-19 14:28:42 +01:00
Viktor Lofgren
afc047cd27 (control) GUI for exporting segmentation data from a wikipedia zim 2024-03-18 13:45:23 +01:00
Viktor Lofgren
9f1649636e Clean up documentation and rename domain-links to link-graph 2024-02-28 11:40:39 +01:00
Viktor Lofgren
1d34224416 (refac) Remove src/main from all source code paths.
Look, this will make the git history look funny, but trimming unnecessary depth from the source tree is a very necessary sanity-preserving measure when dealing with a super-modularized codebase like this one.

While it makes the project configuration a bit less conventional, it will save you several clicks every time you jump between modules.  Which you'll do a lot, because it's *modul*ar.  The src/main/java convention makes a lot of sense for a non-modular project though.  This ain't that.
2024-02-23 16:13:40 +01:00