MarginaliaSearch/code/tools/term-frequency-extractor
Viktor Lofgren dbe9235f3a (*) Upgrade to JDK21 with preview enabled.
... also move some common configuration into the root build.gradle-file.

Support for JDK21 in lombok is a bit sketchy at the moment, but it seems to work.  This upgrade is kind of important as the new index construction really benefits from Arena based lifecycle control over off-heap memory.
2023-09-24 10:38:59 +02:00
..
src/main/java/nu/marginalia/tools (language) fasttext based language filter 2023-08-16 15:48:12 +02:00
build.gradle (*) Upgrade to JDK21 with preview enabled. 2023-09-24 10:38:59 +02:00
readme.md Remove unrelated code, break tools into their own directory. 2023-03-17 16:03:11 +01:00

Term Frequency Extractor

Generates a term frequency dictionary file from a batch of crawl data.

Usage:

PATH_TO_SAMPLES=run/samples/crawl-s
export JAVA_OPTS=-Dcrawl.rootDirRewrite=/crawl:${PATH_TO_SAMPLES} 

term-frequency-extractor ${PATH_TO_SAMPLES}/plan.yaml out.dat

See Also