MarginaliaSearch/code/libraries/easy-lsh
Viktor Lofgren 4668b1ddcb (build) Java 22 and its consequences has been a disaster for Marginalia Search
Roll back to JDK 21 for now, and make Java version configurable in the root build.gradle

The project has run into no less than three distinct show-stopping bugs in JDK22, across multiple vendors, and gradle still doesn't fully support it, meaning you need multiple JDK versions installed.
2024-04-24 13:54:04 +02:00
..
java/nu/marginalia/lsh (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
test/nu/marginalia/lsh (refac) Remove src/main from all source code paths. 2024-02-23 16:13:40 +01:00
build.gradle (build) Java 22 and its consequences has been a disaster for Marginalia Search 2024-04-24 13:54:04 +02:00
readme.md (docs) Begin un-fucking the docs after refactoring 2024-02-27 21:22:21 +01:00

Easy LSH

This a simple Locality-Sensitive Hash for document deduplication. Hashes are compared using their hamming distance.

Central Classes

Demo

Consider statistical distribution only

var lsh1 = new EasyLSH();
lsh1.addUnordered("lorem");
lsh1.addUnordered("ipsum");
lsh1.addUnordered("dolor");
lsh1.addUnordered("sit");
lsh1.addUnordered("amet");

long hash1 = lsh1.get();

var lsh2 = new EasyLSH();
lsh2.addUnordered("amet");
lsh2.addUnordered("ipsum");
lsh2.addUnordered("lorem");
lsh2.addUnordered("dolor");
lsh2.addUnordered("SEAT");

long hash2 = lsh2.get();

System.out.println(EasyLSH.hammingDistance(lsh1, lsh2));
// 1 -- these are similar

Consider order as well as distribution

var lsh1 = new EasyLSH();
lsh1.addOrdered("lorem");
lsh1.addOrdered("ipsum");
lsh1.addOrdered("dolor");
lsh1.addOrdered("sit");
lsh1.addOrdered("amet");

long hash1 = lsh1.get();

var lsh2 = new EasyLSH();
lsh2.addOrdered("amet");
lsh2.addOrdered("ipsum");
lsh2.addOrdered("lorem");
lsh2.addOrdered("dolor");
lsh2.addOrdered("SEAT");


long hash2 = lsh2.get();

System.out.println(EasyLSH.hammingDistance(lsh1, lsh2));
// 5 -- these are not very similar

// note the value is relatively low because there are few words
// and there simply can't be very many differences
// it will approach 32 as documents grow larger