X Close

UCL CHIME

Home

UCL MEDICAL SCHOOL

Menu

Java example code of common similarity algorithms, used in data mining

By rmhiajd, on 28 June 2010

I am currently researching record linkage techniques and so had to remind myself of the some of the algorithms used in and around the area to help evaluate the products I have at my disposal. This work has been undertaken as part of the DHICE project at UCL, funded by the Wellcome Trust.

My typical approach to understanding algorithms is to code them up and so I am then able to “play” with them. In the process of researching each technique I noticed in most cases there was only pseudo code available. So I have decided to share the Java examples of the following algorithms:

Levenshtein
Damerau Levenshtein
Hamming
Jaro Winkler
Smith Waterman
N-Gram
Markov Chain

For each algorithm, I will provide a very basic description and then the Java code which you are free to play with, further to this I will provide links to the wikipedia entry for the algorithm and any other site I have sourced. Each piece of code is based around the explanation in Wikipedia and has been tested against the examples provided by each Wikipedia article and/or other sites referenced. There is no warranty provided with this code, any changes / enhancements / corrections / or use of this code should be attributed to CHIME, UCL and shared with the community.

Levenshtein

The purpose of this algorithm is to measure the difference between two sequences/strings. It is based around the number of changes required to make one string equal to the other.

It is aimed at short strings, it usage is spell checkers, optical character recognition, etc.

Wikipedia entry can be found here:

[java]

public class Levenshtein
{
private String compOne;
private String compTwo;
private int[][] matrix;
private Boolean calculated = false;

public Levenshtein(String one, String two)
{
compOne = one;
compTwo = two;
}

public int getSimilarity()
{
if (!calculated)
{
setupMatrix();
}
return matrix[compOne.length()][compTwo.length()];
}

public int[][] getMatrix()
{
setupMatrix();
return matrix;
}

private void setupMatrix()
{
matrix = new int[compOne.length()+1][compTwo.length()+1];

for (int i = 0; i <= compOne.length(); i++)
{
matrix[i][0] = i;
}

for (int j = 0; j <= compTwo.length(); j++)
{
matrix[0][j] = j;
}

for (int i = 1; i < matrix.length; i++)
{
for (int j = 1; j < matrix[i].length; j++)
{
if (compOne.charAt(i-1) == compTwo.charAt(j-1))
{
matrix[i][j] = matrix[i-1][j-1];
}
else
{
int minimum = Integer.MAX_VALUE;
if ((matrix[i-1][j])+1 < minimum)
{
minimum = (matrix[i-1][j])+1;
}

if ((matrix[i][j-1])+1 < minimum)
{
minimum = (matrix[i][j-1])+1;
}

if ((matrix[i-1][j-1])+1 < minimum)
{
minimum = (matrix[i-1][j-1])+1;
}

matrix[i][j] = minimum;
}
}
}
calculated = true;
displayMatrix();
}

private void displayMatrix()
{
System.out.println(” “+compOne);
for (int y = 0; y <= compTwo.length(); y++)
{
if (y-1 < 0) System.out.print(” “); else System.out.print(compTwo.charAt(y-1));
for (int x = 0; x <= compOne.length(); x++)
{
System.out.print(matrix[x][y]);
}
System.out.println();
}
}
}
[/java]

Damerau Levenshtein

Similar to Levenshtein, Damerau-Levenshtein also calculates the distances between two strings. It based around comparing two string and counting the number of insertions, deletions, and substitution of single characters, and transposition of two characters.

This was, originally, aimed at spell checkers, it is also used for DNA sequences.

Wikipedia entry found be here:

[java]

public class DamerauLevenshtein
{
private String compOne;
private String compTwo;
private int[][] matrix;
private Boolean calculated = false;

public DamerauLevenshtein(String a, String b)
{
if ((a.length() > 0 || !a.isEmpty()) || (b.length() > 0 || !b.isEmpty()))
{
compOne = a;
compTwo = b;
}
}

public int[][] getMatrix()
{
setupMatrix();
return matrix;
}

public int getSimilarity()
{
if (!calculated) setupMatrix();

return matrix[compOne.length()][compTwo.length()];
}

private void setupMatrix()
{
int cost = -1;
int del, sub, ins;

matrix = new int[compOne.length()+1][compTwo.length()+1];

for (int i = 0; i <= compOne.length(); i++)
{
matrix[i][0] = i;
}

for (int i = 0; i <= compTwo.length(); i++)
{
matrix[0][i] = i;
}

for (int i = 1; i <= compOne.length(); i++)
{
for (int j = 1; j <= compTwo.length(); j++)
{
if (compOne.charAt(i-1) == compTwo.charAt(j-1))
{
cost = 0;
}
else
{
cost = 1;
}

del = matrix[i-1][j]+1;
ins = matrix[i][j-1]+1;
sub = matrix[i-1][j-1]+cost;

matrix[i][j] = minimum(del,ins,sub);

if ((i > 1) && (j > 1) && (compOne.charAt(i-1) == compTwo.charAt(j-2)) && (compOne.charAt(i-2) == compTwo.charAt(j-1)))
{
matrix[i][j] = minimum(matrix[i][j], matrix[i-2][j-2]+cost);
}
}
}

calculated = true;
displayMatrix();
}

private void displayMatrix()
{
System.out.println(” “+compOne);
for (int y = 0; y <= compTwo.length(); y++)
{
if (y-1 < 0) System.out.print(” “); else System.out.print(compTwo.charAt(y-1));
for (int x = 0; x <= compOne.length(); x++)
{
System.out.print(matrix[x][y]);
}
System.out.println();
}
}

private int minimum(int d, int i, int s)
{
int m = Integer.MAX_VALUE;

if (d < m) m = d;
if (i < m) m = i;
if (s < m) m = s;

return m;
}

private int minimum(int d, int t)
{
int m = Integer.MAX_VALUE;

if (d < m) m = d;
if (t < m) m = t;

return m;
}
}
[/java]

Hamming

This algorithm calculates the distance between two strings, however they have to be of equal length.

It measures the minimum number of substitutions for the two strings to be equal.

It is used in telecommunication (also know as signal distance), it is also used in systematics as a measure of genetic distance.

Wikipedia entry can be found here:

[java]

public class Hamming
{
private String compOne;
private String compTwo;

public Hamming(String one, String two)
{
compOne = one;
compTwo = two;
}

public int getHammingDistance()
{
if (compOne.length() != compTwo.length())
{
return -1;
}

int counter = 0;

for (int i = 0; i < compOne.length(); i++)
{
if (compOne.charAt(i) != compTwo.charAt(i)) counter++;
}

return counter;
}

///
// Hamming distance works best with binary comparisons, this function takes a string arrary of binary
// values and returns the minimum distance value
///
public int minDistance(String[] numbers)
{
int minDistance = Integer.MAX_VALUE;

if (checkConstraints(numbers))
{
for (int i = 1; i < numbers.length; i++)
{
int counter = 0;
for (int j = 1; j <= numbers[i].length(); j++)
{
if (numbers[i-1].charAt(j-1) != numbers[i].charAt(j-1))
{
counter++;
}
}

if (counter == 0) return counter;
if (counter < minDistance) minDistance = counter;
}
}
else
{
return -1;
}

return minDistance;
}

private Boolean checkConstraints(String[] numbers)
{
if (numbers.length > 1 && numbers.length <=50)
{
int prevLength = -1;
for (int i = 0; i < numbers.length; i++)
{
if (numbers[i].length() > 0 && numbers[i].length() <= 50)
{
if (prevLength == -1)
{
prevLength = numbers[i].length();
}
else
{
if (prevLength != numbers[i].length())
{
return false;
}
}

for (int j = 0; j < numbers[i].length(); j++)
{
if (numbers[i].charAt(j) != ‘0’ && numbers[i].charAt(j) != ‘1’)
{
return false;
}
}
}
else
{
return false;
}
}
}
else
{
return false;
}

return true;
}
}
[/java]

Jaro Winkler

This algorithm is purposely designed for record linkage, it was designed for linking short strings. It calculates a normalised score on the similarity between two strings.

The calculation is based on the number of matching characters held within the string and the number of transpositions.

The wikipedia entry can be found here, the other source was lingpipe:

[java]

public class JaroWinkler
{
private String compOne;
private String compTwo;

private String theMatchA = “”;
private String theMatchB = “”;
private int mRange = -1;

public JaroWinkler()
{
}

public JaroWinkler(String s1, String s2)
{
compOne = s1;
compTwo = s2;
}

public double getSimilarity(String s1, String s2)
{
compOne = s1;
compTwo = s2;

mRange = Math.max(compOne.length(), compTwo.length()) / 2 – 1;

double res = -1;

int m = getMatch();
int t = 0;
if (getMissMatch(compTwo,compOne) > 0)
{
t = (getMissMatch(compOne,compTwo) / getMissMatch(compTwo,compOne));
}

int l1 = compOne.length();
int l2 = compTwo.length();

double f = 0.3333;
double mt = (double)(m-t)/m;
double jw = f * ((double)m/l1+(double)m/l2+(double)mt);
res = jw + getCommonPrefix(compOne,compTwo) * (0.1*(1.0 – jw));

return res;
}

private int getMatch()
{

theMatchA = “”;
theMatchB = “”;

int matches = 0;

for (int i = 0; i < compOne.length(); i++)
{
//Look backward
int counter = 0;
while(counter <= mRange && i >= 0 && counter <= i)
{
if (compOne.charAt(i) == compTwo.charAt(i – counter))
{
matches++;
theMatchA = theMatchA + compOne.charAt(i);
theMatchB = theMatchB + compTwo.charAt(i);
}
counter++;
}

//Look forward
counter = 1;
while(counter <= mRange && i < compTwo.length() && counter + i < compTwo.length())
{
if (compOne.charAt(i) == compTwo.charAt(i + counter))
{
matches++;
theMatchA = theMatchA + compOne.charAt(i);
theMatchB = theMatchB + compTwo.charAt(i);
}
counter++;
}
}
return matches;
}

private int getMissMatch(String s1, String s2)
{
int transPositions = 0;

for (int i = 0; i < theMatchA.length(); i++)
{
//Look Backward
int counter = 0;
while(counter <= mRange && i >= 0 && counter <= i)
{
if (theMatchA.charAt(i) == theMatchB.charAt(i – counter) && counter > 0)
{
transPositions++;
}
counter++;
}

//Look forward
counter = 1;
while(counter <= mRange && i < theMatchB.length() && (counter + i) < theMatchB.length())
{
if (theMatchA.charAt(i) == theMatchB.charAt(i + counter) && counter > 0)
{
transPositions++;
}
counter++;
}
}
return transPositions;
}

private int getCommonPrefix(String compOne, String compTwo)
{
int cp = 0;
for (int i = 0; i < 4; i++)
{
if (compOne.charAt(i) == compTwo.charAt(i)) cp++;
}
return cp;
}
}
[/java]

N-Gram

An algorithm to calculate the probability of the next term based on the previous n terms. It is used in speech recognition, phonemes, language recognition etc.

Wikipedia entry can be found here. Wikipedia, in this case, is a little limited on either pseudo code or an algorithm. In this case I referenced the following presentation (Slides 27/28) from CRULP talking about different algorithms used in spell checking. Also, further reading can be found in Chapter 6 of Christopher D Manning and Hinrich Schutze’s book Foundations of Statistical Natural Language Processing. N-Gram are very similar to Markov models.

Here is the code:

[java]

public class Ngram
{
private class result
{
private String theWord;
private int theCount;

public result(String w, int c)
{
theWord = w;
theCount = c;
}

public void setTheCount(int c)
{
theCount = c;
}

public String getTheWord()
{
return theWord;
}

public int getTheCount()
{
return theCount;
}
}

private List<result> results;

public Ngram()
{
results = new ArrayList<result>();
}
public Ngram(String str, int n)
{

}

public double getSimilarity(String wordOne, String wordTwo, int n)
{
List<result> res1 = processString(wordOne, n);
//displayResult(res1);
List<result> res2 = processString(wordTwo, n);
//displayResult(res2);
int c = common(res1,res2);
int u = union(res1,res2);
double sim = (double)c/(double)u;

return sim;
}

private int common(List<result> One, List<result> Two)
{
int res = 0;

for (int i = 0; i < One.size(); i++)
{
for (int j = 0; j < Two.size(); j++)
{
if (One.get(i).theWord.equalsIgnoreCase(Two.get(j).theWord)) res++;
}
}

return res;
}

private int union(List<result> One, List<result> Two)
{
List<result> t = One;

for (int i = 0; i < Two.size(); i++)
{
int pos = -1;
boolean found = false;
for (int j = 0; j < t.size() && !found; j++)
{
if (Two.get(i).theWord.equalsIgnoreCase(t.get(j).theWord))
{
found = true;
}
pos = j;
}

if (!found)
{
result r = Two.get(i);
t.add(r);
}
}

return t.size();
}

private List<result> processString(String c, int n)
{
List<result> t = new ArrayList<result>();

String spacer = “”;
for (int i = 0; i < n-1; i++)
{
spacer = spacer + “%”;
}
c = spacer + c + spacer;

for (int i = 0; i < c.length(); i++)
{
if (i <= (c.length() – n))
{
if (contains(c.substring(i, n+i)) > 0)
{
t.get(i).setTheCount(results.get(i).getTheCount()+1);
}
else
{
t.add(new result(c.substring(i,n+i),1));
}
}
}
return t;
}

private int contains(String c)
{
for (int i = 0; i < results.size(); i++)
{
if (results.get(i).theWord.equalsIgnoreCase(c))
return i;
}
return 0;
}

private void displayResult(List<result> d)
{
for (int i = 0; i < d.size(); i++)
{
System.out.println(d.get(i).theWord+” occurred “+d.get(i).theCount+” times”);
}
}
}
[/java]

Markov Chain

The Markov Chain model calculates the probability of the next letter occurring in a sequence based on the previous n characters.

They are used in a multitude of areas, including data compression, entropy encoding, and pattern recognition.

The wikipedia entry can be found here. I based my algorithm off of an excellent explanation found here, provided by Rutgers as part of their course Discrete and Probabilistic Models in Biology.

Warning, I think this document has a couple of mistakes in the matrix on page 10 (the counting of A,C,G,T), or I am mistaken?

For this code, the final part which is needed is a method which gives the probability of a sequence, that is you pass in a sequence and it returns the probability of this sequence based on the matrix it has calculated.

The code:

[java]
public class Markov
{

private class similarities
{
private char c;
private int count;

public similarities(char a, int b)
{
c = a;
count = b;
}

public char getC()
{
return c;
}

public int getCount()
{
return count;
}

public void setCount(int a)
{
count = a;
}

public void increaseCount()
{
count++;
}
}

private List<similarities> entries;
private List<String> axies;
private double[][] matrix;

public Markov()
{
entries = new ArrayList();
axies = new ArrayList();
}

public void getSimilarity(List<String> s)
{
String temp = new String();
for (int i = 0; i < s.size(); i++)
{
if (temp.length() > 0)
{
temp = temp + ” ” + s.get(i);
}
else temp = s.get(i);

}
setupAxis(temp);
compareCharsAgainstAxis(temp);

display();

for (int i = 0; i < matrix.length; i++)
{
double line = getLineValue(i);
for (int j = 0; j < matrix[i].length; j++)
{
double t = matrix[i][j];

matrix[i][j] = t / line;
}
}

display();
}

private void compareCharsAgainstAxis(String s)
{
matrix = new double[axies.size()][axies.size()];
String comparison = “”;
String test = s;

for (int h = 0; h < test.length(); h++)
{
int end = test.indexOf(” “);
if (end > 0)
{
comparison = test.substring(0,end);
}
else
{
comparison = test.substring(0,test.length());
end = test.length()-1;
}
test = test.substring(end+1,test.length());

for (int i = 1; i < comparison.length(); i++)
{
for (int j = 0; j < axies.size(); j++)
{
for (int k = 0; k < axies.size(); k++)
{
if (String.valueOf(comparison.charAt(i-1)).equalsIgnoreCase(axies.get(j)) && String.valueOf(comparison.charAt(i)).equalsIgnoreCase(axies.get(k)))
{
matrix[j][k]++;
}
}
}
}
}

}
private void setupAxis(String temp)
{
for (int j = 0; j < temp.length(); j++)
{
boolean found = false;

for (int k = 0; k < axies.size() && !found; k++)
{
if (String.valueOf(temp.charAt(j)).equalsIgnoreCase(axies.get(k)))
{
found = true;
}
}

if (!found)
{
if (temp.charAt(j) > 32)
{
axies.add(String.valueOf(temp.charAt(j)));
}
}
}
}

private double getTotalValue()
{
double sum = 0;
for (int i = 0; i < matrix.length; i++)
{
for (int j = 0; j < matrix[i].length; j++)
{
double t = matrix[i][j];
sum = sum + t;
}
}
return sum;
}

private double getLineValue(int line)
{
double sum = 0;
for (int i = 0; i < matrix[line].length; i++)
{
double t = matrix[line][i];
sum = sum + t;
}
return sum;
}

private void display()
{
System.out.println(“Sum of matrix is “+getTotalValue());
System.out.println(“Sum of line 1 = “+getLineValue(3));

System.out.print(” “);
for (int j = 0; j < axies.size(); j++)
{
System.out.print(axies.get(j)+” “);
}
System.out.println();
for (int i = 0; i < matrix.length; i++)
{
System.out.print(axies.get(i)+” “);
for (int j = 0; j < matrix[i].length; j++)
{
System.out.print(matrix[i][j]+” “);
}
System.out.println();
}
}
}
[/java]

I hope this proves useful?