UCL CHIME

A A A

Archive for the 'Informatics' Category

CHIME in the news

By Henry W W Potts, on 10 March 2011

Carl and Jeremy’s paper on open source approaches in health was covered by TechEye and some other blogs.

Our BMJ paper on the ethnic attainment gap in academic performance in medicine attracted a certain amount of coverage, with the story being covered by the Press Association and GP.

And a talk of mine on the adoption of m-health at Health Innovation Expo 2011 resulted in this GP article. Based on some research using the North Central London Anticaogulation and Stroke Prevention Service as a case study, we had concluded that PCTs lacked the necessary skills to make the best judgments when it came to technology adoption. In questions after the talk, I was developing an argument that smaller GP consortia will be even more poorly placed in this regard.

Another big government computer programme, another damning report

By Henry W W Potts, on 1 February 2011

The Commons Public Accounts Committee released today a damning report of HM Revenue and Customs’ new National Insurance and PAYE Service (NPS) system. It describes huge backlogs and budget over-runs. The story is all too familiar, as Margaret Hodge, Chair of the Committee, acknowledged. Implementation proves slower than expected, problems with data quality… all of which we’ve heard before with Connecting for Health.

So, what can the government do in future? Hodge suggests better training for civil servants and more job stability so that someone is in charge and takes responsibility. In CHIME, we’ve published before on the dangers of techno-utopianism and the need to recognise the difficulties in implementation. After so many examples, there should be no excuse for a government department going into a big IT contract with its eyes shut to the challenges ahead.

Recent and forthcoming CHIME publications

By Henry W W Potts, on 10 January 2011

To welcome in 2011, here’s a brief overview of some recent or forthcoming publications by CHIME staff.

Taylor P, Potts H, Wilkinson L, Given-Wilson R (2010). Impact of CAD with full field digital mammography on workflow and cost. Lecture Notes in Computer Science, 6136, 1-8.

Taylor P (2010). Modeling the impact of CAD on the outcomes of the UK Breast Screening Programme. Lecture Notes in Computer Science, 6136, 129-36.

… are two papers from Paul Taylor continuing to explore CAD, computer-aided diagnosis, in mammography.

Michie S, Rubin J, Potts H (2010). The role of media reporting in determining public worry during the swine flu outbreak. Psychology and Health, 25(S1), 122-3.

… is a published abstract continuing our work on the psychology of the swine flu pandemic. We hope to have a paper accepted in Vaccine soon too.

Tingle J, Bark P (ed.s) (2011). Patient Safety, Law Policy and Practice. Routledge. ISBN: 0415557313. 220 pages.

… is due in March. This book, co-edited and with chapters by Pippa Bark, explores the impact of legal systems on patient safety initiatives. More on this as we get closer to publication.

The Digested Read: Every PhD in Health Informatics

By Paul M Taylor, on 10 November 2010

No one has time to read other people’s PhD theses. So, inspired by the Guardian’s Digested Read column, we’ve decided to publish succinct – very succinct – summaries of the major categories of Health Informatics PhD.

Today, in issue 1 of the series, we deal with classification problems.

Using a Fashionable New Computer Science Paradigm to Classify Cases of an Important Disease

Background
This disease is a major health problem and one where diagnostic or management errors can occur. Previous attempts to apply computers to classify cases of this disease have achieved promising results (classification accuracy of over 70% is commonly claimed) but haven’t yet led to clinical take-up. A fashionable new computer science paradigm seems to offer an exciting and novel approach to this problem.

Method
An innovative algorithm based on the fashionable new computer science paradigm was implemented by the candidate. Applying the algorithm to a publicly available dataset with questionable ground truth achieved a classification accuracy of barely more than 60%. Successive, increasingly desperate, and, in the end, almost random modifications allow a classification accuracy of nearly 70% to be claimed.

Results
A new, but disappointingly small, dataset was collected in collaboration with the local clinical team. On this dataset, the modified algorithm achieved an accuracy of just less than 60% when tested against the consensus opinion of clinicians. Closer analysis identified a subset of interesting cases where junior clinicians only achieved agreement levels of 58% when assessed against each other.

Conclusions
The application of the fashionable new computer science paradigm to this important disease seems highly promising.

Java example code of common similarity algorithms, used in data mining

By , on 28 June 2010

I am currently researching record linkage techniques and so had to remind myself of the some of the algorithms used in and around the area to help evaluate the products I have at my disposal. This work has been undertaken as part of the DHICE project at UCL, funded by the Wellcome Trust.

My typical approach to understanding algorithms is to code them up and so I am then able to “play” with them. In the process of researching each technique I noticed in most cases there was only pseudo code available. So I have decided to share the Java examples of the following algorithms:

Levenshtein
Damerau Levenshtein
Hamming
Jaro Winkler
Smith Waterman
N-Gram
Markov Chain

For each algorithm, I will provide a very basic description and then the Java code which you are free to play with, further to this I will provide links to the wikipedia entry for the algorithm and any other site I have sourced. Each piece of code is based around the explanation in Wikipedia and has been tested against the examples provided by each Wikipedia article and/or other sites referenced. There is no warranty provided with this code, any changes / enhancements / corrections / or use of this code should be attributed to CHIME, UCL and shared with the community.

Levenshtein

The purpose of this algorithm is to measure the difference between two sequences/strings. It is based around the number of changes required to make one string equal to the other.

It is aimed at short strings, it usage is spell checkers, optical character recognition, etc.

Wikipedia entry can be found here:

[java]

public class Levenshtein
{
private String compOne;
private String compTwo;
private int[][] matrix;
private Boolean calculated = false;

public Levenshtein(String one, String two)
{
compOne = one;
compTwo = two;
}

public int getSimilarity()
{
if (!calculated)
{
setupMatrix();
}
return matrix[compOne.length()][compTwo.length()];
}

public int[][] getMatrix()
{
setupMatrix();
return matrix;
}

private void setupMatrix()
{
matrix = new int[compOne.length()+1][compTwo.length()+1];

for (int i = 0; i <= compOne.length(); i++)
{
matrix[i][0] = i;
}

for (int j = 0; j <= compTwo.length(); j++)
{
matrix[0][j] = j;
}

for (int i = 1; i < matrix.length; i++)
{
for (int j = 1; j < matrix[i].length; j++)
{
if (compOne.charAt(i-1) == compTwo.charAt(j-1))
{
matrix[i][j] = matrix[i-1][j-1];
}
else
{
int minimum = Integer.MAX_VALUE;
if ((matrix[i-1][j])+1 < minimum)
{
minimum = (matrix[i-1][j])+1;
}

if ((matrix[i][j-1])+1 < minimum)
{
minimum = (matrix[i][j-1])+1;
}

if ((matrix[i-1][j-1])+1 < minimum)
{
minimum = (matrix[i-1][j-1])+1;
}

matrix[i][j] = minimum;
}
}
}
calculated = true;
displayMatrix();
}

private void displayMatrix()
{
System.out.println(” “+compOne);
for (int y = 0; y <= compTwo.length(); y++)
{
if (y-1 < 0) System.out.print(” “); else System.out.print(compTwo.charAt(y-1));
for (int x = 0; x <= compOne.length(); x++)
{
System.out.print(matrix[x][y]);
}
System.out.println();
}
}
}
[/java]

Damerau Levenshtein

Similar to Levenshtein, Damerau-Levenshtein also calculates the distances between two strings. It based around comparing two string and counting the number of insertions, deletions, and substitution of single characters, and transposition of two characters.

This was, originally, aimed at spell checkers, it is also used for DNA sequences.

Wikipedia entry found be here:

[java]

public class DamerauLevenshtein
{
private String compOne;
private String compTwo;
private int[][] matrix;
private Boolean calculated = false;

public DamerauLevenshtein(String a, String b)
{
if ((a.length() > 0 || !a.isEmpty()) || (b.length() > 0 || !b.isEmpty()))
{
compOne = a;
compTwo = b;
}
}

public int[][] getMatrix()
{
setupMatrix();
return matrix;
}

public int getSimilarity()
{
if (!calculated) setupMatrix();

return matrix[compOne.length()][compTwo.length()];
}

private void setupMatrix()
{
int cost = -1;
int del, sub, ins;

matrix = new int[compOne.length()+1][compTwo.length()+1];

for (int i = 0; i <= compOne.length(); i++)
{
matrix[i][0] = i;
}

for (int i = 0; i <= compTwo.length(); i++)
{
matrix[0][i] = i;
}

for (int i = 1; i <= compOne.length(); i++)
{
for (int j = 1; j <= compTwo.length(); j++)
{
if (compOne.charAt(i-1) == compTwo.charAt(j-1))
{
cost = 0;
}
else
{
cost = 1;
}

del = matrix[i-1][j]+1;
ins = matrix[i][j-1]+1;
sub = matrix[i-1][j-1]+cost;

matrix[i][j] = minimum(del,ins,sub);

if ((i > 1) && (j > 1) && (compOne.charAt(i-1) == compTwo.charAt(j-2)) && (compOne.charAt(i-2) == compTwo.charAt(j-1)))
{
matrix[i][j] = minimum(matrix[i][j], matrix[i-2][j-2]+cost);
}
}
}

calculated = true;
displayMatrix();
}

private void displayMatrix()
{
System.out.println(” “+compOne);
for (int y = 0; y <= compTwo.length(); y++)
{
if (y-1 < 0) System.out.print(” “); else System.out.print(compTwo.charAt(y-1));
for (int x = 0; x <= compOne.length(); x++)
{
System.out.print(matrix[x][y]);
}
System.out.println();
}
}

private int minimum(int d, int i, int s)
{
int m = Integer.MAX_VALUE;

if (d < m) m = d;
if (i < m) m = i;
if (s < m) m = s;

return m;
}

private int minimum(int d, int t)
{
int m = Integer.MAX_VALUE;

if (d < m) m = d;
if (t < m) m = t;

return m;
}
}
[/java]

Hamming

This algorithm calculates the distance between two strings, however they have to be of equal length.

It measures the minimum number of substitutions for the two strings to be equal.

It is used in telecommunication (also know as signal distance), it is also used in systematics as a measure of genetic distance.

Wikipedia entry can be found here:

[java]

public class Hamming
{
private String compOne;
private String compTwo;

public Hamming(String one, String two)
{
compOne = one;
compTwo = two;
}

public int getHammingDistance()
{
if (compOne.length() != compTwo.length())
{
return -1;
}

int counter = 0;

for (int i = 0; i < compOne.length(); i++)
{
if (compOne.charAt(i) != compTwo.charAt(i)) counter++;
}

return counter;
}

///
// Hamming distance works best with binary comparisons, this function takes a string arrary of binary
// values and returns the minimum distance value
///
public int minDistance(String[] numbers)
{
int minDistance = Integer.MAX_VALUE;

if (checkConstraints(numbers))
{
for (int i = 1; i < numbers.length; i++)
{
int counter = 0;
for (int j = 1; j <= numbers[i].length(); j++)
{
if (numbers[i-1].charAt(j-1) != numbers[i].charAt(j-1))
{
counter++;
}
}

if (counter == 0) return counter;
if (counter < minDistance) minDistance = counter;
}
}
else
{
return -1;
}

return minDistance;
}

private Boolean checkConstraints(String[] numbers)
{
if (numbers.length > 1 && numbers.length <=50)
{
int prevLength = -1;
for (int i = 0; i < numbers.length; i++)
{
if (numbers[i].length() > 0 && numbers[i].length() <= 50)
{
if (prevLength == -1)
{
prevLength = numbers[i].length();
}
else
{
if (prevLength != numbers[i].length())
{
return false;
}
}

for (int j = 0; j < numbers[i].length(); j++)
{
if (numbers[i].charAt(j) != ‘0’ && numbers[i].charAt(j) != ‘1’)
{
return false;
}
}
}
else
{
return false;
}
}
}
else
{
return false;
}

return true;
}
}
[/java]

Jaro Winkler

This algorithm is purposely designed for record linkage, it was designed for linking short strings. It calculates a normalised score on the similarity between two strings.

The calculation is based on the number of matching characters held within the string and the number of transpositions.

The wikipedia entry can be found here, the other source was lingpipe:

[java]

public class JaroWinkler
{
private String compOne;
private String compTwo;

private String theMatchA = “”;
private String theMatchB = “”;
private int mRange = -1;

public JaroWinkler()
{
}

public JaroWinkler(String s1, String s2)
{
compOne = s1;
compTwo = s2;
}

public double getSimilarity(String s1, String s2)
{
compOne = s1;
compTwo = s2;

mRange = Math.max(compOne.length(), compTwo.length()) / 2 – 1;

double res = -1;

int m = getMatch();
int t = 0;
if (getMissMatch(compTwo,compOne) > 0)
{
t = (getMissMatch(compOne,compTwo) / getMissMatch(compTwo,compOne));
}

int l1 = compOne.length();
int l2 = compTwo.length();

double f = 0.3333;
double mt = (double)(m-t)/m;
double jw = f * ((double)m/l1+(double)m/l2+(double)mt);
res = jw + getCommonPrefix(compOne,compTwo) * (0.1*(1.0 – jw));

return res;
}

private int getMatch()
{

theMatchA = “”;
theMatchB = “”;

int matches = 0;

for (int i = 0; i < compOne.length(); i++)
{
//Look backward
int counter = 0;
while(counter <= mRange && i >= 0 && counter <= i)
{
if (compOne.charAt(i) == compTwo.charAt(i – counter))
{
matches++;
theMatchA = theMatchA + compOne.charAt(i);
theMatchB = theMatchB + compTwo.charAt(i);
}
counter++;
}

//Look forward
counter = 1;
while(counter <= mRange && i < compTwo.length() && counter + i < compTwo.length())
{
if (compOne.charAt(i) == compTwo.charAt(i + counter))
{
matches++;
theMatchA = theMatchA + compOne.charAt(i);
theMatchB = theMatchB + compTwo.charAt(i);
}
counter++;
}
}
return matches;
}

private int getMissMatch(String s1, String s2)
{
int transPositions = 0;

for (int i = 0; i < theMatchA.length(); i++)
{
//Look Backward
int counter = 0;
while(counter <= mRange && i >= 0 && counter <= i)
{
if (theMatchA.charAt(i) == theMatchB.charAt(i – counter) && counter > 0)
{
transPositions++;
}
counter++;
}

//Look forward
counter = 1;
while(counter <= mRange && i < theMatchB.length() && (counter + i) < theMatchB.length())
{
if (theMatchA.charAt(i) == theMatchB.charAt(i + counter) && counter > 0)
{
transPositions++;
}
counter++;
}
}
return transPositions;
}

private int getCommonPrefix(String compOne, String compTwo)
{
int cp = 0;
for (int i = 0; i < 4; i++)
{
if (compOne.charAt(i) == compTwo.charAt(i)) cp++;
}
return cp;
}
}
[/java]

N-Gram

An algorithm to calculate the probability of the next term based on the previous n terms. It is used in speech recognition, phonemes, language recognition etc.

Wikipedia entry can be found here. Wikipedia, in this case, is a little limited on either pseudo code or an algorithm. In this case I referenced the following presentation (Slides 27/28) from CRULP talking about different algorithms used in spell checking. Also, further reading can be found in Chapter 6 of Christopher D Manning and Hinrich Schutze’s book Foundations of Statistical Natural Language Processing. N-Gram are very similar to Markov models.

Here is the code:

[java]

public class Ngram
{
private class result
{
private String theWord;
private int theCount;

public result(String w, int c)
{
theWord = w;
theCount = c;
}

public void setTheCount(int c)
{
theCount = c;
}

public String getTheWord()
{
return theWord;
}

public int getTheCount()
{
return theCount;
}
}

private List<result> results;

public Ngram()
{
results = new ArrayList<result>();
}
public Ngram(String str, int n)
{

}

public double getSimilarity(String wordOne, String wordTwo, int n)
{
List<result> res1 = processString(wordOne, n);
//displayResult(res1);
List<result> res2 = processString(wordTwo, n);
//displayResult(res2);
int c = common(res1,res2);
int u = union(res1,res2);
double sim = (double)c/(double)u;

return sim;
}

private int common(List<result> One, List<result> Two)
{
int res = 0;

for (int i = 0; i < One.size(); i++)
{
for (int j = 0; j < Two.size(); j++)
{
if (One.get(i).theWord.equalsIgnoreCase(Two.get(j).theWord)) res++;
}
}

return res;
}

private int union(List<result> One, List<result> Two)
{
List<result> t = One;

for (int i = 0; i < Two.size(); i++)
{
int pos = -1;
boolean found = false;
for (int j = 0; j < t.size() && !found; j++)
{
if (Two.get(i).theWord.equalsIgnoreCase(t.get(j).theWord))
{
found = true;
}
pos = j;
}

if (!found)
{
result r = Two.get(i);
t.add(r);
}
}

return t.size();
}

private List<result> processString(String c, int n)
{
List<result> t = new ArrayList<result>();

String spacer = “”;
for (int i = 0; i < n-1; i++)
{
spacer = spacer + “%”;
}
c = spacer + c + spacer;

for (int i = 0; i < c.length(); i++)
{
if (i <= (c.length() – n))
{
if (contains(c.substring(i, n+i)) > 0)
{
t.get(i).setTheCount(results.get(i).getTheCount()+1);
}
else
{
t.add(new result(c.substring(i,n+i),1));
}
}
}
return t;
}

private int contains(String c)
{
for (int i = 0; i < results.size(); i++)
{
if (results.get(i).theWord.equalsIgnoreCase(c))
return i;
}
return 0;
}

private void displayResult(List<result> d)
{
for (int i = 0; i < d.size(); i++)
{
System.out.println(d.get(i).theWord+” occurred “+d.get(i).theCount+” times”);
}
}
}
[/java]

Markov Chain

The Markov Chain model calculates the probability of the next letter occurring in a sequence based on the previous n characters.

They are used in a multitude of areas, including data compression, entropy encoding, and pattern recognition.

The wikipedia entry can be found here. I based my algorithm off of an excellent explanation found here, provided by Rutgers as part of their course Discrete and Probabilistic Models in Biology.

Warning, I think this document has a couple of mistakes in the matrix on page 10 (the counting of A,C,G,T), or I am mistaken?

For this code, the final part which is needed is a method which gives the probability of a sequence, that is you pass in a sequence and it returns the probability of this sequence based on the matrix it has calculated.

The code:

[java]
public class Markov
{

private class similarities
{
private char c;
private int count;

public similarities(char a, int b)
{
c = a;
count = b;
}

public char getC()
{
return c;
}

public int getCount()
{
return count;
}

public void setCount(int a)
{
count = a;
}

public void increaseCount()
{
count++;
}
}

private List<similarities> entries;
private List<String> axies;
private double[][] matrix;

public Markov()
{
entries = new ArrayList();
axies = new ArrayList();
}

public void getSimilarity(List<String> s)
{
String temp = new String();
for (int i = 0; i < s.size(); i++)
{
if (temp.length() > 0)
{
temp = temp + ” ” + s.get(i);
}
else temp = s.get(i);

}
setupAxis(temp);
compareCharsAgainstAxis(temp);

display();

for (int i = 0; i < matrix.length; i++)
{
double line = getLineValue(i);
for (int j = 0; j < matrix[i].length; j++)
{
double t = matrix[i][j];

matrix[i][j] = t / line;
}
}

display();
}

private void compareCharsAgainstAxis(String s)
{
matrix = new double[axies.size()][axies.size()];
String comparison = “”;
String test = s;

for (int h = 0; h < test.length(); h++)
{
int end = test.indexOf(” “);
if (end > 0)
{
comparison = test.substring(0,end);
}
else
{
comparison = test.substring(0,test.length());
end = test.length()-1;
}
test = test.substring(end+1,test.length());

for (int i = 1; i < comparison.length(); i++)
{
for (int j = 0; j < axies.size(); j++)
{
for (int k = 0; k < axies.size(); k++)
{
if (String.valueOf(comparison.charAt(i-1)).equalsIgnoreCase(axies.get(j)) && String.valueOf(comparison.charAt(i)).equalsIgnoreCase(axies.get(k)))
{
matrix[j][k]++;
}
}
}
}
}

}
private void setupAxis(String temp)
{
for (int j = 0; j < temp.length(); j++)
{
boolean found = false;

for (int k = 0; k < axies.size() && !found; k++)
{
if (String.valueOf(temp.charAt(j)).equalsIgnoreCase(axies.get(k)))
{
found = true;
}
}

if (!found)
{
if (temp.charAt(j) > 32)
{
axies.add(String.valueOf(temp.charAt(j)));
}
}
}
}

private double getTotalValue()
{
double sum = 0;
for (int i = 0; i < matrix.length; i++)
{
for (int j = 0; j < matrix[i].length; j++)
{
double t = matrix[i][j];
sum = sum + t;
}
}
return sum;
}

private double getLineValue(int line)
{
double sum = 0;
for (int i = 0; i < matrix[line].length; i++)
{
double t = matrix[line][i];
sum = sum + t;
}
return sum;
}

private void display()
{
System.out.println(“Sum of matrix is “+getTotalValue());
System.out.println(“Sum of line 1 = “+getLineValue(3));

System.out.print(” “);
for (int j = 0; j < axies.size(); j++)
{
System.out.print(axies.get(j)+” “);
}
System.out.println();
for (int i = 0; i < matrix.length; i++)
{
System.out.print(axies.get(i)+” “);
for (int j = 0; j < matrix[i].length; j++)
{
System.out.print(matrix[i][j]+” “);
}
System.out.println();
}
}
}
[/java]

I hope this proves useful?