## Data-mining: it’s all about the data

By Paul M Taylor, on 4 May 2010

Marooned in NY last week, I made the trek uptown for a meeting at Columbia. It was about data-mining which isn’t really my field, and I might not have gone if the circumstances had been different. It was interesting though, because I think of data-mining as being an applied field of statistics, so assume that the questions are primarily about approaches and involve complex mathematical arguments about the applicability or power of different algorithms or techniques. In this meeting, however, the conversation never got to a mathematical concept more complex than a percentage or a standard test of significance. Instead the discussion was all about the data. What does it mean? How was it gathered? Does it really mean what we think it means?

The project was looking at data about patients who had had an heart attack in hospital. Specifically they were looking at the observations and comments made in the two days before the event to see if there was some signature or pattern that could be used as a warning of the event.

Three interesting observations:

(1) One idea is to look not at the content of observations but at their timing and frequency. In an earlier project the group had assessed the number of comments or observations about a patient. The first pass at that analysis revealed something seemingly odd: a number of patients died without there being anything in the notes to suggest that the patient was ill at all, never mind critically ill. No signs, no symptoms, no tests, nothing. A quick review of the notes for these patient revealed that these were patients who – if not actually dead on arrival – died within minutes. So the absence of information was not a sign of an absence of concern, but that the speed of the crisis altered the requirement for documenting the case.
(2) A previous attempt to identify a predictive signature from the record found that information that was predictive of the outcome wasn’t useful, because it didn’t tell you something you didn’t already know. So, if a doctor orders a test for TB, and this is – to an extent – predictive of TB, well no surprise. The things that were useful – that had been somehow hidden in the data – were only weakly predictive. And how do you use that information clinically? If the test has an AUC that is significantly greater than 0.5 but is only 0.6?
(3) One analysis had involved looking at the comments associated with observations. Unsurprisingly most comments were made about observations that were outside the normal range. Except for oxygen. Most comments about oxygen were made about patients who were in the normal range! So what does that tell us? Well one thing might be that ‘normal’ is a context-dependent term, so that for these patients to be in that range was not normal, or at least was an event that required some documentation.

All in all, it reinforced the impression that health informatics is all about the data. And it doesn’t always mean what it might be thought to mean.