Somewhat unsurprisingly, some would say, over the last two weeks I have been preoccupied with data.
More specifically, the notion of data having a life of its own.
This is was the key theme of Prof. Charlotte Roueché’s talk at the Science & Engineering South event The Data Dialogue – At War with Data at King’s College on the 7th December. Citing a number of examples of data reuse such as archaeological maps by British Armed Forces and aerial photographs of Aleppo taken in 1916 for military used now being used as archaeological record, she argued that data develop a life of their own. This means that we need to make sure that the data we collect is of the best possible quality and well curated. It should meet the FAIR principles: Findable, Accessible, Interoperable and Re-usable. However, once we have released our data into the wild, we will never truly know how it will be used and by whom. Unfortunately, history has shown that not all re-use is benign.
This then begs the question: how open, should our open data be? Is there a case for not disclosing some data if you know it could do harm. e.g. In the current political climate, the exact location of archaeological sites of religious significance.
This ties in to the two main themes of the National Statistician John Pullinger’s talk at The Turing Institute event Stats, Decision Making and Privacy on 5th December of respect and value.
The key thing about respect is that data is about people and entities, this should never be forgotten. People’s relationships with and perceptions of organisations who collect and process their data varies, as data analysts/scientists we should understand and respect this. This means being alive to what privacy means to individuals and entities, and the context of how it is being discussed. Caring about the security of the data and demonstrating this through good practices. Additionally, thinking about what we should do, not just what we could do with data available to us. This is very pertinent with the rise in the use of machine learning tools and techniques within data science.
This last point links into the second theme of value. Data is valuable. It enables us to make better, more informed decisions and is a critical resource. However, a balance needs to be drawn between extracting value from the data and respect. So, is there a need to change the way in which we think about our data analysis processes?
Dr Cynthia Dwork in her talk on Privacy-Preserving Data Analysis (The Turing Institute event Stats, Decision Making and Privacy) noted that statistics are inherently not private, with aggregate statistics destroying privacy. Echoing the talk of John Pullinger, Dr Dwork raised the question ‘What is the nature of the protection we wish to provide?’. It is also important to understand who is a threat to our data and why. A move towards differential privacy (https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf) was proposed. When an analysis is undertaken in this way the outcome of the analysis is essentially equally likely, irrespective of whether any individuals join, or refrain from joining, the data set. However, this would require a completely different way of working.
We’ve all heard the old adage of ‘lies, damned lies and statistics’; a key factor in making sure this is not the case is the presentation of the data. We need to ensure that the data is correctly understood and correctly interpreted. Start from where your audience is, and think carefully about your choice of words and visualisations. We also need to help our audiences to be more data literate. But to undermine good analysis and communication we need to invest in skills and develop a good data infrastructure.
Support the Royal Statistical Society’s Data Manifesto: http://www.rss.org.uk/Images/PDF/influencing-change/2016/RSS_Data%20Manifesto_2016_Online.pdf
and in the words of John Pullinger ‘step up, step forward and step on the gas’!