By Oliver W Duke-Williams, on 27 March 2014
The Office for National Statistics (ONS) has announced today that Jil Matheson, the National Statistician, has recommended a predominantly online census in 2021 supplemented by the further use of administrative and survey data. A similar announcement has been made by National Records of Scotland, with regard to the 2021 census in Scotland.
This announcement comes after a lengthy consultation process, which looked at the future of census-taking in the UK. In their consultation, ONS considered two options:
- a decennial census similar to the current approach, but largely conducted online
- replacement with data captured from administrative sources, accompanied with a rolling annual sample survey.
CeLSIUS were one of many organisations and individuals who contributed to the consultation, and ONS quote from our response in their consultation report . Our full response is available – along with other responses – at the ONS website alongside the annoucement details. (The file available via ONS has lost some of its original formatting but the text remains the same; for ease of reading, a formatted copy of our response is also available: futureofcensus_response-final). The response drew some of its material from a fairly long post written for the UCL Department for Information Studies blog – “I didn’t have time to write a short letter, so I wrote a long one instead”, as various people are claimed to have said (the quote is often attributed to Mark Twain; the earliest use seems to be by Blaise Pascal, albeit less pithily, and in French).
In our response, we discussed the advantages and disadvantages of both approaches, but concluded that asking people to view the two as a dichotomous choice was of course a trick question:
“The question ‘do we want a census with a strong online element, or do we want to exploit administrative sources?’ is a loaded one. The obvious response is that we want both: we want a robust census core on which projections and estimates can be based, and we want to strengthen and enrich that using administrative data.”
ONS’ consultation report is interesting: there were a large number of responses, both from organisations and from individuals. The report shows the uses of census data to which people responding to the consultation referred.
Almost 70% of individuals making responses referred to use for family history research (amongst other uses – multiple uses could be identified for each response); in contrast this was the least common use identified by organisations, each of which tend to represent large bodies of users. However, we should not under-emphasise the importance of family history: the commercial genealogy industry in the UK (and globally) is big business, and relies on census and other historic data sources. The Office of Fair Trading’s report on the commercial use of public information in 2006 identified commercial geneaology (p31) as a case example of a rapidly growing use of public data, even if the claim that “genealogy is now one of the top uses of the World Wide Web” seems to be a bit of an exaggerated one.
The report notes that “Around 90 per cent of respondents made comments supportive of an online census.” That is perhaps not surprising – it reflects a general preference amongst (a self-selecting set of) respondents to maintain the census, but acknowledging that internet based collection is an clearly a feasible and possibly preferred mode for a growing number of people.
When we responded to the consultation, we noted a number of potential problems with the use of administrative data. These include the issues that:
- Not all census questions have obvious administrative data counterparts
- Household (or family) level information is much more sparse in administrative data than in census data
- The census is used to test whether survey data are representative of a larger population
- Linkage is hard and introduces ambiguity
The first three of these issues relate to specific problems that arise if we envisage the census as being entirely replaced with admin data. However, the problems disappear to a large extent if we see the census as being enhanced with administrative data. There is considerable interest in the use of admin data to explore a wide variety of questions in a broad range of social science and epidemiological fields. This is reflected by heavy investment in the ‘big data’ and ‘administrative data’ areas: this investment includes the newly established ESRC Administrative Data Research Network, which is one part of a set of ‘big data’ investments.
The fourth problem we mentioned – issues to do with linkage being hard – remains valid, yet these are amongst the issues that the ADRN will investigate.
Since the consultation closed, issues to do with data linkage have gained high profile in the media, in association with the care.data project. It is clear that there is considerable public concern about data linkage, and in particular with commercial exploitation of linked data sets. A 2013 report by the Wellcome Trust on public attitudes to personal data and linking personal data highlighted health data as an area on distinctly divided attitudes: people liked the idea of sharing their personal data within the NHS, but not the idea of sharing personal data with the insurance or pharmaceutical industries. There is now much greater awareness that anonymisation of data (such that the data cannot be de-anonymised) is very hard to do.
The ONS LS, together with the sister studies NILS and SLS are useful exemplars of good practice here: users can conduct analysis on anonymised individual level records within carefully set out security arrangements, but results cannot be published at an individual level.
The census agencies now have an opportunity to demonstrate that administrative data can be used in conjunction with census data, and to do so in a manner which is demonstrably responsible and safe. Where administrative data are used in conjunction with aggregate counts, outputs should be designed such that the risk of attribute disclosure is minimal. Controls such as record swapping, and noise introduction via imputation of missing values can help to protect such data. There is also a new opportunity for individual level data sets such as the LS and the SARs to be coupled in the future with additional data or rolled-forward estimates from administrative data. Such data inevitably have higher disclosure risks, so they must continue to be used by trusted researchers in a way in which they are protected by a combination of technical, legal and practical safeguards.