Wrong number?
By Oliver W Duke-Williams, on 7 July 2014
The Labour Party has recently launched a new website which tells you your supposed ‘NHS Baby Number‘ – that is, if all the babies born under the NHS were placed in order, which one you would be, from the very first, born on July 5th 1948, to the very latest.
It is an interesting piece of viral marketing / campaigning, but one which deserves a little more critical attention.
[EDIT 18/05/18: A revised version of this site has appeared in the lead up to the 70th anniversary of the NHS, so I’ve updated the data used for my estimate. Labour have improved their retention policy, in that they ask you first, but have extended the data gathering to potentially record family structure as well. My other criticisms about the reference to the census remain as they were.]
The baby number site can be criticised on two grounds: firstly, for its personal data handling, and secondly for the validity of the ‘baby number’ it generates.
In order to find out your baby number, it is necessary to fill in a form giving your name and email address. The website small print informs you that the Labour Party might subsequently contact you, and says that “You may unsubscribe at any time”. However, this is poor practice for data gathering – generally, if you register for a service or buy something on line, you are given an opportunity to opt-out of receiving further communications before you proceed any further. The baby number site, however, doesn’t follow this practice: it makes submission of your email address an unavoidable part of the procedure. Of course, you can unsubscribe afterwards, and I don’t doubt that this will be handled appropriately; but there is no need for the Labour Party to gather addresses in this way in the first place.
In practice, of course, the site only checks that you have submitted something that looks like an email address: using a bogus address will generate a value in the same way as using a genuine address. Many people will not realise this, however, and will use their real names and email addresses. This is a form of bad practice which we would encourage our students to avoid.
The approach has already gathered criticism in the press, but has still gained a degree of exposure in social media networks.
The second set of criticisms relate to the ‘baby number’ itself. Whilst few will actually take this value particularly seriously, we encourage our students to use data properly, and to consider closely the assumptions that they make.
Clearly, the estimate is based on your day of birth. You are not asked about your time of birth, should you know it, so the baby number must necessarily be a guess as to which of the many babies born on your birthday you happened to be. If you fill in the same details twice, you’ll presumably get a different number. Ignoring that though, the task is simple: count all the births from July 5th 1948 up to the date in question, and then add on a random amount between 1 and the number of babies born on that day. Simple. All we need to worry about is “all the births from July 5th 1948 etc…”. Which is where the problems begin. How exactly do you do that?
The text of the page states that it uses ‘census data’, which is interesting, because it’s not the obvious way to start (by which, of course, I mean “it’s obviously not the right way to do it”). A census tells you how many people were present in a place at a fixed point in time, the most recent UK census being in April 2011. If we look at census data, we can get a figure fairly easily for the total population. We can also get information for the population by age, and can then determine when people were born, or – looking at it from a different point of view – how many people were born in each year. That’s good – we’re almost at the answer, aren’t we? At this point, if I was discussing this as an exercise with my students, I’d hope that they’d spot the problems with the approach. A census, by nature of its design, only counts people who are alive on the census day, and, generally, people who are present in the country. This misses two groups of people, if we’re interested in estimating births: those people who were born after the start of the NHS, but died before the census, and also those who were born after the start of the NHS, but have subsequently emigrated, and are thus not included in the UK census. Both are obviously sizeable groups of people. Raising the problem of emigration in this context might point the student to another problem: people born outside the UK who have subsequently entered the UK are included in the census, but would not have been ‘born on the NHS’. Here, the census might provide some help: all recent censuses have included questions about country of birth, allowing us to remove the non-UK born from our calculations. The first two problems – those who have died, and those who have emigrated, remain problematic. Using ‘census data’ alone, it is difficult to address these. We might perhaps look at older censuses, and see whether we can glean any information there. For those people who died or emigrated during the ten years prior to the last census, well – we should get some information from the 2001 census, in which we assume they were recorded.
At this stage, however, our estimation methodology is starting to look very messy. Fortunately, there is a far more obvious approach, that doesn’t involve census data at all. Instead, we can use administrative records on births. The Office for National Statistics publish a variety of statistics on births, including annual information on the number of live births in the UK.
By counting these, we can remove any need to worry about how to account for the dead or the emigrants – all we need to know is that someone was born.
Using the annual birth figures, we can get an accurate count of the number of people born in the UK. Suppose that you were born in the year 2000: we can add up all the births up to and including 1999, and then make a guess about how many of the births in the year 2000 happened before you were born. As with the Labour website, we use your date of birth to make this estimate. In a similar way to allocating some fraction of the year 2000 births, we can also make an estimate of how many births in 1948 occurred before July 5, and remove them from our calculations.
A final element we might wonder about, is whether births are evenly distributed throughout the year. If they are, then making an estimate of what baby number within your birth year you were is fairly easy; if not then we need to adjust our methods if we are trying to be accurate.
As well as annual birth figures, ONS also provide monthly figures for recent years, allowing us to look at the monthly pattern and make a decision about whether or not to worry about this in our estimate. These figures are for England and Wales, rather than the UK, but give a reasonable estimate for the whole of the UK.
The month with the lowest mean number of births (the blue line) is February – but this is unsurprising, as February has fewer days in which to be born. The red line shows the mean number of births per day in each month – and here we see that September has the highest rate, and December the lowest. Perhaps the most important thing to note however, is that the amount of variation is not huge, and thus we can decide not to worry about month-of-year effects when estimating a baby number.
So, to estimate a ‘baby number’, we can use the cumulative number of births in the years before we were born, a fraction of the number of births in the year that you were born, corresponding to the birth day, and then a random amount based on an estimate of the number of births that day. Again, this will be re-calculated each time, so you will get a different number each time the model is run.
A final adjustment that can be made is to allow for the fact that not all UK births are under the NHS. The Labour website quotes a figure of 97%. Having been unable to find a source for this, I’ve used the same value, although it seems unlikely that this has remained constant over the lifetime of the NHS.
Perhaps the reference to ‘census data’ is just a mistake, and they really meant ‘vital statistics data’. Maybe. But, what is interesting about this approach – and why I’ve bothered describing the methods – is that the birth statistics approach gives a significantly different value to that used by the Labour website. The calculator linked above uses the birth data to estimate a baby number. The most recent annual birth data are for 2012, so I’ve simply assumed an identical number of births occurred in 2013, and will occur in 2014.
Both sets of problems with the NHS baby number site – the clumsy collection of personal data, and the poor methods (or descriptions of methods) are things that we would encourage students in Information Studies to see as bad practice. Firstly, personal data should be collected in a manner which does not require you to opt-out afterwards, and perhaps should not be gathered at all in this sort of viral campaign. Secondly, if we are to encourage people to use administrative data (and the UK is making massive investment in encouraging the sound re-use of administrative data for research), then we should be encouraging best practice in that arena as well.