Image by forklift
In my last post, I promised to go a bit deeper into dealing with missing data. Although it might sound a bit paradoxical, it is pivotal to consider how to deal with what is not there in epidemiological studies. Missing data has the potential to skew the results of a study in unexpected directions if it is overlooked. As I tried to show in the example of the three general practices, the same group of people can give very different results if their information goes missing through one the three mechanisms of data missingness. However, through some slights of hand and a bit of cold reading
In the first example, a practice hit by a computer virus that randomly deletes some of the information recorded by the GP, data goes missing completely at random (MCAR). Luckily, the information that is left is still representative of the patient group as a whole. Dealing with this type of disappearing data is easy, as you can just go about your analysis without taking any special precautions. The results might be a bit less precise than initially hoped for, but they will be accurate.
The opposite is true when data has gone missing not at random (MNAR). In the example, a GP only recorded systolic blood pressure if it was over the threshold of 140 mmHg. As a result, it is nigh on impossible to predict what the blood pressure is of the group who don’t show up in this GP’s records. The only thing we know is that the patients’ blood pressure is probably lower than 140, but nothing else.
Of course, a lot of other studies will have used blood pressure before, so it would be possible to make an educated guess as to what the blood pressures of the other patients would be, based on their age and gender. But this would require an external source of data, which, if you’re doing something a bit more complicated than measuring blood pressure, might not be available.
Things get a bit more complicated if data is missing at random (MAR). In this case, some information has gone missing, but whether it is there or not, is related to something you have measured. In our case, the GP was more likely to subject older patients to a quick measurement. What is essential in this case is that although the information is more likely to be unavailable for younger patients, there is still some information there. Using the right type of imputation method – imputation is the substitution of some value for a missing data point, or ‘filling in the blanks’ – you can look beyond the veil and find the information on your missing persons.
I’m getting a ‘J’…
Imputing missing data is a bit like a psychic reading, with the statistician in the role of the psychic. Like a psychic using cold reading, the simplest method starts with the most general option available: using the mean value to fill in the blanks. A psychic might start contacting the other with a very general statement, naming just one letter of a name of someone he or she is contacting. As with using a mean value, or mean imputation as statisticians like to call it, a systolic blood pressure of 120 for instance, this will ring true for a lot of people. However, because you are predicting the same thing for everyone, a lot of people will be left out. In other words, there isn’t enough variation in your prediction to take the varying nature a measure like blood pressure into account.
A ‘J’… John? I’m hearing from a John…
In order to introduce a bit of variation, statisticians can use something called regression imputation. Rather than just using the mean value of the whole group of people that was measured, you take your other variables into account as well. For instance, when a psychic seems someone responding to the ‘J’, they look for other clues. Maybe there is an elderly woman in audience, who is likely to be there to contact her husband who has passed away. ‘John’ is common name, so guessing that she is probably there to contact a male, the psychic has used a bit of extra information to predict the missing information. Likewise, to predict the missing measurements on blood pressure, you can take account of the age or gender of the person you are measuring.
Does the name John mean anything to you? John. Or Jonah, Jonathan, Jack, Jake…
Although regression imputation is a big step forward from the monotony of mean imputation, there are still some issues. As with the psychic guessing the name ‘John’, a single guess might still be off. Therefore, a psychic often repeats the trick, going through a list of potential names till they hit the jackpot. Statisticians can use similar techniques when using multiple imputation. Similar to regression imputation, a value is predicted using information that is already there, but rather than going with the first attempt, multiple predictions are made. Unfortunately, statisticians don’t have a willing audience telling them when their prediction is right, so we use a set of rules, called Rubin’s rules (after Donald Rubin, a professor of statistics at Harvard), to combine the results in a single, accurate and precise estimate.
The most difficult part of taking missing data into account is deciding on the mechanism of missingness. There is no test to see whether the blanks are completely random; even if the computer virus has a slight preference for larger numbers this assumption will be invalid. Nevertheless, many research studies, especially clinical trials, like to assume that their data is missing completely at random, and using this as a justification to completely ignore the problem.
In reality, this is very rare. Often the people who quit trials will have their reasons: they are the ones experiencing side effects, or the pill the trial is working is not having the effect they were expecting and they’ve gone back to usual care. Ignoring these reasons for missing information can have big effects. If the people on which your treatment wasn’t working disappear, you’re only left with the ‘responders’: patients for whom the medication works. Basing this analysis on this group might give some overly optimistic results. Therefore, it is important to consider the problem of missing data when reading journal articles claiming to have found a new wonder-drug, or when designing your own research. Although I’ve only touched upon some of methods for dealing with missing data, there are lots of options available (missingdata.org.uk is an excellent place to start). So get that crystal ball out and start filling in the blanks!