Trying to make sense of nonsense

Category Archives: Methodology

Care.data: giant leap or embarrassing stumble?

Going by what’s been playing out over the last couple of months, care.data definitely seems to be in the ’embarrassing stumble’ category. However, that other giant leap started out that way as well. The first US attempt to launch a satellite into space, on a Vanguard rocket, ended after four feet of ‘flight’ when the rocket caught fire and the satellite, blasted off the top, rolled behind some bushes. Moscow sent their condolences.

Pale blue dot – taken on Apollo 11 on the way to the moon. Picture courtesy of NASA

Despite that bumpy start, we’re celebrating the 45th anniversary of Neil Armstrong’s small step this week. It took a while for the US to really get into the space race, but they did manage to fly a man to the moon. The big data race has been going on for while now. While companies have realized it can be very lucrative to monetize the lust for information that was previously considered boring, governments have been late to the data party. Perhaps rightly so, seeing the current debate on whether government should be allowed to sell its citizen’s data, or use it for purposes other than the reason it was collected at all.

Care.data was to be England’s giant leap, not only to catch up with what has been going on the Scandinavian countries since the 70s, but to take the lead. Unfortunately, the communication on care.data has been abysmal. Much has been written about it already, so I won’t add to that ever growing canon. The debate about how to move forward, if at all, is only getting started though. Having an information source such as care.data would be an amazing impulse for the next couple of years of public health science, epidemiology, health informatics and a whole host of other disciplines, but we should look further than that.

To come back to the space race, NASA choose to make comprises in getting to the moon. As Kennedy had given them a deadline, they choose options that would get them there whitin that time, rather than options that could keep them going into deep space for longer. The results: a very successful Apollo programme, but a rather disappointing 45 years of flying around Earth to follow (despite the very, very cool space plane used for those flights).

The Health and Social Care Information Centre (HSCIC) have that same choice to make now: go for quick fixes (pseudonymisation at source* for instanrce) and get care.data rolled out soon, or think long term and build the trust and support systems that science can build on for decades to come. The problems surrounding care.data are already making it difficult for researchers to do their job: HSCIC is not giving out any data, pending an internal review of how they have been working and how they should work. This means studies that were funded and approved by ethics committees (and assigned deadlines) have be put on hold because there is no data to work with. This is particularly sour for people on short term contracts (like me at the moment) or students who are suddenly left without a project.

It would be great to have care.data as a data source for research. But I’m also just starting out as a researcher, and I although I would like to fly to the moon, I also want to go beyond and have an academic career rather than one shining moment. As Michael Collins**, the man who circled the moon while Aldrin and Armstrong landed on it, said: “Man has always gone where he has been able to go, it is a basic satisfaction of his inquisitive nature, and I think we all lose a little bit if we choose to turn our back on further exploration.” So let’s get working on making that giant leap, but make sure we don’t lose sight of where we may want to land in the future.

*Pseudonymisation at source would turn any identifiable data (e.g. a date of birth) into a string of letters and numbers that look like nonsense (this string is called a ‘hash’). There’s no way to get back to the original date of birth. Different organisations (GP practices, hospitals etc) would use the same programme and same key to create the same hash, so HSCIC can still link records together. Problem solved? Not so much. Although no identifiable data would leave the practice, the records can still be linked longitudinally. This means that if you would know that a male of a certain age was admitted on a specific date to a specific hospital with a heart condition, you’ll still have a good chance of finding him. A bigger problem would be that either everyone involved in administrative data would have to use the same programme, same key, have data in the exact same format (dates of birth that are saved as 09-01-87  and 09/01/1987 would turn into a different hashes, and you wouldn’t be able to recognise them as being the same) and not make any typos. This is a big limitation and would severely limit the chances of linking data to other sources.

**His autobiography, Carrying the Fire, is amazing. I’d recommend it for some summer reading.

Meta-epidemiology: the science of taking a step back

So last week a pretty interesting looking study appeared in the BMJ. With a title as Comparison of treatment effect sizes associated with surrogate and final patient relevant outcomes in randomised controlled trials: meta-epidemiological study‘ (and breathe…) I wouldn’t be too surprised if many people just skipped over it. Nevertheless, it has some pretty interesting results.

But first we’ll go on a journey back in time to 1997. That year, the BMJ dedicated an entire issue to the topic of meta-epidemiology. Specifically, it looked at meta-analyses, the branch of epidemiology that combines the results from all relevant studies to try to come to some form of agreement on a particular question. Meta-analyses are regarded as the highest form of evidence, being able to pool all available evidence into a final answer.

However, it turned out that this form of analysis wasn’t as infallible as some liked to believe. There was a problem we had been trying to ignore: publication bias. Studies with interesting results and large effect sizes were more likely to be published than studies that didn’t find anything. While these ‘negative trials’ gathered dust in researchers’ drawers, the people meta-analysing studies were lulled into thinking that the treatments they were evaluating were more effective than they actually were.

These results had a big impact on the way meta-analyses were viewed and performed, bringing publication bias and the importance of unpublished studies to the fore. This new study tries to shine a similar light on how we try to assess whether a new treatment works.

As the title of the study suggest, it’s looking at the difference between surrogate and final patient relevant outcomes. While patient relevant outcomes (such as, does this pill I’m taking for heart disease actually make me live longer, or does it lower my chance of a heart attack?) are what we’re really interested in, often trials will look at surrogate outcomes. For instance, while statins are prescribed to lower the chance of heart disease (which could require years of following very large groups of patients), trials often measure whether they lower cholesterol (which requires a couple of months) as we know this is related to future heart disease.

Looking at surrogate or intermediate outcomes makes trials shorter, smaller, and importantly, a lot cheaper. Instead of having to wait ten years to find out whether a drug has an effect, we can find out in a year. With the budget for health research getting ever smaller, it would be great if we could exchange patient relevant outcomes for equally valid surrogate outcomes. Whether that is possible is exactly what this new study is researching.

The researchers compared 84 trials using surrogate outcomes with 101 patient relevant trials published in six of the highest rates medical journals in 2005 and 2006. They found that trials using surrogate outcomes tend to find larger treatment effects: the drugs tested in these trials appeared to be about 47% more effective than trials using patient relevant outcomes. This was true over all the fields of epidemiological research they included, and couldn’t explained by any of the factors explored such as whether the size of the trial or whether it was funded by Big Pharma.

So why does this matter? Although trials using either type of outcome found different effect sizes, they still came to the same overall conclusion: either the drug worked or it didn’t. Other studies have found the same for other drugs that got licensed based on (mainly) data on surrogate outcomes. Unfortunately, the opposite has also happened. A drug for non-small cell lung cancer (a particular type of lung cancer), Gefitinib, was licensed by the FDA based on surrogate outcomes. When the data on patient relevant outcomes became available (whether the drug makes people live longer in this case), it turned out that it didn’t work.

As the paper concludes, policy makers and regulators should be cautious when the only data available on a new drug is on surrogate outcomes, as it could turn out that the drug they’re trying to evaluate is a lot less effective than the research seems to imply. And in rare cases, it might even not work at all.

Is Nate Silver a witch?

Tentative evidence of how Nate Silver was able to make a perfect prediction (image via TechCrunch)

By predicting the outcome of the US elections correctly in 50 out of 50 states (after an already impressive 49/50 in the 2008 elections), Nate Silver of the NY Times’ FiveThirtyEight blog has managed to convince even the most sceptical data deniers of his prediction models. So much so that his perfect prediction started a twitter trend (#natesilverfacts) and led to him being labelled a witch. So how impressive was this feat really? Is Nate Silver really a wizard from the future aiming for world domination through the power of numbers? Let’s use some stats to assess his stats!

Let’s start by toning down Silver’s amazing feat of predicting the election outcomes in 50 separate states. In most US states, the outcome of the election didn’t need complex prediction models to come to a reliable estimate of the election outcome: some results, such as in the District of Columbia where over 90% of the population voted Obama, were uncontested. The same goes for other red Obama-voting states as California (59%), Hawaii (71%), Maryland (62%) or New York (63%) or blue Romney states as Oklahoma (67% voted GOP), Utah (73%), Alabama (61%) or Kansas (60%).

Only in swing states, that could go either way, Nate Silver would have needed his number crunching to decide on a future winner. If we go by the NY times’ numbers, only 7 states were a toss-up between the Democrats and Republicans: Colorado, Florida, Iowa, New Hampshire, Ohio, Virginia and Wisconsin. Treating those 7 states as coin tosses – each outcome has an equal 50% probability – we can test the hypothesis that Nate Silver is a witch, Hwitch, against the competing hypothesis that he is a completely non-magical human being, Hmuggle. If Nate is a witch, we assume he predicts each state’s election results correct, witches having a perfect knowledge of all future events. The probability for this happening is expressed in a fancy maths equation like this:  p(7 right|Hwitch) – read the equation as: probability of Nate getting 7 right, given that he is a witch. The probability in this case is 100% or 1. But even if Nate is devoid of magical abilities, there is still a small chance he would guess all 7 election results correctly. We can calculate this probability:  p(7 right|Hmuggle) = 1/27= 1/128.  If we take the ratio of the two, 1/(1/128), it seems that is about 128 more likely that Nate is a witch than him being a muggle.

Whatever the truth is about Nate Silver, it appears he’s pulled off something pretty extraordinary. Unfortunately for him, he’s still one step removed from being the world’s best predictor as Paul the psychic octopus managed to correctly predict the outcomes of 8 football matches at the 2010 World Cup. World’s best human predictor will have to do for now then.

However, as with Paul, Nate wasn’t the only person making predictions. Paul only gained the street cred necessary to be taken seriously as a clairvoyant cephalopod after a bout of predicting Eurocup results (and getting one wrong), and the same could be said for Nate Silver. If he hadn’t pulled off a similar feat in the previous elections, no one would have paid much attention to his blog this time round. His 2008 prediction was perhaps even more impressive than his latest one: he might have missed Indiana, but got the results for the remaining 10 swing states right.

As polls get about the same amount of coverage (if not more) as the actual elections, there are a lot of people who try to pitch in. Let’s take a guess and say there were 50 people trying to predict the state-by-state 2008 election outcomes. Chances that at least someone would get at least 8 of the 11 swing states correct (assuming this would be the threshold to attract the attention of witch hunters) are 1-(255/256)50= 0.18 (for the reasoning behind this calculation, read David Spiegelhalter’s blog on the numbers behind Paul being a completely normal, if not slightly lucky, octopus). So there was an about 1 in 5 chance of at least someone coming up with some remarkably correct predictions.

XKCD: Frequentists vs Bayesians

XKCD endorsement for Bayesian stats

So we now know that frequentist statisticians would label Silver as a witch, but what about much cooler Bayesians? (no bias at all here…) Bayesian statistics differ from frequentist statistics in that it takes prior knowledge into account when putting a probability on an event. Or : Bayesian statistics is probably a cool branch of stats, but if you know XKCD thinks so too, it’s suddenly a lot more probable to be true (the coolness of a specific branch of statistics is conditional on XKCD endorsement).

To calculate the posterior probability of Nate Silver being a witch, we need to know a few things:

  • p(W), or the prior probability that Nate Silver is a witch, regardless of any other information. This will depend on the prevalence of witches in Silver’s hometown, New York. According to this NY Meetup page, there are 3023 witches in NY. Considering the population of the whole city (8,244,910 according to the US census), the prior probability of a random person in NY being a witch is 0.0004.
  • p(W’), or the probability that Nate is a muggle regardless of any other information, and that’s 1 – 0.0004 or 0.9996 in this case.
  • p(P|W), the probability of Nate making a perfect prediction, given that he’s a witch: 100%, or 1.
  • p(P|W’), the probability of Nate making a perfect prediction as a muggle, which we put at  or 0.008 earlier.
  • p(P), the probability of making a perfect prediction, regardless of any other information. Using the law of total probability – all probabilities have to add up to 1 or 100% – this is 1×0.0004 + 0.008×0.9996 = 0.0084

Now that we know all this we can fill out the formula for calculating posterior probability:

posterior probabilityThat’s pretty slim, though at 5%, we can’t be sure he isn’t a witch. However, going back to the 2008 elections, there were already some suspicions of Nate Silver’s potential Wiccan background. If we start with the 0.18 probability we arrived at earlier, the posterior probability of Nate Silver being a witch rises to 0.96 or 96%. So yes, Nate Silver is probably a witch. Alternatively, you could of course exchange ‘witch’ with ‘statistician’ and conclude with 96% confidence that he’s just very good at his job.

Missing data: looking for information from beyond the veil

Image by forklift

In my last post, I promised to go a bit deeper into dealing with missing data. Although it might sound a bit paradoxical, it is pivotal to consider how to deal with what is not there in epidemiological studies. Missing data has the potential to skew the results of a study in unexpected directions if it is overlooked. As I tried to show in the example of the three general practices, the same group of people can give very different results if their information goes missing through one the three mechanisms of data missingness. However, through some slights of hand and a bit of cold reading

In the first example, a practice hit by a computer virus that randomly deletes some of the information recorded by the GP, data goes missing completely at random (MCAR). Luckily, the information that is left is still representative of the patient group as a whole. Dealing with this type of disappearing data is easy, as you can just go about your analysis without taking any special precautions. The results might be a bit less precise than initially hoped for, but they will be accurate.

The opposite is true when data has gone missing not at random (MNAR). In the example, a GP only recorded systolic blood pressure if it was over the threshold of 140 mmHg. As a result, it is nigh on impossible to predict what the blood pressure is of the group who don’t show up in this GP’s records. The only thing we know is that the patients’ blood pressure is probably lower than 140, but nothing else.

Of course, a lot of other studies will have used blood pressure before, so it would be possible to make an educated guess as to what the blood pressures of the other patients would be, based on their age and gender. But this would require an external source of data, which, if you’re doing something a bit more complicated than measuring blood pressure, might not be available.

Things get a bit more complicated if data is missing at random (MAR). In this case, some information has gone missing, but whether it is there or not, is related to something you have measured. In our case, the GP was more likely to subject older patients to a quick measurement. What is essential in this case is that although the information is more likely to be unavailable for younger patients, there is still some information there. Using the right type of imputation method – imputation is the substitution of some value for a missing data point, or ‘filling in the blanks’ – you can look beyond the veil and find the information on your missing persons.

I’m getting a ‘J’…
Imputing missing data is a bit like a psychic reading, with the statistician in the role of the psychic. Like a psychic using cold reading, the simplest method starts with the most general option available: using the mean value to fill in the blanks. A psychic might start contacting the other with a very general statement, naming just one letter of a name of someone he or she is contacting. As with using a mean value, or mean imputation as statisticians like to call it, a systolic blood pressure of 120 for instance, this will ring true for a lot of people. However, because you are predicting the same thing for everyone, a lot of people will be left out. In other words, there isn’t enough variation in your prediction to take the varying nature a measure like blood pressure into account.

A ‘J’… John? I’m hearing from a John…
In order to introduce a bit of variation, statisticians can use something called regression imputation. Rather than just using the mean value of the whole group of people that was measured, you take your other variables into account as well. For instance, when a psychic seems someone responding to the ‘J’, they look for other clues. Maybe there is an elderly woman in audience, who is likely to be there to contact her husband who has passed away. ‘John’ is common name, so guessing that she is probably there to contact a male, the psychic has used a bit of extra information to predict the missing information.  Likewise, to predict the missing measurements on blood pressure, you can take account of the age or gender of the person you are measuring.

Does the name John mean anything to you? John. Or Jonah, Jonathan, Jack, Jake…
Although regression imputation is a big step forward from the monotony of mean imputation, there are still some issues. As with the psychic guessing the name ‘John’, a single guess might still be off. Therefore, a psychic often repeats the trick, going through a list of potential names till they hit the jackpot. Statisticians can use similar techniques when using multiple imputation. Similar to regression imputation, a value is predicted using information that is already there, but rather than going with the first attempt, multiple predictions are made. Unfortunately, statisticians don’t have a willing audience telling them when their prediction is right, so we use a set of rules, called Rubin’s rules (after Donald Rubin, a professor of statistics at Harvard), to combine the results in a single, accurate and precise estimate.

The most difficult part of taking missing data into account is deciding on the mechanism of missingness. There is no test to see whether the blanks are completely random; even if the computer virus has a slight preference for larger numbers this assumption will be invalid. Nevertheless, many research studies, especially clinical trials, like to assume that their data is missing completely at random, and using this as a justification to completely ignore the problem.

In reality, this is very rare. Often the people who quit trials will have their reasons: they are the ones experiencing side effects, or the pill the trial is working is not having the effect they were expecting and they’ve gone back to usual care. Ignoring these reasons for missing information can have big effects. If the people on which your treatment wasn’t working disappear, you’re only left with the ‘responders’: patients for whom the medication works. Basing this analysis on this group might give some overly optimistic results. Therefore, it is important to consider the problem of missing data when reading journal articles claiming to have found a new wonder-drug, or when designing your own research. Although I’ve only touched upon some of methods for dealing with missing data, there are lots of options available (missingdata.org.uk is an excellent place to start). So get that crystal ball out and start filling in the blanks!

Dude, where’s my data?

In almost all research, data goes missing. Maybe the dog ate your lab book, or you’ve got some office mates with a score to settle. Luckily there is a whole field of missing data research that can come to your rescue. However, in order to use any methods to deal with missing data, you first have to try and figure out what the mechanism behind your data missingness is. And to help you find out what is going on in your data: my first attempt at data visualisation (or more concept visualisation in this case).

To conclude: there are three mechanisms of missingness, with their own catchy names. As you might guess at this point, the missingness mechanism determines how should go about analysing your data. But more on that in another post that’s coming up soon.