John Ioannidis has been getting a lot of press lately. He reached the cover of the last issue of The Atlantic Monthly. David Dobbs wrote about him here (and a few years ago, here). This is the doctor known for his claim that around half of medical studies are false -- that is about 80% of non-randomized trials and even 25% of randomized trials. These are not just dinky findings published in dinky journals; of 49 of the most highly-regarded papers published over a 13-year period, 34 of the 45 with that claimed to have found effective treatments had been retested, and 41% of those tests failed to replicated the original result.
Surprised?
Quoting the Atlantic Monthly:
There are other factors at work. Some scientists are sloppier than others, and many of the ways in which one can be sloppy lead to significant and/or surprising results. For instance, 5% of experiments have false positives. There are labs that will run the same experiment 6 times with minor tweaks. There is a (1-.95^6) * 100 = 26.5% chance that one of those will have a significant result. The lab may then publish only that final experiments and not report the others. If sloppy results lead to high-impact publications, survival of the fittest dictates that sloppy labs will reap the accolades, get the grant money, tenure, etc.
Keep in mind that often many different labs are trying to do the same thing. For instance, in developmental psychology, one of the deep questions is what is innate? So many labs are testing younger and younger infants, trying to find evidence that these younger infants can do X, Y or Z. If 10 labs all run the same experiment, there's a (1-.95^10) * 100 = 40.1% chance of one of the labs finding a significant result.
Countervailing Forces
Thus, there are many incentives to publish incorrect data. Meanwhile, there are very few disincentives to doing so. If you publish something that turns out to replicate, it is very unlikely that anyone will publish a failure to replicate -- simply because it is very difficult to publish a failure to replicate. If someone does manage to publish such a paper, it will certainly be in a lower-profile journal (which is, incidentally, a disincentive to publishing such work to begin with).
What to do?
Typically, solutions proposed involve changing attitudes. The Atlantic Monthly suggests:
So the idea is that we should stop rewarding scientists for having interesting results, and instead reward the ones who have uninteresting results? Journals should publish only the most staid research, and universities should reward tenure not based on the number of highly-cited papers you have written, but based on how many papers you've written which have never been cited? I like that idea. I can run a boring study in a few hours and write it up in the afternoon: "Language Abilities in Cambridge Toddlers are Unaffected by Presence or Absence of Snow in Patagonia." That's boring and almost certainly true. And no one will ever cite it.
Seriously, though, public awareness campaigns telling people to be more responsible are great, and sometimes they even help, but I don't know how much can be done without changing the incentive structure itself.
Reputation
I don't have a solution, but I think Ioannidis again points us towards one. He found that papers continue to be cited long after they have been convincingly and publicly refuted. I was discussing this issue with a colleague some time back and mentioned a well-known memory paper that nobody can replicated. Multiple failures-to-replicate have been published. Yet I still see it cited all the time. The colleague said, "Wow! I wish you had told me earlier. We just had a student spend two years trying to follow up that paper, and the student just couldn't get the method to work."
Never mind that researchers rarely bother to replicate published work -- even if they did, we have no mechanism for tracking which papers have been successfully replicated and which papers can't be replicated.
Tenure is awarded partly on how often your work has been cited, and we have many nice, accessible databases that will tell you how often a paper has been cited. Journals are ranked by how often their papers are cited. What if we rewarded researchers and journals based on how well their papers hold up to replication? Maybe it would help, maybe it wouldn't, but without a mechanism for tracking this information, this is at best an intellectual enterprise.
Even if such a database wasn't ultimately useful in decreasing the number of wrong papers, at least we'd know which papers were wrong.
Surprised?
Quoting the Atlantic Monthly:
Ioannidis initially thought the community might come out fighting. Instead, it seemed relieved, as if it had been guiltily waiting for someone to low the whistle...Well, it's not surprising. The appropriate analog in psychology is the randomized trial, of which in medicine 25% turn out to be false according to this research (which hopefully isn't itself false). As Ioannidis has detailed, the system is set up to reward false positives. Journals -- particularly glamour mags like Science -- preferentially accept surprising results, and the best way to have a surprising result is to have one that is wrong. Incorrect results happen: "statistically significant" means "has only a 5% probability of happening by random chance." This means (in theory) that 5% of all experiments published in journals should reach the wrong conclusions. If journals are biased in favor of accepting exactly those 5%, then the proportion should be higher.
There are other factors at work. Some scientists are sloppier than others, and many of the ways in which one can be sloppy lead to significant and/or surprising results. For instance, 5% of experiments have false positives. There are labs that will run the same experiment 6 times with minor tweaks. There is a (1-.95^6) * 100 = 26.5% chance that one of those will have a significant result. The lab may then publish only that final experiments and not report the others. If sloppy results lead to high-impact publications, survival of the fittest dictates that sloppy labs will reap the accolades, get the grant money, tenure, etc.
Keep in mind that often many different labs are trying to do the same thing. For instance, in developmental psychology, one of the deep questions is what is innate? So many labs are testing younger and younger infants, trying to find evidence that these younger infants can do X, Y or Z. If 10 labs all run the same experiment, there's a (1-.95^10) * 100 = 40.1% chance of one of the labs finding a significant result.
Countervailing Forces
Thus, there are many incentives to publish incorrect data. Meanwhile, there are very few disincentives to doing so. If you publish something that turns out to replicate, it is very unlikely that anyone will publish a failure to replicate -- simply because it is very difficult to publish a failure to replicate. If someone does manage to publish such a paper, it will certainly be in a lower-profile journal (which is, incidentally, a disincentive to publishing such work to begin with).
Similarly, consider what happens when you run a study and get a surprising result. You could replicate it yourself to make sure you trust the result. That takes time, and there's a decent chance it won't replicate. If you do replicate it, you can't publish the replication (I tried to in a recent paper submission, and a reviewer insisted that I remove reference to the replication on account of it being "unnecessary"). If the replication works, you'll gain nothing. If it fails, you won't get to publish the paper. Either way, you'll have spent valuable time you could have spent working on a different study leading to a different paper.
In short, there are good reasons to expect that 25% of studies -- particularly in the high-profile journals -- are un-replicable.
What to do?
Typically, solutions proposed involve changing attitudes. The Atlantic Monthly suggests:
We could solve much of the wrongness problem, Ioannidis says, if the world simply stopped expecting scientists to be right. That's because being wrong in science is fine, and even necessary ... But as long as careers remain contingent on producing a stream of research that's dressed up to seem more right than it is, scientists will keep delivering exactly that.I've heard this idea expressed elsewhere. In the aftermath of Hausergate, a number of people suggested that a factor was the pressure-cooker that is the Harvard tenure process, and that Harvard needs to stop putting so much pressure on people to publish exciting results.
So the idea is that we should stop rewarding scientists for having interesting results, and instead reward the ones who have uninteresting results? Journals should publish only the most staid research, and universities should reward tenure not based on the number of highly-cited papers you have written, but based on how many papers you've written which have never been cited? I like that idea. I can run a boring study in a few hours and write it up in the afternoon: "Language Abilities in Cambridge Toddlers are Unaffected by Presence or Absence of Snow in Patagonia." That's boring and almost certainly true. And no one will ever cite it.
Seriously, though, public awareness campaigns telling people to be more responsible are great, and sometimes they even help, but I don't know how much can be done without changing the incentive structure itself.
Reputation
I don't have a solution, but I think Ioannidis again points us towards one. He found that papers continue to be cited long after they have been convincingly and publicly refuted. I was discussing this issue with a colleague some time back and mentioned a well-known memory paper that nobody can replicated. Multiple failures-to-replicate have been published. Yet I still see it cited all the time. The colleague said, "Wow! I wish you had told me earlier. We just had a student spend two years trying to follow up that paper, and the student just couldn't get the method to work."
Never mind that researchers rarely bother to replicate published work -- even if they did, we have no mechanism for tracking which papers have been successfully replicated and which papers can't be replicated.
Tenure is awarded partly on how often your work has been cited, and we have many nice, accessible databases that will tell you how often a paper has been cited. Journals are ranked by how often their papers are cited. What if we rewarded researchers and journals based on how well their papers hold up to replication? Maybe it would help, maybe it wouldn't, but without a mechanism for tracking this information, this is at best an intellectual enterprise.
Even if such a database wasn't ultimately useful in decreasing the number of wrong papers, at least we'd know which papers were wrong.
12 comments:
Nice post! I built a replication failure database of the kind you allude to a couple of years ago (failuretoreplicate.com), and wrote about it in this post.
It's fully functional, taggable and searchable, and only takes a minute or two to add an entry; but like everything else, the hard part is getting enough momentum to get the thing off the ground--no one wants to be the first person to add a null result, and I haven't had the time to seriously promote it. But if anyone who's reading this feels a burning need to let the world know about their replication failures, they're invited to do so!
For instance, 5% of experiments have false positives. There are labs that will run the same experiment 6 times with minor tweaks. There is a 5% * 6 = 30% chance that one of those will have a significant result.
False. There is a 1-(.95)^6 x 100% = about 26.5% chance.
If 10 labs all run the same experiment, there's a 5% * 10 = 50% chance of one of the labs finding a significant result.
False again. There is a 1-(.95)^10 x 100% = about 40.1% chance.
While your overarching point is salient, I cannot help but note the irony of erroneous probability calculations appearing in this particular blog post.
@Tal -- I like the idea and am glad to know about it. There were a few features that I didn't see but might have overlooked. Is it possible to add additional failures to the same entry, or do you need a separate entry for each failure? Also, it looks like failures are anonymous, which might not be ideal. Have you been adding information about retracted papers?
@Davis -- Ironically, while I studied math in college, the one part of math I've always found non-intuitive was probability. You're of course right about this; I'm going to fix the text.
I've left it pretty much up to the discretion of the user... you can either lump a bunch of failures under one entry, or make one for each. If the site ever picks up to the point where non-standardization becomes an issue, I'll come up with clearer guidelines, but right now I'm hesitant to impose any more rules than necessary.
The entries aren't anonymous, in the sense that if you're logged in, you can see who the contributing user is, and view their contact information (which is required in order to submit anything). I just didn't want that info visible to the world. I'll add that to the FAQ next time I update it, thanks!
@Tal -- I meant more that if I have failed to replicate a study and find that that study has already been listed on your site, it'd be nice if I could add an addendum to the existing entry rather than a separate entry...unless this is all aggregated together automatically in some way I overlooked (very likely).
What about including *replications* in addition to failures to replicate. It's also useful to know that a paper replicates, right?
One answer is to base decision to publish on just reading the Introduction and Methods: has an interesting question been identified, and are the methods appropriate to answer it?
If Yes, publish, regardless of outcome.
""statistically significant" means "has only a 5% probability of happening by random chance." This means (in theory) that 5% of all experiments published in journals should reach the wrong conclusions. If journals are biased in favor of accepting exactly those 5%, then the proportion should be higher."
No, it means that if there wasn't a result, theres a 5% chance that we'll think there was a result. Assuming the assumptions we've made are correct. Thus if studies are done on things that are likely to work going in, we'll have less than a 5% failure rate.
That said, publishing bias is a genuine problem
@deevybee: it sounds like a nice idea, and it's what PLoS does, but as I pointed out in the post, its easy to do a boring experiment well. Many journals only have room to print 1/10 papers they receive, and most researchers don't have time to read anywhere near that many. You still have to decide what to print, and choosing based on whether anyone would care about the outcome of the experiment seems like a good idea.
@Mr K: I guess it's an empirical question. My experience is that in advance of an experiment, nobody has any idea what hypotheses are true, and our guesses approximate a binomial distribution, in which case my claim is correct. If I had to bet that we were biased one direction or another, I would bet researchers are biased in favor of experiments that *won't* work, simply because it's very hard to publish an experiment that everyone thought was going to work (it's boring).
I did not recommend going just on the methods, but also on the introduction. Could avoid boring studies by making a decision about whether the *question* was worth investigating.
@deevybee: Fair enough. Having the introduction include helps with the *really* boring studies. But I think it's the case that many experiments are only interesting if their results come out one way and not the other. For instance, a failure to replicate is a lot more interesting than a replication. It's hard to see what can be done about that.
I agree. The journals are encouraging a bias toward the unusual, the good scientists involved, and good medical doctors are aware of any result must be replicated and tested for all types of possible factors.
So essentially, the journals are duping the public, who like to see the odd and unusual. As a result, there are a lot of false reports and this leads to leaders of this country to not trust scientists when they have a general consensus on real data.
Essentially, once again, media is controlling the world in ways we never consider or bother to change.
The reason is that we want it that way. We want entertainment. Not reality.
No surprise. All involved -- journal editors, peer-reviewers, researchers, department heads, deans -- belong to the same self-named species, Homo sapiens.
Post a Comment