In this week's New Yorker, Jonah Lehrer shows once again just how hard it is to do good science journalism if you are not yourself a scientist.

His target is the strange phenomenon that many high profile papers are failing to replicate. This has been very much a cause celebre lately, and Lehrer follows a series of scientific papers on the topic as well as an excellent

Lehrer's focus in this article is shrinking effects.

Some experimental effects seem to shrink steadily over time:

Now, other people try replicating my finding. Many, many people. Most will fail to replicate, but some -- again by chance -- will replicate. It is extremely difficult to get a failure to replicate published, so only the replications get published. After time, the "genius apple hypothesis" becomes part of established dogma. Remember that anything that challenges established dogma is exciting and surprising and thus easier to publish. So now failures to replicate are surprising and exciting and get published. When you look at effect-sizes in published papers over time, you will see a gradual but steady decrease in the "effect" of apples -- from 5 points to 4 points down to 0.

So far so good, except here's Lehrer again:

Some years ago, I had a bad habit of getting excited about my labmate's results and trying to follow them up. Just like a journal, I was interested in the most exciting results. Not surprisingly, most of these failed to replicate. The result was that none of them got published. Again, this was just a factor of some results being spurious -- disproportionately, the best ones. (Surprisingly, this labmate is still a friend of mine; personally, I'd hate me.)

Some readers at this point might be wondering: wait -- people do statistics on their data and only accept a results that is extremely unlikely to have happened by chance. The cut-off is usually 0.05 -- a 5% chance of having a false positive. And many studies that turn out later to have been wrong pass even stricter statistical tests. Notes Lehrer:

and again:

I wasn't around in 1959, so I can't speak to that time, but I suspect that the numbers are similar today ... but in fact Sterling was measuring the wrong thing. Nobody cares what our hypotheses were. They don't care what order the experiments were actually run in. They care about the truth, and they have very limited time to read papers (most papers are never read, only skimmed). Good scientific writing is clear and concise. The mantra is: Tell them what you're going to tell them. Tell them. And then tell them what you told them. No fishing excursions, no detours. When we write scientific papers, we're writing

And this means we usually claim to have expected to find whatever it is that we found. It just makes for a more readable paper. So when a scientist reads the line, "We predicted X," we know that really means "We

So where do all the false positive come from, if they should be less than 5% of conducted studies? There seem to be a number of issues.

First, it should be pointed out that the purpose of statistical tests (and the magic .05 threshold for significance) is to make a prediction as to how likely it is that a particular result will replicate. A p-value of .05 means roughly that there is a 95% chance that the basic result will replicate (sort of; this is not technically true but is a good approximation for present purposes).

But statistics are

Let's go back to my genius fruit experiment. I ask a group of people to eat an apple and then give them an IQ test. I compare their IQ scores with scores from a control group that didn't eat an apple. Now let's say in fact eating apples doesn't affect IQ scores. Assuming I do my statistics correctly and all the assumptions of the statistical tests are met, I should have only a 5% chance of finding a "significant" effect of apple-eating.

Now let's say I'm disappointed in my result. So I try the same experiment with kiwis. Again, I have only a 5% chance of getting a significant result for kiwis. So that's not very likely to happen either.

Next I try oranges....

Hopefully you see where this is going. If I try only one fruit, I have a 5% chance of getting a significant result. If I try 2 fruits, I have a 1 - .95*.95 = 9.8% chance of getting a significant result for at least one of the fruits. If I try 4 fruits, now I'm up to a 1 - .95*.95*.95*.95 = 18.5% chance that I'll "discover" that one of these fruits significantly affects IQ. By the time I've tried 14 fruits, I've got a better than 50% chance of an amazing discovery. But my p-value for that one experiment -- that is, my estimate that these results won't replicate -- is less than 5%, suggesting there is only a 5% chance the results were due to chance.

While there are ways of statistically correcting for this increased likelihood of false positives, my experience suggests that it's relatively rare for anyone to do so. And it's not always possible. Consider the fact that there may be 14

Many researchers peak at their data. There are good reasons for doing this. One is curiosity (we do experiments because we

Time is finite. Nobody wants to spend longer on an experiment than necessary. Let's say you have a study where you expect to need -- based on intuition and past experience -- around 20 subjects. You might check your data after you've run 12, just in case that's enough. What usually happens is that if the results are significant, you stop running the study and move on. If they aren't, you run more subjects. Now maybe after you've got 20 subjects, you check your data. If it's significant, you stop the study; if not, you run some more. And you keep on doing this until either you get a significant result or you give up.

It's a little harder to do back-of-the-envelop calculations on the importance of this effect, but it should be clear that this habit has the unfortunate result of increasing the relative likelihood of a false positive, since false positives lead you to declare victory and end the experiment, whereas false negatives are likely to be corrected (since you keep on collecting more subjects until the false negative is overcome). I read a nice paper on this issue that actually crunched the numbers a while back (for some reason I can't find it at the moment), and I remember the result was a pretty significant increase in the expected number of false positives.

The issues I've discussed so real problems but are pretty common and not generally regarded as ethical violations. Data massaging is at the borderline.

Any dataset can be analyzed in a number of ways. Once again, if people get the result they were expecting with the first analysis they run, they're generally going to declare victory and start writing the paper. If you don't get the results you expect, you try different analysis methods. There are different statistical tests that be used. There are different covariates that could be factored out. You can through out "bad" subjects or items. This is going to significantly increase the rate of false positives.

It should be pointed out that interrogating your statistical model is a good thing. Ideally, researchers

The problem, at least from where I stand, is one of statistics. We want our statistical tests to tell us how likely it is that our results will replicate. We have statistical tests which, if used properly, will give us just such an estimate. However, there are lots and lots of ways to use them incorrectly.

So what should we do? One possibility is to train people to use statistics better. And there are occasional revisions in standard practice that do result in better use of statistics.

Another possibility is to lower the p-value that is considered significant. The choice of p=0.05 as a cutoff was, as Lehrer notes, arbitrary. Picking a smaller number would decrease the number of false positives. Unfortunately, it also decreases the number of real positives by a lot. People who don't math can skip this next section.

Let's assume we're running studies with a single dependent variable and one manipulation, and that we're going to test for significance with a t-test. Let's say the manipulation really should work -- that is, it really does have an effect on our dependent measure. Let's say that the effect size is large-ish (Cohen's d of .8, which is large by psychology standards) and that we run 50 subjects. The chance of actually finding a significant effect at the p=.05 level is 79%. For people who haven't done power analyses before, this might seem low, but actually an 80% chance of finding an effect is pretty good. Dropping our significant threshold to p=.01 drops the chance of finding the effect to 56%. To put this in perspective, if we ran 20 such studies, we'd find 16 significant effects at the p=.05 level but only 11 at the p=.01 level. (If you want to play around with these numbers yourself, try this free statistical power calculator.)

Now consider what happens if we're running studies where the manipulation shouldn't have an effect. If we run 20 such studies, 1 of them will nonetheless give us a false positive at the p=.05 level, whereas we probably won't get any at the p=.01 level. So we've eliminated one false positive, but at the cost of nearly 1/3 of our true positives.

Perhaps the easiest method is to just replicate studies before publishing them. The chances of getting the same spurious result twice in a row are vanishingly small. Many of the issues I outlined above -- other than data massaging -- won't increase your replication rate. Test 14 different fruits to see if any of them increase IQ scores, and you have over a 50% chance that one of them will spuriously do so. Test that

True positives aren't nearly so affected. Again, a typical power level is B=0.80 -- 80% of the time that an effect is really there, you'll be able to find it. So when you try to replication a

So let's say the literature has a 30% false positive rate (which, based on current estimates, seems quite reasonable). Attempting to replicate every positive result prior to publication -- and note that it's extremely rare to publish a null result (no effect), so almost all published results are positive results -- should decrease the false positives 20-fold and the true positives by 20%, leaving us with a 2.6% false positive rate. That's a huge improvement.

So why don't people replicate before publishing? If 30% of your own publishable results are false positives, and you eliminate them, you've just lost 30% of your potential publications. You've also lost 20% of your

When scientists are ranked, they're largely ranked on (a) number of publications, (b) number of times a publication is cited, and (c) quality of journal that the publications are in. Notice that you can improve your score on all of these metrics by publishing more false positives. Taking the time to replicate decreases your number of publications and eliminates many of the most exciting and surprising results (decreasing both citations and quality of journal). Perversely, even if someone publishes a failure to replicate your false positive, that's a citation and another feather in your cap.

I'm not saying that people are cynically increasing their numbers of bogus results. Most of us got into science because we actually want to know the answers to stuff. We care about science. But there is limited time in the day, and all the methods of eliminating false positives take time. And we're always under incredible pressure to pick up the pace of research, not slow it down.

I'm not sure how to solve this problem, but any solution I can think of involves some way of tracking not just how often a researcher publishes or how many citations those publications get, but how often those publications are replicated. Without having a way of tracking which publications replicate and which don't, there is no way to reward meticulous researchers or hold sloppy researchers to account.

Also, I think a lot of people just don't believe that false positives are that big a problem. If you think that only 2-3% of published papers contain bogus results, there's not a lot of incentive to put in a lot of hard work learning better statistical techniques, replicating everything, etc. If you think the rate is closer to 100%, you'd question the meaning of your own existence. As long as we aren't keeping track of replication rates, nobody really knows for sure where we are on this continuum.

That's my conclusion. Here's Lehrer's:

His target is the strange phenomenon that many high profile papers are failing to replicate. This has been very much a cause celebre lately, and Lehrer follows a series of scientific papers on the topic as well as an excellent

*Atlantic*article by David Freedman. At this point, many of the basic facts are well-known: anecdotally, many scientists report repeated failures to replicate published findings. The higher-profile the paper, the less likely it is to replicate, with around 50% of the highest-impact papers in medicine failing to replicate. As Lehrer points out, this isn't just scientists failing to replicate each other's work, but scientists failing to replicate their*own*work: a thread running through the article is the story of Jonathan Schooler, a professor at UC-Santa Barbara who has been unable to replicate his own seminal graduate student work on memory.Lehrer's focus in this article is shrinking effects.

*No, not this one.*

Some experimental effects seem to shrink steadily over time:

In 2001, Michael Jennions, a biologist at the Australian National University, set out to analyze "temporal trends" across a wide range of subjects in ecology and evolutionary biology. He looked at hundreds of papers and forty-four meta-analyses (that is, statistical syntheses of related studies), and discovered a consistent decline effect over time, as many of the theories seemed to fade into irrelevance.As described, that's weird. But there is a good explanation for such effects, and Lehrer brings it up. Some results are spurious. It's just one of those things. Unfortunately, spurious results are also likely to be exciting. Let's say I run a study looking for a relationship between fruit-eating habits and IQ. I look at the effects of 20 different fruits. By chance, one of them will likely show a significant -- but spurious -- effect. So let's say I find that eating an apple every day leads to a 5-point increase in IQ. That's really exciting because it's surprising -- and the fact that it's not true is integral to what makes it surprising. So I get it published in a top journal (top journals prefer surprising results).

Now, other people try replicating my finding. Many, many people. Most will fail to replicate, but some -- again by chance -- will replicate. It is extremely difficult to get a failure to replicate published, so only the replications get published. After time, the "genius apple hypothesis" becomes part of established dogma. Remember that anything that challenges established dogma is exciting and surprising and thus easier to publish. So now failures to replicate are surprising and exciting and get published. When you look at effect-sizes in published papers over time, you will see a gradual but steady decrease in the "effect" of apples -- from 5 points to 4 points down to 0.

**Where I get off the Bus**So far so good, except here's Lehrer again:

While the publication bias almost certainly plays a role in the decline effect, it remains an incomplete explanation. For one thing, it fails to account for the initial prevalence of positive results among studies that never even get submitted to journals. It also fails to explaint eh experience of people like Schooler, who have been unable to replicate their initial data despite their best efforts.Huh? Lehrer seems to be suggesting that it is

*publication*that makes a result spurious. But that can't be right. Rather, there are just lots of spurious results out there. It happens that journals preferentially publish spurious results, leading to biases in the published record, and eventually the decline effect.Some years ago, I had a bad habit of getting excited about my labmate's results and trying to follow them up. Just like a journal, I was interested in the most exciting results. Not surprisingly, most of these failed to replicate. The result was that none of them got published. Again, this was just a factor of some results being spurious -- disproportionately, the best ones. (Surprisingly, this labmate is still a friend of mine; personally, I'd hate me.)

**The Magic of Point O Five**Some readers at this point might be wondering: wait -- people do statistics on their data and only accept a results that is extremely unlikely to have happened by chance. The cut-off is usually 0.05 -- a 5% chance of having a false positive. And many studies that turn out later to have been wrong pass even stricter statistical tests. Notes Lehrer:

And yet Schooler has noticed that many of the data sets that end up declining seem statistically solid--that is, they contain enough data that any regression to the mean shouldn't be dramatic. '"These are the results that pass all the tests," he says. "The odds of them being random are typically quite remote, like one in a million. This means that the decline effect should almost never happen. But it happens all the time!"So there's got to be something making these results look more unlikely than they really are. Lehrer suspects unconscious bias:

Theodore Sterling, in 1959 ... noticed that ninety-seven percent of all published psychological studies with statistically significant data found the effect they were looking for ... Sterling saw that if ninety-seven per cent of psychology studies were proving their hypotheses, either psychologists were extraordinarily lucky or they published only the outcomes of successful experiments

and again:

The problem seems to be one of subtle omissions and unconscious misperceptions, as researchers struggle to make sense of their results.

I expect that unconscious bias is a serious problem (I illustrate some reasons below), but this is pretty unsatisfactory, as he doesn't explain

*how*unconscious bias would affect results, and the Schooler effect is a complete red herring.I wasn't around in 1959, so I can't speak to that time, but I suspect that the numbers are similar today ... but in fact Sterling was measuring the wrong thing. Nobody cares what our hypotheses were. They don't care what order the experiments were actually run in. They care about the truth, and they have very limited time to read papers (most papers are never read, only skimmed). Good scientific writing is clear and concise. The mantra is: Tell them what you're going to tell them. Tell them. And then tell them what you told them. No fishing excursions, no detours. When we write scientific papers, we're writing

*science,*not history.And this means we usually claim to have expected to find whatever it is that we found. It just makes for a more readable paper. So when a scientist reads the line, "We predicted X," we know that really means "We

*found*X" -- what the author actually predicted is beside the point.**Messing with that Point O Five**So where do all the false positive come from, if they should be less than 5% of conducted studies? There seem to be a number of issues.

First, it should be pointed out that the purpose of statistical tests (and the magic .05 threshold for significance) is to make a prediction as to how likely it is that a particular result will replicate. A p-value of .05 means roughly that there is a 95% chance that the basic result will replicate (sort of; this is not technically true but is a good approximation for present purposes).

But statistics are

*estimates*, not facts. They are based on a large number of idealizations (for instance, many require that measurement error is distributed normally*a normal distribution*

meaning that the bulk of measurements are very close to the true measurement and a measurement is as likely to be larger than the true number as it is likely to be smaller. In fact, most data is heavily skewed, with measurements more likely to be too large than too smaller (or vice versa).

For instance, give someone an IQ test. IQ tests have some measurement error -- people will score higher or lower than their "true" score due to random factors such as guessing answers correctly (or incorrectly), being sleepy (or not), etc. But it's a lot harder to get an IQ score higher than your true score than lower, because getting a higher score requires a lot of good luck (unlikely) whereas there are all sorts of ways to get a low score (brain freeze, etc.).

Most statistical tests make a number of assumptions (like normally distributed error) that are not true of actual data. That leads to incorrect estimates of how likely a particular result is to replicate. The truth is most scientists -- at the very least, most psychologists -- aren't experts in statistics, and so statistical tests are misapplied all the time.

I don't actually think that issues like the ones I just discussed lead to most of the difficulties (though I admit I have no data one way or another). I bring these issues up mainly to point out at that statistical tests are tools that are either used or misused according to the skill of the experimenter. And there are lots of nasty ways to misuse statistical tests. I discuss a few of them below:

**Run enough experiments and...**

Now let's say I'm disappointed in my result. So I try the same experiment with kiwis. Again, I have only a 5% chance of getting a significant result for kiwis. So that's not very likely to happen either.

Next I try oranges....

Hopefully you see where this is going. If I try only one fruit, I have a 5% chance of getting a significant result. If I try 2 fruits, I have a 1 - .95*.95 = 9.8% chance of getting a significant result for at least one of the fruits. If I try 4 fruits, now I'm up to a 1 - .95*.95*.95*.95 = 18.5% chance that I'll "discover" that one of these fruits significantly affects IQ. By the time I've tried 14 fruits, I've got a better than 50% chance of an amazing discovery. But my p-value for that one experiment -- that is, my estimate that these results won't replicate -- is less than 5%, suggesting there is only a 5% chance the results were due to chance.

While there are ways of statistically correcting for this increased likelihood of false positives, my experience suggests that it's relatively rare for anyone to do so. And it's not always possible. Consider the fact that there may be 14

*different labs*all testing the genius fruit hypothesis (it's suddenly very fashionable for some reason). There's a better than 50% chance that one of these labs will get a significant result, even though from the perspective of an individual lab, they only ran one experiment.**Data peaking**Many researchers peak at their data. There are good reasons for doing this. One is curiosity (we do experiments because we

*really*want to know the outcome). Another is to make sure all your equipment is working (don't want to waste time collecting useless data). Another reason -- and this is the problematic one -- is to see if you can stop collecting data.Time is finite. Nobody wants to spend longer on an experiment than necessary. Let's say you have a study where you expect to need -- based on intuition and past experience -- around 20 subjects. You might check your data after you've run 12, just in case that's enough. What usually happens is that if the results are significant, you stop running the study and move on. If they aren't, you run more subjects. Now maybe after you've got 20 subjects, you check your data. If it's significant, you stop the study; if not, you run some more. And you keep on doing this until either you get a significant result or you give up.

It's a little harder to do back-of-the-envelop calculations on the importance of this effect, but it should be clear that this habit has the unfortunate result of increasing the relative likelihood of a false positive, since false positives lead you to declare victory and end the experiment, whereas false negatives are likely to be corrected (since you keep on collecting more subjects until the false negative is overcome). I read a nice paper on this issue that actually crunched the numbers a while back (for some reason I can't find it at the moment), and I remember the result was a pretty significant increase in the expected number of false positives.

**Data massaging**The issues I've discussed so real problems but are pretty common and not generally regarded as ethical violations. Data massaging is at the borderline.

Any dataset can be analyzed in a number of ways. Once again, if people get the result they were expecting with the first analysis they run, they're generally going to declare victory and start writing the paper. If you don't get the results you expect, you try different analysis methods. There are different statistical tests that be used. There are different covariates that could be factored out. You can through out "bad" subjects or items. This is going to significantly increase the rate of false positives.

It should be pointed out that interrogating your statistical model is a good thing. Ideally, researchers

*should*check to see if there are bad subjects or items, check whether there are covariates to be controlled for, check whether different analysis techniques give different results. But doing this affects the interpretation of your p-value (the estimate of how likely it is that your results will replicate), and most people don't know how to appropriately control for that. And some are frankly more concerned with getting the results they want than doing the statistics properly (there is where the "borderline" comes in).**Better estimates**The problem, at least from where I stand, is one of statistics. We want our statistical tests to tell us how likely it is that our results will replicate. We have statistical tests which, if used properly, will give us just such an estimate. However, there are lots and lots of ways to use them incorrectly.

So what should we do? One possibility is to train people to use statistics better. And there are occasional revisions in standard practice that do result in better use of statistics.

Another possibility is to lower the p-value that is considered significant. The choice of p=0.05 as a cutoff was, as Lehrer notes, arbitrary. Picking a smaller number would decrease the number of false positives. Unfortunately, it also decreases the number of real positives by a lot. People who don't math can skip this next section.

Let's assume we're running studies with a single dependent variable and one manipulation, and that we're going to test for significance with a t-test. Let's say the manipulation really should work -- that is, it really does have an effect on our dependent measure. Let's say that the effect size is large-ish (Cohen's d of .8, which is large by psychology standards) and that we run 50 subjects. The chance of actually finding a significant effect at the p=.05 level is 79%. For people who haven't done power analyses before, this might seem low, but actually an 80% chance of finding an effect is pretty good. Dropping our significant threshold to p=.01 drops the chance of finding the effect to 56%. To put this in perspective, if we ran 20 such studies, we'd find 16 significant effects at the p=.05 level but only 11 at the p=.01 level. (If you want to play around with these numbers yourself, try this free statistical power calculator.)

Now consider what happens if we're running studies where the manipulation shouldn't have an effect. If we run 20 such studies, 1 of them will nonetheless give us a false positive at the p=.05 level, whereas we probably won't get any at the p=.01 level. So we've eliminated one false positive, but at the cost of nearly 1/3 of our true positives.

**No better prediction of replication than replication**Perhaps the easiest method is to just replicate studies before publishing them. The chances of getting the same spurious result twice in a row are vanishingly small. Many of the issues I outlined above -- other than data massaging -- won't increase your replication rate. Test 14 different fruits to see if any of them increase IQ scores, and you have over a 50% chance that one of them will spuriously do so. Test that

*same fruit*again, and you've only got a 5% chance of repeating the effect. So replication decreases your false positive rate 20-fold. Similarly, data massaging may get you that coveted p.05, but the chances of the same massages producing the same result again are very, very low.True positives aren't nearly so affected. Again, a typical power level is B=0.80 -- 80% of the time that an effect is really there, you'll be able to find it. So when you try to replication a

*true*positive, you'll succeed 80% of the time. So replication decreases your true positives by only 20%.So let's say the literature has a 30% false positive rate (which, based on current estimates, seems quite reasonable). Attempting to replicate every positive result prior to publication -- and note that it's extremely rare to publish a null result (no effect), so almost all published results are positive results -- should decrease the false positives 20-fold and the true positives by 20%, leaving us with a 2.6% false positive rate. That's a huge improvement.

**So why not replicate more**?So why don't people replicate before publishing? If 30% of your own publishable results are false positives, and you eliminate them, you've just lost 30% of your potential publications. You've also lost 20% of your

*true*positives as well, btw, which means overall you've decreased your productivity by 43%. And that's without counting the time it takes to*run*the replication. Yes, it's nice that you've eliminated your false positives, but you also may have eliminated your own career!When scientists are ranked, they're largely ranked on (a) number of publications, (b) number of times a publication is cited, and (c) quality of journal that the publications are in. Notice that you can improve your score on all of these metrics by publishing more false positives. Taking the time to replicate decreases your number of publications and eliminates many of the most exciting and surprising results (decreasing both citations and quality of journal). Perversely, even if someone publishes a failure to replicate your false positive, that's a citation and another feather in your cap.

I'm not saying that people are cynically increasing their numbers of bogus results. Most of us got into science because we actually want to know the answers to stuff. We care about science. But there is limited time in the day, and all the methods of eliminating false positives take time. And we're always under incredible pressure to pick up the pace of research, not slow it down.

I'm not sure how to solve this problem, but any solution I can think of involves some way of tracking not just how often a researcher publishes or how many citations those publications get, but how often those publications are replicated. Without having a way of tracking which publications replicate and which don't, there is no way to reward meticulous researchers or hold sloppy researchers to account.

Also, I think a lot of people just don't believe that false positives are that big a problem. If you think that only 2-3% of published papers contain bogus results, there's not a lot of incentive to put in a lot of hard work learning better statistical techniques, replicating everything, etc. If you think the rate is closer to 100%, you'd question the meaning of your own existence. As long as we aren't keeping track of replication rates, nobody really knows for sure where we are on this continuum.

That's my conclusion. Here's Lehrer's:

The decline effect is troubling because it reminds us how difficult it is to prove anything. We like to pretend that our experiments define the truth for us. But that's often not the case. Just because an idea is true doesn't mean it can be proved. And just because an idea can be proved doesn't mean it's true. When the experiments are done, we still have to chose what to believe.I say it again: huh?

## 18 comments:

Great post! I really enjoy your blog. You are absolutely right about the replication before publishing --- in fact, this is exactly what is known as a test set in machine learning. And you can't just publish the "training set" results to show that your methods work. I cannot understand why it is not the standard practice in scientific publications in general.

@ Michael B -- Interesting point. I hadn't considered that analogy. We typically think of our statistical tests as predictors of replicability. But there is also a sense in which we are fitting a model, and you're right: the test of a model is not its ability to model the data it was fitted to but it's ability to model a new set of data.

Excellent post! I'm just a lowly undergraduate, but I feel as though the three biggest issues in the scientific method is a) biases, b) the incentive to publish positive results, and c) the incentive to make money.

The combination of all three I think accurately describe the problems we are facing when referencing the failure to replicate and the major flaws of the scientific process.

The question remains: how do we remove these incentives from the scientific process?

In physics, the rule is "An experiment doesn't prove a hypothesis unless it was formulated before the experiment." This means that experimental results like the one with apples must be replicated.

Of course, physicists have it easier, as in many cases they know what they expect. However, when someone says that they expected to get what they actually got, they are misleading the reader into thinking that the hypothesis is proved according to this rule. OTOH, if this is standard practice, it isn't dishonest in the same way that "Pleased to meet you" can't really be a lie.

Nice post! I don't have access to the New Yorker article, but I think these are some the most serious problems we face as scientists, and it's nice to see them discussed in a mainstream outlet. I've written about these issues in several papers and blog posts--e.g., here, here, here, and here. This post crunches some of the numbers you're talking about (plotting false positive as a function of data peeking).

Also, a couple of minor corrections:

The probability of replicating a result significant at p < .05 isn't approximately 95%, it's more like 50%. But that's only true if you assume the effect in question is unbiased. And precisely because unexpected findings will tend to capitalize on chance somewhat, the actual probability of replicating an unexpected finding that was significant at around p < .05 will typically be somewhat lower. In cases where there's reason to think effects are rampantly inflated (e.g., in genetic or fMRI studies, where we typically correct for thousands of comparisons), replication power in an identically-sized sample may be pretty close to zero. So the reality is that if you're serious about wanting to replicate every effect (which I agree with in principle), you not only need to do double the work, but actually

morethan double the work, because you'll generally want a larger sample size the second time around (other things being equal).Related to that, I think the idea that power is normally around 80% is wishful thinking. That's the conventional level for 'adequate' power, but for the above reasons, the reality is that the vast majority of studies have far less power than that. So while direct replication should still be our ideal, the unfortunate reality is that the

failureof a direct replication is very often not actually a good indicator that the effect wasn't 'real' (whatever that means)--just that it may be smaller than initially estimated (which, again, is just a manifestation of the same 'decline' effect).@Tal -- I was being succinct when saying "95% chance of replicating". I'll stick with my t-test example. Significance at p=.05 level means there's less than a 5% probability the two samples were drawn from the same population. So conversely there's a 95% chance the samples are from different populations. If the samples are from different populations, you should be able to replicate the effect, given sufficient power. Whether you will actually replicate the effect depends a great deal on the statistical power of the replication.

As far as whether we normally have .8 power ... I don't know what it ends up being in practice, but let's point out that .8 power is called "adequate" but failing to find real effects 20% of the time is actually kinda lousy. So I *hope* people have at least .8 power. I expect the fact that it is hard to publish null results also pushes people in the direction of collecting enough data to have sufficient power.

What makes you suspect otherwise?

I guess I don't see why saying there's approximately a 95% probability of replication is more succinct than saying there's approximately a 50% probability of replication... and the latter has the benefit of being true (or at least, much closer to the truth).

I think in your example you might be confusing the probability of the data given the hypothesis (P(D|H)) with the probability of the hypothesis given the data (P(H|D)). Observing that p < .05 means that P(D|H0) is < 0.05 (where H0 is the null). When you talk about the probability of the samples being from the same or different populations, you're talking about P(H|D)--the probability that the null is true (or that the alternative hypothesis H1 is true, which is the complement)

given the data. You can't actually calculate that just from the observed p value, because you don't know the prior probabilities P(H0) and P(D). I think what you're thinking of as the complement, which really is 0.95, is P(~D|H0)--the probability that youwouldn'tobserve the data if the null were true. But that's generally not an interesting quantity.On the power thing, there have been many reviews in many different domains, and they tend to converge pretty strongly on the conclusion that most studies are underpowered. The classic paper is Cohen's (1962) review of social and abnormal psychology articles, and then Sedlmeier and Gigerenzer followed up about 20 years later and showed that power hadn't increased at all. More recent reviews all basically show the same thing--power hasn't budged (though there are some domains where people have been much better about this--e.g., population genetics, where people now routinely use samples in the tens of thousands). I report some power analyses for fMRI studies in this paper, and the results are not encouraging. Actually, I think if anything, most of the power reviews may even be too optimistic, because they tend to assume medium or large effect sizes, when in fact meta-analyses suggest that most effects are small. So while it certainly varies by domain, as a general rule of thumb, I think it's safe to assume that the average study is underpowered.

It would be nice if the prevalence of null results pushed people to routinely collect much larger samples, as you suggest; but as I discuss in the same paper, and as people like Ioannidis have pointed out, that's counteracted by the fact that smaller samples give biased effect size estimates, leading people to think that their effects are bigger than they are (and hence, to think they need smaller samples than they do).

The other problem is that people rarely actually attribute null results to low power; it's much more common to see invisible moderating variables invoked--e.g., "the fact that we didn't get the effect when the stimuli were faces rather than words may mean there are dissociable systems involved in processing words and faces." Power is not an intuitive concept, and it's hard to remember that a failure to obtain meaningful results often says more about what you

didn'tdo (collect enough subjects) than what youdiddo (vary some experimental variable).There were several typos/misspellings/words missing that would have been helpful. Try using spell-check next time.

@Tal: You are right to point out that what we are estimating is the probability of the data given the hypothesis, whereas what we want to estimate is the probability of the hypothesis. My understanding was that there is no way to estimate the latter, and so we use the former as a proxy.

I'm still not sure I know what you mean by "50% chance of replication". If you mean "50% chance the null hypothesis is in fact false" -- that is, that the effect is real and should be found again -- then that seems low for the following reason: I think it's a reasonable assumption that any given tested hypothesis has around a 50% chance of being true (and thus the null hypothesis has a 50% chance of being false).

If you mean that p=.05 is equivalent to a 50% chance that the null hypothesis is false, then what you're saying is that being significant at that level carries

no information whatsoever. I believe there's a real problem, but that strikes me as overly pessimistic.I should say that whenever somebody has argued with my hypothesis that 50% of tested null hypotheses should be false, the argument is that the number should be much higher, since researchers single out null hypotheses particularly likely to be false (that is, positive hypotheses particularly likely to be true). On this account, then, a p-value of .05 carries

lessinformation than the fact that the researcher ran the experiment to begin with.Possibly I've completely misunderstood what you're saying. Is that about right?

I'm still not sure I know what you mean by "50% chance of replication". If you mean "50% chance the null hypothesis is in fact false" -- that is, that the effect is real and should be found again -- then that seems low for the following reason: I think it's a reasonable assumption that any given tested hypothesis has around a 50% chance of being true (and thus the null hypothesis has a 50% chance of being false).By "chance of replication" I just mean the probability of obtaining a statistically significant effect in the same direction given an identical study (i.e., same design, number of subjects, etc.). In a world in which all hypothesis tests are unbiased, obtaining a result significant at p < .05 implies a roughly 50% chance of obtaining a second statistically significant result if you were to redo the study without changing anything.

In the real world, hypothesis tests aren't unbiased, of course; that's what this entire debate is largely about. There's a tendency to selectively report and pursue effects that attained statistical significance, so the reality is that most of the time, the true probability of replication (in the same sense as above) is going to be lower than 50%. And again, there are domains where we can say pretty confidently that it's going to be

muchlower.If you mean that p=.05 is equivalent to a 50% chance that the null hypothesis is false, then what you're saying is that being significant at that level carries no information whatsoever. I believe there's a real problem, but that strikes me as overly pessimistic.That's not the implication I'd draw... I'm not sure it's meaningful to talk about the probability of the null being true or false. Strictly speaking, the null is (for practical intents)

alwaysfalse. What effect could you possibly name where you really believe that if you sampled the entire population, the magnitude of the effect would literally be exactly zero? In a dense causal system, that's inconceivable. Everything has an influence on everything, however small.(continued from above comment)

If you think about it in terms of effect sizes, then this problem goes away. You have some prior belief about how big the effect is, then you look a the data, and you update your belief accordingly. The standard practice in psychology is to effectively assume a completely uniform prior distribution (and this is one of the things Bayesians rail against, because if you really had no reason to think any value was likelier than zero, why would you ever do the study in the first place?). So, if we're being strict about it, then achieving a result that's significant at p < .05

isgiving you information, because you started out with the null (which is effectively a prior in this case) that there wasn't any effect. If you intuitively feel like that's wrong, then I think what you're tacitly saying is that it's silly to use uniform priors, and we should build at leastsomesense of the expected effect size into the test--which I'd agree with.Notice that once you do that, it does become entirely possible that you would obtain a result statistically significant at p < .05 yet walk away feeling

lessconfident about your prediction. For instance, if you think the effect you're testing is one standard deviation in magnitude, and you're conducting a one-sample t-test on 50 subjects, you could get an effect of a quarter of a standard deviation and it would still be significant at p < .05. But you wouldn't want to walk away concluding that your hypothesis was borne out--you would in fact conclude the evidence was weaker than you thought.I'm not sure it's meaningful to talk about the probability of the null being true or false. Strictly speaking, the null is (for practical intents) always false.Do you think so? I think it's pretty simple to design experiments were we expect the null hypothesis to be true. Tests of ESP, for instance. Since there are an infinite number of ESP manipulations one could try, there is necessarily an infinite number of experiments where we expect the null hypothesis to be false. (If you believe in ESP, I'm sure you can work out your own class of examples.)

What's left is to decide in practice, for a given field, how often the null hypothesis is likely to be true. I've already put my money on 50% for psych -- at least, the areas of psych I'm familiar with.

I don't know what area of psych you're referring to in your post, but in my subfield (language) and most subfields I read, we mostly

do not care about effect size.We're studying underlying structure, so any effect of any size is meaningful -- and, in fact, effect size has no theoretical meaning in most cases.So if it were actually the case that we knew the null hypothesis had to be false, we'd never bother to run the experiment -- except in the case where we need to know the direction of an effect, though that only comes up every so often.

In such cases, I don't think knowing you have a 50% chance of being able to repeat an effect carries much value, if any. Would progress even be possible, if this were true? I think not. Yet there has been progress, which is why I'm skeptical of your claim. I'm not saying you've done the math wrong, but perhaps some of the assumptions are incorrect, at least for the branches of cognitive and social psychology/neuroscience that I follow.

But for fields where effect size matters, these are interesting ideas.

Do you think so? I think it's pretty simple to design experiments were we expect the null hypothesis to be true. Tests of ESP, for instance.ESP is an excellent example, because that's about as clean a case as you could make, and even there, I don't think it's plausible to expect that you'd ever get an effect of exactly zero if you sampled the entire population. Remember, the null hypothesis isn't something that lives in construct land (where you can say things like "ESP doesn't exist, so it can't be associated with anything"); it has to be operationalized, otherwise we can't test it. I'd argue that for

anyoperationalization of ESP, there are going to be many potential confounds and loose ends that are necessarily going to make your effect non-zero. Take precognition experiments. A very basic requirement is that you have to have a truly random number generator. Well, there isn't really any such thing. Not in the sense you'd require in order to ensure an effect ofliterallyzero. Remember, any systematic blip, no matter how small, renders the null false. If the body heat generated by larger subjects systematically throws off the computation by one bit in eight trillion, that's still a non-zero association. And this is for a contrived example; for an average hypothesis that most psychologists would actually be willing to entertain, you could easily reel of dozens if not hundreds of factors that would ensure you have a non-zero association...What's left is to decide in practice, for a given field, how often the null hypothesis is likely to be true. I've already put my money on 50% for psych -- at least, the areas of psych I'm familiar with.I guess I don't understand what this means, or where that number comes from. The probability of rejecting the null hypothesis depends in large part on your sample size. So that 50% number can't be referring to the probability of rejecting the null hypothesis in actual experiments, because if that were the case, so long as the effect wasn't exactly zero, you'd be able to turn 50% into almost any other probability just by collecting more or fewer subjects. Which would render the number meaningless.

The only way I can make sense of this is if you really believe that for 50% of all hypotheses that people suggest, the effect in the entire population is

literallyzero. Not small; not close to zero; not a correlation of 0.0001 (which is still statistically significant in a sample of 7 billion people!); butexactlyzero. If that's really what you believe, then we're at an impasse, but frankly the idea that any association between two meaningful psychological variables in a dense causal system like our world would ever beexactlynil seems inconceivable to me.continued from above...

I don't know what area of psych you're referring to in your post, but in my subfield (language) and most subfields I read, we mostly do not care about effect size. We're studying underlying structure, so any effect of any size is meaningful -- and, in fact, effect size has no theoretical meaning in most cases.I don't see how it's possible not to care about effect size. I'll grant that most psychologists may not stop to think about what constitutes a meaningful effect size very often, but that doesn't mean they're not making implicit assumptions about effect size every time they run a test. To put it in perspective, consider that if you routinely conducted your studies with 1,000,000 subjects each,

allof your tests would produce statistically significant results (the critical effect size for p < .05 with that sample size is around 2/1000th of a standard deviation--good luck getting effects smaller than that!). So how is it possible not to care about effect size and only about rejecting the null, if all it takes in order to reject the null is collecting a large enough sample? For that matter, I assume you tend to treat a finding significant at p < .05 differently from one significant at p < .00001--and the only difference between the two is... effect size.So if it were actually the case that we knew the null hypothesis had to be false, we'd never bother to run the experiment -- except in the case where we need to know the direction of an effect, though that only comes up every so often.That's exactly why null hypothesis testing is kind of absurd. That's not to say it isn't a

usefulfiction, but it's still a fiction--there are few if any situations in which a null of zero is at all meaningful. The only reason the framework actually works is because we tend to run samples small enough that we don't run into the problem of havingeverythingbe statistically significant, so we rarely have to think about how absurd it is. In other words, what's happening is that p values end up being proxies for effect size in virtue of the kind of sample sizes we use.In such cases, I don't think knowing you have a 50% chance of being able to repeat an effect carries much value, if any. Would progress even be possible, if this were true? I think not. Yet there has been progress, which is why I'm skeptical of your claim. I'm not saying you've done the math wrong, but perhaps some of the assumptions are incorrect, at least for the branches of cognitive and social psychology/neuroscience that I follow.I'm not sure what you're objecting to here... If you take the significance testing framework at face value, it's simply a fact that a finding that's significant at p = .05 will, on average, have a 50% chance of replicating if you repeat the study. I'm not expressing my opinion or building in any extra assumptions beyond what you already assume when you run a t-test; that's just the reality. If you doubt it, just take the critical effect size that corresponds to p = .05 for a given sample size and run a power calculation for the same effect size, alpha, and sample size. You will get 50%. If you don't like it, your problem is with the hypothesis testing framework.

(and again, sorry, last bit...)

I'm also still not sure why you think a 50% chance of replication doesn't allow any progress. When you test against a null of zero, your expectation isn't that the null hypothesis is false 50% of the time (if it is, you're testing against the wrong null!), it's that there is

no effect whatsoever. Learning the direction of the effect, and that it can be reliably distinguished from zero 50% of the time, is hugely informative! Now, it's true that the same result might be completely uninformative if your prior exactly matched the posterior, and that could certainly happen. But that's not an argumentforhypothesis testing, it's precisely the kind of argument Bayesians use for why you should use a Bayesian approach.But for fields where effect size matters, these are interesting ideas.There aren't any fields of psychology where effect size doesn't matter, for the reasons articulated above. The fact that you're rarely explicit about effect sizes in your studies doesn't mean you aren't making implicit assumptions about effect size all the time. You're just using p values as a proxy, because your sample sizes aren't big enough that everything comes out statistically significant. If you doubt this to be true, run a few experiments with 10,000 subjects each and then tell me how you're going to decide which effects you care about and which you don't.

Anyway, this comment thread is getting kind of long (sorry, I know I'm not being very succinct), so the last word is yours.

@tal. Don't worry about the long thread. I know what your negative argument is, but what is your positive argument. Even if we accept that in practice there are always confounds and with sufficient power, we'll always reject the null hypothesis, my claim that approximately half the time we care about whether our *intended* manipulation has an effect (intended manipulation, not anything we manipulated on accident) still stands.

If you deny that's the case, then the place to write about that is in the post I just put up. But if you agree that we often do care, how do we go about testing the existence or absence of effects?

---

BTW I worked through the calculations you suggested, and you seem to be overly optimistic. A 2-sample t-test with 15 subjects per condition is significant at the p=.05 level with t=1.7. That's an effect size (in Cohen's terms) of .44. An effect size of .44 with the same alpha has observed power of .21. So even though this effect is real, we expect to see it only 1/5 of the time we test for it.

Is that right?

You may have already seen this, but if not it is definitely worth a look. Dance of the p-values is a short youtube video explaining why p .05 does not mean there is a 95% chance of replication, amongst other things. http://www.youtube.com/watch?v=ez4DgdurRPg

Post a Comment