Field of Science

When should an effect be called significant?

note: This post originally contained an error in the mathematics, which Tal of Citation Needed kindly pointed out. This error has been corrected.

In the thread following my earlier post on false positives, Tal made the observation that in a typical study that is significant at the p=.05 level has a 50% chance of being replicated.

It turns out that this depends heavily on what you mean by replicate. I'm going to work through some numbers below. Again, stats isn't my specialty, so please anyone jump in to correct errors. But I think I've got the general shape of the issues correct.

I got a significant result! Can I get it again?

Let's say you ran an experiment comparing the IQ scores of 15 people who prefer kiwi with the IQ scores of 15 people who prefer rambutan. You find that people who prefer rambutan have IQs 11.2 points higher than those who prefer kiwi. Assuming the standard deviations is 15 (which is how IQ tests are normalized), then that should give you a t-value of 11.2 / (15 * (2/15)^.5) = 2.04 and a p-value of about .05.

So you've now got a significant result! You tell all your friends, and they try running the same experiment. What are the chances they'll get the same results, significant at the p=.05 level?

The chances are not great. Even assuming that the underlying effect is real (rambutan-eaters really are smarter), your friends will only replicate your result about 51%, assuming they use exactly the same methods (according to a nifty power calculator found online here).

Define "get it"

Of course, we were assuming above that rambutan-eaters really are 11.2 IQ points smarter than kiwi-eaters (BTW I like both kiwi and rambutan, so nothing is meant by this example). In which case, your friends might not have gotten results significant at the p=.05 level, but they very likely found higher average IQs for their samples of rambutan-eaters relative to kiwi-eaters. And of course, what we really care about is how easy it will be to replicate the rambutan/kiwi difference, not how easy it will be to get the significant p-value again. The point of science is not to be able to predict statistically-significant differences but simply to predict differences. It's well beyond my statistical abilities to say how often this would happen, but hopefully someone will step up in the comments and let us know.

In practice, though, other people are only going to follow up on your effect if they can replicate it at the standard p=.05 level. What can we do to improve the chances of replicability?

Lower alphas

Let's suppose your effect had been significant at the p=.01 level. We can manage that while keeping the effect-size the same (11.2 IQ points) if we increase our sample to 26 kiwi-eaters and 26 rambutan-eaters (t = 8/(15 * (2/26)^.5) = 2.7). Now our chance of getting another significant result at the p=.01 level is ... 52%.

But we don't really care about getting a p=.01 again; we want to get the result again at the p=.05 level, which should happen around 76% of the time. Now, what if we had a result significant at the p=.001 level the first time around? We'd have needed about 42 subjects per condition. The chance or replicating that at the p=.05 level is 92%.

p-value       #subjects/condition           Likelihood of repeating at p=.05 level
.05                         15                                 51%
.01                         26                                 76%
.001                       42                                 92%

Replication

Of course, there are things that I'm not considering here, like the possibility that your original experiment underestimated the effect size. For instance, let's say that the true effect size is 15 IQ points (which is a lot!). Now, your chances of finding an effect significant at the p=.05 level with only 15 participants per condition is 75%. That's a lot better than what we started with, though not perfect. To have an effect large enough to see it 95% of the time at the p=.05 level, it would need to be over 20 IQ points, which is a monstrously large effect. Incidentally, if you ran this study with 15 rambutan-eaters and 15 kiwi-eaters and found a 20 IQ point effect, that would be significant below the p=.001 level.

What I get from all this is that if you want a result that you and others will be able to replicate, you're going want the p-value in your original experiment to have been lower than p.05.

4 comments:

tal said...

A few quick comments:

Assuming the standard deviations is 15 (which is how IQ tests are normalized), then that should give you a t-value of 8 / (15 * 1/15^.5) = 2.07 and a p-value just under .05.

I think you're confusing one-sample and two-sample t-tests. The t value above is correct for a one-sample t-test; you haven't accounted for the variance pooling for independent samples. But you're then providing the replication power for a two-sample t-test, which is why you get 29%. Re-run with the right numbers and you get 50%.

In practice, though, other people are only going to follow up on your effect if they can replicate it at the standard p=.05 level. What can we do to improve the chances of replicability?

Well, again, if all you care about is rejecting the null once again, the answer is very simple: keep collecting more subjects, and you're 100% guaranteed to replicate it eventually (assuming it's a 'real' effect, of course--and if it's not, you don't want to replicate it).

What I get from all this is that if you want a result that you and others will be able to replicate, you're going want the p-value in your original experiment to have been lower than p.05.

I don't really understand the point of talking about it this way. It doesn't make sense to think of replication likelihood as depending on the original p value, because you could have made the original p value anything you wanted just by collecting more subjects. So all you're really saying here is "if you want to increase the likelihood of rejecting the null, collect more subjects." And if you want to know by how much you should increase the sample, take the effect size and plug it into a power calculation.

Of course, this just gets us back to my original point: if all you care about is rejecting the null, then you can replicate any non-zero effect with 100% probability if you just keep collecting subjects. It's just not an interesting problem. If you really think that the goal of empirical studies is just to reject a null of zero effect, then you can save yourself a lot of work by not doing anything at all, because for any question that you'd actually care about, you're guaranteed that the null is false a priori.

GamesWithWords said...

@Tal -- For some reason, your comment went to spam again. This is annoying.

You haven't accounted for teh variance pooling for independent samples.

As I said, I'm assuming that the standard deviation is 15 for both samples. Thus the pooled standard deviation is
sqrt((15*15 + 15*15)/2) = sqrt(2*15*15/2) = 15

Is that not correct?

you can replicate any non-zero effect with 100% probability if you just keep collecting subjects.

So let's say I measure the heights of men and women. For some reason, I get a spurious effect where the women are taller than the men (funny sample). You're saying I'm guaranteed to replicate this if I test enough subjects?

Direction is going to matter.

you could have made the original p value anything you wanted just by collecting more subjects

Unless you run out of subjects and can't test any more. Again, we're not conducting a census. We're not interested -- at least, I'm not interested -- in knowing the state of the world at the present moment. We're interested in understanding the mechanics that generated the world and the present moment.

tal said...

Is that not correct?

The pooled sd is multiplied by root 2/n for the two sample test, so the denominator is larger and t is smaller.

So let's say I measure the heights of men and women. For some reason, I get a spurious effect where the women are taller than the men (funny sample). You're saying I'm guaranteed to replicate this if I test enough subjects?

Direction is going to matter.


Sorry, that was careless wording. I did add the appropriate caveat one paragraph up: "you're 100% guaranteed to replicate it eventually (assuming it's a 'real' effect, of course--and if it's not, you don't want to replicate it).

Unless you run out of subjects and can't test any more. Again, we're not conducting a census. We're not interested -- at least, I'm not interested -- in knowing the state of the world at the present moment. We're interested in understanding the mechanics that generated the world and the present moment.

But you're defining 'understanding the mechanics' as 'being able to reject the null'. What I'm saying is that if you keep collecting data, you will eventually reject the null, so the implication is that if you want to understand the mechanics, you just need to collect more data. But then if it's just about rejecting the null, why do the study anyway? You're 100% certain to end up rejecting it.

Now, if what you're saying is that you care about being able to reject the null in a reasonably-sized sample, that's fine with me, but that's basically an effect size claim. You're saying that you don't care about your hypothesis if it takes a million people to reject the null, but you do care about it if you can do it in 20. I agree with that, and that's exactly why I think we should be explicit about effect sizes.

Anyway, I'm spending too much time on this discussion again (though I've found it interesting and useful!), so the last word is yours.

GamesWithWords said...

Hah. I missed that 2 in the numerator. OK, those numbers will have to be fixed.

What I'm saying is that if you keep collecting data, you will eventually reject the null.

From reading your previous comments and your blog, I think you've confused observed effect size with expected effect size. Our hypotheses are really about expected effect size. Certainly, whenever you actually measure an effect, it's only very rarely going to be exactly 0 (though this can and does happen). But that doesn't mean that the correct prediction isn't 0.

If it helps, think of the limit on an asymptote. At any finite point, you won't have reached the asymptote, but at infinity, you do. Infinity works differently, and our hypotheses really about samples that are infinitely large.