Not enough published studies replicate. There are a number of reasons for this, but one of the most difficult to address is statistical power.

**Why is low power a problem?**

Suppose you want to test the effect of some training on IQ scores. You test 50 subjects: 25 in your experimental condition and 25 in the control condition. That's a fairly typical size for a psychology study. And you get a significant result. You might be tempted to conclude that your manipulation worked, but it might actually be

*more*likely that your results are due to chance or experimenter error.
It depends on how large of an effect your manipulation ought to have. If the training typically raises people's IQs by 7.5 points, your study would only have had a 41% chance of detecting it (given normal assumptions about normal distributions). A more modest 5 point effect could be detected 20% of the time. You'd need a 14 point effect to have a 90% chance of detecting it.

For reference, a 14 point effect is enough to move someone of average intelligence into the top fifth of humanity. We are in miracle drug territory. More realistically, your manipulation is unlikely to have nudged IQ by more than a point or two, in which case there's essentially no chance your study could have detected it. So if you did detect something, it's probably an error.

**Well, how much power do studies have?**

Concerns about statistical power (among things) have led some researchers to declare that more than half of all published findings are false. Other researchers are more sanguine. In short, if you think that the effects we are studying tend to be pretty large (a standard deviation or more), there is little to worry about. If you think they tend to be small, the situation is dire.

Unfortunately, the only way to accurately determine how large an effect is is to measure it multiple times. Obviously, you can't have done that in advance when running a brand new study. You might be able to guestimate based on the typical effect size in your field. Unfortunately, many fields primarily publish significant results. This introduces a bias, because effect size and significance are correlated.

Suppose we ran the above experiment and the true effect size is 7.5 IQ points. On average, that is what we would find. But of course sometimes we'll run the experiment and the effect will be larger and sometimes it will be smaller, simply due to random chance. By paying attention only to the significant results, we're selectively ignoring those experiments that happened, through no fault of their own, to underestimate the effect. This skews our results, and on average we would report an effect of 11.8 IQ points -- much higher than the truth.

**Effects aren't as big as you think**.

The typical effect reported in the psychology literature is about half a standard deviation, which is why I've been using the example of 7.5 IQ points above. However, because of the bias against publishing null results or replications, this is inflated. That means that the expectations of psychologists are inflated. We are used to seeing effects of half a standard deviation or more. As a result, we are going to chronically underestimate the number of subjects we need to run.

Unfortunately, without widespread publication of null results and replications, we cannot say how badly our perception is distorted, because the degree of distortion depends on how large effects really are. I ran a series of simulations involving a two-condition, between-subjects design to see how bad the distortion might be. In these simulations, I assumed that null results are never reported, which is only a slight exaggeration of the psychology literature.

In the graph below, the true effect size (measured in standard deviations) is on the X-axis, and the average reported effect size is in the Y-axis. The different lines reflect different numbers of subjects per condition.

Even with 1,000 subjects per condition, there is some distortion for effects smaller than one quarter standard deviation.

Note that these simulations assume that the researcher is correcting for multiple comparisons, isn't p-hacking, etc. The situation gets worse if we relax those assumptions.

In the simulation on the left, I assumed the researcher is engaging in contingent stopping. After every 5 subjects, the researcher checks her data. If the effect is significant, she stops and reports the result. If the effect is in the "wrong direction", she decides there's something wrong with her stimuli, revamps the experiment, and tries again.

While this is a little extreme, most researchers engage in some amount of contingent stopping. As you can see, this badly distorts the results. Even with 1,000 subjects, we end up distorting even large effects.

Failure to correct for multiple comparisons will have similar effects.

What this means is that if you are basing your expected effect size on prior experience, the effect you are trying to detect is probably not anywhere near as large as you think, and you may need many more subjects to detect it than you were planning.

**But I usually get significant results. Doesn't that mean I have enough subjects?**

**Maybe. You might be testing an abnormally large effect. Such effects do exist. Alternatively, you may be engaging in contingent stopping, failing to adequately correct for multiple comparisons, or simply making a mistake somewhere in your analysis. It's probably worth checking.**

To adapt a metaphor from Uri Simonsohn, you can search for exoplanets with binoculars. But you

should keep in mind that it is so unlikely you could see an exoplanet with your binoculars, that if you do see one, you are probably mistaken in some way.

**I don't use t-tests. How does this apply to me?**

**My examples above use t-tests because they are simple and widely known. But the same basic issues apply no matter what kind of analysis you do. If you are looking for some kind of effect, and if that effect is small, you'll need a lot of data to detect it. And it's probably smaller than you think.**

If you are model-fitting, the noise in your data puts an upper limit on how well you can fit the underlying phenomenon. If your data are fairly noisy and yet your model fits really well, you are probably fitting the noise, not the signal. And your data are probably noisier than you think.

**Recommendations**

**I can't tell you how many subjects to run. I don't know. Nobody knows. Until there is a good mechanism for reporting null results, nobody will know.**

In the meantime, I recommend running more than you think you need. If there are published studies looking at a similar phenomenon, look at their effect size and assume the true effect size is significantly smaller, then do a power analysis. If you run your study and get a positive result, it's probably a good idea to replicate it. If you get a null result, you might not wish to read too much into that.

If you don't have the means to test a lot of subjects, you have a few options. Effect size is a combination of the signal-to-noise ratio and the amount of data you collected. If you can't collect more data, you can try to decrease the noise (e.g., by using a better, more accurate instrument). You can't necessarily increase the signal, because you can't change the laws of nature. But you can decide which laws of nature to study, you and you might be better off studying one that has powerful effects.

**R code**

**See below for the R code to run the simulations above. If anyone knows how to convince blogspot to allow indention, please lemmeno.**

runsim <- span="">function (N,d,iter,inter=N,verbose=FALSE){

#N = number of subjects/condition

#d = effect size in Cohen's d

#iter = number of simulations to run

#inter = after this number of subjects, check results. Stop if significant *or* in the wrong direction. Default to running all subjects.

#verbose = should we print interim results?

ES<-c span="">#vector of effect sizes

for (i in 1:iter){

#do iter simulations

if (verbose){print(paste("Starting simulation",i))}

con1<- span="">con2<-c span="">#initialize scores for the two conditions

for (j in 1:ceiling(N/inter)){

#on each round, test an addition inter subjects.

n<- span="">inter #how many subjects do we need for this round?

if (j*inter>N){n<- span="">N-j*inter} #adjust if last round would result in more than N subjects

con1<-c class="s3" span="">con1

,rnorm(n,0,1))
con2<-c class="s3" span="">con2

,rnorm(n,d,1))
if (verbose){print(paste("pval:",round(t.test(con1,con2)$p.value,2)," diff in means:",round(mean(con2)-mean(con1),2)))}

if (t.test(con1,con2)$p.value<.05){

#reached a significant result!

break

}

if (mean(con1)>mean(con2)){

#preliminary results are in the wrong direction. Revamp experiment and try again.

break

}

}

if (t.test(con1,con2)$p.value<.05){

ES<-c class="s3" span="">ES

,round((mean(con2)-mean(con1))/((sd(con1)+sd(con2))/2),2))
}else{

#wasn't significant, so don't report

}

}

return(ES)

}

## graph observed effect size against actual effect size for different N

results<-data .frame="" class="s3" span="">N

=c(rep(10,10),rep(25,10),rep(50,10),rep(100,10),rep(1000,10)),GT=c(rep(seq(.1,1,.1),5)),ES=0)
for (i in 1:length(results$N)){

results$ES[i]<-mean runsim="" span="">results$N[i],results$GT[i],1000))

}

results$N<-as .factor="" span="">results$N)

results$expected<- span="">results$GT

library(ggplot2)

quartz(width=5,height=4)

p <- ggplot="" span="">data=results, aes(x=GT, y=ES, fill=N, linetype=N, colour=N))

p <- class="s3" span="">p

+ geom_line()+ scale_color_manual(values=c("red","blue","dark green","purple","black"))
p <- class="s3" span="">p

+ scale_x_continuous(limits=c(0,1),breaks=seq(0,1,.25)) + scale_y_continuous(limits=c(0,1.5),breaks=seq(0,1.5,.25))
p