Games with Words: replication

Showing posts with label replication. Show all posts

Unrealistic Scientific Optimism

Posted by GamesWithWords on Thursday, June 04, 2015

Not enough published studies replicate. There are a number of reasons for this, but one of the most difficult to address is statistical power.

Why is low power a problem?

Suppose you want to test the effect of some training on IQ scores. You test 50 subjects: 25 in your experimental condition and 25 in the control condition. That's a fairly typical size for a psychology study. And you get a significant result. You might be tempted to conclude that your manipulation worked, but it might actually be more likely that your results are due to chance or experimenter error.

It depends on how large of an effect your manipulation ought to have. If the training typically raises people's IQs by 7.5 points, your study would only have had a 41% chance of detecting it (given normal assumptions about normal distributions). A more modest 5 point effect could be detected 20% of the time. You'd need a 14 point effect to have a 90% chance of detecting it.

For reference, a 14 point effect is enough to move someone of average intelligence into the top fifth of humanity. We are in miracle drug territory. More realistically, your manipulation is unlikely to have nudged IQ by more than a point or two, in which case there's essentially no chance your study could have detected it. So if you did detect something, it's probably an error.

Well, how much power do studies have?

Concerns about statistical power (among things) have led some researchers to declare that more than half of all published findings are false. Other researchers are more sanguine. In short, if you think that the effects we are studying tend to be pretty large (a standard deviation or more), there is little to worry about. If you think they tend to be small, the situation is dire.

Unfortunately, the only way to accurately determine how large an effect is is to measure it multiple times. Obviously, you can't have done that in advance when running a brand new study. You might be able to guestimate based on the typical effect size in your field. Unfortunately, many fields primarily publish significant results. This introduces a bias, because effect size and significance are correlated.

Suppose we ran the above experiment and the true effect size is 7.5 IQ points. On average, that is what we would find. But of course sometimes we'll run the experiment and the effect will be larger and sometimes it will be smaller, simply due to random chance. By paying attention only to the significant results, we're selectively ignoring those experiments that happened, through no fault of their own, to underestimate the effect. This skews our results, and on average we would report an effect of 11.8 IQ points -- much higher than the truth.

Effects aren't as big as you think.

The typical effect reported in the psychology literature is about half a standard deviation, which is why I've been using the example of 7.5 IQ points above. However, because of the bias against publishing null results or replications, this is inflated. That means that the expectations of psychologists are inflated. We are used to seeing effects of half a standard deviation or more. As a result, we are going to chronically underestimate the number of subjects we need to run.

Unfortunately, without widespread publication of null results and replications, we cannot say how badly our perception is distorted, because the degree of distortion depends on how large effects really are. I ran a series of simulations involving a two-condition, between-subjects design to see how bad the distortion might be. In these simulations, I assumed that null results are never reported, which is only a slight exaggeration of the psychology literature.

In the graph below, the true effect size (measured in standard deviations) is on the X-axis, and the average reported effect size is in the Y-axis. The different lines reflect different numbers of subjects per condition.

As you can see, if you have 50 or fewer subjects per condition, you'll hardly ever report an effect size smaller than half a standard deviation, even when the true effect size is one tenth of a standard deviation. This is because reliably detecting an effect of one tenth of a standard deviation requires about 2,000 subjects per condition.

Even with 1,000 subjects per condition, there is some distortion for effects smaller than one quarter standard deviation.

Note that these simulations assume that the researcher is correcting for multiple comparisons, isn't p-hacking, etc. The situation gets worse if we relax those assumptions.

In the simulation on the left, I assumed the researcher is engaging in contingent stopping. After every 5 subjects, the researcher checks her data. If the effect is significant, she stops and reports the result. If the effect is in the "wrong direction", she decides there's something wrong with her stimuli, revamps the experiment, and tries again.

While this is a little extreme, most researchers engage in some amount of contingent stopping. As you can see, this badly distorts the results. Even with 1,000 subjects, we end up distorting even large effects.

Failure to correct for multiple comparisons will have similar effects.

What this means is that if you are basing your expected effect size on prior experience, the effect you are trying to detect is probably not anywhere near as large as you think, and you may need many more subjects to detect it than you were planning.

But I usually get significant results. Doesn't that mean I have enough subjects?

Maybe. You might be testing an abnormally large effect. Such effects do exist. Alternatively, you may be engaging in contingent stopping, failing to adequately correct for multiple comparisons, or simply making a mistake somewhere in your analysis. It's probably worth checking.

To adapt a metaphor from Uri Simonsohn, you can search for exoplanets with binoculars. But you
should keep in mind that it is so unlikely you could see an exoplanet with your binoculars, that if you do see one, you are probably mistaken in some way.

I don't use t-tests. How does this apply to me?

My examples above use t-tests because they are simple and widely known. But the same basic issues apply no matter what kind of analysis you do. If you are looking for some kind of effect, and if that effect is small, you'll need a lot of data to detect it. And it's probably smaller than you think.

If you are model-fitting, the noise in your data puts an upper limit on how well you can fit the underlying phenomenon. If your data are fairly noisy and yet your model fits really well, you are probably fitting the noise, not the signal. And your data are probably noisier than you think.

Recommendations

I can't tell you how many subjects to run. I don't know. Nobody knows. Until there is a good mechanism for reporting null results, nobody will know.

In the meantime, I recommend running more than you think you need. If there are published studies looking at a similar phenomenon, look at their effect size and assume the true effect size is significantly smaller, then do a power analysis. If you run your study and get a positive result, it's probably a good idea to replicate it. If you get a null result, you might not wish to read too much into that.

If you don't have the means to test a lot of subjects, you have a few options. Effect size is a combination of the signal-to-noise ratio and the amount of data you collected. If you can't collect more data, you can try to decrease the noise (e.g., by using a better, more accurate instrument). You can't necessarily increase the signal, because you can't change the laws of nature. But you can decide which laws of nature to study, you and you might be better off studying one that has powerful effects.

R code

See below for the R code to run the simulations above. If anyone knows how to convince blogspot to allow indention, please lemmeno.

runsim <- span="">function (N,d,iter,inter=N,verbose=FALSE){

#N = number of subjects/condition

#d = effect size in Cohen's d

#iter = number of simulations to run

#inter = after this number of subjects, check results. Stop if significant *or* in the wrong direction. Default to running all subjects.

#verbose = should we print interim results?

ES<-c span="">#vector of effect sizes

for (i in 1:iter){

#do iter simulations

if (verbose){print(paste("Starting simulation",i))}

con1<- span="">con2<-c span="">#initialize scores for the two conditions

for (j in 1:ceiling(N/inter)){

#on each round, test an addition inter subjects.

n<- span="">inter #how many subjects do we need for this round?

if (j*inter>N){n<- span="">N-j*inter} #adjust if last round would result in more than N subjects

con1<-c class="s3" span="">con1

,rnorm(n,0,1))

con2<-c class="s3" span="">con2

,rnorm(n,d,1))

if (verbose){print(paste("pval:",round(t.test(con1,con2)$p.value,2)," diff in means:",round(mean(con2)-mean(con1),2)))}

if (t.test(con1,con2)$p.value<.05){

#reached a significant result!

break

}

if (mean(con1)>mean(con2)){

#preliminary results are in the wrong direction. Revamp experiment and try again.

break

}

if (t.test(con1,con2)$p.value<.05){

ES<-c class="s3" span="">ES

,round((mean(con2)-mean(con1))/((sd(con1)+sd(con2))/2),2))

}else{

#wasn't significant, so don't report

}

return(ES)

}

## graph observed effect size against actual effect size for different N

results<-data .frame="" class="s3" span="">N

=c(rep(10,10),rep(25,10),rep(50,10),rep(100,10),rep(1000,10)),GT=c(rep(seq(.1,1,.1),5)),ES=0)

for (i in 1:length(results$N)){

results$ES[i]<-mean runsim="" span="">results$N[i],results$GT[i],1000))

}

results$N<-as .factor="" span="">results$N)

results$expected<- span="">results$GT

library(ggplot2)

quartz(width=5,height=4)

p <- ggplot="" span="">data=results, aes(x=GT, y=ES, fill=N, linetype=N, colour=N))

p <- class="s3" span="">p

+ geom_line()+ scale_color_manual(values=c("red","blue","dark green","purple","black"))

p <- class="s3" span="">p

+ scale_x_continuous(limits=c(0,1),breaks=seq(0,1,.25)) + scale_y_continuous(limits=c(0,1.5),breaks=seq(0,1.5,.25))

Posted by GamesWithWords on Tuesday, January 22, 2013

Chemistry has its own problems with replication, according to Nature:

Scrounging chemicals and equipment in their spare time, a team of chemistry bloggers is trying to replicated published protocols for making molecules. The researchers want to check how easy it is to repeat the recipes that scientists report in papers ... Among the frustrations [chemists] have experienced with the chemical literature ... are claims that reactions yield products in greater amounts than seem reasonable, and scanty detail about specific conditions in which to run reactions. In some cases, reactions are reported which seem to good to be true - such as a 2009 paper which was corrected within 24 hours by web-savvy chemists live-blogging the experiment.

It's hard to tell from the article how common it is for a reaction simply not to be possible at all as opposed to simply produce less product than reported. Presumably either is problematic, but the causes would be different.

Given the recent excitement about (non-)replication, one has to wonder if this problem is more or less common than in the past. While my gut instinct is that replication was probably less of a problem in the earlier, smaller days of science, it's also quite possible that it's like many forms of violent crime: extremely rare today by historical standards, but we care much more about it.

Eighteen papers on replication

Posted by GamesWithWords on Monday, December 17, 2012

Last month, Perspectives in Psychological Science published a special section on replicability in psychological science. With 18 papers, it took me a while to get through all of them, so I am only blogging about them now.

The issue contains many useful articles, particularly if you have not been following the replicability crisis/discussion carefully. I have been following the discussion pretty closely, and I still found a lot worth reading (a particularly enjoyable surprise was Giner-Sorolla's historical review of replicability problems, which is absolutely fascinating).

So I hope you check the issue out. It is currently open access. But if you'd rather have someone else read it first and summarize the main points ... then lucky for you, I've done just that. Below, I describe and discuss all 18 papers, though it should be said that these descriptions are through the lens of what I happen to find particularly interesting or novel. There are likely many other things in these papers that would be of interest to you. So check them out.

Bookends

The issue is bookended by two overviews. Harold Pashler and Eric-Jan Wagenmakers get the issue started with a brief review of recent scandals, failures to replicate, and other reports -- both in psychology and in other fields -- that replicability may be an issue. John Ioannidis, Legend of Replicability, has a contribution at the end of the issue ("Why science is not necessarily self-correcting") which similarly summarizes many of the main arguments of the issue. If you haven't been following along, these are good primers. Neuroskeptic ("The nine circles of scientific hell") makes many of the same arguments, but in the satirical style of Dante Alighieri (I only wish it was in rhyme).

Is there a replicability crisis?

Pashler and Harris ("Is the replicability crisis overblown?") address three arguments sometimes given that "the replicability crisis is overblown". The first is that the low alpha level (5%) ensures that, of published results, most are right. They show that this is most likely incorrect, given current publishing practices. The second even if not many direct replications are published, many "conceptual" replications are. They argue that the notion of "conceptual replication" is so fuzzy that it actually compounds the problem (one problem: there is no such thing as a failure to conceptually replicate):

We speculate that the harmful interaction of publication bias and a focus on conceptual rather than direct replications may even shed light on some fo the famous and puzzling "pathological science" cases that embarrassed the natural sciences... What many observers found peculiar in these cases [like cold fusion] was that it took many years for a complete consensus to emerge that the phenomena lacked any reality... It appears that many exact replication attempts of the initial [cold fusion studies] were undertaken soon after the first dramatic reports of cold fusion. Such attempts produced generally negative results (Taubes, 1993). However, what kept faith in cold fusion alive for some time ... was a trickle of positive results achieved using very different designs that the originals (i.e., what psychologists would call conceptual replications).

Finally, they take issue with the claim that over the long run, science is self-correcting, given the lack of evidence that this is true (as far as I can tell, it's not even clear how you would show that science is self-correcting without rigorous, systematic studies of replicability).

While probably not everyone agrees with all the specifics of Pashler and Harris's argument, there does appear to be general consensus that there is a real replicability problem. 1,292 psychologists around the world surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") believed, on average, that 53% of the results in psychology could be replicated in direct replications. (One word of caution: it may be that psychologists most concerned about replication were also disproportionately likely to agree to participate in the survey; still, 1300 psychologists is a significant fraction of the field.)

What's the best way of getting a significant result?

Ideally, the answer would be "test for a real effect", but that of course requires that you know whether your hypothesis is true or not, which presumably you don't (otherwise, why run the study?). Bakker, van Dijk and Wicherts ("The rules of the game called psychological science") show that, as a general rule, you are better off running a few underpowered studies than one large, sufficiently-powered study (particularly if any real effect sizes are small). That is, the more likely it is that your results aren't true, the more likely you are to have a publishable finding. This is bad news when coupled with the fact that humans are good at finding efficient strategies. (Note that this doesn't require bad actors: One can simply find that certain procedures -- like testing around X subjects -- have generally produced "good", publishable research in the past.)

The bottleneck

Giner-Sorolla ("Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science") argues that a certain amount of bad science is expected when there are too few publication outlets:

Articles cannot pass through just by showing theoretical meaning and methodological rigor; thei results must appear to support the hypothesis perfectly... Imagine that two colleagues in psychology each show you a manuscript. One has two studies with roughly equal numbers of participants. Both studies support the hypothesis, each with a significant key result at p=.04. The other paper has three studies, also supporting the hypothesis, but the last two studies' individual results are only near significant: p=.02, p=.07, and that most annoying figure, p=.11.

Statistically, the second paper actually presents stronger evidence, it should be obvious that the the first one is more likely to be published.

There are two ways of having perfect-looking results, and unfortunately one is a lot easier than the other. Again, this doesn't require bad actors -- just competent ones. Giner-Sorolla's solution is not to make publication easier -- various reasons are given for why that is unlikely to work -- but rather to develop an alternative method for distributing less "pretty" results to those who might be interested (such as those doing meta-analysis).

The summary here does not give this fascinating article justice, which also includes (among things) a fascinating history of hypothesis testing, including a previous crisis of confidence in Psychology in the 1970s.)

Too good to be true

Francis ("The psychology of replication and the replication of psychology") takes the point a bit further: our love of "pretty" data has led us to expect data that are too good to be true. Most psychology studies have low power to detect an effect; thus, presenting several studies in a single paper all of which show the same significant effect is highly suspicious. He goes through eight experiments recently published in Galak & Meyvis (2011), seven of which show significant results despite their apparent power ranging from about 0.25 to 0.8 (that is, between a 25% and 80% chance of detecting the effect). The likelihood of this happening even if the hypothesis was true was about 0.08. (This is actually an underestimate: the one "non-signifcant" experiment had a p-value of 0.056, which many authors would count as significant -- which, again, makes the reported findings even less likely.)

Francis goes out of his way to say that this does not necessarily represent malicious intent:

Publication bias could occur in a set of experiments because authors closely follow the standards of the field. Indeed, the experiments reported in Galak and Meyvis (2011) appear to meet (and often exceed) the standards of experimental psychology.

Francis argues that changing the aesthetic can be accomplished by changing what results we report. We should report effect sizes and worry less about p-values. I've always been somewhat skeptical of effect size cheerleaders, but Francis put the argument in a new way that (at least for the moment) makes sense to me:

Given that almost every empirical study in psychology currently uses hypothesis testing, it may seem bizarre to claim that effect sizes are more important than the outcome of hypothesis tests... [But] surely, the theory of Galak and Meyvis (2011) does not predict that every experiment testing the theory will reject the null (e.g., even with sample sizes of, say, n = 3). The authors probably meant that if the theory were true, then an experiment with a large enough sample would reject the null hypothesis. But the definition of "large enough sample" is determined by the magnitude of the effect size ... you still have to focus on effect sizes in order to predict the outcome of experiments.

Measuring and reporting effect sizes would highlight when data are "too pretty", making such data actually less attractive, mitigating some of the factors encouraging publication bias. It would also lead to better research. Francis then discusses some ways to easily detect publication bias.

This paper generated a lot of attention and two commentaries were published alongside it. Simonsohn ("It does not follow") notes that Francis had to look at many papers to find any that have evidence of publication bias but does not correct for multiple comparisons. That is, Francis's test for the file-drawer effect must itself take into account the file-drawer effect; you'd want to see that, in a sample of papers, more show evidence of a file-drawer effect than you would have expected by chance. In other words, it is possible that Galak and Mayvis (2011) failed the test through pure random chance alone. This argument is somewhat weakened by Galak and Mayvis's admission ("You could have just asked") that they had indeed excluded several experiments that had not reached significance. However, the larger point stands.

Both Simonsohn and Galak & Mayvis push back against Francis's argument that if a paper has been shown to suffer from the file-drawer effect (unpublished null results), the paper should be ignored altogether. Just because the file-drawer effect has inflated the observed effect size, that doesn't mean the effect size is 0. Simonsohn points out that there are meta-analytic ways to correct for the file-drawer effect. Galak and Mayvis argue that not publishing null results is inconsequential, because having several positive results is simply too unlikely. Whether this is true, they acknowledge, depends on whether false-positive rates are inflated overall (as many of the other authors in this issue argue).

Can meta-analyses save us?

Ferguson and Heene ("A vast graveyard of undead theories") focus on the problems for meta-analyses posed by publication bias. Obviously, meta-analyses of biased data will be biased meta-analyses (junk in - junk out). They take a look at some ways researchers have tried to de-bias meta-analyses, such as including unpublished data. The problem is that the selection of unpublished data itself is often biased, and they show that in some cases this actually makes the problem worse.

They also take on an apparently common argument, which is that if several published studies have shown an effect, you'd need there to be a huge number of studies showing no effect in the proverbial file drawer in order to wash out the significant studies.

Now let us assume we assembled 50 studies in a meta-analysis with a mean [effect size] of 2.0. In this case, the FSN [a statistic indicating the number of null results needed to cancel out the significant ones] would yield a value of 6,854.

However, this statistic makes the incorrect assumption that if all the published studies were false positives, the mean effect size of the remainder would be 0.

In fact, if only 5% of studies that show Type I errors were published, the mean [effect size] of the remaining unpublished studies cannot be zero but must be negative... The FSN turns out to be a gross overestimate...

Despite the math, the best evidence of their claims comes from a plausibility argument:

It is thus not surprising that we have seldom seen a meta-analysis resolve a controversial debate in a field. Typically, the antagonists simply decry the meta-analysis as fundamentally flawed or produce a competing met-analysis of their own [long list of citations].

Participants as passive receptacles of stimuli

Klein and colleagues ("Low hopes, high expectations: Expectancy effects and the replicability of behavioral experiments") write:

Though in general there may be many reasons why a study fails to replicate (e.g., lack of statistical power...), we suggest that the central issue is a widespread failure to acknowledge, recognize, and document the circumstances and social context in which an experiment takes place.

In theory, if aspects of social context (e.g., whether the experimenter was male or female) are relevant for the effect but not documented, that part of the method cannot be replicated. But more generally, the failure to document important aspects of the design may suggest -- to put it bluntly -- incompetent experimenters, in which case a failure to replicate is no surprise. For instance: a significant percentage of studies (published in premier journals) that use deception did not include an "awareness probe" (basically, asking the participant if they were aware of the deception). This is problematic because sometimes the deception is too obvious and participants figure it out, invalidating the results of the study.

Their proposed solutions mostly amount to "do good studies". For instance, if experimenter bias is a concern, use double-blind procedures. If you are worried about demand characteristics, include relevant probes and controls. This is something that should be taught in every intro methods class, but -- according to the authors' research -- are often neglected even in prestigious journals.

Who should be doing replications?

Given that researchers are already over-loaded with responsibilities, if more replications are to be conducted, who is going to conduct them? Frank and Saxe ("Teaching replication") argue that undergraduate and graduate research methods classes should. Students in these classes already have to conduct experiments, but usually they are replications of classic experiments (like Stroop) or novel experiments thought up by the students. Either kind is usually boring: The first kind because the answer is already known, and the second kind because the students don't know enough to think up an interesting, novel experiment that makes contact with cutting-edge research. Replicating recent, cutting-edge studies solves both problems. While they acknowledge that sometimes the experiments aren't done well enough to be informative, around half of the time (in their experience, both having taught such classes), they are.

Grahe and colleagues ("Harnessing the undiscovered resource of student research projects") likewise argue that undergraduates are capable of doing useful research, and given that there are many undergraduates out there taking many methods courses, quite a lot of useful work could be collectively accomplished by them.

How often are replications published in psychology?

Makel, Plucker, and Hegarty ("Replications in psychology research: How often do they really occur?") find -- probably to nobody's surprise -- that replications are rarely published. Looking at all the articles published by 100 psych journals since 1900, just under 2% of the articles even mention "replication" in the text. Based on further investigation, they estimate that only 1% of articles are actual replications, most of which were successful replications. The good news is that the rate has been rising steadily in recent decades. The bad news is that, the vast majority were conceptual replications, not direct replications. Worse news: replications, whether successful or not, were cited on average 17 times, compared with an average of 64.5 times for the articles replicated. (Interestingly, they find that some papers do include failures to conceptually replicate -- despite what most people would expect -- though these are nonetheless fairly rare.)

Incentivizing replication and replicability

Nosek, Spies and Motyl ("Scientific Utopia II: Restructuring incentives and practices to promote truth over publishability") note, as have others, that there isn't much incentive to engage in activities that would improve the replicability rate. In a succinct article, they list a number of strategies that they think are unlikely to word -- conceptual replication, waiting for science to "self-correct", establishing journals devoted to publishing replications or negative results, education campaigns, increasing expectations of reviewers, and raising the barrier for publication. The reasons why are straight-forward and can be found in the article.

They also offer several strategies they think will work. For instance, they recommend simple check-lists, which "ensure disclosure of obvious items that are sometimes forgotten: sample sizes, statistical tests, effect sizes, and covariates." While this might seem similar to their discarded strategy of "raising expectations" and "educational campaigns", they argue that checklists are an exception, and point to research showing their efficacy (as opposed to raised expectations and educational campaigns) in medicine.

Other proposals include establishing metrics to identify what is worth replicating, thereby making replications of these papers seem important enough (to journal editors) to publish, crowd sourcing replication efforts, and establishing journals (like PLoS ONE) with review standards focused on the soundness, not importance, of the research. While this latter approach may seem like a journal signing its own death warrant, they point to data suggesting that journals aren't actually very good at figuring out what papers will or won't be cited in the future. They also note that making data, materials, and the workflow itself more open and transparent will make it easier to re-use data and materials, with additional replications as an inevitable, positive side effect.

Koole and Lakens ("Rewarding replications") focus more squarely on the lack of incentive to conduct replications because they are rarely published and rarely cited when published. Their proposed solution is that when journals publish an original finding, they agree to publish future replications as an online supplement. They then could create a meta-reference that refers to the original paper and subsequent replications, which researchers could then cite, increasing the citations received by the replications (since they are cited automatically). This has the nice advantage of working within the existing infrastructure (always easier than starting from scratch). The only remaining issue is how to convince the existing infrastructure (i.e., the journals) to participate, given that it will cost them a lot and there isn't any clear advantage to participation (some would argue that it's actually against the journal's interest to participate). There is also the first-mover disadvantage: the system only becomes particularly useful once many journals participate.

Confirmatory research

Wagenmakers, Wetzels, Borsboon, van der Maas and Kievit ("An agenda for purely confirmatory research") argue that researchers should be required to pre-register their methods and intended analyses prior to running a study. The idea is to make it clear which results are and are not exploratory. The suggestion is that exploratory analyses are often circular (you look at the data, develop a hypothesis, and then test the hypothesis on those data), which increases the likelihood of false positives.

Pre-registration obviously requires extra work, both on the part of the scientists and on the part of the people who maintain the registry, so one can reasonably ask whether the gain is worth the cost. First, we need to know if the problem itself is common (no point in investing in this infrastructure in order to combat a non-existant problem). Wagenmakers and colleagues argue that it is, but unconvincingly:

Almost no psychological research is conducted in a purely confirmatory fashion (e.g., Kerr, 1998; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011; for similar discussion in biology, see Anderson, Burnham, Gould, & Cherry, 2001).

As far as I can tell, Wagenmakers et al., (2011) speculates that confirmatory research is rare, but offers no data (that I noticed when skimming). Kerr (1998) did conduct a survey of 156 social psychologists, clinical/community psychologists, and sociologists, finding that confirmatory research was reported as occurring most frequently, though not significantly more common than some forms of circular research; even if we assume that these results generalize to all of psychology (not just social and clinical), it is is a far cry from "almost nothing".

But let's even suppose for the moment that non-confirmatory (that is, exploratory) research is very common. Is that a problem? As I already mentioned, Wagenmakers and colleagues are worried about circular analyses. What they suggest is tracking exploratory analyses, which are not necessarily the same thing: Suppose I run a study, and in review, a reviewer asks, "Was there a difference between the male and female participants?" (This is not a made-up example). Well, I hadn't ever thought about that question, so I never checked. Now I check, and either there is or there isn't. This is not a circular analysis; retroactively deciding this is an interesting question should not affect the results (unless you believe in ESP, which Wagenmarkers and colleagues clearly do not).

In the absence of additional information, then, the fact that an analysis was exploratory tells you little about whether it is circular, which was the real issue. Whether an analysis was exploratory or planned only suggests that it is circular or not if large percentages of exploratory analyses are circular. I don't know; there doesn't appear to be any data. But even if there were, that would just tell you likelihood -- it wouldn't settle the issue for any particular case.

Enough smart people think that pre-registration would be a good idea that I'm willing to suppose that it is. I just don't see anything in this particular article to suggest that the price is worth the benefit. One last thought: Wagenmakers and colleagues argue that researchers don't fully appreciate the distinction between planned and circular analyses. If so, would marking certain analyses as planned and certain analyses as circular help?

What should we do?

The 1,292 psychologists surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") were also asked about specific suggested reforms (proposed in a recent paper, not by them). I summarize the results below:

Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article:
should be good practice: 60%
should be condition of publication: 46%

Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data collection justification:
should be good practice: 47%
should be condition of publication: 30% [presumably none of these were psychophysicists]

Authors must list all variables collected in a study.
should be good practice: 58%
should be condition of publication: 46%

Authors must report all experimental conditions, including failed manipulations:
should be good practice: 80%
should be condition of publication: 66%

If observations are eliminated, authors must also report what the statistical results are if those observations are included:
should be good practice: 69%
should be condition of publication: 52%

If analysis includes a covariate, authors must report the statistical results of the analysis without the covariate:
should be good practice: 67%
should be condition of publication: 50%

The survey also asked about four suggested guidelines; for space reasons, I have left them out. You can find them in the paper.

Nobody asked my opinion, but here it is anyway. I think having and reporting a termination rule is relatively costless and helpful in certain situations, so I support it as good practice and wouldn't mind if it were a condition of publication. The question about 20 observations per cell ... do we mean 20 items or 20 subjects or both? I suspect the authors were thinking of studies with a single item per condition per subject (which are common in some areas of psychology but essentially non-existant in others; note that Fuchs, Jenny & Fiedler also note the problem of finding one set of standards that is appropriate to all subdisciplines of psychology), so it's hard to know how this is meant to be applied. In any case, I would prefer a standard based on expected power, which is more important (though harder to define).

I strongly support listing all variables collected and all experimental conditions, though it quickly runs up quickly against the word limits in many high-impact journals (which I would be happy to do away with; to paraphrase Einstein, papers should be as short as possible, but not shorter). One issue: what counts as an experimental condition is not always clear, particularly if you are doing between-subject manipulations (is this data set a "condition" or a separate experiment?). So I don't know how to enforce these as conditions of publication.

As far as covariates, certain kinds of measures make no sense without the covariate, so in some cases you'd be adding irrelevant, misleading information. Leaving this up to reviewers (the current system) seems perfectly reasonable to me.

Estimating replication rates in psychology

Posted by GamesWithWords on Sunday, September 02, 2012

The Open Science Collaboration's interim report, which will come out shortly in Perspectives in Psychological Science, is available. We nearly pulled off the physics trick of having a paper where the author list is longer than the paper itself. I think there are nearly 70 of us (if you scroll down, you'll find me in the H's).

The abstract says it all:

Reproducibility is a defining feature of science. However, because of strong incentives for innovation and weak incentives for confirmation, direct replication is rarely practiced or published. The Reproducibility Project is an open, large-scale, collaborative effort to systematically examine the rate and predictors of reproducibility in psychological science. So far, 72 volunteer researchers from 41 institutions have organized to openly and transparently replicate studies published in three prominent psychological journals from 2008. Multiple methods will be used to evaluate the findings, calculate an empirical rate of replication, and investigate factors that predict reproducibility. Whatever the result, a better understanding of reproducibility will ultimately improve confidence in scientific methodology and findings.

If you are interested in participating, there is still time. Go to the website for more information.

Making up your data

Posted by GamesWithWords on Saturday, July 21, 2012

Having finished reading the Simonsohn paper on detecting fraud, I have come to two conclusions:

1. Making up high-quality data is really hard. Part of the problem with making up data is that you have to introduce some randomness into it. If your study involves asking people how much they are willing to pay for a black t-shirt, you can't just write down that they all were willing to pay the average (say $12). You have to write down some variation ($12, $14, $7, $9, etc.).

The problem is that humans are notoriously bad at generating random number sequences. Simonsohn discusses this in terms of Tversky and Kahneman's famous, tongue-in-cheek paper "Belief in the law of small numbers." People think that random sequences should look roughly "average", even if the sample is small: Flip a coin 4 times, you should get 2 heads and 2 tails, when in fact getting 4 heads isn't all that improbable.

So your best bet, if you are making up data, is to use a computer program to generate it from your favorite distribution (the normal distribution would be a good choice in most cases). The problem is that data can have funny idiosyncrasies. One of the problems with the string of numbers I suggested above ($12, $14, $7, $9, etc.) is that humans like round numbers. So when people say what they are willing to pay for a t-shirt, what you should see is a lot of $10s, $20s and maybe some $5s and $15s. The numbers in my list are relatively unlikely.

The paper goes on to describe other problems as well. What I get from this is that making up data in a way that is undetectable is a lot of work, and you might as well actually run the study. So even leaving aside other reasons you might want to not commit fraud (ethics, desire for / belief in importance of knowledge, etc.), it seems sheer laziness alone should steer you the other direction.

2. The Dark Knight Rises is awesome. Seriously. Technically there was nothing about that in the paper, but I was thinking about the movie while reading the paper. Since I saw the show this morning, it's been hard to think of anything else. The most negative thing I can say about it is that it wasn't better than the last one, which is grading on a pretty steep curve.

The Psychologist on Replication

Posted by GamesWithWords on Thursday, April 19, 2012

The Psychologist solicited opinions on the importance of replication from a number of researchers, including yours truly. See a preview here.

Results: Replication in Psychology

Posted by GamesWithWords on Wednesday, March 07, 2012

My paper with Adena Schachner on replication in psychology is now published. The paper contains 3 main sections: a reasonably thorough literature review on replication rates in psychology, a proposal as to how to improve replication rates (primarily through tracking replication rates), and the results of a survey of psychologists on replication practices (many thanks to all who participated). The results of the survey was that while not nearly enough replications are attempted, there are actually more being attempted than we had guessed (or than many of our colleagues that we discussed this project with had guessed).
This paper is part of a larger collection of papers on reimagining the publication and review process, and is more of those papers are printed, I plan to discuss at least some of them.

New editor at Cognition (eventually)

Posted by GamesWithWords on Sunday, May 15, 2011

There are no doubt many psychologists who don't count Cognition as their favorite journal. I just don't happen to know very many of them. Whenever the topic of favorite journal comes up, Cognition it is. One would think that would argue in favor of continuity; whatever they're doing is working.

That's not apparently how the for-profit publishers of Cognition (Elsevier) feel, as they've decided to find a new editor, apparently without consulting anyone in the field about it. I hope they know what they are doing.

Survey on Replication

Posted by GamesWithWords on Saturday, April 23, 2011

Are you a researcher working in psychology or related domains (neuroscience, linguistics, etc.)? A colleague and I are conducting a survey on replication in these fields, for inclusion in an upcoming special issue of Frontiers in Computational Neuroscience. You can fill out the survey here.

Photo credit here.

Field of Science