Field of Science

Eighteen papers on replication

Last month, Perspectives in Psychological Science published a special section on replicability in psychological science. With 18 papers, it took me a while to get through all of them, so I am only blogging about them now.

The issue contains many useful articles, particularly if you have not been following the replicability crisis/discussion carefully. I have been following the discussion pretty closely, and I still found a lot worth reading (a particularly enjoyable surprise was Giner-Sorolla's historical review of replicability problems, which is absolutely fascinating).

So I hope you check the issue out. It is currently open access. But if you'd rather have someone else read it first and summarize the main points ... then lucky for you, I've done just that. Below, I describe and discuss all 18 papers, though it should be said that these descriptions are through the lens of what I happen to find particularly interesting or novel. There are likely many other things in these papers that would be of interest to you. So check them out.

Bookends

The issue is bookended by two overviews. Harold Pashler and Eric-Jan Wagenmakers get the issue started with a brief review of recent scandals, failures to replicate, and other reports -- both in psychology and in other fields -- that replicability may be an issue. John Ioannidis, Legend of Replicability, has a contribution at the end of the issue ("Why science is not necessarily self-correcting") which similarly summarizes many of the main arguments of the issue. If you haven't been following along, these are good primers. Neuroskeptic ("The nine circles of scientific hell") makes many of the same arguments, but in the satirical style of Dante Alighieri (I only wish it was in rhyme).

Is there a replicability crisis?

Pashler and Harris ("Is the replicability crisis overblown?") address three arguments sometimes given that "the replicability crisis is overblown". The first is that the low alpha level (5%) ensures that, of published results, most are right. They show that this is most likely incorrect, given current publishing practices. The second even if not many direct replications are published, many "conceptual" replications are. They argue that the notion of "conceptual replication" is so fuzzy that it actually compounds the problem (one problem: there is no such thing as a failure to conceptually replicate):
We speculate that the harmful interaction of publication bias and a focus on conceptual rather than direct replications may even shed light on some fo the famous and puzzling "pathological science" cases that embarrassed the natural sciences... What many observers found peculiar in these cases [like cold fusion] was that it took many years for a complete consensus to emerge that the phenomena lacked any reality... It appears that many exact replication attempts of the initial [cold fusion studies] were undertaken soon after the first dramatic reports of cold fusion. Such attempts produced generally negative results (Taubes, 1993). However, what kept faith in cold fusion alive for some time ... was a trickle of positive results achieved using very different designs that the originals (i.e., what psychologists would call conceptual replications).
Finally, they take issue with the claim that over the long run, science is self-correcting, given the lack of evidence that this is true (as far as I can tell, it's not even clear how you would show that science is self-correcting without rigorous, systematic studies of replicability).

While probably not everyone agrees with all the specifics of Pashler and Harris's argument, there does appear to be general consensus that there is a real replicability problem. 1,292 psychologists around the world surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") believed, on average, that 53% of the results in psychology could be replicated in direct replications. (One word of caution: it may be that psychologists most concerned about replication were also disproportionately likely to agree to participate in the survey; still, 1300 psychologists is a significant fraction of the field.)

What's the best way of getting a significant result?

Ideally, the answer would be "test for a real effect", but that of course requires that you know whether your hypothesis is true or not, which presumably you don't (otherwise, why run the study?). Bakker, van Dijk and Wicherts ("The rules of the game called psychological science") show that, as a general rule, you are better off running a few underpowered studies than one large, sufficiently-powered study (particularly if any real effect sizes are small). That is, the more likely it is that your results aren't true, the more likely you are to have a publishable finding. This is bad news when coupled with the fact that humans are good at finding efficient strategies. (Note that this doesn't require bad actors: One can simply find that certain procedures -- like testing around X subjects -- have generally produced "good", publishable research in the past.)

The bottleneck

Giner-Sorolla ("Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science") argues that a certain amount of bad science is expected when there are too few publication outlets:
Articles cannot pass through just by showing theoretical meaning and methodological rigor; thei results must appear to support the hypothesis perfectly... Imagine that two colleagues in psychology each show you a manuscript. One has two studies with roughly equal numbers of participants. Both studies support the hypothesis, each with a significant key result at p=.04. The other paper has three studies, also supporting the hypothesis, but the last two studies' individual results are only near significant: p=.02, p=.07, and that most annoying figure, p=.11.
Statistically, the second paper actually presents stronger evidence, it should be obvious that the the first one is more likely to be published.

There are two ways of having perfect-looking results, and unfortunately one is a lot easier than the other. Again, this doesn't require bad actors -- just competent ones. Giner-Sorolla's solution is not to make publication easier -- various reasons are given for why that is unlikely to work -- but rather to develop an alternative method for distributing less "pretty" results to those who might be interested (such as those doing meta-analysis). 

The summary here does not give this fascinating article justice, which also includes (among things) a fascinating history of hypothesis testing, including a previous crisis of confidence in Psychology in the 1970s.)

Too good to be true

Francis ("The psychology of replication and the replication of psychology") takes the point a bit further:  our love of "pretty" data has led us to expect data that are too good to be true. Most psychology studies have low power to detect an effect; thus, presenting several studies in a single paper all of which show the same significant effect is highly suspicious. He goes through eight experiments recently published in Galak & Meyvis (2011), seven of which show significant results despite their apparent power ranging from about 0.25 to 0.8 (that is, between a 25% and 80% chance of detecting the effect). The likelihood of this happening even if the hypothesis was true was about 0.08. (This is actually an underestimate: the one "non-signifcant" experiment had a p-value of 0.056, which many authors would count as significant -- which, again, makes the reported findings even less likely.)

Francis goes out of his way to say that this does not necessarily represent malicious intent:
Publication bias could occur in a set of experiments because authors closely follow the standards of the field. Indeed, the experiments reported in Galak and Meyvis (2011) appear to meet (and often exceed) the standards of experimental psychology.
Francis argues that changing the aesthetic can be accomplished by changing what results we report. We should report effect sizes and worry less about p-values. I've always been somewhat skeptical of effect size cheerleaders, but Francis put the argument in a new way that (at least for the moment) makes sense to me:
Given that almost every empirical study in psychology currently uses hypothesis testing, it may seem bizarre to claim that effect sizes are more important than the outcome of hypothesis tests... [But] surely, the theory of Galak and Meyvis (2011) does not predict that every experiment testing the theory will reject the null (e.g., even with sample sizes of, say, n = 3). The authors probably meant that if the theory were true, then an experiment with a large enough sample would reject the null hypothesis. But the definition of "large enough sample" is determined by the magnitude of the effect size ... you still have to focus on effect sizes in order to predict the outcome of experiments.
Measuring and reporting effect sizes would highlight when data are "too pretty", making such data actually less attractive, mitigating some of the factors encouraging publication bias. It would also lead to better research. Francis then discusses some ways to easily detect publication bias.

This paper generated a lot of attention and two commentaries were published alongside it. Simonsohn ("It does not follow") notes that Francis had to look at many papers to find any that have evidence of publication bias but does not correct for multiple comparisons. That is, Francis's test for the file-drawer effect must itself take into account the file-drawer effect; you'd want to see that, in a sample of papers, more show evidence of a file-drawer effect than you would have expected by chance. In other words, it is possible that Galak and Mayvis (2011) failed the test through pure random chance alone. This argument is somewhat weakened by Galak and Mayvis's admission ("You could have just asked") that they had indeed excluded several experiments that had not reached significance. However, the larger point stands.

Both Simonsohn and Galak & Mayvis push back against Francis's argument that if a paper has been shown to suffer from the file-drawer effect (unpublished null results), the paper should be ignored altogether. Just because the file-drawer effect has inflated the observed effect size, that doesn't mean the effect size is 0. Simonsohn points out that there are meta-analytic ways to correct for the file-drawer effect. Galak and Mayvis argue that not publishing null results is inconsequential, because having several positive results is simply too unlikely. Whether this is true, they acknowledge, depends on whether false-positive rates are inflated overall (as many of the other authors in this issue argue).

Can meta-analyses save us?

Ferguson and Heene ("A vast graveyard of undead theories") focus on the problems for meta-analyses posed by publication bias. Obviously, meta-analyses of biased data will be biased meta-analyses (junk in - junk out). They take a look at some ways researchers have tried to de-bias meta-analyses, such as including unpublished data. The problem is that the selection of unpublished data itself is often biased, and they show that in some cases this actually makes the problem worse.

They also take on an apparently common argument, which is that if several published studies have shown an effect, you'd need there to be a huge number of studies showing no effect in the proverbial file drawer in order to wash out the significant studies.
Now let us assume we assembled 50 studies in a meta-analysis with a mean [effect size] of 2.0. In this case, the FSN [a statistic indicating the number of null results needed to cancel out the significant ones] would yield a value of 6,854.
However, this statistic makes the incorrect assumption that if all the published studies were false positives, the mean effect size of the remainder would be 0.
In fact, if only 5% of studies that show Type I errors were published, the mean [effect size] of the remaining unpublished studies cannot be zero but must be negative... The FSN turns out to be a gross overestimate...
Despite the math, the best evidence of their claims comes from a plausibility argument:
It is thus not surprising that we have seldom seen a meta-analysis resolve a controversial debate in a field. Typically, the antagonists simply decry the meta-analysis as fundamentally flawed or produce a competing met-analysis of their own [long list of citations].
Participants as passive receptacles of stimuli

Klein and colleagues ("Low hopes, high expectations: Expectancy effects and the replicability of behavioral experiments") write:
Though in general there may be many reasons why a study fails to replicate (e.g., lack of statistical power...), we suggest that the central issue is a widespread failure to acknowledge, recognize, and document the circumstances and social context in which an experiment takes place.
In theory, if aspects of social context (e.g., whether the experimenter was male or female) are relevant for the effect but not documented, that part of the method cannot be replicated. But more generally, the failure to document important aspects of the design may suggest -- to put it bluntly -- incompetent experimenters, in which case a failure to replicate is no surprise. For instance: a significant percentage of studies (published in premier journals) that use deception did not include an "awareness probe" (basically, asking the participant if they were aware of the deception). This is problematic because sometimes the deception is too obvious and participants figure it out, invalidating the results of the study.

Their proposed solutions mostly amount to "do good studies". For instance, if experimenter bias is a concern, use double-blind procedures. If you are worried about demand characteristics, include relevant probes and controls. This is something that should be taught in every intro methods class, but -- according to the authors' research -- are often neglected even in prestigious journals.

Who should be doing replications?

Given that researchers are already over-loaded with responsibilities, if more replications are to be conducted, who is going to conduct them? Frank and Saxe ("Teaching replication") argue that undergraduate and graduate research methods classes should. Students in these classes already have to conduct experiments, but usually they are replications of classic experiments (like Stroop) or novel experiments thought up by the students. Either kind is usually boring: The first kind because the answer is already known, and the second kind because the students don't know enough to think up an interesting, novel experiment that makes contact with cutting-edge research. Replicating recent, cutting-edge studies solves both problems. While they acknowledge that sometimes the experiments aren't done well enough to be informative, around half of the time (in their experience, both having taught such classes), they are.

Grahe and colleagues ("Harnessing the undiscovered resource of student research projects") likewise argue that undergraduates are capable of doing useful research, and given that there are many undergraduates out there taking many methods courses, quite a lot of useful work could be collectively accomplished by them.


How often are replications published in psychology?

Makel, Plucker, and Hegarty ("Replications in psychology research: How often do they really occur?") find -- probably to nobody's surprise -- that replications are rarely published. Looking at all the articles published by 100 psych journals since 1900, just under 2% of the articles even mention "replication" in the text. Based on further investigation, they estimate that only 1% of articles are actual replications, most of which were successful replications. The good news is that the rate has been rising steadily in recent decades. The bad news is that, the vast majority were conceptual replications, not direct replications. Worse news: replications, whether successful or not, were cited on average 17 times, compared with an average of 64.5 times for the articles replicated. (Interestingly, they find that some papers do include failures to conceptually replicate -- despite what most people would expect -- though these are nonetheless fairly rare.)

Incentivizing replication and replicability

Nosek, Spies and Motyl ("Scientific Utopia II: Restructuring incentives and practices to promote truth over publishability") note, as have others, that there isn't much incentive to engage in activities that would improve the replicability rate. In a succinct article, they list a number of strategies that they think are unlikely to word -- conceptual replication, waiting for science to "self-correct", establishing journals devoted to publishing replications or negative results, education campaigns, increasing expectations of reviewers, and raising the barrier for publication. The reasons why are straight-forward and can be found in the article.

They also offer several strategies they think will work. For instance, they recommend simple check-lists, which "ensure disclosure of obvious items that are sometimes forgotten: sample sizes, statistical tests, effect sizes, and covariates." While this might seem similar to their discarded strategy of "raising expectations" and "educational campaigns", they argue that checklists are an exception, and point to research showing their efficacy (as opposed to raised expectations and educational campaigns) in medicine.

Other proposals include establishing metrics to identify what is worth replicating, thereby making replications of these papers seem important enough (to journal editors) to publish, crowd sourcing replication efforts, and establishing journals (like PLoS ONE) with review standards focused on the soundness, not importance, of the research. While this latter approach may seem like a journal signing its own death warrant, they point to data suggesting that journals aren't actually very good at figuring out what papers will or won't be cited in the future. They also note that making data, materials, and the workflow itself more open and transparent will make it easier to re-use data and materials, with additional replications as an inevitable, positive side effect.

Koole and Lakens ("Rewarding replications") focus more squarely on the lack of incentive to conduct replications because they are rarely published and rarely cited when published. Their proposed solution is that when journals publish an original finding, they agree to publish future replications as an online supplement. They then could create a meta-reference that refers to the original paper and subsequent replications, which researchers could then cite, increasing the citations received by the replications (since they are cited automatically).  This has the nice advantage of working within the existing infrastructure (always easier than starting from scratch). The only remaining issue is how to convince the existing infrastructure (i.e., the journals) to participate, given that it will cost them a lot and there isn't any clear advantage to participation (some would argue that it's actually against the journal's interest to participate). There is also the first-mover disadvantage: the system only becomes particularly useful once many journals participate.

Confirmatory research

Wagenmakers, Wetzels, Borsboon, van der Maas and Kievit ("An agenda for purely confirmatory research") argue that researchers should be required to pre-register their methods and intended analyses prior to running a study. The idea is to make it clear which results are and are not exploratory. The suggestion is that exploratory analyses are often circular (you look at the data, develop a hypothesis, and then test the hypothesis on those data), which increases the likelihood of false positives.

Pre-registration obviously requires extra work, both on the part of the scientists and on the part of the people who maintain the registry, so one can reasonably ask whether the gain is worth the cost. First, we need to know if the problem itself is common (no point in investing in this infrastructure in order to combat a non-existant problem). Wagenmakers and colleagues argue that it is, but unconvincingly:
Almost no psychological research is conducted in a purely confirmatory fashion (e.g., Kerr, 1998; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011; for similar discussion in biology, see Anderson, Burnham, Gould, & Cherry, 2001).
As far as I can tell, Wagenmakers et al., (2011) speculates that confirmatory research is rare, but offers no data (that I noticed when skimming). Kerr (1998) did conduct a survey of 156 social psychologists, clinical/community psychologists, and sociologists, finding that confirmatory research was reported as occurring most frequently, though not significantly more common than some forms of circular research;  even if we assume that these results generalize to all of psychology (not just social and clinical), it is is a far cry from "almost nothing".

But let's even suppose for the moment that non-confirmatory (that is, exploratory) research is very common. Is that a problem? As I already mentioned, Wagenmakers and colleagues are worried about circular analyses. What they suggest is tracking exploratory analyses, which are not necessarily the same thing: Suppose I run a study, and in review, a reviewer asks, "Was there a difference between the male and female participants?" (This is not a made-up example). Well, I hadn't ever thought about that question, so I never checked. Now I check, and either there is or there isn't. This is not a circular analysis; retroactively deciding this is an interesting question should not affect the results (unless you believe in ESP, which Wagenmarkers and colleagues clearly do not).

In the absence of additional information, then, the fact that an analysis was exploratory tells you little about whether it is circular, which was the real issue. Whether an analysis was exploratory or planned only suggests that it is circular or not if large percentages of exploratory analyses are circular. I don't know; there doesn't appear to be any data. But even if there were, that would just tell you likelihood -- it wouldn't settle the issue for any particular case.

Enough smart people think that pre-registration would be a good idea that I'm willing to suppose that it is. I just don't see anything in this particular article to suggest that the price is worth the benefit. One last thought: Wagenmakers and colleagues argue that researchers don't fully appreciate the distinction between planned and circular analyses. If so, would marking certain analyses as planned and certain analyses as circular help?

What should we do? 

The 1,292 psychologists surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") were also asked about specific suggested reforms (proposed in a recent paper, not by them). I summarize the results below:

Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article:
should be good practice: 60%
should be condition of publication: 46%

Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data collection justification:
should be good practice: 47%
should be condition of publication: 30%  [presumably none of these were psychophysicists]

Authors must list all variables collected in a study.
should be good practice: 58%
should be condition of publication: 46%

Authors must report all experimental conditions, including failed manipulations:
should be good practice: 80%
should be condition of publication: 66%

If observations are eliminated, authors must also report what the statistical results are if those observations are included:
should be good practice: 69%
should be condition of publication: 52%

If analysis includes a covariate, authors must report the statistical results of the analysis without the covariate:
should be good practice: 67%
should be condition of publication: 50%

The survey also asked about four suggested guidelines; for space reasons, I have left them out. You can find them in the paper.

Nobody asked my opinion, but here it is anyway. I think having and reporting a termination rule is relatively costless and helpful in certain situations, so I support it as good practice and wouldn't mind if it were a condition of publication. The question about 20 observations per cell ... do we mean 20 items or 20 subjects or both? I suspect the authors were thinking of studies with a single item per condition per subject (which are common in some areas of psychology but essentially non-existant in others; note that Fuchs, Jenny & Fiedler also note the problem of finding one set of standards that is appropriate to all subdisciplines of psychology), so it's hard to know how this is meant to be applied. In any case, I would prefer a standard based on expected power, which is more important (though harder to define).

I strongly support listing all variables collected and all experimental conditions, though it quickly runs up quickly against the word limits in many high-impact journals (which I would be happy to do away with; to paraphrase Einstein, papers should be as short as possible, but not shorter). One issue: what counts as an experimental condition is not always clear, particularly if you are doing between-subject manipulations (is this data set a "condition" or a separate experiment?). So I don't know how to enforce these as conditions of publication.

As far as covariates, certain kinds of measures make no sense without the covariate, so in some cases you'd be adding irrelevant, misleading information. Leaving this up to reviewers (the current system) seems perfectly reasonable to me.

In praise of experiments

Today, the excellent Neuroskeptic writes about a new study investigating which US states are most suicidal. The interesting twist was the form of the data: Google searches. It's an interesting study and an interesting use of Google searches, but what struck me was Neuroskeptic's closing thoughts.
Over the past couple of years there's been a flurry of studies based on analyzing Google and Twitter trends. What's interesting to me is that we're really in the early days of this, when you think about likely future technologies. What will happen when everyone's wearing a computer 24/7 that records their every word and move, and even what they see? 
Eventually, psychology and sociology might evolve (or degenerate) into no more than the analysis of such data...
It's always dangerous to predict the future, but here's my prediction: Not a chance. It gets down to a distinction between observational studies and experiments. Observational studies (where you record what happens in the course of normal events) are useful, particularly when you care about questions like what is the state of the world? They are much less useful when you want to know why is the world the way it is?

There are a couple reasons.

Reason #1: The correlation fallacy

First, observational studies are really about studying correlations. To have much power to analyze interesting correlations, you need a lot of data. This is what makes Google and Twitter powerful: they provide a lot of data. But correlation, famously, doesn't always tell you much about causation.

For instance, it is now well-known that you can use the number of pirates active in the world's oceans and seas to reasonably predict average global temperature (there's a strong correlation):



I did not know until recently that Google search data has now definitively shown a correlation between the amount of movie piracy and global warming as well:

In the case of real pirates vs. the temperature, the correlation runs the other way (temperature affects weather affects seafaring activities). I have no idea what causes the correlation between searches for free movies and searches about global warming; perhaps some third factor. To give another silly example, there is a lot more traffic on the roads during daylight than at night, but this isn't because cars are solar-powered!

The point is that experiments don't have this problem: you go out and manipulate the world to see what happens. Change the number of pirates and see if global temperatures change. Nobody has tried this (to my knowledge), but I'm willing to bet it won't work.

(Of course, there are natural experiments, which are a hybrid of observational studies and experiments: the experimenter doesn't manipulate the world herself but rather waits until somebody else, in the course of normal events, does it for her. Good examples are comparing different states as they adopt bicycle helmet laws at different times and comparing that against head injury stats in the various states. These are rarely as well-controlled as an actual experiment, but have the advantage of ecological validity.)

Reason #2: Life's too short

The second is that observational studies are limited by what actually happens in the world. You won't, from an observational study, find out what the effect of US politics is of every US senator taking up crack while every US representative takes up meth. (I hope not, anyway.)

That was an absurd example, but the problem is real. Language gives lots of great examples. Suppose you want to find out what sentences in any given language are grammatical and what sentences are not. You could do an observational study and see what sentences people say. Those are grammatical; sentences you haven't heard probably aren't.

The problem with this is that people are boring and repetitive. A small number of words (heck, a small number of sentence fragments) accounts for most of what people say and write. The vast majority of grammatical sentences will never appear in your observational sample no matter how long you wait, because there are actually an infinite number of grammatical English sentences. (In his impressive "Who's afraid of George Kingsley Zipf?", Charles Yang shows how a number of prominent language researchers went astray by paying too much attention to this kind of observational study.)

The basic feature of the problem is that for building theories -- explaining why things are the way they are -- very often what you care about are the border cases. Human behavior is largely repetitive, and the  border cases are quite rare. Experiments turn this around: by deliberately choosing the situations we put our participants in, we can focus on the informative test cases.

The experimental method: Here to stay

None of this should be taken as meaning that I don't think observational studies are useful. I conduct them myself. A prerequisite to asking the question Why are things the way they are is knowing, in fact, what way things are. There is also the question of ecological validity. When we conduct laboratory experiments, we construct artificial situations and then try to generalize the results to real life. It's good to know something about real life in order to inform those generalizations.

But just as I can't imagine observational studies disappearing, I can't imagine them replacing experimentation, either.



Lab Notebook: You know you are writing a paper when

Your "recently added" list in Mendeley is growing at an exponential rate:

(click to expand. note time added.)

Every new paper you read results in downloading at least two more (not unlike the Hounds of Resurrection. Coincidence? I think not).


I don't think I've ever actually finished my reading list for a paper. At some point, I shut down the process before it overwhelms my hard drive.

Findings: That Kind of Person

That Kind of Person is now complete. Many thanks to all who answered the call to participate.


For some time now, I have been studying the effect of context on pronoun interpretation. If words and sentences always meant what they meant regardless of context, linguistics and psycholinguistics would be much easier, and we would have much better computer translation, speech recognition, etc. Unfortunately, the same word (bank) can often mean different things in different contexts (he paddled over to the bank versus he cashed a check at the back).

Pronouns are a great guinea pig for studying the role of context, because they derive almost all their meaning from context (try to define “she” or “he” and compare it to your definition of “Martha Washington” or “George Washington”).

Great Expectations

Recently, a picture has started to emerge, at least in the case pronouns. The basic idea, due mostly to the work of Andrew Kehler at UCSD*, is that our initial interpretation of a pronoun is driven by what we think is likely to be talked about next. If this seems obvious, the dominant theory at the time Kehler started working (Centering Theory and variants) argued that our initial interpretation of the pronoun is that it refers to whatever person or thing is currently most “salient” (what counts as "salient" depends on the version of the theory) -- a hypothesis that also usually strikes folks as obvious.

Kehler's big contribution was articulating theory of discourse structure – that is, how sentences relate to one another – that can be used to fairly accurately predict what people expect to be mentioned next. (If you are interested in these issues and have a little background in linguistics, Kehler's book, Coherence, Reference, and the Theory of Grammar is fantastic.) For instance, sometimes one sentence introduces the consequence of another sentence:

(1) John frightened Bill, so he ran away.

Here, the second sentence (or, if you prefer, second clause) describes a consequence of the first sentence. Most likely "he" refers to Bill, because Bill running away would be a reasonable consequence of John frightening him. In contrast, other sentences explain the previous sentence:

(2) John frightened Bill because he is scary.

Here, "he" probably refers to John, since John being scary would be a good explanation of his frightening of Bill.

There are many other types of relationships between sentences, and they have predictable effects on pronoun interpretation. Although Kehler's theory explains a lot, it does not explain, for example, why we think Bill running away is a more likely effect of John frightening Bill than Bill running away.

The role of verbs

In two recent papers, which I discussed on this blog, my colleagues and I argued that verbs play a major role. Verbs -- specifically, the relationship between a verb and its subject and object -- provide a lot of information about events. We drew in particular on one line of theoretical work (usually called "predicate decomposition theory"), which tries to explain how verb meaning can be built out of a few constituent parts. The details aren't important here. What is important is that this theory argues that some  verbs specify who the cause of the event was. What we showed was that usually, in sentences like (2), people think the pronoun refers to the person that the verb specifies as the cause. In this case, "frighten" means something like "John caused Bill to be afraid". Remember that "he is scary" is an explanation of "John frightened Bill." Explanations usually refer to causes.

In short, by drawing on independent theories of discourse structure and verb meaning, we were able to predict very well how people will interpret pronouns in various contexts. At least, we could do so in the ones we tried -- there's a lot of work left to be done to fully flesh out this work.

The problem

I have been presenting this work for a while, and I often get the following objection: We already know that verbs can't be doing all (or even much) of the work. The real story, it was argued, is much more complex. Thinking just about the explanation sentences like (2), Pickering and Majid (2007) noted that multiple factors "affect the construction of the event representation, and it is this event representation that is used to infer the cause..." They cite experimental findings argued to show that pronoun interpretation in sentences like (2) depend in complex ways not just on the verb but on what you know about the subject and the object:
In addition, properties of the participants affect implicit causality. Changing the gender (Lafrance, Brownell, & Hahn, 1997), animacy (Corrigan, 1988, 1992), or typicality (Corrigan, 1992; Garvey et al., 1976) of the participants changes the [pronoun interpretation].
After hearing this enough times, I started what I thought would be a series of studies to look at how information about the subject and object interact with the verb in real time during sentence comprehension. This project never got off the ground because I couldn't find any such effects. That is, I have now run a number of studies where I manipulate the gender or typicality, etc., of the subject and object, and they have no effect on pronoun interpretation.

It turns out that there was some confusion in the literature. The studies that Pickering and Majid cite in the quote above mostly don't look at pronoun interpretation at all. Most look at a different task:

(3) John frightened Bill.
a. How likely is this because John is the kind of person who frightens people? 1 2 3 4 5 6 7 8 9
b. How likely is this because Bill is the kind of person people frighten? 1 2 3 4 5 6 7 8 9

Researchers look whether the answer to (a) is greater or less than the answer to (b) to decide who people think caused the event: John or Bill? Much of the literature has assumed that the answer to this question should predict what happens in pronoun sentences like (2), even though this has never been rigorously shown. (Why it hasn't been carefully tested is a bit of a mystery. It is so widely believed to be true that I suspect many folks don't realize that it hasn't been tested. It actually took me several years to pick up on this fact myself.)

I now have a long line of studies showing that there is little relationship between the two tasks. Also, although manipulating who the subject and object are affect the task in (3), I find very little evidence that it affects pronoun interpretation in (2). For instance, compare the following:

(4) a. The king frightened the page because he....
     b. The page frightened the king because he....

Everybody agrees that, in general, it is more likely that kings frighten pages than that pages frighten kings, and so if you use these sentences in (3), you get a nice effect of who the subject is. But it doesn't affect pronoun interpretation at all.

This is a serious blow to Pickering and Majid's argument. They argued that pronoun interpretation cannot be all (or mostly) about discourse structure and verb meaning because these interact in complex ways with knowledge about the subject and object (I should add: non-linguistic knowledge. It presumably is not part of the definition of king and page that kings frighten pages but not vice versa, but rather something you learn about the world). If it turns out that this is not the case, then discourse structure + verb meaning may well explain much or all of the phenomenon at hand.

That Kind of Person

That was my argument, anyway, in a paper that I have been shopping around for a couple years now. The difficulty with publishing this paper is that it makes a null argument: you can't find effects of knowledge about the subject and object on pronoun interpretation. In fact, all I can show is that the manipulations I have tried haven't worked, not that no manipulation works (you can't try everything!). So much of the review process has been reviewers suggesting additional experiments and me running them. The latest -- and I hope last -- one was That Kind of Person.

A reviewer very smartly noted that a big difference between (2) and (3) is that (3) asks about the kind of person the subject is and the kind of person the object is, whereas (2) does not. What we are manipulating in our king/page manipulation is, of course, the kind of person the subject is and the kind of person that the object is. So the reviewer suggested the following pronoun task:

(5) a. The king frightened the page because he is the kind of person that...
     b. The page frightened the king because he is the kind of person that...

The specific manipulation was one of status. It was argued in the literature that people are more likely to think that high-status folk (kings) caused the event that low-status folk (pages). This does turn out to be true if you use the task in (3), but yet again I found no effect on pronouns, either using sentences like (4) or like (5). (Sorry -- I was going to include a graph, but the results aren't formatted for graphing yet, and it's time for lunch! Maybe when the paper is published...)

Conclusions

I think the result of this work is that it suggests that we really are narrowing in on "the" theory of pronoun interpretation (though there is a lot of work left), a theory in which most of the work is done by discourse structure and verb meaning. This is pretty exciting, because it would be one of the rare cases where we have a reasonably complete theory of how context affects word meaning. It does leave open the question of what the task in (3) is measuring, and why it doesn't match what the pronoun tasks measure. That's still the sticking point in the review. I have a few new ideas, and we'll see what the reviewers say this time around.

----
*Editors at newspapers and magazines usually request that, whenever you introduce a scientist in an article, you state name, institution, and scientific field. The first two are easy, but the last one is hard, particularly when you frequently write about interdisciplinary research (which I do). I wrote about Kehler in an article for Scientific American Mind a while back, and introducing him caused a long debate. His degree is in computer science, he works in a linguistics department, but his work is probably best described as psychology. So what is he?

Just another reason I prefer blogging.

Small World of Words

A group of researchers in Belgium is putting together a very large word association network by asking volunteers to say which words are related to which other words. They are hoping to recruit around 300,000 participants, which makes it my kind of study! (Technically, I've never tried 300,000 participants -- I think we've never gone beyond about 50,000, though we have some new things in the pipeline...)

It looks interesting. To participate, go to www.smallworldofwords.com. You can read more about the project here.