Field of Science

Eighteen papers on replication

Last month, Perspectives in Psychological Science published a special section on replicability in psychological science. With 18 papers, it took me a while to get through all of them, so I am only blogging about them now.

The issue contains many useful articles, particularly if you have not been following the replicability crisis/discussion carefully. I have been following the discussion pretty closely, and I still found a lot worth reading (a particularly enjoyable surprise was Giner-Sorolla's historical review of replicability problems, which is absolutely fascinating).

So I hope you check the issue out. It is currently open access. But if you'd rather have someone else read it first and summarize the main points ... then lucky for you, I've done just that. Below, I describe and discuss all 18 papers, though it should be said that these descriptions are through the lens of what I happen to find particularly interesting or novel. There are likely many other things in these papers that would be of interest to you. So check them out.


The issue is bookended by two overviews. Harold Pashler and Eric-Jan Wagenmakers get the issue started with a brief review of recent scandals, failures to replicate, and other reports -- both in psychology and in other fields -- that replicability may be an issue. John Ioannidis, Legend of Replicability, has a contribution at the end of the issue ("Why science is not necessarily self-correcting") which similarly summarizes many of the main arguments of the issue. If you haven't been following along, these are good primers. Neuroskeptic ("The nine circles of scientific hell") makes many of the same arguments, but in the satirical style of Dante Alighieri (I only wish it was in rhyme).

Is there a replicability crisis?

Pashler and Harris ("Is the replicability crisis overblown?") address three arguments sometimes given that "the replicability crisis is overblown". The first is that the low alpha level (5%) ensures that, of published results, most are right. They show that this is most likely incorrect, given current publishing practices. The second even if not many direct replications are published, many "conceptual" replications are. They argue that the notion of "conceptual replication" is so fuzzy that it actually compounds the problem (one problem: there is no such thing as a failure to conceptually replicate):
We speculate that the harmful interaction of publication bias and a focus on conceptual rather than direct replications may even shed light on some fo the famous and puzzling "pathological science" cases that embarrassed the natural sciences... What many observers found peculiar in these cases [like cold fusion] was that it took many years for a complete consensus to emerge that the phenomena lacked any reality... It appears that many exact replication attempts of the initial [cold fusion studies] were undertaken soon after the first dramatic reports of cold fusion. Such attempts produced generally negative results (Taubes, 1993). However, what kept faith in cold fusion alive for some time ... was a trickle of positive results achieved using very different designs that the originals (i.e., what psychologists would call conceptual replications).
Finally, they take issue with the claim that over the long run, science is self-correcting, given the lack of evidence that this is true (as far as I can tell, it's not even clear how you would show that science is self-correcting without rigorous, systematic studies of replicability).

While probably not everyone agrees with all the specifics of Pashler and Harris's argument, there does appear to be general consensus that there is a real replicability problem. 1,292 psychologists around the world surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") believed, on average, that 53% of the results in psychology could be replicated in direct replications. (One word of caution: it may be that psychologists most concerned about replication were also disproportionately likely to agree to participate in the survey; still, 1300 psychologists is a significant fraction of the field.)

What's the best way of getting a significant result?

Ideally, the answer would be "test for a real effect", but that of course requires that you know whether your hypothesis is true or not, which presumably you don't (otherwise, why run the study?). Bakker, van Dijk and Wicherts ("The rules of the game called psychological science") show that, as a general rule, you are better off running a few underpowered studies than one large, sufficiently-powered study (particularly if any real effect sizes are small). That is, the more likely it is that your results aren't true, the more likely you are to have a publishable finding. This is bad news when coupled with the fact that humans are good at finding efficient strategies. (Note that this doesn't require bad actors: One can simply find that certain procedures -- like testing around X subjects -- have generally produced "good", publishable research in the past.)

The bottleneck

Giner-Sorolla ("Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science") argues that a certain amount of bad science is expected when there are too few publication outlets:
Articles cannot pass through just by showing theoretical meaning and methodological rigor; thei results must appear to support the hypothesis perfectly... Imagine that two colleagues in psychology each show you a manuscript. One has two studies with roughly equal numbers of participants. Both studies support the hypothesis, each with a significant key result at p=.04. The other paper has three studies, also supporting the hypothesis, but the last two studies' individual results are only near significant: p=.02, p=.07, and that most annoying figure, p=.11.
Statistically, the second paper actually presents stronger evidence, it should be obvious that the the first one is more likely to be published.

There are two ways of having perfect-looking results, and unfortunately one is a lot easier than the other. Again, this doesn't require bad actors -- just competent ones. Giner-Sorolla's solution is not to make publication easier -- various reasons are given for why that is unlikely to work -- but rather to develop an alternative method for distributing less "pretty" results to those who might be interested (such as those doing meta-analysis). 

The summary here does not give this fascinating article justice, which also includes (among things) a fascinating history of hypothesis testing, including a previous crisis of confidence in Psychology in the 1970s.)

Too good to be true

Francis ("The psychology of replication and the replication of psychology") takes the point a bit further:  our love of "pretty" data has led us to expect data that are too good to be true. Most psychology studies have low power to detect an effect; thus, presenting several studies in a single paper all of which show the same significant effect is highly suspicious. He goes through eight experiments recently published in Galak & Meyvis (2011), seven of which show significant results despite their apparent power ranging from about 0.25 to 0.8 (that is, between a 25% and 80% chance of detecting the effect). The likelihood of this happening even if the hypothesis was true was about 0.08. (This is actually an underestimate: the one "non-signifcant" experiment had a p-value of 0.056, which many authors would count as significant -- which, again, makes the reported findings even less likely.)

Francis goes out of his way to say that this does not necessarily represent malicious intent:
Publication bias could occur in a set of experiments because authors closely follow the standards of the field. Indeed, the experiments reported in Galak and Meyvis (2011) appear to meet (and often exceed) the standards of experimental psychology.
Francis argues that changing the aesthetic can be accomplished by changing what results we report. We should report effect sizes and worry less about p-values. I've always been somewhat skeptical of effect size cheerleaders, but Francis put the argument in a new way that (at least for the moment) makes sense to me:
Given that almost every empirical study in psychology currently uses hypothesis testing, it may seem bizarre to claim that effect sizes are more important than the outcome of hypothesis tests... [But] surely, the theory of Galak and Meyvis (2011) does not predict that every experiment testing the theory will reject the null (e.g., even with sample sizes of, say, n = 3). The authors probably meant that if the theory were true, then an experiment with a large enough sample would reject the null hypothesis. But the definition of "large enough sample" is determined by the magnitude of the effect size ... you still have to focus on effect sizes in order to predict the outcome of experiments.
Measuring and reporting effect sizes would highlight when data are "too pretty", making such data actually less attractive, mitigating some of the factors encouraging publication bias. It would also lead to better research. Francis then discusses some ways to easily detect publication bias.

This paper generated a lot of attention and two commentaries were published alongside it. Simonsohn ("It does not follow") notes that Francis had to look at many papers to find any that have evidence of publication bias but does not correct for multiple comparisons. That is, Francis's test for the file-drawer effect must itself take into account the file-drawer effect; you'd want to see that, in a sample of papers, more show evidence of a file-drawer effect than you would have expected by chance. In other words, it is possible that Galak and Mayvis (2011) failed the test through pure random chance alone. This argument is somewhat weakened by Galak and Mayvis's admission ("You could have just asked") that they had indeed excluded several experiments that had not reached significance. However, the larger point stands.

Both Simonsohn and Galak & Mayvis push back against Francis's argument that if a paper has been shown to suffer from the file-drawer effect (unpublished null results), the paper should be ignored altogether. Just because the file-drawer effect has inflated the observed effect size, that doesn't mean the effect size is 0. Simonsohn points out that there are meta-analytic ways to correct for the file-drawer effect. Galak and Mayvis argue that not publishing null results is inconsequential, because having several positive results is simply too unlikely. Whether this is true, they acknowledge, depends on whether false-positive rates are inflated overall (as many of the other authors in this issue argue).

Can meta-analyses save us?

Ferguson and Heene ("A vast graveyard of undead theories") focus on the problems for meta-analyses posed by publication bias. Obviously, meta-analyses of biased data will be biased meta-analyses (junk in - junk out). They take a look at some ways researchers have tried to de-bias meta-analyses, such as including unpublished data. The problem is that the selection of unpublished data itself is often biased, and they show that in some cases this actually makes the problem worse.

They also take on an apparently common argument, which is that if several published studies have shown an effect, you'd need there to be a huge number of studies showing no effect in the proverbial file drawer in order to wash out the significant studies.
Now let us assume we assembled 50 studies in a meta-analysis with a mean [effect size] of 2.0. In this case, the FSN [a statistic indicating the number of null results needed to cancel out the significant ones] would yield a value of 6,854.
However, this statistic makes the incorrect assumption that if all the published studies were false positives, the mean effect size of the remainder would be 0.
In fact, if only 5% of studies that show Type I errors were published, the mean [effect size] of the remaining unpublished studies cannot be zero but must be negative... The FSN turns out to be a gross overestimate...
Despite the math, the best evidence of their claims comes from a plausibility argument:
It is thus not surprising that we have seldom seen a meta-analysis resolve a controversial debate in a field. Typically, the antagonists simply decry the meta-analysis as fundamentally flawed or produce a competing met-analysis of their own [long list of citations].
Participants as passive receptacles of stimuli

Klein and colleagues ("Low hopes, high expectations: Expectancy effects and the replicability of behavioral experiments") write:
Though in general there may be many reasons why a study fails to replicate (e.g., lack of statistical power...), we suggest that the central issue is a widespread failure to acknowledge, recognize, and document the circumstances and social context in which an experiment takes place.
In theory, if aspects of social context (e.g., whether the experimenter was male or female) are relevant for the effect but not documented, that part of the method cannot be replicated. But more generally, the failure to document important aspects of the design may suggest -- to put it bluntly -- incompetent experimenters, in which case a failure to replicate is no surprise. For instance: a significant percentage of studies (published in premier journals) that use deception did not include an "awareness probe" (basically, asking the participant if they were aware of the deception). This is problematic because sometimes the deception is too obvious and participants figure it out, invalidating the results of the study.

Their proposed solutions mostly amount to "do good studies". For instance, if experimenter bias is a concern, use double-blind procedures. If you are worried about demand characteristics, include relevant probes and controls. This is something that should be taught in every intro methods class, but -- according to the authors' research -- are often neglected even in prestigious journals.

Who should be doing replications?

Given that researchers are already over-loaded with responsibilities, if more replications are to be conducted, who is going to conduct them? Frank and Saxe ("Teaching replication") argue that undergraduate and graduate research methods classes should. Students in these classes already have to conduct experiments, but usually they are replications of classic experiments (like Stroop) or novel experiments thought up by the students. Either kind is usually boring: The first kind because the answer is already known, and the second kind because the students don't know enough to think up an interesting, novel experiment that makes contact with cutting-edge research. Replicating recent, cutting-edge studies solves both problems. While they acknowledge that sometimes the experiments aren't done well enough to be informative, around half of the time (in their experience, both having taught such classes), they are.

Grahe and colleagues ("Harnessing the undiscovered resource of student research projects") likewise argue that undergraduates are capable of doing useful research, and given that there are many undergraduates out there taking many methods courses, quite a lot of useful work could be collectively accomplished by them.

How often are replications published in psychology?

Makel, Plucker, and Hegarty ("Replications in psychology research: How often do they really occur?") find -- probably to nobody's surprise -- that replications are rarely published. Looking at all the articles published by 100 psych journals since 1900, just under 2% of the articles even mention "replication" in the text. Based on further investigation, they estimate that only 1% of articles are actual replications, most of which were successful replications. The good news is that the rate has been rising steadily in recent decades. The bad news is that, the vast majority were conceptual replications, not direct replications. Worse news: replications, whether successful or not, were cited on average 17 times, compared with an average of 64.5 times for the articles replicated. (Interestingly, they find that some papers do include failures to conceptually replicate -- despite what most people would expect -- though these are nonetheless fairly rare.)

Incentivizing replication and replicability

Nosek, Spies and Motyl ("Scientific Utopia II: Restructuring incentives and practices to promote truth over publishability") note, as have others, that there isn't much incentive to engage in activities that would improve the replicability rate. In a succinct article, they list a number of strategies that they think are unlikely to word -- conceptual replication, waiting for science to "self-correct", establishing journals devoted to publishing replications or negative results, education campaigns, increasing expectations of reviewers, and raising the barrier for publication. The reasons why are straight-forward and can be found in the article.

They also offer several strategies they think will work. For instance, they recommend simple check-lists, which "ensure disclosure of obvious items that are sometimes forgotten: sample sizes, statistical tests, effect sizes, and covariates." While this might seem similar to their discarded strategy of "raising expectations" and "educational campaigns", they argue that checklists are an exception, and point to research showing their efficacy (as opposed to raised expectations and educational campaigns) in medicine.

Other proposals include establishing metrics to identify what is worth replicating, thereby making replications of these papers seem important enough (to journal editors) to publish, crowd sourcing replication efforts, and establishing journals (like PLoS ONE) with review standards focused on the soundness, not importance, of the research. While this latter approach may seem like a journal signing its own death warrant, they point to data suggesting that journals aren't actually very good at figuring out what papers will or won't be cited in the future. They also note that making data, materials, and the workflow itself more open and transparent will make it easier to re-use data and materials, with additional replications as an inevitable, positive side effect.

Koole and Lakens ("Rewarding replications") focus more squarely on the lack of incentive to conduct replications because they are rarely published and rarely cited when published. Their proposed solution is that when journals publish an original finding, they agree to publish future replications as an online supplement. They then could create a meta-reference that refers to the original paper and subsequent replications, which researchers could then cite, increasing the citations received by the replications (since they are cited automatically).  This has the nice advantage of working within the existing infrastructure (always easier than starting from scratch). The only remaining issue is how to convince the existing infrastructure (i.e., the journals) to participate, given that it will cost them a lot and there isn't any clear advantage to participation (some would argue that it's actually against the journal's interest to participate). There is also the first-mover disadvantage: the system only becomes particularly useful once many journals participate.

Confirmatory research

Wagenmakers, Wetzels, Borsboon, van der Maas and Kievit ("An agenda for purely confirmatory research") argue that researchers should be required to pre-register their methods and intended analyses prior to running a study. The idea is to make it clear which results are and are not exploratory. The suggestion is that exploratory analyses are often circular (you look at the data, develop a hypothesis, and then test the hypothesis on those data), which increases the likelihood of false positives.

Pre-registration obviously requires extra work, both on the part of the scientists and on the part of the people who maintain the registry, so one can reasonably ask whether the gain is worth the cost. First, we need to know if the problem itself is common (no point in investing in this infrastructure in order to combat a non-existant problem). Wagenmakers and colleagues argue that it is, but unconvincingly:
Almost no psychological research is conducted in a purely confirmatory fashion (e.g., Kerr, 1998; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011; for similar discussion in biology, see Anderson, Burnham, Gould, & Cherry, 2001).
As far as I can tell, Wagenmakers et al., (2011) speculates that confirmatory research is rare, but offers no data (that I noticed when skimming). Kerr (1998) did conduct a survey of 156 social psychologists, clinical/community psychologists, and sociologists, finding that confirmatory research was reported as occurring most frequently, though not significantly more common than some forms of circular research;  even if we assume that these results generalize to all of psychology (not just social and clinical), it is is a far cry from "almost nothing".

But let's even suppose for the moment that non-confirmatory (that is, exploratory) research is very common. Is that a problem? As I already mentioned, Wagenmakers and colleagues are worried about circular analyses. What they suggest is tracking exploratory analyses, which are not necessarily the same thing: Suppose I run a study, and in review, a reviewer asks, "Was there a difference between the male and female participants?" (This is not a made-up example). Well, I hadn't ever thought about that question, so I never checked. Now I check, and either there is or there isn't. This is not a circular analysis; retroactively deciding this is an interesting question should not affect the results (unless you believe in ESP, which Wagenmarkers and colleagues clearly do not).

In the absence of additional information, then, the fact that an analysis was exploratory tells you little about whether it is circular, which was the real issue. Whether an analysis was exploratory or planned only suggests that it is circular or not if large percentages of exploratory analyses are circular. I don't know; there doesn't appear to be any data. But even if there were, that would just tell you likelihood -- it wouldn't settle the issue for any particular case.

Enough smart people think that pre-registration would be a good idea that I'm willing to suppose that it is. I just don't see anything in this particular article to suggest that the price is worth the benefit. One last thought: Wagenmakers and colleagues argue that researchers don't fully appreciate the distinction between planned and circular analyses. If so, would marking certain analyses as planned and certain analyses as circular help?

What should we do? 

The 1,292 psychologists surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") were also asked about specific suggested reforms (proposed in a recent paper, not by them). I summarize the results below:

Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article:
should be good practice: 60%
should be condition of publication: 46%

Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data collection justification:
should be good practice: 47%
should be condition of publication: 30%  [presumably none of these were psychophysicists]

Authors must list all variables collected in a study.
should be good practice: 58%
should be condition of publication: 46%

Authors must report all experimental conditions, including failed manipulations:
should be good practice: 80%
should be condition of publication: 66%

If observations are eliminated, authors must also report what the statistical results are if those observations are included:
should be good practice: 69%
should be condition of publication: 52%

If analysis includes a covariate, authors must report the statistical results of the analysis without the covariate:
should be good practice: 67%
should be condition of publication: 50%

The survey also asked about four suggested guidelines; for space reasons, I have left them out. You can find them in the paper.

Nobody asked my opinion, but here it is anyway. I think having and reporting a termination rule is relatively costless and helpful in certain situations, so I support it as good practice and wouldn't mind if it were a condition of publication. The question about 20 observations per cell ... do we mean 20 items or 20 subjects or both? I suspect the authors were thinking of studies with a single item per condition per subject (which are common in some areas of psychology but essentially non-existant in others; note that Fuchs, Jenny & Fiedler also note the problem of finding one set of standards that is appropriate to all subdisciplines of psychology), so it's hard to know how this is meant to be applied. In any case, I would prefer a standard based on expected power, which is more important (though harder to define).

I strongly support listing all variables collected and all experimental conditions, though it quickly runs up quickly against the word limits in many high-impact journals (which I would be happy to do away with; to paraphrase Einstein, papers should be as short as possible, but not shorter). One issue: what counts as an experimental condition is not always clear, particularly if you are doing between-subject manipulations (is this data set a "condition" or a separate experiment?). So I don't know how to enforce these as conditions of publication.

As far as covariates, certain kinds of measures make no sense without the covariate, so in some cases you'd be adding irrelevant, misleading information. Leaving this up to reviewers (the current system) seems perfectly reasonable to me.

In praise of experiments

Today, the excellent Neuroskeptic writes about a new study investigating which US states are most suicidal. The interesting twist was the form of the data: Google searches. It's an interesting study and an interesting use of Google searches, but what struck me was Neuroskeptic's closing thoughts.
Over the past couple of years there's been a flurry of studies based on analyzing Google and Twitter trends. What's interesting to me is that we're really in the early days of this, when you think about likely future technologies. What will happen when everyone's wearing a computer 24/7 that records their every word and move, and even what they see? 
Eventually, psychology and sociology might evolve (or degenerate) into no more than the analysis of such data...
It's always dangerous to predict the future, but here's my prediction: Not a chance. It gets down to a distinction between observational studies and experiments. Observational studies (where you record what happens in the course of normal events) are useful, particularly when you care about questions like what is the state of the world? They are much less useful when you want to know why is the world the way it is?

There are a couple reasons.

Reason #1: The correlation fallacy

First, observational studies are really about studying correlations. To have much power to analyze interesting correlations, you need a lot of data. This is what makes Google and Twitter powerful: they provide a lot of data. But correlation, famously, doesn't always tell you much about causation.

For instance, it is now well-known that you can use the number of pirates active in the world's oceans and seas to reasonably predict average global temperature (there's a strong correlation):

I did not know until recently that Google search data has now definitively shown a correlation between the amount of movie piracy and global warming as well:

In the case of real pirates vs. the temperature, the correlation runs the other way (temperature affects weather affects seafaring activities). I have no idea what causes the correlation between searches for free movies and searches about global warming; perhaps some third factor. To give another silly example, there is a lot more traffic on the roads during daylight than at night, but this isn't because cars are solar-powered!

The point is that experiments don't have this problem: you go out and manipulate the world to see what happens. Change the number of pirates and see if global temperatures change. Nobody has tried this (to my knowledge), but I'm willing to bet it won't work.

(Of course, there are natural experiments, which are a hybrid of observational studies and experiments: the experimenter doesn't manipulate the world herself but rather waits until somebody else, in the course of normal events, does it for her. Good examples are comparing different states as they adopt bicycle helmet laws at different times and comparing that against head injury stats in the various states. These are rarely as well-controlled as an actual experiment, but have the advantage of ecological validity.)

Reason #2: Life's too short

The second is that observational studies are limited by what actually happens in the world. You won't, from an observational study, find out what the effect of US politics is of every US senator taking up crack while every US representative takes up meth. (I hope not, anyway.)

That was an absurd example, but the problem is real. Language gives lots of great examples. Suppose you want to find out what sentences in any given language are grammatical and what sentences are not. You could do an observational study and see what sentences people say. Those are grammatical; sentences you haven't heard probably aren't.

The problem with this is that people are boring and repetitive. A small number of words (heck, a small number of sentence fragments) accounts for most of what people say and write. The vast majority of grammatical sentences will never appear in your observational sample no matter how long you wait, because there are actually an infinite number of grammatical English sentences. (In his impressive "Who's afraid of George Kingsley Zipf?", Charles Yang shows how a number of prominent language researchers went astray by paying too much attention to this kind of observational study.)

The basic feature of the problem is that for building theories -- explaining why things are the way they are -- very often what you care about are the border cases. Human behavior is largely repetitive, and the  border cases are quite rare. Experiments turn this around: by deliberately choosing the situations we put our participants in, we can focus on the informative test cases.

The experimental method: Here to stay

None of this should be taken as meaning that I don't think observational studies are useful. I conduct them myself. A prerequisite to asking the question Why are things the way they are is knowing, in fact, what way things are. There is also the question of ecological validity. When we conduct laboratory experiments, we construct artificial situations and then try to generalize the results to real life. It's good to know something about real life in order to inform those generalizations.

But just as I can't imagine observational studies disappearing, I can't imagine them replacing experimentation, either.

Lab Notebook: You know you are writing a paper when

Your "recently added" list in Mendeley is growing at an exponential rate:

(click to expand. note time added.)

Every new paper you read results in downloading at least two more (not unlike the Hounds of Resurrection. Coincidence? I think not).

I don't think I've ever actually finished my reading list for a paper. At some point, I shut down the process before it overwhelms my hard drive.

Findings: That Kind of Person

That Kind of Person is now complete. Many thanks to all who answered the call to participate.

For some time now, I have been studying the effect of context on pronoun interpretation. If words and sentences always meant what they meant regardless of context, linguistics and psycholinguistics would be much easier, and we would have much better computer translation, speech recognition, etc. Unfortunately, the same word (bank) can often mean different things in different contexts (he paddled over to the bank versus he cashed a check at the back).

Pronouns are a great guinea pig for studying the role of context, because they derive almost all their meaning from context (try to define “she” or “he” and compare it to your definition of “Martha Washington” or “George Washington”).

Great Expectations

Recently, a picture has started to emerge, at least in the case pronouns. The basic idea, due mostly to the work of Andrew Kehler at UCSD*, is that our initial interpretation of a pronoun is driven by what we think is likely to be talked about next. If this seems obvious, the dominant theory at the time Kehler started working (Centering Theory and variants) argued that our initial interpretation of the pronoun is that it refers to whatever person or thing is currently most “salient” (what counts as "salient" depends on the version of the theory) -- a hypothesis that also usually strikes folks as obvious.

Kehler's big contribution was articulating theory of discourse structure – that is, how sentences relate to one another – that can be used to fairly accurately predict what people expect to be mentioned next. (If you are interested in these issues and have a little background in linguistics, Kehler's book, Coherence, Reference, and the Theory of Grammar is fantastic.) For instance, sometimes one sentence introduces the consequence of another sentence:

(1) John frightened Bill, so he ran away.

Here, the second sentence (or, if you prefer, second clause) describes a consequence of the first sentence. Most likely "he" refers to Bill, because Bill running away would be a reasonable consequence of John frightening him. In contrast, other sentences explain the previous sentence:

(2) John frightened Bill because he is scary.

Here, "he" probably refers to John, since John being scary would be a good explanation of his frightening of Bill.

There are many other types of relationships between sentences, and they have predictable effects on pronoun interpretation. Although Kehler's theory explains a lot, it does not explain, for example, why we think Bill running away is a more likely effect of John frightening Bill than Bill running away.

The role of verbs

In two recent papers, which I discussed on this blog, my colleagues and I argued that verbs play a major role. Verbs -- specifically, the relationship between a verb and its subject and object -- provide a lot of information about events. We drew in particular on one line of theoretical work (usually called "predicate decomposition theory"), which tries to explain how verb meaning can be built out of a few constituent parts. The details aren't important here. What is important is that this theory argues that some  verbs specify who the cause of the event was. What we showed was that usually, in sentences like (2), people think the pronoun refers to the person that the verb specifies as the cause. In this case, "frighten" means something like "John caused Bill to be afraid". Remember that "he is scary" is an explanation of "John frightened Bill." Explanations usually refer to causes.

In short, by drawing on independent theories of discourse structure and verb meaning, we were able to predict very well how people will interpret pronouns in various contexts. At least, we could do so in the ones we tried -- there's a lot of work left to be done to fully flesh out this work.

The problem

I have been presenting this work for a while, and I often get the following objection: We already know that verbs can't be doing all (or even much) of the work. The real story, it was argued, is much more complex. Thinking just about the explanation sentences like (2), Pickering and Majid (2007) noted that multiple factors "affect the construction of the event representation, and it is this event representation that is used to infer the cause..." They cite experimental findings argued to show that pronoun interpretation in sentences like (2) depend in complex ways not just on the verb but on what you know about the subject and the object:
In addition, properties of the participants affect implicit causality. Changing the gender (Lafrance, Brownell, & Hahn, 1997), animacy (Corrigan, 1988, 1992), or typicality (Corrigan, 1992; Garvey et al., 1976) of the participants changes the [pronoun interpretation].
After hearing this enough times, I started what I thought would be a series of studies to look at how information about the subject and object interact with the verb in real time during sentence comprehension. This project never got off the ground because I couldn't find any such effects. That is, I have now run a number of studies where I manipulate the gender or typicality, etc., of the subject and object, and they have no effect on pronoun interpretation.

It turns out that there was some confusion in the literature. The studies that Pickering and Majid cite in the quote above mostly don't look at pronoun interpretation at all. Most look at a different task:

(3) John frightened Bill.
a. How likely is this because John is the kind of person who frightens people? 1 2 3 4 5 6 7 8 9
b. How likely is this because Bill is the kind of person people frighten? 1 2 3 4 5 6 7 8 9

Researchers look whether the answer to (a) is greater or less than the answer to (b) to decide who people think caused the event: John or Bill? Much of the literature has assumed that the answer to this question should predict what happens in pronoun sentences like (2), even though this has never been rigorously shown. (Why it hasn't been carefully tested is a bit of a mystery. It is so widely believed to be true that I suspect many folks don't realize that it hasn't been tested. It actually took me several years to pick up on this fact myself.)

I now have a long line of studies showing that there is little relationship between the two tasks. Also, although manipulating who the subject and object are affect the task in (3), I find very little evidence that it affects pronoun interpretation in (2). For instance, compare the following:

(4) a. The king frightened the page because he....
     b. The page frightened the king because he....

Everybody agrees that, in general, it is more likely that kings frighten pages than that pages frighten kings, and so if you use these sentences in (3), you get a nice effect of who the subject is. But it doesn't affect pronoun interpretation at all.

This is a serious blow to Pickering and Majid's argument. They argued that pronoun interpretation cannot be all (or mostly) about discourse structure and verb meaning because these interact in complex ways with knowledge about the subject and object (I should add: non-linguistic knowledge. It presumably is not part of the definition of king and page that kings frighten pages but not vice versa, but rather something you learn about the world). If it turns out that this is not the case, then discourse structure + verb meaning may well explain much or all of the phenomenon at hand.

That Kind of Person

That was my argument, anyway, in a paper that I have been shopping around for a couple years now. The difficulty with publishing this paper is that it makes a null argument: you can't find effects of knowledge about the subject and object on pronoun interpretation. In fact, all I can show is that the manipulations I have tried haven't worked, not that no manipulation works (you can't try everything!). So much of the review process has been reviewers suggesting additional experiments and me running them. The latest -- and I hope last -- one was That Kind of Person.

A reviewer very smartly noted that a big difference between (2) and (3) is that (3) asks about the kind of person the subject is and the kind of person the object is, whereas (2) does not. What we are manipulating in our king/page manipulation is, of course, the kind of person the subject is and the kind of person that the object is. So the reviewer suggested the following pronoun task:

(5) a. The king frightened the page because he is the kind of person that...
     b. The page frightened the king because he is the kind of person that...

The specific manipulation was one of status. It was argued in the literature that people are more likely to think that high-status folk (kings) caused the event that low-status folk (pages). This does turn out to be true if you use the task in (3), but yet again I found no effect on pronouns, either using sentences like (4) or like (5). (Sorry -- I was going to include a graph, but the results aren't formatted for graphing yet, and it's time for lunch! Maybe when the paper is published...)


I think the result of this work is that it suggests that we really are narrowing in on "the" theory of pronoun interpretation (though there is a lot of work left), a theory in which most of the work is done by discourse structure and verb meaning. This is pretty exciting, because it would be one of the rare cases where we have a reasonably complete theory of how context affects word meaning. It does leave open the question of what the task in (3) is measuring, and why it doesn't match what the pronoun tasks measure. That's still the sticking point in the review. I have a few new ideas, and we'll see what the reviewers say this time around.

*Editors at newspapers and magazines usually request that, whenever you introduce a scientist in an article, you state name, institution, and scientific field. The first two are easy, but the last one is hard, particularly when you frequently write about interdisciplinary research (which I do). I wrote about Kehler in an article for Scientific American Mind a while back, and introducing him caused a long debate. His degree is in computer science, he works in a linguistics department, but his work is probably best described as psychology. So what is he?

Just another reason I prefer blogging.

Small World of Words

A group of researchers in Belgium is putting together a very large word association network by asking volunteers to say which words are related to which other words. They are hoping to recruit around 300,000 participants, which makes it my kind of study! (Technically, I've never tried 300,000 participants -- I think we've never gone beyond about 50,000, though we have some new things in the pipeline...)

It looks interesting. To participate, go to You can read more about the project here.

I say "uncle", you say "DaJiu"

Kinship terms (mother, uncle, niece, etc.) are socially important and generally learned early in acquisition. Interestingly, different languages have different sets of terms. Mandarin, for instance, divides "uncle" into "father's older brother", "father's younger brother", and "mother's brother".
Stranger things (to an anglophone, anyway) happen, too: In Northern Paiute, the kin terms for grandparents and grandchildren are self-reciprocal: you would use the same word to refer to your grandmother (if you are female) that she uses to refer to you. (See my previous post on "mommy" across languages.)

Kinship terms in English and Northern Paiute. Ignore all the logical terms for now.
(Figure taken from Kemp & Regier, 2012)

Even so, there are a lot of similarities across languages. Disjunctions are relatively rare; that is, it's unusual to see a word that means "father or cousin". Usually there are more words to distinguish varieties of closely-related relatives (sister, brother) than distant relatives (cousin). How come? One obvious answer is that maybe the kinship systems we have are just better than the alternatives (ones with words like "facousin" = "father or cousin"), but it would be nice to show this.

Optimal Kinship Terms

In a paper earlier this year, Charles Kemp and Terry Regier did just that.
We show that major aspects of kin classification follow directly from two general principles: Categories tend to be simple, which minimizes cognitive load, and to be informative, which maximizes communicative efficiency ... The principles of simplicity and informativeness trade off against each other... A system with a single category that includes all possible relatives would be simple but uninformative because this category does not help to pick out specific relatives. A system with a different name for each relative would be complex but highly informative because it picks out individual relatives perfectly. 
That seems intuitively reasonable, but these are computational folk, so they formalized this with math. The details are in the paper, but roughly: They formalize the notion of complexity by using minimum description length in a representational language based on primitives like FEMALE and PARENT. The descriptions of the various terms in English and Northern Paiute are shown in parts C and D of the figure above. Communicativeness is formalized by measuring how ambiguous each term is (how many people it could potentially refer to).

A language is considered "better" than another if it out-scores the other on one dimension (e.g., simplicity) and no worse on the other (informativeness). A language is near-optimal if it there is hardly any possible language that is better. They looked at a number of different existing kinship systems (English, Northern Paiute, and a bunch of others) and found that all of them were near-optimal.

Evolution, Culture, or Development?

There are generally three ways of explaining any given behavior: evolution (we evolved to behave that way), culture (culture -- possibly through cultural evolution -- made us that way), or development (we learned to behave that way). For instance, it's rare to find people who chiefly eat arsenic. This could be because of evolution (we evolved to avoid arsenic because the arsenic-eaters don't have children and pass on their genes), cultural evolution (cultures that prized arsenic-eating all died out, leaving the non-arsenic cultures as the only game in town), or development (we learned as children, through trial and error, that eating arsenic is a bad idea). If I remember my Psych 101, food preferences actually involve all three.

What about kinship terms? If they are optimal, who do we credit with their optimality? Probably not development (we don't each individually create optimal kinship terms in childhood). Kemp and Regier seem to favor cultural evolution: over time, more useful kinship terms stuck in the lexicon of a given language and useless ones like "facousin" died out. It would be nice to show, however, that it is not actually genetic. This wouldn't have to be genes for kinship terms, but it could be genes that bias you to learn naming systems that are near-optimal (kinship naming systems or otherwise). One would need to show that these arose for language and not just cognition in general.

------ Kemp, C., and Regier, T. (2012). Kinship Categories Across Languages Reflect General Communicative Principles Science, 336 (6084), 1049-1054 DOI: 10.1126/science.1218811

Still testing...

I was hoping to post the results of That Kind of Person today. When I announced the study two weeks ago, I estimated that it would take about two weeks to get enough data. For some reason, traffic on the site plummeted late last week.

So maybe one more week. As soon as I know the results, you will, and since this is (please let it be) the last experiment (#8!) for a paper, I am checking the numbers constantly. Many thanks to those who have already participated (those who haven't, you can find the experiment here; it shouldn't take more than 5 minutes).

Findings: Linguistic Universals in Pronoun Resolution - Episode II

A new paper, based on data collected through, is now in press (click here for the accepted draft). Below is an overview of the paper.

Many of the experiments at have to do with pronouns. I find pronouns interesting because, unlike many other words, the meaning of a pronoun is almost entirely dependent on context. So while "Jane Austen" refers to Jane Austen no matter who says it or when, "I" refers to a different person, depending mostly on who says it (but not entirely: an actor playing a part uses "I" to refer not to himself but to the character he's playing). Things get even hairier when we start looking at other pronouns like "he" and "she". This means that pronouns are a good laboratory animal for investigating how people use context to help interpret language.

Mice make lousy laboratory animals for studying the role of context in language.
Pronouns are better.

I have spent a lot of time looking at one particular contextual effect, originally discovered by Garvey and Caramazza in the mid-70s:

(1) Sally frightens Mary because she...
(2) Sally loves Mary because she...

Although the pronoun is ambiguous, most people guess that she refers to Sally in (1) but Mary in (2). That is, the verb used (frightens, loves) seems to affect pronoun resolution. Replace "frightens" and "loves" with other verbs, and what happens to the pronoun depends on the verb: some verbs lead to subject resolutions like frightens, some to object resolutions like loves, and some leave people unsure (that is, they think that either interpretation of the pronoun is equally reasonable).

The question is why. One possibility is that this is some idiosyncratic fact about the verb. Just as you learn that the past tense of walk is walked but the past tense of run is ran, you learn that some verbs lead you to resolve pronouns to the verbs' subject and some the verbs' object (and some verbs have no preference). This was what was tentatively suggested in the original Garvey and Caramazza paper.

Does the meaning of the verb matter?

One of the predictions of this account is that there's nothing necessary about the fact that frightens leads to subject resolutions whereas loves leads to object resolutions, just as there is no deep reason that run's past tense is ran. English could have been different.

Many researchers have suspected that the pronoun effects we see are not accidental; the pronoun effects arise from some fundamental aspect of the meanings of frightens and loves. Even Garvey & Caramazza suspected this, but all the hypotheses they considered they were able to rule out. Recently, using data from, we presented some evidence that this is right. Interestingly, while researchers studying pronouns were busy trying to come up with some theory of verb meaning that would explain the pronoun effects, many semanticists were independently busy trying to explain verb meaning for entirely different reasons. Usually, they are interested in explaining things like verb alternations. So, for instance, they might notice that verbs for which the subject experiences an emotion about the object:

(3) Mary likes/loves/hates/fears John.

can take "that" complements:

(4) Mary likes/loves/hates/fears that John climbs mountains.

However, verbs for which the object experiences an emotion caused by the subject do not:

(5) Mary pleases/delights/angers/frightens John.
(6) *Mary pleases/delights/angers/frightens that John climbs mountains.

[The asterisk means that the sentence is ill-formed in English.]

Linguists working on these problems have put together lists of verbs, all of which have similar meanings and which can be used in the same way. (VerbNet is the most comprehensive of these.) Notice that in this particular work, "please" and "frighten" end up in the same group as each other and a different group from "like" and "fear" are in a different one: Even though "frighten" and "fear" are similar in terms of the emotion they describe, they have a very different structure in terms of who -- the subject or the object -- feels the emotion.

We took one such list of verb classes and showed that it explained the pronoun effect quite well: Verbs that were in the same meaning class had the same pronoun effect. This suggests that meaning is what is driving the pronoun effect.

Or does it?

If the pronoun effect is driven by the meaning of a verb, then it shouldn't matter what language that verb is in. If you have two verbs in two languages with the same meaning, they should both show the same pronoun effect.

We aren't the first people to have thought of this. As early as 1983, Brown and Fish compared English and Mandarin. The most comprehensive study so far is probably Goikoetxea, Pascual and Ancha's mammoth study of Spanish verbs. The problem was determining identifying cross-linguistic synonyms. Does the Spanish word asustar mean frighten, scare, or terrify?
Is this orangutan scared, frightened or terrified? Does it matter?

Once we showed that frighten, scare and terrify all have the same pronoun effect in English, the problem disappeared. It no longer mattered what the exact translation of asustar or any other word was: Given that entire classes of verbs in English have the same pronoun effect, all we needed to do was find verbs in other languages that fit into the same class.

We focused on transitive verbs of emotion. These are the two classes already introduced: those where the subject experiences the emotion (like/love/hate/fear) and those where the object does (please/delight/anger/frighten) (note that there are quite a few of both types of verbs). We collected new data in Japanese, Mandarin and Russian (the Japanese and Russian studies were run at and/or its predecessor, and re-analyzed published data from English, Dutch, Italian, Spanish, and Finnish.

Results for English verbs (above). "Experiencer-Subject" verbs are the ones like "fear" and "Experiencer-Object" are the ones like "frighten". You can see that people were consistently more likely to think that the pronoun in sentences like (1-2) referred to the subject of Experiencer-Object verbs than Experiencer-Subject verbs.

The results are the same for Mandarin (above). There aren't as many dots because we didn't test as many of the verbs in Mandarin, but the pattern is striking.

The Dutch results (above). The pattern is again the same. Again, Dutch has more of these verb, but the study we re-analyzed had only tested a few of them.

You can read the paper and see the rest of the graphs here. In the future, we would like to test more different kinds of verbs and more languages, but the results so far are striking, and suggest that the pronoun effect is caused by what verbs mean, not some idiosyncratic grammatical feature of the language. There is still a lot to be worked out, though. For instance, we're now pretty sure that some component of meaning is relevant to the pronoun effect, but which component and why?

Hartshorne, J., and Snedeker, J. (2012). Verb argument structure predicts implicit causality: The advantages of finer-grained semantics Language and Cognitive Processes, 1-35 DOI: 10.1080/01690965.2012.689305

Goikoetxea, E., Pascual, G., and Acha, J. (2008). Normative study of the implicit causality of 100 interpersonal verbs in Spanish Behavior Research Methods, 40 (3), 760-772 DOI: 10.3758/BRM.40.3.760

Garvery, C., and Caramazza, A. (1974). Implicit causality in verbs Linguistic Inquiry, 5 (3), 459-464

Roger Brown and Deborah Fish (1983). Are there universal schemas of psychological causality? Archives de Psychologie, 51, 145-153

New Experiment: That Kind of Person

I just got back reviews on one of the pronoun papers. Although the paper already had seven experiments, they want two more. The worst part about it is that they are right.

Luckily, the experiment they asked for can be done online. It takes about 5 minutes. Native English speakers preferred (though I look at all data).

That Kind of Person (takes about 5 minutes)

My target is to post the results for this and the seven previous experiments in 2 weeks ... if I get enough participants quickly. Thank you in advance to everyone who participates.

Is Psychology a science?: Redux

The third-most read post on this blog is "Is Psychology a science?". I was a few years younger then and still had strong memories of one of my friends complaining, when we were both undergraduates, that he had to take a psychology course as part of his science distributional requirements. "Psychology isn't a science," he said, "because they don't do experiments." Since he was telling me this over AIM as I was sitting in my psychology laboratory, analyzing an experiment, it didn't go over well.

It's been a popular post, but I haven't written about the subject much since in part because I started to suspect that the "psychology isn't a science" bias might actually be a thing of ignorant undergraduates and a few cranks. It's not something I've rarely heard in the last few years, and there's no need to write diatribes against a non-existant prejudice.

In retrospect, maybe I haven't come across these opinions because I mostly hang out with other psychologists. A colleague recently forwarded me this blog post ("Keep Psychology out of the science club"), which links to a few other similar pieces on blogs and in newspapers. So it seems the issue is alive and well.

Some articles one comes across are of the "psychologists don't do experiments" variety; these are easily explained by ignorance and an inability to use Google. But some folks raise some real concerns which, while I think they are misplaced, really are worth thinking about.

Psychology is too hard

One common theme that I came across is that psychology is simply too difficult. We'll never understand human behavior very well, so maybe we shouldn't even try. For instance, Gary Gutting, writing in the Opinionator at the New York Times, said:
Social sciences may be surrounded by the "paraphernalia" of the natural sciences, such as technical terminology, mathematical equations, empirical data and even carefully designed experiments. But when it comes to generating reliable scientific knowledge, there is nothing more important than frequent and detailed predictions of future events ... while the physical sciences produce many detailed and precise predictions, the social sciences do not ... Because of the many interrelated causes at work in social systems, many questions are simply "impervious to experimentation" ... even when we can get reliable experimental results, the causal complexity restricts us...
In a Washington Post editorial, Charles Lane wrote:
The NSF shouldn't fund any social science. Federal funding for mathematics, engineering and other "hard" sciences is appropriate. In these fields, researchers can test their hypotheses under controlled conditions; then those experiments can be repeated by others. Though quantitative methods may rule economics, political science and psychology, these disciplines can never achieve the objectivity of the natural sciences. Those who study social behavior -- or fund studies of it -- are inevitably influenced by value judgments, left, right, and center. And unlike hypotheses in the hard sciences, hypotheses about society usually can't be proven or disproven by experimentation. Society is not a laboratory.
Alex Berezow at the Newton Blog agrees:
Making useful predictions is a vital part of the scientific process, but psychology has a dismal record in this regard.
Is that a fair critique?

These writers don't entirely miss the mark. It really is true that psychology does not make as precise or as accurate predictions as, say, physics. That is not the same thing as saying that we can't make any predictions. Berezow complains about happiness research:
Happiness research is a great example of why psychology isn't a science. How exactly should "happiness" be defined? The meaning of the word differs from person to person, and especially between cultures. What makes Americans happy doesn't necessarily make Chinese people happy. How does one measure happiness? Psychologists can't use a ruler or a microscope, so they invent an arbitrary scale. Today, personally, I'm feeling about a 3.7 out of 5. How about you? ...  How can an experiment be consistently reproducible or provide any useful predictions if the basic terms are vague and unquantifiable?
That's a great question! Let's start with the facts. It is true that we don't know exactly what it means to be a 3.7 on a scale of 1-5. But we do know a few interesting things.

People's predictions of how happy they will rate themselves in the future are systematically biased. People will say that good things (like getting tenure) will make them very happy (a 5 out of 5) whereas bad things (like not getting tenure) will make them very sad (a 1 out of 5), whereas when you then ask those same people to rate their happiness a little while after the event, people generally rate themselves as not nearly so happy or unhappy as they predicted. (Similarly, people who lose a limb usually rate themselves as about as happy afterwards as before, provided you give them a little time to adjust.) People who have children normally see a drop in how happy they rate themselves. They only start to recover after their children leave the nest. There is also the "future ahedonia" effect: people think that good things (e.g., an ice cream sunday) will make them happier now (on our 1-5 scale) than those same good things would make them happy in the future, and conversely for bad things (e.g., doing my homework won't feel so bad if I do it tomorrow rather than today). And so on. (These and many other examples can be found in Dan Gilbert's Stumbling on Happiness.)

These and other findings are highly reliable, despite the fact that we don't have a direct, objective measurement of happiness. In fact, as Dan Gilbert has pointed out, we would only consider that "direct" measurement to be a measurement of happiness if it correlated really well with how happy people said they were. To the extent it diverged from how happy people claim to be, we would start to distrust the "direct" measurement.

I personally am glad that we know what we know about happiness, though I wish we knew more. I picked happiness to defend because I've noticed that even those who defend psychology in comments sections give up happiness research as a lost cause. I think it's pretty interesting, useful work. It would be even easier to defend, for instance, low-level vision research, which makes remarkably precise predictions, has clear theories of the relationship between the psychological phenomena and the neural implementations, etc. (See also this post for some psychology success stories.)

Just how good do you need your predictions to be?

Still, it is true that we can't always make the precise predictions that can be made in some other fields. Of course, other fields can't always make the precise predictions, either. While physicists are great at telling you what will happen to rigid objects moving through vacuums, predicting the motions of real objects in the real world has been traditionally a lot harder, and understanding fluid dynamics has been deeply problematic (though I understand this has been getting a lot better in recent years). And that's without pulling out the Heisenberg Uncertainty Principle, which should cause anyone who wants precise, deterministic predictions to declare physics a non-science.

Also, some parts of psychology are able to make much more precise predictions than others do. Anything amenable to psychophysics tends to be much more precise, and vision researchers, as already noted, have remarkably well worked-out theories of low- and mid-level vision.

This line of discussion also raises an interesting question: when exactly did physics become a science? Was it a science in Newton's day, when we still new squat about electromagnetism -- much less elementary particles -- and couldn't make even rough predictions about turbulent air or fluid systems? And to people from 350 years from now, will the physics of today seem like a "real" science (my guess: no).


Berezow ends his post with the following caution:
To claim [psychology] is a "science" is inaccurate. Actually, it's worse than that. It's an attempt to redefine science. Science, redefined, is no longer the empirical analysis of the natural world; instead, it is any topic that sprinkles a few numbers around. This is dangerous, because, under such a loose definition, anything can qualify as science. And when anything qualifies as science, science can no longer claim to have a unique grasp on secular truth.
I have a different worry. My worry is that someone gets ahold of a time machine, goes back in time to 1661 and convinces Newton to lay off that non-scientific "physics" crap. Pre-Newtonian physics was a hodgepodge of knowledge, little resembling what we think of science today. Making precise predictions about the messy, physical world we live in no doubt seemed an impossible pipe-dream to many. Luckily, folks like Newton kept plugging away, and three and a half centuries later, here we are.

We should keep in mind that the serious study of the mind only began in the mid-1800s; physics has a significant head-start. And, as the anti-psychology commentators are happy to point out, psychology is much, much harder than physics or chemistry. But the only reason I can see to pull the plug is if we are sure that (a) we have learned nothing in the last 150 years, and (b) we will never make any further progress. These are empirical claims and so subject to test (I think the first one has already been falsified). So here's a proposed experiment: psychologists keep on doing psychology, and people who don't want to don't have to. And we'll wait a few decades and see who knows more about the human mind.

What you missed lately on the Web: 11/12/2012

I've switched the title of these posts from "last week" to "lately", since apparently posting every week is too ambitious (last Monday my excuse was the hurricane + BUCLD, but that wasn't the first time I missed a week).

An elegant defense of prescriptivism
Quoted by Harm*less Drudg*ery, with some additional discussion at Language Log.

Are  differences in brain connectivity in Autism actually motion artifact?
Neuroskeptic considers the possibilities.

What is the purpose of a university?
Boston Magazine argues that it is to produce tech start-ups, not lawyers and doctors. Funny, I thought it was to gather, create, and disseminate knowledge, but apparently that is old-school thinking.

Neuroskeptic publishes under own name
Future archaeologists are going to be very confused.

Language Log tries to figure out whether Biden is proud to be vice president.

Open Science Collaboration's first paper

Perspectives in Psychological Science is making history this issue by publishing a paper by a blogger under the blogger's pseudonym (reportedly the first such paper), as well as the first paper by the Open Science Collaboration (to which I am a contributor, so here's to many more!).

The issue, which is currently open access, is focused on issues of replicability. The Open Science Collaboration has a number of goals with respect to changing research practices in psychology. The main project so far has been the Reproducibility Project, which
is an open, large-scale collaborative effort to systematically examine the rate and predictors of reproducibility in psychological science. So far, 72 volunteer researchers from 41 institutions have organized to openly and transparently replicate studies published in three prominent psychological journals in 2008.
This is something pretty close to my heart, which is why I am involved. As my co-author and I pointed out in "Tracking replicability as a method of post-publication open evaluation," for all the concern about replicability in psychology and other sciences, there is remarkably little systematic evidence one way or another (we did our best to thoroughly review the literature; you can check out our findings in the paper). What kinds of reforms we should put in place depend on how bad the problem is. If the problem isn't that bad -- and for all we know, it isn't -- then there is no reason to implement costly, unpleasant reforms.

You can read more about the project in the paper. There is still time for interested researchers to join the project. Just sayin'.

Open Science Collaboration. (2012). An Open, Large-Scale, Collaborative Effort to Estimate the Reproducibility of Psychological Science Perspectives on Psychological Science, 7 (6), 657-660 DOI: 10.1177/1745691612462588

New Experiment: Ignore That!

**UPDATE Apparently the examples below didn't display correctly on some computes. I think this is now fixed.**

It can be very hard to ignore irrelevant information. I personally can't work when there is music with English lyrics playing (overheard conversations are difficult, too, so I don't often work in cafes, at least not without ear plugs).

There are a number of classic studies in psychology looking at our ability to ignore distracting information. For instance, suppose that you are asked to identify which direction the arrow in the middle of the sequence below is pointing:

<--  <--  <--

You will typically do that faster and more accurately than you would for the sequence below:

<--  -->  <--

Even though the first and last arrow are irrelevant, they distract you and lead to incorrect responses. The original study (to my knowledge) to demonstrate this effect -- using a slightly different method involving letters rather than numbers -- was Eriksen & Eriksen's 1974 paper cited at the end of this post. 

White Bears

Another classic study is the "White Bear" study from Daniel Wegner and colleagues. Do the following: For the next five seconds, try not to think about a white bear. 

This turns out to be very difficult to do. Although in general you probably rarely think about white bears, when asked not to do so, it becomes nearly impossible.

A new experiment

I recently posted a new experiment -- Ignore That! -- at, which investigates another classic "mental control" phenomenon. In it, you will try to answer, as quickly and accurately as possible, which color a word is written in. For instance "hello" is in red. This seems simple enough, but add some distracting information, and it becomes quite difficult. (There are actually two parts to the experiment -- one part uses color and the other uses the direction of an arrow, but both get at the same phenomenon.)

The experiment takes about 5 minutes or less. At the end, you will be able to see your own results and find out just how distracted you were by the distracting information. 

Try the experiment here:


Wegner, D., Schneider, D., Carter, S., and White, T. (1987). Paradoxical effects of thought suppression. Journal of Personality and Social Psychology, 53 (1), 5-13 DOI: 10.1037//0022-3514.53.1.5

Eriksen, B., and Eriksen, C. (1974). Effects of noise letters upon the identification of a target letter in a nonsearch task Perception & Psychophysics, 16 (1), 143-149 DOI: 10.3758/BF03203267

Boston University Conference on Language Development: Day 2

This year marks my 7th straight BUCLD, the major yearly language acquisition conference. See previous posts for my notes on Day 1 and Day 3.

Verbing nouns

Many if not all English nouns can be turned into verbs. The verb's meaning is related to the noun, but not always in the same way. Consider "John milked the cow" and "John watered the garden". In the first face, John extracts a liquid from the cow; in the second, he adds liquid to the garden.

Maybe this is just something we have to learn in each case, but people seem to have strong intuitions about new verbs. Let's say that there is a substance called "dax" that comes from the dax tree. If I were to dax a tree, am I taking dax out of the tree or adding dax to the tree? Most people think the first definition is right. Now let's say there is something called "blick" which is a seasoning that people often add to soup. If I blick some soup, most people think I'm adding blick to the soup, not taking blick out of the soup. (There are other types of noun-derived verbs as well, but they are a topic for another time.)

These examples suggest a hypothesis: if a noun refers to a substance that usually comes from a specific source, then the derived verb probably refers to the action of extracting that substance. If the noun refers to something that doesn't come from any particular source but is often added to things, then the derived verb refers to that process of adding the substance to something.

Mahesh Srinivasan of UCSD presented joint work with David Barner in which they tested this hypothesis. Probably the most informative of the experiments was one with made-up nouns, much like my "dax" and "blick" examples above. Interestingly, while children were pretty sure that "to blick" meant "put blick on something" (the experiment involved several such nouns, and the children had strong intuitions about all of them), they were much less sure what "to dax" (and similar verbs) meant. Other experiments also showed that young children have more difficulty understanding existing substance-extraction noun-derived verbs (to milk/dust/weed/etc.) than substance-adding noun-derived verbs (to water/paint/butter). And interestingly, English has many more of the latter type of verb than the former.

So, as usual, answer one question leads to another. While they found strong support for their hypothesis about why certain noun-derived verbs have the meanings they do, they also found that children find the one kind of verb easier to learn than the other, which demands an explanation. They explored a few hypotheses. One has to do with the "goal" bias described in previous work by Laura Lakusta and colleagues: generally, when infants watch a video in which an object goes from one location to another, they pay more attention to and remember better the location the object ended up at than the location it came from. Whatever the answer, learning biases -- particularly in young children -- are interesting because they provide clues as to the structure of the mind.

Verb biases in structure priming

One of the talks most-mentioned among the folks I talked to at BUCLD was one on structural priming by Michelle Peter (with Ryan Blything, Caroline Rowland, and Franklin Chang, all of the University of Liverpool). The idea behind structural priming is that using a particular syntactic structure once tends to lead to using it more again in the future (priming). The structure under consideration here was the so-called dative alternation:

(1) Mary gave a book to John.
(2) Mary gave John a book

Although the two sentences mean the same thing (maybe -- that's a long post in itself), notice the difference in word order between (1) and (2). The former is called the "prepositional object" structure, and the second is called the "double object" structure. Some time ago, it was discovered that if people use a given verb (e.g., give) in the prepositional object form once, they are more likely to use that verb in the same form again next time they have to use that verb (and vice versa for the double object form). More recently, it was discovered that using one verb (e.g., give) in the prepositional object form made it more likely to use another verb (e.g., send) in that same form (and again vice versa for the double object form). This suggests that the syntactic form itself is represented in some way that is (at least partially) independent of the verb in question, which is consistent with theories involving relatively abstract grammar.

Or maybe not. This has been highly controversial over the last number of years, with groups of researchers (including the Rowland group) showing evidence of what they call a "lexical boost" -- priming is stronger from the same verb to the same verb, which they take as evidence that grammar is at least partly word-specific. Interestingly, they have now found that children do *not* show the same lexical boost (which, if I remember correctly, has been found by other researchers from the "abstract grammar" camp before, but not by those in the "lexically-specific grammar" camp).

This seems consistent with a theory of grammar on which children start out with relatively general grammatical structures, but as you get older you tend to memorize particularly frequent constructions -- thus, as far as processing goes, grammar becomes increasingly lexically-specific as you get older (though the abstract structures are still around in order to allow for productivity). This is the opposite of the speakers' favored theory, one which grammar becomes more abstract as you get older. They did find some aspects of their data that they thought reflected lexically-specific processing in children; it's complex so I won't discuss it here (I didn't have time to get it all down in my notes and don't want to make a mistake).

There was also a talk by Kyae-Sung Park (collaborator: Bonnie D. Schwartz, both of the University of Hawai'i) on the Korean version of the dative alternation, finding that the more common form is learned earlier by second-language learners of Korean. I was interested in finding out more about the structure of Korean, but I don't know the second-language acquisition research well enough to integrate their main findings into the larger literature.

Other studies

There were many other good talks. The ones I saw included a study by Wang & Mintz, arguing that previous studies that looked at the overlap in the contexts in which different determiners occur in child speech -- which had been used to suggest that young children don't have an abstract grammatical category "determiner" -- were confounded by the small size of the corpora used. If you use a similarly small corpus of adult speech, you'd come to the same conclusion. [The analyses were much cooler and more detailed than this quick overview can get across.]

Lakusta, L., Wagner, L., O'Hearn, K., and Landau, B. (2007). Conceptual Foundations of Spatial Language: Evidence for a Goal Bias in Infants Language Learning and Development, 3 (3), 179-197 DOI: 10.1080/15475440701360168