Field of Science

Showing posts with label On reading long papers so you don't have to. Show all posts
Showing posts with label On reading long papers so you don't have to. Show all posts

Eighteen papers on replication

Last month, Perspectives in Psychological Science published a special section on replicability in psychological science. With 18 papers, it took me a while to get through all of them, so I am only blogging about them now.

The issue contains many useful articles, particularly if you have not been following the replicability crisis/discussion carefully. I have been following the discussion pretty closely, and I still found a lot worth reading (a particularly enjoyable surprise was Giner-Sorolla's historical review of replicability problems, which is absolutely fascinating).

So I hope you check the issue out. It is currently open access. But if you'd rather have someone else read it first and summarize the main points ... then lucky for you, I've done just that. Below, I describe and discuss all 18 papers, though it should be said that these descriptions are through the lens of what I happen to find particularly interesting or novel. There are likely many other things in these papers that would be of interest to you. So check them out.

Bookends

The issue is bookended by two overviews. Harold Pashler and Eric-Jan Wagenmakers get the issue started with a brief review of recent scandals, failures to replicate, and other reports -- both in psychology and in other fields -- that replicability may be an issue. John Ioannidis, Legend of Replicability, has a contribution at the end of the issue ("Why science is not necessarily self-correcting") which similarly summarizes many of the main arguments of the issue. If you haven't been following along, these are good primers. Neuroskeptic ("The nine circles of scientific hell") makes many of the same arguments, but in the satirical style of Dante Alighieri (I only wish it was in rhyme).

Is there a replicability crisis?

Pashler and Harris ("Is the replicability crisis overblown?") address three arguments sometimes given that "the replicability crisis is overblown". The first is that the low alpha level (5%) ensures that, of published results, most are right. They show that this is most likely incorrect, given current publishing practices. The second even if not many direct replications are published, many "conceptual" replications are. They argue that the notion of "conceptual replication" is so fuzzy that it actually compounds the problem (one problem: there is no such thing as a failure to conceptually replicate):
We speculate that the harmful interaction of publication bias and a focus on conceptual rather than direct replications may even shed light on some fo the famous and puzzling "pathological science" cases that embarrassed the natural sciences... What many observers found peculiar in these cases [like cold fusion] was that it took many years for a complete consensus to emerge that the phenomena lacked any reality... It appears that many exact replication attempts of the initial [cold fusion studies] were undertaken soon after the first dramatic reports of cold fusion. Such attempts produced generally negative results (Taubes, 1993). However, what kept faith in cold fusion alive for some time ... was a trickle of positive results achieved using very different designs that the originals (i.e., what psychologists would call conceptual replications).
Finally, they take issue with the claim that over the long run, science is self-correcting, given the lack of evidence that this is true (as far as I can tell, it's not even clear how you would show that science is self-correcting without rigorous, systematic studies of replicability).

While probably not everyone agrees with all the specifics of Pashler and Harris's argument, there does appear to be general consensus that there is a real replicability problem. 1,292 psychologists around the world surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") believed, on average, that 53% of the results in psychology could be replicated in direct replications. (One word of caution: it may be that psychologists most concerned about replication were also disproportionately likely to agree to participate in the survey; still, 1300 psychologists is a significant fraction of the field.)

What's the best way of getting a significant result?

Ideally, the answer would be "test for a real effect", but that of course requires that you know whether your hypothesis is true or not, which presumably you don't (otherwise, why run the study?). Bakker, van Dijk and Wicherts ("The rules of the game called psychological science") show that, as a general rule, you are better off running a few underpowered studies than one large, sufficiently-powered study (particularly if any real effect sizes are small). That is, the more likely it is that your results aren't true, the more likely you are to have a publishable finding. This is bad news when coupled with the fact that humans are good at finding efficient strategies. (Note that this doesn't require bad actors: One can simply find that certain procedures -- like testing around X subjects -- have generally produced "good", publishable research in the past.)

The bottleneck

Giner-Sorolla ("Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science") argues that a certain amount of bad science is expected when there are too few publication outlets:
Articles cannot pass through just by showing theoretical meaning and methodological rigor; thei results must appear to support the hypothesis perfectly... Imagine that two colleagues in psychology each show you a manuscript. One has two studies with roughly equal numbers of participants. Both studies support the hypothesis, each with a significant key result at p=.04. The other paper has three studies, also supporting the hypothesis, but the last two studies' individual results are only near significant: p=.02, p=.07, and that most annoying figure, p=.11.
Statistically, the second paper actually presents stronger evidence, it should be obvious that the the first one is more likely to be published.

There are two ways of having perfect-looking results, and unfortunately one is a lot easier than the other. Again, this doesn't require bad actors -- just competent ones. Giner-Sorolla's solution is not to make publication easier -- various reasons are given for why that is unlikely to work -- but rather to develop an alternative method for distributing less "pretty" results to those who might be interested (such as those doing meta-analysis). 

The summary here does not give this fascinating article justice, which also includes (among things) a fascinating history of hypothesis testing, including a previous crisis of confidence in Psychology in the 1970s.)

Too good to be true

Francis ("The psychology of replication and the replication of psychology") takes the point a bit further:  our love of "pretty" data has led us to expect data that are too good to be true. Most psychology studies have low power to detect an effect; thus, presenting several studies in a single paper all of which show the same significant effect is highly suspicious. He goes through eight experiments recently published in Galak & Meyvis (2011), seven of which show significant results despite their apparent power ranging from about 0.25 to 0.8 (that is, between a 25% and 80% chance of detecting the effect). The likelihood of this happening even if the hypothesis was true was about 0.08. (This is actually an underestimate: the one "non-signifcant" experiment had a p-value of 0.056, which many authors would count as significant -- which, again, makes the reported findings even less likely.)

Francis goes out of his way to say that this does not necessarily represent malicious intent:
Publication bias could occur in a set of experiments because authors closely follow the standards of the field. Indeed, the experiments reported in Galak and Meyvis (2011) appear to meet (and often exceed) the standards of experimental psychology.
Francis argues that changing the aesthetic can be accomplished by changing what results we report. We should report effect sizes and worry less about p-values. I've always been somewhat skeptical of effect size cheerleaders, but Francis put the argument in a new way that (at least for the moment) makes sense to me:
Given that almost every empirical study in psychology currently uses hypothesis testing, it may seem bizarre to claim that effect sizes are more important than the outcome of hypothesis tests... [But] surely, the theory of Galak and Meyvis (2011) does not predict that every experiment testing the theory will reject the null (e.g., even with sample sizes of, say, n = 3). The authors probably meant that if the theory were true, then an experiment with a large enough sample would reject the null hypothesis. But the definition of "large enough sample" is determined by the magnitude of the effect size ... you still have to focus on effect sizes in order to predict the outcome of experiments.
Measuring and reporting effect sizes would highlight when data are "too pretty", making such data actually less attractive, mitigating some of the factors encouraging publication bias. It would also lead to better research. Francis then discusses some ways to easily detect publication bias.

This paper generated a lot of attention and two commentaries were published alongside it. Simonsohn ("It does not follow") notes that Francis had to look at many papers to find any that have evidence of publication bias but does not correct for multiple comparisons. That is, Francis's test for the file-drawer effect must itself take into account the file-drawer effect; you'd want to see that, in a sample of papers, more show evidence of a file-drawer effect than you would have expected by chance. In other words, it is possible that Galak and Mayvis (2011) failed the test through pure random chance alone. This argument is somewhat weakened by Galak and Mayvis's admission ("You could have just asked") that they had indeed excluded several experiments that had not reached significance. However, the larger point stands.

Both Simonsohn and Galak & Mayvis push back against Francis's argument that if a paper has been shown to suffer from the file-drawer effect (unpublished null results), the paper should be ignored altogether. Just because the file-drawer effect has inflated the observed effect size, that doesn't mean the effect size is 0. Simonsohn points out that there are meta-analytic ways to correct for the file-drawer effect. Galak and Mayvis argue that not publishing null results is inconsequential, because having several positive results is simply too unlikely. Whether this is true, they acknowledge, depends on whether false-positive rates are inflated overall (as many of the other authors in this issue argue).

Can meta-analyses save us?

Ferguson and Heene ("A vast graveyard of undead theories") focus on the problems for meta-analyses posed by publication bias. Obviously, meta-analyses of biased data will be biased meta-analyses (junk in - junk out). They take a look at some ways researchers have tried to de-bias meta-analyses, such as including unpublished data. The problem is that the selection of unpublished data itself is often biased, and they show that in some cases this actually makes the problem worse.

They also take on an apparently common argument, which is that if several published studies have shown an effect, you'd need there to be a huge number of studies showing no effect in the proverbial file drawer in order to wash out the significant studies.
Now let us assume we assembled 50 studies in a meta-analysis with a mean [effect size] of 2.0. In this case, the FSN [a statistic indicating the number of null results needed to cancel out the significant ones] would yield a value of 6,854.
However, this statistic makes the incorrect assumption that if all the published studies were false positives, the mean effect size of the remainder would be 0.
In fact, if only 5% of studies that show Type I errors were published, the mean [effect size] of the remaining unpublished studies cannot be zero but must be negative... The FSN turns out to be a gross overestimate...
Despite the math, the best evidence of their claims comes from a plausibility argument:
It is thus not surprising that we have seldom seen a meta-analysis resolve a controversial debate in a field. Typically, the antagonists simply decry the meta-analysis as fundamentally flawed or produce a competing met-analysis of their own [long list of citations].
Participants as passive receptacles of stimuli

Klein and colleagues ("Low hopes, high expectations: Expectancy effects and the replicability of behavioral experiments") write:
Though in general there may be many reasons why a study fails to replicate (e.g., lack of statistical power...), we suggest that the central issue is a widespread failure to acknowledge, recognize, and document the circumstances and social context in which an experiment takes place.
In theory, if aspects of social context (e.g., whether the experimenter was male or female) are relevant for the effect but not documented, that part of the method cannot be replicated. But more generally, the failure to document important aspects of the design may suggest -- to put it bluntly -- incompetent experimenters, in which case a failure to replicate is no surprise. For instance: a significant percentage of studies (published in premier journals) that use deception did not include an "awareness probe" (basically, asking the participant if they were aware of the deception). This is problematic because sometimes the deception is too obvious and participants figure it out, invalidating the results of the study.

Their proposed solutions mostly amount to "do good studies". For instance, if experimenter bias is a concern, use double-blind procedures. If you are worried about demand characteristics, include relevant probes and controls. This is something that should be taught in every intro methods class, but -- according to the authors' research -- are often neglected even in prestigious journals.

Who should be doing replications?

Given that researchers are already over-loaded with responsibilities, if more replications are to be conducted, who is going to conduct them? Frank and Saxe ("Teaching replication") argue that undergraduate and graduate research methods classes should. Students in these classes already have to conduct experiments, but usually they are replications of classic experiments (like Stroop) or novel experiments thought up by the students. Either kind is usually boring: The first kind because the answer is already known, and the second kind because the students don't know enough to think up an interesting, novel experiment that makes contact with cutting-edge research. Replicating recent, cutting-edge studies solves both problems. While they acknowledge that sometimes the experiments aren't done well enough to be informative, around half of the time (in their experience, both having taught such classes), they are.

Grahe and colleagues ("Harnessing the undiscovered resource of student research projects") likewise argue that undergraduates are capable of doing useful research, and given that there are many undergraduates out there taking many methods courses, quite a lot of useful work could be collectively accomplished by them.


How often are replications published in psychology?

Makel, Plucker, and Hegarty ("Replications in psychology research: How often do they really occur?") find -- probably to nobody's surprise -- that replications are rarely published. Looking at all the articles published by 100 psych journals since 1900, just under 2% of the articles even mention "replication" in the text. Based on further investigation, they estimate that only 1% of articles are actual replications, most of which were successful replications. The good news is that the rate has been rising steadily in recent decades. The bad news is that, the vast majority were conceptual replications, not direct replications. Worse news: replications, whether successful or not, were cited on average 17 times, compared with an average of 64.5 times for the articles replicated. (Interestingly, they find that some papers do include failures to conceptually replicate -- despite what most people would expect -- though these are nonetheless fairly rare.)

Incentivizing replication and replicability

Nosek, Spies and Motyl ("Scientific Utopia II: Restructuring incentives and practices to promote truth over publishability") note, as have others, that there isn't much incentive to engage in activities that would improve the replicability rate. In a succinct article, they list a number of strategies that they think are unlikely to word -- conceptual replication, waiting for science to "self-correct", establishing journals devoted to publishing replications or negative results, education campaigns, increasing expectations of reviewers, and raising the barrier for publication. The reasons why are straight-forward and can be found in the article.

They also offer several strategies they think will work. For instance, they recommend simple check-lists, which "ensure disclosure of obvious items that are sometimes forgotten: sample sizes, statistical tests, effect sizes, and covariates." While this might seem similar to their discarded strategy of "raising expectations" and "educational campaigns", they argue that checklists are an exception, and point to research showing their efficacy (as opposed to raised expectations and educational campaigns) in medicine.

Other proposals include establishing metrics to identify what is worth replicating, thereby making replications of these papers seem important enough (to journal editors) to publish, crowd sourcing replication efforts, and establishing journals (like PLoS ONE) with review standards focused on the soundness, not importance, of the research. While this latter approach may seem like a journal signing its own death warrant, they point to data suggesting that journals aren't actually very good at figuring out what papers will or won't be cited in the future. They also note that making data, materials, and the workflow itself more open and transparent will make it easier to re-use data and materials, with additional replications as an inevitable, positive side effect.

Koole and Lakens ("Rewarding replications") focus more squarely on the lack of incentive to conduct replications because they are rarely published and rarely cited when published. Their proposed solution is that when journals publish an original finding, they agree to publish future replications as an online supplement. They then could create a meta-reference that refers to the original paper and subsequent replications, which researchers could then cite, increasing the citations received by the replications (since they are cited automatically).  This has the nice advantage of working within the existing infrastructure (always easier than starting from scratch). The only remaining issue is how to convince the existing infrastructure (i.e., the journals) to participate, given that it will cost them a lot and there isn't any clear advantage to participation (some would argue that it's actually against the journal's interest to participate). There is also the first-mover disadvantage: the system only becomes particularly useful once many journals participate.

Confirmatory research

Wagenmakers, Wetzels, Borsboon, van der Maas and Kievit ("An agenda for purely confirmatory research") argue that researchers should be required to pre-register their methods and intended analyses prior to running a study. The idea is to make it clear which results are and are not exploratory. The suggestion is that exploratory analyses are often circular (you look at the data, develop a hypothesis, and then test the hypothesis on those data), which increases the likelihood of false positives.

Pre-registration obviously requires extra work, both on the part of the scientists and on the part of the people who maintain the registry, so one can reasonably ask whether the gain is worth the cost. First, we need to know if the problem itself is common (no point in investing in this infrastructure in order to combat a non-existant problem). Wagenmakers and colleagues argue that it is, but unconvincingly:
Almost no psychological research is conducted in a purely confirmatory fashion (e.g., Kerr, 1998; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011; for similar discussion in biology, see Anderson, Burnham, Gould, & Cherry, 2001).
As far as I can tell, Wagenmakers et al., (2011) speculates that confirmatory research is rare, but offers no data (that I noticed when skimming). Kerr (1998) did conduct a survey of 156 social psychologists, clinical/community psychologists, and sociologists, finding that confirmatory research was reported as occurring most frequently, though not significantly more common than some forms of circular research;  even if we assume that these results generalize to all of psychology (not just social and clinical), it is is a far cry from "almost nothing".

But let's even suppose for the moment that non-confirmatory (that is, exploratory) research is very common. Is that a problem? As I already mentioned, Wagenmakers and colleagues are worried about circular analyses. What they suggest is tracking exploratory analyses, which are not necessarily the same thing: Suppose I run a study, and in review, a reviewer asks, "Was there a difference between the male and female participants?" (This is not a made-up example). Well, I hadn't ever thought about that question, so I never checked. Now I check, and either there is or there isn't. This is not a circular analysis; retroactively deciding this is an interesting question should not affect the results (unless you believe in ESP, which Wagenmarkers and colleagues clearly do not).

In the absence of additional information, then, the fact that an analysis was exploratory tells you little about whether it is circular, which was the real issue. Whether an analysis was exploratory or planned only suggests that it is circular or not if large percentages of exploratory analyses are circular. I don't know; there doesn't appear to be any data. But even if there were, that would just tell you likelihood -- it wouldn't settle the issue for any particular case.

Enough smart people think that pre-registration would be a good idea that I'm willing to suppose that it is. I just don't see anything in this particular article to suggest that the price is worth the benefit. One last thought: Wagenmakers and colleagues argue that researchers don't fully appreciate the distinction between planned and circular analyses. If so, would marking certain analyses as planned and certain analyses as circular help?

What should we do? 

The 1,292 psychologists surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") were also asked about specific suggested reforms (proposed in a recent paper, not by them). I summarize the results below:

Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article:
should be good practice: 60%
should be condition of publication: 46%

Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data collection justification:
should be good practice: 47%
should be condition of publication: 30%  [presumably none of these were psychophysicists]

Authors must list all variables collected in a study.
should be good practice: 58%
should be condition of publication: 46%

Authors must report all experimental conditions, including failed manipulations:
should be good practice: 80%
should be condition of publication: 66%

If observations are eliminated, authors must also report what the statistical results are if those observations are included:
should be good practice: 69%
should be condition of publication: 52%

If analysis includes a covariate, authors must report the statistical results of the analysis without the covariate:
should be good practice: 67%
should be condition of publication: 50%

The survey also asked about four suggested guidelines; for space reasons, I have left them out. You can find them in the paper.

Nobody asked my opinion, but here it is anyway. I think having and reporting a termination rule is relatively costless and helpful in certain situations, so I support it as good practice and wouldn't mind if it were a condition of publication. The question about 20 observations per cell ... do we mean 20 items or 20 subjects or both? I suspect the authors were thinking of studies with a single item per condition per subject (which are common in some areas of psychology but essentially non-existant in others; note that Fuchs, Jenny & Fiedler also note the problem of finding one set of standards that is appropriate to all subdisciplines of psychology), so it's hard to know how this is meant to be applied. In any case, I would prefer a standard based on expected power, which is more important (though harder to define).

I strongly support listing all variables collected and all experimental conditions, though it quickly runs up quickly against the word limits in many high-impact journals (which I would be happy to do away with; to paraphrase Einstein, papers should be as short as possible, but not shorter). One issue: what counts as an experimental condition is not always clear, particularly if you are doing between-subject manipulations (is this data set a "condition" or a separate experiment?). So I don't know how to enforce these as conditions of publication.

As far as covariates, certain kinds of measures make no sense without the covariate, so in some cases you'd be adding irrelevant, misleading information. Leaving this up to reviewers (the current system) seems perfectly reasonable to me.

Learning the passive

If Microsoft Word had its way, passive verbs would be excised from the language. That would solve children some problems, because passive verbs are more difficult to learn than one might think, because not all verbs passivize. Consider:

*The bicycle was resembled by John.
*Three bicycles are had by John.
*Many people are escaped by the argument.


The bicycle was resembled by John: A how-to guide.

So children must learn which verbs have passives and which don't. I recently sat down to read Pinker, Lebeaux and Frost (1987), a landmark study of how children learn to passivize verbs. This is not a work undertaken lightly. At 73 pages, Pinker et al. (1987) is not Steve Pinker's longest paper -- that honor goes to his 120-page take-down of Connectionist theories of language, Pinker and Prince (1988) -- but it is long, even for psycholinguistics. It's worth the read, both for the data and because it lays out the core of what become Learnability and Cognition, one of the books that has had the most influence on my own work and thinking.

The Data


The authors were primarily interested in testing the following claim: that children are conservative learners and only passivize verbs that they have previously heard in the passive. This would prevent them from over-generating passives that don't exist in the adult language.

First, the authors looked at a database of transcriptions of child speech. A large percentage of the passive verbs they found were passives the children couldn't possibly have heard before because they aren't legal passives in the adult language:

It's broked? (i.e., is it broken?)
When I get hurts, I put dose one of does bandage on.
He all tieded up, Mommy.

Of course, when we say that the child couldn't have heard such passives before, you can't really be sure what the child heard. It just seems unlikely. To more carefully control what the child had heard, the authors taught children of various ages (the youngest group was 4 years old) made-up verbs. For instance, they might demonstrate a stuffed frog jumping on top of a stuffed elephant and say, "Look, the frog gorped the elephant." Then they would show the elephant jumping on top of a mouse and ask the child, "What happened to the mouse?"

If you think "gorp" has a passive form, the natural thing to do would be to say "The mouse was gorped by the elephant." But a child who only uses passive verbs she has heard before would refuse to utter such a sentence. However, across a range of different made-up verbs and across four different experiments, the authors found that children were willing -- at least some of the time -- to produce these new passive verbs. (In addition to production tests, there were also comprehension tests where the children had to interpret a passivization of an already-learned verb.)

Some Considerations

These data conclusively proved that children are not completely conservative, at least not by 4 years of age (there has been a lot of debate more recently about younger children). With what we know now, we know that the conservative child theory had to be wrong -- again, at least for 4 yos -- but it's worth remembering that at the time, this was a serious hypothesis.

There is a lot of other data in the paper. Children are more likely to produce new passive forms as they get older (higher rates for 5 year-olds than 4 year-olds). They taught children verbs where the agent is the object and the patient is the subject (that is, where The frog gorped the elephant means "the elephant jumped on top of the frog"). Children had more difficulty passivizing those verbs. However, a lot of these additional analyses are difficult to interpret because of the small sample sizes (16 children and only a handful of verbs per experiment or sub-experiment).

Theory 

Fair warning: the rest of this post is pretty technical.

What excites me about this paper is the theoretical work. For instance, the authors propose a theory of linking rules that have strong innate constraints and yet still some language-by-language variation.
The linkages between individual thematic roles in thematic cores and individual grammatical functions in predicate-argument structures is in turn mediated by a set of unmarked universal linking rules: agents are mapped onto subjects; patients are mapped onto objects; locations and paths are mapped onto oblique objects. Themes are mapped onto any unique grammatical function but can be expressed as oblique, object or subject; specifically, as the 'highest' function on that list that has not already been claimed by some other argument of the verb.
With respect to passivization, what is important is that only verbs which have agents as subjects are going to be easily passivized. The trick is that what counts as an 'agent' can vary from language to language.
It is common for languages to restrict passivized subjects to patients affect by an action ... The English verbal passive, of course, is far more permissive; most classes of transitive verbs, even those that do not involve physical actions, have the privilege of passivizability assigned to them. We suggest this latitude is possible because what counts as the patient of an action is not self-evident ... Languages have the option of defining classes in which thematic labels are assigned to arguments whose roles abstractly resemble those of physical thematic relations...
This last passage sets up the core of the theory to be developed in Learnability and Cognition. Children are born knowing that certain canonical verbs -- ones that very clearly have agents and patients, like break -- must passivize, and that a much larger group of verbs in theory might passivize, because they could be conceived of as metaphorically having agents and patients. What they have to learn is which verbs from that broader set actually do passivize. Importantly, verbs come in classes of verbs with similar meanings. If any verb from that set passivizes, they all will.

This last prediction is the one I am particularly interested in. A later paper (Gropen, Pinker, Hollander, Goldberg & Wilson, 1989) explored this hypothesis with regards to the dative alternation, but I don't know of much other work. In general, Learnability and Cognition go less attention than it should have, perhaps because by the time it was published, the Great Past Tense Debate had already begun. I've often thought of continuing this work, but teaching novel verbs to children in the course of an experiment is damn hard. Ben Ambridge has recently run a number of great studies on the acquisition of verb alternations (like the passive), so perhaps he will eventually tackle this hypothesis directly.

----
Pinker S, Lebeaux DS, and Frost LA (1987). Productivity and constraints in the acquisition of the passive. Cognition, 26 (3), 195-267 PMID: 3677572

Universal Grammar is dead. Long live Universal Grammar.

Last year, in a commentary on Evans and Levinson's "The myth of language universals: Language diversity and its importance for cognitive science" in Behavioral and Brain Sciences (a journal which published one target paper and dozens of commentaries in each issue), Michael Tomasello wrote:
I am told that a number of supporters of universal grammar will be writing commentaries on this article. Though I have not seen them, here is what is certain. You will not be seeing arguments of the following type: I have systematically looked at a well-chosen sample of the world's languages, and I have discerned the following universals ... And you will not even be seeing specific hypotheses about what we might find in universal grammar if we followed such a procedure.
Hmmm. There are no specific proposals about what might be in UG... Clearly Tomasello doesn't read this blog much. Granted, for that he should probably be forgiven. But he also clearly hasn't read Chomsky lately. Here's the abstract of the well-known Hauser, Chomsky & Fitch (2002):
We submit that a distinction should be made between the faculty of language in the broad sense (FLB) and in the narrow sense (FLN). FLB includes a sensory-motor system, a conceptual-intentional system, and the computational mechanisms for recursion, providing the capacity to generate an infinite range of expressions from a finite set of elements. We hypothesize that FLN only includes recursion and is the only uniquely human component of the faculty of language.
Later on, HCF make it clear that FLN is another way of thinking about what elsewhere is called "universal grammar" -- that is, constraints on learning that allow the learning of language.

Tomasello's claim about the other commentaries (that they won't make specific claims about what is in UG) is also quickly falsified, and by the usual suspects. For instance, Steve Pinker and Ray Jackendoff devote much of their commentary to describing grammatical principles that could be -- but aren't -- instantiated in any language.

Tomasello's thinking is perhaps made more clear by a later comment later in his commentary:
For sure, all fo the world's languages have things in common, and [Evans and Levinson] document a number of them. But these commonalities come not from any universal grammar, but rather from universal aspects of human cognition, social interaction, and information processing...
Thus, it seems he agrees that there are constraints on language learning that shape what languages exist. This, for instance, is the usual counter-argument to Pinker and Jackendoff's nonexistent languages: those languages don't exist because they're really stupid languages to have. I doubt Pinker or Jackendoff are particular fazed by those critiques, since they are interested in constraints on language learning, and this proposed Stupidity Constraint is still a constraint. Even Hauser, Chomsky and Fitch (2002) allow for constraints on language that are not specific to language (that's their FLB).

So perhaps Tomasello fundamentally agrees with people who argue for Universal Grammar, this is just a terminology war. They call fundamental cognitive constraints on language learning "Universal Grammar" and he uses the term to refer to something else: for instance, proposals about specific grammatical rules that we are born knowing. Then, his claim is that nobody has any proposals about such rules.

If that is what he is claiming, that is also quickly falsified (if it hasn't already been falsified by HCF's claims about recursion). Mark C. Baker, by the third paragraph of his commentary, is already quoting one of his well-known suggested language universals:
(1) The Verb-Object Constraint (VOC): A nominal that expresses the theme/patient of an event combines with the event-denoting verb before a nominal that expresses the agent/cause does.
And I could keep on picking examples. For those outside of the field, it's important to point out that there wasn't anything surprising in the Baker commentary or the Pinker and Jackendoff commentary. They were simply repeating well-known arguments they (and others) have made many times before. And these are not obscure arguments. Writing an article about Universal Grammar that fails to mention Chomsky, Pinker, Jackendoff or Baker would be like writing an article about major American cities without mentioning New York, Boston, San Francisco or Los Angeles.

Don't get me wrong. Tomasello has produced absurd numbers of high-quality studies and I am a big admirer of his work. But if he is going to make blanket statements about an entire literature, he might want to read one or two of the papers in that literature first.

-------
Tomasello, M. (2009). Universal grammar is dead Behavioral and Brain Sciences, 32 (05) DOI: 10.1017/S0140525X09990744

Evans, N., & Levinson, S. (2009). The myth of language universals: Language diversity and its importance for cognitive science Behavioral and Brain Sciences, 32 (05) DOI: 10.1017/S0140525X0999094X

Hauser MD, Chomsky N, & Fitch WT (2002). The faculty of language: what is it, who has it, and how did it evolve? Science (New York, N.Y.), 298 (5598), 1569-79 PMID: 12446899

Baker, M. (2009). Language universals: Abstract but not mythological Behavioral and Brain Sciences, 32 (05) DOI: 10.1017/S0140525X09990604

Pinker, S., & Jackendoff, R. (2009). The reality of a universal language faculty Behavioral and Brain Sciences, 32 (05) DOI: 10.1017/S0140525X09990720

The verbal memory hegemony

One fact about the world is that the most famous memory researchers did most of their work on verbal memory. Alan Baddeley and George Miller both come to mind -- and I doubt anybody can think of more famous memory researchers in the last 50 years.

Another fact about the world is that many researchers -- not necessarily Baddeley or Miller -- have assumed that anything discovered using memory tests involving words should apply to other forms of memory as well. To pick unfairly on one person, Cowan notes in his masterful paper "The magical number 4 in short-term memory" that out of several related experiments, one has results that diverge from the others. Cowan attempts an explanation but basically throws up his hands. He doesn't notice that of all the experiments discussed in that section, the divergent one was the only one to use visual rather than verbal stimuli.

Similarly, a reviewer of my paper which just came out complained that the results reported in the paper only "told us things we already knew." As evidence, the reviewer cited a number of other papers, all of which had investigated verbal rather than visual short-term memory.

As it happens, the results in this case were very similar to what had been reported previously for verbal memory. But it could have come out differently. That was the point of doing the experiment.

Partly because of this bias in favor of verbal materials, not enough is known about visual memory, though this has been changing in recent years, thanks in part to folks like Steve Luck, George Alvarez, Yuhong Jiang, Edward Vogel and several others.

----------
Cowan, N. (2001). The magical number 4 in short-term memory: a reconsideration of mental storage capacity. Behavioral and brain sciences, 24, 87-185.

Hartshorne, J.K. (2008). Visual working memory capacity and proactive interference. Public Library of Science One

Miller, G.A. (1956). The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological Review, 63, 81-97.