Field of Science

Chemistry has its own problems with replication, according to Nature:
Scrounging chemicals and equipment in their spare time, a team of chemistry bloggers is trying to replicated published protocols for making molecules. The researchers want to check how easy it is to repeat the recipes that scientists report in papers ... Among the frustrations [chemists] have experienced with the chemical literature ... are claims that reactions yield products in greater amounts than seem reasonable, and scanty detail about specific conditions in which to run reactions. In some cases, reactions are reported which seem to good to be true - such as a 2009 paper which was corrected within 24 hours by web-savvy chemists live-blogging the experiment.
It's hard to tell from the article how common it is for a reaction simply not to be possible at all as opposed to simply produce less product than reported. Presumably either is problematic, but the causes would be different.

Given the recent excitement about (non-)replication, one has to wonder if this problem is more or less common than in the past. While my gut instinct is that replication was probably less of a problem in the earlier, smaller days of science, it's also quite possible that it's like many forms of violent crime: extremely rare today by historical standards, but we care much more about it.

What makes psychology and neuroscience hard

Explained by today's XKCD:



Ambrose Bierce pointed out the same problem in his 1911 satyrical dictionary (The Devil's Dictionary):
Mind, n. A mysterious form of matter secreted by the brain. Its chief activity consists in the endeavor to ascertain its own nature, the futility of the attempt being due to the fact that it has nothing but itself to know itself with. 

Fractionating IQ

Near the dawn of the modern study of the mind, the great psychological pioneer Charles Spearman noticed that people who are good at one kind of mental activity tend to be good at most other good mental activities. Thus, the notion of g (for "general intelligence") was born: the notion that there is some underlying factor that determines -- all else equal -- how good someone is at any particular intelligent task. This of course fits folk psychology quite well: g is just another word for "smarts".

The whole idea has always been controversial, and many people have argued that there is more than one kind of smarts out there (verbal vs. numeric, logical vs. creative, etc.). Enter a recent paper by Hampshire and colleagues (Hampshire, HIghfield, Parkin & Owen, 2012) which tries to bring both neuroimaging and large-scale Web-based testing to bear on the question.

In the neuroimaging component, they asked sixteen participants to carry out twelve difficult cognitive tasks while their brains were scanned and applied principle components analysis (PCA) to the results. PCA is a sophisticated statistical method for grouping things.

A side note on PCA

If you already know what PCA is, skip to the next section. Basically, PCA is a very sophisticated way of sorting thigns. Imagine you are sorting dogs. The simplest thing you could do is have a list of dog breeds and go through each dog and sort it according to its breed.

What if you didn't already have dog breed manual? Well, German shepherds are more similar to one another than any given German shepherd is to a poodle. So by looking through the range of dogs you see, you could probably find a reasonable way of sorting them, "rediscovering" the various dog breeds in the process. (In more difficult cases, there are algorithms you could use to help out.)

That works great if you have purebreds. What if you have mutts? This is where PCA comes in. PCA assumes that there are some number of breeds and that each dog you see is a mixture of those breeds. So a given dog may be 25% German Shepherd, 25% border collie, and 50% poodle. PCA tries to "learn" how many breeds there are, the characteristics of those breeds, and the mixture of breeds that makes up each dog -- all at the same time. It's a very powerful technique (though not without its flaws).

Neuroimaging intelligence

Analysis focused only on the "multiple demands" network previously identified as being related to IQ and shown in red in part A of the graph below. PCA discovered two underlying components that accounted for about 90% of the variance in the brain scans across the twelve tasks. One was particularly important for working memory tasks, so the authors called in MDwm (see part B of the graph below), and it involved mostly the IFO, SFS and ventral ACC/preSMA (see part A below for locations). The other was mostly involved in various reasoning tasks and involved more IFS, IPC and dorsal ACC/preSMA.


Notice that all tasks involved both factors, and some tasks (like the paired associates memory task) involved a roughly equal portion of each.

Sixteen subjects isn't very many

The authors put versions of those same twelve tasks on the Internet. They were able to get data from 44,600 people, which makes it one of the larger Internet studies I've seen. The authors then applied PCA to those data. This time they got three components, two of which were quite similar to the two components found in the neuroimaging study (they correlated at around r=.7, which is a very strong correlation in psychology). The third component seemed to be particularly involved in tasks requiring language. Most likely that did not show up in the neuroimaging study because the neuroimaging study focused on the "multiple demands" network, whereas language primarily involves other parts of the brain.

The factors dissociated in other ways as well. Whereas people's working memory and reasoning abilities start to decline about the time people reach the legal drinking age in the US (coincidence?) verbal skills remain largely undiminished until around age 50. People who suffer from anxiety had lower than average working memory abilities, but average reasoning and verbal abilities. Several other demographic factors similarly had differing effects on working memory, reasoning, and verbal abilities.

Conclusions

The data in this paper are very pretty, and it was a particularly nice demonstration of converging behavioral and neuropsychological methods. I am curious what the impact will be. The authors are clearly arguing against a view on which there is some unitary notion of IQ/g. It occurred to me as I wrote this what while I've read many papers lately discussing the different components of IQ, I haven't read anything recent that endorses the idea of a unitary g. I wonder if there is anyone, and, if so, how they account for this kind of data. If I come across anything, I will post it here.


------
ResearchBlogging.orgHampshire, A., Highfield, R., Parkin, B., & Owen, A. (2012). Fractionating Human Intelligence Neuron, 76 (6), 1225-1237 DOI: 10.1016/j.neuron.2012.06.022

Professor -- The Easiest Job in the World

There has been a small kerfuffle over Susan Adams's article at Forbes, titled "The least stressful jobs of 2013":
University professors have a lot less stress than most of us. Unless they teach summer school, they are off between May and September and they enjoy long breaks during the school year, including a month over Christmas and New Year's and another chunk of time in the spring. Even when school is in session they don't sped too many hours in the classroom ... Working conditions tend to be cozy and civilized and there are minimal travel demands...
She also mentions the great job prospects ("Universities are expected to add 305,700 adjunct and tenure-track professors by 2020").

To her credit, Adams has added a sizable addendum to her article, correcting -- but not apologizing for -- her mistakes. Unfortunately, this is far from the first time this kind of article has appeared in a major publication. Some time back, a columnist for the New York Times wrote an article suggesting that the solution to rising costs of higher education was to make professors work more than a few hours a week. An article in the New Yorker casually noted that the new head of a particular company was concerned that his employees worked "the hours of college professors" (I initially assumed they meant "way too hard" and that the boss wanted them to take a break!). What gives?

Scicurious suggests it's the curse of half-knowledge:
The vast majority of us aren't teachers or professors, but we've all been students, right? ... We thought that, because of what we saw of them in our classes, we knew what they did ... Because of this half-knowledge, people make assumptions about our jobs, assumptions that can really affect how we are perceived as people...
 That is no doubt part of it, but it also requires that people not think very hard. If I heard that someone made a pretty good living working only a few hours a week, it would immediately set off my implausibility alarm. I mean, what are the chances? And you'd only have to think for a moment to realize this can't be true.

Adams got hundreds of comments and letters pointing out that professors, in addition to giving a few lectures a week, also grade papers, advise students, write papers and books, go to conferences, give invited talks, etc. Adams presents this as if this came as a surprise, but that seems equally implausible. I'm going to assume she's read one or two articles about medicine or science, in which case the people discussed are inevitably professors. In fact, articles about politics occasionally cite professors as well. If she went to college, she knows that professors have office hours and grade papers. Many of the books on science and politics in the bookstore are written by faculty, as are essentially all college textbooks.

Even if she had never attended college, never interacted with a professor, and didn't read articles about higher education, a few minutes of Googling prior to writing her article would have corrected that mistake. My guess is that she didn't really think about her article before writing it and didn't consult either her own memory or Google because she -- and the others who write similar articles -- wanted this crazy claim about the lazy professor to be true. The interesting question is why she wanted it to be true. Anti-intellectualism? A desire to believe that such cushy jobs really exist? Or is this just an example of one of those ideas that are crazy enough that they inspire belief (like one of those many apocryphal "weird facts")?

*I do realize that some professors do very little work. Some people in all professions do very little work.

Transferring Consciousness

My brother was just in town, and we had our usual argument about Old Man's War, which he loves and about which I'm less enthusiastic (it was a fun read, but...). Perhaps one issue that keeps me from enjoying it fully is that whenever I think about it I think about an early scene, in which a character's consciousness was transferred from an old body to a new body. This is presented in the book as just one more futuristic miracle, but I can't stop thinking about the deeper questions it raises.

What does it mean to transfer consciousness from one body to another? Our current scientific understanding is that there is no consciousness separate from the underlying physical machinery, so such a transfer could not happen. But you might be able to create the illusion of a consciousness transfer, which I explain below. So we can make sense of Old Man's War if we assume that the doctors are deliberately lying about what is going on, covering up the murder that lies at the heart of the procedure.

Here's what might be going on (yes, I realize this is fiction, but good science fiction almost always has a thought experiment at its heart): It should be possible, at least in principle, to create a new body that has identical machinery to an existing body. This is would be new person who is a twin not only physically but mentally, down to having the same memories (by definition, since they have the same brains down to the microcircuitry). From the new person's perspective, he has finds herself suddenly in a "new" body. (This is much like the old philosophical puzzle, what if the world was created yesterday, all of us with artificial memories?)

So now we've got a consciousness that believes itself to have transferred into a new body from an old body. What happened to the consciousness in the old body? The doctors in Old Man's War claim that it is now a vegetable, with no consciousness inside, because that consciousness has transferred. Since that can't happen, they are lying: either the process of creating the new copy of the old brain destroys the old brain, or the doctors deliberately destroy the old brain to preserve the illusion of the transfer (after all, if transfer is impossible, why go through this procedure? It's very nice for your twin to have a new body, but it's not going to do you any good at all!).

Here's the question: does this matter? If John undergoes this procedure happens on a Wednesday, then the world on Thursday is much the same as the world on Tuesday: on both days, there is a consciousness calling itself "John" with roughly the same memories. It only gets tricky when you think too much about Wednesday. You might be tempted to say you have John 1 on Tuesday and John 2 on Thursday, who are duplicates but nonetheless not the same because they have different bodies. But, of course, John had a different body when he was 5yo than when he was 75yo, down even to being made up of different atoms. So if we're willing to call 5yo John and 75yo John the same person, why aren't John 1 and John 2?

This confuses the heck out of me, which is why I have difficulty paying attention to the novel itself.

Eighteen papers on replication

Last month, Perspectives in Psychological Science published a special section on replicability in psychological science. With 18 papers, it took me a while to get through all of them, so I am only blogging about them now.

The issue contains many useful articles, particularly if you have not been following the replicability crisis/discussion carefully. I have been following the discussion pretty closely, and I still found a lot worth reading (a particularly enjoyable surprise was Giner-Sorolla's historical review of replicability problems, which is absolutely fascinating).

So I hope you check the issue out. It is currently open access. But if you'd rather have someone else read it first and summarize the main points ... then lucky for you, I've done just that. Below, I describe and discuss all 18 papers, though it should be said that these descriptions are through the lens of what I happen to find particularly interesting or novel. There are likely many other things in these papers that would be of interest to you. So check them out.

Bookends

The issue is bookended by two overviews. Harold Pashler and Eric-Jan Wagenmakers get the issue started with a brief review of recent scandals, failures to replicate, and other reports -- both in psychology and in other fields -- that replicability may be an issue. John Ioannidis, Legend of Replicability, has a contribution at the end of the issue ("Why science is not necessarily self-correcting") which similarly summarizes many of the main arguments of the issue. If you haven't been following along, these are good primers. Neuroskeptic ("The nine circles of scientific hell") makes many of the same arguments, but in the satirical style of Dante Alighieri (I only wish it was in rhyme).

Is there a replicability crisis?

Pashler and Harris ("Is the replicability crisis overblown?") address three arguments sometimes given that "the replicability crisis is overblown". The first is that the low alpha level (5%) ensures that, of published results, most are right. They show that this is most likely incorrect, given current publishing practices. The second even if not many direct replications are published, many "conceptual" replications are. They argue that the notion of "conceptual replication" is so fuzzy that it actually compounds the problem (one problem: there is no such thing as a failure to conceptually replicate):
We speculate that the harmful interaction of publication bias and a focus on conceptual rather than direct replications may even shed light on some fo the famous and puzzling "pathological science" cases that embarrassed the natural sciences... What many observers found peculiar in these cases [like cold fusion] was that it took many years for a complete consensus to emerge that the phenomena lacked any reality... It appears that many exact replication attempts of the initial [cold fusion studies] were undertaken soon after the first dramatic reports of cold fusion. Such attempts produced generally negative results (Taubes, 1993). However, what kept faith in cold fusion alive for some time ... was a trickle of positive results achieved using very different designs that the originals (i.e., what psychologists would call conceptual replications).
Finally, they take issue with the claim that over the long run, science is self-correcting, given the lack of evidence that this is true (as far as I can tell, it's not even clear how you would show that science is self-correcting without rigorous, systematic studies of replicability).

While probably not everyone agrees with all the specifics of Pashler and Harris's argument, there does appear to be general consensus that there is a real replicability problem. 1,292 psychologists around the world surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") believed, on average, that 53% of the results in psychology could be replicated in direct replications. (One word of caution: it may be that psychologists most concerned about replication were also disproportionately likely to agree to participate in the survey; still, 1300 psychologists is a significant fraction of the field.)

What's the best way of getting a significant result?

Ideally, the answer would be "test for a real effect", but that of course requires that you know whether your hypothesis is true or not, which presumably you don't (otherwise, why run the study?). Bakker, van Dijk and Wicherts ("The rules of the game called psychological science") show that, as a general rule, you are better off running a few underpowered studies than one large, sufficiently-powered study (particularly if any real effect sizes are small). That is, the more likely it is that your results aren't true, the more likely you are to have a publishable finding. This is bad news when coupled with the fact that humans are good at finding efficient strategies. (Note that this doesn't require bad actors: One can simply find that certain procedures -- like testing around X subjects -- have generally produced "good", publishable research in the past.)

The bottleneck

Giner-Sorolla ("Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science") argues that a certain amount of bad science is expected when there are too few publication outlets:
Articles cannot pass through just by showing theoretical meaning and methodological rigor; thei results must appear to support the hypothesis perfectly... Imagine that two colleagues in psychology each show you a manuscript. One has two studies with roughly equal numbers of participants. Both studies support the hypothesis, each with a significant key result at p=.04. The other paper has three studies, also supporting the hypothesis, but the last two studies' individual results are only near significant: p=.02, p=.07, and that most annoying figure, p=.11.
Statistically, the second paper actually presents stronger evidence, it should be obvious that the the first one is more likely to be published.

There are two ways of having perfect-looking results, and unfortunately one is a lot easier than the other. Again, this doesn't require bad actors -- just competent ones. Giner-Sorolla's solution is not to make publication easier -- various reasons are given for why that is unlikely to work -- but rather to develop an alternative method for distributing less "pretty" results to those who might be interested (such as those doing meta-analysis). 

The summary here does not give this fascinating article justice, which also includes (among things) a fascinating history of hypothesis testing, including a previous crisis of confidence in Psychology in the 1970s.)

Too good to be true

Francis ("The psychology of replication and the replication of psychology") takes the point a bit further:  our love of "pretty" data has led us to expect data that are too good to be true. Most psychology studies have low power to detect an effect; thus, presenting several studies in a single paper all of which show the same significant effect is highly suspicious. He goes through eight experiments recently published in Galak & Meyvis (2011), seven of which show significant results despite their apparent power ranging from about 0.25 to 0.8 (that is, between a 25% and 80% chance of detecting the effect). The likelihood of this happening even if the hypothesis was true was about 0.08. (This is actually an underestimate: the one "non-signifcant" experiment had a p-value of 0.056, which many authors would count as significant -- which, again, makes the reported findings even less likely.)

Francis goes out of his way to say that this does not necessarily represent malicious intent:
Publication bias could occur in a set of experiments because authors closely follow the standards of the field. Indeed, the experiments reported in Galak and Meyvis (2011) appear to meet (and often exceed) the standards of experimental psychology.
Francis argues that changing the aesthetic can be accomplished by changing what results we report. We should report effect sizes and worry less about p-values. I've always been somewhat skeptical of effect size cheerleaders, but Francis put the argument in a new way that (at least for the moment) makes sense to me:
Given that almost every empirical study in psychology currently uses hypothesis testing, it may seem bizarre to claim that effect sizes are more important than the outcome of hypothesis tests... [But] surely, the theory of Galak and Meyvis (2011) does not predict that every experiment testing the theory will reject the null (e.g., even with sample sizes of, say, n = 3). The authors probably meant that if the theory were true, then an experiment with a large enough sample would reject the null hypothesis. But the definition of "large enough sample" is determined by the magnitude of the effect size ... you still have to focus on effect sizes in order to predict the outcome of experiments.
Measuring and reporting effect sizes would highlight when data are "too pretty", making such data actually less attractive, mitigating some of the factors encouraging publication bias. It would also lead to better research. Francis then discusses some ways to easily detect publication bias.

This paper generated a lot of attention and two commentaries were published alongside it. Simonsohn ("It does not follow") notes that Francis had to look at many papers to find any that have evidence of publication bias but does not correct for multiple comparisons. That is, Francis's test for the file-drawer effect must itself take into account the file-drawer effect; you'd want to see that, in a sample of papers, more show evidence of a file-drawer effect than you would have expected by chance. In other words, it is possible that Galak and Mayvis (2011) failed the test through pure random chance alone. This argument is somewhat weakened by Galak and Mayvis's admission ("You could have just asked") that they had indeed excluded several experiments that had not reached significance. However, the larger point stands.

Both Simonsohn and Galak & Mayvis push back against Francis's argument that if a paper has been shown to suffer from the file-drawer effect (unpublished null results), the paper should be ignored altogether. Just because the file-drawer effect has inflated the observed effect size, that doesn't mean the effect size is 0. Simonsohn points out that there are meta-analytic ways to correct for the file-drawer effect. Galak and Mayvis argue that not publishing null results is inconsequential, because having several positive results is simply too unlikely. Whether this is true, they acknowledge, depends on whether false-positive rates are inflated overall (as many of the other authors in this issue argue).

Can meta-analyses save us?

Ferguson and Heene ("A vast graveyard of undead theories") focus on the problems for meta-analyses posed by publication bias. Obviously, meta-analyses of biased data will be biased meta-analyses (junk in - junk out). They take a look at some ways researchers have tried to de-bias meta-analyses, such as including unpublished data. The problem is that the selection of unpublished data itself is often biased, and they show that in some cases this actually makes the problem worse.

They also take on an apparently common argument, which is that if several published studies have shown an effect, you'd need there to be a huge number of studies showing no effect in the proverbial file drawer in order to wash out the significant studies.
Now let us assume we assembled 50 studies in a meta-analysis with a mean [effect size] of 2.0. In this case, the FSN [a statistic indicating the number of null results needed to cancel out the significant ones] would yield a value of 6,854.
However, this statistic makes the incorrect assumption that if all the published studies were false positives, the mean effect size of the remainder would be 0.
In fact, if only 5% of studies that show Type I errors were published, the mean [effect size] of the remaining unpublished studies cannot be zero but must be negative... The FSN turns out to be a gross overestimate...
Despite the math, the best evidence of their claims comes from a plausibility argument:
It is thus not surprising that we have seldom seen a meta-analysis resolve a controversial debate in a field. Typically, the antagonists simply decry the meta-analysis as fundamentally flawed or produce a competing met-analysis of their own [long list of citations].
Participants as passive receptacles of stimuli

Klein and colleagues ("Low hopes, high expectations: Expectancy effects and the replicability of behavioral experiments") write:
Though in general there may be many reasons why a study fails to replicate (e.g., lack of statistical power...), we suggest that the central issue is a widespread failure to acknowledge, recognize, and document the circumstances and social context in which an experiment takes place.
In theory, if aspects of social context (e.g., whether the experimenter was male or female) are relevant for the effect but not documented, that part of the method cannot be replicated. But more generally, the failure to document important aspects of the design may suggest -- to put it bluntly -- incompetent experimenters, in which case a failure to replicate is no surprise. For instance: a significant percentage of studies (published in premier journals) that use deception did not include an "awareness probe" (basically, asking the participant if they were aware of the deception). This is problematic because sometimes the deception is too obvious and participants figure it out, invalidating the results of the study.

Their proposed solutions mostly amount to "do good studies". For instance, if experimenter bias is a concern, use double-blind procedures. If you are worried about demand characteristics, include relevant probes and controls. This is something that should be taught in every intro methods class, but -- according to the authors' research -- are often neglected even in prestigious journals.

Who should be doing replications?

Given that researchers are already over-loaded with responsibilities, if more replications are to be conducted, who is going to conduct them? Frank and Saxe ("Teaching replication") argue that undergraduate and graduate research methods classes should. Students in these classes already have to conduct experiments, but usually they are replications of classic experiments (like Stroop) or novel experiments thought up by the students. Either kind is usually boring: The first kind because the answer is already known, and the second kind because the students don't know enough to think up an interesting, novel experiment that makes contact with cutting-edge research. Replicating recent, cutting-edge studies solves both problems. While they acknowledge that sometimes the experiments aren't done well enough to be informative, around half of the time (in their experience, both having taught such classes), they are.

Grahe and colleagues ("Harnessing the undiscovered resource of student research projects") likewise argue that undergraduates are capable of doing useful research, and given that there are many undergraduates out there taking many methods courses, quite a lot of useful work could be collectively accomplished by them.


How often are replications published in psychology?

Makel, Plucker, and Hegarty ("Replications in psychology research: How often do they really occur?") find -- probably to nobody's surprise -- that replications are rarely published. Looking at all the articles published by 100 psych journals since 1900, just under 2% of the articles even mention "replication" in the text. Based on further investigation, they estimate that only 1% of articles are actual replications, most of which were successful replications. The good news is that the rate has been rising steadily in recent decades. The bad news is that, the vast majority were conceptual replications, not direct replications. Worse news: replications, whether successful or not, were cited on average 17 times, compared with an average of 64.5 times for the articles replicated. (Interestingly, they find that some papers do include failures to conceptually replicate -- despite what most people would expect -- though these are nonetheless fairly rare.)

Incentivizing replication and replicability

Nosek, Spies and Motyl ("Scientific Utopia II: Restructuring incentives and practices to promote truth over publishability") note, as have others, that there isn't much incentive to engage in activities that would improve the replicability rate. In a succinct article, they list a number of strategies that they think are unlikely to word -- conceptual replication, waiting for science to "self-correct", establishing journals devoted to publishing replications or negative results, education campaigns, increasing expectations of reviewers, and raising the barrier for publication. The reasons why are straight-forward and can be found in the article.

They also offer several strategies they think will work. For instance, they recommend simple check-lists, which "ensure disclosure of obvious items that are sometimes forgotten: sample sizes, statistical tests, effect sizes, and covariates." While this might seem similar to their discarded strategy of "raising expectations" and "educational campaigns", they argue that checklists are an exception, and point to research showing their efficacy (as opposed to raised expectations and educational campaigns) in medicine.

Other proposals include establishing metrics to identify what is worth replicating, thereby making replications of these papers seem important enough (to journal editors) to publish, crowd sourcing replication efforts, and establishing journals (like PLoS ONE) with review standards focused on the soundness, not importance, of the research. While this latter approach may seem like a journal signing its own death warrant, they point to data suggesting that journals aren't actually very good at figuring out what papers will or won't be cited in the future. They also note that making data, materials, and the workflow itself more open and transparent will make it easier to re-use data and materials, with additional replications as an inevitable, positive side effect.

Koole and Lakens ("Rewarding replications") focus more squarely on the lack of incentive to conduct replications because they are rarely published and rarely cited when published. Their proposed solution is that when journals publish an original finding, they agree to publish future replications as an online supplement. They then could create a meta-reference that refers to the original paper and subsequent replications, which researchers could then cite, increasing the citations received by the replications (since they are cited automatically).  This has the nice advantage of working within the existing infrastructure (always easier than starting from scratch). The only remaining issue is how to convince the existing infrastructure (i.e., the journals) to participate, given that it will cost them a lot and there isn't any clear advantage to participation (some would argue that it's actually against the journal's interest to participate). There is also the first-mover disadvantage: the system only becomes particularly useful once many journals participate.

Confirmatory research

Wagenmakers, Wetzels, Borsboon, van der Maas and Kievit ("An agenda for purely confirmatory research") argue that researchers should be required to pre-register their methods and intended analyses prior to running a study. The idea is to make it clear which results are and are not exploratory. The suggestion is that exploratory analyses are often circular (you look at the data, develop a hypothesis, and then test the hypothesis on those data), which increases the likelihood of false positives.

Pre-registration obviously requires extra work, both on the part of the scientists and on the part of the people who maintain the registry, so one can reasonably ask whether the gain is worth the cost. First, we need to know if the problem itself is common (no point in investing in this infrastructure in order to combat a non-existant problem). Wagenmakers and colleagues argue that it is, but unconvincingly:
Almost no psychological research is conducted in a purely confirmatory fashion (e.g., Kerr, 1998; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011; for similar discussion in biology, see Anderson, Burnham, Gould, & Cherry, 2001).
As far as I can tell, Wagenmakers et al., (2011) speculates that confirmatory research is rare, but offers no data (that I noticed when skimming). Kerr (1998) did conduct a survey of 156 social psychologists, clinical/community psychologists, and sociologists, finding that confirmatory research was reported as occurring most frequently, though not significantly more common than some forms of circular research;  even if we assume that these results generalize to all of psychology (not just social and clinical), it is is a far cry from "almost nothing".

But let's even suppose for the moment that non-confirmatory (that is, exploratory) research is very common. Is that a problem? As I already mentioned, Wagenmakers and colleagues are worried about circular analyses. What they suggest is tracking exploratory analyses, which are not necessarily the same thing: Suppose I run a study, and in review, a reviewer asks, "Was there a difference between the male and female participants?" (This is not a made-up example). Well, I hadn't ever thought about that question, so I never checked. Now I check, and either there is or there isn't. This is not a circular analysis; retroactively deciding this is an interesting question should not affect the results (unless you believe in ESP, which Wagenmarkers and colleagues clearly do not).

In the absence of additional information, then, the fact that an analysis was exploratory tells you little about whether it is circular, which was the real issue. Whether an analysis was exploratory or planned only suggests that it is circular or not if large percentages of exploratory analyses are circular. I don't know; there doesn't appear to be any data. But even if there were, that would just tell you likelihood -- it wouldn't settle the issue for any particular case.

Enough smart people think that pre-registration would be a good idea that I'm willing to suppose that it is. I just don't see anything in this particular article to suggest that the price is worth the benefit. One last thought: Wagenmakers and colleagues argue that researchers don't fully appreciate the distinction between planned and circular analyses. If so, would marking certain analyses as planned and certain analyses as circular help?

What should we do? 

The 1,292 psychologists surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") were also asked about specific suggested reforms (proposed in a recent paper, not by them). I summarize the results below:

Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article:
should be good practice: 60%
should be condition of publication: 46%

Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data collection justification:
should be good practice: 47%
should be condition of publication: 30%  [presumably none of these were psychophysicists]

Authors must list all variables collected in a study.
should be good practice: 58%
should be condition of publication: 46%

Authors must report all experimental conditions, including failed manipulations:
should be good practice: 80%
should be condition of publication: 66%

If observations are eliminated, authors must also report what the statistical results are if those observations are included:
should be good practice: 69%
should be condition of publication: 52%

If analysis includes a covariate, authors must report the statistical results of the analysis without the covariate:
should be good practice: 67%
should be condition of publication: 50%

The survey also asked about four suggested guidelines; for space reasons, I have left them out. You can find them in the paper.

Nobody asked my opinion, but here it is anyway. I think having and reporting a termination rule is relatively costless and helpful in certain situations, so I support it as good practice and wouldn't mind if it were a condition of publication. The question about 20 observations per cell ... do we mean 20 items or 20 subjects or both? I suspect the authors were thinking of studies with a single item per condition per subject (which are common in some areas of psychology but essentially non-existant in others; note that Fuchs, Jenny & Fiedler also note the problem of finding one set of standards that is appropriate to all subdisciplines of psychology), so it's hard to know how this is meant to be applied. In any case, I would prefer a standard based on expected power, which is more important (though harder to define).

I strongly support listing all variables collected and all experimental conditions, though it quickly runs up quickly against the word limits in many high-impact journals (which I would be happy to do away with; to paraphrase Einstein, papers should be as short as possible, but not shorter). One issue: what counts as an experimental condition is not always clear, particularly if you are doing between-subject manipulations (is this data set a "condition" or a separate experiment?). So I don't know how to enforce these as conditions of publication.

As far as covariates, certain kinds of measures make no sense without the covariate, so in some cases you'd be adding irrelevant, misleading information. Leaving this up to reviewers (the current system) seems perfectly reasonable to me.

In praise of experiments

Today, the excellent Neuroskeptic writes about a new study investigating which US states are most suicidal. The interesting twist was the form of the data: Google searches. It's an interesting study and an interesting use of Google searches, but what struck me was Neuroskeptic's closing thoughts.
Over the past couple of years there's been a flurry of studies based on analyzing Google and Twitter trends. What's interesting to me is that we're really in the early days of this, when you think about likely future technologies. What will happen when everyone's wearing a computer 24/7 that records their every word and move, and even what they see? 
Eventually, psychology and sociology might evolve (or degenerate) into no more than the analysis of such data...
It's always dangerous to predict the future, but here's my prediction: Not a chance. It gets down to a distinction between observational studies and experiments. Observational studies (where you record what happens in the course of normal events) are useful, particularly when you care about questions like what is the state of the world? They are much less useful when you want to know why is the world the way it is?

There are a couple reasons.

Reason #1: The correlation fallacy

First, observational studies are really about studying correlations. To have much power to analyze interesting correlations, you need a lot of data. This is what makes Google and Twitter powerful: they provide a lot of data. But correlation, famously, doesn't always tell you much about causation.

For instance, it is now well-known that you can use the number of pirates active in the world's oceans and seas to reasonably predict average global temperature (there's a strong correlation):



I did not know until recently that Google search data has now definitively shown a correlation between the amount of movie piracy and global warming as well:

In the case of real pirates vs. the temperature, the correlation runs the other way (temperature affects weather affects seafaring activities). I have no idea what causes the correlation between searches for free movies and searches about global warming; perhaps some third factor. To give another silly example, there is a lot more traffic on the roads during daylight than at night, but this isn't because cars are solar-powered!

The point is that experiments don't have this problem: you go out and manipulate the world to see what happens. Change the number of pirates and see if global temperatures change. Nobody has tried this (to my knowledge), but I'm willing to bet it won't work.

(Of course, there are natural experiments, which are a hybrid of observational studies and experiments: the experimenter doesn't manipulate the world herself but rather waits until somebody else, in the course of normal events, does it for her. Good examples are comparing different states as they adopt bicycle helmet laws at different times and comparing that against head injury stats in the various states. These are rarely as well-controlled as an actual experiment, but have the advantage of ecological validity.)

Reason #2: Life's too short

The second is that observational studies are limited by what actually happens in the world. You won't, from an observational study, find out what the effect of US politics is of every US senator taking up crack while every US representative takes up meth. (I hope not, anyway.)

That was an absurd example, but the problem is real. Language gives lots of great examples. Suppose you want to find out what sentences in any given language are grammatical and what sentences are not. You could do an observational study and see what sentences people say. Those are grammatical; sentences you haven't heard probably aren't.

The problem with this is that people are boring and repetitive. A small number of words (heck, a small number of sentence fragments) accounts for most of what people say and write. The vast majority of grammatical sentences will never appear in your observational sample no matter how long you wait, because there are actually an infinite number of grammatical English sentences. (In his impressive "Who's afraid of George Kingsley Zipf?", Charles Yang shows how a number of prominent language researchers went astray by paying too much attention to this kind of observational study.)

The basic feature of the problem is that for building theories -- explaining why things are the way they are -- very often what you care about are the border cases. Human behavior is largely repetitive, and the  border cases are quite rare. Experiments turn this around: by deliberately choosing the situations we put our participants in, we can focus on the informative test cases.

The experimental method: Here to stay

None of this should be taken as meaning that I don't think observational studies are useful. I conduct them myself. A prerequisite to asking the question Why are things the way they are is knowing, in fact, what way things are. There is also the question of ecological validity. When we conduct laboratory experiments, we construct artificial situations and then try to generalize the results to real life. It's good to know something about real life in order to inform those generalizations.

But just as I can't imagine observational studies disappearing, I can't imagine them replacing experimentation, either.



Lab Notebook: You know you are writing a paper when

Your "recently added" list in Mendeley is growing at an exponential rate:

(click to expand. note time added.)

Every new paper you read results in downloading at least two more (not unlike the Hounds of Resurrection. Coincidence? I think not).


I don't think I've ever actually finished my reading list for a paper. At some point, I shut down the process before it overwhelms my hard drive.

Findings: That Kind of Person

That Kind of Person is now complete. Many thanks to all who answered the call to participate.


For some time now, I have been studying the effect of context on pronoun interpretation. If words and sentences always meant what they meant regardless of context, linguistics and psycholinguistics would be much easier, and we would have much better computer translation, speech recognition, etc. Unfortunately, the same word (bank) can often mean different things in different contexts (he paddled over to the bank versus he cashed a check at the back).

Pronouns are a great guinea pig for studying the role of context, because they derive almost all their meaning from context (try to define “she” or “he” and compare it to your definition of “Martha Washington” or “George Washington”).

Great Expectations

Recently, a picture has started to emerge, at least in the case pronouns. The basic idea, due mostly to the work of Andrew Kehler at UCSD*, is that our initial interpretation of a pronoun is driven by what we think is likely to be talked about next. If this seems obvious, the dominant theory at the time Kehler started working (Centering Theory and variants) argued that our initial interpretation of the pronoun is that it refers to whatever person or thing is currently most “salient” (what counts as "salient" depends on the version of the theory) -- a hypothesis that also usually strikes folks as obvious.

Kehler's big contribution was articulating theory of discourse structure – that is, how sentences relate to one another – that can be used to fairly accurately predict what people expect to be mentioned next. (If you are interested in these issues and have a little background in linguistics, Kehler's book, Coherence, Reference, and the Theory of Grammar is fantastic.) For instance, sometimes one sentence introduces the consequence of another sentence:

(1) John frightened Bill, so he ran away.

Here, the second sentence (or, if you prefer, second clause) describes a consequence of the first sentence. Most likely "he" refers to Bill, because Bill running away would be a reasonable consequence of John frightening him. In contrast, other sentences explain the previous sentence:

(2) John frightened Bill because he is scary.

Here, "he" probably refers to John, since John being scary would be a good explanation of his frightening of Bill.

There are many other types of relationships between sentences, and they have predictable effects on pronoun interpretation. Although Kehler's theory explains a lot, it does not explain, for example, why we think Bill running away is a more likely effect of John frightening Bill than Bill running away.

The role of verbs

In two recent papers, which I discussed on this blog, my colleagues and I argued that verbs play a major role. Verbs -- specifically, the relationship between a verb and its subject and object -- provide a lot of information about events. We drew in particular on one line of theoretical work (usually called "predicate decomposition theory"), which tries to explain how verb meaning can be built out of a few constituent parts. The details aren't important here. What is important is that this theory argues that some  verbs specify who the cause of the event was. What we showed was that usually, in sentences like (2), people think the pronoun refers to the person that the verb specifies as the cause. In this case, "frighten" means something like "John caused Bill to be afraid". Remember that "he is scary" is an explanation of "John frightened Bill." Explanations usually refer to causes.

In short, by drawing on independent theories of discourse structure and verb meaning, we were able to predict very well how people will interpret pronouns in various contexts. At least, we could do so in the ones we tried -- there's a lot of work left to be done to fully flesh out this work.

The problem

I have been presenting this work for a while, and I often get the following objection: We already know that verbs can't be doing all (or even much) of the work. The real story, it was argued, is much more complex. Thinking just about the explanation sentences like (2), Pickering and Majid (2007) noted that multiple factors "affect the construction of the event representation, and it is this event representation that is used to infer the cause..." They cite experimental findings argued to show that pronoun interpretation in sentences like (2) depend in complex ways not just on the verb but on what you know about the subject and the object:
In addition, properties of the participants affect implicit causality. Changing the gender (Lafrance, Brownell, & Hahn, 1997), animacy (Corrigan, 1988, 1992), or typicality (Corrigan, 1992; Garvey et al., 1976) of the participants changes the [pronoun interpretation].
After hearing this enough times, I started what I thought would be a series of studies to look at how information about the subject and object interact with the verb in real time during sentence comprehension. This project never got off the ground because I couldn't find any such effects. That is, I have now run a number of studies where I manipulate the gender or typicality, etc., of the subject and object, and they have no effect on pronoun interpretation.

It turns out that there was some confusion in the literature. The studies that Pickering and Majid cite in the quote above mostly don't look at pronoun interpretation at all. Most look at a different task:

(3) John frightened Bill.
a. How likely is this because John is the kind of person who frightens people? 1 2 3 4 5 6 7 8 9
b. How likely is this because Bill is the kind of person people frighten? 1 2 3 4 5 6 7 8 9

Researchers look whether the answer to (a) is greater or less than the answer to (b) to decide who people think caused the event: John or Bill? Much of the literature has assumed that the answer to this question should predict what happens in pronoun sentences like (2), even though this has never been rigorously shown. (Why it hasn't been carefully tested is a bit of a mystery. It is so widely believed to be true that I suspect many folks don't realize that it hasn't been tested. It actually took me several years to pick up on this fact myself.)

I now have a long line of studies showing that there is little relationship between the two tasks. Also, although manipulating who the subject and object are affect the task in (3), I find very little evidence that it affects pronoun interpretation in (2). For instance, compare the following:

(4) a. The king frightened the page because he....
     b. The page frightened the king because he....

Everybody agrees that, in general, it is more likely that kings frighten pages than that pages frighten kings, and so if you use these sentences in (3), you get a nice effect of who the subject is. But it doesn't affect pronoun interpretation at all.

This is a serious blow to Pickering and Majid's argument. They argued that pronoun interpretation cannot be all (or mostly) about discourse structure and verb meaning because these interact in complex ways with knowledge about the subject and object (I should add: non-linguistic knowledge. It presumably is not part of the definition of king and page that kings frighten pages but not vice versa, but rather something you learn about the world). If it turns out that this is not the case, then discourse structure + verb meaning may well explain much or all of the phenomenon at hand.

That Kind of Person

That was my argument, anyway, in a paper that I have been shopping around for a couple years now. The difficulty with publishing this paper is that it makes a null argument: you can't find effects of knowledge about the subject and object on pronoun interpretation. In fact, all I can show is that the manipulations I have tried haven't worked, not that no manipulation works (you can't try everything!). So much of the review process has been reviewers suggesting additional experiments and me running them. The latest -- and I hope last -- one was That Kind of Person.

A reviewer very smartly noted that a big difference between (2) and (3) is that (3) asks about the kind of person the subject is and the kind of person the object is, whereas (2) does not. What we are manipulating in our king/page manipulation is, of course, the kind of person the subject is and the kind of person that the object is. So the reviewer suggested the following pronoun task:

(5) a. The king frightened the page because he is the kind of person that...
     b. The page frightened the king because he is the kind of person that...

The specific manipulation was one of status. It was argued in the literature that people are more likely to think that high-status folk (kings) caused the event that low-status folk (pages). This does turn out to be true if you use the task in (3), but yet again I found no effect on pronouns, either using sentences like (4) or like (5). (Sorry -- I was going to include a graph, but the results aren't formatted for graphing yet, and it's time for lunch! Maybe when the paper is published...)

Conclusions

I think the result of this work is that it suggests that we really are narrowing in on "the" theory of pronoun interpretation (though there is a lot of work left), a theory in which most of the work is done by discourse structure and verb meaning. This is pretty exciting, because it would be one of the rare cases where we have a reasonably complete theory of how context affects word meaning. It does leave open the question of what the task in (3) is measuring, and why it doesn't match what the pronoun tasks measure. That's still the sticking point in the review. I have a few new ideas, and we'll see what the reviewers say this time around.

----
*Editors at newspapers and magazines usually request that, whenever you introduce a scientist in an article, you state name, institution, and scientific field. The first two are easy, but the last one is hard, particularly when you frequently write about interdisciplinary research (which I do). I wrote about Kehler in an article for Scientific American Mind a while back, and introducing him caused a long debate. His degree is in computer science, he works in a linguistics department, but his work is probably best described as psychology. So what is he?

Just another reason I prefer blogging.

Small World of Words

A group of researchers in Belgium is putting together a very large word association network by asking volunteers to say which words are related to which other words. They are hoping to recruit around 300,000 participants, which makes it my kind of study! (Technically, I've never tried 300,000 participants -- I think we've never gone beyond about 50,000, though we have some new things in the pipeline...)

It looks interesting. To participate, go to www.smallworldofwords.com. You can read more about the project here.

I say "uncle", you say "DaJiu"

Kinship terms (mother, uncle, niece, etc.) are socially important and generally learned early in acquisition. Interestingly, different languages have different sets of terms. Mandarin, for instance, divides "uncle" into "father's older brother", "father's younger brother", and "mother's brother".
Stranger things (to an anglophone, anyway) happen, too: In Northern Paiute, the kin terms for grandparents and grandchildren are self-reciprocal: you would use the same word to refer to your grandmother (if you are female) that she uses to refer to you. (See my previous post on "mommy" across languages.)






































Kinship terms in English and Northern Paiute. Ignore all the logical terms for now.
(Figure taken from Kemp & Regier, 2012)

Even so, there are a lot of similarities across languages. Disjunctions are relatively rare; that is, it's unusual to see a word that means "father or cousin". Usually there are more words to distinguish varieties of closely-related relatives (sister, brother) than distant relatives (cousin). How come? One obvious answer is that maybe the kinship systems we have are just better than the alternatives (ones with words like "facousin" = "father or cousin"), but it would be nice to show this.

Optimal Kinship Terms

In a paper earlier this year, Charles Kemp and Terry Regier did just that.
We show that major aspects of kin classification follow directly from two general principles: Categories tend to be simple, which minimizes cognitive load, and to be informative, which maximizes communicative efficiency ... The principles of simplicity and informativeness trade off against each other... A system with a single category that includes all possible relatives would be simple but uninformative because this category does not help to pick out specific relatives. A system with a different name for each relative would be complex but highly informative because it picks out individual relatives perfectly. 
That seems intuitively reasonable, but these are computational folk, so they formalized this with math. The details are in the paper, but roughly: They formalize the notion of complexity by using minimum description length in a representational language based on primitives like FEMALE and PARENT. The descriptions of the various terms in English and Northern Paiute are shown in parts C and D of the figure above. Communicativeness is formalized by measuring how ambiguous each term is (how many people it could potentially refer to).

A language is considered "better" than another if it out-scores the other on one dimension (e.g., simplicity) and no worse on the other (informativeness). A language is near-optimal if it there is hardly any possible language that is better. They looked at a number of different existing kinship systems (English, Northern Paiute, and a bunch of others) and found that all of them were near-optimal.

Evolution, Culture, or Development?

There are generally three ways of explaining any given behavior: evolution (we evolved to behave that way), culture (culture -- possibly through cultural evolution -- made us that way), or development (we learned to behave that way). For instance, it's rare to find people who chiefly eat arsenic. This could be because of evolution (we evolved to avoid arsenic because the arsenic-eaters don't have children and pass on their genes), cultural evolution (cultures that prized arsenic-eating all died out, leaving the non-arsenic cultures as the only game in town), or development (we learned as children, through trial and error, that eating arsenic is a bad idea). If I remember my Psych 101, food preferences actually involve all three.

What about kinship terms? If they are optimal, who do we credit with their optimality? Probably not development (we don't each individually create optimal kinship terms in childhood). Kemp and Regier seem to favor cultural evolution: over time, more useful kinship terms stuck in the lexicon of a given language and useless ones like "facousin" died out. It would be nice to show, however, that it is not actually genetic. This wouldn't have to be genes for kinship terms, but it could be genes that bias you to learn naming systems that are near-optimal (kinship naming systems or otherwise). One would need to show that these arose for language and not just cognition in general.

------
ResearchBlogging.org Kemp, C., and Regier, T. (2012). Kinship Categories Across Languages Reflect General Communicative Principles Science, 336 (6084), 1049-1054 DOI: 10.1126/science.1218811

Still testing...

I was hoping to post the results of That Kind of Person today. When I announced the study two weeks ago, I estimated that it would take about two weeks to get enough data. For some reason, traffic on the site plummeted late last week.

So maybe one more week. As soon as I know the results, you will, and since this is (please let it be) the last experiment (#8!) for a paper, I am checking the numbers constantly. Many thanks to those who have already participated (those who haven't, you can find the experiment here; it shouldn't take more than 5 minutes).

Findings: Linguistic Universals in Pronoun Resolution - Episode II

A new paper, based on data collected through GamesWithWords.org, is now in press (click here for the accepted draft). Below is an overview of the paper.

Many of the experiments at GamesWithWords.org have to do with pronouns. I find pronouns interesting because, unlike many other words, the meaning of a pronoun is almost entirely dependent on context. So while "Jane Austen" refers to Jane Austen no matter who says it or when, "I" refers to a different person, depending mostly on who says it (but not entirely: an actor playing a part uses "I" to refer not to himself but to the character he's playing). Things get even hairier when we start looking at other pronouns like "he" and "she". This means that pronouns are a good laboratory animal for investigating how people use context to help interpret language.

Mice make lousy laboratory animals for studying the role of context in language.
Pronouns are better.

I have spent a lot of time looking at one particular contextual effect, originally discovered by Garvey and Caramazza in the mid-70s:

(1) Sally frightens Mary because she...
(2) Sally loves Mary because she...

Although the pronoun is ambiguous, most people guess that she refers to Sally in (1) but Mary in (2). That is, the verb used (frightens, loves) seems to affect pronoun resolution. Replace "frightens" and "loves" with other verbs, and what happens to the pronoun depends on the verb: some verbs lead to subject resolutions like frightens, some to object resolutions like loves, and some leave people unsure (that is, they think that either interpretation of the pronoun is equally reasonable).

The question is why. One possibility is that this is some idiosyncratic fact about the verb. Just as you learn that the past tense of walk is walked but the past tense of run is ran, you learn that some verbs lead you to resolve pronouns to the verbs' subject and some the verbs' object (and some verbs have no preference). This was what was tentatively suggested in the original Garvey and Caramazza paper.

Does the meaning of the verb matter?

One of the predictions of this account is that there's nothing necessary about the fact that frightens leads to subject resolutions whereas loves leads to object resolutions, just as there is no deep reason that run's past tense is ran. English could have been different.

Many researchers have suspected that the pronoun effects we see are not accidental; the pronoun effects arise from some fundamental aspect of the meanings of frightens and loves. Even Garvey & Caramazza suspected this, but all the hypotheses they considered they were able to rule out. Recently, using data from GamesWithWords.org, we presented some evidence that this is right. Interestingly, while researchers studying pronouns were busy trying to come up with some theory of verb meaning that would explain the pronoun effects, many semanticists were independently busy trying to explain verb meaning for entirely different reasons. Usually, they are interested in explaining things like verb alternations. So, for instance, they might notice that verbs for which the subject experiences an emotion about the object:

(3) Mary likes/loves/hates/fears John.

can take "that" complements:

(4) Mary likes/loves/hates/fears that John climbs mountains.

However, verbs for which the object experiences an emotion caused by the subject do not:

(5) Mary pleases/delights/angers/frightens John.
(6) *Mary pleases/delights/angers/frightens that John climbs mountains.

[The asterisk means that the sentence is ill-formed in English.]

Linguists working on these problems have put together lists of verbs, all of which have similar meanings and which can be used in the same way. (VerbNet is the most comprehensive of these.) Notice that in this particular work, "please" and "frighten" end up in the same group as each other and a different group from "like" and "fear" are in a different one: Even though "frighten" and "fear" are similar in terms of the emotion they describe, they have a very different structure in terms of who -- the subject or the object -- feels the emotion.

We took one such list of verb classes and showed that it explained the pronoun effect quite well: Verbs that were in the same meaning class had the same pronoun effect. This suggests that meaning is what is driving the pronoun effect.

Or does it?

If the pronoun effect is driven by the meaning of a verb, then it shouldn't matter what language that verb is in. If you have two verbs in two languages with the same meaning, they should both show the same pronoun effect.

We aren't the first people to have thought of this. As early as 1983, Brown and Fish compared English and Mandarin. The most comprehensive study so far is probably Goikoetxea, Pascual and Ancha's mammoth study of Spanish verbs. The problem was determining identifying cross-linguistic synonyms. Does the Spanish word asustar mean frighten, scare, or terrify?
Is this orangutan scared, frightened or terrified? Does it matter?

Once we showed that frighten, scare and terrify all have the same pronoun effect in English, the problem disappeared. It no longer mattered what the exact translation of asustar or any other word was: Given that entire classes of verbs in English have the same pronoun effect, all we needed to do was find verbs in other languages that fit into the same class.

We focused on transitive verbs of emotion. These are the two classes already introduced: those where the subject experiences the emotion (like/love/hate/fear) and those where the object does (please/delight/anger/frighten) (note that there are quite a few of both types of verbs). We collected new data in Japanese, Mandarin and Russian (the Japanese and Russian studies were run at GamesWithWords.org and/or its predecessor, CogLangLab.org) and re-analyzed published data from English, Dutch, Italian, Spanish, and Finnish.

Results for English verbs (above). "Experiencer-Subject" verbs are the ones like "fear" and "Experiencer-Object" are the ones like "frighten". You can see that people were consistently more likely to think that the pronoun in sentences like (1-2) referred to the subject of Experiencer-Object verbs than Experiencer-Subject verbs.

The results are the same for Mandarin (above). There aren't as many dots because we didn't test as many of the verbs in Mandarin, but the pattern is striking.

The Dutch results (above). The pattern is again the same. Again, Dutch has more of these verb, but the study we re-analyzed had only tested a few of them.

You can read the paper and see the rest of the graphs here. In the future, we would like to test more different kinds of verbs and more languages, but the results so far are striking, and suggest that the pronoun effect is caused by what verbs mean, not some idiosyncratic grammatical feature of the language. There is still a lot to be worked out, though. For instance, we're now pretty sure that some component of meaning is relevant to the pronoun effect, but which component and why?

------------
Hartshorne, J., and Snedeker, J. (2012). Verb argument structure predicts implicit causality: The advantages of finer-grained semantics Language and Cognitive Processes, 1-35 DOI: 10.1080/01690965.2012.689305

Goikoetxea, E., Pascual, G., and Acha, J. (2008). Normative study of the implicit causality of 100 interpersonal verbs in Spanish Behavior Research Methods, 40 (3), 760-772 DOI: 10.3758/BRM.40.3.760

Garvery, C., and Caramazza, A. (1974). Implicit causality in verbs Linguistic Inquiry, 5 (3), 459-464

Roger Brown and Deborah Fish (1983). Are there universal schemas of psychological causality? Archives de Psychologie, 51, 145-153

New Experiment: That Kind of Person

I just got back reviews on one of the pronoun papers. Although the paper already had seven experiments, they want two more. The worst part about it is that they are right.

Luckily, the experiment they asked for can be done online. It takes about 5 minutes. Native English speakers preferred (though I look at all data).

That Kind of Person (takes about 5 minutes)

My target is to post the results for this and the seven previous experiments in 2 weeks ... if I get enough participants quickly. Thank you in advance to everyone who participates.

Is Psychology a science?: Redux

The third-most read post on this blog is "Is Psychology a science?". I was a few years younger then and still had strong memories of one of my friends complaining, when we were both undergraduates, that he had to take a psychology course as part of his science distributional requirements. "Psychology isn't a science," he said, "because they don't do experiments." Since he was telling me this over AIM as I was sitting in my psychology laboratory, analyzing an experiment, it didn't go over well.

It's been a popular post, but I haven't written about the subject much since in part because I started to suspect that the "psychology isn't a science" bias might actually be a thing of ignorant undergraduates and a few cranks. It's not something I've rarely heard in the last few years, and there's no need to write diatribes against a non-existant prejudice.

In retrospect, maybe I haven't come across these opinions because I mostly hang out with other psychologists. A colleague recently forwarded me this blog post ("Keep Psychology out of the science club"), which links to a few other similar pieces on blogs and in newspapers. So it seems the issue is alive and well.

Some articles one comes across are of the "psychologists don't do experiments" variety; these are easily explained by ignorance and an inability to use Google. But some folks raise some real concerns which, while I think they are misplaced, really are worth thinking about.


Psychology is too hard

One common theme that I came across is that psychology is simply too difficult. We'll never understand human behavior very well, so maybe we shouldn't even try. For instance, Gary Gutting, writing in the Opinionator at the New York Times, said:
Social sciences may be surrounded by the "paraphernalia" of the natural sciences, such as technical terminology, mathematical equations, empirical data and even carefully designed experiments. But when it comes to generating reliable scientific knowledge, there is nothing more important than frequent and detailed predictions of future events ... while the physical sciences produce many detailed and precise predictions, the social sciences do not ... Because of the many interrelated causes at work in social systems, many questions are simply "impervious to experimentation" ... even when we can get reliable experimental results, the causal complexity restricts us...
In a Washington Post editorial, Charles Lane wrote:
The NSF shouldn't fund any social science. Federal funding for mathematics, engineering and other "hard" sciences is appropriate. In these fields, researchers can test their hypotheses under controlled conditions; then those experiments can be repeated by others. Though quantitative methods may rule economics, political science and psychology, these disciplines can never achieve the objectivity of the natural sciences. Those who study social behavior -- or fund studies of it -- are inevitably influenced by value judgments, left, right, and center. And unlike hypotheses in the hard sciences, hypotheses about society usually can't be proven or disproven by experimentation. Society is not a laboratory.
Alex Berezow at the Newton Blog agrees:
Making useful predictions is a vital part of the scientific process, but psychology has a dismal record in this regard.
Is that a fair critique?

These writers don't entirely miss the mark. It really is true that psychology does not make as precise or as accurate predictions as, say, physics. That is not the same thing as saying that we can't make any predictions. Berezow complains about happiness research:
Happiness research is a great example of why psychology isn't a science. How exactly should "happiness" be defined? The meaning of the word differs from person to person, and especially between cultures. What makes Americans happy doesn't necessarily make Chinese people happy. How does one measure happiness? Psychologists can't use a ruler or a microscope, so they invent an arbitrary scale. Today, personally, I'm feeling about a 3.7 out of 5. How about you? ...  How can an experiment be consistently reproducible or provide any useful predictions if the basic terms are vague and unquantifiable?
That's a great question! Let's start with the facts. It is true that we don't know exactly what it means to be a 3.7 on a scale of 1-5. But we do know a few interesting things.

People's predictions of how happy they will rate themselves in the future are systematically biased. People will say that good things (like getting tenure) will make them very happy (a 5 out of 5) whereas bad things (like not getting tenure) will make them very sad (a 1 out of 5), whereas when you then ask those same people to rate their happiness a little while after the event, people generally rate themselves as not nearly so happy or unhappy as they predicted. (Similarly, people who lose a limb usually rate themselves as about as happy afterwards as before, provided you give them a little time to adjust.) People who have children normally see a drop in how happy they rate themselves. They only start to recover after their children leave the nest. There is also the "future ahedonia" effect: people think that good things (e.g., an ice cream sunday) will make them happier now (on our 1-5 scale) than those same good things would make them happy in the future, and conversely for bad things (e.g., doing my homework won't feel so bad if I do it tomorrow rather than today). And so on. (These and many other examples can be found in Dan Gilbert's Stumbling on Happiness.)

These and other findings are highly reliable, despite the fact that we don't have a direct, objective measurement of happiness. In fact, as Dan Gilbert has pointed out, we would only consider that "direct" measurement to be a measurement of happiness if it correlated really well with how happy people said they were. To the extent it diverged from how happy people claim to be, we would start to distrust the "direct" measurement.

I personally am glad that we know what we know about happiness, though I wish we knew more. I picked happiness to defend because I've noticed that even those who defend psychology in comments sections give up happiness research as a lost cause. I think it's pretty interesting, useful work. It would be even easier to defend, for instance, low-level vision research, which makes remarkably precise predictions, has clear theories of the relationship between the psychological phenomena and the neural implementations, etc. (See also this post for some psychology success stories.)

Just how good do you need your predictions to be?

Still, it is true that we can't always make the precise predictions that can be made in some other fields. Of course, other fields can't always make the precise predictions, either. While physicists are great at telling you what will happen to rigid objects moving through vacuums, predicting the motions of real objects in the real world has been traditionally a lot harder, and understanding fluid dynamics has been deeply problematic (though I understand this has been getting a lot better in recent years). And that's without pulling out the Heisenberg Uncertainty Principle, which should cause anyone who wants precise, deterministic predictions to declare physics a non-science.

Also, some parts of psychology are able to make much more precise predictions than others do. Anything amenable to psychophysics tends to be much more precise, and vision researchers, as already noted, have remarkably well worked-out theories of low- and mid-level vision.

This line of discussion also raises an interesting question: when exactly did physics become a science? Was it a science in Newton's day, when we still new squat about electromagnetism -- much less elementary particles -- and couldn't make even rough predictions about turbulent air or fluid systems? And to people from 350 years from now, will the physics of today seem like a "real" science (my guess: no).

Worries

Berezow ends his post with the following caution:
To claim [psychology] is a "science" is inaccurate. Actually, it's worse than that. It's an attempt to redefine science. Science, redefined, is no longer the empirical analysis of the natural world; instead, it is any topic that sprinkles a few numbers around. This is dangerous, because, under such a loose definition, anything can qualify as science. And when anything qualifies as science, science can no longer claim to have a unique grasp on secular truth.
I have a different worry. My worry is that someone gets ahold of a time machine, goes back in time to 1661 and convinces Newton to lay off that non-scientific "physics" crap. Pre-Newtonian physics was a hodgepodge of knowledge, little resembling what we think of science today. Making precise predictions about the messy, physical world we live in no doubt seemed an impossible pipe-dream to many. Luckily, folks like Newton kept plugging away, and three and a half centuries later, here we are.

We should keep in mind that the serious study of the mind only began in the mid-1800s; physics has a significant head-start. And, as the anti-psychology commentators are happy to point out, psychology is much, much harder than physics or chemistry. But the only reason I can see to pull the plug is if we are sure that (a) we have learned nothing in the last 150 years, and (b) we will never make any further progress. These are empirical claims and so subject to test (I think the first one has already been falsified). So here's a proposed experiment: psychologists keep on doing psychology, and people who don't want to don't have to. And we'll wait a few decades and see who knows more about the human mind.