Games with Words

Everlasting Love

Posted by GamesWithWords on Monday, April 29, 2013

I just got back data from a survey in which we asked people to estimate how long different emotions are likely to last. We'll use this information to design a future experiment looking at how people expect emotions to be encoded in language. In the meantime, what struck me is that of all the emotions we asked about, the one that people expected to last the longest was "being head-over-heels in love". Which is awesome.

(Image courtesy of Faizal Sharif)

New Experiment: The Vocab Quiz

Posted by GamesWithWords on Tuesday, April 23, 2013

Curious how good your vocabulary is? I just posted a new experiment that will tell you. There are 32 questions. At the end, you'll see your score and how it compares with others who have done the experiment. This should be a fairly hard test. I piloted it on around 40 people, and only a few managed to get all the questions right. Then I made it harder. You can find the experiment here.

What is the purpose of the experiment?

We are interested in why some people have better vocabularies than others. So before you take the test, you'll answer some questions about your background, such as your age, level of education, and birth order. The predictions for age and level of education are probably fairly obvious. The predictions for birth order are less clear. Some researchers would predict that eldest children will have better vocabularies (they spent more time with their parents and so got a jump start). Others would predict that the youngest would have better vocabularies (they had extra teachers in the home!). Still other researchers would argue that birth order (being the oldest or youngest, etc.) should have no effect on vocabulary, because they argue that pretty much nothing is affected by birth order.

We are particularly interested in people for whom English is a second language. What factors lead some people to easily acquire a second language and others not?

Take the Vocab Quiz.

New Experiment: The Language & Memory Test

Posted by GamesWithWords on Wednesday, April 17, 2013

There is a close relationship between language and memory, since of course whenever you use words and grammar, you have to access your memory for those words and that grammar. If you couldn't remember anything, you couldn't learn language to begin with.

The relationship between language and memory is not well understood, partly because they tend to be studied by different people, though there are a few labs squarely interested in the relationship between language and memory, such as the Brain and Language Lab at Georgetown University.

This week, I posted a new experiment, "The Language & Memory Test", which explores the relationship between memory and language. The experiment consists of two components. One is a memory test. At the end, you will see your score and how it compares with other people who took the test. This test is surprisingly hard for how simple it seems.

In the other part, you will try to learn to use some new words. We'll be studying the relationship between different aspects of your memory performance and how you learn these new words. As always, there will be a bit more explanation at the end of the experiment. When the experiment is done and the results are known, there will be a full description of them and what we learned here at the blog and at GamesWithWords.org.

Try the Language & Memory test here.

New Experiment: Collecting Fancy Art

Posted by GamesWithWords on Monday, April 15, 2013

Over the last few years, we've run a lot of experiments online at GamesWithWords.org, resulting so far in four publications, with a number of others currently under review at various journals. Most of these have experiments have focused on how people process and interpret language. I just posted a new experiment (Collecting Fancy Art) that is more squarely focused on learning language. Language learning experiments are somewhat tricky to do online, since they tend to take longer than the 5-10 minute format of online experiments, but they are important.

One of the most salient truths about language is that language has to be learned. This is clearly pretty hard, or other animals would be able to do it and we'd already have computers that were pretty good at language. But just how the learning process happens is a bit of a mystery, partly because language is a complex, interconnected system. When you learn one word, it affects how you use other words.

In this experiment, you will simultaneously learn the meanings of three different words. We're interested in seeing how your understanding of these words develops. As always, you'll learn more about the experiment at the end. And check back here in the future: After the experiment is completed, the results will be posted here.

The experiment is called "Collecting Fancy Art". You can find it here.

Lab Notebook: Social Networking

Posted by GamesWithWords on Sunday, April 14, 2013

The problem with websites is they quickly become obsolete. A few years ago, I updated the website to make it easier to share pages, adding buttons for Facebook, Twitter, Digg, and Reddit. A little while ago, I noticed that the Digg button wasn't working anymore. Then the Twitter button disappeared.

I just updated the website, switching from native buttons for social networking systems to ShareThis. ShareThis has the advantage of incorporating every social networking system you've heard of and a bunch you haven't heard of (I've put Google+, Facebook, Twitter, Tumblr, and email up front, but by clicking on the ShareThis button, users can choose from dozens of networks).

Fieldofscience (the network this blog is a part of) has been using ShareThis for a couple years. However, it went through several periods where it wasn't working. Periodically, it would have memory failures, and posts that had once had dozens of likes suddenly went to zero. But lately it seems much more stable, so I'm trying it out.

The disadvantage is that every page says that it hasn't been liked by anybody, which isn't great advertising for the website. (*UPDATE* We've got a few shares now on some of the pages.) I hope this changes quickly.

The $64,000 question is, of course, whether this update changes the overall amount of traffic to the website. It's been averaging around 2,000 visitors/month for a couple years now. That's very respectable for a research website. However, many of the experiments now running (like the Mind Reading Quotient and Finding Explanations) require large numbers of participants, and they would really benefit from an uptick in traffic.

Who you gonna believe: E. O. Wilson or common sense?

Posted by GamesWithWords on Wednesday, April 10, 2013

I was planning a post on E. O. Wilson's recent flight of fancy, "Great Scientist ≠ Good at Math", in which he tells potential future scientists that knowing math isn't all that important, but it turns out Jeremy Fox has already said everything I was going to say, only better. It's a long post, though, so here are some key passages:

Wilson’s claim that deep interest in a subject, combined with deep immersion in masses of data, is sufficient, because hey, it worked for Charles Darwin, is utter rubbish. First of all, just because it worked for Darwin (or Wilson) doesn’t mean it will work for you, and just because it worked in the 19th century doesn’t mean it will work in the 21st. If for no other reason than that there are plenty of people out there, in every field, who not only have a deep interest in the subject and an encyclopedic knowledge of the data, but who know a lot of mathematics and statistics.

and

Wilson claims that strong math skills are relevant only a few disciplines, like physics. Elsewhere, great science is a matter of “conjuring images and processes by intuition”... I’m sure Wilson is describing his own approach here, and it’s worked for him. But I have to say, it’s surprising to find someone as famous for his breadth of knowledge as E. O. Wilson generalizing so unthinkingly from his own example. I wonder what his late collaborator Robert MacArthur would think of the notion that intuition alone is enough. I wonder what Bill Hamilton would think. Or R. A. Fisher. Or J. B. S. Haldane. Or Robert May. Or John Maynard Smith. Or George Price. Or Peter Chesson. Or Dave Tilman. Or lots of other great ecologists and evolutionary biologists I could name off the top of my head. Would Wilson seriously argue that none of those people were great scientists, or that they never made any great discoveries, or that the great discoveries they made arose from intuition unaided by mathematics?

Meanwhile, over at Finding the Next Einstein, Jonathan Wai draws on his own research to argue that mathematics ability is key to success in a wide range of scientific fields (though these data are unfortunately correlational).

International Journal of Lousy Research

Posted by GamesWithWords on Tuesday, April 09, 2013

Jeffrey Beall's blacklist of "predatory open-access journals" -- discussed in yesterday's New York Times -- provides evidence for my long-standing suspicion of any journal named "International Journal of ..." There probably are some good journals named "International Journal of...", but I don't know of any off-hand. And there seem to be an awful lot of bad ones, probably for good reason: An internationally-recognized journal doesn't have to say so. So almost by definition a journal that has to call itself "International Journal of" is probably not a well-known journal.

In general, nearly every journal on the list has some location in its name, such as South Asian Journal of Mathematics, which doubles down by referring to itself on its home page as an "international journal". Again, there are, of course, good journals with region-specific names. But there don't seem to be many. I'm less sure of the reason for this one.

[Future Post: Explaining why universities that market themselves as "The Harvard of" some region are frequently not even the most prestigious school in that region.]

Laying to rest an old myth about Chinese

Posted by GamesWithWords on Tuesday, April 02, 2013

I just got back from my second research trip to Taiwan in three years (with another planned soon!) and fourth trip overall. As always, I had a great time and ate as much beef noodle soup as I could manage.

As always, I spent a couple months beforehand brushing up my reading and writing. This isn't something I have to do before trips to Spain or Russia. A few hours spent learning Spanish or Russian orthography, and you are set for life. As soon as I blink, I forget how to read and write Chinese. This is because, as is well known, rather than a couple dozen phonetic symbols, Chinese employs thousands of easily-confusable characters which, if you don't use for a while, you end up confusing.

This isn't just a problem for foreigners. Students in Taiwan (and China or Japan, I assume) continue investing significant amounts of time into learning to read and write additional characters well through secondary school. This raises the question of why Chinese-speakers don't just adopt a phonetic writing system?

Problems with a Chinese phonetic writing system

The argument one often hears is that Chinese has so many homophones (words that sounds like), that if you wrote them all the same way, there would be so much ambiguity that it would be impossible to read. The character system solves this by having different characters for different words, even ones that sound alike.

In the last century, when switching to a phonetic system was proposed, a scholar illustrated this problem with the following poem, which reads something like this:

Shi shi shi shi shi shi, shi shi, shi shi shi shi. Shi shi shi shi shi shi shi shi shi, shi shi shi shi shi, shi shi, shi shi shi shi shi. Shi shi shi shi shi, shi shi shi, shi shi shi shi shi shi. Shi shi shi shi shi shi, shi shi shi. Shi shi shi, shi shi shi shi shi shi. Shi shi shi, shi shi shi shi shi shi shi shi. Shi shi shi shi shi shi shi shi shi shi shi shi shi. Shi shi shi shi.

As written, this is incomprehensible. Only if you write it in characters

the meaning becomes clear:

A poet named Shi lived in a stone house and liked to eat lion flesh and he vowed to eat ten of them. He used to go to the market in search of lions and one day chanced to see ten of them there. Shi killed the lions with arrows and picked up their bodies carrying them back to his stone house. His house was dripping with water so he requested that his servants proceed to dry it. Then he began to try to eat the bodies of the ten lions. It was only then he realized that these were in fact ten lions made of stone. Try to explain the riddle.

Problems with this argument

This argument sounds compelling until you realize that what is being claimed is that you can't understand a Chinese sentence based on its sound alone. This means that not only is it impossible to understand phonetically-written Chinese, it is impossible to understand spoken Chinese (which, like phonetically-written Chinese, doesn't have any characters to help disambiguate similar-sounding words). Since a billion people speak Mandarin Chinese every day, there must be a problem with this argument!

There are a few. First of all, I wrote the poem phonetically ignoring the five Chinese tones. Like many languages, Chinese uses intonation phonetically -- an 'i' with a rising tone is different from an 'i' with a falling tone. Writing a tonal language without tones is like writing English without vowels -- much harder to read. Similarly, the phonetic writing above does not have any breaks between words, making it much harder to read (imaginewritingEnglishwithoutspacesbetweenwords). True, written Chinese doesn't mark word boundaries, but then it has all the extra information encoded in the characters to help with any ambiguity.

Second, this poem uses very archaic Chinese (different vocabulary and different grammar than modern Mandarin). It's not clear how many people would understand the poem spoken aloud. Wikipedia gives a nice translation of the poem into modern Mandarin, which involves many different sounds, not just 'shi'.

The most important problem is that there actually is a perfectly good phonetic system for writing Chinese. Actually, there are several, but the most common is pinyin. People can and do write entire texts in pinyin.

Why care?

Why go to the effort of debunking this myth? This often comes up in arguments over whether the Chinese should adopt a new writing system, but that's not really my concern. Very often, there is a tendency to believe that different cultures and languages are much more different from one another than they are. One hears about strange aspects of other languages without even pausing to think about the fact that your own language has many of those same features. The writing systems of English and Chinese are actually alike in many ways (both are partially phonetic and partially semantic -- a topic for a different post). I can only speak for myself, but the more I learn about a given language, usually the less foreign it seems. Which is a fact worth thinking about.

Update on DuoLingo

Posted by GamesWithWords on Saturday, March 30, 2013

I have been using Duolingo for a few months to brush up my Spanish. I have generally found it to be pretty useful and a significant improvement over my strategy was to listen to the news in Spanish. So I was interested to see a report on the effectiveness of Duolingo.

Even though most people enrolled in the study did not actually spent much time using DuoLingo (only a handful managed more than 30 hours in two months), there was a statistically significant improvement. How much improvement? The report estimates that a person with no prior knowledge of Spanish would be able to place into 2nd-semester Spanish after 34 hours with Duolingo.

Reasons for sketicism

While this is certainly good to see, Duolingo goes a bit far in concluding that this means Duolingo is more effective than a university. It might be true, but these aren't the kind of data you would want to show it. What we would want to know is how much the participants in this study would have learned if they had taken one semester of college Spanish. I doubt the answer is "exactly enough to place into 2nd-Semester Spanish on the study's placement test."One reason is that placement tests are designed to tell you whether someone has enough background to take a class, not whether they know exactly as much as the typical student starting that class. A second reason is that the study population is very different from your typical undergrad. In fact, nearly 3/4 of them had graduated from college already, and over 1/4 had a graduate degree. These are people who are highly experienced at education and who have been very successful, and either reason might make you expect them to learn faster than your typical college freshman. Then there's the fact that the study doesn't seem to control for whether they are using any other methods to learn Spanish at the same time (like taking a class).

I suppose my main reason for being skeptical is that while I find Duolingo incredibly useful for learning nouns and adjectives and for practicing what I already know, I've found it less useful in terms of learning grammar or learning verbs. Grammar is not explicitly taught at all (you're suppose to work out the rules of grammar from seeing example sentences). There are a lot of Spanish verbs with irregular endings, and Duolingo gives you no information about those (except what you might glean from seeing an example sentence with such a verb).

Perhaps this is closer to how children learn (though not really -- Duolingo is all about translation, and generally children don't learn their first language by learning how to translate it into another language!), but I suspect there's a reason that language classes the world over explicitly teach you grammar rules. Babies might not need it, but adults seem to.

What Duolingo is good for

This doesn't mean I've got it in for Duolingo. As I said, I've been using it and intend to keep on using it. Much of Duolingo consists of trying to create new sentences in Spanish and then getting feedback on whether you did it right or not. This is fantastic practice, maybe even better than what you'd get in an immersion environment (in which you create sentences but don't always get feedback), and I highly recommend it to anyone trying to revive moribund language skills or as an addition to an ongoing course of study. I just don't see it standing all by itself.

The other useful tidbit from this study: Most people who started using Duolingo quit, and quit quickly. Which is a reminder that the limiting factor in language learning is not what textbook or website you use, but your own dedication.

Posted by GamesWithWords on Tuesday, January 22, 2013

Chemistry has its own problems with replication, according to Nature:

Scrounging chemicals and equipment in their spare time, a team of chemistry bloggers is trying to replicated published protocols for making molecules. The researchers want to check how easy it is to repeat the recipes that scientists report in papers ... Among the frustrations [chemists] have experienced with the chemical literature ... are claims that reactions yield products in greater amounts than seem reasonable, and scanty detail about specific conditions in which to run reactions. In some cases, reactions are reported which seem to good to be true - such as a 2009 paper which was corrected within 24 hours by web-savvy chemists live-blogging the experiment.

It's hard to tell from the article how common it is for a reaction simply not to be possible at all as opposed to simply produce less product than reported. Presumably either is problematic, but the causes would be different.

Given the recent excitement about (non-)replication, one has to wonder if this problem is more or less common than in the past. While my gut instinct is that replication was probably less of a problem in the earlier, smaller days of science, it's also quite possible that it's like many forms of violent crime: extremely rare today by historical standards, but we care much more about it.

What makes psychology and neuroscience hard

Posted by GamesWithWords on Monday, January 21, 2013

Explained by today's XKCD:

Ambrose Bierce pointed out the same problem in his 1911 satyrical dictionary (The Devil's Dictionary):

Mind, n. A mysterious form of matter secreted by the brain. Its chief activity consists in the endeavor to ascertain its own nature, the futility of the attempt being due to the fact that it has nothing but itself to know itself with.

Fractionating IQ

Posted by GamesWithWords on Thursday, January 17, 2013

Near the dawn of the modern study of the mind, the great psychological pioneer Charles Spearman noticed that people who are good at one kind of mental activity tend to be good at most other good mental activities. Thus, the notion of g (for "general intelligence") was born: the notion that there is some underlying factor that determines -- all else equal -- how good someone is at any particular intelligent task. This of course fits folk psychology quite well: g is just another word for "smarts".

The whole idea has always been controversial, and many people have argued that there is more than one kind of smarts out there (verbal vs. numeric, logical vs. creative, etc.). Enter a recent paper by Hampshire and colleagues (Hampshire, HIghfield, Parkin & Owen, 2012) which tries to bring both neuroimaging and large-scale Web-based testing to bear on the question.

In the neuroimaging component, they asked sixteen participants to carry out twelve difficult cognitive tasks while their brains were scanned and applied principle components analysis (PCA) to the results. PCA is a sophisticated statistical method for grouping things.

A side note on PCA

If you already know what PCA is, skip to the next section. Basically, PCA is a very sophisticated way of sorting thigns. Imagine you are sorting dogs. The simplest thing you could do is have a list of dog breeds and go through each dog and sort it according to its breed.

What if you didn't already have dog breed manual? Well, German shepherds are more similar to one another than any given German shepherd is to a poodle. So by looking through the range of dogs you see, you could probably find a reasonable way of sorting them, "rediscovering" the various dog breeds in the process. (In more difficult cases, there are algorithms you could use to help out.)

That works great if you have purebreds. What if you have mutts? This is where PCA comes in. PCA assumes that there are some number of breeds and that each dog you see is a mixture of those breeds. So a given dog may be 25% German Shepherd, 25% border collie, and 50% poodle. PCA tries to "learn" how many breeds there are, the characteristics of those breeds, and the mixture of breeds that makes up each dog -- all at the same time. It's a very powerful technique (though not without its flaws).

Neuroimaging intelligence

Analysis focused only on the "multiple demands" network previously identified as being related to IQ and shown in red in part A of the graph below. PCA discovered two underlying components that accounted for about 90% of the variance in the brain scans across the twelve tasks. One was particularly important for working memory tasks, so the authors called in MDwm (see part B of the graph below), and it involved mostly the IFO, SFS and ventral ACC/preSMA (see part A below for locations). The other was mostly involved in various reasoning tasks and involved more IFS, IPC and dorsal ACC/preSMA.

Notice that all tasks involved both factors, and some tasks (like the paired associates memory task) involved a roughly equal portion of each.

Sixteen subjects isn't very many

The authors put versions of those same twelve tasks on the Internet. They were able to get data from 44,600 people, which makes it one of the larger Internet studies I've seen. The authors then applied PCA to those data. This time they got three components, two of which were quite similar to the two components found in the neuroimaging study (they correlated at around r=.7, which is a very strong correlation in psychology). The third component seemed to be particularly involved in tasks requiring language. Most likely that did not show up in the neuroimaging study because the neuroimaging study focused on the "multiple demands" network, whereas language primarily involves other parts of the brain.

The factors dissociated in other ways as well. Whereas people's working memory and reasoning abilities start to decline about the time people reach the legal drinking age in the US (coincidence?) verbal skills remain largely undiminished until around age 50. People who suffer from anxiety had lower than average working memory abilities, but average reasoning and verbal abilities. Several other demographic factors similarly had differing effects on working memory, reasoning, and verbal abilities.

Conclusions

The data in this paper are very pretty, and it was a particularly nice demonstration of converging behavioral and neuropsychological methods. I am curious what the impact will be. The authors are clearly arguing against a view on which there is some unitary notion of IQ/g. It occurred to me as I wrote this what while I've read many papers lately discussing the different components of IQ, I haven't read anything recent that endorses the idea of a unitary g. I wonder if there is anyone, and, if so, how they account for this kind of data. If I come across anything, I will post it here.

------

Hampshire, A., Highfield, R., Parkin, B., & Owen, A. (2012). Fractionating Human Intelligence Neuron, 76 (6), 1225-1237 DOI: 10.1016/j.neuron.2012.06.022

Professor -- The Easiest Job in the World

Posted by GamesWithWords on Wednesday, January 09, 2013

There has been a small kerfuffle over Susan Adams's article at Forbes, titled "The least stressful jobs of 2013":

University professors have a lot less stress than most of us. Unless they teach summer school, they are off between May and September and they enjoy long breaks during the school year, including a month over Christmas and New Year's and another chunk of time in the spring. Even when school is in session they don't sped too many hours in the classroom ... Working conditions tend to be cozy and civilized and there are minimal travel demands...

She also mentions the great job prospects ("Universities are expected to add 305,700 adjunct and tenure-track professors by 2020").

To her credit, Adams has added a sizable addendum to her article, correcting -- but not apologizing for -- her mistakes. Unfortunately, this is far from the first time this kind of article has appeared in a major publication. Some time back, a columnist for the New York Times wrote an article suggesting that the solution to rising costs of higher education was to make professors work more than a few hours a week. An article in the New Yorker casually noted that the new head of a particular company was concerned that his employees worked "the hours of college professors" (I initially assumed they meant "way too hard" and that the boss wanted them to take a break!). What gives?

Scicurious suggests it's the curse of half-knowledge:

The vast majority of us aren't teachers or professors, but we've all been students, right? ... We thought that, because of what we saw of them in our classes, we knew what they did ... Because of this half-knowledge, people make assumptions about our jobs, assumptions that can really affect how we are perceived as people...

That is no doubt part of it, but it also requires that people not think very hard. If I heard that someone made a pretty good living working only a few hours a week, it would immediately set off my implausibility alarm. I mean, what are the chances? And you'd only have to think for a moment to realize this can't be true.

Adams got hundreds of comments and letters pointing out that professors, in addition to giving a few lectures a week, also grade papers, advise students, write papers and books, go to conferences, give invited talks, etc. Adams presents this as if this came as a surprise, but that seems equally implausible. I'm going to assume she's read one or two articles about medicine or science, in which case the people discussed are inevitably professors. In fact, articles about politics occasionally cite professors as well. If she went to college, she knows that professors have office hours and grade papers. Many of the books on science and politics in the bookstore are written by faculty, as are essentially all college textbooks.

Even if she had never attended college, never interacted with a professor, and didn't read articles about higher education, a few minutes of Googling prior to writing her article would have corrected that mistake. My guess is that she didn't really think about her article before writing it and didn't consult either her own memory or Google because she -- and the others who write similar articles -- wanted this crazy claim about the lazy professor to be true. The interesting question is why she wanted it to be true. Anti-intellectualism? A desire to believe that such cushy jobs really exist? Or is this just an example of one of those ideas that are crazy enough that they inspire belief (like one of those many apocryphal "weird facts")?

*I do realize that some professors do very little work. Some people in all professions do very little work.

Transferring Consciousness

Posted by GamesWithWords on Tuesday, January 08, 2013

My brother was just in town, and we had our usual argument about Old Man's War, which he loves and about which I'm less enthusiastic (it was a fun read, but...). Perhaps one issue that keeps me from enjoying it fully is that whenever I think about it I think about an early scene, in which a character's consciousness was transferred from an old body to a new body. This is presented in the book as just one more futuristic miracle, but I can't stop thinking about the deeper questions it raises.

What does it mean to transfer consciousness from one body to another? Our current scientific understanding is that there is no consciousness separate from the underlying physical machinery, so such a transfer could not happen. But you might be able to create the illusion of a consciousness transfer, which I explain below. So we can make sense of Old Man's War if we assume that the doctors are deliberately lying about what is going on, covering up the murder that lies at the heart of the procedure.

Here's what might be going on (yes, I realize this is fiction, but good science fiction almost always has a thought experiment at its heart): It should be possible, at least in principle, to create a new body that has identical machinery to an existing body. This is would be new person who is a twin not only physically but mentally, down to having the same memories (by definition, since they have the same brains down to the microcircuitry). From the new person's perspective, he has finds herself suddenly in a "new" body. (This is much like the old philosophical puzzle, what if the world was created yesterday, all of us with artificial memories?)

So now we've got a consciousness that believes itself to have transferred into a new body from an old body. What happened to the consciousness in the old body? The doctors in Old Man's War claim that it is now a vegetable, with no consciousness inside, because that consciousness has transferred. Since that can't happen, they are lying: either the process of creating the new copy of the old brain destroys the old brain, or the doctors deliberately destroy the old brain to preserve the illusion of the transfer (after all, if transfer is impossible, why go through this procedure? It's very nice for your twin to have a new body, but it's not going to do you any good at all!).

Here's the question: does this matter? If John undergoes this procedure happens on a Wednesday, then the world on Thursday is much the same as the world on Tuesday: on both days, there is a consciousness calling itself "John" with roughly the same memories. It only gets tricky when you think too much about Wednesday. You might be tempted to say you have John 1 on Tuesday and John 2 on Thursday, who are duplicates but nonetheless not the same because they have different bodies. But, of course, John had a different body when he was 5yo than when he was 75yo, down even to being made up of different atoms. So if we're willing to call 5yo John and 75yo John the same person, why aren't John 1 and John 2?

This confuses the heck out of me, which is why I have difficulty paying attention to the novel itself.

Eighteen papers on replication

Posted by GamesWithWords on Monday, December 17, 2012

Last month, Perspectives in Psychological Science published a special section on replicability in psychological science. With 18 papers, it took me a while to get through all of them, so I am only blogging about them now.

The issue contains many useful articles, particularly if you have not been following the replicability crisis/discussion carefully. I have been following the discussion pretty closely, and I still found a lot worth reading (a particularly enjoyable surprise was Giner-Sorolla's historical review of replicability problems, which is absolutely fascinating).

So I hope you check the issue out. It is currently open access. But if you'd rather have someone else read it first and summarize the main points ... then lucky for you, I've done just that. Below, I describe and discuss all 18 papers, though it should be said that these descriptions are through the lens of what I happen to find particularly interesting or novel. There are likely many other things in these papers that would be of interest to you. So check them out.

Bookends

The issue is bookended by two overviews. Harold Pashler and Eric-Jan Wagenmakers get the issue started with a brief review of recent scandals, failures to replicate, and other reports -- both in psychology and in other fields -- that replicability may be an issue. John Ioannidis, Legend of Replicability, has a contribution at the end of the issue ("Why science is not necessarily self-correcting") which similarly summarizes many of the main arguments of the issue. If you haven't been following along, these are good primers. Neuroskeptic ("The nine circles of scientific hell") makes many of the same arguments, but in the satirical style of Dante Alighieri (I only wish it was in rhyme).

Is there a replicability crisis?

Pashler and Harris ("Is the replicability crisis overblown?") address three arguments sometimes given that "the replicability crisis is overblown". The first is that the low alpha level (5%) ensures that, of published results, most are right. They show that this is most likely incorrect, given current publishing practices. The second even if not many direct replications are published, many "conceptual" replications are. They argue that the notion of "conceptual replication" is so fuzzy that it actually compounds the problem (one problem: there is no such thing as a failure to conceptually replicate):

We speculate that the harmful interaction of publication bias and a focus on conceptual rather than direct replications may even shed light on some fo the famous and puzzling "pathological science" cases that embarrassed the natural sciences... What many observers found peculiar in these cases [like cold fusion] was that it took many years for a complete consensus to emerge that the phenomena lacked any reality... It appears that many exact replication attempts of the initial [cold fusion studies] were undertaken soon after the first dramatic reports of cold fusion. Such attempts produced generally negative results (Taubes, 1993). However, what kept faith in cold fusion alive for some time ... was a trickle of positive results achieved using very different designs that the originals (i.e., what psychologists would call conceptual replications).

Finally, they take issue with the claim that over the long run, science is self-correcting, given the lack of evidence that this is true (as far as I can tell, it's not even clear how you would show that science is self-correcting without rigorous, systematic studies of replicability).

While probably not everyone agrees with all the specifics of Pashler and Harris's argument, there does appear to be general consensus that there is a real replicability problem. 1,292 psychologists around the world surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") believed, on average, that 53% of the results in psychology could be replicated in direct replications. (One word of caution: it may be that psychologists most concerned about replication were also disproportionately likely to agree to participate in the survey; still, 1300 psychologists is a significant fraction of the field.)

What's the best way of getting a significant result?

Ideally, the answer would be "test for a real effect", but that of course requires that you know whether your hypothesis is true or not, which presumably you don't (otherwise, why run the study?). Bakker, van Dijk and Wicherts ("The rules of the game called psychological science") show that, as a general rule, you are better off running a few underpowered studies than one large, sufficiently-powered study (particularly if any real effect sizes are small). That is, the more likely it is that your results aren't true, the more likely you are to have a publishable finding. This is bad news when coupled with the fact that humans are good at finding efficient strategies. (Note that this doesn't require bad actors: One can simply find that certain procedures -- like testing around X subjects -- have generally produced "good", publishable research in the past.)

The bottleneck

Giner-Sorolla ("Science or art? How aesthetic standards grease the way through the publication bottleneck but undermine science") argues that a certain amount of bad science is expected when there are too few publication outlets:

Articles cannot pass through just by showing theoretical meaning and methodological rigor; thei results must appear to support the hypothesis perfectly... Imagine that two colleagues in psychology each show you a manuscript. One has two studies with roughly equal numbers of participants. Both studies support the hypothesis, each with a significant key result at p=.04. The other paper has three studies, also supporting the hypothesis, but the last two studies' individual results are only near significant: p=.02, p=.07, and that most annoying figure, p=.11.

Statistically, the second paper actually presents stronger evidence, it should be obvious that the the first one is more likely to be published.

There are two ways of having perfect-looking results, and unfortunately one is a lot easier than the other. Again, this doesn't require bad actors -- just competent ones. Giner-Sorolla's solution is not to make publication easier -- various reasons are given for why that is unlikely to work -- but rather to develop an alternative method for distributing less "pretty" results to those who might be interested (such as those doing meta-analysis).

The summary here does not give this fascinating article justice, which also includes (among things) a fascinating history of hypothesis testing, including a previous crisis of confidence in Psychology in the 1970s.)

Too good to be true

Francis ("The psychology of replication and the replication of psychology") takes the point a bit further: our love of "pretty" data has led us to expect data that are too good to be true. Most psychology studies have low power to detect an effect; thus, presenting several studies in a single paper all of which show the same significant effect is highly suspicious. He goes through eight experiments recently published in Galak & Meyvis (2011), seven of which show significant results despite their apparent power ranging from about 0.25 to 0.8 (that is, between a 25% and 80% chance of detecting the effect). The likelihood of this happening even if the hypothesis was true was about 0.08. (This is actually an underestimate: the one "non-signifcant" experiment had a p-value of 0.056, which many authors would count as significant -- which, again, makes the reported findings even less likely.)

Francis goes out of his way to say that this does not necessarily represent malicious intent:

Publication bias could occur in a set of experiments because authors closely follow the standards of the field. Indeed, the experiments reported in Galak and Meyvis (2011) appear to meet (and often exceed) the standards of experimental psychology.

Francis argues that changing the aesthetic can be accomplished by changing what results we report. We should report effect sizes and worry less about p-values. I've always been somewhat skeptical of effect size cheerleaders, but Francis put the argument in a new way that (at least for the moment) makes sense to me:

Given that almost every empirical study in psychology currently uses hypothesis testing, it may seem bizarre to claim that effect sizes are more important than the outcome of hypothesis tests... [But] surely, the theory of Galak and Meyvis (2011) does not predict that every experiment testing the theory will reject the null (e.g., even with sample sizes of, say, n = 3). The authors probably meant that if the theory were true, then an experiment with a large enough sample would reject the null hypothesis. But the definition of "large enough sample" is determined by the magnitude of the effect size ... you still have to focus on effect sizes in order to predict the outcome of experiments.

Measuring and reporting effect sizes would highlight when data are "too pretty", making such data actually less attractive, mitigating some of the factors encouraging publication bias. It would also lead to better research. Francis then discusses some ways to easily detect publication bias.

This paper generated a lot of attention and two commentaries were published alongside it. Simonsohn ("It does not follow") notes that Francis had to look at many papers to find any that have evidence of publication bias but does not correct for multiple comparisons. That is, Francis's test for the file-drawer effect must itself take into account the file-drawer effect; you'd want to see that, in a sample of papers, more show evidence of a file-drawer effect than you would have expected by chance. In other words, it is possible that Galak and Mayvis (2011) failed the test through pure random chance alone. This argument is somewhat weakened by Galak and Mayvis's admission ("You could have just asked") that they had indeed excluded several experiments that had not reached significance. However, the larger point stands.

Both Simonsohn and Galak & Mayvis push back against Francis's argument that if a paper has been shown to suffer from the file-drawer effect (unpublished null results), the paper should be ignored altogether. Just because the file-drawer effect has inflated the observed effect size, that doesn't mean the effect size is 0. Simonsohn points out that there are meta-analytic ways to correct for the file-drawer effect. Galak and Mayvis argue that not publishing null results is inconsequential, because having several positive results is simply too unlikely. Whether this is true, they acknowledge, depends on whether false-positive rates are inflated overall (as many of the other authors in this issue argue).

Can meta-analyses save us?

Ferguson and Heene ("A vast graveyard of undead theories") focus on the problems for meta-analyses posed by publication bias. Obviously, meta-analyses of biased data will be biased meta-analyses (junk in - junk out). They take a look at some ways researchers have tried to de-bias meta-analyses, such as including unpublished data. The problem is that the selection of unpublished data itself is often biased, and they show that in some cases this actually makes the problem worse.

They also take on an apparently common argument, which is that if several published studies have shown an effect, you'd need there to be a huge number of studies showing no effect in the proverbial file drawer in order to wash out the significant studies.

Now let us assume we assembled 50 studies in a meta-analysis with a mean [effect size] of 2.0. In this case, the FSN [a statistic indicating the number of null results needed to cancel out the significant ones] would yield a value of 6,854.

However, this statistic makes the incorrect assumption that if all the published studies were false positives, the mean effect size of the remainder would be 0.

In fact, if only 5% of studies that show Type I errors were published, the mean [effect size] of the remaining unpublished studies cannot be zero but must be negative... The FSN turns out to be a gross overestimate...

Despite the math, the best evidence of their claims comes from a plausibility argument:

It is thus not surprising that we have seldom seen a meta-analysis resolve a controversial debate in a field. Typically, the antagonists simply decry the meta-analysis as fundamentally flawed or produce a competing met-analysis of their own [long list of citations].

Participants as passive receptacles of stimuli

Klein and colleagues ("Low hopes, high expectations: Expectancy effects and the replicability of behavioral experiments") write:

Though in general there may be many reasons why a study fails to replicate (e.g., lack of statistical power...), we suggest that the central issue is a widespread failure to acknowledge, recognize, and document the circumstances and social context in which an experiment takes place.

In theory, if aspects of social context (e.g., whether the experimenter was male or female) are relevant for the effect but not documented, that part of the method cannot be replicated. But more generally, the failure to document important aspects of the design may suggest -- to put it bluntly -- incompetent experimenters, in which case a failure to replicate is no surprise. For instance: a significant percentage of studies (published in premier journals) that use deception did not include an "awareness probe" (basically, asking the participant if they were aware of the deception). This is problematic because sometimes the deception is too obvious and participants figure it out, invalidating the results of the study.

Their proposed solutions mostly amount to "do good studies". For instance, if experimenter bias is a concern, use double-blind procedures. If you are worried about demand characteristics, include relevant probes and controls. This is something that should be taught in every intro methods class, but -- according to the authors' research -- are often neglected even in prestigious journals.

Who should be doing replications?

Given that researchers are already over-loaded with responsibilities, if more replications are to be conducted, who is going to conduct them? Frank and Saxe ("Teaching replication") argue that undergraduate and graduate research methods classes should. Students in these classes already have to conduct experiments, but usually they are replications of classic experiments (like Stroop) or novel experiments thought up by the students. Either kind is usually boring: The first kind because the answer is already known, and the second kind because the students don't know enough to think up an interesting, novel experiment that makes contact with cutting-edge research. Replicating recent, cutting-edge studies solves both problems. While they acknowledge that sometimes the experiments aren't done well enough to be informative, around half of the time (in their experience, both having taught such classes), they are.

Grahe and colleagues ("Harnessing the undiscovered resource of student research projects") likewise argue that undergraduates are capable of doing useful research, and given that there are many undergraduates out there taking many methods courses, quite a lot of useful work could be collectively accomplished by them.

How often are replications published in psychology?

Makel, Plucker, and Hegarty ("Replications in psychology research: How often do they really occur?") find -- probably to nobody's surprise -- that replications are rarely published. Looking at all the articles published by 100 psych journals since 1900, just under 2% of the articles even mention "replication" in the text. Based on further investigation, they estimate that only 1% of articles are actual replications, most of which were successful replications. The good news is that the rate has been rising steadily in recent decades. The bad news is that, the vast majority were conceptual replications, not direct replications. Worse news: replications, whether successful or not, were cited on average 17 times, compared with an average of 64.5 times for the articles replicated. (Interestingly, they find that some papers do include failures to conceptually replicate -- despite what most people would expect -- though these are nonetheless fairly rare.)

Incentivizing replication and replicability

Nosek, Spies and Motyl ("Scientific Utopia II: Restructuring incentives and practices to promote truth over publishability") note, as have others, that there isn't much incentive to engage in activities that would improve the replicability rate. In a succinct article, they list a number of strategies that they think are unlikely to word -- conceptual replication, waiting for science to "self-correct", establishing journals devoted to publishing replications or negative results, education campaigns, increasing expectations of reviewers, and raising the barrier for publication. The reasons why are straight-forward and can be found in the article.

They also offer several strategies they think will work. For instance, they recommend simple check-lists, which "ensure disclosure of obvious items that are sometimes forgotten: sample sizes, statistical tests, effect sizes, and covariates." While this might seem similar to their discarded strategy of "raising expectations" and "educational campaigns", they argue that checklists are an exception, and point to research showing their efficacy (as opposed to raised expectations and educational campaigns) in medicine.

Other proposals include establishing metrics to identify what is worth replicating, thereby making replications of these papers seem important enough (to journal editors) to publish, crowd sourcing replication efforts, and establishing journals (like PLoS ONE) with review standards focused on the soundness, not importance, of the research. While this latter approach may seem like a journal signing its own death warrant, they point to data suggesting that journals aren't actually very good at figuring out what papers will or won't be cited in the future. They also note that making data, materials, and the workflow itself more open and transparent will make it easier to re-use data and materials, with additional replications as an inevitable, positive side effect.

Koole and Lakens ("Rewarding replications") focus more squarely on the lack of incentive to conduct replications because they are rarely published and rarely cited when published. Their proposed solution is that when journals publish an original finding, they agree to publish future replications as an online supplement. They then could create a meta-reference that refers to the original paper and subsequent replications, which researchers could then cite, increasing the citations received by the replications (since they are cited automatically). This has the nice advantage of working within the existing infrastructure (always easier than starting from scratch). The only remaining issue is how to convince the existing infrastructure (i.e., the journals) to participate, given that it will cost them a lot and there isn't any clear advantage to participation (some would argue that it's actually against the journal's interest to participate). There is also the first-mover disadvantage: the system only becomes particularly useful once many journals participate.

Confirmatory research

Wagenmakers, Wetzels, Borsboon, van der Maas and Kievit ("An agenda for purely confirmatory research") argue that researchers should be required to pre-register their methods and intended analyses prior to running a study. The idea is to make it clear which results are and are not exploratory. The suggestion is that exploratory analyses are often circular (you look at the data, develop a hypothesis, and then test the hypothesis on those data), which increases the likelihood of false positives.

Pre-registration obviously requires extra work, both on the part of the scientists and on the part of the people who maintain the registry, so one can reasonably ask whether the gain is worth the cost. First, we need to know if the problem itself is common (no point in investing in this infrastructure in order to combat a non-existant problem). Wagenmakers and colleagues argue that it is, but unconvincingly:

Almost no psychological research is conducted in a purely confirmatory fashion (e.g., Kerr, 1998; Wagenmakers, Wetzels, Borsboom, & van der Maas, 2011; for similar discussion in biology, see Anderson, Burnham, Gould, & Cherry, 2001).

As far as I can tell, Wagenmakers et al., (2011) speculates that confirmatory research is rare, but offers no data (that I noticed when skimming). Kerr (1998) did conduct a survey of 156 social psychologists, clinical/community psychologists, and sociologists, finding that confirmatory research was reported as occurring most frequently, though not significantly more common than some forms of circular research; even if we assume that these results generalize to all of psychology (not just social and clinical), it is is a far cry from "almost nothing".

But let's even suppose for the moment that non-confirmatory (that is, exploratory) research is very common. Is that a problem? As I already mentioned, Wagenmakers and colleagues are worried about circular analyses. What they suggest is tracking exploratory analyses, which are not necessarily the same thing: Suppose I run a study, and in review, a reviewer asks, "Was there a difference between the male and female participants?" (This is not a made-up example). Well, I hadn't ever thought about that question, so I never checked. Now I check, and either there is or there isn't. This is not a circular analysis; retroactively deciding this is an interesting question should not affect the results (unless you believe in ESP, which Wagenmarkers and colleagues clearly do not).

In the absence of additional information, then, the fact that an analysis was exploratory tells you little about whether it is circular, which was the real issue. Whether an analysis was exploratory or planned only suggests that it is circular or not if large percentages of exploratory analyses are circular. I don't know; there doesn't appear to be any data. But even if there were, that would just tell you likelihood -- it wouldn't settle the issue for any particular case.

Enough smart people think that pre-registration would be a good idea that I'm willing to suppose that it is. I just don't see anything in this particular article to suggest that the price is worth the benefit. One last thought: Wagenmakers and colleagues argue that researchers don't fully appreciate the distinction between planned and circular analyses. If so, would marking certain analyses as planned and certain analyses as circular help?

What should we do?

The 1,292 psychologists surveyed by Fuchs, Jenny and Fiedler ("Psychologists are open to change, yet wary of rules") were also asked about specific suggested reforms (proposed in a recent paper, not by them). I summarize the results below:

Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article:
should be good practice: 60%
should be condition of publication: 46%

Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data collection justification:
should be good practice: 47%
should be condition of publication: 30% [presumably none of these were psychophysicists]

Authors must list all variables collected in a study.
should be good practice: 58%
should be condition of publication: 46%

Authors must report all experimental conditions, including failed manipulations:
should be good practice: 80%
should be condition of publication: 66%

If observations are eliminated, authors must also report what the statistical results are if those observations are included:
should be good practice: 69%
should be condition of publication: 52%

If analysis includes a covariate, authors must report the statistical results of the analysis without the covariate:
should be good practice: 67%
should be condition of publication: 50%

The survey also asked about four suggested guidelines; for space reasons, I have left them out. You can find them in the paper.

Nobody asked my opinion, but here it is anyway. I think having and reporting a termination rule is relatively costless and helpful in certain situations, so I support it as good practice and wouldn't mind if it were a condition of publication. The question about 20 observations per cell ... do we mean 20 items or 20 subjects or both? I suspect the authors were thinking of studies with a single item per condition per subject (which are common in some areas of psychology but essentially non-existant in others; note that Fuchs, Jenny & Fiedler also note the problem of finding one set of standards that is appropriate to all subdisciplines of psychology), so it's hard to know how this is meant to be applied. In any case, I would prefer a standard based on expected power, which is more important (though harder to define).

I strongly support listing all variables collected and all experimental conditions, though it quickly runs up quickly against the word limits in many high-impact journals (which I would be happy to do away with; to paraphrase Einstein, papers should be as short as possible, but not shorter). One issue: what counts as an experimental condition is not always clear, particularly if you are doing between-subject manipulations (is this data set a "condition" or a separate experiment?). So I don't know how to enforce these as conditions of publication.

As far as covariates, certain kinds of measures make no sense without the covariate, so in some cases you'd be adding irrelevant, misleading information. Leaving this up to reviewers (the current system) seems perfectly reasonable to me.

Field of Science