Field of Science

Garbage in, Garbage out

While watching television, have you ever had a fatal heart attack?

If you answered "yes" to this question, you would have been marked as a "bad participant" in Experimental Turk's recent study. The charitable assumption would be that you weren't paying attention. Importantly for those interested in using Amazon Mechanical Turk for research, participants recruited through AMT were no more likely to answer "yes" than participants tested in a traditional lab-based setting (neither group was likely to say "yes").

It's a nice post, though I think that Experimental Turk's analysis is over-optimistic, for reasons that I'll explain below. More interesting, though, is that Experimental Turk apparently does not always include such catch trials in their experiments. In fact, they find the idea so novel that they actually cited a 2009 paper from the Journal of Experimental Social Psychology that "introduces" the technique -- which means the editors and reviewers at this journal were similarly impressed with the idea.

That's surprising.

Always include catch trials

Including catch trials is often taught as basic experimental method, and for good reason. As Experimental Turk points out, you never know if your participants are paying attention. Inevitably, some aren't -- participants are usually paid or given course credit for participation, so they aren't always very motivated. Identifying and excluding the apathetic participants can clean up your results. But that's not the most important reason to include catch trials.

Even the best participant may not understand the instructions. I have certainly run experiments in which the majority of the participants interpreted the instructions differently from how I intended. A good catch trial is designed such that the correct answer can only be arrived at if you understand the instructions. It is also a good way of making sure you're analyzing your data correctly -- you'd be surprised how often a stray negative sign worms its way into analysis scripts.

Sometimes participants also forget instructions. In a recent study, I wasn't finding a difference between the control and experimental groups. I discovered in debriefing that most of the participants in the experimental group had forgotten the key instruction that made the experimental group the experimental group. No wonder there wasn't a difference! And good thing I asked. 

The catch trial -- the question with the obvious answer -- is just one tool in a whole kit of tricks used to validate one's results. There are other options, too. In reading studies, researchers often ask comprehension questions -- not because the answers themselves matter (the real interest is in what the participants do while reading), but simply to prove that the participants in fact did read and understand the material. 

Similar is the embedded experiment -- a mini experiment embedded into your larger experiment, the only purpose of which is to replicate a well-established result. For instance, in a recent experiment I included a vocabulary test (which you can also find in this experiment I'm running with Laura Germine at TestMyBrain.org). I also asked the participants for their SAT scores (these were undergraduates), not because I cared about their scores per se, but I was able to show that their Verbal SAT scores correlated nicely with performance on the vocabulary test (Math SAT scores less so), helping to validate the our vocab test. 


Beyond Surveys

Although I described catch trials mostly in terms of survey-format studies, the same techniques can be embedded into nearly any experiment. I've used them for reading-time, eye-tracking and ERP experiments as well. The practice isn't even specific to psychology/cognitive science. During my brief sojourn in a wet lab in high school, my job was to help genotype genetic knock-out mice to make sure that the genes in question really were missing from the relevant mice and not from the control mice. It probably wouldn't have occurred to the PIs in that lab to just assume the knock-out manipulation worked. Fail that, and none of the rest of the experiment is interpretable. 

A version of the catch trial is even seen in debugging software, where the programmer inserts code that isn't relevant to the function of the program per se, but the output of which helps determine whether the code is doing what it's supposed to.

It is true that some experiments resist checks of this sort. I have certainly run experiments where  by design I couldn't easily confirm that the participants understood the experiment, were paying attention, etc. But that is better avoided if possible -- which is why when I don't see such checks in an experimental write-up, I assume either (a) the checks were performed but deemed too unimportant/obvious to mention, or (b)

An Odd Omission

If catch trials are a basic aspect of good experimental design, how is it that Experimental Turk and the Journal of Experimental Social Psychology didn't know about it? I'm not sure. Part of it may be due to how experimental design is taught. It's not something you look up in an almanac, and though there are classes on technique (at least in psychology departments), they aren't necessarily that helpful since there are hundreds of types of experiments out there, each of which has its own quirks, and a class can only cover a few.

At least in my experience, experimental design is learned through a combination of the apprenticeship method (working with professors -- or, more often, more experienced graduate students) and figuring it out for yourself. The authors at Experimental Turk, it turns out, come from fields relatively new to experimental design (business, management, and political science), so it's possible they had less access to such institutional knowledge. 

As far as the Journal of Experimental Social Psychology... I'm not a social psychologist, and I hesitate to generalize about the field. A lot of social psychology uses questionnaires as instruments. They go to a great deal of difficulty to validate the questionnaires -- show that they are predictive of results on other tests or questionnaires, show that the questionnaires have good test-retest reliability, etc. Many of the techniques they use are ones I would like to learn better. But I actually haven't ever run across one (again, in my limited experience) that actually includes catch trials. Which in itself is interesting.

A clever idea 

I should add that while Experimental Turk cites said journal article for suggesting using questions with obvious answers, that's not actually what the paper suggests. Rather, it suggests using instructions telling participants to ignore certain questions. For instance: 
Sports Participation
Most modern theories of decision making recognize the fact that decisions do not take place in a vacuum. Individual preferences and knowledge, along with situational variables can greatly impact the decision process. In order to facilitate our research on decision making we are interested in knowing certain factors about you, the decision maker. Specifically, we are interested in whether you actually take the time to read the directions; if not, then some of our manipulations that rely on changes in the instructions will be ineffective. So, in order to demonstrate that you have read the instructions, please ignore the sports item below, as well as the continue button. Instead, simply click on the title at the top of this screen (i.e., "sports participation") to proceed to the next screen. Thank you very much.
That's a clever idea. One of my elementary school teachers actually wrote a whole test with instructions like that to teach the class a lesson about reading instructions carefully (and it worked -- I still do!). So it's a good idea I've never seen used in an experimental setting before, but that doesn't mean it hasn't been used. In any case, the discussion in the paper doesn't mention catch trials or other methods of validating data, so it's hard to know whether they did a thorough literature search.

More training

A bad movie can still make entertaining watching. A bad experiment is irredeemable. If the participants didn't understand the instructions, nothing can be gleaned from the data. And there are so many ways to run bad experiments -- I know, because I've employed many of them myself. There are a lot of datasets out there in psychology that have proven, shall we say, resistant to replication. Some of this has to be due to the fact that experimental design is not as good as it could and should be. 

Addendum

As I mentioned higher up, I think Experimental Turk is overly optimistic about the quality of data from AMT. I've run a couple dozen experiments on AMT now, and the percentage of participants that fail the catch trials varies a great deal, from as few as 0% to as many as 20-30%. I haven't made a systematic study of it, but there seem to be a number of contributing factors, some of which are general to all experimental venues (length of the experiment, how interesting it is, how complicated in the instructions are) and some of which are specific to AMT (the more related HITs, the more attractive a target the experiment is to spammers). 

All the more reason to always include catch trials.


-----------
Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power Journal of Experimental Social Psychology, 45, 867-872

2 comments:

Disgruntled PhD said...

I'm not surprised that exp soc psych hadn't heard of this, it was a timely reminder for me.

Catch trials were mentioned in my first year undergrad, but only for surveys and they were called Lie scales.

Thanks for the post, very interesting.

Gabriele Paolacci said...

We haven't really been optimistic about attention provided by workers on AMT.

As I mentioned in a comment on the blog post, in my studies I found highly variant failure rates on AMT. This obviously depends on a number of factors, as you correctly pointed out. In our post, we have been tacit about whether AMT workers are generally "good" or "bad"; we just reported how they compared to other subject pools in the only comparative study we ran and knew about. For these kinds of data, it's really the more the merrier - experimenters need to know when it's ok to use AMT and when it's not.

Regarding the novelty of the procedure suggested in the Oppenheimer et al. (2009), all I can say is that detecting actual attention is hardly a new thing (nor does our post implies that we were ignorant about "catch trials" and the like...). What those authors provided is a theoretical and empirical argument in favor of its use.

Thanks for *your* attention :-) I'm glad we involuntarily stimulated some discussion on research methods.