Games with Words: On method mavenry

Showing posts with label On method mavenry. Show all posts

In praise of experiments

Posted by GamesWithWords on Friday, December 14, 2012

Today, the excellent Neuroskeptic writes about a new study investigating which US states are most suicidal. The interesting twist was the form of the data: Google searches. It's an interesting study and an interesting use of Google searches, but what struck me was Neuroskeptic's closing thoughts.

Over the past couple of years there's been a flurry of studies based on analyzing Google and Twitter trends. What's interesting to me is that we're really in the early days of this, when you think about likely future technologies. What will happen when everyone's wearing a computer 24/7 that records their every word and move, and even what they see?

Eventually, psychology and sociology might evolve (or degenerate) into no more than the analysis of such data...

It's always dangerous to predict the future, but here's my prediction: Not a chance. It gets down to a distinction between observational studies and experiments. Observational studies (where you record what happens in the course of normal events) are useful, particularly when you care about questions like what is the state of the world? They are much less useful when you want to know why is the world the way it is?

There are a couple reasons.

Reason #1: The correlation fallacy

First, observational studies are really about studying correlations. To have much power to analyze interesting correlations, you need a lot of data. This is what makes Google and Twitter powerful: they provide a lot of data. But correlation, famously, doesn't always tell you much about causation.

For instance, it is now well-known that you can use the number of pirates active in the world's oceans and seas to reasonably predict average global temperature (there's a strong correlation):

I did not know until recently that Google search data has now definitively shown a correlation between the amount of movie piracy and global warming as well:

In the case of real pirates vs. the temperature, the correlation runs the other way (temperature affects weather affects seafaring activities). I have no idea what causes the correlation between searches for free movies and searches about global warming; perhaps some third factor. To give another silly example, there is a lot more traffic on the roads during daylight than at night, but this isn't because cars are solar-powered!

The point is that experiments don't have this problem: you go out and manipulate the world to see what happens. Change the number of pirates and see if global temperatures change. Nobody has tried this (to my knowledge), but I'm willing to bet it won't work.

(Of course, there are natural experiments, which are a hybrid of observational studies and experiments: the experimenter doesn't manipulate the world herself but rather waits until somebody else, in the course of normal events, does it for her. Good examples are comparing different states as they adopt bicycle helmet laws at different times and comparing that against head injury stats in the various states. These are rarely as well-controlled as an actual experiment, but have the advantage of ecological validity.)

Reason #2: Life's too short

The second is that observational studies are limited by what actually happens in the world. You won't, from an observational study, find out what the effect of US politics is of every US senator taking up crack while every US representative takes up meth. (I hope not, anyway.)

That was an absurd example, but the problem is real. Language gives lots of great examples. Suppose you want to find out what sentences in any given language are grammatical and what sentences are not. You could do an observational study and see what sentences people say. Those are grammatical; sentences you haven't heard probably aren't.

The problem with this is that people are boring and repetitive. A small number of words (heck, a small number of sentence fragments) accounts for most of what people say and write. The vast majority of grammatical sentences will never appear in your observational sample no matter how long you wait, because there are actually an infinite number of grammatical English sentences. (In his impressive "Who's afraid of George Kingsley Zipf?", Charles Yang shows how a number of prominent language researchers went astray by paying too much attention to this kind of observational study.)

The basic feature of the problem is that for building theories -- explaining why things are the way they are -- very often what you care about are the border cases. Human behavior is largely repetitive, and the border cases are quite rare. Experiments turn this around: by deliberately choosing the situations we put our participants in, we can focus on the informative test cases.

The experimental method: Here to stay

None of this should be taken as meaning that I don't think observational studies are useful. I conduct them myself. A prerequisite to asking the question Why are things the way they are is knowing, in fact, what way things are. There is also the question of ecological validity. When we conduct laboratory experiments, we construct artificial situations and then try to generalize the results to real life. It's good to know something about real life in order to inform those generalizations.

But just as I can't imagine observational studies disappearing, I can't imagine them replacing experimentation, either.

Faster fMRI?

Posted by GamesWithWords on Saturday, October 27, 2012

A paper demonstrating a new technique for "ultrafast fMRI" has been getting some buzz on the blogosphere. Although movies often depict fMRI showing real-time activity in the brain, in fact typical methods only collect from one slide of the brain at a time, taking a fair amount of time to cover the entire brain (Neuroskeptic puts this at about 2-3 seconds). This new technique (GIN) can complete the job in 50 ms, and without sacrificing spatial resolution (which is the great advantage of fMRI relative to other neuroimaging techniques like EEG or MEG).

Does this mean fMRI is about to get 50 times faster?

Not exactly. What fMRI is measuring is the change in blood oxygenation in areas of your brain. When a particular area starts working harder, more oxygen-rich blood is sent in its direction, and that can be detected using MRI. The limitation is that it takes a while for this blood to actually get there (around 5-10 seconds). One commenter on the Neuroskeptic post (which is where I heard about this article) wrote "making fMRI 50 times faster is like using an atomic clock to time the cooking of a chicken."

The basic fact is that fMRI is never going to compete with EEG or MEG in terms of temporal resolution, because the latter directly measure the electrical activity in the brain and can do so on very fine time-scales. But that doesn't mean that speeding up fMRI data acquisition isn't a good idea. As the authors of the paper write:

fMRI studies, especially related to causality and connectivity, would benefit from reduced repetition time in terms of better statistics and physiological noise characteristics...

They don't really say *how* these studies would achieve this benefit. The rest of the discussion is mostly about how their technique improves on other attempts at ultra-fast fMRI, which tend to have poor spatial resolution. They do mention that maybe ultra-fast fMRI would help simultaneous EEG-fMRI studies to strengthen the link between the EEG signal and the fMRI signal, but it's obvious to me just how helpful this would be, given the very different timing of EEG and fMRI.

But that's not going to stop me from speculating as to how faster data-acquisition might improve fMRI. (Any readers who know more about fMRI should feel free to step in for corrections/additions).

Speculations

The basic problem is that what you want to do is model the hemodynamic response (the change in blood oxygenation levels) due to a given trial. This response unfolds over a time-course of 5-10 seconds. If you are only measuring what is happening every couple seconds, you have pretty sparse data from which to reconstruct that response. Here's an example of some reconstructed responses (notice they seem to be sampling once every second or so):

Much faster data-collection would help with this reconstruction, leading to more accurate results (and conclusions). The paper also mentions that their technique helps with motion-correction. One of the basic problems in fMRI is that if somebody moves their head/brain even just a few millimeters, everything gets thrown off. It's very hard to sit in a scanner for an hour or two without moving even a smidge (one technique, used by some hard-core researchers, is a bite bar, which is perfectly fitted to your jaw and keeps you completely stabilized). Various statistical techniques can be used to try to mitigate any movement that happens, but they only work so well. The authors of the paper write:

Obviously, all InI-based and comparable imaging methods are sensitive to motion especially at the edges of the brain with possible incorrect estimation of prior information. However, due to the large amount of data, scan times are currently short (4 min in teh current study), which mitigates the motion problem.

I take this to mean that because their ultra-rapid scanning technique collects so much data from each trial, you don't need as many trials, so the entire experiment can be shortened. Note that they are focused on the comparison between their technique and other related techniques, not the comparison between their technique and standard fMRI techniques. But it does seem reasonable that more densely sampling the hemodynamic response for an individual trial should mean you need fewer trials overall, thus shortening experiments.

-----

Boyacioğlu, R., & Barth, M. (2012). Generalized iNverse imaging (GIN): Ultrafast fMRI with physiological noise correction Magnetic Resonance in Medicine DOI: 10.1002/mrm.24528

Estimating replication rates in psychology

Posted by GamesWithWords on Sunday, September 02, 2012

The Open Science Collaboration's interim report, which will come out shortly in Perspectives in Psychological Science, is available. We nearly pulled off the physics trick of having a paper where the author list is longer than the paper itself. I think there are nearly 70 of us (if you scroll down, you'll find me in the H's).

The abstract says it all:

Reproducibility is a defining feature of science. However, because of strong incentives for innovation and weak incentives for confirmation, direct replication is rarely practiced or published. The Reproducibility Project is an open, large-scale, collaborative effort to systematically examine the rate and predictors of reproducibility in psychological science. So far, 72 volunteer researchers from 41 institutions have organized to openly and transparently replicate studies published in three prominent psychological journals from 2008. Multiple methods will be used to evaluate the findings, calculate an empirical rate of replication, and investigate factors that predict reproducibility. Whatever the result, a better understanding of reproducibility will ultimately improve confidence in scientific methodology and findings.

If you are interested in participating, there is still time. Go to the website for more information.

And for my next trick, I'll make this effect disappear!

Posted by GamesWithWords on Friday, December 10, 2010

In this week's New Yorker, Jonah Lehrer shows once again just how hard it is to do good science journalism if you are not yourself a scientist.

His target is the strange phenomenon that many high profile papers are failing to replicate. This has been very much a cause celebre lately, and Lehrer follows a series of scientific papers on the topic as well as an excellent Atlantic article by David Freedman. At this point, many of the basic facts are well-known: anecdotally, many scientists report repeated failures to replicate published findings. The higher-profile the paper, the less likely it is to replicate, with around 50% of the highest-impact papers in medicine failing to replicate. As Lehrer points out, this isn't just scientists failing to replicate each other's work, but scientists failing to replicate their own work: a thread running through the article is the story of Jonathan Schooler, a professor at UC-Santa Barbara who has been unable to replicate his own seminal graduate student work on memory.

Lehrer's focus in this article is shrinking effects.

No, not this one.

Some experimental effects seem to shrink steadily over time:

In 2001, Michael Jennions, a biologist at the Australian National University, set out to analyze "temporal trends" across a wide range of subjects in ecology and evolutionary biology. He looked at hundreds of papers and forty-four meta-analyses (that is, statistical syntheses of related studies), and discovered a consistent decline effect over time, as many of the theories seemed to fade into irrelevance.

As described, that's weird. But there is a good explanation for such effects, and Lehrer brings it up. Some results are spurious. It's just one of those things. Unfortunately, spurious results are also likely to be exciting. Let's say I run a study looking for a relationship between fruit-eating habits and IQ. I look at the effects of 20 different fruits. By chance, one of them will likely show a significant -- but spurious -- effect. So let's say I find that eating an apple every day leads to a 5-point increase in IQ. That's really exciting because it's surprising -- and the fact that it's not true is integral to what makes it surprising. So I get it published in a top journal (top journals prefer surprising results).

Now, other people try replicating my finding. Many, many people. Most will fail to replicate, but some -- again by chance -- will replicate. It is extremely difficult to get a failure to replicate published, so only the replications get published. After time, the "genius apple hypothesis" becomes part of established dogma. Remember that anything that challenges established dogma is exciting and surprising and thus easier to publish. So now failures to replicate are surprising and exciting and get published. When you look at effect-sizes in published papers over time, you will see a gradual but steady decrease in the "effect" of apples -- from 5 points to 4 points down to 0.

Where I get off the Bus

So far so good, except here's Lehrer again:

While the publication bias almost certainly plays a role in the decline effect, it remains an incomplete explanation. For one thing, it fails to account for the initial prevalence of positive results among studies that never even get submitted to journals. It also fails to explaint eh experience of people like Schooler, who have been unable to replicate their initial data despite their best efforts.

Huh? Lehrer seems to be suggesting that it is publication that makes a result spurious. But that can't be right. Rather, there are just lots of spurious results out there. It happens that journals preferentially publish spurious results, leading to biases in the published record, and eventually the decline effect.

Some years ago, I had a bad habit of getting excited about my labmate's results and trying to follow them up. Just like a journal, I was interested in the most exciting results. Not surprisingly, most of these failed to replicate. The result was that none of them got published. Again, this was just a factor of some results being spurious -- disproportionately, the best ones. (Surprisingly, this labmate is still a friend of mine; personally, I'd hate me.)

The Magic of Point O Five

Some readers at this point might be wondering: wait -- people do statistics on their data and only accept a results that is extremely unlikely to have happened by chance. The cut-off is usually 0.05 -- a 5% chance of having a false positive. And many studies that turn out later to have been wrong pass even stricter statistical tests. Notes Lehrer:

And yet Schooler has noticed that many of the data sets that end up declining seem statistically solid--that is, they contain enough data that any regression to the mean shouldn't be dramatic. '"These are the results that pass all the tests," he says. "The odds of them being random are typically quite remote, like one in a million. This means that the decline effect should almost never happen. But it happens all the time!"

So there's got to be something making these results look more unlikely than they really are. Lehrer suspects unconscious bias:

Theodore Sterling, in 1959 ... noticed that ninety-seven percent of all published psychological studies with statistically significant data found the effect they were looking for ... Sterling saw that if ninety-seven per cent of psychology studies were proving their hypotheses, either psychologists were extraordinarily lucky or they published only the outcomes of successful experiments

and again:

The problem seems to be one of subtle omissions and unconscious misperceptions, as researchers struggle to make sense of their results.

I expect that unconscious bias is a serious problem (I illustrate some reasons below), but this is pretty unsatisfactory, as he doesn't explain how unconscious bias would affect results, and the Schooler effect is a complete red herring.

I wasn't around in 1959, so I can't speak to that time, but I suspect that the numbers are similar today ... but in fact Sterling was measuring the wrong thing. Nobody cares what our hypotheses were. They don't care what order the experiments were actually run in. They care about the truth, and they have very limited time to read papers (most papers are never read, only skimmed). Good scientific writing is clear and concise. The mantra is: Tell them what you're going to tell them. Tell them. And then tell them what you told them. No fishing excursions, no detours. When we write scientific papers, we're writing science, not history.

And this means we usually claim to have expected to find whatever it is that we found. It just makes for a more readable paper. So when a scientist reads the line, "We predicted X," we know that really means "We found X" -- what the author actually predicted is beside the point.

Messing with that Point O Five

So where do all the false positive come from, if they should be less than 5% of conducted studies? There seem to be a number of issues.

First, it should be pointed out that the purpose of statistical tests (and the magic .05 threshold for significance) is to make a prediction as to how likely it is that a particular result will replicate. A p-value of .05 means roughly that there is a 95% chance that the basic result will replicate (sort of; this is not technically true but is a good approximation for present purposes).

But statistics are estimates, not facts. They are based on a large number of idealizations (for instance, many require that measurement error is distributed normally

a normal distribution

meaning that the bulk of measurements are very close to the true measurement and a measurement is as likely to be larger than the true number as it is likely to be smaller. In fact, most data is heavily skewed, with measurements more likely to be too large than too smaller (or vice versa).

For instance, give someone an IQ test. IQ tests have some measurement error -- people will score higher or lower than their "true" score due to random factors such as guessing answers correctly (or incorrectly), being sleepy (or not), etc. But it's a lot harder to get an IQ score higher than your true score than lower, because getting a higher score requires a lot of good luck (unlikely) whereas there are all sorts of ways to get a low score (brain freeze, etc.).

Most statistical tests make a number of assumptions (like normally distributed error) that are not true of actual data. That leads to incorrect estimates of how likely a particular result is to replicate. The truth is most scientists -- at the very least, most psychologists -- aren't experts in statistics, and so statistical tests are misapplied all the time.

I don't actually think that issues like the ones I just discussed lead to most of the difficulties (though I admit I have no data one way or another). I bring these issues up mainly to point out at that statistical tests are tools that are either used or misused according to the skill of the experimenter. And there are lots of nasty ways to misuse statistical tests. I discuss a few of them below:

Run enough experiments and...

Let's go back to my genius fruit experiment. I ask a group of people to eat an apple and then give them an IQ test. I compare their IQ scores with scores from a control group that didn't eat an apple. Now let's say in fact eating apples doesn't affect IQ scores. Assuming I do my statistics correctly and all the assumptions of the statistical tests are met, I should have only a 5% chance of finding a "significant" effect of apple-eating.

Now let's say I'm disappointed in my result. So I try the same experiment with kiwis. Again, I have only a 5% chance of getting a significant result for kiwis. So that's not very likely to happen either.

Next I try oranges....

Hopefully you see where this is going. If I try only one fruit, I have a 5% chance of getting a significant result. If I try 2 fruits, I have a 1 - .95*.95 = 9.8% chance of getting a significant result for at least one of the fruits. If I try 4 fruits, now I'm up to a 1 - .95*.95*.95*.95 = 18.5% chance that I'll "discover" that one of these fruits significantly affects IQ. By the time I've tried 14 fruits, I've got a better than 50% chance of an amazing discovery. But my p-value for that one experiment -- that is, my estimate that these results won't replicate -- is less than 5%, suggesting there is only a 5% chance the results were due to chance.

While there are ways of statistically correcting for this increased likelihood of false positives, my experience suggests that it's relatively rare for anyone to do so. And it's not always possible. Consider the fact that there may be 14 different labs all testing the genius fruit hypothesis (it's suddenly very fashionable for some reason). There's a better than 50% chance that one of these labs will get a significant result, even though from the perspective of an individual lab, they only ran one experiment.

Data peaking

Many researchers peak at their data. There are good reasons for doing this. One is curiosity (we do experiments because we really want to know the outcome). Another is to make sure all your equipment is working (don't want to waste time collecting useless data). Another reason -- and this is the problematic one -- is to see if you can stop collecting data.

Time is finite. Nobody wants to spend longer on an experiment than necessary. Let's say you have a study where you expect to need -- based on intuition and past experience -- around 20 subjects. You might check your data after you've run 12, just in case that's enough. What usually happens is that if the results are significant, you stop running the study and move on. If they aren't, you run more subjects. Now maybe after you've got 20 subjects, you check your data. If it's significant, you stop the study; if not, you run some more. And you keep on doing this until either you get a significant result or you give up.

It's a little harder to do back-of-the-envelop calculations on the importance of this effect, but it should be clear that this habit has the unfortunate result of increasing the relative likelihood of a false positive, since false positives lead you to declare victory and end the experiment, whereas false negatives are likely to be corrected (since you keep on collecting more subjects until the false negative is overcome). I read a nice paper on this issue that actually crunched the numbers a while back (for some reason I can't find it at the moment), and I remember the result was a pretty significant increase in the expected number of false positives.

Data massaging

The issues I've discussed so real problems but are pretty common and not generally regarded as ethical violations. Data massaging is at the borderline.

Any dataset can be analyzed in a number of ways. Once again, if people get the result they were expecting with the first analysis they run, they're generally going to declare victory and start writing the paper. If you don't get the results you expect, you try different analysis methods. There are different statistical tests that be used. There are different covariates that could be factored out. You can through out "bad" subjects or items. This is going to significantly increase the rate of false positives.

It should be pointed out that interrogating your statistical model is a good thing. Ideally, researchers should check to see if there are bad subjects or items, check whether there are covariates to be controlled for, check whether different analysis techniques give different results. But doing this affects the interpretation of your p-value (the estimate of how likely it is that your results will replicate), and most people don't know how to appropriately control for that. And some are frankly more concerned with getting the results they want than doing the statistics properly (there is where the "borderline" comes in).

Better estimates

The problem, at least from where I stand, is one of statistics. We want our statistical tests to tell us how likely it is that our results will replicate. We have statistical tests which, if used properly, will give us just such an estimate. However, there are lots and lots of ways to use them incorrectly.

So what should we do? One possibility is to train people to use statistics better. And there are occasional revisions in standard practice that do result in better use of statistics.

Another possibility is to lower the p-value that is considered significant. The choice of p=0.05 as a cutoff was, as Lehrer notes, arbitrary. Picking a smaller number would decrease the number of false positives. Unfortunately, it also decreases the number of real positives by a lot. People who don't math can skip this next section.

Let's assume we're running studies with a single dependent variable and one manipulation, and that we're going to test for significance with a t-test. Let's say the manipulation really should work -- that is, it really does have an effect on our dependent measure. Let's say that the effect size is large-ish (Cohen's d of .8, which is large by psychology standards) and that we run 50 subjects. The chance of actually finding a significant effect at the p=.05 level is 79%. For people who haven't done power analyses before, this might seem low, but actually an 80% chance of finding an effect is pretty good. Dropping our significant threshold to p=.01 drops the chance of finding the effect to 56%. To put this in perspective, if we ran 20 such studies, we'd find 16 significant effects at the p=.05 level but only 11 at the p=.01 level. (If you want to play around with these numbers yourself, try this free statistical power calculator.)

Now consider what happens if we're running studies where the manipulation shouldn't have an effect. If we run 20 such studies, 1 of them will nonetheless give us a false positive at the p=.05 level, whereas we probably won't get any at the p=.01 level. So we've eliminated one false positive, but at the cost of nearly 1/3 of our true positives.

No better prediction of replication than replication

Perhaps the easiest method is to just replicate studies before publishing them. The chances of getting the same spurious result twice in a row are vanishingly small. Many of the issues I outlined above -- other than data massaging -- won't increase your replication rate. Test 14 different fruits to see if any of them increase IQ scores, and you have over a 50% chance that one of them will spuriously do so. Test that same fruit again, and you've only got a 5% chance of repeating the effect. So replication decreases your false positive rate 20-fold. Similarly, data massaging may get you that coveted p.05, but the chances of the same massages producing the same result again are very, very low.

True positives aren't nearly so affected. Again, a typical power level is B=0.80 -- 80% of the time that an effect is really there, you'll be able to find it. So when you try to replication a true positive, you'll succeed 80% of the time. So replication decreases your true positives by only 20%.

So let's say the literature has a 30% false positive rate (which, based on current estimates, seems quite reasonable). Attempting to replicate every positive result prior to publication -- and note that it's extremely rare to publish a null result (no effect), so almost all published results are positive results -- should decrease the false positives 20-fold and the true positives by 20%, leaving us with a 2.6% false positive rate. That's a huge improvement.

So why not replicate more?

So why don't people replicate before publishing? If 30% of your own publishable results are false positives, and you eliminate them, you've just lost 30% of your potential publications. You've also lost 20% of your true positives as well, btw, which means overall you've decreased your productivity by 43%. And that's without counting the time it takes to run the replication. Yes, it's nice that you've eliminated your false positives, but you also may have eliminated your own career!

When scientists are ranked, they're largely ranked on (a) number of publications, (b) number of times a publication is cited, and (c) quality of journal that the publications are in. Notice that you can improve your score on all of these metrics by publishing more false positives. Taking the time to replicate decreases your number of publications and eliminates many of the most exciting and surprising results (decreasing both citations and quality of journal). Perversely, even if someone publishes a failure to replicate your false positive, that's a citation and another feather in your cap.

I'm not saying that people are cynically increasing their numbers of bogus results. Most of us got into science because we actually want to know the answers to stuff. We care about science. But there is limited time in the day, and all the methods of eliminating false positives take time. And we're always under incredible pressure to pick up the pace of research, not slow it down.

I'm not sure how to solve this problem, but any solution I can think of involves some way of tracking not just how often a researcher publishes or how many citations those publications get, but how often those publications are replicated. Without having a way of tracking which publications replicate and which don't, there is no way to reward meticulous researchers or hold sloppy researchers to account.

Also, I think a lot of people just don't believe that false positives are that big a problem. If you think that only 2-3% of published papers contain bogus results, there's not a lot of incentive to put in a lot of hard work learning better statistical techniques, replicating everything, etc. If you think the rate is closer to 100%, you'd question the meaning of your own existence. As long as we aren't keeping track of replication rates, nobody really knows for sure where we are on this continuum.

That's my conclusion. Here's Lehrer's:

The decline effect is troubling because it reminds us how difficult it is to prove anything. We like to pretend that our experiments define the truth for us. But that's often not the case. Just because an idea is true doesn't mean it can be proved. And just because an idea can be proved doesn't mean it's true. When the experiments are done, we still have to chose what to believe.

I say it again: huh?

Did your genes make you liberal?

Posted by GamesWithWords on Friday, October 29, 2010

"The new issue of the Journal of Politics, published by Cambridge University, carries the study that says political ideology may be caused by genetic predisposition."
--- RightPundits.com

"Scientists find 'liberal gene.'"
--- NBC San Diego

"Liberals may owe their political outlook partly to their genetic make-up, according to new research from the University of California, San Diego, and Harvard University. Ideology is affected not just by social factors, but also by a dopamine receptor gene called DRD4."
-- University press release

As in the case yesterday of the study about sisters making you happy, these statements are all technically true (ish -- read below) but deeply misleading. The study in question looks at the effects of number of friends and the DRD4 gene on political ideology. Specifically, they asked people to self-rate on a 5-point scale from very conservative to very liberal. They tested for the DRD4 gene. They also asked people to list up to 5 close friends.

The number of friends one listed did not significantly predict political ideology, nor did the presence or absence of the DRD4 gene. However, there was a significant (p=.02) interaction ... significant, but apparently tiny. The authors do not discuss effect size, but we can try to piece together the information by looking at the regression coefficients.

An estimated coefficient means that if you increase the value of the predictor by 1, the outcome variable increases by the size of the coefficient. So imagine the coefficient between the presence of the gene and political orientation was 2. That would mean that, on average, people with the gene score 2 points higher (more liberal) on the 5-point political orientation scale.

The authors seem to be reporting standardized coefficients, which means that we're looking at increasing values by one standard deviation rather than by one point. The coefficient of the significant interaction 0.04. This means that roughly as the number of friends and presence of the gene increase by one standard deviation, political orientation scores increase by 0.04 standard deviations. The information we'd need to correctly interpret that isn't given in the paper, but a reasonable estimate is that this means that someone with one extra friend and the gene would score anywhere from .01 to .2 points higher on the score (remember, 1=very conservative, 2=conservative, 3=moderate, 4=liberal, 5=very liberal).

The authors give a little more information:

For people who have two copies of the [gene], an increase in number of friendships from 0 to 10 friends is associated with increasing ideology in the liberal direction by about 40% of a category on our five-category scale.

People with no copies of the gene were unaffected by the number of friends they had.

None of what I wrote above detracts from the theoretical importance of the paper. Identifying genes that influence behavior, even just a tiny bit, is important as it opens windows into the underlying mechanisms. And to their credit, the authors are very guarded and cautious in their discussion of the results. The media reports -- fed, no doubt, by the university press release -- have focused on the role of the gene in predicting behavior. It should be clear that the gene is next to useless in predicting, for instance, who somebody is going to vote for. Does that make it a gene for liberalism? Maybe.

I would point out one other worry about the study, which even the authors point out. They tested a number of different possible predictors. The chances of getting a false positive increases with every statistical test you run, and they do not appear to have corrected for multiple comparisons. Even with 2,000 participants (which is a large sample), the p-value for the significant interaction was only p=.02, which is significant but not very strong, so the risk that this will not replicate is real. As the authors say, "the way forward is to seek replication in different populations and age groups."

Overheard: Scientific Prejudice

Posted by GamesWithWords on Thursday, September 30, 2010

A senior colleague recently attended an Autism conference. Language is frequently impaired in Autism, and so many of the neuroscientists there were trying to look at the effects of their animals models of Autism on language.

Yes, you read that correctly: animal models of language. In many cases, rats.

This colleague and I both believe in some amount of phylogenetic continuity: some aspects of language are no doubt built on mechanisms that existed in our distant ancestors (and therefore may exist in other modern-day animals). But given that we have, at best, a rudimentary understanding of the mechanisms underlying language in humans -- and certainly little or no agreement in the field at present -- arguing that a particular behavior in a rat is related to some aspect of language is at best wild-eyed conjecture (and I say this with a great deal of respect for the people who have been taking this problem seriously).

Unfortunately, this colleague didn't get very far in discussing these issues with these researchers. One, for instance, said, "I know rat squeaks are related to language because they're auditory!"

Sure, so's sneezing:

The problem with such conversations, as this colleague pointed out, is that neuroscientists often don't take us cognitive types seriously. After all, they work on a "harder" science. (For those who haven't seen it yet, read this by DrugMonkey -- tangential but fun.) A friend of mine, who is a physicist, once told me that he wasn't sure why psychology was even called a "science" since psychologists don't do experiments -- never mind that I was IMing him from my lab at the time (which he knew).

When I applied to graduate school, I applied to one interdisciplinary program that included cognitive people and also physiology folk. During my interview with one professor who worked on monkey physiology, he interrupted me as I was describing the work I had done as an undergraduate. "Nothing of value about language," he told me, "can be learned by studying humans." When I suggested that perhaps there weren't any good animal models of language to work with, he said, "No, that's just a prejudice on the part of you cognitive people."

Keep in mind that there were several faculty in his department who studied language in humans. This is why such mixed departments aren't always particularly collegial places to work.

I bring this up not to rag on neuroscientists or physicists, but to remind the psychologists in the audience that we have this exact same problem. I don't know how many people have told me that linguistics is mostly bullshit (when I was an undergraduate, one professor of psychology told me: "Don't study linguistics, Josh. It will ruin you as a scientist.") and that philosophy has nothing to offer. When you talk to them in detail, though, you quickly realize that most of them have no idea what linguists or philosophers do, what their methods or, or why those fields have settled on those methods. And that's the height of arrogance: linguists and philosophers incude, in their numbers, some of the smartest people on the planet. It only stands to reason that they know something of value.

I'm not defending all the methods used by linguists. They could be improved. (So could methods used by physicists, too.) But linguists do do experiments, and they do work with empirical data. And they're damn smart.

Just sayin'.

Photos: mcfarlando, Jessica Florence.

1/3 of Americans can't speak?

Posted by GamesWithWords on Wednesday, August 04, 2010

A number of people have been blogging about a recent, still unpublished study suggesting that "a significant proportion of native English speakers are unable to understand some basic sentences." Language Log has a detailed explanation of the methods, but in essence participants were asked to match sentences to pictures. A good fraction made large numbers of mistakes, particularly those who had been high-school drop-outs.

What's going on here? To an extent, this shouldn't be that surprising. We all know there are people who regularly mangle language. But, as Mark Liberman at Language Log points out, at least some of these data are no doubt ascribable to the "paper airplane effect":

At one point we thought we had discovered that a certain fraction of the population is surprisingly deaf to certain fairly easy speech-perception distinctions; the effect, noted in a population of high-school-student subjects, was replicable; but observing one group of subjects more closely, we observed that a similar fraction spent the experiment surreptitiously launching paper airplanes and spitballs at one another.

It's worth remembering that, while many participants in an experiment take it seriously and are happy to help out the researcher, some are just there for the money they get paid. Since we're required to pay people whether they pay attention to the experiment or not, they really don't have any incentive to try hard. Does it surprise anyone that high-school drop-outs are particularly likely to be bad at/uninterested in taking tests?

It's probably relevant that the researchers involved in this study are linguists. There are some linguists who run fabulous experiments, but as a general rule, linguists don't have much training in doing experiments or much familiarity with what data looks like. So it's not surprising that the researchers in question -- and the people to whom they presented the data -- weren't aware of the paper airplane effect.

(I should say that psychology is by no means immune to this problem. Whenever a new method is adopted, it takes a while before there's a critical mass of people who really understand it, and in the meantime a lot of papers with spurious conclusions get written. I'm thinking of fMRI here.)

Garbage in, Garbage out

Posted by GamesWithWords on Tuesday, July 20, 2010

While watching television, have you ever had a fatal heart attack?

If you answered "yes" to this question, you would have been marked as a "bad participant" in Experimental Turk's recent study. The charitable assumption would be that you weren't paying attention. Importantly for those interested in using Amazon Mechanical Turk for research, participants recruited through AMT were no more likely to answer "yes" than participants tested in a traditional lab-based setting (neither group was likely to say "yes").

It's a nice post, though I think that Experimental Turk's analysis is over-optimistic, for reasons that I'll explain below. More interesting, though, is that Experimental Turk apparently does not always include such catch trials in their experiments. In fact, they find the idea so novel that they actually cited a 2009 paper from the Journal of Experimental Social Psychology that "introduces" the technique -- which means the editors and reviewers at this journal were similarly impressed with the idea.

That's surprising.

Always include catch trials

Including catch trials is often taught as basic experimental method, and for good reason. As Experimental Turk points out, you never know if your participants are paying attention. Inevitably, some aren't -- participants are usually paid or given course credit for participation, so they aren't always very motivated. Identifying and excluding the apathetic participants can clean up your results. But that's not the most important reason to include catch trials.

Even the best participant may not understand the instructions. I have certainly run experiments in which the majority of the participants interpreted the instructions differently from how I intended. A good catch trial is designed such that the correct answer can only be arrived at if you understand the instructions. It is also a good way of making sure you're analyzing your data correctly -- you'd be surprised how often a stray negative sign worms its way into analysis scripts.

Sometimes participants also forget instructions. In a recent study, I wasn't finding a difference between the control and experimental groups. I discovered in debriefing that most of the participants in the experimental group had forgotten the key instruction that made the experimental group the experimental group. No wonder there wasn't a difference! And good thing I asked.

The catch trial -- the question with the obvious answer -- is just one tool in a whole kit of tricks used to validate one's results. There are other options, too. In reading studies, researchers often ask comprehension questions -- not because the answers themselves matter (the real interest is in what the participants do while reading), but simply to prove that the participants in fact did read and understand the material.

Similar is the embedded experiment -- a mini experiment embedded into your larger experiment, the only purpose of which is to replicate a well-established result. For instance, in a recent experiment I included a vocabulary test (which you can also find in this experiment I'm running with Laura Germine at TestMyBrain.org). I also asked the participants for their SAT scores (these were undergraduates), not because I cared about their scores per se, but I was able to show that their Verbal SAT scores correlated nicely with performance on the vocabulary test (Math SAT scores less so), helping to validate the our vocab test.

Beyond Surveys

Although I described catch trials mostly in terms of survey-format studies, the same techniques can be embedded into nearly any experiment. I've used them for reading-time, eye-tracking and ERP experiments as well. The practice isn't even specific to psychology/cognitive science. During my brief sojourn in a wet lab in high school, my job was to help genotype genetic knock-out mice to make sure that the genes in question really were missing from the relevant mice and not from the control mice. It probably wouldn't have occurred to the PIs in that lab to just assume the knock-out manipulation worked. Fail that, and none of the rest of the experiment is interpretable.

A version of the catch trial is even seen in debugging software, where the programmer inserts code that isn't relevant to the function of the program per se, but the output of which helps determine whether the code is doing what it's supposed to.

It is true that some experiments resist checks of this sort. I have certainly run experiments where by design I couldn't easily confirm that the participants understood the experiment, were paying attention, etc. But that is better avoided if possible -- which is why when I don't see such checks in an experimental write-up, I assume either (a) the checks were performed but deemed too unimportant/obvious to mention, or (b)

An Odd Omission

If catch trials are a basic aspect of good experimental design, how is it that Experimental Turk and the Journal of Experimental Social Psychology didn't know about it? I'm not sure. Part of it may be due to how experimental design is taught. It's not something you look up in an almanac, and though there are classes on technique (at least in psychology departments), they aren't necessarily that helpful since there are hundreds of types of experiments out there, each of which has its own quirks, and a class can only cover a few.

At least in my experience, experimental design is learned through a combination of the apprenticeship method (working with professors -- or, more often, more experienced graduate students) and figuring it out for yourself. The authors at Experimental Turk, it turns out, come from fields relatively new to experimental design (business, management, and political science), so it's possible they had less access to such institutional knowledge.

As far as the Journal of Experimental Social Psychology... I'm not a social psychologist, and I hesitate to generalize about the field. A lot of social psychology uses questionnaires as instruments. They go to a great deal of difficulty to validate the questionnaires -- show that they are predictive of results on other tests or questionnaires, show that the questionnaires have good test-retest reliability, etc. Many of the techniques they use are ones I would like to learn better. But I actually haven't ever run across one (again, in my limited experience) that actually includes catch trials. Which in itself is interesting.

A clever idea

I should add that while Experimental Turk cites said journal article for suggesting using questions with obvious answers, that's not actually what the paper suggests. Rather, it suggests using instructions telling participants to ignore certain questions. For instance:

Sports Participation

Most modern theories of decision making recognize the fact that decisions do not take place in a vacuum. Individual preferences and knowledge, along with situational variables can greatly impact the decision process. In order to facilitate our research on decision making we are interested in knowing certain factors about you, the decision maker. Specifically, we are interested in whether you actually take the time to read the directions; if not, then some of our manipulations that rely on changes in the instructions will be ineffective. So, in order to demonstrate that you have read the instructions, please ignore the sports item below, as well as the continue button. Instead, simply click on the title at the top of this screen (i.e., "sports participation") to proceed to the next screen. Thank you very much.

That's a clever idea. One of my elementary school teachers actually wrote a whole test with instructions like that to teach the class a lesson about reading instructions carefully (and it worked -- I still do!). So it's a good idea I've never seen used in an experimental setting before, but that doesn't mean it hasn't been used. In any case, the discussion in the paper doesn't mention catch trials or other methods of validating data, so it's hard to know whether they did a thorough literature search.

More training

A bad movie can still make entertaining watching. A bad experiment is irredeemable. If the participants didn't understand the instructions, nothing can be gleaned from the data. And there are so many ways to run bad experiments -- I know, because I've employed many of them myself. There are a lot of datasets out there in psychology that have proven, shall we say, resistant to replication. Some of this has to be due to the fact that experimental design is not as good as it could and should be.

Addendum

As I mentioned higher up, I think Experimental Turk is overly optimistic about the quality of data from AMT. I've run a couple dozen experiments on AMT now, and the percentage of participants that fail the catch trials varies a great deal, from as few as 0% to as many as 20-30%. I haven't made a systematic study of it, but there seem to be a number of contributing factors, some of which are general to all experimental venues (length of the experiment, how interesting it is, how complicated in the instructions are) and some of which are specific to AMT (the more related HITs, the more attractive a target the experiment is to spammers).

All the more reason to always include catch trials.

-----------
Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power Journal of Experimental Social Psychology, 45, 867-872

Do Language Universals Exist?

Posted by GamesWithWords on Tuesday, July 06, 2010

Is there an underlying structure common to all languages? There are at least two arguments in favor of that position. One is an in principle argument, and one is based on observed data.

Since Chomsky, many researchers have noted that language would be impossible to learn if one approached it without preconceptions. It's like solving for 4 variables with only 3 equations -- for those of you who have forgotten your math, that can't be done. Quine pointed out the problem for semantics, but the problem extends to syntax.

The data-driven argument is based on the observation that diverse languages share many properties. All languages, it is claimed, have nouns and verbs. All languages have consonants and vowels. All languages put agents (the do-ers; Jane in Jane broke the window) in subject position and patients (the do-ees; the window in Jane broke the window) in object position. And so on. (Here's an extensive list.)

Though many researchers subscribe to this universal grammar hypothesis, it has always been controversial. Last year, Evans and Levinson published an extensive refutation of the hypothesis in Behavioral and Brain Sciences. They don't tackle the in principle argument (it's actually tough to argue against, since it turns out to be logically necessary), but they do take issue with the data-based argument.

Rare languages

Evans and Levinson point out that at best 10% of the world's 7,000 or so languages have been studied in any great detail, and that the bulk of all work on language has focused on English. They claim that researchers only believe in linguistic universals because they've only looked at a relatively small number of often closely-related languages, and they bring up counter-examples to proposed universals found in obscure languages.

This argument cuts both ways. The correct characterization of a language is very, very hard. Much of the work I have been doing lately has been an attempt to correctly characterize the semantics of about 300 related verbs in English. Hundreds of papers have been written about these verbs over the last half-century. Many of them have turned out to be wrong -- not because the researchers were bad, but because the problem is hard.

That's 300 verbs in the most-studied language on the planet, and we still have work to do. Evans and Levinson are basing their arguments on broad-scale phenomena in extremely rare, poorly-studied languages.

A friend of a friend told me...

The rare languages that Evans and Levinson make use of are not -- as they readily acknowledge -- well-understood. In arguing against recursion as a linguistic universal, they bring up Piraha, a language spoken in a handful of villages deep in the Amazon. Without discussing recursion in detail, the basic claim is that there are sentences that are ungrammatical in Piraha, and these sentences are ungrammatical because they require recursion.

To my knowledge, there is one Spanish-Piraha bilingual speaker, in addition to two English-speaking missionaries who, as adults, learned Piraha. The claim that Piraha doesn't have recursion is based on the work of one of those missionaries. So the data that sentences with recursion are ungrammatical in Piraha is based on a limited number of observations. It's not that I don't trust that particular researcher -- it's that I don't trust any single study (including my own), because it's easy to make mistakes.

Looking back at English, I study emotion verbs in which the subject of the verb experiences an emotion (e.g., fear, like, love). A crucial pillar of one well-known theory from the 1990s was that such verbs can't be prefixed with "un". That is, English doesn't have the words unfeared or unliked. While I agree that these words sound odd, a quick Google search shows that unfeared and unliked are actually pretty common. Even more problematic for the theory, unloved is a perfectly good English word. In fact, many of these verbs do allow "un" prefixation. The author, despite being an experienced researcher and a native speaker of English, was just wrong.

Even assuming that you are correct in claiming that a certain word or sentence doesn't appear in a given language, you could be wrong about why. Some years ago, Michael Tomasello (and others) noticed that certain constructions are more rare in child speech than one might naively expect. He assumed this was because the children didn't know those constructions were grammatical. For instance, in inflected languages such as Spanish or Italian, young children rarely use any verbs in all possible forms. A number of people (e.g., Charles Yang) have pointed out that this assumes that the children would want to say all those words. Take a look at this chart of all the forms of the Spanish verbs hablar, comer and vivir. The child might be excused for never using the form habriamos hablado ("we would have spoken") -- that doesn't mean she doesn't know what it is.

In short, even in well-studied languages spoken by many linguists, there can be a lot of confusion. This should give us pause when looking at evidence from a rare language, spoken by few and studied by fewer.

Miracles are unlikely, and rare

Some centuries ago, David Hume got annoyed at people claiming God must exist, otherwise how can you explain the miracles recorded in the Bible? Hume pointed out that by definition, a miracle is something that is essentially impossible. As a general rule, seas don't part, water doesn't turn into wine, and nobody turns into pillars of salt. Then consider that any evidence you have that a miracle did in fact happen could be wrong. If a friend tells you they saw someone turn into a pillar of salt, they could be lying. If you saw it yourself, you could be hallucinating. Hume concludes that however strong your evidence that a miracle happened is, that could never be as strong as the extreme unlikelihood of a miracle actually happening -- and, in any case, the chance that the Bible is wrong is way higher than the chance that Moses in fact did part the Sea of Reeds.

(For those of you who are worried, this isn't necessarily an argument against the existence of God, just an argument against gullibility.)

Back to the question of universals. Let's say you have a candidate linguistic universal, such as recursion, that has shown up in a large number of unrelated and well-studied languages. These facts have been verified by many, many researchers, and you yourself speak several of the languages in question. So the evidence that this is in fact a linguistic universal is very strong.

Then you come across a paper that claims said linguistic universal doesn't apply in some language X. Either the paper is right, and you have to toss out the linguistic universal, or it's wrong, and you don't. Evans and Levinson err on the side of tossing out the linguistic universal. Given the strength of evidence in favor of some of these universals, and the fact that the counter-examples involve relatively poorly-understood languages, I think one might rather err on the other side. As they say, extraordinary claims require extraordinary evidence.

The solution

Obviously, the solution is not to say something about "extraordinary claims" and wander on. Evans and Levinson's paper includes a plea to researchers to look beyond the usual suspects and start doing more research on distant languages. I couldn't agree more, particularly as many of the world's language are dying and the opportunity to study them is quickly disappearing.

-------
Evans, N. and Levinson, S. (2009). The myth of language universals: Language diversity and its importance for cognitive science Behavioral and Brain Sciences, 32 (05) DOI: 10.1017/S0140525X0999094X

Revetro responds

Posted by GamesWithWords on Monday, May 24, 2010

Earlier this month I blogged about a study supposedly produced by Revetro on texting during sex. The main point of the post was that researchers have to be careful about ensuring data quality (e.g., are the participants actually paying attention to the questions?). I also remarked that I had been unable to find the original article.

Jennifer Jacobson at Revetro very kindly emailed me several days ago in order to point me to the original study. The survey question under discussion can be found in the "We interrupt this dinner" section.

So at least the study exists. I'm hoping to find out more about the methods they used.

Texting during sex

Posted by GamesWithWords on Friday, May 07, 2010

"Teens surprisingly OK with texting during sex," notes Slate's news aggregator. This seemed like a good lead for a piece I've wanted to write for a while: just how much we should trust claims that 10% of people agree to claim X. In many cases, we probably should put little faith in those numbers.

As usual, Stephen Colbert explains why. In his infamous roast of George W Bush, he notes

Now I know there's some polls out there that say this man has a 32 percent approval rating ... Pay no attention to people who say the glass is half empty .. because 32 percent means it's 2/3 empty. There's still some liquid in that glass, is my point. But I wouldn't drink it. The last third is usually backwash.

This was meant as a dig at those who still supported Bush, but there's a deeper point to be made: there's a certain percentage of people who, in a survey, will say "yes" to anything.

Numbers

For instance, many of my studies involve asking people's opinions about various sentences. In a recent one I ran on Amazon Mechanical Turk, I presented people with sentence fragments and asked them which pronoun they thought would likely be the next word in the sentence:

John went to the store with Sally. She/he...

In that case, it could be either pronoun, so I'm trying to get a sense of what people's biases are. However, I put in some filler trials just to make sure people were paying attention:

Billy went to the store with Alfred. She/he...

In this case, it's really, really unlikely the pronoun will be "she," since there aren't any female characters in the story. Even so, over 4% of the time participants still clicked on "she." This wasn't an issue of some of the participants simply being bad. I included 10 such sentences, and nobody only one person got more than 1 of these wrong. However, a lot of people did manage to miss 1 ... probably because they simply were sloppy, made a mistake, were momentarily not thinking ... or because they really thought the next word would be "she."

Those numbers are actually pretty good. In another, slightly harder experiment that I ran on my website, people didn't do so well. This one was shorter, so I included only 4 "catch trials" -- questions for which there was only one reasonable answer. Below is a pie chart of the participants, according to how many of these they got right:

You can see that over half got them all right, but around a quarter missed 1, and a significant sliver got no more than 50% correct. This could suggest many things: my questions weren't as well-framed as I thought, I had a lot of participants who weren't paying attention, some people were deliberately goofing off, etc.

Poll numbers

This isn't a problem specific to experiments. As we all learned in 2000, a certain number of people accidentally vote for the wrong candidate through some combination of not paying attention and poor ballot design.

So there is a difference between a survey finding that 10% of teens say that they think texting during sex is fine and 10% of teens actually thinking that texting during sex is fine. A good survey will incorporate methods of sussing out who is pulling the surveyor's leg (or not paying attention, or having a slip of the tongue, etc.).

Real Surveys

I didn't want to unnecessarily pick on this particular study, so I tried to hunt down the original source to see if they had done anything to protect against the "backwash" factor. Slate linked to a story on mashable.com. Mashable claimed that the research was done by the consumer electronics shopping and review site Retrevo, but only linked to Retrevo's main page, not any particular article. I did find a blog on Retrevo that frequently presents data from surveys, but nothing this year matched the survey in question (though this comes close). I found several other references to this study using Google, but all referenced Retrevo.

If anyone knows how to find the original study, I'd love to see it -- but if it doesn't exist, it wouldn't be the first apocryphal study. So it turns out that the backwash effect isn't the only thing to be careful of when reading survey results.

UPDATE

I have since heard from Revetro. See here.

Are elders better scientists?

Posted by josh on Thursday, December 11, 2008

A recent paper, discussed in a recent issue of Nature, found that across disciplines, professors in their 50s and 60s published about twice the number of papers each year as professors in their 30s. This is taken in the article as evidence that older professors can be very productive.

Nature allows readers to common on news briefs, and the comments raised the same issues I had. Here are the first two, for instance:

They don't seem to consider that older professors have larger research groups, i.e. more underlings to actually write the papers. Perhaps a better photo to illustrate the story would be the aged professor in their office wielding a red pen over their students' manuscripts.

Well, the older professors are also more established and have more connections, and therefore can participate in both small and large collaborative projects. No offense, but this survey only seems to prove an already obvious point.

Basically, older faculty tend to not only have more graduate students and post-docs, they also tend to have broad collaboration networks. This is not to say that older researchers are not productive, or that even less-productive older researchers aren't valuable members of the community, just that these data seem hard to interpret.

Language Wars

Posted by josh on Tuesday, December 09, 2008

I was struck by a comment to a post a while back on Cognitive Daily:

It's "DOES the use of hand gestures." Please, pay attention; grammar matters. "The use of hand gestures" is the subject, and it is singular.

Grammar matters?

A certain segment of the population gets very worked up about "correct usage" of language. As a scientist, I understand the difference between "standard" and "non-standard" language, and why one might care, as an educator, about the standard. Language is most useful when everybody understands one another (cf The Tower of Babel). This is why the standardization of spelling was such an important -- and relatively recent -- achievement.

However, the people who say, "pay attention; grammar matters" seem to be concerned with something else entirely. I can't say for sure what this poster cared about, but most that I know believe that without proper language, one cannot have proper thoughts. Thus, if we could make everybody produce perfectly-diagrammable sentences, everyone would finally think right, too.

Really?
To actually prove this contention, you would have to do a controlled experiment. Find two people who speak with "poor" grammar and have similarly sloppy thinking, teach one the correct grammar, and see if that person now thinks more clearly than the uneducated speaker.

To the best of my knowledge, no such experiment has been done -- no doubt partly because scientists seem to as a group reject such thinking altogether. For one thing, you would have to define "correct grammar," which is a priori impossible to do. The only known way to determine if a sentence is grammatical is to querry a native speaker of a language. That's it. There are no other methods.

So, now suppose we have two people (for instance, Henry Higgins and Eliza Doolittle) who disagree as to whether a sentence is grammatical. How do we decide between the two of them? Typically, most people for whatever reason side with the wealthier and more politically powerful of the two (in this case, Henry Higgins).

That doesn't sound very democratic. So we could take a poll. Typically, you'll find that one judgment is more common than another. But now we have only defined a standard: not necessarily a "correct" judgment. Moreover, these differences in judgments often vary as a function of where you live. As I understand it, there are parts of the South where most people will agree that you simply can't refer to a group of people as "you" -- "y'all" is the correct term.

A war of words
If it is the case that there is no evidence that "correct grammar" helps people think more correctly, and that this is because there is no such thing as correct grammar -- and I assure you, there isn't -- then why do people get so hung up on it?

First, you might answer that most people live their lives just fine without ever thinking about correct and incorrect grammar. I suspect that is false. Much hay has been made about Palin's "mangling" of the English language, some of which is valid, but much of which is due to the fact that she speaks with a nonstandard dialect. It has been remarked by more than one Southerner that Yankees think they are dumb just because of their accent. If you've never done this, then I ask you, have you really never assumed someone with a West Virginian accent was dumb? If you haven't, then at least accept that even babies prefer people who speak with the local standard accent (note that somewhat older children may actually prefer a person with a locally high-status accent rather than their own accent).

I've heard it claimed that wars have been fought over linguistic differences, but I couldn't think of any obvious examples (please comment away if you have one). Still, I think the evidence is compelling that people really, really care about accent and language use, and this goes beyond a belief in the empirical claim that right language leads to right thoughts. This runs deeper. Hopefully we will some day understand why.

A rash of scientific plagiarism?

Posted by josh on Friday, October 17, 2008

Nature reports that Harold Garner of the University of Texas Southwestern Medical Center in Dallas has been scouring the medical literature using an automated text-matching software package to catch plagiarized articles.

A surprising number have been found. 181 papers have been classified as duplicates, sharing 85% of their text, on average, with a previous paper. One quarter of these are nearly 100% identical to a previous publication.

While it is troubling that anybody would be so brazen, the fact that they have gotten away with it so far says something: there are a lot of journals. And a lot of papers. For a plagiarist to be successful, it must be the case that neither the editor nor any of the referees have read the original article -- this despite the fact that referees are typically chosen because they are experts in the field the article addresses.

That, I think, is the big news: that it is possible to plagiarize so blatantly.

Incidentally, the Nature news brief suggests that the confirmed plagiarism is usually carried out in obscure journals. This means that the plagiarists are gaining relatively little for their effort, and the original authors are losing little.

That said

Garner's project has apparently identified 75,000 abstracts that seem highly similar. It's hard to tell what that means, so we'll have to wait for the full report.

An abstract is about 200 words long. PsychInfo currently lists 10,098 which contain the phrase "working memory." One would assume that, even if all of them are examples of independent work, many are highly similar just by random chance. So I hope to find out more about how "highly similar" is being operationalized in this project.

While I suspect that plagiarism is not a huge problem, I still think it is fantastic that people are attacking it with these modern tools. I think we will be seeing a lot more of this type of work. (Actually, come to think of it, a professor I had in 2002 actually used an automated plagiarism-catching software program to screen student homework, so this has been around for a while.)

Neuroscience that matters

Posted by josh on Tuesday, July 01, 2008

Science, like any other human activity, is subject to trends and fashions. Some are brief fads; others are slow waves that wash through society. For the last decade or two, cognitive neuroscience has been hot -- particularly neuroimaging.

A pretty typical example of cognitive neuroscience appears in this recent piece by the New York Times about research into the brain basis of sarcasm, which I read because I've been considering starting some work on sarcasm.

I generally don't like the media coverage of cognitive neuroscience, since it often acts surprised that human behavior is the result of activity in our brain. This particular article did not have that problem, but it still suffered from failing to answer the most important question any article has to answer.

The Right Parahippocampal Gyrus detects Sarcasm. So What?

The punch line of the article was that a neuroimaging study found the right parahippocampal gyrus to be active in sarcasm detection. Why this is important is left to the reader to decide.

So why is it?

In a lecture last spring, Randy Buckner distinguished between two types of cognitive neuroscience.

In one, neuroscience techniques (patient studies, fMRI, single-cell recording, etc.) are used as behavioral measures. The goal of that type of research is to better understand human behavior. For instance, you might use fMRI to see if different brain regions are used in interpreting sarcasm and irony, which would suggest that the two phenomena are truly distinct.

The other kind of cognitive neuroscience uses the techniques of neuroscience to better understand how the brain produces the behavior in question. For instance, what computations to the neurons perform such that a person can perceive sarcasm?

I am sympathetic to both types of cognitive neuroscience, though I tend to feel that there are very few human behaviors we understand well enough to seriously explore their neural instantiations (the basic phenomena of sensory perception are the only clear candidates I can think of, though basic memory processes might also make that list). You can't reverse-engineer a product if you don't know what it does.

Interpreting Cognitive Neuroscience

In terms of the sarcasm article, it wasn't clear what this study adds to our understanding of what sarcasm is. So I don't think it counts as the first type of cognitive neuroscience.

Is it the second type? Some part of the brain must be involved in detecting sarcasm, so discovering which part of the brain it is in and of itself doesn't tell us much about implementation. Finding out that Sprint's national HQ is in Overland Park, KS, doesn't, by itself, tell you very much about Sprint, other than that it has an HQ, which you already probably guessed.

That doesn't mean it's without information. Based on what you know about Overland Park, KS -- its tax regulations, local worker pool, lines of transportation and communication, etc. -- you might derive a great deal of information about how Spring works. But, unfortunately, the Times article didn't tell us much useful. I certainly don't know enough about the right parahippocampal gyrus to really tell much of a story.

This is not a criticism of the journal article, which I haven't yet read. I'm actually pretty happy somebody is working on this issue. I just wish the Times had told me something useful about their work.

Sorry, New York Times, cognitive dissonance still exists

Posted by josh on Wednesday, April 09, 2008

Earlier this week, New York Times columnist John Tierney reported a potential flaw in a classic psychology experiment. It turns out that the experimental finding -- cognitive dissonance -- is safe and sound (see below). But first, here are the basic claims:

Cognitive dissonance generally refers to changing your beliefs and desires to match what you do. That is, rather than working hard for something you like, you may believe you like something because you worked so hard for it.

Laboratory experiments (of which there have been hundreds if not thousands) tend to be of the following flavor (quoted from the Tierney blog post). Have someone rate several different objects (such as different colored M&Ms) in terms of how much they like them. From that set of objects, choose three (say, red, blue and green) that the person likes equally well. Then let the person choose between two of them (the red and blue M&M).

Presumably (and this will be the catch) the person chooses randomly, since she likes both equally. Say she chooses the red M&M. Then let her choose between red and green. You would predict that she would choose randomly, since she likes the two colors equally, but she nearly invariably will be the red M&M. This is taken as evidence that her originally random choice of the red M&M actually changed her preferences to where she now likes red better than either blue or green.

The basic problem with this experiment, according to M. Keith Chen of Yale and as reported by Tierney, is that we don't really know that the person didn't originally prefer red. She may have rated them similarly, but she chose red over blue. The math works out such that if she in fact already preferred red over blue, she probably also actually preferred red over green.

Tierney calls this a "fatal flaw" in cognitive dissonance research, and asks "choice rationalization has been considered one of the most well-established theories in social psychology. Does it need to be reconsidered?"

Short answer: No.

First, it is important to point out that Chen has shown that if the original preferences were measured incorrectly, then this type of experiment might suggest cognitive dissonance even where there is none. He does not show that the original measurements were in error.

However, even if that were true, that would not mean that cognitive dissonance does not exist. This is a classic problem in logic. Chen's argument is of the following form: If Socrates is a woman, then he is mortal. Socrates is not a woman. Therefore, he is not mortal.

In any case, cognitive dissonance has been shown in studies that do not fall under Chen's criticisms. Louisa Egan and collaborators solved this problem by having their subjects choose between items they couldn't see. Since the subjects knew nothing about the items, they couldn't possibly have a pre-existing preference. Even so, they showed the classic pattern of results.

By all appearances in the Tierney article, Chen is unaware of this study (which, to be fair, has not yet been published). "I wouldn't be completely surprised if [cognitive dissonance] exists, but I've never seen in measured correctly." This is hard to believe, since Chen not only works in the same university as Egan, he is a close collaborator of Laurie Santos (Egan's graduate advisor). It's not clear why he would neglect to mention this study, particularly since this blanket critique of cognitive dissonance research in the New York Times is embarrassing to Egan and Santos at a time when Egan is on the job market (and it appears to have a lot of people upset).

Thus, it's puzzling that Chen claims that no existing study unambiguously shows cognitive dissonance. He might, however, be able to make the weaker claim that it is possible that some studies that have been claimed to show cognitive dissonance in fact to not. That is a reasonable claim and worth testing. In fact, Chen reports that he is testing it now. It is worth keeping in mind that for the time being, Chen has only an untested hypothesis. It's an intriguing and potentially valuable hypothesis, but there isn't any evidence yet that it is correct.

See the original article here.

Field of Science