Games with Words: scientific methods

Showing posts with label scientific methods. Show all posts

In praise of experiments

Posted by GamesWithWords on Friday, December 14, 2012

Today, the excellent Neuroskeptic writes about a new study investigating which US states are most suicidal. The interesting twist was the form of the data: Google searches. It's an interesting study and an interesting use of Google searches, but what struck me was Neuroskeptic's closing thoughts.

Over the past couple of years there's been a flurry of studies based on analyzing Google and Twitter trends. What's interesting to me is that we're really in the early days of this, when you think about likely future technologies. What will happen when everyone's wearing a computer 24/7 that records their every word and move, and even what they see?

Eventually, psychology and sociology might evolve (or degenerate) into no more than the analysis of such data...

It's always dangerous to predict the future, but here's my prediction: Not a chance. It gets down to a distinction between observational studies and experiments. Observational studies (where you record what happens in the course of normal events) are useful, particularly when you care about questions like what is the state of the world? They are much less useful when you want to know why is the world the way it is?

There are a couple reasons.

Reason #1: The correlation fallacy

First, observational studies are really about studying correlations. To have much power to analyze interesting correlations, you need a lot of data. This is what makes Google and Twitter powerful: they provide a lot of data. But correlation, famously, doesn't always tell you much about causation.

For instance, it is now well-known that you can use the number of pirates active in the world's oceans and seas to reasonably predict average global temperature (there's a strong correlation):

I did not know until recently that Google search data has now definitively shown a correlation between the amount of movie piracy and global warming as well:

In the case of real pirates vs. the temperature, the correlation runs the other way (temperature affects weather affects seafaring activities). I have no idea what causes the correlation between searches for free movies and searches about global warming; perhaps some third factor. To give another silly example, there is a lot more traffic on the roads during daylight than at night, but this isn't because cars are solar-powered!

The point is that experiments don't have this problem: you go out and manipulate the world to see what happens. Change the number of pirates and see if global temperatures change. Nobody has tried this (to my knowledge), but I'm willing to bet it won't work.

(Of course, there are natural experiments, which are a hybrid of observational studies and experiments: the experimenter doesn't manipulate the world herself but rather waits until somebody else, in the course of normal events, does it for her. Good examples are comparing different states as they adopt bicycle helmet laws at different times and comparing that against head injury stats in the various states. These are rarely as well-controlled as an actual experiment, but have the advantage of ecological validity.)

Reason #2: Life's too short

The second is that observational studies are limited by what actually happens in the world. You won't, from an observational study, find out what the effect of US politics is of every US senator taking up crack while every US representative takes up meth. (I hope not, anyway.)

That was an absurd example, but the problem is real. Language gives lots of great examples. Suppose you want to find out what sentences in any given language are grammatical and what sentences are not. You could do an observational study and see what sentences people say. Those are grammatical; sentences you haven't heard probably aren't.

The problem with this is that people are boring and repetitive. A small number of words (heck, a small number of sentence fragments) accounts for most of what people say and write. The vast majority of grammatical sentences will never appear in your observational sample no matter how long you wait, because there are actually an infinite number of grammatical English sentences. (In his impressive "Who's afraid of George Kingsley Zipf?", Charles Yang shows how a number of prominent language researchers went astray by paying too much attention to this kind of observational study.)

The basic feature of the problem is that for building theories -- explaining why things are the way they are -- very often what you care about are the border cases. Human behavior is largely repetitive, and the border cases are quite rare. Experiments turn this around: by deliberately choosing the situations we put our participants in, we can focus on the informative test cases.

The experimental method: Here to stay

None of this should be taken as meaning that I don't think observational studies are useful. I conduct them myself. A prerequisite to asking the question Why are things the way they are is knowing, in fact, what way things are. There is also the question of ecological validity. When we conduct laboratory experiments, we construct artificial situations and then try to generalize the results to real life. It's good to know something about real life in order to inform those generalizations.

But just as I can't imagine observational studies disappearing, I can't imagine them replacing experimentation, either.

Is Psychology a science?: Redux

Posted by GamesWithWords on Tuesday, November 13, 2012

The third-most read post on this blog is "Is Psychology a science?". I was a few years younger then and still had strong memories of one of my friends complaining, when we were both undergraduates, that he had to take a psychology course as part of his science distributional requirements. "Psychology isn't a science," he said, "because they don't do experiments." Since he was telling me this over AIM as I was sitting in my psychology laboratory, analyzing an experiment, it didn't go over well.

It's been a popular post, but I haven't written about the subject much since in part because I started to suspect that the "psychology isn't a science" bias might actually be a thing of ignorant undergraduates and a few cranks. It's not something I've rarely heard in the last few years, and there's no need to write diatribes against a non-existant prejudice.

In retrospect, maybe I haven't come across these opinions because I mostly hang out with other psychologists. A colleague recently forwarded me this blog post ("Keep Psychology out of the science club"), which links to a few other similar pieces on blogs and in newspapers. So it seems the issue is alive and well.

Some articles one comes across are of the "psychologists don't do experiments" variety; these are easily explained by ignorance and an inability to use Google. But some folks raise some real concerns which, while I think they are misplaced, really are worth thinking about.

Psychology is too hard

One common theme that I came across is that psychology is simply too difficult. We'll never understand human behavior very well, so maybe we shouldn't even try. For instance, Gary Gutting, writing in the Opinionator at the New York Times, said:

Social sciences may be surrounded by the "paraphernalia" of the natural sciences, such as technical terminology, mathematical equations, empirical data and even carefully designed experiments. But when it comes to generating reliable scientific knowledge, there is nothing more important than frequent and detailed predictions of future events ... while the physical sciences produce many detailed and precise predictions, the social sciences do not ... Because of the many interrelated causes at work in social systems, many questions are simply "impervious to experimentation" ... even when we can get reliable experimental results, the causal complexity restricts us...

In a Washington Post editorial, Charles Lane wrote:

The NSF shouldn't fund any social science. Federal funding for mathematics, engineering and other "hard" sciences is appropriate. In these fields, researchers can test their hypotheses under controlled conditions; then those experiments can be repeated by others. Though quantitative methods may rule economics, political science and psychology, these disciplines can never achieve the objectivity of the natural sciences. Those who study social behavior -- or fund studies of it -- are inevitably influenced by value judgments, left, right, and center. And unlike hypotheses in the hard sciences, hypotheses about society usually can't be proven or disproven by experimentation. Society is not a laboratory.

Alex Berezow at the Newton Blog agrees:

Making useful predictions is a vital part of the scientific process, but psychology has a dismal record in this regard.

Is that a fair critique?

These writers don't entirely miss the mark. It really is true that psychology does not make as precise or as accurate predictions as, say, physics. That is not the same thing as saying that we can't make any predictions. Berezow complains about happiness research:

Happiness research is a great example of why psychology isn't a science. How exactly should "happiness" be defined? The meaning of the word differs from person to person, and especially between cultures. What makes Americans happy doesn't necessarily make Chinese people happy. How does one measure happiness? Psychologists can't use a ruler or a microscope, so they invent an arbitrary scale. Today, personally, I'm feeling about a 3.7 out of 5. How about you? ... How can an experiment be consistently reproducible or provide any useful predictions if the basic terms are vague and unquantifiable?

That's a great question! Let's start with the facts. It is true that we don't know exactly what it means to be a 3.7 on a scale of 1-5. But we do know a few interesting things.

People's predictions of how happy they will rate themselves in the future are systematically biased. People will say that good things (like getting tenure) will make them very happy (a 5 out of 5) whereas bad things (like not getting tenure) will make them very sad (a 1 out of 5), whereas when you then ask those same people to rate their happiness a little while after the event, people generally rate themselves as not nearly so happy or unhappy as they predicted. (Similarly, people who lose a limb usually rate themselves as about as happy afterwards as before, provided you give them a little time to adjust.) People who have children normally see a drop in how happy they rate themselves. They only start to recover after their children leave the nest. There is also the "future ahedonia" effect: people think that good things (e.g., an ice cream sunday) will make them happier now (on our 1-5 scale) than those same good things would make them happy in the future, and conversely for bad things (e.g., doing my homework won't feel so bad if I do it tomorrow rather than today). And so on. (These and many other examples can be found in Dan Gilbert's Stumbling on Happiness.)

These and other findings are highly reliable, despite the fact that we don't have a direct, objective measurement of happiness. In fact, as Dan Gilbert has pointed out, we would only consider that "direct" measurement to be a measurement of happiness if it correlated really well with how happy people said they were. To the extent it diverged from how happy people claim to be, we would start to distrust the "direct" measurement.

I personally am glad that we know what we know about happiness, though I wish we knew more. I picked happiness to defend because I've noticed that even those who defend psychology in comments sections give up happiness research as a lost cause. I think it's pretty interesting, useful work. It would be even easier to defend, for instance, low-level vision research, which makes remarkably precise predictions, has clear theories of the relationship between the psychological phenomena and the neural implementations, etc. (See also this post for some psychology success stories.)

Just how good do you need your predictions to be?

Still, it is true that we can't always make the precise predictions that can be made in some other fields. Of course, other fields can't always make the precise predictions, either. While physicists are great at telling you what will happen to rigid objects moving through vacuums, predicting the motions of real objects in the real world has been traditionally a lot harder, and understanding fluid dynamics has been deeply problematic (though I understand this has been getting a lot better in recent years). And that's without pulling out the Heisenberg Uncertainty Principle, which should cause anyone who wants precise, deterministic predictions to declare physics a non-science.

Also, some parts of psychology are able to make much more precise predictions than others do. Anything amenable to psychophysics tends to be much more precise, and vision researchers, as already noted, have remarkably well worked-out theories of low- and mid-level vision.

This line of discussion also raises an interesting question: when exactly did physics become a science? Was it a science in Newton's day, when we still new squat about electromagnetism -- much less elementary particles -- and couldn't make even rough predictions about turbulent air or fluid systems? And to people from 350 years from now, will the physics of today seem like a "real" science (my guess: no).

Worries

Berezow ends his post with the following caution:

To claim [psychology] is a "science" is inaccurate. Actually, it's worse than that. It's an attempt to redefine science. Science, redefined, is no longer the empirical analysis of the natural world; instead, it is any topic that sprinkles a few numbers around. This is dangerous, because, under such a loose definition, anything can qualify as science. And when anything qualifies as science, science can no longer claim to have a unique grasp on secular truth.

I have a different worry. My worry is that someone gets ahold of a time machine, goes back in time to 1661 and convinces Newton to lay off that non-scientific "physics" crap. Pre-Newtonian physics was a hodgepodge of knowledge, little resembling what we think of science today. Making precise predictions about the messy, physical world we live in no doubt seemed an impossible pipe-dream to many. Luckily, folks like Newton kept plugging away, and three and a half centuries later, here we are.

We should keep in mind that the serious study of the mind only began in the mid-1800s; physics has a significant head-start. And, as the anti-psychology commentators are happy to point out, psychology is much, much harder than physics or chemistry. But the only reason I can see to pull the plug is if we are sure that (a) we have learned nothing in the last 150 years, and (b) we will never make any further progress. These are empirical claims and so subject to test (I think the first one has already been falsified). So here's a proposed experiment: psychologists keep on doing psychology, and people who don't want to don't have to. And we'll wait a few decades and see who knows more about the human mind.

Results: Replication in Psychology

Posted by GamesWithWords on Wednesday, March 07, 2012

My paper with Adena Schachner on replication in psychology is now published. The paper contains 3 main sections: a reasonably thorough literature review on replication rates in psychology, a proposal as to how to improve replication rates (primarily through tracking replication rates), and the results of a survey of psychologists on replication practices (many thanks to all who participated). The results of the survey was that while not nearly enough replications are attempted, there are actually more being attempted than we had guessed (or than many of our colleagues that we discussed this project with had guessed).
This paper is part of a larger collection of papers on reimagining the publication and review process, and is more of those papers are printed, I plan to discuss at least some of them.

Survey on Replication

Posted by GamesWithWords on Saturday, April 23, 2011

Are you a researcher working in psychology or related domains (neuroscience, linguistics, etc.)? A colleague and I are conducting a survey on replication in these fields, for inclusion in an upcoming special issue of Frontiers in Computational Neuroscience. You can fill out the survey here.

Photo credit here.

Overheard: Converting common knowledge into scientific knowledge

Posted by josh on Sunday, February 20, 2011

Because they are so familiar, it is easy to assume that category labels drawn from everyday language are self-evidently the correct way to describe emotion. However, transforming everyday categorical descriptions into an effective research tool is at the least a challenge.

Cowie & Cornelius (2003) Describing the emotional states that are expressed in speech. Speech Communication 40, 5-32.

When should an effect be called significant?

Posted by GamesWithWords on Thursday, December 16, 2010

note: This post originally contained an error in the mathematics, which Tal of Citation Needed kindly pointed out. This error has been corrected.

In the thread following my earlier post on false positives, Tal made the observation that in a typical study that is significant at the p=.05 level has a 50% chance of being replicated.

It turns out that this depends heavily on what you mean by replicate. I'm going to work through some numbers below. Again, stats isn't my specialty, so please anyone jump in to correct errors. But I think I've got the general shape of the issues correct.

I got a significant result! Can I get it again?

Let's say you ran an experiment comparing the IQ scores of 15 people who prefer kiwi with the IQ scores of 15 people who prefer rambutan. You find that people who prefer rambutan have IQs 11.2 points higher than those who prefer kiwi. Assuming the standard deviations is 15 (which is how IQ tests are normalized), then that should give you a t-value of 11.2 / (15 * (2/15)^.5) = 2.04 and a p-value of about .05.

So you've now got a significant result! You tell all your friends, and they try running the same experiment. What are the chances they'll get the same results, significant at the p=.05 level?

The chances are not great. Even assuming that the underlying effect is real (rambutan-eaters really are smarter), your friends will only replicate your result about 51%, assuming they use exactly the same methods (according to a nifty power calculator found online here).

Define "get it"

Of course, we were assuming above that rambutan-eaters really are 11.2 IQ points smarter than kiwi-eaters (BTW I like both kiwi and rambutan, so nothing is meant by this example). In which case, your friends might not have gotten results significant at the p=.05 level, but they very likely found higher average IQs for their samples of rambutan-eaters relative to kiwi-eaters. And of course, what we really care about is how easy it will be to replicate the rambutan/kiwi difference, not how easy it will be to get the significant p-value again. The point of science is not to be able to predict statistically-significant differences but simply to predict differences. It's well beyond my statistical abilities to say how often this would happen, but hopefully someone will step up in the comments and let us know.

In practice, though, other people are only going to follow up on your effect if they can replicate it at the standard p=.05 level. What can we do to improve the chances of replicability?

Lower alphas

Let's suppose your effect had been significant at the p=.01 level. We can manage that while keeping the effect-size the same (11.2 IQ points) if we increase our sample to 26 kiwi-eaters and 26 rambutan-eaters (t = 8/(15 * (2/26)^.5) = 2.7). Now our chance of getting another significant result at the p=.01 level is ... 52%.

But we don't really care about getting a p=.01 again; we want to get the result again at the p=.05 level, which should happen around 76% of the time. Now, what if we had a result significant at the p=.001 level the first time around? We'd have needed about 42 subjects per condition. The chance or replicating that at the p=.05 level is 92%.

p-value #subjects/condition Likelihood of repeating at p=.05 level
.05 15 51%
.01 26 76%
.001 42 92%

Replication

Of course, there are things that I'm not considering here, like the possibility that your original experiment underestimated the effect size. For instance, let's say that the true effect size is 15 IQ points (which is a lot!). Now, your chances of finding an effect significant at the p=.05 level with only 15 participants per condition is 75%. That's a lot better than what we started with, though not perfect. To have an effect large enough to see it 95% of the time at the p=.05 level, it would need to be over 20 IQ points, which is a monstrously large effect. Incidentally, if you ran this study with 15 rambutan-eaters and 15 kiwi-eaters and found a 20 IQ point effect, that would be significant below the p=.001 level.

What I get from all this is that if you want a result that you and others will be able to replicate, you're going want the p-value in your original experiment to have been lower than p.05.

Poll: Do You Care about Effect Size?

Posted by GamesWithWords on Wednesday, December 15, 2010

My recent post on false positives has generated a long thread, with a large number of informative comments from Tal, who has convinced me to think a lot more about power analyses. I recommend reading the comments.

One issue that has come up is if and when we actually care about the null hypothesis. I argue that a fair amount of the time we really are deeply interested in knowing whether an effect exists or not. I don't entirely understand Tal's argument -- I'm sure he'll help out in the comments -- but I think he is saying that in any given experiment, there are always confounds such that if you have enough power, you'll find a significant result. So whether or not the manipulation has its intended effect, the confounds will ensure that the null hypothesis is false.

Perhaps. Having run studies with thousands of participants and no significant effect, I'm skeptical that this is always true, but obviously the data we'd need to test his claim does not and never will exist. In any case, this is why we use converging methods: the undetected confounds in one method will (hopefully) not appear in the others, and across studies the truth will emerge.

Still, this discussion has led me to wonder: across fields, how often are people deeply interested in the existence or absence of an effect (as opposed to the size of the effect). Please leave a comment with your field and how often you really are interested in the presence or absence of an effect. Examples are encouraged but are unnecessary.

I'm already on the record saying I am often interested in the existence of an effect and rarely care about its size. Below I give my exaples.

Why I rarely care about effect size

Priming: Priming is expected to occur whenever two mental constructs share underlying structure or recruit the same underlying processes. There is a lot of interest in the underlying representations of certain verb forms. Verbs of transfer can be used two ways. Compare: John gave the book to Sally vs. John gave Sally the book. The order of the words changes and there either is or isn't a preposition.

In a number of experiments, Thothathiri and Snedeker asked whether hearing give in one form would make it easier for people to understand other verbs of transfer in the same form (e.g., send). On some theories, it should (due to shared structure between verbs). On some theories, it shouldn't (due to verbs not sharing structure). So the existence of the effect mattered. But what about effect size: how much of an effect should priming have? It's an interesting question, but irrelevant to the hypotheses this study was testing, and frankly currently nobody has any hypotheses one way or another.

Development: Thothathiri and Snedeker found the priming effect in adults. They also tested children. For any adult behavior, there is always the question of at what point in development the behavior should appear. This is a deep, interesting question, since some behaviors are (roughly-speaking) innate and some are learned and you'd expect the former to appear earlier than the latter. Again, there are theories that very strongly predict that young children should or should not show the same effect as adults. Once again, the existence of the effect matters.

What about the size? Again, nobody has any predictions, and effect size cannot be used to tease apart theories. Even if the effect were much smaller in children, that wouldn't really matter, since in general children are difficult participants to work with and their effects are often smaller because a certain number simply didn't understand the task.

Eyetracking: Many of my experiments use the Visual World Paradigm. The basic idea is that people if you show people a picture and start talking about it, they will look at the parts you are talking about as you are talking about them. If there is a picture of a cat, a dog and a horse, and I say "dog," participants will look at the part of the picture with a dog. We can then use their eye movements to see how quickly people understood the word. So we're looking for the first point in time at which more people are looking at "dog" than you'd expect by chance. At any given time point, either there is an effect or there isn't -- and there had better be a point at which there isn't, such as before I said the word "dog"!

As far as effect size, though, it's not going to be the case that everyone is looking at the dog at any given time point (these effects are probabilistic). You'd expect is somewhere between 50% and 80% of people looking at the dog. But as long as you have more than 33% looking at the dog (remember, there are 3 things to look at: the cat, the dog and the horse), that's an effect. As far as size...you can measure it, but it won't help you distinguish between existing theories, which is what a good experiment is supposed to do.

Etc.: It's easy to generate more examples. I'm pretty sure every study I've ever run has been of this sort, as are most of the studies I have read. Sometimes we're interested in knowing more than just whether an effect exists. Sometimes we also care about the direction. But existence in and of itself is a real question.

And for my next trick, I'll make this effect disappear!

Posted by GamesWithWords on Friday, December 10, 2010

In this week's New Yorker, Jonah Lehrer shows once again just how hard it is to do good science journalism if you are not yourself a scientist.

His target is the strange phenomenon that many high profile papers are failing to replicate. This has been very much a cause celebre lately, and Lehrer follows a series of scientific papers on the topic as well as an excellent Atlantic article by David Freedman. At this point, many of the basic facts are well-known: anecdotally, many scientists report repeated failures to replicate published findings. The higher-profile the paper, the less likely it is to replicate, with around 50% of the highest-impact papers in medicine failing to replicate. As Lehrer points out, this isn't just scientists failing to replicate each other's work, but scientists failing to replicate their own work: a thread running through the article is the story of Jonathan Schooler, a professor at UC-Santa Barbara who has been unable to replicate his own seminal graduate student work on memory.

Lehrer's focus in this article is shrinking effects.

No, not this one.

Some experimental effects seem to shrink steadily over time:

In 2001, Michael Jennions, a biologist at the Australian National University, set out to analyze "temporal trends" across a wide range of subjects in ecology and evolutionary biology. He looked at hundreds of papers and forty-four meta-analyses (that is, statistical syntheses of related studies), and discovered a consistent decline effect over time, as many of the theories seemed to fade into irrelevance.

As described, that's weird. But there is a good explanation for such effects, and Lehrer brings it up. Some results are spurious. It's just one of those things. Unfortunately, spurious results are also likely to be exciting. Let's say I run a study looking for a relationship between fruit-eating habits and IQ. I look at the effects of 20 different fruits. By chance, one of them will likely show a significant -- but spurious -- effect. So let's say I find that eating an apple every day leads to a 5-point increase in IQ. That's really exciting because it's surprising -- and the fact that it's not true is integral to what makes it surprising. So I get it published in a top journal (top journals prefer surprising results).

Now, other people try replicating my finding. Many, many people. Most will fail to replicate, but some -- again by chance -- will replicate. It is extremely difficult to get a failure to replicate published, so only the replications get published. After time, the "genius apple hypothesis" becomes part of established dogma. Remember that anything that challenges established dogma is exciting and surprising and thus easier to publish. So now failures to replicate are surprising and exciting and get published. When you look at effect-sizes in published papers over time, you will see a gradual but steady decrease in the "effect" of apples -- from 5 points to 4 points down to 0.

Where I get off the Bus

So far so good, except here's Lehrer again:

While the publication bias almost certainly plays a role in the decline effect, it remains an incomplete explanation. For one thing, it fails to account for the initial prevalence of positive results among studies that never even get submitted to journals. It also fails to explaint eh experience of people like Schooler, who have been unable to replicate their initial data despite their best efforts.

Huh? Lehrer seems to be suggesting that it is publication that makes a result spurious. But that can't be right. Rather, there are just lots of spurious results out there. It happens that journals preferentially publish spurious results, leading to biases in the published record, and eventually the decline effect.

Some years ago, I had a bad habit of getting excited about my labmate's results and trying to follow them up. Just like a journal, I was interested in the most exciting results. Not surprisingly, most of these failed to replicate. The result was that none of them got published. Again, this was just a factor of some results being spurious -- disproportionately, the best ones. (Surprisingly, this labmate is still a friend of mine; personally, I'd hate me.)

The Magic of Point O Five

Some readers at this point might be wondering: wait -- people do statistics on their data and only accept a results that is extremely unlikely to have happened by chance. The cut-off is usually 0.05 -- a 5% chance of having a false positive. And many studies that turn out later to have been wrong pass even stricter statistical tests. Notes Lehrer:

And yet Schooler has noticed that many of the data sets that end up declining seem statistically solid--that is, they contain enough data that any regression to the mean shouldn't be dramatic. '"These are the results that pass all the tests," he says. "The odds of them being random are typically quite remote, like one in a million. This means that the decline effect should almost never happen. But it happens all the time!"

So there's got to be something making these results look more unlikely than they really are. Lehrer suspects unconscious bias:

Theodore Sterling, in 1959 ... noticed that ninety-seven percent of all published psychological studies with statistically significant data found the effect they were looking for ... Sterling saw that if ninety-seven per cent of psychology studies were proving their hypotheses, either psychologists were extraordinarily lucky or they published only the outcomes of successful experiments

and again:

The problem seems to be one of subtle omissions and unconscious misperceptions, as researchers struggle to make sense of their results.

I expect that unconscious bias is a serious problem (I illustrate some reasons below), but this is pretty unsatisfactory, as he doesn't explain how unconscious bias would affect results, and the Schooler effect is a complete red herring.

I wasn't around in 1959, so I can't speak to that time, but I suspect that the numbers are similar today ... but in fact Sterling was measuring the wrong thing. Nobody cares what our hypotheses were. They don't care what order the experiments were actually run in. They care about the truth, and they have very limited time to read papers (most papers are never read, only skimmed). Good scientific writing is clear and concise. The mantra is: Tell them what you're going to tell them. Tell them. And then tell them what you told them. No fishing excursions, no detours. When we write scientific papers, we're writing science, not history.

And this means we usually claim to have expected to find whatever it is that we found. It just makes for a more readable paper. So when a scientist reads the line, "We predicted X," we know that really means "We found X" -- what the author actually predicted is beside the point.

Messing with that Point O Five

So where do all the false positive come from, if they should be less than 5% of conducted studies? There seem to be a number of issues.

First, it should be pointed out that the purpose of statistical tests (and the magic .05 threshold for significance) is to make a prediction as to how likely it is that a particular result will replicate. A p-value of .05 means roughly that there is a 95% chance that the basic result will replicate (sort of; this is not technically true but is a good approximation for present purposes).

But statistics are estimates, not facts. They are based on a large number of idealizations (for instance, many require that measurement error is distributed normally

a normal distribution

meaning that the bulk of measurements are very close to the true measurement and a measurement is as likely to be larger than the true number as it is likely to be smaller. In fact, most data is heavily skewed, with measurements more likely to be too large than too smaller (or vice versa).

For instance, give someone an IQ test. IQ tests have some measurement error -- people will score higher or lower than their "true" score due to random factors such as guessing answers correctly (or incorrectly), being sleepy (or not), etc. But it's a lot harder to get an IQ score higher than your true score than lower, because getting a higher score requires a lot of good luck (unlikely) whereas there are all sorts of ways to get a low score (brain freeze, etc.).

Most statistical tests make a number of assumptions (like normally distributed error) that are not true of actual data. That leads to incorrect estimates of how likely a particular result is to replicate. The truth is most scientists -- at the very least, most psychologists -- aren't experts in statistics, and so statistical tests are misapplied all the time.

I don't actually think that issues like the ones I just discussed lead to most of the difficulties (though I admit I have no data one way or another). I bring these issues up mainly to point out at that statistical tests are tools that are either used or misused according to the skill of the experimenter. And there are lots of nasty ways to misuse statistical tests. I discuss a few of them below:

Run enough experiments and...

Let's go back to my genius fruit experiment. I ask a group of people to eat an apple and then give them an IQ test. I compare their IQ scores with scores from a control group that didn't eat an apple. Now let's say in fact eating apples doesn't affect IQ scores. Assuming I do my statistics correctly and all the assumptions of the statistical tests are met, I should have only a 5% chance of finding a "significant" effect of apple-eating.

Now let's say I'm disappointed in my result. So I try the same experiment with kiwis. Again, I have only a 5% chance of getting a significant result for kiwis. So that's not very likely to happen either.

Next I try oranges....

Hopefully you see where this is going. If I try only one fruit, I have a 5% chance of getting a significant result. If I try 2 fruits, I have a 1 - .95*.95 = 9.8% chance of getting a significant result for at least one of the fruits. If I try 4 fruits, now I'm up to a 1 - .95*.95*.95*.95 = 18.5% chance that I'll "discover" that one of these fruits significantly affects IQ. By the time I've tried 14 fruits, I've got a better than 50% chance of an amazing discovery. But my p-value for that one experiment -- that is, my estimate that these results won't replicate -- is less than 5%, suggesting there is only a 5% chance the results were due to chance.

While there are ways of statistically correcting for this increased likelihood of false positives, my experience suggests that it's relatively rare for anyone to do so. And it's not always possible. Consider the fact that there may be 14 different labs all testing the genius fruit hypothesis (it's suddenly very fashionable for some reason). There's a better than 50% chance that one of these labs will get a significant result, even though from the perspective of an individual lab, they only ran one experiment.

Data peaking

Many researchers peak at their data. There are good reasons for doing this. One is curiosity (we do experiments because we really want to know the outcome). Another is to make sure all your equipment is working (don't want to waste time collecting useless data). Another reason -- and this is the problematic one -- is to see if you can stop collecting data.

Time is finite. Nobody wants to spend longer on an experiment than necessary. Let's say you have a study where you expect to need -- based on intuition and past experience -- around 20 subjects. You might check your data after you've run 12, just in case that's enough. What usually happens is that if the results are significant, you stop running the study and move on. If they aren't, you run more subjects. Now maybe after you've got 20 subjects, you check your data. If it's significant, you stop the study; if not, you run some more. And you keep on doing this until either you get a significant result or you give up.

It's a little harder to do back-of-the-envelop calculations on the importance of this effect, but it should be clear that this habit has the unfortunate result of increasing the relative likelihood of a false positive, since false positives lead you to declare victory and end the experiment, whereas false negatives are likely to be corrected (since you keep on collecting more subjects until the false negative is overcome). I read a nice paper on this issue that actually crunched the numbers a while back (for some reason I can't find it at the moment), and I remember the result was a pretty significant increase in the expected number of false positives.

Data massaging

The issues I've discussed so real problems but are pretty common and not generally regarded as ethical violations. Data massaging is at the borderline.

Any dataset can be analyzed in a number of ways. Once again, if people get the result they were expecting with the first analysis they run, they're generally going to declare victory and start writing the paper. If you don't get the results you expect, you try different analysis methods. There are different statistical tests that be used. There are different covariates that could be factored out. You can through out "bad" subjects or items. This is going to significantly increase the rate of false positives.

It should be pointed out that interrogating your statistical model is a good thing. Ideally, researchers should check to see if there are bad subjects or items, check whether there are covariates to be controlled for, check whether different analysis techniques give different results. But doing this affects the interpretation of your p-value (the estimate of how likely it is that your results will replicate), and most people don't know how to appropriately control for that. And some are frankly more concerned with getting the results they want than doing the statistics properly (there is where the "borderline" comes in).

Better estimates

The problem, at least from where I stand, is one of statistics. We want our statistical tests to tell us how likely it is that our results will replicate. We have statistical tests which, if used properly, will give us just such an estimate. However, there are lots and lots of ways to use them incorrectly.

So what should we do? One possibility is to train people to use statistics better. And there are occasional revisions in standard practice that do result in better use of statistics.

Another possibility is to lower the p-value that is considered significant. The choice of p=0.05 as a cutoff was, as Lehrer notes, arbitrary. Picking a smaller number would decrease the number of false positives. Unfortunately, it also decreases the number of real positives by a lot. People who don't math can skip this next section.

Let's assume we're running studies with a single dependent variable and one manipulation, and that we're going to test for significance with a t-test. Let's say the manipulation really should work -- that is, it really does have an effect on our dependent measure. Let's say that the effect size is large-ish (Cohen's d of .8, which is large by psychology standards) and that we run 50 subjects. The chance of actually finding a significant effect at the p=.05 level is 79%. For people who haven't done power analyses before, this might seem low, but actually an 80% chance of finding an effect is pretty good. Dropping our significant threshold to p=.01 drops the chance of finding the effect to 56%. To put this in perspective, if we ran 20 such studies, we'd find 16 significant effects at the p=.05 level but only 11 at the p=.01 level. (If you want to play around with these numbers yourself, try this free statistical power calculator.)

Now consider what happens if we're running studies where the manipulation shouldn't have an effect. If we run 20 such studies, 1 of them will nonetheless give us a false positive at the p=.05 level, whereas we probably won't get any at the p=.01 level. So we've eliminated one false positive, but at the cost of nearly 1/3 of our true positives.

No better prediction of replication than replication

Perhaps the easiest method is to just replicate studies before publishing them. The chances of getting the same spurious result twice in a row are vanishingly small. Many of the issues I outlined above -- other than data massaging -- won't increase your replication rate. Test 14 different fruits to see if any of them increase IQ scores, and you have over a 50% chance that one of them will spuriously do so. Test that same fruit again, and you've only got a 5% chance of repeating the effect. So replication decreases your false positive rate 20-fold. Similarly, data massaging may get you that coveted p.05, but the chances of the same massages producing the same result again are very, very low.

True positives aren't nearly so affected. Again, a typical power level is B=0.80 -- 80% of the time that an effect is really there, you'll be able to find it. So when you try to replication a true positive, you'll succeed 80% of the time. So replication decreases your true positives by only 20%.

So let's say the literature has a 30% false positive rate (which, based on current estimates, seems quite reasonable). Attempting to replicate every positive result prior to publication -- and note that it's extremely rare to publish a null result (no effect), so almost all published results are positive results -- should decrease the false positives 20-fold and the true positives by 20%, leaving us with a 2.6% false positive rate. That's a huge improvement.

So why not replicate more?

So why don't people replicate before publishing? If 30% of your own publishable results are false positives, and you eliminate them, you've just lost 30% of your potential publications. You've also lost 20% of your true positives as well, btw, which means overall you've decreased your productivity by 43%. And that's without counting the time it takes to run the replication. Yes, it's nice that you've eliminated your false positives, but you also may have eliminated your own career!

When scientists are ranked, they're largely ranked on (a) number of publications, (b) number of times a publication is cited, and (c) quality of journal that the publications are in. Notice that you can improve your score on all of these metrics by publishing more false positives. Taking the time to replicate decreases your number of publications and eliminates many of the most exciting and surprising results (decreasing both citations and quality of journal). Perversely, even if someone publishes a failure to replicate your false positive, that's a citation and another feather in your cap.

I'm not saying that people are cynically increasing their numbers of bogus results. Most of us got into science because we actually want to know the answers to stuff. We care about science. But there is limited time in the day, and all the methods of eliminating false positives take time. And we're always under incredible pressure to pick up the pace of research, not slow it down.

I'm not sure how to solve this problem, but any solution I can think of involves some way of tracking not just how often a researcher publishes or how many citations those publications get, but how often those publications are replicated. Without having a way of tracking which publications replicate and which don't, there is no way to reward meticulous researchers or hold sloppy researchers to account.

Also, I think a lot of people just don't believe that false positives are that big a problem. If you think that only 2-3% of published papers contain bogus results, there's not a lot of incentive to put in a lot of hard work learning better statistical techniques, replicating everything, etc. If you think the rate is closer to 100%, you'd question the meaning of your own existence. As long as we aren't keeping track of replication rates, nobody really knows for sure where we are on this continuum.

That's my conclusion. Here's Lehrer's:

The decline effect is troubling because it reminds us how difficult it is to prove anything. We like to pretend that our experiments define the truth for us. But that's often not the case. Just because an idea is true doesn't mean it can be proved. And just because an idea can be proved doesn't mean it's true. When the experiments are done, we still have to chose what to believe.

I say it again: huh?

Did your genes make you liberal?

Posted by GamesWithWords on Friday, October 29, 2010

"The new issue of the Journal of Politics, published by Cambridge University, carries the study that says political ideology may be caused by genetic predisposition."
--- RightPundits.com

"Scientists find 'liberal gene.'"
--- NBC San Diego

"Liberals may owe their political outlook partly to their genetic make-up, according to new research from the University of California, San Diego, and Harvard University. Ideology is affected not just by social factors, but also by a dopamine receptor gene called DRD4."
-- University press release

As in the case yesterday of the study about sisters making you happy, these statements are all technically true (ish -- read below) but deeply misleading. The study in question looks at the effects of number of friends and the DRD4 gene on political ideology. Specifically, they asked people to self-rate on a 5-point scale from very conservative to very liberal. They tested for the DRD4 gene. They also asked people to list up to 5 close friends.

The number of friends one listed did not significantly predict political ideology, nor did the presence or absence of the DRD4 gene. However, there was a significant (p=.02) interaction ... significant, but apparently tiny. The authors do not discuss effect size, but we can try to piece together the information by looking at the regression coefficients.

An estimated coefficient means that if you increase the value of the predictor by 1, the outcome variable increases by the size of the coefficient. So imagine the coefficient between the presence of the gene and political orientation was 2. That would mean that, on average, people with the gene score 2 points higher (more liberal) on the 5-point political orientation scale.

The authors seem to be reporting standardized coefficients, which means that we're looking at increasing values by one standard deviation rather than by one point. The coefficient of the significant interaction 0.04. This means that roughly as the number of friends and presence of the gene increase by one standard deviation, political orientation scores increase by 0.04 standard deviations. The information we'd need to correctly interpret that isn't given in the paper, but a reasonable estimate is that this means that someone with one extra friend and the gene would score anywhere from .01 to .2 points higher on the score (remember, 1=very conservative, 2=conservative, 3=moderate, 4=liberal, 5=very liberal).

The authors give a little more information:

For people who have two copies of the [gene], an increase in number of friendships from 0 to 10 friends is associated with increasing ideology in the liberal direction by about 40% of a category on our five-category scale.

People with no copies of the gene were unaffected by the number of friends they had.

None of what I wrote above detracts from the theoretical importance of the paper. Identifying genes that influence behavior, even just a tiny bit, is important as it opens windows into the underlying mechanisms. And to their credit, the authors are very guarded and cautious in their discussion of the results. The media reports -- fed, no doubt, by the university press release -- have focused on the role of the gene in predicting behavior. It should be clear that the gene is next to useless in predicting, for instance, who somebody is going to vote for. Does that make it a gene for liberalism? Maybe.

I would point out one other worry about the study, which even the authors point out. They tested a number of different possible predictors. The chances of getting a false positive increases with every statistical test you run, and they do not appear to have corrected for multiple comparisons. Even with 2,000 participants (which is a large sample), the p-value for the significant interaction was only p=.02, which is significant but not very strong, so the risk that this will not replicate is real. As the authors say, "the way forward is to seek replication in different populations and age groups."

You're Wrong

Posted by GamesWithWords on Saturday, October 23, 2010

John Ioannidis has been getting a lot of press lately. He reached the cover of the last issue of The Atlantic Monthly. David Dobbs wrote about him here (and a few years ago, here). This is the doctor known for his claim that around half of medical studies are false -- that is about 80% of non-randomized trials and even 25% of randomized trials. These are not just dinky findings published in dinky journals; of 49 of the most highly-regarded papers published over a 13-year period, 34 of the 45 with that claimed to have found effective treatments had been retested, and 41% of those tests failed to replicated the original result.

Surprised?

Quoting the Atlantic Monthly:

Ioannidis initially thought the community might come out fighting. Instead, it seemed relieved, as if it had been guiltily waiting for someone to low the whistle...

Well, it's not surprising. The appropriate analog in psychology is the randomized trial, of which in medicine 25% turn out to be false according to this research (which hopefully isn't itself false). As Ioannidis has detailed, the system is set up to reward false positives. Journals -- particularly glamour mags like Science -- preferentially accept surprising results, and the best way to have a surprising result is to have one that is wrong. Incorrect results happen: "statistically significant" means "has only a 5% probability of happening by random chance." This means (in theory) that 5% of all experiments published in journals should reach the wrong conclusions. If journals are biased in favor of accepting exactly those 5%, then the proportion should be higher.

There are other factors at work. Some scientists are sloppier than others, and many of the ways in which one can be sloppy lead to significant and/or surprising results. For instance, 5% of experiments have false positives. There are labs that will run the same experiment 6 times with minor tweaks. There is a (1-.95^6) * 100 = 26.5% chance that one of those will have a significant result. The lab may then publish only that final experiments and not report the others. If sloppy results lead to high-impact publications, survival of the fittest dictates that sloppy labs will reap the accolades, get the grant money, tenure, etc.

Keep in mind that often many different labs are trying to do the same thing. For instance, in developmental psychology, one of the deep questions is what is innate? So many labs are testing younger and younger infants, trying to find evidence that these younger infants can do X, Y or Z. If 10 labs all run the same experiment, there's a (1-.95^10) * 100 = 40.1% chance of one of the labs finding a significant result.

Countervailing Forces

Thus, there are many incentives to publish incorrect data. Meanwhile, there are very few disincentives to doing so. If you publish something that turns out to replicate, it is very unlikely that anyone will publish a failure to replicate -- simply because it is very difficult to publish a failure to replicate. If someone does manage to publish such a paper, it will certainly be in a lower-profile journal (which is, incidentally, a disincentive to publishing such work to begin with).

Similarly, consider what happens when you run a study and get a surprising result. You could replicate it yourself to make sure you trust the result. That takes time, and there's a decent chance it won't replicate. If you do replicate it, you can't publish the replication (I tried to in a recent paper submission, and a reviewer insisted that I remove reference to the replication on account of it being "unnecessary"). If the replication works, you'll gain nothing. If it fails, you won't get to publish the paper. Either way, you'll have spent valuable time you could have spent working on a different study leading to a different paper.

In short, there are good reasons to expect that 25% of studies -- particularly in the high-profile journals -- are un-replicable.

What to do?

Typically, solutions proposed involve changing attitudes. The Atlantic Monthly suggests:

We could solve much of the wrongness problem, Ioannidis says, if the world simply stopped expecting scientists to be right. That's because being wrong in science is fine, and even necessary ... But as long as careers remain contingent on producing a stream of research that's dressed up to seem more right than it is, scientists will keep delivering exactly that.

I've heard this idea expressed elsewhere. In the aftermath of Hausergate, a number of people suggested that a factor was the pressure-cooker that is the Harvard tenure process, and that Harvard needs to stop putting so much pressure on people to publish exciting results.

So the idea is that we should stop rewarding scientists for having interesting results, and instead reward the ones who have uninteresting results? Journals should publish only the most staid research, and universities should reward tenure not based on the number of highly-cited papers you have written, but based on how many papers you've written which have never been cited? I like that idea. I can run a boring study in a few hours and write it up in the afternoon: "Language Abilities in Cambridge Toddlers are Unaffected by Presence or Absence of Snow in Patagonia." That's boring and almost certainly true. And no one will ever cite it.

Seriously, though, public awareness campaigns telling people to be more responsible are great, and sometimes they even help, but I don't know how much can be done without changing the incentive structure itself.

Reputation

I don't have a solution, but I think Ioannidis again points us towards one. He found that papers continue to be cited long after they have been convincingly and publicly refuted. I was discussing this issue with a colleague some time back and mentioned a well-known memory paper that nobody can replicated. Multiple failures-to-replicate have been published. Yet I still see it cited all the time. The colleague said, "Wow! I wish you had told me earlier. We just had a student spend two years trying to follow up that paper, and the student just couldn't get the method to work."

Never mind that researchers rarely bother to replicate published work -- even if they did, we have no mechanism for tracking which papers have been successfully replicated and which papers can't be replicated.

Tenure is awarded partly on how often your work has been cited, and we have many nice, accessible databases that will tell you how often a paper has been cited. Journals are ranked by how often their papers are cited. What if we rewarded researchers and journals based on how well their papers hold up to replication? Maybe it would help, maybe it wouldn't, but without a mechanism for tracking this information, this is at best an intellectual enterprise.

Even if such a database wasn't ultimately useful in decreasing the number of wrong papers, at least we'd know which papers were wrong.

Lab Notebook: Verb Resources

Posted by GamesWithWords on Wednesday, October 13, 2010

It's good to be studying language now, and not a few decades ago. There are a number of invaluable resources freely available on the Web.

The resource I use the most -- and without which much of my research would have been impossible -- is Martha Palmer & co.'s VerbNet, which is a meticulous semantic analysis of some several thousand English verbs. This is invaluable when choosing verbs for stimuli, as you can choose verbs that are similar to or differ from one another along particular dimensions. It's also useful for finding polysemous and nonpolysemous verbs where polysemy is defined in a very rigorous way.

Meichun Liu and her students at NCTU in Taiwan have been working on a similar project in Mandarin, Mandarin VerbNet. This resource has proved extremely valuable as I've been writing up some work I've been doing in Mandarin, and I only wish I had known about it when I constructed my stimuli.

I bring this up in case these resources are of use to anyone else. Mandarin VerbNet is particularly hard to find. I personally spent several months looking for it.

Overheard: Scientific Prejudice

Posted by GamesWithWords on Thursday, September 30, 2010

A senior colleague recently attended an Autism conference. Language is frequently impaired in Autism, and so many of the neuroscientists there were trying to look at the effects of their animals models of Autism on language.

Yes, you read that correctly: animal models of language. In many cases, rats.

This colleague and I both believe in some amount of phylogenetic continuity: some aspects of language are no doubt built on mechanisms that existed in our distant ancestors (and therefore may exist in other modern-day animals). But given that we have, at best, a rudimentary understanding of the mechanisms underlying language in humans -- and certainly little or no agreement in the field at present -- arguing that a particular behavior in a rat is related to some aspect of language is at best wild-eyed conjecture (and I say this with a great deal of respect for the people who have been taking this problem seriously).

Unfortunately, this colleague didn't get very far in discussing these issues with these researchers. One, for instance, said, "I know rat squeaks are related to language because they're auditory!"

Sure, so's sneezing:

The problem with such conversations, as this colleague pointed out, is that neuroscientists often don't take us cognitive types seriously. After all, they work on a "harder" science. (For those who haven't seen it yet, read this by DrugMonkey -- tangential but fun.) A friend of mine, who is a physicist, once told me that he wasn't sure why psychology was even called a "science" since psychologists don't do experiments -- never mind that I was IMing him from my lab at the time (which he knew).

When I applied to graduate school, I applied to one interdisciplinary program that included cognitive people and also physiology folk. During my interview with one professor who worked on monkey physiology, he interrupted me as I was describing the work I had done as an undergraduate. "Nothing of value about language," he told me, "can be learned by studying humans." When I suggested that perhaps there weren't any good animal models of language to work with, he said, "No, that's just a prejudice on the part of you cognitive people."

Keep in mind that there were several faculty in his department who studied language in humans. This is why such mixed departments aren't always particularly collegial places to work.

I bring this up not to rag on neuroscientists or physicists, but to remind the psychologists in the audience that we have this exact same problem. I don't know how many people have told me that linguistics is mostly bullshit (when I was an undergraduate, one professor of psychology told me: "Don't study linguistics, Josh. It will ruin you as a scientist.") and that philosophy has nothing to offer. When you talk to them in detail, though, you quickly realize that most of them have no idea what linguists or philosophers do, what their methods or, or why those fields have settled on those methods. And that's the height of arrogance: linguists and philosophers incude, in their numbers, some of the smartest people on the planet. It only stands to reason that they know something of value.

I'm not defending all the methods used by linguists. They could be improved. (So could methods used by physicists, too.) But linguists do do experiments, and they do work with empirical data. And they're damn smart.

Just sayin'.

Photos: mcfarlando, Jessica Florence.

Sorry, Sharing My Data is Illegal

Posted by GamesWithWords on Tuesday, September 14, 2010

I recently got back from collecting data in Russia. This particular study brought into focus for me the issues involved in making experimental data public. In this study, I videotape people as they listen to stories, look at pictures, and answer questions about the stories. The videotape is key, since what I'm actually analyzing is the participants' eye-gaze direction during different parts of the stories (this can be used to partially determine what the participants were thinking at different points in time).

Sharing raw data would mean sharing the videos...which I can't do. These videos are confidential, and there's no easy way of making them anonymous, since they are close-up videos of people's faces. I could ask participants to sign a waver allowing me to put up their videos on the Internet, but I suspect most of my participants would just refuse to participate. Many were concerned enough about the video as was.

Now, I could share the semi-processed data -- that is, not the videos themselves but the information gleaned from them. I already discussed some of the problems with that, namely that getting the data into a format that's easy for someone else to analyze is extremely time-consuming.

This isn't an issue with just one study -- more than half the studies I run are eye-tracking studies. Many of the rest are EEG studies, which can have several gigabytes of data each and thus it's simply impractical to share the data (plus, when dealing with brain data anonymity is even more a concern). I do some kid studies where I simply write down participants' responses, but if your goal was the check to make sure I'm recording my data correctly, that wouldn't help -- what you'd want are tapes of the experiments, but good luck convincing the ethics board to allow me to post videos of young children participants in experiments on the Internet.

[Those are my laboratory studies. Data from my Web-based studies is actually relatively easy to share -- though you'd have to be proficient in ActionScript to understand it.]

Certainly, there are many behavioral researchers that wouldn't have this problem. But there are many who would. Mandating that everyone make their data publicly available would mean that many kinds of experiments simply couldn't be done anymore.

Is psychology a science, redux

Posted by GamesWithWords on Tuesday, August 31, 2010

Is psychology a science? I see this question asked a lot on message boards, and it's time to discuss it again here. I think the typical response by a researcher like myself is an annoyed "of course, you ignoramus." But a more subtle response is deserved, as the answer depends entirely on what you mean by "psychology" and what you mean by "science."

Two Psychologies

First, if by "psychology" you mean seeing clients (like in Good Will Hunting or Silence of the Lambs), then, no, it's probably not a science. But that's a bit like asking whether engineers or doctors are scientists. Scientists create knowledge. Client-visiting psychologists, doctors and engineers use knowledge. Of course, you could legitimately ask whether client-visiting psychologists base their interventions on good science. Many don't, but that's also true about some doctors and, I'd be willing to bet, engineers.

Helpfully, "engineering" and "physics" are given different names, while the research and application ends of psychology confusingly share the same name. (Yes, I'm aware that engineering is not hte application of physics writ broadly -- what's the application of string theory? -- and one can be a chemical engineer, etc. I actually think that makes the analogy to the two psychologies even more apt). It doesn't help that the only psychologists who show up in movies are the Good Will Hunting kind (though if paleoglaciologists get to save the world, I don't see why experimental psychologists don't!), but it does exist.

A friend of mine (a physicist) once claimed psychologists don't do experiments (he said this un-ironically over IM while I was killing time in a psychology research lab). My response now would be to invite him to participate in one of these experiments. Based on this Facebook group, I know I'm not the only one who has heard this.

Methods

There are also those, however, who are aware that psychologists do experiments, but deny that it's a true science. Some of this has to do with the belief that psychologists still use introspection (there are probably some somewhere, but I suspect there are also physicists who use voodoo dolls somewhere as well, along with mathematicians who play the lottery).

The more serious objection has to do with the statistics used in psychology. In the physical sciences, typically a reaction takes place or does not, or a neutrino is detected is not. There is some uncertainty given the precision of the tools being used, but on the whole the results are fairly straight-forward and the precision is pretty good (unless you study turbulence or something similar).

In psychology, however, the phenomena we study are noisy and the tools lack much precision. When studying a neutrino, you don't have to worry about whether it's hungry or sleepy or distracted. You don't have to worry about whether the neutrino you are studying is smarter than average, or maybe too tall for your testing booth, or maybe it's only participating in your experiment to get extra credit in class and isn't the least bit motivated. It does what it does according to fairly simple rules. Humans, on the other hand, are terrible test subjects. Psychology experiments require averaging over many, many observations in order to detect patterns within all that noise.

Science is about predictions. In theory, we'd like to predict what an individual person will do in a particular instance. In practice, we're largely in the business of predicting what the average person will do in an average instance. Obviously we'd like to make more specific predictions (and there are those who can and do), but they're still testable (and tested) predictions. The alternative is to declare much of human and animal behavior outside the realm of science.

Significant differences

There are some who are on board so far but get off the bus when it comes to how statistics are done in psychology. Usually an experiment consists of determining statistically whether a particular result was likely to have occurred by chance alone. Richard Feynman famously thought this was nuts (the thought experiment is that it's unlikely to see a license plate printed CPT 349, but you wouldn't want to conclude much from it).

That's missing the point. The notion of significant difference is really a measure of replicability. We're usually comparing a measurement across two populations. We may find population A is better than population B on some test. That could be because population A is underlyingly better at such tests. Alternatively, population A was lucky that day. A significant difference is essentially a prediction that if we test population A and population B again, we'll get the same results (better performance for population A). Ultimately, though, the statistical test is just a prediction (one that typically works pretty well) that the results will replicate. Ideally, all experiments would be replicated multiple times, but that's expensive and time-consuming, and -- to the extent that the statistical analysis was done correctly (a big if) -- largely unnecessary

So what do you think? Are the social sciences sciences? Comments are welcome.

Making data public

Posted by GamesWithWords on Monday, August 16, 2010

Lately, there have been a lot of voices (e.g., this one) calling for scientists to make raw data immediately available to the general public. In the interest of answer than call, here's some of my raw data:

female	no	English	English	no	no	yes	United States	1148478773	312	0	helped	1	daxed	59	0	1	1
female	no	English	English	no	no	yes	United States	1148478773	312	1	heard	2	blied	33	0	0	2
female	no	English	English	no	no	yes	United States	1148478773	312	2	decelerated	2	lenked	45.4	1	0	2
female	no	English	English	no	no	yes	United States	1148478773	312	3	startled	1	gamped	31.1	1	0	3
female	no	English	English	no	no	yes	United States	1148478773	312	4	prompted	2	henterred	59	0	1	4
female	no	English	English	no	no	yes	United States	1148478773	312	5	engrossed	2	nazored	31.1	0	1	5
female	no	English	English	no	no	yes	United States	1148478773	312	6	obliged	1	ablined	59	1	0	6
female	no	English	English	no	no	yes	United States	1148478773	312	7	tantalized	2	bosined	31.1	1	1	7
female	no	English	English	no	no	yes	United States	1148478773	312	8	bled for	1	breened	31.3	1	1	8
female	no	English	English	no	no	yes	United States	1148478773	312	9	loathed	2	gaubled	31.2	0	0	9
female	no	English	English	no	no	yes	United States	1148478773	312	10	mourned for	1	ginked	31.3	1	1	10
female	no	English	English	no	no	yes	United States	1148478773	312	11	wounded	2	jarined	31.1	0	0	10

Do you feel enlightened? Probably not. Raw data isn't all that useful if you don't know how it was collected, what the different numbers refer to, etc. Even if I told you this is data from this experiment, that probably wouldn't help much. Even showing you the header rows for these data will help only so much:

sex

subject_already

nat_language

prime_language

autism

dyslexia

psychiatric

country

randomID

startTime

trial

word

choice

conclusion

wordClass

whichLocation

because

totalCorrect

Some things are straightforward. Some are not. It's important to know that I record data with a separate row for every trial, so each participant has multiple trials. Also, I record all data, even data from participants who did not complete the experiment. If you're unaware of that, your data analyses would come out very wrong. Also I have some codes I use to mark that the participant is an experimenter checking to make sure everything is running correctly. You'd need to know those. It's key to know how responses are coded (it's not simply "right" or "wrong" -- and in fact the column called totalCorrect does not record whether the participant got anything correct).

The truth is, even though I designed this study myself and wrote the program that outputs the data, every time I go back to data from a study I haven't worked with in a while, it takes me a few hours to orient myself -- and I'm actually relatively good about documenting my data.

So if a law were passed -- as some have advocated for -- requiring that data be made public, one of two things will happen: either people will post uninterpretable data like my mini-spreadsheet above, or they'll spend huge amounts of time preparing their data for others' consumption. The former will help no one. And the latter is expensive, and someone has to pay for that. And this all has to be balanced against the fact that there are very few data sets anyone would want to reanalyze.

There are important datasets that should be made available. And in fact there are already mechanisms for doing this (in my field, CHILDES is a good example). This kind of sharing should be encouraged, but mandated sharing is likely to cause more problems than it solves.

Field of Science