Field of Science

Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Unrealistic Scientific Optimism

Not enough published studies replicate. There are a number of reasons for this, but one of the most difficult to address is statistical power

Why is low power a problem?

Suppose you want to test the effect of some training on IQ scores. You test 50 subjects: 25 in your experimental condition and 25 in the control condition. That's a fairly typical size for a psychology study. And you get a significant result. You might be tempted to conclude that your manipulation worked, but it might actually be more likely that your results are due to chance or experimenter error

It depends on how large of an effect your manipulation ought to have. If the training typically raises people's IQs by 7.5 points, your study would only have had a 41% chance of detecting it (given normal assumptions about normal distributions). A more modest 5 point effect could be detected 20% of the time. You'd need a 14 point effect to have a 90% chance of detecting it.

For reference, a 14 point effect is enough to move someone of average intelligence into the top fifth of humanity. We are in miracle drug territory. More realistically, your manipulation is unlikely to have nudged IQ by more than a point or two, in which case there's essentially no chance your study could have detected it. So if you did detect something, it's probably an error. 

Well, how much power do studies have?

Concerns about statistical power (among things) have led some researchers to declare that more than half of all published findings are false. Other researchers are more sanguine. In short, if you think that the effects we are studying tend to be pretty large (a standard deviation or more), there is little to worry about. If you think they tend to be small, the situation is dire. 

Unfortunately, the only way to accurately determine how large an effect is is to measure it multiple times. Obviously, you can't have done that in advance when running a brand new study. You might be able to guestimate based on the typical effect size in your field. Unfortunately, many fields primarily publish significant results. This introduces a bias, because effect size and significance are correlated. 

Suppose we ran the above experiment and the true effect size is 7.5 IQ points. On average, that is what we would find. But of course sometimes we'll run the experiment and the effect will be larger and sometimes it will be smaller, simply due to random chance. By paying attention only to the significant results, we're selectively ignoring those experiments that happened, through no fault of their own, to underestimate the effect. This skews our results, and on average we would report an effect of 11.8 IQ points -- much higher than the truth. 

Effects aren't as big as you think.

The typical effect reported in the psychology literature is about half a standard deviation, which is why I've been using the example of 7.5 IQ points above. However, because of the bias against publishing null results or replications, this is inflated. That means that the expectations of psychologists are inflated. We are used to seeing effects of half a standard deviation or more. As a result, we are going to chronically underestimate the number of subjects we need to run. 

Unfortunately, without widespread publication of null results and replications, we cannot say how badly our perception is distorted, because the degree of distortion depends on how large effects really are. I ran a series of simulations involving a two-condition, between-subjects design to see how bad the distortion might be. In these simulations, I assumed that null results are never reported, which is only a slight exaggeration of the psychology literature.

In the graph below, the true effect size (measured in standard deviations) is on the X-axis, and the average reported effect size is in the Y-axis. The different lines reflect different numbers of subjects per condition.

As you can see, if you have 50 or fewer subjects per condition, you'll hardly ever report an effect size smaller than half a standard deviation, even when the true effect size is one tenth of a standard deviation. This is because reliably detecting an effect of one tenth of a standard deviation requires about 2,000 subjects per condition.

Even with 1,000 subjects per condition, there is some distortion for effects smaller than one quarter standard deviation.




Note that these simulations assume that the researcher is correcting for multiple comparisons, isn't p-hacking, etc. The situation gets worse if we relax those assumptions.

In the simulation on the left, I assumed the researcher is engaging in contingent stopping. After every 5 subjects, the researcher checks her data. If the effect is significant, she stops and reports the result. If the effect is in the "wrong direction", she decides there's something wrong with her stimuli, revamps the experiment, and tries again.

While this is a little extreme, most researchers engage in some amount of contingent stopping. As you can see, this badly distorts the results. Even with 1,000 subjects, we end up distorting even large effects.

Failure to correct for multiple comparisons will have similar effects.

What this means is that if you are basing your expected effect size on prior experience, the effect you are trying to detect is probably not anywhere near as large as you think, and you may need many more subjects to detect it than you were planning.

But I usually get significant results. Doesn't that mean I have enough subjects?

Maybe. You might be testing an abnormally large effect. Such effects do exist. Alternatively, you may be engaging in contingent stopping, failing to adequately correct for multiple comparisons, or simply making a mistake somewhere in your analysis. It's probably worth checking.

To adapt a metaphor from Uri Simonsohn, you can search for exoplanets with binoculars. But you
should keep in mind that it is so unlikely you could see an exoplanet with your binoculars, that if you do see one, you are probably mistaken in some way.

I don't use t-tests. How does this apply to me?

My examples above use t-tests because they are simple and widely known. But the same basic issues apply no matter what kind of analysis you do. If you are looking for some kind of effect, and if that effect is small, you'll need a lot of data to detect it. And it's probably smaller than you think.

If you are model-fitting, the noise in your data puts an upper limit on how well you can fit the underlying phenomenon. If your data are fairly noisy and yet your model fits really well, you are probably fitting the noise, not the signal. And your data are probably noisier than you think.

Recommendations

I can't tell you how many subjects to run. I don't know. Nobody knows. Until there is a good mechanism for reporting null results, nobody will know.

In the meantime, I recommend running more than you think you need. If there are published studies looking at a similar phenomenon, look at their effect size and assume the true effect size is significantly smaller, then do a power analysis. If you run your study and get a positive result, it's probably a good idea to replicate it. If you get a null result, you might not wish to read too much into that.

If you don't have the means to test a lot of subjects, you have a few options. Effect size is a combination of the signal-to-noise ratio and the amount of data you collected. If you can't collect more data, you can try to decrease the noise (e.g., by using a better, more accurate instrument). You can't necessarily increase the signal, because  you can't change the laws of nature. But you can decide which laws of nature to study, you and you might be better off studying one that has powerful effects.

R code

See below for the R code to run the simulations above. If anyone knows how to convince blogspot to allow indention, please lemmeno.

runsim <- span="">function (N,d,iter,inter=N,verbose=FALSE){
#N = number of subjects/condition
#d = effect size in Cohen's d
#iter = number of simulations to run
#inter = after this number of subjects, check results. Stop if significant *or* in the wrong direction. Default to running all subjects.
#verbose = should we print interim results?
ES<-c span="">#vector of effect sizes
for (i in 1:iter){
#do iter simulations
if (verbose){print(paste("Starting simulation",i))}
con1<- span="">con2<-c span="">#initialize scores for the two conditions
for (j in 1:ceiling(N/inter)){
#on each round, test an addition inter subjects.
n<- span="">inter #how many subjects do we need for this round?
if (j*inter>N){n<- span="">N-j*inter} #adjust if last round would result in more than N subjects
con1<-c class="s3" span="">con1
,rnorm(n,0,1))
con2<-c class="s3" span="">con2
,rnorm(n,d,1))
if (verbose){print(paste("pval:",round(t.test(con1,con2)$p.value,2),"  diff in means:",round(mean(con2)-mean(con1),2)))}

if (t.test(con1,con2)$p.value<.05){
#reached a significant result!
break
}
if (mean(con1)>mean(con2)){
#preliminary results are in the wrong direction. Revamp experiment and try again.
break
}
}

if (t.test(con1,con2)$p.value<.05){
ES<-c class="s3" span="">ES
,round((mean(con2)-mean(con1))/((sd(con1)+sd(con2))/2),2))
}else{
#wasn't significant, so don't report
}
}
return(ES)
}

## graph observed effect size against actual effect size for different N

results<-data .frame="" class="s3" span="">N
=c(rep(10,10),rep(25,10),rep(50,10),rep(100,10),rep(1000,10)),GT=c(rep(seq(.1,1,.1),5)),ES=0)

for (i in 1:length(results$N)){
results$ES[i]<-mean runsim="" span="">results$N[i],results$GT[i],1000))
}

results$N<-as .factor="" span="">results$N)
results$expected<- span="">results$GT

library(ggplot2)
quartz(width=5,height=4)
p <- ggplot="" span="">data=results, aes(x=GT, y=ES, fill=N, linetype=N, colour=N)) 
p <- class="s3" span="">p
+ geom_line()+ scale_color_manual(values=c("red","blue","dark green","purple","black"))
p <- class="s3" span="">p
+ scale_x_continuous(limits=c(0,1),breaks=seq(0,1,.25)) + scale_y_continuous(limits=c(0,1.5),breaks=seq(0,1.5,.25))
p

Language fact of the day

The name that appears most often in Genesis is "Jacob", followed by "Joseph".

In other news, the most common word in Moby Dick is "the"; the most common noun (excluding pronouns) is, not surprisingly, "whale".

In Genesis, Moby Dick, and a number of other texts, three-letter words are more common than word of any other length (the one exception I've found so far is Moby Dick)

(Yes, I am learning to use NLTK, which so far I like a lot)
claimtoken-509944af17bd2

Lying with statistics

One of the most concise explanations of why your units of measurement matter, courtesy of XKCD:


Statistics for Idiots

Republicans in the House are proposing to cut funding for food safety programs, despite a rise in food-born illness. Congressman Jack Kingston explains, that the nation's food supply is "99.99 percent safe". Politifact says, "That sounds great, but is it true?"

Actually, it doesn't sound that good to me. Suppose Kingston means that you only have a 0.01% chance of getting ill any particular time you eat (which seems to be the case). And let's say people eat 3 times a day. That gives you a 10.4% chance of getting sick any given year. I'd rather not get sick at all, particularly when many of the illnesses are easily preventable.

missing 2

One of the formulas in the last post was missing a 2. Everything has now been recalculated. Some numbers changed. The basic result is that some of the numbers are not quite as dire as I had stated: the original example experiment, which had 15 participants per condition and an effect significant at p=.05 has a 51% chance of replicating (in the sense of producing another significant p-value when re-run exactly), again assuming the effect was real and the effect size is as measured in the first experiment.

When should an effect be called significant?

note: This post originally contained an error in the mathematics, which Tal of Citation Needed kindly pointed out. This error has been corrected.

In the thread following my earlier post on false positives, Tal made the observation that in a typical study that is significant at the p=.05 level has a 50% chance of being replicated.

It turns out that this depends heavily on what you mean by replicate. I'm going to work through some numbers below. Again, stats isn't my specialty, so please anyone jump in to correct errors. But I think I've got the general shape of the issues correct.

I got a significant result! Can I get it again?

Let's say you ran an experiment comparing the IQ scores of 15 people who prefer kiwi with the IQ scores of 15 people who prefer rambutan. You find that people who prefer rambutan have IQs 11.2 points higher than those who prefer kiwi. Assuming the standard deviations is 15 (which is how IQ tests are normalized), then that should give you a t-value of 11.2 / (15 * (2/15)^.5) = 2.04 and a p-value of about .05.

So you've now got a significant result! You tell all your friends, and they try running the same experiment. What are the chances they'll get the same results, significant at the p=.05 level?

The chances are not great. Even assuming that the underlying effect is real (rambutan-eaters really are smarter), your friends will only replicate your result about 51%, assuming they use exactly the same methods (according to a nifty power calculator found online here).

Define "get it"

Of course, we were assuming above that rambutan-eaters really are 11.2 IQ points smarter than kiwi-eaters (BTW I like both kiwi and rambutan, so nothing is meant by this example). In which case, your friends might not have gotten results significant at the p=.05 level, but they very likely found higher average IQs for their samples of rambutan-eaters relative to kiwi-eaters. And of course, what we really care about is how easy it will be to replicate the rambutan/kiwi difference, not how easy it will be to get the significant p-value again. The point of science is not to be able to predict statistically-significant differences but simply to predict differences. It's well beyond my statistical abilities to say how often this would happen, but hopefully someone will step up in the comments and let us know.

In practice, though, other people are only going to follow up on your effect if they can replicate it at the standard p=.05 level. What can we do to improve the chances of replicability?

Lower alphas

Let's suppose your effect had been significant at the p=.01 level. We can manage that while keeping the effect-size the same (11.2 IQ points) if we increase our sample to 26 kiwi-eaters and 26 rambutan-eaters (t = 8/(15 * (2/26)^.5) = 2.7). Now our chance of getting another significant result at the p=.01 level is ... 52%.

But we don't really care about getting a p=.01 again; we want to get the result again at the p=.05 level, which should happen around 76% of the time. Now, what if we had a result significant at the p=.001 level the first time around? We'd have needed about 42 subjects per condition. The chance or replicating that at the p=.05 level is 92%.

p-value       #subjects/condition           Likelihood of repeating at p=.05 level
.05                         15                                 51%
.01                         26                                 76%
.001                       42                                 92%

Replication

Of course, there are things that I'm not considering here, like the possibility that your original experiment underestimated the effect size. For instance, let's say that the true effect size is 15 IQ points (which is a lot!). Now, your chances of finding an effect significant at the p=.05 level with only 15 participants per condition is 75%. That's a lot better than what we started with, though not perfect. To have an effect large enough to see it 95% of the time at the p=.05 level, it would need to be over 20 IQ points, which is a monstrously large effect. Incidentally, if you ran this study with 15 rambutan-eaters and 15 kiwi-eaters and found a 20 IQ point effect, that would be significant below the p=.001 level.

What I get from all this is that if you want a result that you and others will be able to replicate, you're going want the p-value in your original experiment to have been lower than p.05.

Poll: Do You Care about Effect Size?

My recent post on false positives has generated a long thread, with a large number of informative comments from Tal, who has convinced me to think a lot more about power analyses. I recommend reading the comments.

One issue that has come up is if and when we actually care about the null hypothesis. I argue that a fair amount of the time we really are deeply interested in knowing whether an effect exists or not. I don't entirely understand Tal's argument -- I'm sure he'll help out in the comments -- but I think he is saying that in any given experiment, there are always confounds such that if you have enough power, you'll find a significant result. So whether or not the manipulation has its intended effect, the confounds will ensure that the null hypothesis is false.

Perhaps. Having run studies with thousands of participants and no significant effect, I'm skeptical that this is always true, but obviously the data we'd need to test his claim does not and never will exist. In any case, this is why we use converging methods: the undetected confounds in one method will (hopefully) not appear in the others, and across studies the truth will emerge.

Still, this discussion has led me to wonder: across fields, how often are people deeply interested in the existence or absence of an effect (as opposed to the size of the effect). Please leave a comment with your field and how often you really are interested in the presence or absence of an effect. Examples are encouraged but are unnecessary.

I'm already on the record saying I am often interested in the existence of an effect and rarely care about its size. Below I give my exaples.

Why I rarely care about effect size

Priming: Priming is expected to occur whenever two mental constructs share underlying structure or recruit the same underlying processes. There is a lot of interest in the underlying representations of certain verb forms. Verbs of transfer can be used two ways. Compare: John gave the book to Sally vs. John gave Sally the book. The order of the words changes and there either is or isn't a preposition.

In a number of experiments, Thothathiri and Snedeker asked whether hearing give in one form would make it easier for people to understand other verbs of transfer in the same form (e.g., send). On some theories, it should (due to shared structure between verbs). On some theories, it shouldn't (due to verbs not sharing structure). So the existence of the effect mattered.  But what about effect size: how much of an effect should priming have? It's an interesting question, but irrelevant to the hypotheses this study was testing, and frankly currently nobody has any hypotheses one way or another.

Development: Thothathiri and Snedeker found the priming effect in adults. They also tested children. For any adult behavior, there is always the question of at what point in development the behavior should appear. This is a deep, interesting question, since some behaviors are (roughly-speaking) innate and some are learned and you'd expect the former to appear earlier than the latter. Again, there are theories that very strongly predict that young children should or should not show the same effect as adults. Once again, the existence of the effect matters.

What about the size? Again, nobody has any predictions, and effect size cannot be used to tease apart theories. Even if the effect were much smaller in children, that wouldn't really matter, since in general children are difficult participants to work with and their effects are often smaller because a certain number simply didn't understand the task.

Eyetracking: Many of my experiments use the Visual World Paradigm. The basic idea is that people if you show people a picture and start talking about it, they will look at the parts you are talking about as you are talking about them. If there is a picture of a cat, a dog and a horse, and I say "dog," participants will look at the part of the picture with a dog. We can then use their eye movements to see how quickly people understood the word. So we're looking for the first point in time at which more people are looking at "dog" than you'd expect by chance. At any given time point, either there is an effect or there isn't -- and there had better be a point at which there isn't, such as before I said the word "dog"!

As far as effect size, though, it's not going to be the case that everyone is looking at the dog at any given time point (these effects are probabilistic). You'd expect is somewhere between 50% and 80% of people looking at the dog. But as long as you have more than 33% looking at the dog (remember, there are 3 things to look at: the cat, the dog and the horse), that's an effect. As far as size...you can measure it, but it won't help you distinguish between existing theories, which is what a good experiment is supposed to do.

Etc.: It's easy to generate more examples. I'm pretty sure every study I've ever run has been of this sort, as are most of the studies I have read. Sometimes we're interested in knowing more than just whether an effect exists. Sometimes we also care about the direction. But existence in and of itself is a real question.

Is psychology a science, redux

Is psychology a science? I see this question asked a lot on message boards, and it's time to discuss it again here. I think the typical response by a researcher like myself is an annoyed "of course, you ignoramus." But a more subtle response is deserved, as the answer depends entirely on what you mean by "psychology" and what you mean by "science."

Two Psychologies

First, if by "psychology" you mean seeing clients (like in Good Will Hunting or Silence of the Lambs), then, no, it's probably not a science. But that's a bit like asking whether engineers or doctors are scientists. Scientists create knowledge. Client-visiting psychologists, doctors and engineers use knowledge. Of course, you could legitimately ask whether client-visiting psychologists base their interventions on good science. Many don't, but that's also true about some doctors and, I'd be willing to bet, engineers.

Helpfully, "engineering" and "physics" are given different names, while the research and application ends of psychology confusingly share the same name. (Yes, I'm aware that engineering is not hte application of physics writ broadly -- what's the application of string theory? -- and one can be a chemical engineer, etc. I actually think that makes the analogy to the two psychologies even more apt). It doesn't help that the only psychologists who show up in movies are the Good Will Hunting kind (though if paleoglaciologists get to save the world, I don't see why experimental psychologists don't!), but it does exist.

A friend of mine (a physicist) once claimed psychologists don't do experiments (he said this un-ironically over IM while I was killing time in a psychology research lab). My response now would be to invite him to participate in one of these experiments. Based on this Facebook group, I know I'm not the only one who has heard this.

Methods

There are also those, however, who are aware that psychologists do experiments, but deny that it's a true science. Some of this has to do with the belief that psychologists still use introspection (there are probably some somewhere, but I suspect there are also physicists who use voodoo dolls somewhere as well, along with mathematicians who play the lottery).

The more serious objection has to do with the statistics used in psychology. In the physical sciences, typically a reaction takes place or does not, or a neutrino is detected is not. There is some uncertainty given the precision of the tools being used, but on the whole the results are fairly straight-forward and the precision is pretty good (unless you study turbulence or something similar).

In psychology, however, the phenomena we study are noisy and the tools lack much precision. When studying a neutrino, you don't have to worry about whether it's hungry or sleepy or distracted. You don't have to worry about whether the neutrino you are studying is smarter than average, or maybe too tall for your testing booth, or maybe it's only participating in your experiment to get extra credit in class and isn't the least bit motivated. It does what it does according to fairly simple rules. Humans, on the other hand, are terrible test subjects. Psychology experiments require averaging over many, many observations in order to detect patterns within all that noise.

Science is about predictions. In theory, we'd like to predict what an individual person will do in a particular instance. In practice, we're largely in the business of predicting what the average person will do in an average instance. Obviously we'd like to make more specific predictions (and there are those who can and do), but they're still testable (and tested) predictions. The alternative is to declare much of human and animal behavior outside the realm of science.

Significant differences

There are some who are on board so far but get off the bus when it comes to how statistics are done in psychology. Usually an experiment consists of determining statistically whether a particular result was likely to have occurred by chance alone. Richard Feynman famously thought this was nuts (the thought experiment is that it's unlikely to see a license plate printed CPT 349, but you wouldn't want to conclude much from it).

That's missing the point. The notion of significant difference is really a measure of replicability. We're usually comparing a measurement across two populations. We may find population A is better than population B on some test. That could be because population A is underlyingly better at such tests. Alternatively, population A was lucky that day. A significant difference is essentially a prediction that if we test population A and population B again, we'll get the same results (better performance for population A). Ultimately, though, the statistical test is just a prediction (one that typically works pretty well) that the results will replicate. Ideally, all experiments would be replicated multiple times, but that's expensive and time-consuming, and -- to the extent that the statistical analysis was done correctly (a big if) -- largely unnecessary

So what do you think? Are the social sciences sciences? Comments are welcome.

Another reason everyone should learn statistics

Here is another insightful experiment from Tversky and Kahneman:

In a discussion of flight training, experienced instructors noted that praise for an exceptionally smooth landing is typically followed by a poorer landing on the next try, while harsh criticism after a rough landing is usually followed by an improvement on the next try. The instructors concluded that verbal rewards are detrimental to learning, while verbal punishments are beneficial.

It's not clear from the description whether the instructors considered whether their lesson plan would be beneficial to morale, but in any case, they were almost certainly wrong. They fell for a statistical phenomenon known as "regression to the mean."

Basically, every time you measure something, there is some error. For instance, although Sally may be a B student, sometimes she gets As on her tests and sometimes she gets Cs. Some days she has good days and some days she has bad days.

Now suppose you give a test to the whole class and then select all the students who got As to be in a special program. That group of students who got As will include some who are normally A students, but it will also include some people who normally are B or C students but who had a good day. So, if you were to re-test those same students, the average grade would decline, perhaps to an A- or B+. This isn't because the students got stupider; it's just that the students who got fluke As the first time are unlikely to repeat their performance.

Similarly, if you had picked all the students who failed the exam, that group of students would have included both true F students as well as a few C or D students (maybe even B or A students) who were having a rough day. If you retest them, the average grade will move up, because those C and D students will likely do better the second time. They haven't gotten smarter; it's just regression to the mean.

Those flight students would probably have had a better experience if their instructors knew about regression to the mean.


----
Some readers might have wondered the following: if, in the group of students who got As, some of their scores will go down upon re-testing, shouldn't some go up? Yes and no. Some scores would go up, but those are mostly typically A students who got Bs or Cs on that particular exam. However, you have already excluded them from the group, so their rebounding scores can't off-set the falling scores of the typically B and C students.



Tversky, A., Kahneman, D. (1974). Judgment under uncertainty: heuristics and biases. Science, 185(4157), 1124-1131.

Do ballplayers really hit in the clutch?

If you've been watching the playoffs on FOX, you'll notice that rather than present a given player's regular-season statistics, they've been mostly showing us their statistics either for all playoff games in their career, or just for the 2007 post-season. Is that trivia, or is it an actual statistic? For instance, David Ortiz hits better in the post-season than during the regular season. OK, one number is higher than the other, but that could just be random variation. Does he really hit better during the playoffs?

Why does this even matter? There is conventional wisdom in baseball that certain players hit better in clutch situations -- for instance, when men on base. This is why RBIs (runs-batted-in) are treated as a statistic, rather than as trivia. Some young Turks (i.e., Billy Beane of the Oakland A's) have argued vigorously that RBIs don't tell you anything about the batter -- they tell you about the people who bat in front of him (that is, they are good at getting on base). Statistically, it is said, few to no ballplayers hit better with men on and 2 outs.

So what about in the post-season?

I couldn't find Ortiz's lifetime post-season stats, so I compared this post-season, during which he's been phenomenally hot (.773 on-base percentage through the weekend -- I did this math last night during the game, so I didn't include last night's game), compared with the 2007 regular season, during which he was just hot (.445 on-base percentage).

There are probably several ways to do the math. I used a formula to compare two independent proportions (see the math below). I found that his OBP is significantly better this post-season than during the regular season. So that's at least one example...

Here's the math.

You need to calculate a t statistic, which is the difference between the two means (.773 and .445) divided by the standard deviation of the difference between those two means. The first part is easy, but the latter part is complicated by the fact that we're dealing with ratios. That formula is:

square root of: (P1*(1-P1)/N1 + P2*(1-P2)/N2)
where P1 = .773, P2 = .445, N1 = 659 (regular season at-bats - 1), N2 = 22 (post-season at-bats - 1).

t = 2.99, which gives a p value of less than .01.

I was also considering checking just how unusual Colorado's winning streak is, but that's where my knowledge of statistics broke down (maybe we'll learn how to do that next semester). If anybody has comments or corrections on the stats above or can produce other MBL-related math, please post it in the comments.