Comments on Games with Words: And for my next trick, I'll make this effect disappear!

You may have already seen this, but if not it is d...

2012-07-15T03:37:01.222-04:00

You may have already seen this, but if not it is definitely worth a look. Dance of the p-values is a short youtube video explaining why p .05 does not mean there is a 95% chance of replication, amongst other things. http://www.youtube.com/watch?v=ez4DgdurRPg

@tal. Don't worry about the long thread. I kno...

2010-12-15T16:31:19.749-05:00

@tal. Don't worry about the long thread. I know what your negative argument is, but what is your positive argument. Even if we accept that in practice there are always confounds and with sufficient power, we'll always reject the null hypothesis, my claim that approximately half the time we care about whether our *intended* manipulation has an effect (intended manipulation, not anything we manipulated on accident) still stands.

If you deny that's the case, then the place to write about that is in the post I just put up. But if you agree that we often do care, how do we go about testing the existence or absence of effects?

---

BTW I worked through the calculations you suggested, and you seem to be overly optimistic. A 2-sample t-test with 15 subjects per condition is significant at the p=.05 level with t=1.7. That's an effect size (in Cohen's terms) of .44. An effect size of .44 with the same alpha has observed power of .21. So even though this effect is real, we expect to see it only 1/5 of the time we test for it.

Is that right?

(and again, sorry, last bit...) I'm also stil...

2010-12-15T14:10:13.430-05:00

(and again, sorry, last bit...)

I'm also still not sure why you think a 50% chance of replication doesn't allow any progress. When you test against a null of zero, your expectation isn't that the null hypothesis is false 50% of the time (if it is, you're testing against the wrong null!), it's that there is no effect whatsoever. Learning the direction of the effect, and that it can be reliably distinguished from zero 50% of the time, is hugely informative! Now, it's true that the same result might be completely uninformative if your prior exactly matched the posterior, and that could certainly happen. But that's not an argument for hypothesis testing, it's precisely the kind of argument Bayesians use for why you should use a Bayesian approach.

But for fields where effect size matters, these are interesting ideas.

There aren't any fields of psychology where effect size doesn't matter, for the reasons articulated above. The fact that you're rarely explicit about effect sizes in your studies doesn't mean you aren't making implicit assumptions about effect size all the time. You're just using p values as a proxy, because your sample sizes aren't big enough that everything comes out statistically significant. If you doubt this to be true, run a few experiments with 10,000 subjects each and then tell me how you're going to decide which effects you care about and which you don't.

Anyway, this comment thread is getting kind of long (sorry, I know I'm not being very succinct), so the last word is yours.

continued from above... I don't know what are...

2010-12-15T14:08:44.427-05:00

continued from above...

I don't know what area of psych you're referring to in your post, but in my subfield (language) and most subfields I read, we mostly do not care about effect size. We're studying underlying structure, so any effect of any size is meaningful -- and, in fact, effect size has no theoretical meaning in most cases.

I don't see how it's possible not to care about effect size. I'll grant that most psychologists may not stop to think about what constitutes a meaningful effect size very often, but that doesn't mean they're not making implicit assumptions about effect size every time they run a test. To put it in perspective, consider that if you routinely conducted your studies with 1,000,000 subjects each, all of your tests would produce statistically significant results (the critical effect size for p < .05 with that sample size is around 2/1000th of a standard deviation--good luck getting effects smaller than that!). So how is it possible not to care about effect size and only about rejecting the null, if all it takes in order to reject the null is collecting a large enough sample? For that matter, I assume you tend to treat a finding significant at p < .05 differently from one significant at p < .00001--and the only difference between the two is... effect size.

So if it were actually the case that we knew the null hypothesis had to be false, we'd never bother to run the experiment -- except in the case where we need to know the direction of an effect, though that only comes up every so often.

That's exactly why null hypothesis testing is kind of absurd. That's not to say it isn't a useful fiction, but it's still a fiction--there are few if any situations in which a null of zero is at all meaningful. The only reason the framework actually works is because we tend to run samples small enough that we don't run into the problem of having everything be statistically significant, so we rarely have to think about how absurd it is. In other words, what's happening is that p values end up being proxies for effect size in virtue of the kind of sample sizes we use.

In such cases, I don't think knowing you have a 50% chance of being able to repeat an effect carries much value, if any. Would progress even be possible, if this were true? I think not. Yet there has been progress, which is why I'm skeptical of your claim. I'm not saying you've done the math wrong, but perhaps some of the assumptions are incorrect, at least for the branches of cognitive and social psychology/neuroscience that I follow.

I'm not sure what you're objecting to here... If you take the significance testing framework at face value, it's simply a fact that a finding that's significant at p = .05 will, on average, have a 50% chance of replicating if you repeat the study. I'm not expressing my opinion or building in any extra assumptions beyond what you already assume when you run a t-test; that's just the reality. If you doubt it, just take the critical effect size that corresponds to p = .05 for a given sample size and run a power calculation for the same effect size, alpha, and sample size. You will get 50%. If you don't like it, your problem is with the hypothesis testing framework.

Do you think so? I think it's pretty simple to...

2010-12-15T14:07:18.215-05:00

Do you think so? I think it's pretty simple to design experiments were we expect the null hypothesis to be true. Tests of ESP, for instance.

ESP is an excellent example, because that's about as clean a case as you could make, and even there, I don't think it's plausible to expect that you'd ever get an effect of exactly zero if you sampled the entire population. Remember, the null hypothesis isn't something that lives in construct land (where you can say things like "ESP doesn't exist, so it can't be associated with anything"); it has to be operationalized, otherwise we can't test it. I'd argue that for any operationalization of ESP, there are going to be many potential confounds and loose ends that are necessarily going to make your effect non-zero. Take precognition experiments. A very basic requirement is that you have to have a truly random number generator. Well, there isn't really any such thing. Not in the sense you'd require in order to ensure an effect of literally zero. Remember, any systematic blip, no matter how small, renders the null false. If the body heat generated by larger subjects systematically throws off the computation by one bit in eight trillion, that's still a non-zero association. And this is for a contrived example; for an average hypothesis that most psychologists would actually be willing to entertain, you could easily reel of dozens if not hundreds of factors that would ensure you have a non-zero association...

What's left is to decide in practice, for a given field, how often the null hypothesis is likely to be true. I've already put my money on 50% for psych -- at least, the areas of psych I'm familiar with.

I guess I don't understand what this means, or where that number comes from. The probability of rejecting the null hypothesis depends in large part on your sample size. So that 50% number can't be referring to the probability of rejecting the null hypothesis in actual experiments, because if that were the case, so long as the effect wasn't exactly zero, you'd be able to turn 50% into almost any other probability just by collecting more or fewer subjects. Which would render the number meaningless.

The only way I can make sense of this is if you really believe that for 50% of all hypotheses that people suggest, the effect in the entire population is literally zero. Not small; not close to zero; not a correlation of 0.0001 (which is still statistically significant in a sample of 7 billion people!); but exactly zero. If that's really what you believe, then we're at an impasse, but frankly the idea that any association between two meaningful psychological variables in a dense causal system like our world would ever be exactly nil seems inconceivable to me.

I'm not sure it's meaningful to talk about...

2010-12-14T23:04:24.063-05:00

I'm not sure it's meaningful to talk about the probability of the null being true or false. Strictly speaking, the null is (for practical intents) always false.

Do you think so? I think it's pretty simple to design experiments were we expect the null hypothesis to be true. Tests of ESP, for instance. Since there are an infinite number of ESP manipulations one could try, there is necessarily an infinite number of experiments where we expect the null hypothesis to be false. (If you believe in ESP, I'm sure you can work out your own class of examples.)

What's left is to decide in practice, for a given field, how often the null hypothesis is likely to be true. I've already put my money on 50% for psych -- at least, the areas of psych I'm familiar with.

I don't know what area of psych you're referring to in your post, but in my subfield (language) and most subfields I read, we mostly do not care about effect size. We're studying underlying structure, so any effect of any size is meaningful -- and, in fact, effect size has no theoretical meaning in most cases.

So if it were actually the case that we knew the null hypothesis had to be false, we'd never bother to run the experiment -- except in the case where we need to know the direction of an effect, though that only comes up every so often.

In such cases, I don't think knowing you have a 50% chance of being able to repeat an effect carries much value, if any. Would progress even be possible, if this were true? I think not. Yet there has been progress, which is why I'm skeptical of your claim. I'm not saying you've done the math wrong, but perhaps some of the assumptions are incorrect, at least for the branches of cognitive and social psychology/neuroscience that I follow.

But for fields where effect size matters, these are interesting ideas.

(continued from above comment) If you think about...

2010-12-14T20:20:21.342-05:00

(continued from above comment)

If you think about it in terms of effect sizes, then this problem goes away. You have some prior belief about how big the effect is, then you look a the data, and you update your belief accordingly. The standard practice in psychology is to effectively assume a completely uniform prior distribution (and this is one of the things Bayesians rail against, because if you really had no reason to think any value was likelier than zero, why would you ever do the study in the first place?). So, if we're being strict about it, then achieving a result that's significant at p < .05 is giving you information, because you started out with the null (which is effectively a prior in this case) that there wasn't any effect. If you intuitively feel like that's wrong, then I think what you're tacitly saying is that it's silly to use uniform priors, and we should build at least some sense of the expected effect size into the test--which I'd agree with.

Notice that once you do that, it does become entirely possible that you would obtain a result statistically significant at p < .05 yet walk away feeling less confident about your prediction. For instance, if you think the effect you're testing is one standard deviation in magnitude, and you're conducting a one-sample t-test on 50 subjects, you could get an effect of a quarter of a standard deviation and it would still be significant at p < .05. But you wouldn't want to walk away concluding that your hypothesis was borne out--you would in fact conclude the evidence was weaker than you thought.

I'm still not sure I know what you mean by &qu...

2010-12-14T20:19:05.152-05:00

I'm still not sure I know what you mean by "50% chance of replication". If you mean "50% chance the null hypothesis is in fact false" -- that is, that the effect is real and should be found again -- then that seems low for the following reason: I think it's a reasonable assumption that any given tested hypothesis has around a 50% chance of being true (and thus the null hypothesis has a 50% chance of being false).

By "chance of replication" I just mean the probability of obtaining a statistically significant effect in the same direction given an identical study (i.e., same design, number of subjects, etc.). In a world in which all hypothesis tests are unbiased, obtaining a result significant at p < .05 implies a roughly 50% chance of obtaining a second statistically significant result if you were to redo the study without changing anything.

In the real world, hypothesis tests aren't unbiased, of course; that's what this entire debate is largely about. There's a tendency to selectively report and pursue effects that attained statistical significance, so the reality is that most of the time, the true probability of replication (in the same sense as above) is going to be lower than 50%. And again, there are domains where we can say pretty confidently that it's going to be much lower.

If you mean that p=.05 is equivalent to a 50% chance that the null hypothesis is false, then what you're saying is that being significant at that level carries no information whatsoever. I believe there's a real problem, but that strikes me as overly pessimistic.

That's not the implication I'd draw... I'm not sure it's meaningful to talk about the probability of the null being true or false. Strictly speaking, the null is (for practical intents) always false. What effect could you possibly name where you really believe that if you sampled the entire population, the magnitude of the effect would literally be exactly zero? In a dense causal system, that's inconceivable. Everything has an influence on everything, however small.

@Tal: You are right to point out that what we are...

2010-12-13T13:46:21.551-05:00

@Tal: You are right to point out that what we are estimating is the probability of the data given the hypothesis, whereas what we want to estimate is the probability of the hypothesis. My understanding was that there is no way to estimate the latter, and so we use the former as a proxy.

I'm still not sure I know what you mean by "50% chance of replication". If you mean "50% chance the null hypothesis is in fact false" -- that is, that the effect is real and should be found again -- then that seems low for the following reason: I think it's a reasonable assumption that any given tested hypothesis has around a 50% chance of being true (and thus the null hypothesis has a 50% chance of being false).

If you mean that p=.05 is equivalent to a 50% chance that the null hypothesis is false, then what you're saying is that being significant at that level carries no information whatsoever. I believe there's a real problem, but that strikes me as overly pessimistic.

I should say that whenever somebody has argued with my hypothesis that 50% of tested null hypotheses should be false, the argument is that the number should be much higher, since researchers single out null hypotheses particularly likely to be false (that is, positive hypotheses particularly likely to be true). On this account, then, a p-value of .05 carries less information than the fact that the researcher ran the experiment to begin with.

Possibly I've completely misunderstood what you're saying. Is that about right?

There were several typos/misspellings/words missin...

2010-12-12T15:03:54.282-05:00

There were several typos/misspellings/words missing that would have been helpful. Try using spell-check next time.

I guess I don't see why saying there's app...

2010-12-12T14:21:15.945-05:00

I guess I don't see why saying there's approximately a 95% probability of replication is more succinct than saying there's approximately a 50% probability of replication... and the latter has the benefit of being true (or at least, much closer to the truth).

I think in your example you might be confusing the probability of the data given the hypothesis (P(D|H)) with the probability of the hypothesis given the data (P(H|D)). Observing that p < .05 means that P(D|H0) is < 0.05 (where H0 is the null). When you talk about the probability of the samples being from the same or different populations, you're talking about P(H|D)--the probability that the null is true (or that the alternative hypothesis H1 is true, which is the complement) given the data. You can't actually calculate that just from the observed p value, because you don't know the prior probabilities P(H0) and P(D). I think what you're thinking of as the complement, which really is 0.95, is P(~D|H0)--the probability that you wouldn't observe the data if the null were true. But that's generally not an interesting quantity.

On the power thing, there have been many reviews in many different domains, and they tend to converge pretty strongly on the conclusion that most studies are underpowered. The classic paper is Cohen's (1962) review of social and abnormal psychology articles, and then Sedlmeier and Gigerenzer followed up about 20 years later and showed that power hadn't increased at all. More recent reviews all basically show the same thing--power hasn't budged (though there are some domains where people have been much better about this--e.g., population genetics, where people now routinely use samples in the tens of thousands). I report some power analyses for fMRI studies in this paper, and the results are not encouraging. Actually, I think if anything, most of the power reviews may even be too optimistic, because they tend to assume medium or large effect sizes, when in fact meta-analyses suggest that most effects are small. So while it certainly varies by domain, as a general rule of thumb, I think it's safe to assume that the average study is underpowered.

It would be nice if the prevalence of null results pushed people to routinely collect much larger samples, as you suggest; but as I discuss in the same paper, and as people like Ioannidis have pointed out, that's counteracted by the fact that smaller samples give biased effect size estimates, leading people to think that their effects are bigger than they are (and hence, to think they need smaller samples than they do).

The other problem is that people rarely actually attribute null results to low power; it's much more common to see invisible moderating variables invoked--e.g., "the fact that we didn't get the effect when the stimuli were faces rather than words may mean there are dissociable systems involved in processing words and faces." Power is not an intuitive concept, and it's hard to remember that a failure to obtain meaningful results often says more about what you didn't do (collect enough subjects) than what you did do (vary some experimental variable).

@Tal -- I was being succinct when saying "95%...

2010-12-11T23:04:24.728-05:00

@Tal -- I was being succinct when saying "95% chance of replicating". I'll stick with my t-test example. Significance at p=.05 level means there's less than a 5% probability the two samples were drawn from the same population. So conversely there's a 95% chance the samples are from different populations. If the samples are from different populations, you should be able to replicate the effect, given sufficient power. Whether you will actually replicate the effect depends a great deal on the statistical power of the replication.

As far as whether we normally have .8 power ... I don't know what it ends up being in practice, but let's point out that .8 power is called "adequate" but failing to find real effects 20% of the time is actually kinda lousy. So I *hope* people have at least .8 power. I expect the fact that it is hard to publish null results also pushes people in the direction of collecting enough data to have sufficient power.

What makes you suspect otherwise?

In physics, the rule is "An experiment doesn&...

2010-12-10T17:12:57.397-05:00

In physics, the rule is "An experiment doesn't prove a hypothesis unless it was formulated before the experiment." This means that experimental results like the one with apples must be replicated.

Of course, physicists have it easier, as in many cases they know what they expect. However, when someone says that they expected to get what they actually got, they are misleading the reader into thinking that the hypothesis is proved according to this rule. OTOH, if this is standard practice, it isn't dishonest in the same way that "Pleased to meet you" can't really be a lie.

Excellent post! I'm just a lowly undergraduate...

2010-12-10T15:56:29.797-05:00

Excellent post! I'm just a lowly undergraduate, but I feel as though the three biggest issues in the scientific method is a) biases, b) the incentive to publish positive results, and c) the incentive to make money.

The combination of all three I think accurately describe the problems we are facing when referencing the failure to replicate and the major flaws of the scientific process.

The question remains: how do we remove these incentives from the scientific process?

2010-12-10T15:55:24.198-05:00

This comment has been removed by the author.

@ Michael B -- Interesting point. I hadn't con...

2010-12-10T15:07:25.714-05:00

@ Michael B -- Interesting point. I hadn't considered that analogy. We typically think of our statistical tests as predictors of replicability. But there is also a sense in which we are fitting a model, and you're right: the test of a model is not its ability to model the data it was fitted to but it's ability to model a new set of data.

Great post! I really enjoy your blog. You are abso...

2010-12-10T14:00:33.845-05:00

Great post! I really enjoy your blog. You are absolutely right about the replication before publishing --- in fact, this is exactly what is known as a test set in machine learning. And you can't just publish the "training set" results to show that your methods work. I cannot understand why it is not the standard practice in scientific publications in general.