Field of Science

Poll: Do You Care about Effect Size?

My recent post on false positives has generated a long thread, with a large number of informative comments from Tal, who has convinced me to think a lot more about power analyses. I recommend reading the comments.

One issue that has come up is if and when we actually care about the null hypothesis. I argue that a fair amount of the time we really are deeply interested in knowing whether an effect exists or not. I don't entirely understand Tal's argument -- I'm sure he'll help out in the comments -- but I think he is saying that in any given experiment, there are always confounds such that if you have enough power, you'll find a significant result. So whether or not the manipulation has its intended effect, the confounds will ensure that the null hypothesis is false.

Perhaps. Having run studies with thousands of participants and no significant effect, I'm skeptical that this is always true, but obviously the data we'd need to test his claim does not and never will exist. In any case, this is why we use converging methods: the undetected confounds in one method will (hopefully) not appear in the others, and across studies the truth will emerge.

Still, this discussion has led me to wonder: across fields, how often are people deeply interested in the existence or absence of an effect (as opposed to the size of the effect). Please leave a comment with your field and how often you really are interested in the presence or absence of an effect. Examples are encouraged but are unnecessary.

I'm already on the record saying I am often interested in the existence of an effect and rarely care about its size. Below I give my exaples.

Why I rarely care about effect size

Priming: Priming is expected to occur whenever two mental constructs share underlying structure or recruit the same underlying processes. There is a lot of interest in the underlying representations of certain verb forms. Verbs of transfer can be used two ways. Compare: John gave the book to Sally vs. John gave Sally the book. The order of the words changes and there either is or isn't a preposition.

In a number of experiments, Thothathiri and Snedeker asked whether hearing give in one form would make it easier for people to understand other verbs of transfer in the same form (e.g., send). On some theories, it should (due to shared structure between verbs). On some theories, it shouldn't (due to verbs not sharing structure). So the existence of the effect mattered.  But what about effect size: how much of an effect should priming have? It's an interesting question, but irrelevant to the hypotheses this study was testing, and frankly currently nobody has any hypotheses one way or another.

Development: Thothathiri and Snedeker found the priming effect in adults. They also tested children. For any adult behavior, there is always the question of at what point in development the behavior should appear. This is a deep, interesting question, since some behaviors are (roughly-speaking) innate and some are learned and you'd expect the former to appear earlier than the latter. Again, there are theories that very strongly predict that young children should or should not show the same effect as adults. Once again, the existence of the effect matters.

What about the size? Again, nobody has any predictions, and effect size cannot be used to tease apart theories. Even if the effect were much smaller in children, that wouldn't really matter, since in general children are difficult participants to work with and their effects are often smaller because a certain number simply didn't understand the task.

Eyetracking: Many of my experiments use the Visual World Paradigm. The basic idea is that people if you show people a picture and start talking about it, they will look at the parts you are talking about as you are talking about them. If there is a picture of a cat, a dog and a horse, and I say "dog," participants will look at the part of the picture with a dog. We can then use their eye movements to see how quickly people understood the word. So we're looking for the first point in time at which more people are looking at "dog" than you'd expect by chance. At any given time point, either there is an effect or there isn't -- and there had better be a point at which there isn't, such as before I said the word "dog"!

As far as effect size, though, it's not going to be the case that everyone is looking at the dog at any given time point (these effects are probabilistic). You'd expect is somewhere between 50% and 80% of people looking at the dog. But as long as you have more than 33% looking at the dog (remember, there are 3 things to look at: the cat, the dog and the horse), that's an effect. As far as size...you can measure it, but it won't help you distinguish between existing theories, which is what a good experiment is supposed to do.

Etc.: It's easy to generate more examples. I'm pretty sure every study I've ever run has been of this sort, as are most of the studies I have read. Sometimes we're interested in knowing more than just whether an effect exists. Sometimes we also care about the direction. But existence in and of itself is a real question.

10 comments:

Josef said...

I think Andrew Gelman made important points here:
http://www.stat.columbia.edu/~gelman/research/published/power4r.pdf

Specifically, if you have no prior hypotheses about the size and direction of the effect, then you run the risk of having "vampirical" hypotheses: ones that cannot be killed by evidence. It seems to me that you are more likely to get any result than you are a result in a specific direction than you are a result in a specific direction of a specific size. Moreover, if you could have written as convincing a paper with results where all the signs were flipped, and were different sizes, then why bother to have run an experiment in the first place?

I would guess that even if you don't have precise estimates about how large an effect will be, you have a guess as to the order of magnitude. It should be smaller than the largest effect size you've ever seen, and large enough the to be the product of a biological system with finite precision. For example, I'd never believe a study with the result that there is a significant effect of 5 nanoseconds on latency to orient to an object, or a significant difference of 0.0025 in proportion looking times, etc. Even if such a result is statistically significant at p < 0.05, it is, in effect, not an effect.

This should go double for experiments which are not simply about architecture of the system, but may attempt to explain certain phenomena, like acquisition patterns, etc., with the hypothesis under examination.

GamesWithWords said...

@Josef:

if you could have written as convincing a paper with results where all the signs were flipped, and were different sizes, then why bother to have run an experiment in the first place?

Because you actually need to know the answer? Seriously, one probably typically has strong hypotheses about the directionality of effects, but not always. I work a lot on pronoun resolution, mainly looking at cases where a given pronoun could resolve to one object or another or neither. So all three possibilities are really up in the air (bias one direction, bias the other, or no bias) and I really do want to know which it is. And not just me: natural language processing is stymied right now by the fact that we know so little about how people interpret pronouns that there's no way of telling whether a computer is doing so correctly or incorrectly.

The point about effects which are too small to be believed is an interesting one. I suppose we're assuming that we know for sure what the smallest effect that is "the product of a biological system with finite precision" is, right? Can we make such claims?

Relatedly: let's say you got a 5 nanosecond effect consistently across many different experimental paradigms. What would you do then?

Just being devil's advocate.

Tim said...

For the record, I am 100% in the Gelman camp here (represented by Tal, apparently). All null hypotheses are false. All that matters is effect size.

In psychology, this can alternatively be expressed as: Every single thing in your brain affects every single other thing, to some degree or another. Seeing a particular prime word affects your probability of thinking about what to get your mother for Christmas, which affects the probability you respond more quickly to a word about gifts (or mind-wander to a slower RT). Counterbalancing doesn't help this, either: For any particular ordering of particular words, some set of non-random things happens in the brain that is different from any other ordering of the same words.

In group analyses, in particular, it is easy to see this is true. If you could measure all men and all women in the world on ANY trait, do you believe you could find a single one where they would be identical to infinity decimal places? No? Then you believe the null hypothesis that men and women are equivalent is false for every possible trait. All that is left is to ask whether the effect size is large enough to be meaningful.

tal said...

Thanks for bumping this! I posted my thoughts here. I think ultimately it boils down to what you mean when you say 'care about'. If you mean "are interested in thinking about," then sure, many psychologists don't care. But if you mean "are impacted by, and implicitly make assumptions about," then it's hard to see how we don't all care about effect sizes pretty much all the time.

GamesWithWords said...

Hey Tim --

Your male/female example is interesting. You're probably right that we'd find a difference (perhaps infinitely small) between male and female means for any measurement, even if we test all men and women in existence. But are we asking in our research whether our findings generalize to existing human beings? Perhaps, but then of course we aren't interested in predicting anything about human beings not yet born or human beings already dead, right?

On the other issues, I have the worrying feeling we're talking about different things. There seem to be at least two issues: whether the intended aspect of the manipulation -- as opposed to any confounds -- has an effect, and whether it's ever possible to eliminate confounds.

Your statement that "every single thing in your brain affects every single other thing, to some degree or another" seems to address the first issue. I think I agree, with emphasis on "some degree or another."

Would you agree in the physical world that every physical action affects everything else, to some degree or another? Now if you try to pick up a 1,000 ton brick, do you lift it off the ground to any degree? I suspect not, though we'd have to define "lift" and "it".

Thoughts?

Josef said...

So all three possibilities are really up in the air (bias one direction, bias the other, or no bias) and I really do want to know which it is.

But that's more exploratory, isn't it? What I was trying to say was that if you had hypothesis H, and hypothesis H could be confirmed (or H0 falsified or whatever) by an effect of any size of any sign, then why bother doing an experiment, because any experiment is guaranteed to have some effect of some size and sign. With the situation you described, it sounds like there is less of a specific hypothesis being tested, and more exploration as to the directionality of an effect. That is, there isn't some model or hypothesis at stake if there is a an effect in the opposite direction, or a weak to no effect in an expected direction.

In a situation like that, effect sizes and directions actually sound more important than their mere existence (at least to me), especially given your allusion to applications to NLP. Say you ran an experiment, and there was a 0.495 probability of resolving to A, and a 0.505 probability of resolving to B, and the difference is significant. How much of human behavior have you actually captured with your experimental manipulations? And how worthwhile to an engineer would it be to spend their time accounting for this effect in their parser?

As for precision, I was assuming that there is some finite limit to human precision that while it might not be knowable, might be estimable. At least, your effects should be larger than what other uncontrollable effects (microvariation environmental temperature, lighting, etc) might cause. But there's also the issue of the precision of our tools. If the stats returned a significant latency effect that was smaller than the sampling rate of the eyetracker, or a response time effect that was small enough to have been caused by noise between the button press mechanism and the computer, the effect isn't "real."

As for what I would do about a consistent 5 nanosecond effect that was definitely "real," I guess it would depend on how important this was for theory building and evaluation. If I studied neurons, it'd probably be really important. Otherwise, I'd have to decide how sensitive my theory was to 5 nanoseconds. That is, could it produce a 5 nanosecond difference? Also, how large are other effects in other experiments that my theory is supposed to also generate?

I think those last two questions are really important for any effect of any size, actually. And really, we're not talking about an in-principle-5-nanosecond effect, but probably something like, "is 3% a real difference?" If all the other experiments on the phenomenon have turned up 5% to 6% differences, then 3% is more likely to be "real" than if the other experiments have turned up 30% effects.

GamesWithWords said...

@Josef:

But that's more exploratory, isn't it?

Maybe? In this particular case, there are multiple theories out there that give clashing predictions: some predict subject-resolution and some object-resolution. And the null hypothesis is neither.

What I was trying to say was that if you had hypothesis H, and hypothesis H could be confirmed (or H0 falsified or whatever) by an effect of any size of any sign, then why bother doing an experiment, because any experiment is guaranteed to have some effect of some size and sign.

I suspect we're talking about different things. Read on.

GamesWithWords said...

There have been a lot of really insightful comments to this and the last post by some very smart people who number among my favorite bloggers and researchers. The fact that we seem to be disagreeing suggests either that I'm completely wrong or that we're talking about different things. I'm not yet willing to give up on being right, so my working hypothesis is that we're talking somewhat past each other and focusing on different issues. I'm going to try to lay out where I think the issues are, and you let me know whether you agree.

In any given experiment, we're measuring the effect of three things: (a) the intended manipulation, (b) any confounds of the manipulation, and (c) random preexisting differences in populations. For instance, I might want to know whether verbs are harder to learn than nouns, so I try to teach people both. My intended manipulation (a) is part of speech. Then there might be some confounds (b) of the manipulation (e.g., if this is a between-subjects study, differences between the subjects). Then it may also be the case that existing verbs and nouns differ in some nonnecessary way (c) such that if I tested all existing nouns and verbs, the nouns would be harder simply because existing nouns are harder, not because nouns are necessarily harder.

That the hypothesis that there intended manipulation always has some effect which could be observed with sufficient power is trivially false, as hopefully the ESP examples have convinced you. Given that the brain is a complex causal system, many intended manipulations may have tiny effects that could be observed with sufficient power but which we might not be interested in -- I think both Tim and Tal, at least, are concerned with such effects. I think it's am empirical question just how many there are. They have analogies that suggest they should be rampant; I have analogies that suggest they could be quite rare. I don't think we'll solve this by analogy.

[continued below]

GamesWithWords said...

[continued from above]

The third issue (c) I tackled in my comment to Tim. It is a concern, but primarily when you are measuring existing subjects and items, whereas much of the time we're interested in possible subjects and items. There can be no accidental differences between possible verbs and possible nouns. Any differences, by definition, are due to their noun-ness and verb-ness ... or due to confounds, which brings up:

The second issue is one of confounds of manipulation. There are differences between verbs and nouns that are not necessarily due to their verbness or nounness. In any given sample of nouns and verbs, you can never equate exactly for frequency or phoneme length, etc.

I believe several of the commenters have argued that even if there is no effect of the intended manipulation (a), there will always be at least one confound such that if you have a large enough sample, you'll eventually get a significant effect of that confound. For instance, Josef writes:

any experiment is guaranteed to have some effect of some size and sign

I'll grant that it's true that whenever you get a significant effect, you have to determine whether it's due to your intended manipulation or some confound. That's one of the oldest problems in science and one of the most difficult.

I *think* what commenters like Josef mean when they say "effect size is all that matters" is that you can use effect size to rule out some effects as being so small or so large that they are probably not the result of the intended manipulation. That would be a heuristic like Occam's Razor, not a necessary fact, since you could always be wrong about the size of effect your intended manipulation should have. Also, it'll only really help in extreme cases -- whether the observed effect is orders of magnitude too large or too small.

If I've got anyone's position wrong, please clarify. I'll clarify my own. When I say that I don't care about effect size, I mean I don't care about the size of any effect that is the result of the intended manipulation. To the extent that we can use effect size to identify the result of confounds, then that is a nice heuristic (as long as you don't get carried away and reject "real" effects).

GamesWithWords said...

PS Tal -- for some reason Blogger is convinced you're a spammer. Almost all your comments go into my Spam box. Not sure why. I've found Blogger's new spam filter more aggravating than helpful.