Field of Science

Qing Wen!

In the process of encouraging more Americans to study Spanish rather than Mandarin, Nicholas Kristoff notes that in Mandarin

there are thousands of characters to memorize as well as the landmines of any tonal language. 
How true! How true! Kristoff shortly proves the latter point in more ways than one:
The standard way to ask somebody a question in Chinese is “qing wen,” with the “wen” in a falling tone. That means roughly: May I ask something? But ask the same “qing wen” with the “wen” first falling and then rising, and it means roughly: May I have a kiss?

Just one possible reaction if you use the wrong tone.

Kristoff is right, so long as you don't mind sounding like a speech synthesizer. The classic description of third tone is a falling tone followed by a rising tone, but in practice it is relatively rare to pronounce the second half (the rising tone), particularly in fluent speech (in Taiwan, anyway; China has a lot of regional variation in Mandarin, so I don't know whether this holds everywhere). Figuring out when to pronounce the full tone and when not to is just one of many issues L2 Mandarin speakers run into.

Actually, third tone is worse than I just suggested. Qing wen is actually a good example, because the qing is also in third tone. When there are two third tones in a row -- as there are in the qing3 wen3 that  means "may I ask you a question?" (I'm writing in the tones with numbers here) -- the first one is pronounced as if it were second tone (start low and rise high). So even though qing technically doesn't change, its pronunciation depends on which wen you are using.

If you have three or more third tones in a row (e.g., ni3 you3 hao3 gou3 gou3 ma0?), deciding which syllables will be pronounced as if they were second tone is a complicated issue. I'd explain it to you, but I don't actually know myself. I've been told you actually have some flexibility in what you do, but I'm not sure that wasn't just another way of saying, "Sorry, I can't really explain it to you."


I've been an advocate for and member of Zipcar since my wife and I moved to Boston four and a half years ago. For that time period, I thought Zipcar was in every way superior to owning a car. Until last week, anyway. Now I'm reconsidering the car ownership issue.

Own or rent? That's the question.

It begins

I was already unhappy early in the week, having discovered Zipcar had overcharged us $375 over the past few months. At the beginning of August, I had added my wife as a driver on our account (we'd always had just one driver to save on the yearly membership fee) and also upgraded us to a fixed $75/month plan (which has some added benefits), having noticed that we'd spent more than $75 pretty much every month for the last year. I confirmed carefully with the representative that we would only be billed $75 combined, not $75 each.

It turns out, the hapless representative, rather than simply putting two of us on one account, made two accounts and put us each on both, and then charged us each every month. I didn't notice earlier as the charges appeared on different accounts, and I thought we only had one account, but the credit card charges were looking suspiciously large. It took a series of emails and a phone call to get that straightened out. They eventually agreed to refund us the bogus charges "as a one-time courtesy." That's a direct quote.


We had an overnight trip to our favorite New England B&B for Christmas weekend (seriously, this place is fantastic and has one of the best restaurants I've been to anywhere in the world).

Our room was even more charming than it looks. Yes, that's a working fireplace.

As usual, I booked a Zipcar for the purpose. I believe it was a Nissan Sentra, helpfully parked in our apartment building's garage. On Thursday, I got an email from Zipcar saying that due to an unforeseen circumstance, they were bumping us to a Civic in the Government Center garage.

The exterior of the building is distinctive.

I've rented cars once or twice from that location. During the day, it's ok. At night, it's spooky as hell. I'd say it's deserted, but there are occasionally roving bans of teens doing who knows what.

I couldn't find a good picture of the interior.
This illustration is true to the spirit of the place.

So I emailed Zipcar, explaining that I didn't really like returning cars to that location at night, so was there maybe another car nearby I could use. Or could the Civic be relocated somewhere more pleasant for a couple days?


I don't know if anybody read that email, since I never got a reply. I did get another email on Friday, though, saying that due to another unforeseen circumstance, my reservation had been moved to a Smart Car parked in Somerville (a 15-20 minute drive from where we live).

Did I mention that the B&B was a 3 hour drive away? I didn't really want to drive a Smart Car on the highway for 3 hours, and I wasn't sure our stuff would fit (I had planned on bringing skiis). So I emailed, saying if I it was a choice between a Civic parked in the Dungeon of Dispair or a Smart Car, I'd take my chances with the dungeon.

This time, I got a quick email saying there were no other cars available. So I called and explained my situation. A very polite representative explained that there really were no other cars nearby, but would I take a Mazda 3 in Arlington (3 suburbs out from Boston, where I live)? They'd pick up the cab fare. Either that, or I could have $200 towards some other form of transportation. Or I could cancel my reservation. $200 wasn't going to cover a standard car rental (I checked), and the B&B reservation was nonrefundable (plus we'd been looking forward to it), so we went with the car in Arlington.


We took a cab out to Arlington in the morning. It took maybe 20 minutes (we helpfully live next to I-93, making getting out of the city easy -- Thank you, Tip O'Neil and Ted Kennedy) and cost $33.75.

We'd only just gotten back onto I-93 in the direction of Vermont when there was a loud pop under the car and it sounded like something was dragging. On inspection (I took the next exit and pulled over), there was some piece of plastic hanging loose. The plastic itself didn't look problematic, but I wasn't sure what it had previously been holding in place. So I called Zipcar.

The representative agreed that the car was not safe to drive and asked us to return the car to its original location, and could we perhaps take a Prius from Wellesley College instead? If we needed a cab ride, they'd cover the fare. I pointed out that (a) we were already running pretty late, and (b) a cab fare to Wellesley was going to be pretty serious, esp. on top of our cab fare to Arlington, so could I just drive the car to Wellesley, drop it off there and take the Prius. She said that wasn't possible, since when the mechanics came out to service the Mazda, they wouldn't know where it was. I said if that was the problem, I was happy to tell the mechanics that the car was in Wellesley.

She put me on hold.

After a brief wait, she came back on the line to apologize, saying she hadn't gotten permission to drop the car off in Wellesley. Did I want to make the switch anyway? Or there was another car in Salem, MA, if we wanted.

Witch trials: Popular entertainment in Salem, MA

I've actually wanted to see Wellesley for a while (I like college towns), so we went with Wellesley. We called yet another cab (have I mentioned this was Christmas? Not a lot of cabs wanting to go to a deserted college town) and went to Wellesley. That fare was $66.05, including a Christmas-appropriate tip.

Finally, we got the Prius and set off. We had been traveling for 2 1/2 hours and were now farther from our destination than when we started.

View Larger Map

Our itinerarary: A: Home, B: Mazda in Arlington, C: Roughly where the car broke, D: Back in Arlington, E: Wellesley College. Vermont is in the North.


The rest of Saturday was pretty good, and the B&B was everything we remembered (dinner, which was, as always, excellent, included what may be the perfect bread, from Orchard Hill Breadworks). Sunday morning we heard rumors that a serious blizzard was heading our way, though it wasn't expected to be bad until evening (before we had left, the weather report had put the chance of snow at only 30%). We got a slightly earlier start than we had planned, stopped at a few places on the way back.

Somewhere around 3:30 or 4:00, we entered Massachusetts and it began to snow. The state had put up blizzard warnings on the roads, requesting everyone to get off the road and go home. If we had been going straight home -- as we would have were we driving that much-mourned Sentra -- that wouldn't have been a problem. But we were going to Wellesley.

At least, we tried. As we neared Wellesley, the snow got very bad, and I frankly wasn't that comfortable driving, particularly once we left the highway and the streets weren't as well-plowed. My wife called a cab company to make sure we could get a ride back to Boston. They agreed to take us, but then called back shortly thereafter, saying that nobody was driving anywhere, didn't we know there was a blizzard going on? I think they made the right decision.

I did know there was a commuter rail station in Wellesley. We called the MBTA to see if the trains were still running. They said the trains would most likely run, but with significant delays. The next one wouldn't be for 3 hours. Oh, and the station we'd be waiting at is outside. In a blizzard.

As a backup plan, we checked to see if there was anywhere we could stay in the night in Wellesly. However, as Wellesley doesn't have any hotels, there appeared to be only one option:

The only available room in Wellesley last night.

We called up Zipcar to consult. They agreed to let us leave the car in Boston at no charge, as long as we told them where the car was. I'll give them credit for that decision, at least. We drove home, very slowly.


I wish I could say the story was over, but this morning I checked my email and saw that we were billed, not only for the Prius, but also for the Mazda 3 (the one that broke down). Plus, there was a late fee for returning the Mazda late. It seems that when the representative switched our rental from the Mazda to the Prius, she did so before we actually got back to Arlington. I sent in another email this morning. We can hope that they'll remove those charges "as a one-time courtesy."

I realize that owning a car has its own hassles. I don't expect Zipcar to be perfect, either. Everyone's allowed a bad week. As long as it's just one week. And as long as they give me my money back.

What did I learn from this experience? What I learned -- and what you should take home from this as well -- is to get your bread from Orchard Hill. Because it is fucking awesome bread.

*Update: Tuesday*

This morning I got a call from someone higher up in customer service at Zipcar, who listened to the whole story. She took the numbers for the cab rides in order to reimburse us directly, rather than my having to send in the receipts, which was nice. She also comped the entire weekend trip and added a $50 driving credit, which I also appreciate. The part I cared about more was that she at least seemed very interested in improving the service such that such problems would not be repeated or would be mitigated more quickly when they do. If this reflects a real commitment to efficient service, then hopefully this last week is an aberration, and we'll be able to go back to trusting and relying on Zipcar, as we have in the past.

*Another Update: Tuesday*

Now the vice president for member services has called to apologize in addition. It's great that they take this stuff seriously. I was going to take a temporary break from Zipcar and use a regular rental car company for some upcoming stuff, but now I think I'll give them another shot.

Crowdsourcing My Data Analysis

I just finished collecting data for a study. Do you want to help analyze it?


What makes a pun funny? If you said "nothing," then you should probably skip this post. But even admirers of puns recognize that while some are sublime, others are ... well, not.

Over the last year, I've been asking people to rate funniness of just over 2300 different puns. (Where did I get 2300 puns? The user-submitted site PunoftheDay. PunoftheDay also has funniness ratings, but I wanted a bit more control over how the puns were rated and who rating them.).

Why care what makes puns funny?

There are three reasons I ran this experiment. I do mostly basic research, and while I believe in its importance and think it's fun, the idea of doing a project I could actually explain to relatives was appealing. I was partly inspired by Zenzi Griffin's 2009 CUNY talk reporting a study she ran on why parents call their kids by the wrong names (typically, calling younger children by elder children's names), work which has now been published in a book chapter.

Plus, I was just interested. I mean: puns!

Finally, I was beginning a line of work on the interpretation of homophones. One of the best-established facts about homophones is that we very rapidly suppress context-irrelevant meanings of words -- in fact, so rapidly that we rarely even notice. If your friend said, "I'm out of money, so I'm going to stop by the bank," would you really even notice considering that bank might mean the side of a river?

A river bank. 
photo: Istvan, creative commons 

A successful pun, on the other hand, requires that at least two meanings be accessed and remain active. In some sense, a pun is homophone processing gone bad. By better understanding puns, I thought I might get some insight into language processing.


As already mentioned, my first step down this road was to collect funniness ratings for a whole bunch of puns. I popped them into a Flash survey, called it Puntastic, and put it on the Games With Words website. The idea was to mine the data and try to find patterns which could then be systematically manipulated in subsequent experiments.

It turns out that there are a lot of ways that 2300 puns can be measured and categorized. So while I have a few ideas I want to try out, no doubt many of the best ones have not occurred to me. Data collection was crowdsourced, and I see no reason why the analyses shouldn't be as well.

I have posted the data on my website. If you have some ideas about what might make one pun funnier than another -- or just want to play around with the data -- you are welcome to it. Please post your findings here.

If you are a researcher and might use the data in an official publication, please contact me directly before beginning analysis (gameswithwords$at* just so there aren't misunderstandings down the line. Failure to get permission to publish analyses of these data may be punished by extremely bad karma and/or nasty looks cast your way at conferences.

The results so far...

Unfortunately for the crowd, I've already done the easiest analyses. The following are based on nearly 800 participants over the age of 13 who listed English as both their native and primary languages (there weren't enough non-native English speakers to conduct meaningful analyses on their responses).

The average was 2.6 stars out of 7 (participants could choose anywhere from 1 to 7 stars, as well as "I don't get it," which was scored as -1 for these analyses), which says something either about the puns I used or the people who rated them.

First I looked at differences between participants to see if I could find types of people who like puns more than others. There was no significant difference in overall ratings by men or women.

I also asked participants if they thought they had good or poor social skills. There was no significant difference there, either.

I also asked them in they had difficulty reading or if they had ever been diagnosed with any psychiatric illnesses, but neither of those factors had any significant effect either (got tired of making graphs, so just trust me on this one).

The effect of age was unclear.

It was the case that the youngest participants produced lower ratings than the older participants (p=.0029), which was significant even after a conservative Bonferroni correction for 15 possible pairwise comparisons (alpha=.0033). However, the 10-19 year-olds' ratings were also significantly lower than the 20-29 year-olds' (p=.0014) and the 30-39 year-olds' (p=.0008), but obviously this was not true of the 40-49 year-olds' or 50-59 year-olds' ratings. So it's not clear what to make of that. Given that the overall effect size was small and that this is an exploratory analysis, I wouldn't make much of the effect without corroboration from an independent data set.

The funniest puns

The only factor I've looked at so far that might explain pun funniness is the length of the joke. I considered only the 2238 puns for which I had at least 5 ratings (which was most of them). I asked whether there might be a relationship between the length of the pun and how funny it was. I could imagine this going either way, with concise jokes being favored (short and sweet) or long jokes having a better lead-up (the shaggy dog effect). In fact, the correlation between pun ratings and length in terms of number of characters (r=.05) or in terms of number of words (r=.05) were both so small I didn't bother to do significance tests.

I broke up the puns into five groups according to length to see if maybe there was a bimodal effect (shortest and longest jokes are funniest) or a Goldilocks effect (average-length jokes are best). There wasn't.

In short, I can't tell you anything about what makes some people like puns more than others, or why people like some puns more than others. What I can tell you is which puns people did or didn't like. Here are the top 5 and bottom 5 puns:

1. He didn't tell his mother that he ate some glue. His lips were sealed.
2. Cartoonist found dead in home. Details are sketchy.
3. Biologists have recently produced immortal frogs by removing their vocal cords. They can't croak.
4. The frustrated cannibal threw up his hands.
5. Can Napoleon return to his place of birth? Of Corsican.
2234. The Egyptian cinema usherette sold religious icons in the daytime. Sometimes she got confused and called out, 'Get your choc isis here!'
2235. Polly the senator's parrot swallowed a watch.
2236. Two pilgrims were left behind after their diagnostic test came back positive.
2237. In a baseball season, a pitcher is worth a thousands blurs.
2238. He said, "Hones', that is the truth', but I knew elide.

Ten points to anyone who can even figure out what those five puns are about. Mostly participants rated this as "I don't get it."

BTW Please don't take from this discussion that there hasn't been any serious studies of puns. There have been a number, going back at least as far as Sapir of the Sapir-Whorf hypothesis, who wrote a paper on "Two Navaho Puns." There is a well-known linguistics paper by Zwicky & Zwicky and at least one computer model that generates its own puns. However, I know a lot less about this literature than I would like to, so if there are any experts in the audience, please feel free to send me links.

Paper submitted

I just submitted a new paper on pronoun resolution ("Do inferred causes really drive pronoun resolution"), in which I argue that a widely-studied phenomenon called "implicit causality" has been misanalyzed and is in fact at least two different phenomena (as described in this previous post). You can find the paper on my publications page. Comments are welcome.

I always find writing up methods and results relatively easy. The trick is fitting the research into the literature in a way that will make sense and be useful to readers. That is, while the narrow implications are often clear, it's not always obvious which broader implications are most relevant. That is, the paper has clear implications for the few dozen people who study implicit causality, but one would like people beyond that small group to also find the results relevant.

I tried a few different approaches before ultimately settling on a particular introduction and conclusion. I was curious how much the paper had changed from the first draft to the last.

Here's the first draft, according to Wordle:

Here's draft 2:

The most obvious differences is that I hyphenated a lot more in the final draft (I was trying to make the word limit). But it doesn't appear that the changes in theme -- as measured by Wordle -- were all that drastic. That's either a good sign (my paper didn't lose its soul in the process of editing) or a bad sign (I didn't edit it enough).

I guess we'll see when the reviews come back in.

Do you speak Japanese?

Do you speak Japanese?

If so, I've got an experiment for you. A while back I presented some results from a project comparing pronoun processing in English, Spanish, Mandarin and Russian. We're also testing Japanese. So if you speak Japanese and have a few minutes, please follow this link. Even better, if you know someone who is a fluent Japanese speaker -- or, even better, a native Japanese speaker, please send him/her the link.

If you speak English -- and you probably do if you're reading this post -- and have never participating in any of my English pronoun experiments, you can follow this link. These experiments usually take less than 5 minutes.

Huh? Pronoun processing?

For those of you wondering what I could possibly be studying, the interesting thing about pronouns is that their meaning changes wildly depending on context. Given the right context, she can refer to any female (and some things that aren't actually female, like ships). That isn't true of proper names (Jane Austen can only be used to refer to one person).

Some theories state that we learn language-specific cues that help us figure out what a given pronoun in a given context means. Other theories state we use general intelligence to pull off the feat. On the second theory, if you use the same contexts in different languages, people should interpret pronouns the same way. On the first theory, that isn't necessarily the case.

(Obviously I'm being cagey here in terms of how exactly we're manipulating context in the experiment, since I don't want to bias any potential participants.)

More posts on pronouns: here, here and here.

missing 2

One of the formulas in the last post was missing a 2. Everything has now been recalculated. Some numbers changed. The basic result is that some of the numbers are not quite as dire as I had stated: the original example experiment, which had 15 participants per condition and an effect significant at p=.05 has a 51% chance of replicating (in the sense of producing another significant p-value when re-run exactly), again assuming the effect was real and the effect size is as measured in the first experiment.

When should an effect be called significant?

note: This post originally contained an error in the mathematics, which Tal of Citation Needed kindly pointed out. This error has been corrected.

In the thread following my earlier post on false positives, Tal made the observation that in a typical study that is significant at the p=.05 level has a 50% chance of being replicated.

It turns out that this depends heavily on what you mean by replicate. I'm going to work through some numbers below. Again, stats isn't my specialty, so please anyone jump in to correct errors. But I think I've got the general shape of the issues correct.

I got a significant result! Can I get it again?

Let's say you ran an experiment comparing the IQ scores of 15 people who prefer kiwi with the IQ scores of 15 people who prefer rambutan. You find that people who prefer rambutan have IQs 11.2 points higher than those who prefer kiwi. Assuming the standard deviations is 15 (which is how IQ tests are normalized), then that should give you a t-value of 11.2 / (15 * (2/15)^.5) = 2.04 and a p-value of about .05.

So you've now got a significant result! You tell all your friends, and they try running the same experiment. What are the chances they'll get the same results, significant at the p=.05 level?

The chances are not great. Even assuming that the underlying effect is real (rambutan-eaters really are smarter), your friends will only replicate your result about 51%, assuming they use exactly the same methods (according to a nifty power calculator found online here).

Define "get it"

Of course, we were assuming above that rambutan-eaters really are 11.2 IQ points smarter than kiwi-eaters (BTW I like both kiwi and rambutan, so nothing is meant by this example). In which case, your friends might not have gotten results significant at the p=.05 level, but they very likely found higher average IQs for their samples of rambutan-eaters relative to kiwi-eaters. And of course, what we really care about is how easy it will be to replicate the rambutan/kiwi difference, not how easy it will be to get the significant p-value again. The point of science is not to be able to predict statistically-significant differences but simply to predict differences. It's well beyond my statistical abilities to say how often this would happen, but hopefully someone will step up in the comments and let us know.

In practice, though, other people are only going to follow up on your effect if they can replicate it at the standard p=.05 level. What can we do to improve the chances of replicability?

Lower alphas

Let's suppose your effect had been significant at the p=.01 level. We can manage that while keeping the effect-size the same (11.2 IQ points) if we increase our sample to 26 kiwi-eaters and 26 rambutan-eaters (t = 8/(15 * (2/26)^.5) = 2.7). Now our chance of getting another significant result at the p=.01 level is ... 52%.

But we don't really care about getting a p=.01 again; we want to get the result again at the p=.05 level, which should happen around 76% of the time. Now, what if we had a result significant at the p=.001 level the first time around? We'd have needed about 42 subjects per condition. The chance or replicating that at the p=.05 level is 92%.

p-value       #subjects/condition           Likelihood of repeating at p=.05 level
.05                         15                                 51%
.01                         26                                 76%
.001                       42                                 92%


Of course, there are things that I'm not considering here, like the possibility that your original experiment underestimated the effect size. For instance, let's say that the true effect size is 15 IQ points (which is a lot!). Now, your chances of finding an effect significant at the p=.05 level with only 15 participants per condition is 75%. That's a lot better than what we started with, though not perfect. To have an effect large enough to see it 95% of the time at the p=.05 level, it would need to be over 20 IQ points, which is a monstrously large effect. Incidentally, if you ran this study with 15 rambutan-eaters and 15 kiwi-eaters and found a 20 IQ point effect, that would be significant below the p=.001 level.

What I get from all this is that if you want a result that you and others will be able to replicate, you're going want the p-value in your original experiment to have been lower than p.05.

Poll: Do You Care about Effect Size?

My recent post on false positives has generated a long thread, with a large number of informative comments from Tal, who has convinced me to think a lot more about power analyses. I recommend reading the comments.

One issue that has come up is if and when we actually care about the null hypothesis. I argue that a fair amount of the time we really are deeply interested in knowing whether an effect exists or not. I don't entirely understand Tal's argument -- I'm sure he'll help out in the comments -- but I think he is saying that in any given experiment, there are always confounds such that if you have enough power, you'll find a significant result. So whether or not the manipulation has its intended effect, the confounds will ensure that the null hypothesis is false.

Perhaps. Having run studies with thousands of participants and no significant effect, I'm skeptical that this is always true, but obviously the data we'd need to test his claim does not and never will exist. In any case, this is why we use converging methods: the undetected confounds in one method will (hopefully) not appear in the others, and across studies the truth will emerge.

Still, this discussion has led me to wonder: across fields, how often are people deeply interested in the existence or absence of an effect (as opposed to the size of the effect). Please leave a comment with your field and how often you really are interested in the presence or absence of an effect. Examples are encouraged but are unnecessary.

I'm already on the record saying I am often interested in the existence of an effect and rarely care about its size. Below I give my exaples.

Why I rarely care about effect size

Priming: Priming is expected to occur whenever two mental constructs share underlying structure or recruit the same underlying processes. There is a lot of interest in the underlying representations of certain verb forms. Verbs of transfer can be used two ways. Compare: John gave the book to Sally vs. John gave Sally the book. The order of the words changes and there either is or isn't a preposition.

In a number of experiments, Thothathiri and Snedeker asked whether hearing give in one form would make it easier for people to understand other verbs of transfer in the same form (e.g., send). On some theories, it should (due to shared structure between verbs). On some theories, it shouldn't (due to verbs not sharing structure). So the existence of the effect mattered.  But what about effect size: how much of an effect should priming have? It's an interesting question, but irrelevant to the hypotheses this study was testing, and frankly currently nobody has any hypotheses one way or another.

Development: Thothathiri and Snedeker found the priming effect in adults. They also tested children. For any adult behavior, there is always the question of at what point in development the behavior should appear. This is a deep, interesting question, since some behaviors are (roughly-speaking) innate and some are learned and you'd expect the former to appear earlier than the latter. Again, there are theories that very strongly predict that young children should or should not show the same effect as adults. Once again, the existence of the effect matters.

What about the size? Again, nobody has any predictions, and effect size cannot be used to tease apart theories. Even if the effect were much smaller in children, that wouldn't really matter, since in general children are difficult participants to work with and their effects are often smaller because a certain number simply didn't understand the task.

Eyetracking: Many of my experiments use the Visual World Paradigm. The basic idea is that people if you show people a picture and start talking about it, they will look at the parts you are talking about as you are talking about them. If there is a picture of a cat, a dog and a horse, and I say "dog," participants will look at the part of the picture with a dog. We can then use their eye movements to see how quickly people understood the word. So we're looking for the first point in time at which more people are looking at "dog" than you'd expect by chance. At any given time point, either there is an effect or there isn't -- and there had better be a point at which there isn't, such as before I said the word "dog"!

As far as effect size, though, it's not going to be the case that everyone is looking at the dog at any given time point (these effects are probabilistic). You'd expect is somewhere between 50% and 80% of people looking at the dog. But as long as you have more than 33% looking at the dog (remember, there are 3 things to look at: the cat, the dog and the horse), that's an effect. As far as can measure it, but it won't help you distinguish between existing theories, which is what a good experiment is supposed to do.

Etc.: It's easy to generate more examples. I'm pretty sure every study I've ever run has been of this sort, as are most of the studies I have read. Sometimes we're interested in knowing more than just whether an effect exists. Sometimes we also care about the direction. But existence in and of itself is a real question.

Slate calls for More Republican Scientists

Daniel Sarewitz of Slate worries that there aren't enough Republican scientists. Is it any wonder that Republicans don't trust science, if it's all coming from the laboratories of Democrats?

I don't know. Is it?

What would a Republican scientist look like? Would she accept the reality of man-made global warming? Evolution? Would she be aware that gay couples are at least as good parents as straight couples? Or that states that allow gay marriage and civil unions have lower divorce rates than those that don't. That true Keynesian economic stimulus works pretty well, and that tax cuts for the rich have little effect on the economy? That the health care system in Western Europe is cheaper and more effective than the American health care system?

As Colbert once noted, reality has a well-known liberal bias.

I strongly believe that communities are stronger when they are made up of people with diverse viewpoints. There is no benefit to a community made up of people with diverse facts.

Yes, we should be worried by the paucity of Republican scientists. But what does Sarewitz mean by saying we need to make the scientific community "more welcoming" to Republicans? If that means wearing elephant pins, I'll go along with it. If it means abandoning facts for fiction...

And for my next trick, I'll make this effect disappear!

In this week's New Yorker, Jonah Lehrer shows once again just how hard it is to do good science journalism if you are not yourself a scientist.

His target is the strange phenomenon that many high profile papers are failing to replicate. This has been very much a cause celebre lately, and Lehrer follows a series of scientific papers on the topic as well as an excellent Atlantic article by David Freedman. At this point, many of the basic facts are well-known: anecdotally, many scientists report repeated failures to replicate published findings. The higher-profile the paper, the less likely it is to replicate, with around 50% of the highest-impact papers in medicine failing to replicate. As Lehrer points out, this isn't just scientists failing to replicate each other's work, but scientists failing to replicate their own work: a thread running through the article is the story of Jonathan Schooler, a professor at UC-Santa Barbara who has been unable to replicate his own seminal graduate student work on memory.

Lehrer's focus in this article is shrinking effects.

No, not this one.

Some experimental effects seem to shrink steadily over time:
In 2001, Michael Jennions, a biologist at the Australian National University, set out to analyze "temporal trends" across a wide range of subjects in ecology and evolutionary biology. He looked at hundreds of papers and forty-four meta-analyses (that is, statistical syntheses of related studies), and discovered a consistent decline effect over time, as many of the theories seemed to fade into irrelevance.
As described, that's weird. But there is a good explanation for such effects, and Lehrer brings it up. Some results are spurious. It's just one of those things. Unfortunately, spurious results are also likely to be exciting. Let's say I run a study looking for a relationship between fruit-eating habits and IQ. I look at the effects of 20 different fruits. By chance, one of them will likely show a significant -- but spurious -- effect. So let's say I find that eating an apple every day leads to a 5-point increase in IQ. That's really exciting because it's surprising -- and the fact that it's not true is integral to what makes it surprising. So I get it published in a top journal (top journals prefer surprising results).

Now, other people try replicating my finding. Many, many people. Most will fail to replicate, but some -- again by chance -- will replicate. It is extremely difficult to get a failure to replicate published, so only the replications get published. After time, the "genius apple hypothesis" becomes part of established dogma. Remember that anything that challenges established dogma is exciting and surprising and thus easier to publish. So now failures to replicate are surprising and exciting and get published. When you look at effect-sizes in published papers over time, you will see a gradual but steady decrease in the "effect" of apples -- from 5 points to 4 points down to 0.

Where I get off the Bus

So far so good, except here's Lehrer again:
While the publication bias almost certainly plays a role in the decline effect, it remains an incomplete  explanation. For one thing, it fails to account for the initial prevalence of positive results among studies that never even get submitted to journals. It also fails to explaint eh experience of people like Schooler, who have been unable to replicate their initial data despite their best efforts.
Huh? Lehrer seems to be suggesting that it is publication that makes a result spurious. But that can't be right. Rather, there are just lots of spurious results out there. It happens that journals preferentially publish spurious results, leading to biases in the published record, and eventually the decline effect.

Some years ago, I had a bad habit of getting excited about my labmate's results and trying to follow them up. Just like a journal, I was interested in the most exciting results. Not surprisingly, most of these failed to replicate. The result was that none of them got published. Again, this was just a factor of some results being spurious -- disproportionately, the best ones. (Surprisingly, this labmate is still a friend of mine; personally, I'd hate me.)

The Magic of Point O Five

Some readers at this point might be wondering: wait -- people do statistics on their data and only accept a results that is extremely unlikely to have happened by chance. The cut-off is usually 0.05 -- a 5% chance of having a false positive. And many studies that turn out later to have been wrong pass even stricter statistical tests. Notes Lehrer:
And yet Schooler has noticed that many of the data sets that end up declining seem statistically solid--that is, they contain enough data that any regression to the mean shouldn't be dramatic. '"These are the results that pass all the tests," he says. "The odds of them being random are typically quite remote, like one in a million. This means that the decline effect should almost never happen. But it happens all the time!"
So there's got to be something making these results look more unlikely than they really are. Lehrer suspects unconscious bias:
Theodore Sterling, in 1959 ... noticed that ninety-seven percent of all published psychological studies with statistically significant data found the effect they were looking for ... Sterling saw that if ninety-seven per cent of psychology studies were proving their hypotheses, either psychologists were extraordinarily lucky or they published only the outcomes of successful experiments

and again:
The problem seems to be one of subtle omissions and unconscious misperceptions, as researchers struggle to make sense of their results.
I expect that unconscious bias is a serious problem (I illustrate some reasons below), but this is pretty unsatisfactory, as he doesn't explain how unconscious bias would affect results, and the Schooler effect is a complete red herring. 

I wasn't around in 1959, so I can't speak to that time, but I suspect that the numbers are similar today ... but in fact Sterling was measuring the wrong thing. Nobody cares what our hypotheses were. They don't care what order the experiments were actually run in. They care about the truth, and they have very limited time to read papers (most papers are never read, only skimmed). Good scientific writing is clear and concise. The mantra is: Tell them what you're going to tell them. Tell them. And then tell them what you told them. No fishing excursions, no detours. When we write scientific papers, we're writing science, not history.

And this means we usually claim to have expected to find whatever it is that we found. It just makes for a more readable paper. So when a scientist reads the line, "We predicted X," we know that really means "We found X" -- what the author actually predicted is beside the point.

Messing with that Point O Five

So where do all the false positive come from, if they should be less than 5% of conducted studies? There seem to be a number of issues.

First, it should be pointed out that the purpose of statistical tests (and the magic .05 threshold for significance) is to make a prediction as to how likely it is that a particular result will replicate. A p-value of .05 means roughly that there is a 95% chance that the basic result will replicate (sort of; this is not technically true but is a good approximation for present purposes).

But statistics are estimates, not facts. They are based on a large number of idealizations (for instance, many require that measurement error is distributed normally

a normal distribution

meaning that the bulk of measurements are very close to the true measurement and a measurement is as likely to be larger than the true number as it is likely to be smaller. In fact, most data is heavily skewed, with measurements more likely to be too large than too smaller (or vice versa).

For instance, give someone an IQ test. IQ tests have some measurement error -- people will score higher or lower than their "true" score due to random factors such as guessing answers correctly (or incorrectly), being sleepy (or not), etc. But it's a lot harder to get an IQ score higher than your true score than lower, because getting a higher score requires a lot of good luck (unlikely) whereas there are all sorts of ways to get a low score (brain freeze, etc.). 

Most statistical tests make a number of assumptions (like normally distributed error) that are not true of actual data. That leads to incorrect estimates of how likely a particular result is to replicate. The truth is most scientists -- at the very least, most psychologists -- aren't experts in statistics, and so statistical tests are misapplied all the time.

I don't actually think that issues like the ones I just discussed lead to most of the difficulties (though I admit I have no data one way or another). I bring these issues up mainly to point out at that statistical tests are tools that are either used or misused according to the skill of the experimenter. And there are lots of nasty ways to misuse statistical tests. I discuss a few of them below:

Run enough experiments and...

Let's go back to my genius fruit experiment. I ask a group of people to eat an apple and then give them an IQ test. I compare their IQ scores with scores from a control group that didn't eat an apple. Now let's say in fact eating apples doesn't affect IQ scores. Assuming I do my statistics correctly and all the assumptions of the statistical tests are met, I should have only a 5% chance of finding a "significant" effect of apple-eating.

Now let's say I'm disappointed in my result. So I try the same experiment with kiwis. Again, I have only a 5% chance of getting a significant result for kiwis. So that's not very likely to happen either.

Next I try oranges....

Hopefully you see where this is going. If I try only one fruit, I have a 5% chance of getting a significant result. If I try 2 fruits, I have a 1 - .95*.95 = 9.8% chance of getting a significant result for at least one of the fruits. If I try 4 fruits, now I'm up to a 1 - .95*.95*.95*.95 = 18.5% chance that I'll "discover" that one of these fruits significantly affects IQ. By the time I've tried 14 fruits, I've got a better than 50% chance of an amazing discovery. But my p-value for that one experiment -- that is, my estimate that these results won't replicate -- is less than 5%, suggesting there is only a 5% chance the results were due to chance.

While there are ways of statistically correcting for this increased likelihood of false positives, my experience suggests that it's relatively rare for anyone to do so. And it's not always possible. Consider the fact that there may be 14 different labs all testing the genius fruit hypothesis (it's suddenly very fashionable for some reason). There's a better than 50% chance that one of these labs will get a significant result, even though from the perspective of an individual lab, they only ran one experiment.

Data peaking

Many researchers peak at their data. There are good reasons for doing this. One is curiosity (we do experiments because we really want to know the outcome). Another is to make sure all your equipment is working (don't want to waste time collecting useless data). Another reason -- and this is the problematic one -- is to see if you can stop collecting data.

Time is finite. Nobody wants to spend longer on an experiment than necessary. Let's say you have a study where you expect to need -- based on intuition and past experience -- around 20 subjects. You might check your data after you've run 12, just in case that's enough. What usually happens is that if the results are significant, you stop running the study and move on. If they aren't, you run more subjects. Now maybe after you've got 20 subjects, you check your data. If it's significant, you stop the study; if not, you run some more. And you keep on doing this until either you get a significant result or you give up.

It's a little harder to do back-of-the-envelop calculations on the importance of this effect, but it should be clear that this habit has the unfortunate result of increasing the relative likelihood of a false positive, since false positives lead you to declare victory and end the experiment, whereas false negatives are likely to be corrected (since you keep on collecting more subjects until the false negative is overcome). I read a nice paper on this issue that actually crunched the numbers a while back (for some reason I can't find it at the moment), and I remember the result was a pretty significant increase in the expected number of false positives.

Data massaging

The issues I've discussed so real problems but are pretty common and not generally regarded as ethical violations. Data massaging is at the borderline.

Any dataset can be analyzed in a number of ways. Once again, if people get the result they were expecting with the first analysis they run, they're generally going to declare victory and start writing the paper. If you don't get the results you expect, you try different analysis methods. There are different statistical tests that be used. There are different covariates that could be factored out. You can through out "bad" subjects or items. This is going to significantly increase the rate of false positives.

It should be pointed out that interrogating your statistical model is a good thing. Ideally, researchers should check to see if there are bad subjects or items, check whether there are covariates to be controlled for, check whether different analysis techniques give different results. But doing this affects the interpretation of your p-value (the estimate of how likely it is that your results will replicate), and most people don't know how to appropriately control for that. And some are frankly more concerned with getting the results they want than doing the statistics properly (there is where the "borderline" comes in).

Better estimates

The problem, at least from where I stand, is one of statistics. We want our statistical tests to tell us how likely it is that our results will replicate. We have statistical tests which, if used properly, will give us just such an estimate. However, there are lots and lots of ways to use them incorrectly.

So what should we do? One possibility is to train people to use statistics better. And there are occasional revisions in standard practice that do result in better use of statistics.

Another possibility is to lower the p-value that is considered significant. The choice of p=0.05 as a cutoff was, as Lehrer notes, arbitrary. Picking a smaller number would decrease the number of false positives. Unfortunately, it also decreases the number of real positives by a lot. People who don't math can skip this next section.

Let's assume we're running studies with a single dependent variable and one manipulation, and that we're going to test for significance with a t-test. Let's say the manipulation really should work -- that is, it really does have an effect on our dependent measure. Let's say that the effect size is large-ish (Cohen's d of .8, which is large by psychology standards) and that we run 50 subjects. The chance of actually finding a significant effect at the p=.05 level is 79%. For people who haven't done power analyses before, this might seem low, but actually an 80% chance of finding an effect is pretty good. Dropping our significant threshold to p=.01 drops the chance of finding the effect to 56%. To put this in perspective, if we ran 20 such studies, we'd find 16 significant effects at the p=.05 level but only 11 at the p=.01 level. (If you want to play around with these numbers yourself, try this free statistical power calculator.)

Now consider what happens if we're running studies where the manipulation shouldn't have an effect. If we run 20 such studies, 1 of them will nonetheless give us a false positive at the p=.05 level, whereas we probably won't get any at the p=.01 level. So we've eliminated one false positive, but at the cost of nearly 1/3 of our true positives.

No better prediction of replication than replication

Perhaps the easiest method is to just replicate studies before publishing them. The chances of getting the same spurious result twice in a row are vanishingly small. Many of the issues I outlined above -- other than data massaging -- won't increase your replication rate. Test 14 different fruits to see if any of them increase IQ scores, and you have over a 50% chance that one of them will spuriously do so. Test that same fruit again, and you've only got a 5% chance of repeating the effect. So replication decreases your false positive rate 20-fold. Similarly, data massaging may get you that coveted p.05, but the chances of the same massages producing the same result again are very, very low.

True positives aren't nearly so affected. Again, a typical power level is B=0.80 -- 80% of the time that an effect is really there, you'll be able to find it. So when you try to replication a true positive, you'll succeed 80% of the time. So replication decreases your true positives by only 20%.

So let's say the literature has a 30% false positive rate (which, based on current estimates, seems quite reasonable). Attempting to replicate every positive result prior to publication -- and note that it's extremely rare to publish a null result (no effect), so almost all published results are positive results -- should decrease the false positives 20-fold and the true positives by 20%, leaving us with a 2.6% false positive rate. That's a huge improvement.

So why not replicate more?

So why don't people replicate before publishing? If 30% of your own publishable results are false positives, and you eliminate them, you've just lost 30% of your potential publications. You've also lost 20% of your true positives as well, btw, which means overall you've decreased your productivity by 43%. And that's without counting the time it takes to run the replication. Yes, it's nice that you've eliminated your false positives, but you also may have eliminated your own career!

When scientists are ranked, they're largely ranked on (a) number of publications, (b) number of times a publication is cited, and (c) quality of journal that the publications are in. Notice that you can improve your score on all of these metrics by publishing more false positives. Taking the time to replicate decreases your number of publications and eliminates many of the most exciting and surprising results (decreasing both citations and quality of journal). Perversely, even if someone publishes a failure to replicate your false positive, that's a citation and another feather in your cap.

I'm not saying that people are cynically increasing their numbers of bogus results. Most of us got into science because we actually want to know the answers to stuff. We care about science. But there is limited time in the day, and all the methods of eliminating false positives take time. And we're always under incredible pressure to pick up the pace of research, not slow it down.

I'm not sure how to solve this problem, but any solution I can think of involves some way of tracking not just how often a researcher publishes or how many citations those publications get, but how often those publications are replicated. Without having a way of tracking which publications replicate and which don't, there is no way to reward meticulous researchers or hold sloppy researchers to account.

Also, I think a lot of people just don't believe that false positives are that big a problem. If you think that only 2-3% of published papers contain bogus results, there's not a lot of incentive to put in a lot of hard work learning better statistical techniques, replicating everything, etc. If you think the rate is closer to 100%, you'd question the meaning of your own existence. As long as we aren't keeping track of replication rates, nobody really knows for sure where we are on this continuum.

That's my conclusion. Here's Lehrer's:
The decline effect is troubling because it reminds us how difficult it is to prove anything. We like to pretend that our experiments define the truth for us. But that's often not the case. Just because an idea is true doesn't mean it can be proved. And just because an idea can be proved doesn't mean it's true. When the experiments are done, we still have to chose what to believe.
I say it again: huh?

Priest, Altars and Peer Review

David Dobbs at Neuron Culture is complaining about NASA and peer review:
A NASA spokesperson has dismissed a major critique of the Science arsenic bug paper based not on the criticism's merits, but on its venue -- it appeared in a blog rather than a peer-reviewed journal. Apparently ideas are valid (or not) based on their content, or even the reputation of the author, but on where they're published.
I'm not known for my strong endorsement of the fetishism of peer review, but even so I think Dobbs is being somewhat unfair. My reading of history is that scientists have been plugging the peer-review mantra because they're tired of having to respond to ignorant assholes who appear on Oprah spouting nonsense. I mean, yes, you can address wacko claims about vaccines causing autism or the lack of global warming on their merits (they have none), but it gets tiresome to repeat. In any case, relatively few members of the public can follow the actual arguments, so it becomes an issue of who you believe. And that's a hard game to win, since saying "so-and-so doesn't know what they're talking about" sounds elitist even when it's true, and "elitism" (read: "meritocracy") is for some reason unpopular.

Focusing on peer review as a mechanism for establishing authority is convenient, because the public (thinks it) understands the mechanisms. You're not saying, "Believe me because I am a wise scientist," but "Believe the documented record." And since Jenny McCarthy doesn't publish in peer-reviewed journals, you can (try to) exclude her and other nuttos from the conversation.

So I think there are good reasons for a NASA spokesman, when speaking with a reporter, to dismiss blogs. Taking a critique in a blog seriously in public is only going to open the floodgates. I mean, there are a *lot* of blogs out there. That doesn't mean that the scientists involved aren't taking the a series critique by a serious scientist seriously just because the criticism appeared in a blog. I hope that they are, and we don't want to read too much into NASA's official statement.

All that said, I'm not sure focusing on peer-reviewed science has been helping very much. I mean, McCarthy still gets booked on Oprah anyway.

More Politics

As expected, the President appears to be caving on massive tax-breaks for the mega-wealthy while cutting back on services for everyone else and key investments for the future (in the name of fiscal responsibility). The fact that this is a Republican proposal doesn't make him any less responsible. I'll be voting for someone else in two years.

Google Translate Fail

Google Translate's blog:
There are some things we still can't translate. A baby babbling, for example. For the week of November 15th we are releasing five videos of things Google can’t translate (at least not yet)! Check out the videos and share them with your friends. If you can think of other things you wish Google translated (like your calculus homework or your pet hamster), tweet them with the tag #GoogleTranslate. We’ll be making a video of at least one of the suggestions and adding it to our page.
What do I wish Google Translate could translate? I'll bite. How about Russian? Or Japanese?

I mean, have the folks over at GT ever actually used their product? It's not very good. I'll admit that machine translation has improved a lot in recent years, but I doubt it's as good as a second-year Spanish student armed with a pocket dictionary.

Nothing against the fine engineers working at Google. GT is an achievement to be proud of, but when they go around claiming to have solved machine translation, it makes those of us still working on the problems of language look bad. It's hard enough to convince my parents that I'm doing something of value without Google claiming to have already solved all the problems.