Field of Science

Making up your data

Having finished reading the Simonsohn paper on detecting fraud, I have come to two conclusions:

1. Making up high-quality data is really hard. Part of the problem with making up data is that you have to introduce some randomness into it. If your study involves asking people how much they are willing to pay for a black t-shirt, you can't just write down that they all were willing to pay the average (say $12). You have to write down some variation ($12, $14, $7, $9, etc.).

The problem is that humans are notoriously bad at generating random number sequences. Simonsohn discusses this in terms of Tversky and Kahneman's famous, tongue-in-cheek paper "Belief in the law of small numbers." People think that random sequences should look roughly "average", even if the sample is small: Flip a coin 4 times, you should get 2 heads and 2 tails, when in fact getting 4 heads isn't all that improbable.

So your best bet, if you are making up data, is to use a computer program to generate it from your favorite distribution (the normal distribution would be a good choice in most cases). The problem is that data can have funny idiosyncrasies. One of the problems with the string of numbers I suggested above ($12, $14, $7, $9, etc.) is that humans like round numbers. So when people say what they are willing to pay for a t-shirt, what you should see is a lot of $10s, $20s and maybe some $5s and $15s. The numbers in my list are relatively unlikely.

The paper goes on to describe other problems as well. What I get from this is that making up data in a way that is undetectable is a lot of work, and you might as well actually run the study. So even leaving aside other reasons you might want to not commit fraud (ethics, desire for / belief in importance of knowledge, etc.), it seems sheer laziness alone should steer you the other direction.

2. The Dark Knight Rises is awesome. Seriously. Technically there was nothing about that in the paper, but I was thinking about the movie while reading the paper. Since I saw the show this morning, it's been hard to think of anything else. The most negative thing I can say about it is that it wasn't better than the last one, which is grading on a pretty steep curve.

No comments: