Jeffrey Beall's blacklist of "predatory open-access journals" -- discussed in yesterday's New York Times -- provides evidence for my long-standing suspicion of any journal named "International Journal of ..." There probably are some good journals named "International Journal of...", but I don't know of any off-hand. And there seem to be an awful lot of bad ones, probably for good reason: An internationally-recognized journal doesn't have to say so. So almost by definition a journal that has to call itself "International Journal of" is probably not a well-known journal.
In general, nearly every journal on the list has some location in its name, such as South Asian Journal of Mathematics, which doubles down by referring to itself on its home page as an "international journal". Again, there are, of course, good journals with region-specific names. But there don't seem to be many. I'm less sure of the reason for this one.
[Future Post: Explaining why universities that market themselves as "The Harvard of" some region are frequently not even the most prestigious school in that region.]
- Home
- Angry by Choice
- Catalogue of Organisms
- Chinleana
- Doc Madhattan
- Games with Words
- Genomics, Medicine, and Pseudoscience
- History of Geology
- Moss Plants and More
- Pleiotropy
- Plektix
- RRResearch
- Skeptic Wonder
- The Culture of Chemistry
- The Curious Wavefunction
- The Phytophactor
- The View from a Microbiologist
- Variety of Life
Field of Science
-
-
Change of address8 months ago in Variety of Life
-
Change of address8 months ago in Catalogue of Organisms
-
-
Earth Day: Pogo and our responsibility11 months ago in Doc Madhattan
-
What I Read 20241 year ago in Angry by Choice
-
I've moved to Substack. Come join me there.1 year ago in Genomics, Medicine, and Pseudoscience
-
-
-
-
Histological Evidence of Trauma in Dicynodont Tusks7 years ago in Chinleana
-
Posted: July 21, 2018 at 03:03PM7 years ago in Field Notes
-
Why doesn't all the GTA get taken up?7 years ago in RRResearch
-
-
Harnessing innate immunity to cure HIV9 years ago in Rule of 6ix
-
-
-
-
-
-
post doc job opportunity on ribosome biochemistry!11 years ago in Protein Evolution and Other Musings
-
Blogging Microbes- Communicating Microbiology to Netizens11 years ago in Memoirs of a Defective Brain
-
Re-Blog: June Was 6th Warmest Globally11 years ago in The View from a Microbiologist
-
-
-
The Lure of the Obscure? Guest Post by Frank Stahl13 years ago in Sex, Genes & Evolution
-
-
Lab Rat Moving House14 years ago in Life of a Lab Rat
-
Goodbye FoS, thanks for all the laughs14 years ago in Disease Prone
-
-
Slideshow of NASA's Stardust-NExT Mission Comet Tempel 1 Flyby15 years ago in The Large Picture Blog
-
in The Biology Files
Showing posts with label publication. Show all posts
Showing posts with label publication. Show all posts
Lab Notebook: You know you are writing a paper when
Your "recently added" list in Mendeley is growing at an exponential rate:
(click to expand. note time added.)
Every new paper you read results in downloading at least two more (not unlike the Hounds of Resurrection. Coincidence? I think not).
I don't think I've ever actually finished my reading list for a paper. At some point, I shut down the process before it overwhelms my hard drive.
(click to expand. note time added.)
Every new paper you read results in downloading at least two more (not unlike the Hounds of Resurrection. Coincidence? I think not).
I don't think I've ever actually finished my reading list for a paper. At some point, I shut down the process before it overwhelms my hard drive.
Poll Results: How much do you revise your papers
In the last poll, I asked folks how much they revise their papers subsequent to first submission. A substantial majority said that they engaged in revisions that were "substantial" or "more than substantial". I was glad to hear that I'm not the only one.
Predicting my h-index
A new article in Nature presents a model for predicting a neuroscientist's future h-index based on current output. For those who don't know, the h-index is the largest N such that you have N papers each of which have at least N citations. It does this based on your current h-index, total number of published articles, years since first article, the total number of journals published in, and the number of papers in Nature, Science, Nature Neuroscience, PNAS and Neuron.
I'm not a neuroscientist (though I *am* in Neurotree, upon which the data are based), so I figured I'd see what it predicts for me. I ran into a problem right away, though: how do we count papers? Does my book chapter count? How about two editorials that were recently published (that is, non-empirical papers appears in empirical journals)? What about two papers that are in press but not yet published?
If we are conservative and count only empirical papers and only papers currently in print, my predicted h-index in 2022 is 12:
If we count everything I've published -- that is, including papers in press, book chapters, and non-empirical papers -- things improve somewhat (apparently it's a good thing that almost everything I publish is in different outlets):
Interestingly, if I pretend that I currently have an h-index of 9 (that is, all my papers have been cited at least 9 times), it doesn't actually do any good. I've increased my current h-index by 6 but only my predicted 2022 h-index by 3:
I guess the model has discovered regression to the mean.
(BTW I've noticed that neuroscientists really like h-index, probably because they cite each other so much. H-indexes in other fields, such as -- to take a random example -- psychology, tend to be much lower.)
I'm not a neuroscientist (though I *am* in Neurotree, upon which the data are based), so I figured I'd see what it predicts for me. I ran into a problem right away, though: how do we count papers? Does my book chapter count? How about two editorials that were recently published (that is, non-empirical papers appears in empirical journals)? What about two papers that are in press but not yet published?
If we are conservative and count only empirical papers and only papers currently in print, my predicted h-index in 2022 is 12:
Interestingly, if I pretend that I currently have an h-index of 9 (that is, all my papers have been cited at least 9 times), it doesn't actually do any good. I've increased my current h-index by 6 but only my predicted 2022 h-index by 3:
I guess the model has discovered regression to the mean.
(BTW I've noticed that neuroscientists really like h-index, probably because they cite each other so much. H-indexes in other fields, such as -- to take a random example -- psychology, tend to be much lower.)
Since one can't be snarky in a response to a review...
I'll do it here. I am currently revising a paper for resubmission. On the whole, the reviews are fairly reasonable, with the exception of one cranky comment from a reviewer who complains that our literature review is woefully incomplete. This incompleteness seems to be our failure to cite one particular study. The reviewer writes
I'd also love to know what the folks who are obsessed with only citing studies published in peer-reviewed journals would say. (It's possible that some of these conferences it has been presented at have pretty thorough review procedures; I wouldn't know.)
It is possible that this work is flawed, but it really should be discussed.It does seem to be a relevant study and we would have cited it, had we known about it. Why didn't we know about it? Because it has never been published. It hasn't even been presented at any of the normal psycholinguistics conferences (though it has appeared at some linguistics conferences). Short of emailing every researcher who might be conducting a study that might be relevant, I'm not sure what this reviewer was expecting of us.
I'd also love to know what the folks who are obsessed with only citing studies published in peer-reviewed journals would say. (It's possible that some of these conferences it has been presented at have pretty thorough review procedures; I wouldn't know.)
Fair Use & FedEx
And now for something completely different:One private citizen's trials and travails trying to convince FedEx to print posters.
I have wanted a map of Hong Kong on my wall for some time. The Survey & Mapping office of the Hong Kong government helpfully provides some free maps for public use on their website. You will notice how the website helpfully includes a "free maps"logo, along with a copyright notice forbidding only commercial use of the map. Presumably they thought this was a good way of providing some publicity for the Special Administrative District.
They did not take into account FedEx Office. I put this map on a USB stick and went to the FedEx Office at Government Center to have it printed. The manager there refused to print it as I didn't have proof of copyright ownership. I showed him the website (particularly where it says "free maps"). He said the fact that the map is free for public use was irrelevant; he needed a signed document from the copyright owner (the government of Hong Kong) stating that I, personally, had the right to print off the map.
His explanation for his refusal was simple: "I can't get between me and the copyright holder." I pointed out that he was getting between me -- who wants to print the map -- and the copyright owner -- who also wants me to print the map. He repeated that even so, he "can't get between me and the copyright holder." This was just repetition, so I pointed out again that the map is clearly labeled for public use. He said that was just "he said/she said" business; what he needed was a signed document.
I'm curious what he would do with a signed document in Chinese, and whether he would require a notarized translation. I realized as I was leaving that at the beginning when the manager was trying to establish whether I had the right to print the map, he had asked me if I was a member of the organization that made the map -- that is the Hong Kong government. I'm curious what would have happened if I had said yes.
The "copyright waiver"
This is not the 1st time that I've had a run-in with the copyright police at FedEx. Last year the Palo Alto FedEx refused to print a poster that I was supposed to present at a conference at Stanford. I study story comprehension in small children, and a common practice is to use stories about familiar characters. In this case, I had stories about Dora the Explorer and a few other cartoon characters. Because my poster showed an example of one of the pictures that we had drawn to go with the stories, FedEx initially refused to print the poster, saying that it violated copyright.
After a long discussion about fair use and noncommercial uses, one of the employees remembered that they have a “copyright release” form that they can use in these circumstances. Unfortunately, they couldn't find any blank copies. One enterprising employee simply wrote the words “copyright release” on a piece of paper and asked me to sign that piece of paper.
I wasn't sure about the wisdom of signing and essentially blank piece of paper (you can see a photo of it on the right), so they came up with another plan, which was to whiteout all the writing on a previously filled out form, which they then copied (not waiting for the whiteout to dry and getting white out all over their copier in the process) and which I signed. Then they printed my poster and I went on to have an otherwise successful conference.
Copyright and FedEx
Clearly somebody has instilled the Fear of the Lord into the employees at FedEx with regards to copyright infringement. FedEx is understandably concerned about their liability, since unlike me, they have actual assets. I also realize that FedEx may not have the resources to have somebody on staff who has been adequately trained to deal with copyright issues ... but in that case, it suggests that maybe they do not have the resources to run a print shop. After all, it is not like they are not making determinations now. They are just doing it randomly and incorrectly.
Results: Replication in Psychology
My paper with Adena Schachner on replication in psychology is now published. The paper contains 3 main sections: a reasonably thorough literature review on replication rates in psychology, a proposal as to how to improve replication rates (primarily through tracking replication rates), and the results of a survey of psychologists on replication practices (many thanks to all who participated). The results of the survey was that while not nearly enough replications are attempted, there are actually more being attempted than we had guessed (or than many of our colleagues that we discussed this project with had guessed).
This paper is part of a larger collection of papers on reimagining the publication and review process, and is more of those papers are printed, I plan to discuss at least some of them.
Above average!
It's often repeated that the median study is cited less 0 times. I haven't been able to find a citation for that, but if it is true, all my papers are now above median. My birth order paper has now been cited. Actually, it was cited last year, but I didn't notice for a while. Granted, it was cited in a paper appearing in Journal of Language, Technology & Entrepreneurship in Africa, which is apparently not a high-impact journal, but a citation is a citation.
For rather boring reasons not related to the data or the review process itself, the birth order paper appeared in a journal that is not widely read by researchers, which probably has reduced its visibility. Certainly, plenty has been published on the topic in the last few years. This is a lesson for the future: it really does matter which journal you publish in, despite the wide-spread use of search engines.
For more on my birth order research, click here.
For rather boring reasons not related to the data or the review process itself, the birth order paper appeared in a journal that is not widely read by researchers, which probably has reduced its visibility. Certainly, plenty has been published on the topic in the last few years. This is a lesson for the future: it really does matter which journal you publish in, despite the wide-spread use of search engines.
For more on my birth order research, click here.
The pace of review
One of my manuscripts will shortly enter its 7th month of being under review. Apparently one of the three reviewers keeps promising to send in a review but never does. Now the 4+ months a different manuscript languished under review seems speedy.
Ray Kurzweil is convinced that the pace of science is increasing exponentially and will continue to do so. I think he's neglected one rate-limiting step in the process.
Paper submitted
I just submitted a new paper on pronoun resolution ("Do inferred causes really drive pronoun resolution"), in which I argue that a widely-studied phenomenon called "implicit causality" has been misanalyzed and is in fact at least two different phenomena (as described in this previous post). You can find the paper on my publications page. Comments are welcome.
I always find writing up methods and results relatively easy. The trick is fitting the research into the literature in a way that will make sense and be useful to readers. That is, while the narrow implications are often clear, it's not always obvious which broader implications are most relevant. That is, the paper has clear implications for the few dozen people who study implicit causality, but one would like people beyond that small group to also find the results relevant.
I tried a few different approaches before ultimately settling on a particular introduction and conclusion. I was curious how much the paper had changed from the first draft to the last.
Here's the first draft, according to Wordle:
Here's draft 2:
The most obvious differences is that I hyphenated a lot more in the final draft (I was trying to make the word limit). But it doesn't appear that the changes in theme -- as measured by Wordle -- were all that drastic. That's either a good sign (my paper didn't lose its soul in the process of editing) or a bad sign (I didn't edit it enough).
I guess we'll see when the reviews come back in.
I always find writing up methods and results relatively easy. The trick is fitting the research into the literature in a way that will make sense and be useful to readers. That is, while the narrow implications are often clear, it's not always obvious which broader implications are most relevant. That is, the paper has clear implications for the few dozen people who study implicit causality, but one would like people beyond that small group to also find the results relevant.
I tried a few different approaches before ultimately settling on a particular introduction and conclusion. I was curious how much the paper had changed from the first draft to the last.
Here's the first draft, according to Wordle:
Here's draft 2:
The most obvious differences is that I hyphenated a lot more in the final draft (I was trying to make the word limit). But it doesn't appear that the changes in theme -- as measured by Wordle -- were all that drastic. That's either a good sign (my paper didn't lose its soul in the process of editing) or a bad sign (I didn't edit it enough).
I guess we'll see when the reviews come back in.
missing 2
One of the formulas in the last post was missing a 2. Everything has now been recalculated. Some numbers changed. The basic result is that some of the numbers are not quite as dire as I had stated: the original example experiment, which had 15 participants per condition and an effect significant at p=.05 has a 51% chance of replicating (in the sense of producing another significant p-value when re-run exactly), again assuming the effect was real and the effect size is as measured in the first experiment.
When should an effect be called significant?
note: This post originally contained an error in the mathematics, which Tal of Citation Needed kindly pointed out. This error has been corrected.
In the thread following my earlier post on false positives, Tal made the observation that in a typical study that is significant at the p=.05 level has a 50% chance of being replicated.
It turns out that this depends heavily on what you mean by replicate. I'm going to work through some numbers below. Again, stats isn't my specialty, so please anyone jump in to correct errors. But I think I've got the general shape of the issues correct.
I got a significant result! Can I get it again?
Let's say you ran an experiment comparing the IQ scores of 15 people who prefer kiwi with the IQ scores of 15 people who prefer rambutan. You find that people who prefer rambutan have IQs 11.2 points higher than those who prefer kiwi. Assuming the standard deviations is 15 (which is how IQ tests are normalized), then that should give you a t-value of 11.2 / (15 * (2/15)^.5) = 2.04 and a p-value of about .05.
So you've now got a significant result! You tell all your friends, and they try running the same experiment. What are the chances they'll get the same results, significant at the p=.05 level?
The chances are not great. Even assuming that the underlying effect is real (rambutan-eaters really are smarter), your friends will only replicate your result about 51%, assuming they use exactly the same methods (according to a nifty power calculator found online here).
Define "get it"
Of course, we were assuming above that rambutan-eaters really are 11.2 IQ points smarter than kiwi-eaters (BTW I like both kiwi and rambutan, so nothing is meant by this example). In which case, your friends might not have gotten results significant at the p=.05 level, but they very likely found higher average IQs for their samples of rambutan-eaters relative to kiwi-eaters. And of course, what we really care about is how easy it will be to replicate the rambutan/kiwi difference, not how easy it will be to get the significant p-value again. The point of science is not to be able to predict statistically-significant differences but simply to predict differences. It's well beyond my statistical abilities to say how often this would happen, but hopefully someone will step up in the comments and let us know.
In practice, though, other people are only going to follow up on your effect if they can replicate it at the standard p=.05 level. What can we do to improve the chances of replicability?
Lower alphas
Let's suppose your effect had been significant at the p=.01 level. We can manage that while keeping the effect-size the same (11.2 IQ points) if we increase our sample to 26 kiwi-eaters and 26 rambutan-eaters (t = 8/(15 * (2/26)^.5) = 2.7). Now our chance of getting another significant result at the p=.01 level is ... 52%.
But we don't really care about getting a p=.01 again; we want to get the result again at the p=.05 level, which should happen around 76% of the time. Now, what if we had a result significant at the p=.001 level the first time around? We'd have needed about 42 subjects per condition. The chance or replicating that at the p=.05 level is 92%.
p-value #subjects/condition Likelihood of repeating at p=.05 level
.05 15 51%
.01 26 76%
.001 42 92%
Replication
Of course, there are things that I'm not considering here, like the possibility that your original experiment underestimated the effect size. For instance, let's say that the true effect size is 15 IQ points (which is a lot!). Now, your chances of finding an effect significant at the p=.05 level with only 15 participants per condition is 75%. That's a lot better than what we started with, though not perfect. To have an effect large enough to see it 95% of the time at the p=.05 level, it would need to be over 20 IQ points, which is a monstrously large effect. Incidentally, if you ran this study with 15 rambutan-eaters and 15 kiwi-eaters and found a 20 IQ point effect, that would be significant below the p=.001 level.
What I get from all this is that if you want a result that you and others will be able to replicate, you're going want the p-value in your original experiment to have been lower than p.05.
In the thread following my earlier post on false positives, Tal made the observation that in a typical study that is significant at the p=.05 level has a 50% chance of being replicated.
It turns out that this depends heavily on what you mean by replicate. I'm going to work through some numbers below. Again, stats isn't my specialty, so please anyone jump in to correct errors. But I think I've got the general shape of the issues correct.
I got a significant result! Can I get it again?
Let's say you ran an experiment comparing the IQ scores of 15 people who prefer kiwi with the IQ scores of 15 people who prefer rambutan. You find that people who prefer rambutan have IQs 11.2 points higher than those who prefer kiwi. Assuming the standard deviations is 15 (which is how IQ tests are normalized), then that should give you a t-value of 11.2 / (15 * (2/15)^.5) = 2.04 and a p-value of about .05.
So you've now got a significant result! You tell all your friends, and they try running the same experiment. What are the chances they'll get the same results, significant at the p=.05 level?
The chances are not great. Even assuming that the underlying effect is real (rambutan-eaters really are smarter), your friends will only replicate your result about 51%, assuming they use exactly the same methods (according to a nifty power calculator found online here).
Define "get it"
Of course, we were assuming above that rambutan-eaters really are 11.2 IQ points smarter than kiwi-eaters (BTW I like both kiwi and rambutan, so nothing is meant by this example). In which case, your friends might not have gotten results significant at the p=.05 level, but they very likely found higher average IQs for their samples of rambutan-eaters relative to kiwi-eaters. And of course, what we really care about is how easy it will be to replicate the rambutan/kiwi difference, not how easy it will be to get the significant p-value again. The point of science is not to be able to predict statistically-significant differences but simply to predict differences. It's well beyond my statistical abilities to say how often this would happen, but hopefully someone will step up in the comments and let us know.
In practice, though, other people are only going to follow up on your effect if they can replicate it at the standard p=.05 level. What can we do to improve the chances of replicability?
Lower alphas
Let's suppose your effect had been significant at the p=.01 level. We can manage that while keeping the effect-size the same (11.2 IQ points) if we increase our sample to 26 kiwi-eaters and 26 rambutan-eaters (t = 8/(15 * (2/26)^.5) = 2.7). Now our chance of getting another significant result at the p=.01 level is ... 52%.
But we don't really care about getting a p=.01 again; we want to get the result again at the p=.05 level, which should happen around 76% of the time. Now, what if we had a result significant at the p=.001 level the first time around? We'd have needed about 42 subjects per condition. The chance or replicating that at the p=.05 level is 92%.
p-value #subjects/condition Likelihood of repeating at p=.05 level
.05 15 51%
.01 26 76%
.001 42 92%
Replication
Of course, there are things that I'm not considering here, like the possibility that your original experiment underestimated the effect size. For instance, let's say that the true effect size is 15 IQ points (which is a lot!). Now, your chances of finding an effect significant at the p=.05 level with only 15 participants per condition is 75%. That's a lot better than what we started with, though not perfect. To have an effect large enough to see it 95% of the time at the p=.05 level, it would need to be over 20 IQ points, which is a monstrously large effect. Incidentally, if you ran this study with 15 rambutan-eaters and 15 kiwi-eaters and found a 20 IQ point effect, that would be significant below the p=.001 level.
What I get from all this is that if you want a result that you and others will be able to replicate, you're going want the p-value in your original experiment to have been lower than p.05.
And for my next trick, I'll make this effect disappear!
In this week's New Yorker, Jonah Lehrer shows once again just how hard it is to do good science journalism if you are not yourself a scientist.
His target is the strange phenomenon that many high profile papers are failing to replicate. This has been very much a cause celebre lately, and Lehrer follows a series of scientific papers on the topic as well as an excellent Atlantic article by David Freedman. At this point, many of the basic facts are well-known: anecdotally, many scientists report repeated failures to replicate published findings. The higher-profile the paper, the less likely it is to replicate, with around 50% of the highest-impact papers in medicine failing to replicate. As Lehrer points out, this isn't just scientists failing to replicate each other's work, but scientists failing to replicate their own work: a thread running through the article is the story of Jonathan Schooler, a professor at UC-Santa Barbara who has been unable to replicate his own seminal graduate student work on memory.
Lehrer's focus in this article is shrinking effects.
Some experimental effects seem to shrink steadily over time:
Now, other people try replicating my finding. Many, many people. Most will fail to replicate, but some -- again by chance -- will replicate. It is extremely difficult to get a failure to replicate published, so only the replications get published. After time, the "genius apple hypothesis" becomes part of established dogma. Remember that anything that challenges established dogma is exciting and surprising and thus easier to publish. So now failures to replicate are surprising and exciting and get published. When you look at effect-sizes in published papers over time, you will see a gradual but steady decrease in the "effect" of apples -- from 5 points to 4 points down to 0.
Where I get off the Bus
So far so good, except here's Lehrer again:
Some years ago, I had a bad habit of getting excited about my labmate's results and trying to follow them up. Just like a journal, I was interested in the most exciting results. Not surprisingly, most of these failed to replicate. The result was that none of them got published. Again, this was just a factor of some results being spurious -- disproportionately, the best ones. (Surprisingly, this labmate is still a friend of mine; personally, I'd hate me.)
The Magic of Point O Five
Some readers at this point might be wondering: wait -- people do statistics on their data and only accept a results that is extremely unlikely to have happened by chance. The cut-off is usually 0.05 -- a 5% chance of having a false positive. And many studies that turn out later to have been wrong pass even stricter statistical tests. Notes Lehrer:
and again:
I wasn't around in 1959, so I can't speak to that time, but I suspect that the numbers are similar today ... but in fact Sterling was measuring the wrong thing. Nobody cares what our hypotheses were. They don't care what order the experiments were actually run in. They care about the truth, and they have very limited time to read papers (most papers are never read, only skimmed). Good scientific writing is clear and concise. The mantra is: Tell them what you're going to tell them. Tell them. And then tell them what you told them. No fishing excursions, no detours. When we write scientific papers, we're writing science, not history.
And this means we usually claim to have expected to find whatever it is that we found. It just makes for a more readable paper. So when a scientist reads the line, "We predicted X," we know that really means "We found X" -- what the author actually predicted is beside the point.
Messing with that Point O Five
So where do all the false positive come from, if they should be less than 5% of conducted studies? There seem to be a number of issues.
First, it should be pointed out that the purpose of statistical tests (and the magic .05 threshold for significance) is to make a prediction as to how likely it is that a particular result will replicate. A p-value of .05 means roughly that there is a 95% chance that the basic result will replicate (sort of; this is not technically true but is a good approximation for present purposes).
But statistics are estimates, not facts. They are based on a large number of idealizations (for instance, many require that measurement error is distributed normally
Let's go back to my genius fruit experiment. I ask a group of people to eat an apple and then give them an IQ test. I compare their IQ scores with scores from a control group that didn't eat an apple. Now let's say in fact eating apples doesn't affect IQ scores. Assuming I do my statistics correctly and all the assumptions of the statistical tests are met, I should have only a 5% chance of finding a "significant" effect of apple-eating.
Now let's say I'm disappointed in my result. So I try the same experiment with kiwis. Again, I have only a 5% chance of getting a significant result for kiwis. So that's not very likely to happen either.
Next I try oranges....
Hopefully you see where this is going. If I try only one fruit, I have a 5% chance of getting a significant result. If I try 2 fruits, I have a 1 - .95*.95 = 9.8% chance of getting a significant result for at least one of the fruits. If I try 4 fruits, now I'm up to a 1 - .95*.95*.95*.95 = 18.5% chance that I'll "discover" that one of these fruits significantly affects IQ. By the time I've tried 14 fruits, I've got a better than 50% chance of an amazing discovery. But my p-value for that one experiment -- that is, my estimate that these results won't replicate -- is less than 5%, suggesting there is only a 5% chance the results were due to chance.
While there are ways of statistically correcting for this increased likelihood of false positives, my experience suggests that it's relatively rare for anyone to do so. And it's not always possible. Consider the fact that there may be 14 different labs all testing the genius fruit hypothesis (it's suddenly very fashionable for some reason). There's a better than 50% chance that one of these labs will get a significant result, even though from the perspective of an individual lab, they only ran one experiment.
Data peaking
Many researchers peak at their data. There are good reasons for doing this. One is curiosity (we do experiments because we really want to know the outcome). Another is to make sure all your equipment is working (don't want to waste time collecting useless data). Another reason -- and this is the problematic one -- is to see if you can stop collecting data.
Time is finite. Nobody wants to spend longer on an experiment than necessary. Let's say you have a study where you expect to need -- based on intuition and past experience -- around 20 subjects. You might check your data after you've run 12, just in case that's enough. What usually happens is that if the results are significant, you stop running the study and move on. If they aren't, you run more subjects. Now maybe after you've got 20 subjects, you check your data. If it's significant, you stop the study; if not, you run some more. And you keep on doing this until either you get a significant result or you give up.
It's a little harder to do back-of-the-envelop calculations on the importance of this effect, but it should be clear that this habit has the unfortunate result of increasing the relative likelihood of a false positive, since false positives lead you to declare victory and end the experiment, whereas false negatives are likely to be corrected (since you keep on collecting more subjects until the false negative is overcome). I read a nice paper on this issue that actually crunched the numbers a while back (for some reason I can't find it at the moment), and I remember the result was a pretty significant increase in the expected number of false positives.
Data massaging
The issues I've discussed so real problems but are pretty common and not generally regarded as ethical violations. Data massaging is at the borderline.
Any dataset can be analyzed in a number of ways. Once again, if people get the result they were expecting with the first analysis they run, they're generally going to declare victory and start writing the paper. If you don't get the results you expect, you try different analysis methods. There are different statistical tests that be used. There are different covariates that could be factored out. You can through out "bad" subjects or items. This is going to significantly increase the rate of false positives.
It should be pointed out that interrogating your statistical model is a good thing. Ideally, researchers should check to see if there are bad subjects or items, check whether there are covariates to be controlled for, check whether different analysis techniques give different results. But doing this affects the interpretation of your p-value (the estimate of how likely it is that your results will replicate), and most people don't know how to appropriately control for that. And some are frankly more concerned with getting the results they want than doing the statistics properly (there is where the "borderline" comes in).
Better estimates
The problem, at least from where I stand, is one of statistics. We want our statistical tests to tell us how likely it is that our results will replicate. We have statistical tests which, if used properly, will give us just such an estimate. However, there are lots and lots of ways to use them incorrectly.
So what should we do? One possibility is to train people to use statistics better. And there are occasional revisions in standard practice that do result in better use of statistics.
Another possibility is to lower the p-value that is considered significant. The choice of p=0.05 as a cutoff was, as Lehrer notes, arbitrary. Picking a smaller number would decrease the number of false positives. Unfortunately, it also decreases the number of real positives by a lot. People who don't math can skip this next section.
Let's assume we're running studies with a single dependent variable and one manipulation, and that we're going to test for significance with a t-test. Let's say the manipulation really should work -- that is, it really does have an effect on our dependent measure. Let's say that the effect size is large-ish (Cohen's d of .8, which is large by psychology standards) and that we run 50 subjects. The chance of actually finding a significant effect at the p=.05 level is 79%. For people who haven't done power analyses before, this might seem low, but actually an 80% chance of finding an effect is pretty good. Dropping our significant threshold to p=.01 drops the chance of finding the effect to 56%. To put this in perspective, if we ran 20 such studies, we'd find 16 significant effects at the p=.05 level but only 11 at the p=.01 level. (If you want to play around with these numbers yourself, try this free statistical power calculator.)
Now consider what happens if we're running studies where the manipulation shouldn't have an effect. If we run 20 such studies, 1 of them will nonetheless give us a false positive at the p=.05 level, whereas we probably won't get any at the p=.01 level. So we've eliminated one false positive, but at the cost of nearly 1/3 of our true positives.
No better prediction of replication than replication
Perhaps the easiest method is to just replicate studies before publishing them. The chances of getting the same spurious result twice in a row are vanishingly small. Many of the issues I outlined above -- other than data massaging -- won't increase your replication rate. Test 14 different fruits to see if any of them increase IQ scores, and you have over a 50% chance that one of them will spuriously do so. Test that same fruit again, and you've only got a 5% chance of repeating the effect. So replication decreases your false positive rate 20-fold. Similarly, data massaging may get you that coveted p.05, but the chances of the same massages producing the same result again are very, very low.
True positives aren't nearly so affected. Again, a typical power level is B=0.80 -- 80% of the time that an effect is really there, you'll be able to find it. So when you try to replication a true positive, you'll succeed 80% of the time. So replication decreases your true positives by only 20%.
So let's say the literature has a 30% false positive rate (which, based on current estimates, seems quite reasonable). Attempting to replicate every positive result prior to publication -- and note that it's extremely rare to publish a null result (no effect), so almost all published results are positive results -- should decrease the false positives 20-fold and the true positives by 20%, leaving us with a 2.6% false positive rate. That's a huge improvement.
So why not replicate more?
So why don't people replicate before publishing? If 30% of your own publishable results are false positives, and you eliminate them, you've just lost 30% of your potential publications. You've also lost 20% of your true positives as well, btw, which means overall you've decreased your productivity by 43%. And that's without counting the time it takes to run the replication. Yes, it's nice that you've eliminated your false positives, but you also may have eliminated your own career!
When scientists are ranked, they're largely ranked on (a) number of publications, (b) number of times a publication is cited, and (c) quality of journal that the publications are in. Notice that you can improve your score on all of these metrics by publishing more false positives. Taking the time to replicate decreases your number of publications and eliminates many of the most exciting and surprising results (decreasing both citations and quality of journal). Perversely, even if someone publishes a failure to replicate your false positive, that's a citation and another feather in your cap.
I'm not saying that people are cynically increasing their numbers of bogus results. Most of us got into science because we actually want to know the answers to stuff. We care about science. But there is limited time in the day, and all the methods of eliminating false positives take time. And we're always under incredible pressure to pick up the pace of research, not slow it down.
I'm not sure how to solve this problem, but any solution I can think of involves some way of tracking not just how often a researcher publishes or how many citations those publications get, but how often those publications are replicated. Without having a way of tracking which publications replicate and which don't, there is no way to reward meticulous researchers or hold sloppy researchers to account.
Also, I think a lot of people just don't believe that false positives are that big a problem. If you think that only 2-3% of published papers contain bogus results, there's not a lot of incentive to put in a lot of hard work learning better statistical techniques, replicating everything, etc. If you think the rate is closer to 100%, you'd question the meaning of your own existence. As long as we aren't keeping track of replication rates, nobody really knows for sure where we are on this continuum.
That's my conclusion. Here's Lehrer's:
His target is the strange phenomenon that many high profile papers are failing to replicate. This has been very much a cause celebre lately, and Lehrer follows a series of scientific papers on the topic as well as an excellent Atlantic article by David Freedman. At this point, many of the basic facts are well-known: anecdotally, many scientists report repeated failures to replicate published findings. The higher-profile the paper, the less likely it is to replicate, with around 50% of the highest-impact papers in medicine failing to replicate. As Lehrer points out, this isn't just scientists failing to replicate each other's work, but scientists failing to replicate their own work: a thread running through the article is the story of Jonathan Schooler, a professor at UC-Santa Barbara who has been unable to replicate his own seminal graduate student work on memory.
Lehrer's focus in this article is shrinking effects.
No, not this one.
Some experimental effects seem to shrink steadily over time:
In 2001, Michael Jennions, a biologist at the Australian National University, set out to analyze "temporal trends" across a wide range of subjects in ecology and evolutionary biology. He looked at hundreds of papers and forty-four meta-analyses (that is, statistical syntheses of related studies), and discovered a consistent decline effect over time, as many of the theories seemed to fade into irrelevance.As described, that's weird. But there is a good explanation for such effects, and Lehrer brings it up. Some results are spurious. It's just one of those things. Unfortunately, spurious results are also likely to be exciting. Let's say I run a study looking for a relationship between fruit-eating habits and IQ. I look at the effects of 20 different fruits. By chance, one of them will likely show a significant -- but spurious -- effect. So let's say I find that eating an apple every day leads to a 5-point increase in IQ. That's really exciting because it's surprising -- and the fact that it's not true is integral to what makes it surprising. So I get it published in a top journal (top journals prefer surprising results).
Now, other people try replicating my finding. Many, many people. Most will fail to replicate, but some -- again by chance -- will replicate. It is extremely difficult to get a failure to replicate published, so only the replications get published. After time, the "genius apple hypothesis" becomes part of established dogma. Remember that anything that challenges established dogma is exciting and surprising and thus easier to publish. So now failures to replicate are surprising and exciting and get published. When you look at effect-sizes in published papers over time, you will see a gradual but steady decrease in the "effect" of apples -- from 5 points to 4 points down to 0.
Where I get off the Bus
So far so good, except here's Lehrer again:
While the publication bias almost certainly plays a role in the decline effect, it remains an incomplete explanation. For one thing, it fails to account for the initial prevalence of positive results among studies that never even get submitted to journals. It also fails to explaint eh experience of people like Schooler, who have been unable to replicate their initial data despite their best efforts.Huh? Lehrer seems to be suggesting that it is publication that makes a result spurious. But that can't be right. Rather, there are just lots of spurious results out there. It happens that journals preferentially publish spurious results, leading to biases in the published record, and eventually the decline effect.
Some years ago, I had a bad habit of getting excited about my labmate's results and trying to follow them up. Just like a journal, I was interested in the most exciting results. Not surprisingly, most of these failed to replicate. The result was that none of them got published. Again, this was just a factor of some results being spurious -- disproportionately, the best ones. (Surprisingly, this labmate is still a friend of mine; personally, I'd hate me.)
The Magic of Point O Five
Some readers at this point might be wondering: wait -- people do statistics on their data and only accept a results that is extremely unlikely to have happened by chance. The cut-off is usually 0.05 -- a 5% chance of having a false positive. And many studies that turn out later to have been wrong pass even stricter statistical tests. Notes Lehrer:
And yet Schooler has noticed that many of the data sets that end up declining seem statistically solid--that is, they contain enough data that any regression to the mean shouldn't be dramatic. '"These are the results that pass all the tests," he says. "The odds of them being random are typically quite remote, like one in a million. This means that the decline effect should almost never happen. But it happens all the time!"So there's got to be something making these results look more unlikely than they really are. Lehrer suspects unconscious bias:
Theodore Sterling, in 1959 ... noticed that ninety-seven percent of all published psychological studies with statistically significant data found the effect they were looking for ... Sterling saw that if ninety-seven per cent of psychology studies were proving their hypotheses, either psychologists were extraordinarily lucky or they published only the outcomes of successful experiments
and again:
The problem seems to be one of subtle omissions and unconscious misperceptions, as researchers struggle to make sense of their results.
I expect that unconscious bias is a serious problem (I illustrate some reasons below), but this is pretty unsatisfactory, as he doesn't explain how unconscious bias would affect results, and the Schooler effect is a complete red herring.
I wasn't around in 1959, so I can't speak to that time, but I suspect that the numbers are similar today ... but in fact Sterling was measuring the wrong thing. Nobody cares what our hypotheses were. They don't care what order the experiments were actually run in. They care about the truth, and they have very limited time to read papers (most papers are never read, only skimmed). Good scientific writing is clear and concise. The mantra is: Tell them what you're going to tell them. Tell them. And then tell them what you told them. No fishing excursions, no detours. When we write scientific papers, we're writing science, not history.
And this means we usually claim to have expected to find whatever it is that we found. It just makes for a more readable paper. So when a scientist reads the line, "We predicted X," we know that really means "We found X" -- what the author actually predicted is beside the point.
Messing with that Point O Five
So where do all the false positive come from, if they should be less than 5% of conducted studies? There seem to be a number of issues.
First, it should be pointed out that the purpose of statistical tests (and the magic .05 threshold for significance) is to make a prediction as to how likely it is that a particular result will replicate. A p-value of .05 means roughly that there is a 95% chance that the basic result will replicate (sort of; this is not technically true but is a good approximation for present purposes).
But statistics are estimates, not facts. They are based on a large number of idealizations (for instance, many require that measurement error is distributed normally
a normal distribution
meaning that the bulk of measurements are very close to the true measurement and a measurement is as likely to be larger than the true number as it is likely to be smaller. In fact, most data is heavily skewed, with measurements more likely to be too large than too smaller (or vice versa).
For instance, give someone an IQ test. IQ tests have some measurement error -- people will score higher or lower than their "true" score due to random factors such as guessing answers correctly (or incorrectly), being sleepy (or not), etc. But it's a lot harder to get an IQ score higher than your true score than lower, because getting a higher score requires a lot of good luck (unlikely) whereas there are all sorts of ways to get a low score (brain freeze, etc.).
Most statistical tests make a number of assumptions (like normally distributed error) that are not true of actual data. That leads to incorrect estimates of how likely a particular result is to replicate. The truth is most scientists -- at the very least, most psychologists -- aren't experts in statistics, and so statistical tests are misapplied all the time.
I don't actually think that issues like the ones I just discussed lead to most of the difficulties (though I admit I have no data one way or another). I bring these issues up mainly to point out at that statistical tests are tools that are either used or misused according to the skill of the experimenter. And there are lots of nasty ways to misuse statistical tests. I discuss a few of them below:
Run enough experiments and...
Now let's say I'm disappointed in my result. So I try the same experiment with kiwis. Again, I have only a 5% chance of getting a significant result for kiwis. So that's not very likely to happen either.
Next I try oranges....
Hopefully you see where this is going. If I try only one fruit, I have a 5% chance of getting a significant result. If I try 2 fruits, I have a 1 - .95*.95 = 9.8% chance of getting a significant result for at least one of the fruits. If I try 4 fruits, now I'm up to a 1 - .95*.95*.95*.95 = 18.5% chance that I'll "discover" that one of these fruits significantly affects IQ. By the time I've tried 14 fruits, I've got a better than 50% chance of an amazing discovery. But my p-value for that one experiment -- that is, my estimate that these results won't replicate -- is less than 5%, suggesting there is only a 5% chance the results were due to chance.
While there are ways of statistically correcting for this increased likelihood of false positives, my experience suggests that it's relatively rare for anyone to do so. And it's not always possible. Consider the fact that there may be 14 different labs all testing the genius fruit hypothesis (it's suddenly very fashionable for some reason). There's a better than 50% chance that one of these labs will get a significant result, even though from the perspective of an individual lab, they only ran one experiment.
Data peaking
Many researchers peak at their data. There are good reasons for doing this. One is curiosity (we do experiments because we really want to know the outcome). Another is to make sure all your equipment is working (don't want to waste time collecting useless data). Another reason -- and this is the problematic one -- is to see if you can stop collecting data.
Time is finite. Nobody wants to spend longer on an experiment than necessary. Let's say you have a study where you expect to need -- based on intuition and past experience -- around 20 subjects. You might check your data after you've run 12, just in case that's enough. What usually happens is that if the results are significant, you stop running the study and move on. If they aren't, you run more subjects. Now maybe after you've got 20 subjects, you check your data. If it's significant, you stop the study; if not, you run some more. And you keep on doing this until either you get a significant result or you give up.
It's a little harder to do back-of-the-envelop calculations on the importance of this effect, but it should be clear that this habit has the unfortunate result of increasing the relative likelihood of a false positive, since false positives lead you to declare victory and end the experiment, whereas false negatives are likely to be corrected (since you keep on collecting more subjects until the false negative is overcome). I read a nice paper on this issue that actually crunched the numbers a while back (for some reason I can't find it at the moment), and I remember the result was a pretty significant increase in the expected number of false positives.
Data massaging
The issues I've discussed so real problems but are pretty common and not generally regarded as ethical violations. Data massaging is at the borderline.
Any dataset can be analyzed in a number of ways. Once again, if people get the result they were expecting with the first analysis they run, they're generally going to declare victory and start writing the paper. If you don't get the results you expect, you try different analysis methods. There are different statistical tests that be used. There are different covariates that could be factored out. You can through out "bad" subjects or items. This is going to significantly increase the rate of false positives.
It should be pointed out that interrogating your statistical model is a good thing. Ideally, researchers should check to see if there are bad subjects or items, check whether there are covariates to be controlled for, check whether different analysis techniques give different results. But doing this affects the interpretation of your p-value (the estimate of how likely it is that your results will replicate), and most people don't know how to appropriately control for that. And some are frankly more concerned with getting the results they want than doing the statistics properly (there is where the "borderline" comes in).
Better estimates
The problem, at least from where I stand, is one of statistics. We want our statistical tests to tell us how likely it is that our results will replicate. We have statistical tests which, if used properly, will give us just such an estimate. However, there are lots and lots of ways to use them incorrectly.
So what should we do? One possibility is to train people to use statistics better. And there are occasional revisions in standard practice that do result in better use of statistics.
Another possibility is to lower the p-value that is considered significant. The choice of p=0.05 as a cutoff was, as Lehrer notes, arbitrary. Picking a smaller number would decrease the number of false positives. Unfortunately, it also decreases the number of real positives by a lot. People who don't math can skip this next section.
Let's assume we're running studies with a single dependent variable and one manipulation, and that we're going to test for significance with a t-test. Let's say the manipulation really should work -- that is, it really does have an effect on our dependent measure. Let's say that the effect size is large-ish (Cohen's d of .8, which is large by psychology standards) and that we run 50 subjects. The chance of actually finding a significant effect at the p=.05 level is 79%. For people who haven't done power analyses before, this might seem low, but actually an 80% chance of finding an effect is pretty good. Dropping our significant threshold to p=.01 drops the chance of finding the effect to 56%. To put this in perspective, if we ran 20 such studies, we'd find 16 significant effects at the p=.05 level but only 11 at the p=.01 level. (If you want to play around with these numbers yourself, try this free statistical power calculator.)
Now consider what happens if we're running studies where the manipulation shouldn't have an effect. If we run 20 such studies, 1 of them will nonetheless give us a false positive at the p=.05 level, whereas we probably won't get any at the p=.01 level. So we've eliminated one false positive, but at the cost of nearly 1/3 of our true positives.
No better prediction of replication than replication
Perhaps the easiest method is to just replicate studies before publishing them. The chances of getting the same spurious result twice in a row are vanishingly small. Many of the issues I outlined above -- other than data massaging -- won't increase your replication rate. Test 14 different fruits to see if any of them increase IQ scores, and you have over a 50% chance that one of them will spuriously do so. Test that same fruit again, and you've only got a 5% chance of repeating the effect. So replication decreases your false positive rate 20-fold. Similarly, data massaging may get you that coveted p.05, but the chances of the same massages producing the same result again are very, very low.
True positives aren't nearly so affected. Again, a typical power level is B=0.80 -- 80% of the time that an effect is really there, you'll be able to find it. So when you try to replication a true positive, you'll succeed 80% of the time. So replication decreases your true positives by only 20%.
So let's say the literature has a 30% false positive rate (which, based on current estimates, seems quite reasonable). Attempting to replicate every positive result prior to publication -- and note that it's extremely rare to publish a null result (no effect), so almost all published results are positive results -- should decrease the false positives 20-fold and the true positives by 20%, leaving us with a 2.6% false positive rate. That's a huge improvement.
So why not replicate more?
So why don't people replicate before publishing? If 30% of your own publishable results are false positives, and you eliminate them, you've just lost 30% of your potential publications. You've also lost 20% of your true positives as well, btw, which means overall you've decreased your productivity by 43%. And that's without counting the time it takes to run the replication. Yes, it's nice that you've eliminated your false positives, but you also may have eliminated your own career!
When scientists are ranked, they're largely ranked on (a) number of publications, (b) number of times a publication is cited, and (c) quality of journal that the publications are in. Notice that you can improve your score on all of these metrics by publishing more false positives. Taking the time to replicate decreases your number of publications and eliminates many of the most exciting and surprising results (decreasing both citations and quality of journal). Perversely, even if someone publishes a failure to replicate your false positive, that's a citation and another feather in your cap.
I'm not saying that people are cynically increasing their numbers of bogus results. Most of us got into science because we actually want to know the answers to stuff. We care about science. But there is limited time in the day, and all the methods of eliminating false positives take time. And we're always under incredible pressure to pick up the pace of research, not slow it down.
I'm not sure how to solve this problem, but any solution I can think of involves some way of tracking not just how often a researcher publishes or how many citations those publications get, but how often those publications are replicated. Without having a way of tracking which publications replicate and which don't, there is no way to reward meticulous researchers or hold sloppy researchers to account.
Also, I think a lot of people just don't believe that false positives are that big a problem. If you think that only 2-3% of published papers contain bogus results, there's not a lot of incentive to put in a lot of hard work learning better statistical techniques, replicating everything, etc. If you think the rate is closer to 100%, you'd question the meaning of your own existence. As long as we aren't keeping track of replication rates, nobody really knows for sure where we are on this continuum.
That's my conclusion. Here's Lehrer's:
The decline effect is troubling because it reminds us how difficult it is to prove anything. We like to pretend that our experiments define the truth for us. But that's often not the case. Just because an idea is true doesn't mean it can be proved. And just because an idea can be proved doesn't mean it's true. When the experiments are done, we still have to chose what to believe.I say it again: huh?
Priest, Altars and Peer Review
David Dobbs at Neuron Culture is complaining about NASA and peer review:
Focusing on peer review as a mechanism for establishing authority is convenient, because the public (thinks it) understands the mechanisms. You're not saying, "Believe me because I am a wise scientist," but "Believe the documented record." And since Jenny McCarthy doesn't publish in peer-reviewed journals, you can (try to) exclude her and other nuttos from the conversation.
So I think there are good reasons for a NASA spokesman, when speaking with a reporter, to dismiss blogs. Taking a critique in a blog seriously in public is only going to open the floodgates. I mean, there are a *lot* of blogs out there. That doesn't mean that the scientists involved aren't taking the a series critique by a serious scientist seriously just because the criticism appeared in a blog. I hope that they are, and we don't want to read too much into NASA's official statement.
All that said, I'm not sure focusing on peer-reviewed science has been helping very much. I mean, McCarthy still gets booked on Oprah anyway.
A NASA spokesperson has dismissed a major critique of the Science arsenic bug paper based not on the criticism's merits, but on its venue -- it appeared in a blog rather than a peer-reviewed journal. Apparently ideas are valid (or not) based on their content, or even the reputation of the author, but on where they're published.I'm not known for my strong endorsement of the fetishism of peer review, but even so I think Dobbs is being somewhat unfair. My reading of history is that scientists have been plugging the peer-review mantra because they're tired of having to respond to ignorant assholes who appear on Oprah spouting nonsense. I mean, yes, you can address wacko claims about vaccines causing autism or the lack of global warming on their merits (they have none), but it gets tiresome to repeat. In any case, relatively few members of the public can follow the actual arguments, so it becomes an issue of who you believe. And that's a hard game to win, since saying "so-and-so doesn't know what they're talking about" sounds elitist even when it's true, and "elitism" (read: "meritocracy") is for some reason unpopular.
Focusing on peer review as a mechanism for establishing authority is convenient, because the public (thinks it) understands the mechanisms. You're not saying, "Believe me because I am a wise scientist," but "Believe the documented record." And since Jenny McCarthy doesn't publish in peer-reviewed journals, you can (try to) exclude her and other nuttos from the conversation.
So I think there are good reasons for a NASA spokesman, when speaking with a reporter, to dismiss blogs. Taking a critique in a blog seriously in public is only going to open the floodgates. I mean, there are a *lot* of blogs out there. That doesn't mean that the scientists involved aren't taking the a series critique by a serious scientist seriously just because the criticism appeared in a blog. I hope that they are, and we don't want to read too much into NASA's official statement.
All that said, I'm not sure focusing on peer-reviewed science has been helping very much. I mean, McCarthy still gets booked on Oprah anyway.
You're Wrong
John Ioannidis has been getting a lot of press lately. He reached the cover of the last issue of The Atlantic Monthly. David Dobbs wrote about him here (and a few years ago, here). This is the doctor known for his claim that around half of medical studies are false -- that is about 80% of non-randomized trials and even 25% of randomized trials. These are not just dinky findings published in dinky journals; of 49 of the most highly-regarded papers published over a 13-year period, 34 of the 45 with that claimed to have found effective treatments had been retested, and 41% of those tests failed to replicated the original result.
Surprised?
Quoting the Atlantic Monthly:
There are other factors at work. Some scientists are sloppier than others, and many of the ways in which one can be sloppy lead to significant and/or surprising results. For instance, 5% of experiments have false positives. There are labs that will run the same experiment 6 times with minor tweaks. There is a (1-.95^6) * 100 = 26.5% chance that one of those will have a significant result. The lab may then publish only that final experiments and not report the others. If sloppy results lead to high-impact publications, survival of the fittest dictates that sloppy labs will reap the accolades, get the grant money, tenure, etc.
Keep in mind that often many different labs are trying to do the same thing. For instance, in developmental psychology, one of the deep questions is what is innate? So many labs are testing younger and younger infants, trying to find evidence that these younger infants can do X, Y or Z. If 10 labs all run the same experiment, there's a (1-.95^10) * 100 = 40.1% chance of one of the labs finding a significant result.
Countervailing Forces
Thus, there are many incentives to publish incorrect data. Meanwhile, there are very few disincentives to doing so. If you publish something that turns out to replicate, it is very unlikely that anyone will publish a failure to replicate -- simply because it is very difficult to publish a failure to replicate. If someone does manage to publish such a paper, it will certainly be in a lower-profile journal (which is, incidentally, a disincentive to publishing such work to begin with).
What to do?
Typically, solutions proposed involve changing attitudes. The Atlantic Monthly suggests:
So the idea is that we should stop rewarding scientists for having interesting results, and instead reward the ones who have uninteresting results? Journals should publish only the most staid research, and universities should reward tenure not based on the number of highly-cited papers you have written, but based on how many papers you've written which have never been cited? I like that idea. I can run a boring study in a few hours and write it up in the afternoon: "Language Abilities in Cambridge Toddlers are Unaffected by Presence or Absence of Snow in Patagonia." That's boring and almost certainly true. And no one will ever cite it.
Seriously, though, public awareness campaigns telling people to be more responsible are great, and sometimes they even help, but I don't know how much can be done without changing the incentive structure itself.
Reputation
I don't have a solution, but I think Ioannidis again points us towards one. He found that papers continue to be cited long after they have been convincingly and publicly refuted. I was discussing this issue with a colleague some time back and mentioned a well-known memory paper that nobody can replicated. Multiple failures-to-replicate have been published. Yet I still see it cited all the time. The colleague said, "Wow! I wish you had told me earlier. We just had a student spend two years trying to follow up that paper, and the student just couldn't get the method to work."
Never mind that researchers rarely bother to replicate published work -- even if they did, we have no mechanism for tracking which papers have been successfully replicated and which papers can't be replicated.
Tenure is awarded partly on how often your work has been cited, and we have many nice, accessible databases that will tell you how often a paper has been cited. Journals are ranked by how often their papers are cited. What if we rewarded researchers and journals based on how well their papers hold up to replication? Maybe it would help, maybe it wouldn't, but without a mechanism for tracking this information, this is at best an intellectual enterprise.
Even if such a database wasn't ultimately useful in decreasing the number of wrong papers, at least we'd know which papers were wrong.
Surprised?
Quoting the Atlantic Monthly:
Ioannidis initially thought the community might come out fighting. Instead, it seemed relieved, as if it had been guiltily waiting for someone to low the whistle...Well, it's not surprising. The appropriate analog in psychology is the randomized trial, of which in medicine 25% turn out to be false according to this research (which hopefully isn't itself false). As Ioannidis has detailed, the system is set up to reward false positives. Journals -- particularly glamour mags like Science -- preferentially accept surprising results, and the best way to have a surprising result is to have one that is wrong. Incorrect results happen: "statistically significant" means "has only a 5% probability of happening by random chance." This means (in theory) that 5% of all experiments published in journals should reach the wrong conclusions. If journals are biased in favor of accepting exactly those 5%, then the proportion should be higher.
There are other factors at work. Some scientists are sloppier than others, and many of the ways in which one can be sloppy lead to significant and/or surprising results. For instance, 5% of experiments have false positives. There are labs that will run the same experiment 6 times with minor tweaks. There is a (1-.95^6) * 100 = 26.5% chance that one of those will have a significant result. The lab may then publish only that final experiments and not report the others. If sloppy results lead to high-impact publications, survival of the fittest dictates that sloppy labs will reap the accolades, get the grant money, tenure, etc.
Keep in mind that often many different labs are trying to do the same thing. For instance, in developmental psychology, one of the deep questions is what is innate? So many labs are testing younger and younger infants, trying to find evidence that these younger infants can do X, Y or Z. If 10 labs all run the same experiment, there's a (1-.95^10) * 100 = 40.1% chance of one of the labs finding a significant result.
Countervailing Forces
Thus, there are many incentives to publish incorrect data. Meanwhile, there are very few disincentives to doing so. If you publish something that turns out to replicate, it is very unlikely that anyone will publish a failure to replicate -- simply because it is very difficult to publish a failure to replicate. If someone does manage to publish such a paper, it will certainly be in a lower-profile journal (which is, incidentally, a disincentive to publishing such work to begin with).
Similarly, consider what happens when you run a study and get a surprising result. You could replicate it yourself to make sure you trust the result. That takes time, and there's a decent chance it won't replicate. If you do replicate it, you can't publish the replication (I tried to in a recent paper submission, and a reviewer insisted that I remove reference to the replication on account of it being "unnecessary"). If the replication works, you'll gain nothing. If it fails, you won't get to publish the paper. Either way, you'll have spent valuable time you could have spent working on a different study leading to a different paper.
In short, there are good reasons to expect that 25% of studies -- particularly in the high-profile journals -- are un-replicable.
What to do?
Typically, solutions proposed involve changing attitudes. The Atlantic Monthly suggests:
We could solve much of the wrongness problem, Ioannidis says, if the world simply stopped expecting scientists to be right. That's because being wrong in science is fine, and even necessary ... But as long as careers remain contingent on producing a stream of research that's dressed up to seem more right than it is, scientists will keep delivering exactly that.I've heard this idea expressed elsewhere. In the aftermath of Hausergate, a number of people suggested that a factor was the pressure-cooker that is the Harvard tenure process, and that Harvard needs to stop putting so much pressure on people to publish exciting results.
So the idea is that we should stop rewarding scientists for having interesting results, and instead reward the ones who have uninteresting results? Journals should publish only the most staid research, and universities should reward tenure not based on the number of highly-cited papers you have written, but based on how many papers you've written which have never been cited? I like that idea. I can run a boring study in a few hours and write it up in the afternoon: "Language Abilities in Cambridge Toddlers are Unaffected by Presence or Absence of Snow in Patagonia." That's boring and almost certainly true. And no one will ever cite it.
Seriously, though, public awareness campaigns telling people to be more responsible are great, and sometimes they even help, but I don't know how much can be done without changing the incentive structure itself.
Reputation
I don't have a solution, but I think Ioannidis again points us towards one. He found that papers continue to be cited long after they have been convincingly and publicly refuted. I was discussing this issue with a colleague some time back and mentioned a well-known memory paper that nobody can replicated. Multiple failures-to-replicate have been published. Yet I still see it cited all the time. The colleague said, "Wow! I wish you had told me earlier. We just had a student spend two years trying to follow up that paper, and the student just couldn't get the method to work."
Never mind that researchers rarely bother to replicate published work -- even if they did, we have no mechanism for tracking which papers have been successfully replicated and which papers can't be replicated.
Tenure is awarded partly on how often your work has been cited, and we have many nice, accessible databases that will tell you how often a paper has been cited. Journals are ranked by how often their papers are cited. What if we rewarded researchers and journals based on how well their papers hold up to replication? Maybe it would help, maybe it wouldn't, but without a mechanism for tracking this information, this is at best an intellectual enterprise.
Even if such a database wasn't ultimately useful in decreasing the number of wrong papers, at least we'd know which papers were wrong.
Tables, Charts & Figures
APA format (required for most journals I read/publish in) stipulates that figures and tables should not be included in the parts of the manuscript in which you actually talk about them, but rather they should all come at the end of the manuscript. I understand how this might be of use to the type-setter, but I find it a pain when actually trying to read a manuscript. I know I'm not the only one, because in some of the manuscripts I've submitted for review before I actually violated APA format and put the figures in-line, and the reviewers actually thanked me in their review and suggested that this should become journal policy. (The idea is that after acceptance, you resubmit with the figures and tables in APA format, but that during the review process, you put them in-line.)
With that in mind, I left my figures in situ in my last journal submission. The staff at the journal promptly returned the manuscript without review, saying that they couldn't/wouldn't review a paper that didn't follow APA guidelines on tables and figures.
Obviously I reformatted and resubmitted (the customer/journal is always right), but I put this out to the blogosphere: does anyone actually like having the figures at the end of the manuscript?
With that in mind, I left my figures in situ in my last journal submission. The staff at the journal promptly returned the manuscript without review, saying that they couldn't/wouldn't review a paper that didn't follow APA guidelines on tables and figures.
Obviously I reformatted and resubmitted (the customer/journal is always right), but I put this out to the blogosphere: does anyone actually like having the figures at the end of the manuscript?
Masked review?
I just submitted a paper to the Journal of Experimental Psychology: General. Like many journals, this journal allows masked review -- that is, at least in theory, the reviewers won't know who you are.
On the whole, I'm not sure how useful blind review is. If you're pushing an unpopular theoretical position, I expect reviewers will jump on you no matter what. If you're simply personally so unpopular that no one will sign off on your work, you have problems that masked review won't solve.
But the real reason I chose not to use blind review was laziness -- it's a pain to go through the manuscript and remove anything that might give away who you are (assuming this is even possible -- for instance, if you cite your own unpublished work a lot, that's going to give away the game and there's not much you can do about that, except not cite that work).
But I'm curious how other people feel about this. Do you usually request masked review? Those who have reviewed papers, do you treat masked papers differently from signed ones?
photo: Ben Fredericson (xjrlokix)
On the whole, I'm not sure how useful blind review is. If you're pushing an unpopular theoretical position, I expect reviewers will jump on you no matter what. If you're simply personally so unpopular that no one will sign off on your work, you have problems that masked review won't solve.
But the real reason I chose not to use blind review was laziness -- it's a pain to go through the manuscript and remove anything that might give away who you are (assuming this is even possible -- for instance, if you cite your own unpublished work a lot, that's going to give away the game and there's not much you can do about that, except not cite that work).
But I'm curious how other people feel about this. Do you usually request masked review? Those who have reviewed papers, do you treat masked papers differently from signed ones?
photo: Ben Fredericson (xjrlokix)
Making data public
Lately, there have been a lot of voices (e.g., this one) calling for scientists to make raw data immediately available to the general public. In the interest of answer than call, here's some of my raw data:
Do you feel enlightened? Probably not. Raw data isn't all that useful if you don't know how it was collected, what the different numbers refer to, etc. Even if I told you this is data from this experiment, that probably wouldn't help much. Even showing you the header rows for these data will help only so much:
Some things are straightforward. Some are not. It's important to know that I record data with a separate row for every trial, so each participant has multiple trials. Also, I record all data, even data from participants who did not complete the experiment. If you're unaware of that, your data analyses would come out very wrong. Also I have some codes I use to mark that the participant is an experimenter checking to make sure everything is running correctly. You'd need to know those. It's key to know how responses are coded (it's not simply "right" or "wrong" -- and in fact the column called totalCorrect does not record whether the participant got anything correct).
The truth is, even though I designed this study myself and wrote the program that outputs the data, every time I go back to data from a study I haven't worked with in a while, it takes me a few hours to orient myself -- and I'm actually relatively good about documenting my data.
So if a law were passed -- as some have advocated for -- requiring that data be made public, one of two things will happen: either people will post uninterpretable data like my mini-spreadsheet above, or they'll spend huge amounts of time preparing their data for others' consumption. The former will help no one. And the latter is expensive, and someone has to pay for that. And this all has to be balanced against the fact that there are very few data sets anyone would want to reanalyze.
There are important datasets that should be made available. And in fact there are already mechanisms for doing this (in my field, CHILDES is a good example). This kind of sharing should be encouraged, but mandated sharing is likely to cause more problems than it solves.
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 0 | helped | 1 | daxed | 59 | 0 | 1 | 1 |
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 1 | heard | 2 | blied | 33 | 0 | 0 | 2 |
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 2 | decelerated | 2 | lenked | 45.4 | 1 | 0 | 2 |
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 3 | startled | 1 | gamped | 31.1 | 1 | 0 | 3 |
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 4 | prompted | 2 | henterred | 59 | 0 | 1 | 4 |
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 5 | engrossed | 2 | nazored | 31.1 | 0 | 1 | 5 |
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 6 | obliged | 1 | ablined | 59 | 1 | 0 | 6 |
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 7 | tantalized | 2 | bosined | 31.1 | 1 | 1 | 7 |
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 8 | bled for | 1 | breened | 31.3 | 1 | 1 | 8 |
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 9 | loathed | 2 | gaubled | 31.2 | 0 | 0 | 9 |
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 10 | mourned for | 1 | ginked | 31.3 | 1 | 1 | 10 |
| female | no | English | English | no | no | yes | United States | 1148478773 | 312 | 11 | wounded | 2 | jarined | 31.1 | 0 | 0 | 10 |
Do you feel enlightened? Probably not. Raw data isn't all that useful if you don't know how it was collected, what the different numbers refer to, etc. Even if I told you this is data from this experiment, that probably wouldn't help much. Even showing you the header rows for these data will help only so much:
| sex | subject_already | nat_language | prime_language | autism | dyslexia | psychiatric | country | randomID | startTime | trial | word | choice | conclusion | wordClass | whichLocation | because | totalCorrect |
Some things are straightforward. Some are not. It's important to know that I record data with a separate row for every trial, so each participant has multiple trials. Also, I record all data, even data from participants who did not complete the experiment. If you're unaware of that, your data analyses would come out very wrong. Also I have some codes I use to mark that the participant is an experimenter checking to make sure everything is running correctly. You'd need to know those. It's key to know how responses are coded (it's not simply "right" or "wrong" -- and in fact the column called totalCorrect does not record whether the participant got anything correct).
The truth is, even though I designed this study myself and wrote the program that outputs the data, every time I go back to data from a study I haven't worked with in a while, it takes me a few hours to orient myself -- and I'm actually relatively good about documenting my data.
So if a law were passed -- as some have advocated for -- requiring that data be made public, one of two things will happen: either people will post uninterpretable data like my mini-spreadsheet above, or they'll spend huge amounts of time preparing their data for others' consumption. The former will help no one. And the latter is expensive, and someone has to pay for that. And this all has to be balanced against the fact that there are very few data sets anyone would want to reanalyze.
There are important datasets that should be made available. And in fact there are already mechanisms for doing this (in my field, CHILDES is a good example). This kind of sharing should be encouraged, but mandated sharing is likely to cause more problems than it solves.
Making research open-access
The US Office of Science & Technology policy is continuing to receive comments on the future of its open-access policy. The comments have been overwhelmingly in favor of making research papers available for free electronically, which many suggesting they be available immediately on publication (or even upon acceptance, which is as much as a year or two before publication). The comment that was most on-point, from my perspective, was one person who wondered what purpose publishers even serve beyond copy-editing, given that all the work is done volunteers (and, come to think of it, I helped with the copy-editing of my last paper myself).
Some comments go even farther, suggesting that all data should be made public immediately. I'm not sure about that idea. Preparing data so that it is easy for others to understand is by no means easy. When I go back to look at data I collected a few years ago, it often takes me hours to interpret it, and I remember what the study was about. In fact, the basic purpose of a paper is to take data and make them easy to understand. Finally, while there are rare cases where I wish I had access to somebody's original data, for most studies I can't even imagine what value having the original data would have.
Certainly, there are cases in which having the raw data would be valuable. But is the value worth the cost of preparing all data from all studies for the public? Maybe in fields in which the data are easier to publish it is. In psychology, I'm not sure.
A Poorly-edited Editors' Handbook
Most psychology journals require that papers follow the American Psychological Association's style guide. This guidebook covers everything from the structure of the paper to the right way of formatting section headings, and it is updated every so often.
The sixth edition was released over the summer, and it seems it had to be recalled due to "errors and inconsistencies."
I haven't actually seen the 6th edition myself (I just bought the 5th edition a couple years ago and am not in a hurry to buy the new one). On the whole, it's a good manual and the rules make sense. However, reviewers will sometimes thank you for breaking the more frustrating rules , like the rule that charts and tables should be appended to the end of the manuscript -- not included in the document itself. This probably made sense in the day of type-written manuscripts, but makes modern electronic manuscripts very hard to read. Electronic documents are wonderful for many things, but the ease of flipping back and forth from one section to another is not one of them.
Hopefully the 6th edition fixed some of those out-dated rules. But I'll wait to find out once the fixed version appears.
Subscribe to:
Comments (Atom)


.png)
.png)
.png)





