It seems that the National Science Foundation will be asking new grant applications to submit a data management plan, apparently including plans for how to make their data available to others.
I have mixed feelings about this. I certainly approve of high-value data sets being made available. I've benefitted a great deal from the wonderful people who put together Penn Tree Bank, VerbNet and similar projects. There are now some useful data sets included in libraries for R as well. I intend to make the summary data from my pronoun studies available when I publish the associated papers.
That said, getting data together in a manner that its interpretable and usable by somebody else is hard. However much I document my own data, whenever I have to go back to look at some old data it takes hours if not days to figure out what I'm looking at. And I'm the one who created it. Fully documenting a data set for someone not associated with the project takes time.
Given that NSF will be paying the salaries of the people who spend the time to document the data sets, it's reasonable to ask whether it's cost-effective. Just how much of a demand is there for data from other labs? I can think of many papers for which I wish I had the original stimuli. The number for which I want the original data are much smaller (though there are some for which it would be really useful).
Why I'm Marching for Science
1 day ago in Angry by Choice