Field of Science

Data wants to be free

It seems that the National Science Foundation will be asking new grant applications to submit a data management plan, apparently including plans for how to make their data available to others.

I have mixed feelings about this. I certainly approve of high-value data sets being made available. I've benefitted a great deal from the wonderful people who put together Penn Tree Bank, VerbNet and similar projects. There are now some useful data sets included in libraries for R as well. I intend to make the summary data from my pronoun studies available when I publish the associated papers.

That said, getting data together in a manner that its interpretable and usable by somebody else is hard. However much I document my own data, whenever I have to go back to look at some old data it takes hours if not days to figure out what I'm looking at. And I'm the one who created it. Fully documenting a data set for someone not associated with the project takes time.

Given that NSF will be paying the salaries of the people who spend the time to document the data sets, it's reasonable to ask whether it's cost-effective. Just how much of a demand is there for data from other labs? I can think of many papers for which I wish I had the original stimuli. The number for which I want the original data are much smaller (though there are some for which it would be really useful).

2 comments:

Mark J. said...

Yes, I agree! In an ideal world all data would be optimally packaged and documented, and freely available to anyone. But doing this requires valuable resources, and the time spent doing this is time not spent on other things, such as research, teaching and other outreach activities (which the NSF increasingly wants from researchers).

Bob Carpenter said...

That's a very good point. I have a similar response when I hear people say "let's try to make research software more professional".

You might also mention that it's not just time, but skill -- it takes some practice to define reusable standards, understandable documentation, etc. Just like any other kind of writing. This goes double for coding.

Here's the devil's advocate position: if your data's not in good enough order to distribute, you probably haven't done your experiments cleanly enough to be reliable. And certainly not cleanly enough to be directly replicable.