Field of Science

Sorry, Sharing My Data is Illegal

I recently got back from collecting data in Russia. This particular study brought into focus for me the issues involved in making experimental data public. In this study, I videotape people as they listen to stories, look at pictures, and answer questions about the stories. The videotape is key, since what I'm actually analyzing is the participants' eye-gaze direction during different parts of the stories (this can be used to partially determine what the participants were thinking at different points in time).

Sharing raw data would mean sharing the videos...which I can't do. These videos are confidential, and there's no easy way of making them anonymous, since they are close-up videos of people's faces. I could ask participants to sign a waver allowing me to put up their videos on the Internet, but I suspect most of my participants would just refuse to participate. Many were concerned enough about the video as was.

Now, I could share the semi-processed data -- that is, not the videos themselves but the information gleaned from them. I already discussed some of the problems with that, namely that getting the data into a format that's easy for someone else to analyze is extremely time-consuming.

This isn't an issue with just one study -- more than half the studies I run are eye-tracking studies. Many of the rest are EEG studies, which can have several gigabytes of data each and thus it's simply impractical to share the data (plus, when dealing with brain data anonymity is even more a concern). I do some kid studies where I simply write down participants' responses, but if your goal was the check to make sure I'm recording my data correctly, that wouldn't help -- what you'd want are tapes of the experiments, but good luck convincing the ethics board to allow me to post videos of young children participants in experiments on the Internet.

[Those are my laboratory studies. Data from my Web-based studies is actually relatively easy to share -- though you'd have to be proficient in ActionScript to understand it.]

Certainly, there are many behavioral researchers that wouldn't have this problem. But there are many who would. Mandating that everyone make their data publicly available would mean that many kinds of experiments simply couldn't be done anymore.


Mr. Gunn said...

I don't think it's entirely impractical to share GBs of data. There are repositories (your institution may have one) which are set up precisely for this.

You point about the videos is a good one, though, and I think the best you could do there would be to release the data extracted therefrom.

More to the point, I think it's great that you're thinking about these sorts of things in the first place.

GamesWithWords said...

The amount of data grows very quickly. As far as sharing data in extracted form:

There are two reasons people suggest sharing data. One is as a check against fraud. Of course, "extracted" data is easy to fake, so that would be of no help.

The other is to allow people to reanalyze the data for new purposes. When you consider that the majority of papers are never even cited, it seems unlikely that there's much desire to reanalyze the vast bulk of them. So you have to consider the very large cost of annotating the data such that it's interpretable by someone else, which I've discussed here previously.

Valuable datasets should be shared. I've definitely reanalyzed other people's data before, and I have made my data available to others. But those are special cases. Using a lot of time that could be spent running a new study in order to annotate data nobody will ever want to see strikes me as inefficient. I suspect a one-size-fits-all solution is likely to hurt as much as it helps.

John Hawks said...

Does your human subjects protocol not specify what can and can't be released? No data release guidelines can trump the human subjects protocol.

As far as making things "easy to analyze" -- I think this is a straw man. Most things in science are hard to analyze, and if anyone needs your data (for example, because they are interested in a meta-analysis) they will be motivated to take the effort. They can't do this at all if the data aren't available to them.

There is a third issue beyond combating fraud and enhancing third-party studies. Keeping your data in a form that is *not* easily usable by others is bad practice. You may think your effort is better allocated to further studies, but what worth will these have to you in five years when you want to do a meta-analysis and the software platform is obsolete?

GamesWithWords said...

@John -- if you think the ease of analysis is a straw man, I suggest you try analyzing the data in one of my previous posts.

You are right that it would be ideal to carefully document everything (and then document the documentation, so you can find it when you need it). But it's not without cost. And while there is a (very) slight chance you might want to reanalyze the data some day, it's also a truism that reanalyzing data to answer questions the study was not designed to answer is rarely particularly productive. You're usually better off running a new study...

...unless your data are extremely hard to gather. So as I've been saying all along, careful documentation and free exchange of data is appropriate in some cases, but you need to balance the cost/benefit ratio.