Field of Science

Making data public

Lately, there have been a lot of voices (e.g., this one) calling for scientists to make raw data immediately available to the general public. In the interest of answer than call, here's some of my raw data:


female no English English no no yes United States 1148478773 312 0 helped 1 daxed 59 0 1 1
female no English English no no yes United States 1148478773 312 1 heard 2 blied 33 0 0 2
female no English English no no yes United States 1148478773 312 2 decelerated 2 lenked 45.4 1 0 2
female no English English no no yes United States 1148478773 312 3 startled 1 gamped 31.1 1 0 3
female no English English no no yes United States 1148478773 312 4 prompted 2 henterred 59 0 1 4
female no English English no no yes United States 1148478773 312 5 engrossed 2 nazored 31.1 0 1 5
female no English English no no yes United States 1148478773 312 6 obliged 1 ablined 59 1 0 6
female no English English no no yes United States 1148478773 312 7 tantalized 2 bosined 31.1 1 1 7
female no English English no no yes United States 1148478773 312 8 bled for 1 breened 31.3 1 1 8
female no English English no no yes United States 1148478773 312 9 loathed 2 gaubled 31.2 0 0 9
female no English English no no yes United States 1148478773 312 10 mourned for 1 ginked 31.3 1 1 10
female no English English no no yes United States 1148478773 312 11 wounded 2 jarined 31.1 0 0 10


Do you feel enlightened? Probably not. Raw data isn't all that useful if you don't know how it was collected, what the different numbers refer to, etc. Even if I told you this is data from this experiment, that probably wouldn't help much. Even showing you the header rows for these data will help only so much:


sex subject_already nat_language prime_language autism dyslexia psychiatric country randomID startTime trial word choice conclusion wordClass whichLocation because totalCorrect


Some things are straightforward. Some are not. It's important to know that I record data with a separate row for every trial, so each participant has multiple trials. Also, I record all data, even data from participants who did not complete the experiment. If you're unaware of that, your data analyses would come out very wrong. Also I have some codes I use to mark that the participant is an experimenter checking to make sure everything is running correctly. You'd need to know those. It's key to know how responses are coded (it's not simply "right" or "wrong" -- and in fact the column called totalCorrect does not record whether the participant got anything correct).

The truth is, even though I designed this study myself and wrote the program that outputs the data, every time I go back to data from a study I haven't worked with in a while, it takes me a few hours to orient myself -- and I'm actually relatively good about documenting my data.

So if a law were passed -- as some have advocated for -- requiring that data be made public, one of two things will happen: either people will post uninterpretable data like my mini-spreadsheet above, or they'll spend huge amounts of time preparing their data for others' consumption. The former will help no one. And the latter is expensive, and someone has to pay for that. And this all has to be balanced against the fact that there are very few data sets anyone would want to reanalyze.

There are important datasets that should be made available. And in fact there are already mechanisms for doing this (in my field, CHILDES is a good example). This kind of sharing should be encouraged, but mandated sharing is likely to cause more problems than it solves.

2 comments:

Ryan said...

Many of the issues could be solved by making a standardized format, at least for the typical fields. If everyone is outputting the same sort of ID, demographic information, etc. then it reduces the effort for understanding it.

The rest of the data interpretation issues could be solved by also releasing the source code for your program. The act of writing a computer program requires that you bring all of your assumptions about the data into one place. Sure, source code isn't transparent to everyone, but you can't throw a stone without hitting a CS major nowadays.

GamesWithWords said...

Releasing code is an option...assuming, of course, that there was a computer program involved (not always the case). Though I'm not sure how many people would really go to the effort to read source code.

Something I didn't mention of course is storage costs. Some data-sets involve a gigabyte or more of data (I'm thinking of ERP or fMRI studies, for instance). One would need to figure out how to host that. This won't necessarily get easier as memory gets cheaper, as the datasets themselves have been growing in size.