When data are not archived, they rapidly become unavailable

Floppy_disk_2009_G1The value of data for long-term studies increases over time, but data availability decreases over time. The lack of available raw data from bygone days is a hindrance to long-term research and is also a waste of funding, say proponents of open-access, online data archives. Although everyone assumes data are lost over time—metadata are lost, scientists age and become more difficult to contact, and storage devices become obsolete, processes that contribute to the data decay explained by William K. Michener in a 1997 paper—no one seemed to know how fast data were lost. Tim Vines, managing editor of Molecular Ecology, along with several collaborators, found that the rate at which unarchived data are lost over time is much higher than originally presumed.

Vines and colleagues focused on papers published in the past 20 years (1991-2011) that collected the same type of data measured in the same way, and that also used the same statistical analysis (They plan to test the reproducibility of the analysis in a future paper). The individual data for the paper also had to be in the care of one of the original authors. The researchers made every effort to get a hold of an author of each study by email.

For papers where the authors responded with the status of their data, the odds that the data were still available went down by 17% per year. For this subset of studies, 100% of the data from papers published in 2011 could be retrieved. By contrast, data from papers published in 1991—the earliest year included in the study—were only available about 30% of the time. “I am going to speculate that the obsolete hardware starts becoming a much bigger problem after 10 or 12 years,” said Vines in a recent interview. “The other obstacle was actually getting through to the authors in the first place. The proportion of papers where at least one of the emails appeared to work went down by about 7% a year. That’s not as dramatic as the 17% loss per year in data availability, but it still adds to the overall decay and the unavailability of these data, because if you can’t get hold of the authors in the first place and all the emails you can find are broken, then you just can’t even begin to get hold of the data…. Papers before 2000 had very few emails either in the paper or that we could find online, and the proportion of the emails listed that were broken was pretty high, at 50%.”

The data from the vast majority of papers are not stored on a public archive (such as Dryad), an option that will ensure that the data can be checked and that they are available to a future generation of scientists interested in long-term research. As Vines pointed out: “Only 13 of 529 papers that met our initial criteria had made their data available at publication. That’s 2.4%.” The 98% were susceptible to rapid deterioration in accessibility. Vines asserts that scientists need to archive their data to ensure that public confidence in science is improved.

To solve the solution, scientists must be encouraged and required to archive their data. But as Vines and his colleagues showed in a paper published earlier this year, requiring that data be archived—and enforcing that requirement—is much more successful than encouragement alone. If scientists included their data archiving plans with their methods section, it would be appropriately peer reviewed. Then, journal editors could smoothly ensure that the data were archived before the paper goes to press. Vines says that Molecular Ecology will soon be adopting a policy that further encourages thorough data archiving: When all else is equal, papers that show exemplary data archiving methods will be favored by editors for publication.

Read more about Vines’s research and his proposed solutions in this recent American Scientist report.

Leave a Reply

Your email address will not be published. Required fields are marked *