“Making data more accessible needs to make sense to the specific field of research”Name: Prof Jeppe C. Dyre
Position: Professor of Physics, and leader of the research group “Glass and Time”
Institution: Roskilde University, Department of Science and Environment
More info: Home Page
ORCID ID: http://orcid.org/0000-0002-0770-5690
An interview with Prof Jeppe C. Dyre on 15 June 2017
It was suggested by the Glass and Time postdoc Bo Jakobsen that we should make a data repository of our data because it may be difficult to find reliable experimental data in the literature in a form that can be used for further analysis. Although it is always possible to write directly to the author of a paper, it became obvious that it would be good to have publicly-available data.
We opened our data repository http://glass.ruc.dk/data back in 2008. We make available the finished data. It wouldn’t make sense to make the raw measurements available, as you need to work on these to make sense of them. Basically, our procedure for each of our experimental papers is to give the public the data presented in the paper’s figures. We provide the data in a simple format (ASCII file), which can be imported into any software. A ReadMe.txt file says how the data are structured; the actual data are compressed into a zip file, which anyone can download. This is a minimal solution, but it does give people who want to build models input for doing the modelling.
Our research field, glass science, is not a big research field, but several researchers worldwide have benefitted from our data repository. It is known in the glass community that if you have an idea or theory, you can go to our website and download data. We ask people to send us an email if they download and use the data. Sometimes they don’t do that, however, as we can track how many people have downloaded the data, but we can see in the references or acknowledgements of several published papers that people have made use of our data. As an example, yesterday we had a visitor from the USA, who had collected data from three groups for his purpose, including our data.
In computer simulations, if you have written a proper and well-documented paper, it should be possible for anyone to reproduce the data by using the same software and doing exactly the same thing in order to get the same numbers. It doesn’t make much sense to store those data in a repository. We’re storing them because we have backup, and if people ask, we’ll be happy to provide them. That is also scientifically good practice. If people ask after five years, for example, whether they can have the data because they can’t reproduce it, we need to have the data available.
The question is how to make simulation data available to the community in a constructive way. We’re thinking about that, but it’s not so obvious because these huge amounts of data are useless if you can’t find your way around it. For many years we had Denmark’s fastest computer, and I still think that it is one of the fastest ones – 400 Teraflop GPU-based computer power, just for doing our simulations. We were first movers using graphic cards in 2008, and we have the fastest code for doing molecular dynamics simulations, publicly available at http://rumd.org .
We could describe data of the simulations in the same way as we describe the data in our experimental papers, but I’m not sure how much sense that would make because people would want more information. We have very long time series where we have thousands of particles and we calculate their positions every 10-15 seconds, and we have a lot of snapshots and we go on for billions and billions of time steps. More or less all this information is stored. From this you can extract a lot of information. If you had all these data, you wouldn’t need to do the simulation yourself. That would, of course, be nice, but it wouldn’t make sense to transfer these data on the internet. It would be too much data and would take too much time. In upcoming legislation it is really important that scientists are consulted, so that they don’t have to spend too much time on data management rather than doing research … and that data management differs from field to field.
We have not done a lot to promote our data repository for the experimental data, but sometimes at conferences we mention that data are available on the internet, and also in papers we give a footnote or reference with a link to the data repository. We do not require users to log in to gain access to data – we do not consider this necessary as users (to our knowledge) always cite the related publication or the repository. We don’t want bureaucratic obstacles.
We have not, however, considered giving access to unpublished data. It would not make sense, as reliable data from Glass and Time will be published sooner or later. We should be the first to publish our own data.
When does data become data then? We need a definition. Should the raw numbers coming from instruments be regarded as data or do they first become data when they have been processed through our algorithms and are useful and reliable? In our case we have the capacity to process all experimental data, so we will only make processed and published data available. But, in other cases, groups may not have the capacity to process all their data, and then it may make sense to make raw unprocessed data available for others to process.
“The idea must be to promote research, not to impede it.”
Sharing data is not a problem if you, for example, ask for the underlying data of a curve in a published paper. We’ve never had problems with that, but usually it is people asking for our specific data. Then we point to the data repository. In regard to simulations, I don’t recall that we have ever been asked to give simulation data, but sometimes we have heard from people claiming they had problems reproducing something in our papers, and then we looked into it. This has only happened a few times, and we’ve always managed to resolve the problem. I hope that the new rules won’t be a great problem for us in terms of bureaucracy. However, if what we’re doing at the moment is not enough in this area, then we could do less research, but that is not the idea. The idea must be to promote research, not to impede it.
I’m in favour of both Open Access and Open Data; it would be great if all our collaborators and competitors felt the same. What we do with the data repository is rather unusual, although it hasn’t meant a lot of work. It would have been nice if you could have compared data by going to the website instead of writing to the author. I’m not sure how to persuade others to do the same, but I guess that there are some rules coming that will oblige one to do so. If that is the case, I really hope that the minimal solution we have chosen will be sufficient because, otherwise, the risk is that it impedes research. In my opinion, it is okay to say that you need to make your data available, but it should be formulated in a way that makes sense in the specific field. I think it is almost impossible to make general rules for this that work in practice. If you make a rule, then it should be a flexible rule in the sense that the scientist decides. It might make sense to have a common Roskilde University repository or a national Danish repository, but it should still be possible to link directly to our website with the data underlying the publications.
Copyright: Sacha Zurcher (Research Librarian) and Søren Møller (Associate Professor), Roskilde University Library. Creative Commons CC-BY-ND Licence.
To more champions >