main menu

Research data should not be published in high-ranking journals without the storage of raw data in open and easily accessible repositories

“Research data should not be published in high-ranking journals without the storage of raw data in open and easily accessible repositories”
Name: Dr Nicole Jung
Position: Group leader Compound Platform/Research Assistant
Institution: Karlsruhe Institute of Technology, Institute of Organic Chemistry and Institute of Toxicology and Genetics
Country: Germany
More info: Home PageOther

An interview with Dr Nicole Jung on 12 April 2017

What is your interest in Open Data?

I should give a short introduction to explain the value of data in the field of natural sciences or, in particular, in chemical sciences:

Data in chemical sciences consist to a large extent of the analytical data for chemical structures or the details for a reaction and the conditions of chemical procedures. The analytical data are a mapping of molecules or mixtures with spectroscopic and spectrometric methods; they form a fingerprint of the molecules. The data related to a description of a chemical experiment are the recipe to repeat the experiment and to gain the targeted chemical compounds.

“The amount and quality of available data have a direct effect on the time and money invested in a research project.”

For all researchers working experiment-based in chemistry labs, the analytical and reaction data are the basis of their work. All researchers have to start their work at some point with the synthesis of a formerly-published chemical compound. If the description and analytical data are not available, the researcher loses a lot of time and money with the search of the best procedure and the identification of the chemical target compound. Raw data are of high importance as the description of the gained analysis is often not detailed enough for a suitable comparison and the researcher needs to compare the obtained with the published results. Additionally, more and more publishers require corresponding research data in addition to a publication.

To summarise the importance of Open Data in chemistry: the data are the basis of every synthetic researcher, and the amount and quality of available data have a direct effect on the time and money invested in a research project. The availability of data determines our routine daily work.

What is your own experience with Open Data activities?

As the need for data is so strong in chemical sciences, there are already commercial databases offering chemical data in a systematic manner, and also initiatives for Open Data. The commercial databases gain their data directly from the information that is available in publications. The disadvantages of the commercial databases are:

(1) Access is limited to users having a license (which is very expensive).

(2) Data are limited concerning their detail level (no primary or raw data). Some information is only available via the publisher.

(3) Only those data are given that have been part of the publication.

The disadvantages of Open Data archives/repositories include:

(1) There are only very few repositories that contain raw data (like SDBS (http://sdbs.db.aist.go.jp/sdbs/cgi-bin/cre_index.cgi) and ChemSpider (http://www.chemspider.com/)).

(2) The datasets are very limited because of several concerns related to Open Data, and the effort to maintain the databases is very high. The support creates no impact points.

(3) Distinct data can be retrieved via excellent data sources like the CCDC (Cambridge Crystallographic Data Centre (CCDC)). The deposit includes the raw data but is limited to crystallographic data.

Therefore, we initiated our own repository for research data, which is called the Chemotion repository (https://chemotion.net/). The initial start was in 2014 and the project was funded in 2015 by the German Research Foundation (DFG).

The concept was to allow the deposit of all relevant raw data and to create visibility of the datasets via the generation of DOIs, a DataCite registration and the listing of the results in the open Platform PubChem (https://pubchem.ncbi.nlm.nih.gov/). The functionality is extended right now from the deposit of analytical data to the additional deposit of reaction data. The structure of the repository supports several chemistry-related functions, like the search for molecular structures, the labeling of new molecules (that haven’t been registered in PubChem before), and the summary of structure-depended data and values like the molecular weight, the molecule formula, etc. The latter automated functions facilitate the registry process of new molecules for supporters. All data are clearly organised, allowing for a fast association with the molecular structure and the comparison with data from other sources. Additional information on the molecule or corrections can be made at any time via the upload of new or better material (the old information is kept).

There are two ideas to motivate people to participate in the project and to overcome the problems that other repositories are facing.

First, the upload of data is possible via a transfer from an Electronic Lab Notebook which reduces the effort to a minimum. Second, we initiated a competition with incentives for students with a high contribution rate to the repository.

The realisation of the first idea – the upload from an ELN – was the most laborious one because there is at the moment no electronic laboratory notebook (ELN) worldwide that supports the upload of analytical data and reaction data directly to an Open Data repository. With the idea in mind, how detailed and in which way Open Data should be given, we started to develop such an ELN which supports the transfer of data directly into the Chemotion-repository. The ELN is still under development, but a stable version is already published as an Open Source and available via Github. With the ELN, researchers can manage their data and a transfer of selected datasets to the open repository for chemistry research data is possible via a few mouse clicks. Of course, the single use of either the ELN or the repository is also possible, and the upload of data is completely voluntary. Only the data that is explicitly chosen for open publishing will be transferred to the open repository.

A second method was tested to motivate students to contribute to the project: we posed an incentive for students that publish interesting information and complete data referring to their research projects. A committee of three supervisors made the decision on the best contributions, and the students could earn a prize for their contributions. The money was donated by companies, foundations or private persons. This incentive was very successful and we will continue with this procedure this year.

What are some of your concerns and frustrations as regards Open Data?

The concern that publishers might refuse to accept publications, of which the corresponding data has already been released in the repository, could not be confirmed. On the contrary, the authors can link publication and corresponding data with the assigned DOI, which can then be used in the reviewing process. It is also possible to set an embargo.

The only frustration to mention is this: We stated, as others did before, that the upload and allocation of data are only done by researchers if the work is fast and uncomplicated (with almost no additional effort) or incentives can be earned. The barrier for voluntary support of Open Data initiatives is very high because of the time invest that is necessary. This is the reason why, independent of the need for Open Data, almost no databases are available.

In summary, as long as the community accepts the publication of research results without the availability of the data in open repositories, there is almost no chance to change the systematic lack of data.

Who, or what project or service, inspires you and makes you optimistic about the future of Open Science?

We hope that our projects may serve as an example for other researchers and other fields of research as well. But, one has to mention that the infrastructure that is necessary to allow such a procedure for a fast data transfer is related to a long-term engagement of the developers of such an instrument and the hosting institute as well. The programming, development and installation of all necessary components for a suitable infrastructure for Open Data publishing take a lot of time and human resources. In addition, the research institution has to support the initiatives due to the necessary server infrastructure and hosting of web portals. We are thankful for the support of our group by, in particular, the KIT Library, and the Steinbuch Centre for Computing (SCC), who strongly support the projects ELN and our repository.

I would also like to mention how far Open Data may reach. We recently established a database for physical reference material that is called “Molekülarchiv” or molecule archive. Researchers searching for material for their research can apply for material that can be offered to facilitate the identification process and further research with the obtained molecules. This is probably the most advanced interpretation of Open Data.

What still needs to be done to improve access to data?

Concerning our work in particular, we are still working on the improvement of the infrastructure to allow a comfortable transfer of data, which will be launched in a few weeks. We need to improve and extend the ELN functionality step by step during the next months. The work for Open Data will need continuous commitment, especially for the search for novel sponsors for our publication prize, and the search for novel funding for the adaption and improvement of the current software tools.

“The work for Open Data will need continuous commitment.”

In general (without direct connection to our project): I think that a change of publishing procedure is necessary; research data should not be published in high-ranking journals without the storage of raw data in open and easily accessible repositories.

“The availability of data changes the speed and quality of research tremendously.”

Consequences of positive action: As mentioned before, the availability of data changes the speed and quality of research tremendously. But it’s more than that; the availability of data enables researchers to work with large datasets and to evaluate their work at another level. With the availability of research data for example, data-mining and deep learning methods can be supported. This has enormous impact on the availability of general knowledge and the understanding of scientific work.

Consequences of negative action: The consequences of unavailable datasets are clear; as in the past, researchers will waste time, money and resources in general to repeat experiments that have been done by others before. Scientific work will therefore be slower, have lower impact and lower quality standards.

Copyright: Dr. Dzulia Terzijska. Creative Commons CC-BY Licence.

Tags: DOI, ELN, Open Access, Open Data, Open Science, TDM, accessibility, analytical data, awards, benefits, commercial, data curation, impact, infrastructure, investment, labs, licensing, motivation, open source, publishing, quality, replication, repository, simplicity, speed, time, tools, upload, value, waste

To more champions >
Print Friendly, PDF & Email
Comments are closed.