Rzepa, Henry; Mclean, Andrew; Harvey, Matthew J.
Chem Int 38 (3–4) 24–26 (2016).
Science Published: (May/2016)
DOI: https://doi.org/10.1515/ci-2016-3-408
Abstract:
Progress in science has always been driven by data as a primary research output. This is especially true of the data-centric fields of molecular sciences. Scholarly journals in chemistry in the 19th century captured a (probably small) proportion of research data in printed journals, books, and compendia. The curation of this data from its origins in the 1880s and for most of the 20th century was largely driven by a few organisations as a commercial and proprietary activity. The online era, dating from around 1995, saw much experimentation centred around the presentation and delivery of journals, but less so of the data. The latter evolved, almost by accident, into what is now known as electronic supporting or supplemental information (SI), associated with journal articles. [1] That there was still a general problem in science was revealed by the “Climategate†events in 2009, where a lack of access to the data on which climate models are based induced all manner of unfortunate conspiracy theories. [2] These events catalysed a change in policy at, amongst others, UK research funders. One outcome of this change was seen in May 2015 with the introduction of new research data management (RDM) requirements for funded researchers. This centred around the precept that primary research data should be made openly available [3] and coincided with the evolution of the open science tripod of open data, open access articles, and open science notebooks. [4]
These new funder policies now require researchers to develop research data management plans, part of which involves publishing their data in what is called FAIR form. [5] The four components of FAIR are:
F: Findable. Data should be discoverable by searches, ideally on a global scale using consistent interfaces.
A: Accessible. Data should be openly retrievable not only by humans, but by machines operating on a larger scale for the purpose of data or content mining.
I: Interoperable. Once discovered and retrieved, data should be capable of validation and re-use, again not merely by human but also by software.
R: Reusable with a commensurate and declared license that allows this.
Although nowadays a virtually mandatory component of the journal publication process in chemistry, very little supporting information (SI) actually fulfils all these FAIR criteria for a variety of reasons. SI is mostly contained as a PDF document containing page breaks and page headers or footers. The PDF wrapper was never designed as a data container; such containment can easily disable data discoverability. Some data, such as crystallographic information, is contained in structured semantic form, but this is not generally true. Crucially, the PDF-based SI document never has formally declared metadata (information about the data contained therein) and its monolithic structure (examples have reached 504 pages in length, [6] and this may not have been even been close to the maximum) means that even a simple index of the text content is probably next to useless to satisfy the F of FAIR. SI is a child of its parent, the scientific journal article, and as such inherits the persistent (digital object) identifier or DOI of the article. The article DOI, however, carries no information (metadata) about the SI itself or about any data contained in the SI. The DOI normally points to a landing page for the article and this page has to be visually inspected by a human to ascertain the existence and whereabouts of SI, often in a manner parochial to the journal; a fail for both the F and the A of FAIR. Validation of data held inside a PDF file is rarely possible with any semantic assurance, a fail for the I of FAIR. Finally, the licenses that cover data are or should be fundamentally different from those that cover copyrightable materials such as journal articles. These are rarely declared; a fail for the R of FAIR.