Access the full text.
Sign up today, get DeepDyve free for 14 days.
Materials databases: the need for open, interoperable databases with standardized data and rich metadata Fran cois-Xavier Coudert Chimie ParisTech, PSL University, CNRS, Institut de Recherche de Chimie Paris, 75005 Paris, France (Dated: July 8, 2019) Driven by the recent rapid increase in the number of materials databases published (open and commercial), I discuss here some perspectives on the growing need for standardized, interoperable, open databases. The eld of computational materials discovery is quickly expanding, and recent advances in data mining, high throughput screening, and machine learning highlight the potential of open databases. One of the recent important trends in materials science by standardization institutes, learned societies, or com- is the emergence of several large-scale online databases mercial entities. We can cite examples of the NIST of materials, trying to bring together experimental data databases for materials and uids properties, the glass with computational techniques in order to understand property database SciGlass, Pearson's Crystal Data, or the behavior of materials families and design novel mate- more speci c examples like the Polymer Gas Separa- rials through data-mining? Maybe the most visible eort tion Membrane Database from the Membrane Society of in this area is the Materials Project,[1] a database of com- Australasia. Moreover, databases of hypothetical struc- puted information on known and predicted properties, tures | predicted by theory or computations | have part of the US-funded Materials Genome Initiative[2] also appeared over time. In the eld of porous mate- launched in 2011. However, this trend is more general, rials, for example, databases of computed zeolitic struc- and an increasing amount of research in the eld focuses tures have been published more than 15 years ago.[10, 11] on the generation of these databases, their extension with More recently, this eort has intensi ed | probably additional data, and the use of these databases for anal- in response to the increase in capacity of computa- ysis, screening, and prediction. This was clearly exempli- tions, as well as the ease of hosting large datasets on- ed at recent materials meetings and materials modeling line. Databases of various scales, containing hypotheti- workshop, such as the MOFSIM 2019[3] meeting whose cal (enumerated) metal{organic frameworks, have been excellent discussion provoked this short comment. published.[12] Other groups have worked on re ning ex- perimental databases to include additional data (such as atomic charges), making them suitable for computational I. EXISTING DATABASES applications and screening.[13, 14] All these databases, however, are hosted independently as archive les, with heterogeneous le formats, on individual research groups' The need for aggregation of curated data in the physi- websites. cal and chemical sciences was recognized very early, and possibly the best-known database in our eld is the CRC We note, however, that there have been some re- Handbook of Chemistry and Physics[4] (also known as cent initiatives in order to integrate data from dier- the \rubber book"), which has been published since 1914. ent sources into larger, coherent databases. This is par- When it comes to materials, databases ranging in size and ticularly the case of computational data, whose volume scope have emerged since the beginning of computing, increases with high-performance computing capabilities. and have advanced at the same pace as both computer The goals dier for the various initiatives, but in gen- hardware and networking capabilities. The Cambridge eral, they aim at providing large-scale platforms for open Structural Database (CSD),[5] launched in 1965, is one science and data sharing, as well as improve discoverabil- of the rst numerical scienti c databases, operating as a ity and searchability of existing data. A rst example is repository for validated experimental crystal structures the Materials Project,[1] the aim of which is to \remove of organic and organometallic compounds. Additional guesswork from materials design in a variety of applica- structural databases have emerged focusing on other tions" by computing properties of all known materials categories of materials, including the Inorganic Crystal (and many that are not yet synthesized, too) through Structure Database (ICSD),[6] the Protein Data Bank electronic structure analyses | a project funded as part (PDB),[7] the American Mineralogist Crystal Structure of the bigger Materials Genome Initiative[2]. It aggre- Database,[8] and the Crystallography Open Database gates structural data from other existing databases, as (COD).[9] well as physical and chemical properties (band structures, In addition to these structural databases, databases elastic constants, piezoelectric tensors, electrode proper- of materials properties have also been compiled, either ties) computed as part of the project itself. Other plat- forms have been built to sharing results and resources in computational materials sciences, such as the ioChem- fx.coudert@chimieparistech.psl.eu BD digital repository[15] and the more recent Materials arXiv:1907.02791v1 [cond-mat.mtrl-sci] 5 Jul 2019 2 Cloud[16]. Reusable Interoperable II. THE CURRENT STATE OF AFFAIRS Accessible As shown above, the number of existing databases is Findable increasing at a fast pace. Yet, the datasets themselves are often hosted using makeshift solutions, in dierent places. They are typically too large to be hosted as supporting information by the publisher of the associ- ated research paper, and they might be expanded and FIG. 1. FAIR data principles:[19] make data ndable, acces- re ned over time, with the publication of updates not sible, interoperable and reusable. related to a particular peer-reviewed paper. In the most common case, they are hosted on the web server of the research group, on institutional repositories oered by Findable: data are indexed in a searchable re- some universities, or on free data hosting solutions such source, has a stable unique identi er, and is de- as Figshare,[17] GitHub,[18] etc. This leads to the avail- scribed with rich metadata. able data being dispersed between several platforms, and in some cases, raises the question of long-term availabil- Accessible: data are retrievable using a standard- ity of the data (e.g., when it is hosted on a group web ized communications protocol, that is open, free, server or a commercial data hosting solution). Moreover, and universally implementable. there exists no universal way to access these data, unlike what exists for open archives where protocols such as Interoperable: data are represented using an OAI-PMH have been developed for interoperability and open, well-de ned format; data and metadata are discoverability of data sources. interlinked. This fragmentation of the landscape of materials Reusable: data contains relevant metadata about databases is accompanied by a large heterogeneity in its origin, a clear and accessible data usage license, the formats used: while for crystalline material struc- and meet the community standards of its domain tures, the CIF (Crystallographic Information File) for- mat is predominant, the data made available as part of the CIF le is not homogeneous between dierent groups. III. REQUIREMENTS FOR OPEN DATABASES The use of symmetry operators, for example, is not al- ways consistent, with some databases being stored with symmetry systematically lowered to P 1. In addition to Based on these formal requirements, and during the this heterogeneity in data format, there is also a general discussions at MOFSIM 2019 and other workshops, there lack of availability of metadata, meaning that most of appears to be a need for more open and interoperable ma- the databases do not contain information about how the terials databases. We outline below some of the require- data was generated, gathered, curated, and possibly up- ments, drawing both on the FAIR principles and shared dated. Yet, this metadata can be crucial in exploiting experiences and discussions. databases, in order to identify identical or related data Open databases. A majority of the available data on items, to understand how databases evolve over time, materials is produced as part of academic research, and and to allow further investigation of speci c data items. largely funded by public monies. Moreover, for data asso- Metadata enables researchers to answer simple queries ciated with public research, there is a general consensus such as: Where does this structure come from? Where that it should be accessible to all. Recent years have seen was it rst reported, and under what conditions was it the addition of \data availability" requirements in several synthesized? How was this computational property cal- journals, and it is now a required part for any funding ap- culated? What are the conditions of reuse of this data? plication. This creates a need for open databases, where Here, before highlighting some requirements for truly content can be hosted regardless of its origin (contrary open materials databases, we want to introduce the to institution-wide repositories) and where it is accessible FAIR data principles. The FAIR principles are a set to all. Moreover, the database should provide clear infor- of guidelines in order to make data ndable, accessi- mation about its users' rights when it comes to reusing ble, interoperable and reusable (Figure 1). They have the data, mining it, and republishing derived products. been formalized in 2016 by a consortium of scientists and Because open databases do not charge their users for organizations,[19] and were formally endorsed by the G20 subscription or access, they allow the dissemination of Leaders at their 2016 summit in Hangzhou (gÈ), China, knowledge to categories of users that would otherwise in order to \promote open science and facilitate appro- nd it dicult to obtain access: researchers in develop- priate access to publicly funded research results".[20] Re- ing countries, nongovernmental organizations, indepen- quirements for FAIR data are: dent researchers, journalists, even enthusiastic citizens. 3 Operating such open databases of course requires fund- tational information, what was the theoretical method ing, and it is important to note that a number of national used, what is the level of description of the system? This and supranational initiatives have been launched in that is particularly important in databases of computational direction (as discussed in the introduction). properties, where there can be a clear in uence | and Interoperable databases. It is relatively clear that, sometimes even a systematic bias | of the computational given the vastly dierent needs of scientists working in method chosen on the physical and chemical data calcu- dierent areas of materials science, there can be no \one lated. If metadata is present in the databases, it opens size ts all" database, i.e., no single centralized database the door to large-scale systematic explorations of various that ful lls the needs of every dierent community. So, theoretical methods, and their comparison with experi- how can a good balance be reached in developing speci c mental results obtained with dierent techniques, too. topical databases while retaining some uniformity, in or- Moreover, metadata can also provide much-needed der to avoid a complete fragmentation of the eld? It links between several dierent interoperable databases. turns out, this problem is one that has been worked on If a unique identi er is given for each dataset, and for many years in a related area, that of document (or databases are interlinked through their metadata, it pro- papers) archives. While there are many dierent open vides a much simpler exploration for users. It makes archives on the internet, they have been developed in a it easy to determine, e.g., if two properties from two way that allows interoperability between them. Speci - datasets are independent or come from the same original cally, the Open Archives Initiative (OAI) has standard- calculation. It also allows greater discoverability, making ized a Protocol for Metadata Harvesting (OAI-PMH) it possible to nd other properties in other databases, re- through which each archive exposes its metadata, in a lated to any given entry. common format, allowing for cross-database search and Curation that preserves the scienti c record. discoverability.[21] The requirements listed above do not stop a xed point In order to achieve this goal, several design choices are of time, but instead must be considered throughout the needed. One is the use of a well-documented, standard- databases' timeline. For example, metadata can record ized Application Programming Interface (API). Through the time of measurement of a given data, but also its time the use of that API, the data does not have to be re- of inclusion in the database, and its further history. In- trieved with a database-speci c client or web portal,[22] deed, with any database of signi cant size, it is expected but can be written in any programming language without that curation of the data is an important topic, and the inside knowledge of how the database operates internally. dataset will be modi ed to remove errors, updated to This means, in turn, that code that is developed for one re ect new measurements, and sometimes data will be speci c database will work seamlessly with all others. removed for a variety of legitimate reasons. However, for Another is the inclusion of data in standard, publicly- the sake of research reproducibility and conserving the documented le formats. Given that most current scienti c record, it is important that such modi cations databases are currently structural databases, a part of be recorded in the metadata | just like corrections and this problem has already been addressed in the several retractions are publicly announced and archived for sci- past decades: crystal structures are uniformly reported enti c articles. To my knowledge, this is not currently in CIF format (although the details available are not al- the case in existing databases, although the Materials ways consistent), macromolecular structures are consis- Project is publishing \release logs"[23] which are kept tently in PDB format, etc. However, there is currently on a separate page, but not recorded in the database as no uni ed format for storing the properties of these ma- metadata. terials. This is made dicult by the fact that properties are rather diverse in their mathematical nature: some are Long-term availability. Finally, this discussion can- dimensionless but others have units; some are integers or not be concluded without addressing the issue of long- half-integers, others vary continuously; some are scalars, term availability of the deposited data, meaning that it others are matrices or higher-order tensors. Moreover, is necessary, over time, to build institutional support with they sometimes need to be accompanied by additional long-term commitments. This also requires planning for information: expected uncertainty, reference orientation, what happens if and when the hosting institutions de- etc. cide to \pull the plug" on the project. There, having With rich metadata and interlinked datasets. an open database with an API for direct access to bulk Metadata can be de ned, in its simplest form, as \data data is a bene t, because it means other interested par- that provides information about other data". Like in any ties can duplicate the database and take over hosting. database, in a materials database metadata is crucial in Other lessons can be learned from open archives, and assessing the data present, answering questions such as: initiatives in that eld such as CLOCKSS, a long-term How was this data gathered, by whom, when? In which preservation project for articles and books with highly- conditions was a given property measured? For compu- redundant mirroring.[24] 4 [1] A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, D. S. Sholl, Chem. Mater. 29, 2521 (2016). S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, [15] Available online at https://www.iochem-bd.org/. and K. A. Persson, APL Mater. 1, 011002 (2013). [16] Available online at https://www.materialscloud.org/. [2] A. White, MRS Bull. 37, 715 (2012). [17] Available online at https://www.github.com/. [3] MOFSIM2019 meeting, April 10{12 2019, Ghent (Bel- [18] Available online at https://www.figshare.com/. gium). [19] M. D. Wilkinson, M. Dumontier, I. J. Aalbersberg, [4] J. R. Rumble, CRC Handbook of Chemistry and Physics, G. Appleton, M. Axton, A. Baak, N. Blomberg, J.-W. 99th Edition (CRC Press, 2018). Boiten, L. B. da Silva Santos, P. E. Bourne, J. Bouw- [5] C. R. Groom, I. J. Bruno, M. P. Lightfoot, and S. C. man, A. J. Brookes, T. Clark, M. Crosas, I. Dillo, Ward, Acta Cryst. B 72, 171 (2016). O. Dumon, S. Edmunds, C. T. Evelo, R. Finkers, [6] M. Hellenbrandt, Cryst. Rev. 10, 17 (2014). A. Gonzalez-Beltran, A. J. Gray, P. Groth, C. Goble, J. S. [7] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Grethe, J. Heringa, P. A. 't Hoen, R. Hooft, T. Kuhn, Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne, R. Kok, J. Kok, S. J. Lusher, M. E. Martone, A. Mons, Nucleic Acids Research 28, 235 (2000). A. L. Packer, B. Persson, P. Rocca-Serra, M. Roos, [8] R. T. Downs and M. Hall-Wallace, American Mineralo- R. van Schaik, S.-A. Sansone, E. Schultes, T. Sengstag, gist 88, 247 (2003). T. Slater, G. Strawn, M. A. Swertz, M. Thompson, [9] S. Gra zulis, A. Da skevi c, A. Merkys, D. Chateigner, J. van der Lei, E. van Mulligen, J. Velterop, A. Waag- L. Lutterotti, M. Quir os, N. R. Serebryanaya, P. Moeck, meester, P. Wittenburg, K. Wolstencroft, J. Zhao, and R. T. Downs, and A. Le Bail, Nucleic Acids Research B. Mons, Sci. Data 3, e1002295 (2016). 40, D420 (2012). [20] \G20 Leaders' Communique Hangzhou Summit", [10] Y. Li, J. Yu, D. Liu, W. Yan, R. Xu, and Y. Xu, Chem. available online at http://europa.eu/rapid/ Mater. 15, 2780 (2003). press-release_STATEMENT-16-2967_en.htm. [11] D. J. Earl and M. W. Deem, Ind. Eng. Chem. Res. 45, [21] Available online at http://www.openarchives.org/OAI/ 5449 (2006). openarchivesprotocol.html. [12] C. E. Wilmer, M. Leaf, C. Y. Lee, O. K. Farha, B. G. [22] Although of course such clients can exist, providing a Hauser, J. T. Hupp, and R. Q. Snurr, Nature Chem 4, user-friendly way to query the database! The existence 83 (2012). of a public API makes it possible for advanced users to [13] Y. G. Chung, J. Camp, M. Haranczyk, B. J. Sikora, develop their own portals, bringing added value to the W. Bury, V. Krungleviciute, T. Yildirim, O. K. Farha, database. D. S. Sholl, and R. Q. Snurr, Chem. Mater. 26, 6185 [23] Https://discuss.materialsproject.org/t/materials- (2014). project-database-release-log/1609. [14] D. Nazarian, J. S. Camp, Y. G. Chung, R. Q. Snurr, and [24] Available online at https://clockss.org.
Condensed Matter – arXiv (Cornell University)
Published: Jul 5, 2019
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.