Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Machine learning modeling of superconducting critical temperature

Machine learning modeling of superconducting critical temperature www.nature.com/npjcompumats ARTICLE OPEN Machine learning modeling of superconducting critical temperature 1,2 3,4 1,5 2,6 2,7 3,4,8 Valentin Stanev , Corey Oses , A. Gilad Kusne , Efrain Rodriguez , Johnpierre Paglione , Stefano Curtarolo and 1,2 Ichiro Takeuchi Superconductivity has been the focus of enormous research effort since its discovery more than a century ago. Yet, some features of this unique phenomenon remain poorly understood; prime among these is the connection between superconductivity and chemical/structural properties of materials. To bridge the gap, several machine learning schemes are developed herein to model the critical temperatures (T ) of the 12,000+ known superconductors available via the SuperCon database. Materials are first divided into two classes based on their T values, above and below 10 K, and a classification model predicting this label is trained. The model uses coarse-grained features based only on the chemical compositions. It shows strong predictive power, with out-of-sample accuracy of about 92%. Separate regression models are developed to predict the values of T for cuprate, iron-based, and low-T c c compounds. These models also demonstrate good performance, with learned predictors offering potential insights into the mechanisms behind superconductivity in different families of materials. To improve the accuracy and interpretability of these models, new features are incorporated using materials data from the AFLOW Online Repositories. Finally, the classification and regression models are combined into a single-integrated pipeline and employed to search the entire Inorganic Crystallographic Structure Database (ICSD) for potential new superconductors. We identify >30 non-cuprate and non-iron-based oxides as candidate materials. npj Computational Materials (2018) 4:29 ; doi:10.1038/s41524-018-0085-8 INTRODUCTION macroscopic properties, such as the melting temperatures of binary compounds, the likely crystal structure at a given Superconductivity, despite being the subject of intense physics, 15 16,17 16 composition, band gap energies , and density of states of chemistry, and materials science research for more than a century, certain classes of materials. remains among one of the most puzzling scientific topics. It is an Taking advantage of this immense increase of readily accessible intrinsically quantum phenomenon caused by a finite attraction and potentially relevant information, we develop several ML between paired electrons, with unique properties including zero methods modeling T from the complete list of reported DC resistivity, Meissner, and Josephson effects, and with an ever- c (inorganic) superconductors. In their simplest form, these growing list of current and potential applications. There is even a methods take as input a number of predictors generated from profound connection between phenomena in the superconduct- the elemental composition of each material. Models developed ing state and the Higgs mechanism in particle physics. However, with these basic features are surprisingly accurate, despite lacking understanding the relationship between superconductivity and information of relevant properties, such as space group, electronic materials’ chemistry and structure presents significant theoretical structure, and phonon energies. To further improve the predictive and experimental challenges. In particular, despite focused power of the models, as well as the ability to extract useful research efforts in the last 30 years, the mechanisms responsible information out of them, another set of features are constructed for high-temperature superconductivity in cuprate and iron-based 3,4 based on crystallographic and electronic information taken from families remain elusive. 19–22 the AFLOW Online Repositories. Recent developments, however, allow a different approach to Application of statistical methods in the context of super- investigate what ultimately determines the superconducting conductivity began in the early eighties with simple clustering critical temperatures (T ) of materials. Extensive databases cover- 23,24 methods. In particular, three “golden” descriptors confine the ing various measured and calculated materials properties have 5–9 60 known (at the time) superconductors with T > 10 K to three been created over the years. The sheer quantity of accessible small islands in space: the averaged valence-electron numbers, information also makes possible, and even necessary, the use of data-driven approaches, e.g., statistical and machine learning (ML) orbital radii differences, and metallic electronegativity differences. 10–13 methods. Such algorithms can be developed/trained on the Conversely, about 600 other superconductors with T <10K variables collected in these databases, and employed to predict appear randomly dispersed in the same space. These descriptors 1 2 Department of Materials Science and Engineering, University of Maryland, College Park, MD 20742-4111, USA; Center for Nanophysics and Advanced Materials, University of 3 4 Maryland, College Park, MD 20742, USA; Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC 27708, USA; Center for Materials 5 6 Genomics, Duke University, Durham, NC 27708, USA; National Institute of Standards and Technology, Gaithersburg, MD 20899, USA; Department of Chemistry and 7 8 Biochemistry, University of Maryland, College Park, MD 20742, USA; Department of Physics, University of Maryland, College Park, MD 20742, USA and Fritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin-Dahlem, Germany Correspondence: Valentin Stanev (vstanev@umd.edu) Received: 22 November 2017 Revised: 12 May 2018 Accepted: 17 May 2018 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences Machine learning modeling of superconducting critical V Stanev et al. were selected heuristically due to their success in classifying From SuperCon, we have extracted a list of ~16,400 binary/ternary structures and predicting stable/metastable ternary compounds, of which 4000 have no T reported (see Methods quasicrystals. Recently, an investigation stumbled on this cluster- section for details). Of these, roughly 5700 compounds are ing problem again by observing a threshold T closer to cuprates and 1500 are iron-based (about 35 and 9%, respectively), thres thres log T  1:3 T ¼ 20K . Instead of a heuristic approach, reflecting the significant research efforts invested in these two c c random forests and simplex fragments were leveraged on the families. The remaining set of about 8000 is a mix of various structural/electronic properties data from the AFLOW Online materials, including conventional phonon-driven superconductors Repositories to find the optimum clustering descriptors. A (e.g., elemental superconductors, A15 compounds), known classification model was developed showing good performance. unconventional superconductors like the layered nitrides and Separately, a sequential learning framework was evaluated on heavy fermions, and many materials for which the mechanism of superconducting materials, exposing the limitations of relying on superconductivity is still under debate (such as bismuthates and random-guess (trial-and-error) approaches for breakthrough dis- borocarbides). The distribution of materials by T for the three coveries. Subsequently, this study also highlights the impact groups is shown in Fig. 2a. machine learning can have on this particular field. In another early Use of this data for the purpose of creating ML models can be work, statistical methods were used to find correlations between problematic. ML models have an intrinsic applicability domain, i.e., normal state properties and T of the metallic elements in the first predictions are limited to the patterns/trends encountered in the six rows of the periodic table. Other contemporary works hone training set. As such, training a model only on superconductors 28,29 30,31 in on specific materials and families of superconductors can lead to significant selection bias that may render it ineffective (see also ref. ). when applied to new materials (N.B., a model suffering from Whereas previous investigations explored several hundred selection bias can still provide valuable statistical information compounds at most, this work considers >16,000 different about known superconductors). Even if the model learns to compositions. These are extracted from the SuperCon database, correctly recognize factors promoting superconductivity, it may which contains an exhaustive list of superconductors, including miss effects that strongly inhibit it. To mitigate the effect, we many closely related materials varying only by small changes in incorporate about 300 materials found by H. Hosono’s group not stoichiometry (doping plays a significant role in optimizing T ). The to display superconductivity. However, the presence of non- order-of-magnitude increase in training data (i) presents crucial superconducting materials, along with those without T reported subtleties in chemical composition among related compounds, (ii) in SuperCon, leads to a conceptual problem. Surely, some of these affords family-specific modeling exposing different superconduct- compounds emerge as non-superconducting “end-members” ing mechanisms, and (iii) enhances model performance overall. It from doping/pressure studies, indicating no superconducting also enables the optimization of several model construction transition was observed despite some efforts to find one. procedures. Large sets of independent variables can be con- However, since transition may still exist, albeit at experimentally structed and rigorously filtered by predictive power (rather than difficult to reach or altogether inaccessible temperatures - for selecting them by intuition alone). These advances are crucial to most practical purposes below 10 mK. (There are theoretical uncovering insights into the emergence/suppression of super- arguments for this—according to the Kohn–Luttinger theorem, a conductivity with composition. superconducting instability should be present as T→ 0 in any As a demonstration of the potential of ML methods in looking fermionic metallic system with Coulomb interactions. ) This for novel superconductors, we combined and applied several presents a conundrum: ignoring compounds with no reported T models to search for candidates among the roughly 110,000 disregards a potentially important part of the dataset, while different compositions contained in the Inorganic Crystallographic assuming T = 0 K prescribes an inadequate description for (at Structure Database (ICSD), a large fraction of which have not been least some of) these compounds. To circumvent the problem, tested for superconductivity. The framework highlights 35 materials are first partitioned in two groups by their T , above and compounds with predicted T ’s above 20 K for experimental below a threshold temperature (T ), for the creation of a sep validation. Of these, some exhibit interesting chemical and classification model. Compounds with no reported critical structural similarities to cuprate superconductors, demonstrating temperature can be classified in the “below-T ” group without sep the ability of the ML models to identify meaningful patterns in the the need to specify a T value (or assume it is zero). The “above- data. In addition, most materials from the list share a peculiar T ” bin also enables the development of a regression model for sep feature in their electronic band structure: one (or more) flat/nearly- ln(T ), without problems arising in the T → 0 limit. c c flat bands just below the energy of the highest occupied For most materials, the SuperCon database provides only the electronic state. The associated large peak in the density of states chemical composition and T . To convert this information into (infinitely large in the limit of truly flat bands) can lead to strong meaningful features/predictors (used interchangeably), we electronic instability, and has been discussed recently as one employ the Materials Agnostic Platform for Informatics and 33,34 possible way to high-temperature superconductivity. Exploration (Magpie). Magpie computes a set of attributes for each material, including elemental property statistics like the mean and the standard deviation of 22 different elemental RESULTS properties (e.g., period/group on the periodic table, atomic Data and predictors number, atomic radii, melting temperature), as well as electronic The success of any ML method ultimately depends on access to structure attributes, such as the average fraction of electrons from reliable and plentiful data. Superconductivity data used in this the s, p, d, and f valence shells among all elements present. work is extracted from the SuperCon database, created and The application of Magpie predictors, though appearing to lack maintained by the Japanese National Institute for Materials a priori justification, expands upon past clustering approaches by 23,24 Science. It houses information such as the T and reporting Villars and Rabe. They show that, in the space of a few journal publication for superconducting materials known from judiciously chosen heuristic predictors, materials separate and experiment. Assembled within it is a uniquely exhaustive list of all cluster according to their crystal structure and even complex reported superconductors, as well as related non-superconducting properties, such as high-temperature ferroelectricity and super- compounds. As such, SuperCon is the largest database of its kind, conductivity. Similar to these features, Magpie predictors capture and has never before been employed en masse for machine significant chemical information, which plays a decisive role in learning modeling. determining structural and physical properties of materials. npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences 1234567890():,; Machine learning modeling of superconducting critical V Stanev et al. Fig. 1 Schematic of the random forest ML approach. Example of a single decision tree used to classify materials depending on whether T is above or below 10 K. A tree can have many levels, but only the three top are shown. The decision rules leading to each subset are written inside individual rectangles. The subset population percentage is given by “samples”, and the node color/shade represents the degree of separation, i.e., dark blue/orange illustrates a high proportion of T >10 K/T < 10 K materials (the exact value is given by “proportion”). A c c random forest consists of a large number—could be hundreds or thousands—of such individual trees Despite the success of Magpie predictors in modeling materials methods (e.g., linear regression), it does not make assumptions properties, interpreting their connection to superconductivity about the functional form of the relationship between the presents a serious challenge. They do not encode (at least directly) predictors and the target variable (e.g., linear, exponential or many important properties, particularly those pertinent to super- some other a priori fixed function). Second, random forests are conductivity. Incorporating features like lattice type and density of quite tolerant to heterogeneity in the training data. It can handle states would undoubtedly lead to significantly more powerful and both numerical and categorical data which, furthermore, does not interpretable models. Since such information is not generally need extensive and potentially dangerous preprocessing, such as available in SuperCon, we employ data from the AFLOW Online scaling or normalization. Even the presence of strongly correlated 19–22 Repositories. The materials database houses nearly 170 predictors is not a problem for model construction (unlike many million properties calculated with the software package other ML algorithms). Another significant advantage of this 6,38–46 AFLOW. It contains information for the vast majority of method is that, by combining information from individual trees, compounds in the ICSD. Although, the AFLOW Online Reposi- it can estimate the importance of each predictor, thus making the tories contain calculated properties, the DFT results have been model more interpretable. However, unlike model construction, 17,25,47–50 extensively validated with observed properties. determination of predictor importance is complicated by the Unfortunately, only a small subset of materials in SuperCon presence of correlated features. To avoid this, standard feature overlaps with those in the ICSD: about 800 with finite T and <600 selection procedures are employed along with a rigorous are contained within AFLOW. For these, a set of 26 predictors are predictor elimination scheme (based on their strength and incorporated from the AFLOW Online Repositories, including correlation with others). Overall, these methods reduce the structural/chemical information like the lattice type, space group, complexity of the models and improve our ability to interpret volume of the unit cell, density, ratios of the lattice parameters, them. Bader charges and volumes, and formation energy (see Methods section for details). In addition, electronic properties are con- Classification models sidered, including the density of states near the Fermi level as As a first step in applying ML methods to the dataset, a sequence calculated by AFLOW. Previous investigations exposed limitations of classification models are created, each designed to separate in applying ML methods to a similar dataset in isolation. Instead, materials into two distinct groups depending on whether T is a framework is presented here for combining models built on above or below some predetermined value. The temperature that Magpie descriptors (large sampling, but features limited to separates the two groups (T ) is treated as an adjustable sep compositional data) and AFLOW features (small sampling, but parameter of the model, though some physical considerations diverse and pertinent features). should guide its choice as well. Classification ultimately allows Once we have a list of relevant predictors, various ML models 51,52 compounds with no reported T to be used in the training set by can be applied to the data. All ML algorithms in this work are c including them in the below-T bin. Although discretizing variants of the random forest method. Fundamentally, this sep continuous variables is not generally recommended, in this case approach combines many individual decision trees, where each the benefits of including compounds without T outweigh the tree is a non-parametric supervised learning method used for c modeling either categorical or numerical variables (i.e., classifica- potential information loss. tion or regression modeling). A tree predicts the value of a target In order to choose the optimal value of T , a series of random sep variable by learning simple decision rules inferred from the forest models are trained with different threshold temperatures available features (see Fig. 1 for an example). separating the two classes. Since setting T too low or too high sep Random forest is one of the most powerful, versatile, and widely creates strongly imbalanced classes (with many more instances in used ML methods. There are several advantages that make it one group), it is important to compare the models using several especially suitable for this problem. First, it can learn complicated different metrics. Focusing only on the accuracy (count of non-linear dependencies from the data. Unlike many other correctly classified instances) can lead to deceptive results. Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. Hypothetically, if 95% of the observations in the dataset are in the composition of the training and test sets, and their exact values below-T group, simply classifying all materials as such would can vary.) sep yield a high accuracy (95%), while being trivial in any other sense. The most important factors that determine the model’s There are more sophisticated techniques to deal with severely performance are the size of the available dataset and the number imbalanced datasets, like undersampling the majority class or of meaningful predictors. As can be seen in Fig. 2c, all metrics generating synthetic data points for the minority class (see, for improve significantly with the increase of the training set size. The example, ref. ). To avoid this potential pitfall, three other effect is most dramatic for sizes between several hundred and few standard metrics for classification are considered: precision, recall, thousands instances, but there is no obvious saturation even for and F score. They are defined using the values tp, tn, fp, and fn for the largest available datasets. This validates efforts herein to the count of true/false positive/negative predictions of the model: incorporate as much relevant data as possible into model training. The number of predictors is another very important model tp þ tn accuracy  ; (1) parameter. In Fig. 2d, the accuracy is calculated at each step of the tp þ tn þ fp þ fn backward feature elimination process. It quickly saturates when the number of predictors reaches 10. In fact, a model using only tp precision  ; (2) the five most informative predictors, selected out of the full list of tp þ fp 145 ones, achieves almost 90% accuracy. To gain some understanding of what the model has learned, an tp recall  ; (3) analysis of the chosen predictors is needed. In the random forest tp þ fn method, features can be ordered by their importance quantified via the so-called Gini importance or “mean decrease in precision ´ recall 51,52 (4) F  2 ´ ; 1 impurity”. For a given feature, it is the sum of the Gini precision þ recall impurity (calculated as ∑ p (1 - p ), where p is the probability of i i i i where positive/negative refers to above-T /below-T . The sep sep randomly chosen data point from a given decision tree leaf to be 51,52 accuracy of a classifier is the total proportion of correctly classified in class i ) over the number of splits that include the feature, materials, while precision measures the proportion of correctly weighted by the number of samples it splits, and averaged over classified above-T superconductors out of all predicted above- sep the entire forest. Due to the nature of the algorithm, the closer to T . The recall is the proportion of correctly classified above-T sep sep the top of the tree a predictor is used, the greater number of materials out of all truly above-T compounds. While the sep predictions it impacts. precision measures the probability that a material selected by Although correlations between predictors do not affect the the model actually has T > T , the recall reports how sensitive c sep model’s ability to learn, it can distort importance estimates. For the model is to above-T materials. Maximizing the precision or sep example, a material property with a strong effect on T can be recall would require some compromise with the other, i.e., a shared among several correlated predictors. Since the model can model that labels all materials as above-T would have perfect sep access the same information through any of these variables, their recall but dismal precision. To quantify the trade-off between relative importances are diluted across the group. To reduce the recall and precision, their harmonic mean (F score) is widely used 1 effect and limit the list of predictors to a manageable size, the to measure the performance of a classification model. With the backward feature elimination method is employed. The process exception of accuracy, these metrics are not symmetric with begins with a model constructed with the full list of predictors, respect to the exchange of positive and negative labels. and iteratively removes the least significant one, rebuilding the For a realistic estimate of the performance of each model, the model and recalculating importances with every iteration. (This dataset is randomly split (85%/15%) into training and test subsets. iterative procedure is necessary since the ordering of the The training set is employed to fit the model, which is then predictors by importance can change at each step.) Predictors applied to the test set for subsequent benchmarking. The are removed until the overall accuracy of the model drops by 2%, aforementioned metrics (Eqs. (1)–(4)) calculated on the test set at which point there are only five left. Furthermore, two of these provide an unbiased estimate of how well the model is expected predictors are strongly correlated with each other, and we remove to generalize to a new (but similar) dataset. With the random the less important one. This has a negligible impact on the model forest method, similar estimates can be obtained intrinsically at performance, yielding four predictors total (see Table 1) with an the training stage. Since each tree is trained only on a above 90% accuracy score—only slightly worse than the full bootstrapped subset of the data, the remaining subset can be model. Scatter plots of the pairs of the most important predictors used as an internal test set. These two methods for quantifying are shown in Fig. 3, where blue/red denotes whether the material model performance usually yield very similar results. is in the below-T /above-T class. Figure 3a shows a scatter plot sep sep With the procedure in place, the models’ metrics are evaluated of 3000 compounds in the space spanned by the standard for a range of T and illustrated in Fig. 2b. The accuracy increases deviations of the column numbers and electronegativities sep as T goes from 1 to 40 K, and the proportion of above-T calculated over the elemental values. Superconductors with T > sep sep c compounds drops from above 70% to about 15%, while the recall 10 K tend to cluster in the upper-right corner of the plot and in a and F score generally decrease. The region between 5 and 15 K is relatively thin elongated region extending to the left of it. In fact, especially appealing in (nearly) maximizing all benchmarking the points in the upper-right corner represent mostly cuprate metrics while balancing the sizes of the bins. In fact, setting T = materials, which with their complicated compositions and large sep 10 K is a particularly convenient choice. It is also the temperature number of elements are likely to have high-standard deviations in 23,24 used in refs. to separate the two classes, as it is just above the these variables. Figure 3b shows the same compounds projected highest T of all elements and pseudoelemental materials (solid in the space of the standard deviations of the melting solution whose range of composition includes a pure element). temperatures and the averages of the atomic weights of the Here, the proportion of above-T materials is ~38% and the elements forming each compound. The above-T materials tend sep sep accuracy is about 92%, i.e., the model can correctly classify nine to cluster in areas with lower mean atomic weights—not a out of ten materials—much better than random guessing. The surprising result given the role of phonons in conventional recall—quantifying how well all above-T compounds are superconductivity. sep labeled and, thus, the most important metric when searching For comparison, we create another classifier based on the for new superconducting materials—is even higher. (Note that the average number of valence electrons, metallic electronegativity models’ metrics also depend on random factors such as the differences, and orbital radii differences, i.e., the predictors used in npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences Machine learning modeling of superconducting critical V Stanev et al. a b c d Fig. 2 SuperCon dataset and classification model performance. a Histogram of materials categorized by T (bin size is 2 K, only those with finite T are counted). Blue, green, and red denote low-T , iron-based, and cuprate superconductors, respectively. In the inset: histogram of c c materials categorized by ln(T ) restricted to those with T >10K. b Performance of different classification models as a function of the threshold c c temperature (T ) that separates materials in two classes by T . Performance is measured by accuracy (gray), precision (red), recall (blue), and sep c F score (purple). The scores are calculated from predictions on an independent test set, i.e., one separate from the dataset used to train the model. In the inset: the dashed red curve gives the proportion of materials in the above-T set. c Accuracy, precision, recall, and F score as a sep 1 function of the size of the training set with a fixed test set. d Accuracy, precision, recall, and F as a function of the number of predictors 23,24 refs. to cluster materials with T > 10 K. A classifier built only Table 1. The most relevant predictors and their importances for the with these three predictors is less accurate than both the full and classification and general regression models the truncated models presented herein, but comes quite close: the full model has about 3% higher accuracy and F score, while the Predictor Model truncated model with four predictors is less that 2% more rank accurate. The rather small (albeit not insignificant) differences Classification Regression (general; T >10 K) demonstrates that even on the scale of the entire SuperCon 23,24 1 std(column number) avg(number of unfilled orbitals) dataset, the predictors used by Villars and Rabe capture much 0.26 0.26 of the relevant chemical information for superconductivity. 2 std(electronegativity) std(ground state volume) 0.18 0.26 Regression models 3 std(melting std(space group number) 0.17 After constructing a successful classification model, we now move temperature) 0.23 to the more difficult challenge of predicting T . Creating a 4 avg(atomic weight) 0.24 avg(number of d unfilled regression model may enable better understanding of the factors orbitals) 0.17 controlling T of known superconductors, while also serving as an 5 — std(number of d valence organic part of a system for identifying potential new ones. electrons) 0.12 Leveraging the same set of elemental predictors as the classifica- 6 — avg(melting temperature) 0.10 tion model, several regression models are presented focusing on materials with T > 10 K. This approach avoids the problem of avg(x) and std(x) denote the composition-weighted average and standard materials with no reported T with the assumption that, if they deviation, respectively, calculated over the vector of elemental values for c were to exhibit superconductivity at all, their critical temperature each compound. For the classification model, all predictor importances are quite close would be below 10 K. It also enables the substitution of T with ln Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. a b Fig. 3 Scatter plots of 3000 superconductors in the space of the four most important classification predictors. Blue/red represent below-T / sep above-T materials, where T = 10 K. a Feature space of the first and second most important predictors: standard deviations of the column sep sep numbers and electronegativities (calculated over the values for the constituent elements in each compound). b Feature space of the third and fourth most important predictors: standard deviation of the elemental melting temperatures and average of the atomic weights bc de Fig. 4 Benchmarking of regression models predicting ln(T ). a Predicted vs. measured ln(T ) for the general regression model. The test set c c comprising a mix of low-T , iron-based, and cuprate superconductors with T > 10 K. With an R of about 0.88, this one model can accurately c c predict T for materials in different superconducting groups. b, c Predictions of the regression model trained solely on low-T compounds for c c test sets containing cuprate and iron-based materials. d, e Predictions of the regression model trained solely on cuprates for test sets containing low-T and iron-based superconductors. Models trained on a single group have no predictive power for materials from other groups (T ) as the target variable (which is problematic as T → 0), and model does reasonably well among the different c c thus addresses the problem of the uneven distribution of families–benchmarked on the test set, the model achieves R ≈ materials along the T -axis (Fig. 2a). Using ln(T ) creates a more 0.88 (Fig. 4a). It suggests that the random forest algorithm is c c uniform distribution (Fig. 2a inset), and is also considered a best flexible and powerful enough to automatically separate the practice when the range of a target variable covers more than one compounds into groups and create group-specific branches with order-of-magnitude (as in the case of T ). Following this distinct predictors (no explicit group labels were used during transformation, the dataset is parsed randomly (85%/15%) into training and testing). As validation, three separate models are training and test subsets (similarly performed for the classification trained only on a specific family, namely the low-T , cuprate, and model). iron-based superconductors, respectively. Benchmarking on Present within the dataset are distinct families of super- mixed-family test sets, the models performed well on compounds conductors with different driving mechanisms for superconduc- belonging to their training set family while demonstrating no tivity, including cuprate and iron-based high-temperature predictive power on the others. Figure 4b–d illustrates a cross- superconductors, with all others denoted “low-T ” for brevity (no section of this comparison. Specifically, the model trained on low- specific mechanism in this group). Surprisingly, a single-regression T compounds dramatically underestimates the T of both high- c c npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences Machine learning modeling of superconducting critical V Stanev et al. Table 2. The most significant predictors and their importances for the three material-specific regression models Predictor rank Model Regression (low-T ) Regression (cuprates) Regression (Fe-based) 1 frac(d valence electrons) 0.18 avg(number of unfilled orbitals) 0.22 std(column number) 0.17 2 avg(number of d unfilled orbitals) 0.14 std(number of d valence electrons) 0.13 avg(ionic character) 0.15 3 avg(number of valence electrons) 0.13 frac(d valence electrons) 0.13 std(Mendeleev number) 0.14 4 frac(s valence electrons) 0.11 std(ground state volume) 0.13 std(covalent radius) 0.14 5 avg(number of d valence electrons) 0.09 std(number of valence electrons) 0.1 max(melting temperature) 0.14 6 avg(covalent radius) 0.09 std(row number) 0.08 avg(Mendeleev number) 0.14 7 avg(atomic weight) 0.08 ||composition|| 0.07 ||composition|| 0.11 2 2 8 avg(Mendeleev number) 0.07 std(number of s valence electrons) 0.07 — 9 avg(space group number) 0.07 std(melting temperature) 0.07 — 10 avg(number of unfilled orbitals) 0.06—— avg(x), std(x), max(x), and frac(x) denote the composition-weighted average, standard deviation, maximum, and fraction, respectively, taken over the elemental pffiffiffiffiffiffiffiffiffiffiffiffi 2 2 values for each compound. l -norm of a composition is calculated by kk x ¼ x , where x is the proportion of each element i in the compound 2 i i temperature superconducting families (Fig. 4b, c), even though recovers the empirical relation first discovered by Matthias more this test set only contains compounds with T < 40 K. Conversely, than 60 years ago. Such findings validate the ability of ML the model trained on the cuprates tends to overestimate the T of approaches to discover meaningful patterns that encode true low-T (Fig. 4d) and iron-based (Fig. 4e) superconductors. This is a physical phenomena. clear indication that superconductors from these groups have Similar T -vs.-predictor plots reveal more interesting and subtle different factors determining their T . Interestingly, the family- features. A narrow cluster of materials with T > 20 K emerges in specific models do not perform better than the general regression the context of the mean covalent radii of compounds (Fig. 5b)— containing all the data points: R for the low-T materials is about c another important predictor for low-T superconductors. The 0.85, for cuprates is just below 0.8, and for iron-based compounds cluster includes (left-to-right) alkali-doped C ,MgB -related 60 2 is about 0.74. In fact, it is a purely geometric effect that the compounds, and bismuthates. The sector likely characterizes a combined model has the highest R . Each group of super- region of strong covalent bonding and corresponding high- conductors contributes mostly to a distinct T range, and, as a c frequency phonon modes that enhance T (however, frequencies result, the combined regression is better determined over longer that are too high become irrelevant for superconductivity). Another temperature interval. interesting relation appears in the context of the average number In order to reduce the number of predictors and increase the of d valence electrons. Figure 5c illustrates a fundamental bound interpretability of these models without significant detriment to on T of all non-cuprate and non-iron-based superconductors. their performance, a backward feature elimination process is again A similar limit exists for cuprates based on the average number employed. The procedure is very similar to the one described of unfilled orbitals (Fig. 5d). It appears to be quite rigid—several previously for the classification model, with the only difference data points found above it on inspection are actually incorrectly being that the reduction is guided by R of the model, rather than recorded entries in the database and were subsequently removed. the accuracy (the procedure stops when R drops by 3%). The connection between T and the average number of unfilled The most important predictors for the four models (one general orbitals may offer new insight into the mechanism for super- and three family-specific) together with their importances are conductivity in this family. (The number of unfilled orbitals refers shown in Tables 1 and 2. Differences in important predictors to the electron configuration of the substituent elements before across the family-specific models reflect the fact that distinct combining to form oxides. For example, Cu has one unfilled orbital mechanisms are responsible for driving superconductivity among 2 9 14 2 10 3 ([Ar]4s 3d ) and Bi has three ([Xe]4f 6s 5d 6p ). These values are these groups. The list is longest for the low-T superconductors, averaged per formula unit.) Known trends include higher T ’s for reflecting the eclectic nature of this group. Similar to the general structures that (i) stabilize more than one superconducting Cu–O regression model, different branches are likely created for distinct plane per unit cell and (ii) add more polarizable cations such as Tl sub-groups. Nevertheless, some important predictors have + 2+ and Hg between these planes. The connection reflects these straightforward interpretation. As illustrated in Fig. 5a, low average observations, since more copper and oxygen per formula unit atomic weight is a necessary (albeit not sufficient) condition for leads to lower average number of unfilled orbitals (one for copper, achieving high T among the low-T group. In fact, the maximum c c pffiffiffiffiffiffiffi two for oxygen). Further, the lower-T cuprates typically consist of T for a given weight roughly follows 1= m . Mass plays a c A 2− 3− Cu /Cu -containing layers stabilized by the addition/substition significant role in conventional superconductors through the 2+ 3+ of hard cations, such as Ba and La , respectively. These cations Debye frequency of phonons, leading to the well-known formula pffiffiffiffi have a large number of unfilled orbitals, thus increasing the 56– T  1= m, where m is the ionic mass (see, for example, refs. compound’s average. Therefore, the ability of between-sheet ). Other factors like density of states are also important, which cations to contribute charge to the Cu–O planes may be indeed explains the spread in T for a given m . Outlier materials clearly c A pffiffiffiffiffiffiffi quite important. The more polarizable the A cation, the more above the  1= m line include bismuthates and chloronitrates, electron density it can contribute to the already strongly covalent suggesting the conventional electron-phonon mechanism is not 2+ Cu –O bond. driving superconductivity in these materials. Indeed, chloroni- trates exhibit a very weak isotope effect, though some Including AFLOW unconventional electron-phonon coupling could still be relevant for superconductivity. Another important feature for low-T The models described previously demonstrate surprising accuracy materials is the average number of valence electrons. This and predictive power, especially considering the difference Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. Fig. 5 Scatter plots of T for superconducting materials in the space of significant, family-specific regression predictors. For 4000 “low-T ” c c superconductors (i.e., non-cuprate and non-iron-based), T is plotted vs. the a average atomic weight, b average covalent radius, and c average pffiffiffiffiffiffiffi number of d valence electrons. The dashed red line in a is  1= m . Having low average atomic weight and low average number of d valence electrons are necessary (but not sufficient) conditions for achieving high T in this group. d Scatter plot of T for all known superconducting c c cuprates vs. the mean number of unfilled orbitals. c, d suggest that the values of these predictors lead to hard limits on the maximum achievable T between the relevant energy scales of most Magpie predictors the superconducting compounds in the ICSD also yields an (typically in the range of eV) and superconductivity (meV scale). unsatisfactory regression model. The issue is not the lack of This disparity, however, hinders the interpretability of the models, compounds per se, as models created with randomly drawn i.e., the ability to extract meaningful physical correlations. Thus, it subsets from SuperCon with similar counts of compounds perform is highly desirable to create accurate ML models with features much better. In fact, the problem is the chemical sparsity of based on measurable macroscopic properties of the actual superconductors in the ICSD, i.e., the dearth of closely related compounds (e.g., crystallographic and electronic properties) rather compounds (usually created by chemical substitution). This than composite elemental predictors. Unfortunately, only a small translates to compound scatter in predictor space—a challenging subset of materials in SuperCon is also included in the ICSD: about learning environment for the model. 1500 compounds in total, only about 800 with finite T , and even The chemical sparsity in ICSD superconductors is a significant fewer are characterized with ab initio calculations. (Most of the hurdle, even when both sets of predictors (i.e., Magpie and AFLOW superconductors in ICSD but not in AFLOW are non-stoichio- features) are combined via feature fusion. Additionally, this metric/doped compounds, and thus not amenable to conven- approach neglects the majority of the 16,000 compounds tional DFT methods. For the others, AFLOW calculations were available via SuperCon. Instead, we constructed separate models employing Magpie and AFLOW features, and then judiciously attempted but did not converge to a reasonable solution.) In fact, a good portion of known superconductors are disordered (off- combined the results to improve model metrics—known as late or stoichiometric) materials and notoriously challenging to address decision-level fusion. Specifically, two independent classification with DFT calculations. Currently, much faster and efficient models are developed, one using the full SuperCon dataset and methods are becoming available for future applications. Magpie predictors, and another based on superconductors in the To extract suitable features, data are incorporated from the ICSD and AFLOW predictors. Such an approach can improve the AFLOW Online Repositories—a database of DFT calculations recall, for example, in the case where we classify “high-T ” managed by the software package AFLOW. It contains information superconductors as those predicted by either model to be above- for the vast majority of compounds in the ICSD and about T . Indeed, this is the case here where, separately, the models sep 550 superconducting materials. In ref. , several ML models using obtain a recall of 40 and 66%, respectively, and together achieve a a similar set of materials are presented. Though a classifier shows recall of about 76%. (These numbers are based on a relatively good accuracy, attempts to create a regression model for T led to small test set benchmarking and their uncertainty is roughly 3%.) disappointing results. We verify that using Magpie predictors for In this way, the models’ predictions complement each other in a npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences Machine learning modeling of superconducting critical V Stanev et al. constructive way such that above-T materials missed by one sep Table 3. List of potential superconductors identified by the pipeline model (but not the other) are now accurately classified. Compound ICSD SYM Searching for new superconductors in the ICSD CsBe(AsO ) 074027 Orthorhombic As a final proof of concept demonstration, the classification and RbAsO 413150 Orthorhombic regression models described previously are integrated in one KSbO 411214 Monoclinic pipeline and employed to screen the entire ICSD database for 2 candidate “high-T ” superconductors. (Note that “high-T ” is a c c RbSbO 411216 Monoclinic label, the precise meaning of which can be adjusted.) Similar tools CsSbO 059329 Monoclinic power high-throughput screening workflows for materials with AgCrO 004149/025624 Hexagonal 50,62 2 desired thermal conductivity and magnetocaloric properties. K (Li Sn )O 262638 Hexagonal 0.8 0.2 0.76 2 As a first step, the full set of Magpie predictors are generated for Cs(MoZn)(O F ) 018082 Cubic all compounds in ICSD. A classification model similar to the one 3 3 presented above is constructed, but trained only on materials in Na Cd (IrO ) 404507 Monoclinic 3 2 6 SuperCon and not in the ICSD (used as an independent test set). Sr Cd(PtO ) 280518 Hexagonal 3 6 The model is then applied on the ICSD set to create a list of Sr Zn(PtO ) 280519 Hexagonal 3 6 materials predicted to have T above 10 K. Opportunities for (Ba Br )Ru O 245668 Hexagonal 5 2 2 9 model benchmarking are limited to those materials both in the Ba (AgO )(AuO ) 072329 Orthorhombic SuperCon and ICSD datasets, though this test set is shown to be 4 2 4 problematic. The set includes about 1500 compounds, with T Sr (AuO ) 071965 Orthorhombic 5 4 2 reported for only about half of them. The model achieves an RbSeO F 078399 Cubic impressive accuracy of 0.98, which is overshadowed by the fact CsSeO F 078400 Cubic that 96.6% of these compounds belong to the T < 10 K class. The KTeO F 411068 Monoclinic precision, recall, and F scores are about 0.74, 0.66, and 0.70, Na K (Tl2O ) 074956 Monoclinic 2 4 6 respectively. These metrics are lower than the estimates calculated Na Ni BiO 237391 Monoclinic for the general classification model, which is expected given that 3 2 6 this set cannot be considered randomly selected. Nevertheless, Na Ca BiO 240975 Orthorhombic 3 2 6 the performance suggests a good opportunity to identify new CsCd(BO3) 189199 Cubic candidate superconductors. K Cd(SiO ) 083229/086917 Orthorhombic 2 4 Next in the pipeline, the list is fed into a random forest Rb Cd(SiO ) 093879 Orthorhombic 2 4 regression model (trained on the entire SuperCon database) to K Zn(SiO ) 083227 Orthorhombic predict T . Filtering on the materials with T > 20 K, the list is 2 4 c c further reduced to about 2000 compounds. This count may K Zn(Si2O ) 079705 Orthorhombic 2 6 appear daunting, but should be compared with the total number K Zn(GeO ) 069018/085006/085007 Orthorhombic 2 4 of compounds in the database—about 110,000. Thus, the method (K Na )Zn(GeO ) 069166 Orthorhombic 0.6 1.4 4 selects <2% of all materials, which in the context of the training K Zn(Ge O ) 065740 Orthorhombic 2 2 6 set (containing >20% with “high-T ”), suggests that the model is Na Ca (Ge O ) 067315 Hexagonal not overly biased toward predicting high-critical temperatures. 6 3 2 6 3 The vast majority of the compounds identified as candidate Cs (AlGe O ) 412140 Monoclinic 3 2 7 superconductors are cuprates, or at least compounds that contain K Ba(Ge O ) 100203 Monoclinic 4 3 9 copper and oxygen. There are also some materials clearly related K Sr (Ge O ) 100202 Cubic 16 4 3 9 4 to the iron-based superconductors. The remaining set has 35 K Tb[Ge O (OH) ] 193585 Orthorhombic 3 3 8 2 members, and is composed of materials that are not obviously K Eu[Ge O (OH) ] 262677 Orthorhombic connected to any high-temperature superconducting families (see 3 3 8 2 Table 3). (For at least one compound from the list—Na Ni BiO — KBa Zn (Ga O ) 040856 Trigonal 3 2 6 6 4 7 21 low-temperature measurements have been performed and no Also shown are their ICSD numbers and symmetries. Note that for some signs of superconductivity were observed. ) None of them is compounds there are several entries. All of the materials contain oxygen predicted to have T in excess of 40 K, which is not surprising, given that no such instances exist in the training dataset. All contain oxygen—also not a surprising result, since the group of known superconductors with T > 20 K is dominated by oxides. c 4− charges. This is reminiscent of the way Sr balances the (CuO ) 2 4 The list comprises several distinct groups. Most of the materials unit in Sr CuO . Such chemical similarities based on charge 2 4 are insulators, similar to stoichiometric (and underdoped) balancing and stoichiometry were likely identified and exploited cuprates; charge doping and/or pressure will be required to drive by the ML algorithms. these materials into a superconducting state. Especially interesting The electronic properties calculated by AFLOW offer additional are the compounds containing heavy metals (such as Au, Ir, and insight into the results of the search, and suggest a possible Ru), metalloids (Se, Te), and heavier post-transition metals (Bi, Tl), connection among these candidate. Plotting the electronic which are or could be pushed into interesting/unstable oxidation structure of the potential superconductors exposes a rather states. The most surprising and non-intuitive of the compounds in unusual feature shared by almost all—one or several (nearly) flat the list are the silicates and the germanates. These materials form bands just below the energy of the highest occupied electronic corner-sharing SiO or GeO polyhedra, similar to quartz glass, and 4 4 state. Such bands lead to a large peak in the DOS (Fig. 6) and can also have counter cations with full or empty shells, such as Cd or cause a significant enhancement in T . Peaks in the DOS elicited K . Converting these insulators to metals (and possibly super- c by van Hove singularities can enhance T if sufficiently close to conductors) likely requires significant charge doping. However, 64–66 the similarity between these compounds and cuprates is mean- E . However, note that unlike typical van Hove points, a true ingful. In compounds like K CdSiO or K ZnSiO ,K Cd (or K Zn) flat band creates divergence in the DOS (as opposed to its 2 4 2 4 2 2 4− 4− unit carries a 4+ charge that offsets the (SiO ) (or (GeO ) ) derivatives), which in turn leads to a critical temperature 4 4 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. Cs (AlGe O ) CsBe(AsO ) 3 2 7 4 eDOS (states/eV) eDOS (states/eV) 4 4 3 3 total 2 2 1 1 0 0 -1 -1 -2 -2 -3 -3 -4 -4 Y H |I Γ Y F H Z I X Γ Z |M Γ N |X 0 20 40 60 80 Γ X S Y Γ Z U R T Z |Y X |S R 0 20 40 60 80 1 1 F T |U Sr Cd(PtO ) Ba (AgO )(AuO ) 3 6 eDOS (states/eV) 4 2 4 eDOS (states/eV) 4 4 3 3 2 2 1 1 0 0 -1 -1 -2 -2 -3 -3 -4 -4 Γ P Q Γ F P L Z 0 15 30 45 Γ X S R A Z Γ Y X A T Y |Z T 0 20 40 60 1 1 1 Fig. 6 DOS of four compounds identified by the ML algorithm as potential materials with T > 20 K. The partial DOS contributions from s, p, and d electrons and total DOS are shown in blue, green, red, and black, respectively. The large peak just below E is a direct consequence of the flat band(s) present in all these materials. These images were generated automatically via AFLOW . In the case of substantial overlap among k-point labels, the right-most label is offset below dependence linear in the pairing interaction strength, rather than different groups. With the incorporation of crystallographic-/ the usual exponential relationship yielding lower T . Additionally, electronic-based features from the AFLOW Online Repositories, there is significant similarity with the band structure and DOS of the ML models are further improved. Finally, we combined these layered BiS -based superconductors. models into one integrated pipeline, which is employed to search This band structure feature came as the surprising result of the entire ICSD database for new inorganic superconductors. The applying the ML model. It was not sought for, and, moreover, no model identified 35 oxides as candidate materials. Some of these explicit information about the electronic band structure has been are chemically and structurally similar to cuprates (even though no included in these predictors. This is in contrast to the algorithm explicit structural information was provided during training of the presented in ref. , which was specifically designed to filter ICSD model). Another feature that unites almost all of these materials is compounds based on several preselected electronic structure the presence of flat or nearly-flat bands just below the energy of features. the highest occupied electronic state. While at the moment it is not clear if some (or indeed any) of In conclusion, this work demonstrates the important role ML these compounds are really superconducting, let alone with T ’s models can play in superconductivity research. Records collected over several decades in SuperCon and other relevant databases above 20 K, the presence of this highly unusual electronic can be consumed by ML models, generating insights and structure feature is encouraging. Attempts to synthesize several of these compounds are already underway. promoting better understanding of the connection between materials’ chemistry/structure and superconductivity. Application of sophisticated ML algorithms has the potential to dramatically DISCUSSION accelerate the search for candidate high-temperature Herein, several machine learning tools are developed to study the superconductors. critical temperature of superconductors. Based on information from the SuperCon database, initial coarse-grained chemical METHODS features are generated using the Magpie software. As a first Superconductivity data application of ML methods, materials are divided into two classes depending on whether T is above or below 10 K. A non- The SuperCon database consists of two separate subsets: “Oxide and Metallic” (inorganic materials containing metals, alloys, cuprate high- parametric random forest classification model is constructed to temperature superconductors, etc.) and “Organic” (organic superconduc- predict the class of superconductors. The classifier shows excellent tors). Downloading the entire inorganic materials dataset and removing performance, with out-of-sample accuracy and F score of about compounds with incompletely specified chemical compositions leaves 92%. Next, several successful random forest regression models are about 22,000 entries. If a single T record exists for a given material, it is created to predict the value of T , including separate models for taken to accurately reflect the critical temperature of this material. In the three material sub-groups, i.e., cuprate, iron-based, and low-T case of multiple records for the same compound, the reported material’s compounds. By studying the importance of predictors for each T 's are averaged, but only if their standard deviation is <5 K, and discarded family of superconductors, insights are obtained about the otherwise. This brings the total down to about 16,400 compounds, of physical mechanisms driving superconductivity among the which around 4,000 have no critical temperature reported. Each entry in npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences energy (eV) energy (eV) energy (eV) energy (eV) Machine learning modeling of superconducting critical V Stanev et al. the set contains fields for the chemical composition, T , structure, and a journal reference to the information source. Here, structural information is ignored as it is not always available. There are occasional problems with the validity and consistency of some of the data. For example, the database includes some reports based on tenuous experimental evidence and only indirect signatures of super- conductivity, as well as reports of inhomogeneous (surface, interfacial) and non-equilibrium phases. Even in cases of bona fide bulk superconducting phases, important relevant variables like pressure are not recorded. Though some of the obviously erroneous records were removed from the data, these issues were largely ignored assuming their effect on the entire dataset to be relatively modest. The data cleaning and processing is carried out using the Python Pandas package for data analysis. Chemical and structural features The predictors are calculated using the Magpie software. It computes a set of 145 attributes for each material, including: (i) stoichiometric features (depends only on the ratio of elements and not the specific species); (ii) elemental property statistics: the mean, mean absolute deviation, range, minimum, maximum, and mode of 22 different elemental properties (e.g., period/group on the periodic table, atomic number, atomic radii, melting temperature); (iii) electronic structure attributes: the average fraction of Fig. 7 Regression model predictions of T . Predicted vs. measured T electrons from the s, p, d, and f valence shells among all elements present; c c for general regression model. R score is comparable to the one and (iv) ionic compound features that include whether it is possible to obtained testing regression modeling ln(T ) form an ionic compound assuming all elements exhibit a single-oxidation state. ML models are also constructed with the superconducting materials in the AFLOW Online Repositories. AFLOW is a high-throughput ab initio framework that manages density functional theory (DFT) calculations in accordance with the AFLOW Standard. The Standard ensures that the calculations and derived properties are empirical (reproducible), reason- ably well-converged, and above all, consistent (fixed set of parameters), a particularly attractive feature for ML modeling. Many materials properties important for superconductivity have been calculated within the AFLOW framework, and are easily accessible through the AFLOW Online Repositories. The features are built with the following properties: number of atoms, space group, density, volume, energy per atom, electronic entropy per atom, valence of the cell, scintillation attenuation length, the ratios of the unit cell’s dimensions, and Bader charges and volumes. For the Bader charges and volumes (vectors), the following statistics are calculated and incorporated: the maximum, minimum, average, standard deviation, and range. Machine learning algorithms Once we have a list of relevant predictors, various ML models can be 51,52 applied to the data. All ML algorithms in this work are variants of the random forest method. It is based on creating a set of individual decision trees (hence the “forest”), each built to solve the same classification/ Fig. 8 Flat bands feature. Comparison between the normalized regression problem. The model then combines their results, either by average DOS of 380 known superconductors in the ICSD (left) and voting or averaging depending on the problem. The deeper individual tree the normalized average DOS of the potential high-temperature are, the more complex the relationships the model can learn, but also the superconductors from Table 3 (right) greater the danger of overfitting, i.e., learning some irrelevant information or just “noise”. To make the forest more robust to overfitting, individual the relative sparsity of data points in some T ranges, which makes the trees in the ensemble are built from samples drawn with replacement (a model susceptible to outliers. bootstrap sample) from the training set. In addition, when splitting a node during the construction of a tree, the model chooses the best split of the data only considering a random subset of the features. Flat bands feature The random forest models above are developed using scikit-learn—a The flat band attribute is unusual for a superconducting material: the powerful and efficient machine learning Python library. Hyperparameters average DOS of the known superconductors in the ICSD has no distinct of these models include the number of trees in the forest, the maximum features, demonstrating roughly uniform distribution of electronic states. depth of each tree, the minimum number of samples required to split an In contrast, the average DOS of the potential superconductors in Table 3 internal node, and the number of features to consider when looking for the shows a sharp peak just below E (Fig. 8). Also, note that most of the flat best split. To optimize the classifier and the combined/family-specific bands in the potential superconductors we discuss have a notable regressors, the GridSearch function in scikit-learn is employed, which contribution from the oxygen p-orbitals. Accessing/exploiting the potential generates and compares candidate models from a grid of parameter strong instability this electronic structure feature creates can require values. To reduce computational expense, models are not optimized at significant charge doping. each step of the backward feature selection process. To test the influence of using log-transformed target variable ln(T ), a Prediction errors of the regression models general regression model is trained and tested on raw T data (shown in Fig. 7). This model is very similar to the one described in section “Results”, Previously, several regression models were described, each one designed 2 2 and its R value is fairly similar as well (although comparing R scores of to predict the critical temperatures of materials from different super- models built using different target data can be misleading). However, note conducting groups. These models achieved an impressive R score, Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. a b c d −1 meas pred meas Fig. 9 Histograms of Δln(T ) × ln(T ) for the four regression models. Δln(T ) ≡ ln T  ln T and ln(T ) ≡ ln T c c c c c c c demonstrating good predictive power for each group. However, it is also AUTHOR CONTRIBUTIONS important to consider the accuracy of the predictions for individual V.S., I.T., and A.G.K. designed the research. V.S. worked on the model. C.O. and S.C. compounds (rather than on the aggregate set), especially in the context of performed the AFLOW calculations. V.S., I.T., E.R., and J.P. analyzed the results. V.S., C. searching for new materials. To do this, we calculate the prediction errors O., I.T., and E.R. wrote the text of the manuscript. All authors discussed the results and for about 300 materials from a test set. Specifically, we consider the commented on the manuscript. difference between the logarithm of the predicted and measured critical meas pred meas temperature ln T  ln T normalized by the value of ln T c c c ADDITIONAL INFORMATION (normalization compensates the different T ranges of different groups). The models show comparable spread of errors. The histograms of errors for Competing interests: The authors declare no competing interests. the four models (combined and three group-specific) are shown in Fig. 9. The errors approximately follow a normal distribution, centered not at zero Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims but at a small negative value. This suggests the models are marginally in published maps and institutional affiliations. biased, and on average tend to slightly underestimate T . The variance is comparable for all models, but largest for the model trained and tested on iron-based materials, which also shows the smallest R . Performance of this REFERENCES model is expected to benefit from a larger training set. 1. Hirsch, J. E., Maple, M. B. & Marsiglio, F. Superconducting materials: conventional, unconventional and undetermined. Phys. C. 514,1–444 (2015). 2. Anderson, P. W. Plasmons, gauge invariance, and mass. Phys. Rev. 130, 439–442 Data availability (1963). The superconductivity data used to generate the results in this work can 3. Chu, C. W., Deng, L. Z. & Lv, B. Hole-doped cuprate high temperature super- be downloaded from https://github.com/vstanev1/Supercon. conductors. Phys. C. 514, 290–313 (2015). 4. Paglione, J. & Greene, R. L. High-temperature superconductivity in iron-based materials. Nat. Phys. 6, 645–658 (2010). ACKNOWLEDGEMENTS 5. Bergerhoff, G., Hundt, R., Sievers, R. & Brown, I. D. The inorganic crystal structure The authors are grateful to Daniel Samarov, Victor Galitski, Cormac Toher, Richard L. data base. J. Chem. Inf. Comput. Sci. 23,66–69 (1983). Greene, and Yibin Xu for many useful discussions and suggestions. We acknowledge 6. Curtarolo, S. et al. AFLOW: an automatic framework for high-throughput materials Stephan Rühl for ICSD. This research is supported by ONR N000141512222, ONR discovery. Comput. Mater. Sci. 58, 218–226 (2012). N00014-13-1-0635, and AFOSR No. FA 9550-14-10332. C.O. acknowledges support 7. Landis, D. D. et al. The computational materials repository. Comput. Sci. Eng. 14, 51–57 (2012). from the National Science Foundation Graduate Research Fellowship under grant No. 8. Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. Materials design and DGF1106401. J.P. acknowledges support from the Gordon and Betty Moore discovery with high-throughput density functional theory: the Open Quantum Foundation’s EPiQS Initiative through grant No. GBMF4419. S.C. acknowledges Materials Database (OQMD). JOM 65, 1501–1509 (2013). support by the Alexander von Humboldt-Foundation. This research is supported by 9. Jain, A. et al. Commentary: the Materials Project: a materials genome approach to ONR N000141512222, ONR N00014-13-1-0635, and AFOSR no. FA 9550-14-10332. C. accelerating materials innovation. APL Mater. 1, 011002 (2013). O. acknowledges support from the National Science Foundation Graduate Research 10. Agrawal, A. & Choudhary, A. Perspective: materials informatics and big data: Fellowship under grant no. DGF1106401. J.P. acknowledges support from the Gordon realization of the “fourth paradigm” of science in materials science. APL Mater. 4, and Betty Moore Foundation’s EPiQS Initiative through grant no. GBMF4419. S.C. 053208 (2016). acknowledges support by the Alexander von Humboldt-Foundation. npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences Machine learning modeling of superconducting critical V Stanev et al. 11. Lookman, T., Alexander, F. J. & Rajan, K. eds, A Perspective on Materials Informatics: 41. Levy, O., Hart, G. L. W. & Curtarolo, S. Structure maps for hcp metals from first- State-of-the-Art and Challenges, https://doi.org/10.1007/978-3-319-23871-5 principles calculations. Phys. Rev. B 81, 174106 (2010). (Springer International Publishing, 2016). 42. Levy, O., Chepulskii, R. V., Hart, G. L. W. & Curtarolo, S. The new face of rhodium 12. Jain, A., Hautier, G., Ong, S. P. & Persson, K. A. New opportunities for materials alloys: revealing ordered structures from first principles. J. Am. Chem. Soc. 132, informatics: resources and data mining techniques for uncovering hidden rela- 833–837 (2010). tionships. J. Mater. Res. 31, 977–994 (2016). 43. Levy, O., Hart, G. L. W. & Curtarolo, S. Uncovering compounds by synergy of 13. Mueller, T., Kusne, A. G. & Ramprasad, R. Machine Learning in Materials Science, pp. cluster expansion and high-throughput methods. J. Am. Chem. Soc. 132, 186–273, https://doi.org/10.1002/9781119148739.ch4 (John Wiley & Sons, Inc, 4830–4833 (2010). 2016). 44. Hart, G. L. W., Curtarolo, S., Massalski, T. B. & Levy, O. Comprehensive search for 14. Seko, A., Maekawa, T., Tsuda, K. & Tanaka, I. Machine learning with systematic new phases and compounds in binary alloy systems based on platinum-group density-functional theory calculations: application to melting temperatures of metals, using a computational first-principles approach. Phys. Rev. X 3, 041035 single- and binary-component solids. Phys. Rev. B 89, 054303–054313 (2014). (2013). 15. Balachandran, P. V., Theiler, J., Rondinelli, J. M. & Lookman, T. Materials prediction 45. Mehl, M. J. et al. The AFLOW library of crystallographic prototypes: part 1. via classification learning. Sci. Rep. 5, 13285–13301 (2015). Comput. Mater. Sci. 136,S1–S828 (2017). 16. Pilania, G. et al. Machine learning bandgaps of double perovskites. Sci. Rep. 6, 46. Supka, A. R. et al. AFLOWπ: a minimalist approach to high-throughput ab initio 19375 (2016). calculations including the generation of tight-binding hamiltonians. Comput. 17. Isayev, O. et al. Universal fragment descriptors for predicting electronic properties Mater. Sci. 136,76–84 (2017). of inorganic crystals. Nat. Commun. 8, 15679 (2017). 47. Toher, C. et al. High-throughput computational screening of thermal con- 18. National Institute of Materials Science, Materials Information Station, SuperCon, ductivity, Debye temperature, and Grüneisen parameter using a quasiharmonic http://supercon.nims.go.jp/index_en.html (2011). Debye model. Phys. Rev. B 90, 174107 (2014). 19. Curtarolo, S. et al. AFLOWLIB.ORG: a distributed materials properties repository 48. Perim, E. et al. Spectral descriptors for bulk metallic glasses based on the from high-throughput ab initio calculations. Comput. Mater. Sci. 58, 227–235 thermodynamics of competing crystalline phases. Nat. Commun. 7, 12315 (2012). (2016). 20. Taylor, R. H. et al. A RESTful API for exchanging materials data in the AFLOWLIB. 49. Toher, C. et al. Combining the AFLOW GIBBS and Elastic Libraries to efficiently org consortium. Comput. Mater. Sci. 93, 178–192 (2014). and robustly screen thermomechanical properties of solids. Phys. Rev. Mater. 1, 21. Calderon, C. E. et al. The AFLOW standard for high-throughput materials science 015401 (2017). calculations. Comput. Mater. Sci. 108 Part A, 233–238 (2015). 50. van Roekeghem, A., Carrete, J., Oses, C., Curtarolo, S. & Mingo, N. High-throughput 22. Rose, F. et al. AFLUX: the LUX materials search API for the AFLOW data reposi- computation of thermal conductivity of high-temperature solid phases: the case tories. Comput. Mater. Sci. 137, 362–370 (2017). of oxide and fluoride perovskites. Phys. Rev. X 6, 041061 (2016). 23. Villars, P. & Phillips, J. C. Quantum structural diagrams and high-T super- 51. Bishop, C. Pattern Recognition and Machine Learning. (Springer-Verlag, NY, 2006). conductivity. Phys. Rev. B 37, 2345–2348 (1988). 52. Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning: Data 24. Rabe, K. M., Phillips, J. C., Villars, P. & Brown, I. D. Global multinary structural Mining, Inference, and Prediction. (Springer-Verlag, NY, 2001). chemistry of stable quasicrystals, high-T ferroelectrics, and high-T super- 53. Breiman, L. Random forests. Mach. Learn. 45,5–32 (2001). C c conductors. Phys. Rev. B 45, 7650–7676 (1992). 54. Caruana, R. & Niculescu-Mizil, A. An Empirical Comparison of Supervised Learning 25. Isayev, O. et al. Materials cartography: representing and mining materials space Algorithms. In Proceedings of the 23rd International Conference on Machine using structural and electronic fingerprints. Chem. Mater. 27, 735–743 (2015). Learning, ICML ’06, 161–168 (ACM, New York, NY, 2006). https://doi.org/10.1145/ 26. Ling J., Hutchinson M., Antono E., Paradiso S., and Meredig B. High-dimensional 1143844.1143865. materials and process optimization using data-driven experimental design with 55. Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic well-calibrated uncertainty estimates. Integr. Mater. Manuf. Innov. 6, 207–217 minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). (2017). 56. Maxwell, E. Isotope effect in the superconductivity of mercury. Phys. Rev. 78, 27. Hirsch, J. E. Correlations between normal-state properties and superconductivity. 477–477 (1950). Phys. Rev. B 55, 9007–9024 (1997). 57. Reynolds, C. A., Serin, B., Wright, W. H. & Nesbitt, L. B. Superconductivity of 28. Owolabi, T. O., Akande, K. O. & Olatunji, S. O. Estimation of superconducting isotopes of mercury. Phys. Rev. 78, 487–487 (1950). transition temperature T for superconductors of the doped MgB system from 58. Reynolds, C. A., Serin, B. & Nesbitt, L. B. The isotope effect in superconductivity. I. C 2 the crystal lattice parameters using support vector regression. J. Supercond. Nov. Mercury. Phys. Rev. 84, 691–694 (1951). Magn. 28,75–81 (2015). 59. Kasahara, Y., Kuroki, K., Yamanaka, S. & Taguchi, Y. Unconventional super- 29. Ziatdinov, M. et al. Deep data mining in a real space: separation of intertwined conductivity in electron-doped layered metal nitride halides MNX (M = Ti, Zr, Hf; electronic responses in a lightly doped BaFe As . Nanotechnology 27, 475706 (2016). X = Cl, Br, I). Phys. C. 514, 354–367 (2015). 2 2 30. Klintenberg, M. & Eriksson, O. Possible high-temperature superconductors pre- 60. Yin, Z. P., Kutepov, A. & Kotliar, G. Correlation-enhanced electron-phonon dicted from electronic structure and data-filtering algorithms. Comput. Mater. Sci. coupling: applications of GW and screened hybrid functional to bismuthates, 67, 282–286 (2013). chloronitrides, and other high-T superconductors. Phys. Rev. X 3, 021011 31. Owolabi, T. O., Akande, K. O. & Olatunji, S. O. Prediction of superconducting (2013). transition temperatures for Fe-based superconductors using support vector 61. Matthias, B. T. Empirical relation between superconductivity and the number of machine. Adv. Phys. Theor. Appl. 35,12–26 (2014). valence electrons per atom. Phys. Rev. 97,74–76 (1955). 32. Norman, M. R. Materials design for new superconductors. Rep. Prog. Phys. 79, 62. Bocarsly, J. D. et al. A simple computational proxy for screening magnetocaloric 074502 (2016). compounds. Chem. Mater. 29, 1613–1622 (2017). 33. Kopnin, N. B., Heikkilä, T. T. & Volovik, G. E. High-temperature surface super- 63. Seibel, E. M. et al. Structure and magnetic properties of the α-NaFeO -type conductivity in topological flat-band systems. Phys. Rev. B 83, 220503 (2011). honeycomb compound Na Ni BiO . Inorg. Chem. 52, 13605–13611 (2013). 3 2 6 34. Peotta, S. & Törmä, P. Superfluidity in topologically nontrivial flat bands. Nat. 64. Labbé, J., Barišić, S. & Friedel, J. Strong-coupling superconductivity in V X type of Commun. 6, 8944 (2015). compounds. Phys. Rev. Lett. 19, 1039–1041 (1967). 35. Hosono, H. et al. Exploration of new superconductors and functional materials, 65. Hirsch, J. E. & Scalapino, D. J. Enhanced superconductivity in quasi two- and fabrication of superconducting tapes and wires of iron pnictides. Sci. Technol. dimensional systems. Phys. Rev. Lett. 56, 2732–2735 (1986). Adv. Mater. 16, 033503 (2015). 66. Dzyaloshinskiĭ, I. E. Maximal increase of the superconducting transition 36. Kohn, W. & Luttinger, J. M. New mechanism for superconductivity. Phys. Rev. Lett. temperature due to the presence of van’t Hoff singularities. JETP Lett. 46, 118 15, 524–526 (1965). (1987). 37. Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. A general-purpose machine 67. Yazici, D., Jeon, I., White, B. D. & Maple, M. B. Superconductivity in layered BiS - learning framework for predicting properties of inorganic materials. NPJ Comput. based compounds. Phys. C. 514, 218–236 (2015). Mater. 2, 16028 (2016). 68. McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and 38. Setyawan, W. & Curtarolo, S. High-throughput electronic band structure calcu- IPython (O’Reilly Media, 2012). lations: challenges and tools. Comput. Mater. Sci. 49, 299–312 (2010). 69. Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. Magpie Software, https:// 39. Yang, K., Oses, C. & Curtarolo, S. Modeling off-stoichiometry materials with a high- bitbucket.org/wolverton/magpie (2016). https://doi.org/10.1038/npjcompumats. throughput ab-initio approach. Chem. Mater. 28, 6484–6492 (2016). 2016.28 40. Levy, O., Jahnátek, M., Chepulskii, R. V., Hart, G. L. W. & Curtarolo, S. Ordered 70. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. structures in rhenium binary alloys from first-principles calculations. J. Am. Chem. 12, 2825–2830 (2011). Soc. 133, 158–163 (2011). Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. article’s Creative Commons license and your intended use is not permitted by statutory Open Access This article is licensed under a Creative Commons regulation or exceeds the permitted use, you will need to obtain permission directly Attribution 4.0 International License, which permits use, sharing, from the copyright holder. To view a copy of this license, visit http://creativecommons. adaptation, distribution and reproduction in any medium or format, as long as you give org/licenses/by/4.0/. appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless © The Author(s) 2018 indicated otherwise in a credit line to the material. If material is not included in the npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png npj Computational Materials Springer Journals

Machine learning modeling of superconducting critical temperature

Loading next page...
 
/lp/springer-journals/machine-learning-modeling-of-superconducting-critical-temperature-VpQZvP9i2q

References (83)

Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Materials Science; Materials Science, general; Characterization and Evaluation of Materials; Mathematical and Computational Engineering; Theoretical, Mathematical and Computational Physics; Computational Intelligence; Mathematical Modeling and Industrial Mathematics
eISSN
2057-3960
DOI
10.1038/s41524-018-0085-8
Publisher site
See Article on Publisher Site

Abstract

www.nature.com/npjcompumats ARTICLE OPEN Machine learning modeling of superconducting critical temperature 1,2 3,4 1,5 2,6 2,7 3,4,8 Valentin Stanev , Corey Oses , A. Gilad Kusne , Efrain Rodriguez , Johnpierre Paglione , Stefano Curtarolo and 1,2 Ichiro Takeuchi Superconductivity has been the focus of enormous research effort since its discovery more than a century ago. Yet, some features of this unique phenomenon remain poorly understood; prime among these is the connection between superconductivity and chemical/structural properties of materials. To bridge the gap, several machine learning schemes are developed herein to model the critical temperatures (T ) of the 12,000+ known superconductors available via the SuperCon database. Materials are first divided into two classes based on their T values, above and below 10 K, and a classification model predicting this label is trained. The model uses coarse-grained features based only on the chemical compositions. It shows strong predictive power, with out-of-sample accuracy of about 92%. Separate regression models are developed to predict the values of T for cuprate, iron-based, and low-T c c compounds. These models also demonstrate good performance, with learned predictors offering potential insights into the mechanisms behind superconductivity in different families of materials. To improve the accuracy and interpretability of these models, new features are incorporated using materials data from the AFLOW Online Repositories. Finally, the classification and regression models are combined into a single-integrated pipeline and employed to search the entire Inorganic Crystallographic Structure Database (ICSD) for potential new superconductors. We identify >30 non-cuprate and non-iron-based oxides as candidate materials. npj Computational Materials (2018) 4:29 ; doi:10.1038/s41524-018-0085-8 INTRODUCTION macroscopic properties, such as the melting temperatures of binary compounds, the likely crystal structure at a given Superconductivity, despite being the subject of intense physics, 15 16,17 16 composition, band gap energies , and density of states of chemistry, and materials science research for more than a century, certain classes of materials. remains among one of the most puzzling scientific topics. It is an Taking advantage of this immense increase of readily accessible intrinsically quantum phenomenon caused by a finite attraction and potentially relevant information, we develop several ML between paired electrons, with unique properties including zero methods modeling T from the complete list of reported DC resistivity, Meissner, and Josephson effects, and with an ever- c (inorganic) superconductors. In their simplest form, these growing list of current and potential applications. There is even a methods take as input a number of predictors generated from profound connection between phenomena in the superconduct- the elemental composition of each material. Models developed ing state and the Higgs mechanism in particle physics. However, with these basic features are surprisingly accurate, despite lacking understanding the relationship between superconductivity and information of relevant properties, such as space group, electronic materials’ chemistry and structure presents significant theoretical structure, and phonon energies. To further improve the predictive and experimental challenges. In particular, despite focused power of the models, as well as the ability to extract useful research efforts in the last 30 years, the mechanisms responsible information out of them, another set of features are constructed for high-temperature superconductivity in cuprate and iron-based 3,4 based on crystallographic and electronic information taken from families remain elusive. 19–22 the AFLOW Online Repositories. Recent developments, however, allow a different approach to Application of statistical methods in the context of super- investigate what ultimately determines the superconducting conductivity began in the early eighties with simple clustering critical temperatures (T ) of materials. Extensive databases cover- 23,24 methods. In particular, three “golden” descriptors confine the ing various measured and calculated materials properties have 5–9 60 known (at the time) superconductors with T > 10 K to three been created over the years. The sheer quantity of accessible small islands in space: the averaged valence-electron numbers, information also makes possible, and even necessary, the use of data-driven approaches, e.g., statistical and machine learning (ML) orbital radii differences, and metallic electronegativity differences. 10–13 methods. Such algorithms can be developed/trained on the Conversely, about 600 other superconductors with T <10K variables collected in these databases, and employed to predict appear randomly dispersed in the same space. These descriptors 1 2 Department of Materials Science and Engineering, University of Maryland, College Park, MD 20742-4111, USA; Center for Nanophysics and Advanced Materials, University of 3 4 Maryland, College Park, MD 20742, USA; Department of Mechanical Engineering and Materials Science, Duke University, Durham, NC 27708, USA; Center for Materials 5 6 Genomics, Duke University, Durham, NC 27708, USA; National Institute of Standards and Technology, Gaithersburg, MD 20899, USA; Department of Chemistry and 7 8 Biochemistry, University of Maryland, College Park, MD 20742, USA; Department of Physics, University of Maryland, College Park, MD 20742, USA and Fritz-Haber-Institut der Max-Planck-Gesellschaft, 14195 Berlin-Dahlem, Germany Correspondence: Valentin Stanev (vstanev@umd.edu) Received: 22 November 2017 Revised: 12 May 2018 Accepted: 17 May 2018 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences Machine learning modeling of superconducting critical V Stanev et al. were selected heuristically due to their success in classifying From SuperCon, we have extracted a list of ~16,400 binary/ternary structures and predicting stable/metastable ternary compounds, of which 4000 have no T reported (see Methods quasicrystals. Recently, an investigation stumbled on this cluster- section for details). Of these, roughly 5700 compounds are ing problem again by observing a threshold T closer to cuprates and 1500 are iron-based (about 35 and 9%, respectively), thres thres log T  1:3 T ¼ 20K . Instead of a heuristic approach, reflecting the significant research efforts invested in these two c c random forests and simplex fragments were leveraged on the families. The remaining set of about 8000 is a mix of various structural/electronic properties data from the AFLOW Online materials, including conventional phonon-driven superconductors Repositories to find the optimum clustering descriptors. A (e.g., elemental superconductors, A15 compounds), known classification model was developed showing good performance. unconventional superconductors like the layered nitrides and Separately, a sequential learning framework was evaluated on heavy fermions, and many materials for which the mechanism of superconducting materials, exposing the limitations of relying on superconductivity is still under debate (such as bismuthates and random-guess (trial-and-error) approaches for breakthrough dis- borocarbides). The distribution of materials by T for the three coveries. Subsequently, this study also highlights the impact groups is shown in Fig. 2a. machine learning can have on this particular field. In another early Use of this data for the purpose of creating ML models can be work, statistical methods were used to find correlations between problematic. ML models have an intrinsic applicability domain, i.e., normal state properties and T of the metallic elements in the first predictions are limited to the patterns/trends encountered in the six rows of the periodic table. Other contemporary works hone training set. As such, training a model only on superconductors 28,29 30,31 in on specific materials and families of superconductors can lead to significant selection bias that may render it ineffective (see also ref. ). when applied to new materials (N.B., a model suffering from Whereas previous investigations explored several hundred selection bias can still provide valuable statistical information compounds at most, this work considers >16,000 different about known superconductors). Even if the model learns to compositions. These are extracted from the SuperCon database, correctly recognize factors promoting superconductivity, it may which contains an exhaustive list of superconductors, including miss effects that strongly inhibit it. To mitigate the effect, we many closely related materials varying only by small changes in incorporate about 300 materials found by H. Hosono’s group not stoichiometry (doping plays a significant role in optimizing T ). The to display superconductivity. However, the presence of non- order-of-magnitude increase in training data (i) presents crucial superconducting materials, along with those without T reported subtleties in chemical composition among related compounds, (ii) in SuperCon, leads to a conceptual problem. Surely, some of these affords family-specific modeling exposing different superconduct- compounds emerge as non-superconducting “end-members” ing mechanisms, and (iii) enhances model performance overall. It from doping/pressure studies, indicating no superconducting also enables the optimization of several model construction transition was observed despite some efforts to find one. procedures. Large sets of independent variables can be con- However, since transition may still exist, albeit at experimentally structed and rigorously filtered by predictive power (rather than difficult to reach or altogether inaccessible temperatures - for selecting them by intuition alone). These advances are crucial to most practical purposes below 10 mK. (There are theoretical uncovering insights into the emergence/suppression of super- arguments for this—according to the Kohn–Luttinger theorem, a conductivity with composition. superconducting instability should be present as T→ 0 in any As a demonstration of the potential of ML methods in looking fermionic metallic system with Coulomb interactions. ) This for novel superconductors, we combined and applied several presents a conundrum: ignoring compounds with no reported T models to search for candidates among the roughly 110,000 disregards a potentially important part of the dataset, while different compositions contained in the Inorganic Crystallographic assuming T = 0 K prescribes an inadequate description for (at Structure Database (ICSD), a large fraction of which have not been least some of) these compounds. To circumvent the problem, tested for superconductivity. The framework highlights 35 materials are first partitioned in two groups by their T , above and compounds with predicted T ’s above 20 K for experimental below a threshold temperature (T ), for the creation of a sep validation. Of these, some exhibit interesting chemical and classification model. Compounds with no reported critical structural similarities to cuprate superconductors, demonstrating temperature can be classified in the “below-T ” group without sep the ability of the ML models to identify meaningful patterns in the the need to specify a T value (or assume it is zero). The “above- data. In addition, most materials from the list share a peculiar T ” bin also enables the development of a regression model for sep feature in their electronic band structure: one (or more) flat/nearly- ln(T ), without problems arising in the T → 0 limit. c c flat bands just below the energy of the highest occupied For most materials, the SuperCon database provides only the electronic state. The associated large peak in the density of states chemical composition and T . To convert this information into (infinitely large in the limit of truly flat bands) can lead to strong meaningful features/predictors (used interchangeably), we electronic instability, and has been discussed recently as one employ the Materials Agnostic Platform for Informatics and 33,34 possible way to high-temperature superconductivity. Exploration (Magpie). Magpie computes a set of attributes for each material, including elemental property statistics like the mean and the standard deviation of 22 different elemental RESULTS properties (e.g., period/group on the periodic table, atomic Data and predictors number, atomic radii, melting temperature), as well as electronic The success of any ML method ultimately depends on access to structure attributes, such as the average fraction of electrons from reliable and plentiful data. Superconductivity data used in this the s, p, d, and f valence shells among all elements present. work is extracted from the SuperCon database, created and The application of Magpie predictors, though appearing to lack maintained by the Japanese National Institute for Materials a priori justification, expands upon past clustering approaches by 23,24 Science. It houses information such as the T and reporting Villars and Rabe. They show that, in the space of a few journal publication for superconducting materials known from judiciously chosen heuristic predictors, materials separate and experiment. Assembled within it is a uniquely exhaustive list of all cluster according to their crystal structure and even complex reported superconductors, as well as related non-superconducting properties, such as high-temperature ferroelectricity and super- compounds. As such, SuperCon is the largest database of its kind, conductivity. Similar to these features, Magpie predictors capture and has never before been employed en masse for machine significant chemical information, which plays a decisive role in learning modeling. determining structural and physical properties of materials. npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences 1234567890():,; Machine learning modeling of superconducting critical V Stanev et al. Fig. 1 Schematic of the random forest ML approach. Example of a single decision tree used to classify materials depending on whether T is above or below 10 K. A tree can have many levels, but only the three top are shown. The decision rules leading to each subset are written inside individual rectangles. The subset population percentage is given by “samples”, and the node color/shade represents the degree of separation, i.e., dark blue/orange illustrates a high proportion of T >10 K/T < 10 K materials (the exact value is given by “proportion”). A c c random forest consists of a large number—could be hundreds or thousands—of such individual trees Despite the success of Magpie predictors in modeling materials methods (e.g., linear regression), it does not make assumptions properties, interpreting their connection to superconductivity about the functional form of the relationship between the presents a serious challenge. They do not encode (at least directly) predictors and the target variable (e.g., linear, exponential or many important properties, particularly those pertinent to super- some other a priori fixed function). Second, random forests are conductivity. Incorporating features like lattice type and density of quite tolerant to heterogeneity in the training data. It can handle states would undoubtedly lead to significantly more powerful and both numerical and categorical data which, furthermore, does not interpretable models. Since such information is not generally need extensive and potentially dangerous preprocessing, such as available in SuperCon, we employ data from the AFLOW Online scaling or normalization. Even the presence of strongly correlated 19–22 Repositories. The materials database houses nearly 170 predictors is not a problem for model construction (unlike many million properties calculated with the software package other ML algorithms). Another significant advantage of this 6,38–46 AFLOW. It contains information for the vast majority of method is that, by combining information from individual trees, compounds in the ICSD. Although, the AFLOW Online Reposi- it can estimate the importance of each predictor, thus making the tories contain calculated properties, the DFT results have been model more interpretable. However, unlike model construction, 17,25,47–50 extensively validated with observed properties. determination of predictor importance is complicated by the Unfortunately, only a small subset of materials in SuperCon presence of correlated features. To avoid this, standard feature overlaps with those in the ICSD: about 800 with finite T and <600 selection procedures are employed along with a rigorous are contained within AFLOW. For these, a set of 26 predictors are predictor elimination scheme (based on their strength and incorporated from the AFLOW Online Repositories, including correlation with others). Overall, these methods reduce the structural/chemical information like the lattice type, space group, complexity of the models and improve our ability to interpret volume of the unit cell, density, ratios of the lattice parameters, them. Bader charges and volumes, and formation energy (see Methods section for details). In addition, electronic properties are con- Classification models sidered, including the density of states near the Fermi level as As a first step in applying ML methods to the dataset, a sequence calculated by AFLOW. Previous investigations exposed limitations of classification models are created, each designed to separate in applying ML methods to a similar dataset in isolation. Instead, materials into two distinct groups depending on whether T is a framework is presented here for combining models built on above or below some predetermined value. The temperature that Magpie descriptors (large sampling, but features limited to separates the two groups (T ) is treated as an adjustable sep compositional data) and AFLOW features (small sampling, but parameter of the model, though some physical considerations diverse and pertinent features). should guide its choice as well. Classification ultimately allows Once we have a list of relevant predictors, various ML models 51,52 compounds with no reported T to be used in the training set by can be applied to the data. All ML algorithms in this work are c including them in the below-T bin. Although discretizing variants of the random forest method. Fundamentally, this sep continuous variables is not generally recommended, in this case approach combines many individual decision trees, where each the benefits of including compounds without T outweigh the tree is a non-parametric supervised learning method used for c modeling either categorical or numerical variables (i.e., classifica- potential information loss. tion or regression modeling). A tree predicts the value of a target In order to choose the optimal value of T , a series of random sep variable by learning simple decision rules inferred from the forest models are trained with different threshold temperatures available features (see Fig. 1 for an example). separating the two classes. Since setting T too low or too high sep Random forest is one of the most powerful, versatile, and widely creates strongly imbalanced classes (with many more instances in used ML methods. There are several advantages that make it one group), it is important to compare the models using several especially suitable for this problem. First, it can learn complicated different metrics. Focusing only on the accuracy (count of non-linear dependencies from the data. Unlike many other correctly classified instances) can lead to deceptive results. Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. Hypothetically, if 95% of the observations in the dataset are in the composition of the training and test sets, and their exact values below-T group, simply classifying all materials as such would can vary.) sep yield a high accuracy (95%), while being trivial in any other sense. The most important factors that determine the model’s There are more sophisticated techniques to deal with severely performance are the size of the available dataset and the number imbalanced datasets, like undersampling the majority class or of meaningful predictors. As can be seen in Fig. 2c, all metrics generating synthetic data points for the minority class (see, for improve significantly with the increase of the training set size. The example, ref. ). To avoid this potential pitfall, three other effect is most dramatic for sizes between several hundred and few standard metrics for classification are considered: precision, recall, thousands instances, but there is no obvious saturation even for and F score. They are defined using the values tp, tn, fp, and fn for the largest available datasets. This validates efforts herein to the count of true/false positive/negative predictions of the model: incorporate as much relevant data as possible into model training. The number of predictors is another very important model tp þ tn accuracy  ; (1) parameter. In Fig. 2d, the accuracy is calculated at each step of the tp þ tn þ fp þ fn backward feature elimination process. It quickly saturates when the number of predictors reaches 10. In fact, a model using only tp precision  ; (2) the five most informative predictors, selected out of the full list of tp þ fp 145 ones, achieves almost 90% accuracy. To gain some understanding of what the model has learned, an tp recall  ; (3) analysis of the chosen predictors is needed. In the random forest tp þ fn method, features can be ordered by their importance quantified via the so-called Gini importance or “mean decrease in precision ´ recall 51,52 (4) F  2 ´ ; 1 impurity”. For a given feature, it is the sum of the Gini precision þ recall impurity (calculated as ∑ p (1 - p ), where p is the probability of i i i i where positive/negative refers to above-T /below-T . The sep sep randomly chosen data point from a given decision tree leaf to be 51,52 accuracy of a classifier is the total proportion of correctly classified in class i ) over the number of splits that include the feature, materials, while precision measures the proportion of correctly weighted by the number of samples it splits, and averaged over classified above-T superconductors out of all predicted above- sep the entire forest. Due to the nature of the algorithm, the closer to T . The recall is the proportion of correctly classified above-T sep sep the top of the tree a predictor is used, the greater number of materials out of all truly above-T compounds. While the sep predictions it impacts. precision measures the probability that a material selected by Although correlations between predictors do not affect the the model actually has T > T , the recall reports how sensitive c sep model’s ability to learn, it can distort importance estimates. For the model is to above-T materials. Maximizing the precision or sep example, a material property with a strong effect on T can be recall would require some compromise with the other, i.e., a shared among several correlated predictors. Since the model can model that labels all materials as above-T would have perfect sep access the same information through any of these variables, their recall but dismal precision. To quantify the trade-off between relative importances are diluted across the group. To reduce the recall and precision, their harmonic mean (F score) is widely used 1 effect and limit the list of predictors to a manageable size, the to measure the performance of a classification model. With the backward feature elimination method is employed. The process exception of accuracy, these metrics are not symmetric with begins with a model constructed with the full list of predictors, respect to the exchange of positive and negative labels. and iteratively removes the least significant one, rebuilding the For a realistic estimate of the performance of each model, the model and recalculating importances with every iteration. (This dataset is randomly split (85%/15%) into training and test subsets. iterative procedure is necessary since the ordering of the The training set is employed to fit the model, which is then predictors by importance can change at each step.) Predictors applied to the test set for subsequent benchmarking. The are removed until the overall accuracy of the model drops by 2%, aforementioned metrics (Eqs. (1)–(4)) calculated on the test set at which point there are only five left. Furthermore, two of these provide an unbiased estimate of how well the model is expected predictors are strongly correlated with each other, and we remove to generalize to a new (but similar) dataset. With the random the less important one. This has a negligible impact on the model forest method, similar estimates can be obtained intrinsically at performance, yielding four predictors total (see Table 1) with an the training stage. Since each tree is trained only on a above 90% accuracy score—only slightly worse than the full bootstrapped subset of the data, the remaining subset can be model. Scatter plots of the pairs of the most important predictors used as an internal test set. These two methods for quantifying are shown in Fig. 3, where blue/red denotes whether the material model performance usually yield very similar results. is in the below-T /above-T class. Figure 3a shows a scatter plot sep sep With the procedure in place, the models’ metrics are evaluated of 3000 compounds in the space spanned by the standard for a range of T and illustrated in Fig. 2b. The accuracy increases deviations of the column numbers and electronegativities sep as T goes from 1 to 40 K, and the proportion of above-T calculated over the elemental values. Superconductors with T > sep sep c compounds drops from above 70% to about 15%, while the recall 10 K tend to cluster in the upper-right corner of the plot and in a and F score generally decrease. The region between 5 and 15 K is relatively thin elongated region extending to the left of it. In fact, especially appealing in (nearly) maximizing all benchmarking the points in the upper-right corner represent mostly cuprate metrics while balancing the sizes of the bins. In fact, setting T = materials, which with their complicated compositions and large sep 10 K is a particularly convenient choice. It is also the temperature number of elements are likely to have high-standard deviations in 23,24 used in refs. to separate the two classes, as it is just above the these variables. Figure 3b shows the same compounds projected highest T of all elements and pseudoelemental materials (solid in the space of the standard deviations of the melting solution whose range of composition includes a pure element). temperatures and the averages of the atomic weights of the Here, the proportion of above-T materials is ~38% and the elements forming each compound. The above-T materials tend sep sep accuracy is about 92%, i.e., the model can correctly classify nine to cluster in areas with lower mean atomic weights—not a out of ten materials—much better than random guessing. The surprising result given the role of phonons in conventional recall—quantifying how well all above-T compounds are superconductivity. sep labeled and, thus, the most important metric when searching For comparison, we create another classifier based on the for new superconducting materials—is even higher. (Note that the average number of valence electrons, metallic electronegativity models’ metrics also depend on random factors such as the differences, and orbital radii differences, i.e., the predictors used in npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences Machine learning modeling of superconducting critical V Stanev et al. a b c d Fig. 2 SuperCon dataset and classification model performance. a Histogram of materials categorized by T (bin size is 2 K, only those with finite T are counted). Blue, green, and red denote low-T , iron-based, and cuprate superconductors, respectively. In the inset: histogram of c c materials categorized by ln(T ) restricted to those with T >10K. b Performance of different classification models as a function of the threshold c c temperature (T ) that separates materials in two classes by T . Performance is measured by accuracy (gray), precision (red), recall (blue), and sep c F score (purple). The scores are calculated from predictions on an independent test set, i.e., one separate from the dataset used to train the model. In the inset: the dashed red curve gives the proportion of materials in the above-T set. c Accuracy, precision, recall, and F score as a sep 1 function of the size of the training set with a fixed test set. d Accuracy, precision, recall, and F as a function of the number of predictors 23,24 refs. to cluster materials with T > 10 K. A classifier built only Table 1. The most relevant predictors and their importances for the with these three predictors is less accurate than both the full and classification and general regression models the truncated models presented herein, but comes quite close: the full model has about 3% higher accuracy and F score, while the Predictor Model truncated model with four predictors is less that 2% more rank accurate. The rather small (albeit not insignificant) differences Classification Regression (general; T >10 K) demonstrates that even on the scale of the entire SuperCon 23,24 1 std(column number) avg(number of unfilled orbitals) dataset, the predictors used by Villars and Rabe capture much 0.26 0.26 of the relevant chemical information for superconductivity. 2 std(electronegativity) std(ground state volume) 0.18 0.26 Regression models 3 std(melting std(space group number) 0.17 After constructing a successful classification model, we now move temperature) 0.23 to the more difficult challenge of predicting T . Creating a 4 avg(atomic weight) 0.24 avg(number of d unfilled regression model may enable better understanding of the factors orbitals) 0.17 controlling T of known superconductors, while also serving as an 5 — std(number of d valence organic part of a system for identifying potential new ones. electrons) 0.12 Leveraging the same set of elemental predictors as the classifica- 6 — avg(melting temperature) 0.10 tion model, several regression models are presented focusing on materials with T > 10 K. This approach avoids the problem of avg(x) and std(x) denote the composition-weighted average and standard materials with no reported T with the assumption that, if they deviation, respectively, calculated over the vector of elemental values for c were to exhibit superconductivity at all, their critical temperature each compound. For the classification model, all predictor importances are quite close would be below 10 K. It also enables the substitution of T with ln Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. a b Fig. 3 Scatter plots of 3000 superconductors in the space of the four most important classification predictors. Blue/red represent below-T / sep above-T materials, where T = 10 K. a Feature space of the first and second most important predictors: standard deviations of the column sep sep numbers and electronegativities (calculated over the values for the constituent elements in each compound). b Feature space of the third and fourth most important predictors: standard deviation of the elemental melting temperatures and average of the atomic weights bc de Fig. 4 Benchmarking of regression models predicting ln(T ). a Predicted vs. measured ln(T ) for the general regression model. The test set c c comprising a mix of low-T , iron-based, and cuprate superconductors with T > 10 K. With an R of about 0.88, this one model can accurately c c predict T for materials in different superconducting groups. b, c Predictions of the regression model trained solely on low-T compounds for c c test sets containing cuprate and iron-based materials. d, e Predictions of the regression model trained solely on cuprates for test sets containing low-T and iron-based superconductors. Models trained on a single group have no predictive power for materials from other groups (T ) as the target variable (which is problematic as T → 0), and model does reasonably well among the different c c thus addresses the problem of the uneven distribution of families–benchmarked on the test set, the model achieves R ≈ materials along the T -axis (Fig. 2a). Using ln(T ) creates a more 0.88 (Fig. 4a). It suggests that the random forest algorithm is c c uniform distribution (Fig. 2a inset), and is also considered a best flexible and powerful enough to automatically separate the practice when the range of a target variable covers more than one compounds into groups and create group-specific branches with order-of-magnitude (as in the case of T ). Following this distinct predictors (no explicit group labels were used during transformation, the dataset is parsed randomly (85%/15%) into training and testing). As validation, three separate models are training and test subsets (similarly performed for the classification trained only on a specific family, namely the low-T , cuprate, and model). iron-based superconductors, respectively. Benchmarking on Present within the dataset are distinct families of super- mixed-family test sets, the models performed well on compounds conductors with different driving mechanisms for superconduc- belonging to their training set family while demonstrating no tivity, including cuprate and iron-based high-temperature predictive power on the others. Figure 4b–d illustrates a cross- superconductors, with all others denoted “low-T ” for brevity (no section of this comparison. Specifically, the model trained on low- specific mechanism in this group). Surprisingly, a single-regression T compounds dramatically underestimates the T of both high- c c npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences Machine learning modeling of superconducting critical V Stanev et al. Table 2. The most significant predictors and their importances for the three material-specific regression models Predictor rank Model Regression (low-T ) Regression (cuprates) Regression (Fe-based) 1 frac(d valence electrons) 0.18 avg(number of unfilled orbitals) 0.22 std(column number) 0.17 2 avg(number of d unfilled orbitals) 0.14 std(number of d valence electrons) 0.13 avg(ionic character) 0.15 3 avg(number of valence electrons) 0.13 frac(d valence electrons) 0.13 std(Mendeleev number) 0.14 4 frac(s valence electrons) 0.11 std(ground state volume) 0.13 std(covalent radius) 0.14 5 avg(number of d valence electrons) 0.09 std(number of valence electrons) 0.1 max(melting temperature) 0.14 6 avg(covalent radius) 0.09 std(row number) 0.08 avg(Mendeleev number) 0.14 7 avg(atomic weight) 0.08 ||composition|| 0.07 ||composition|| 0.11 2 2 8 avg(Mendeleev number) 0.07 std(number of s valence electrons) 0.07 — 9 avg(space group number) 0.07 std(melting temperature) 0.07 — 10 avg(number of unfilled orbitals) 0.06—— avg(x), std(x), max(x), and frac(x) denote the composition-weighted average, standard deviation, maximum, and fraction, respectively, taken over the elemental pffiffiffiffiffiffiffiffiffiffiffiffi 2 2 values for each compound. l -norm of a composition is calculated by kk x ¼ x , where x is the proportion of each element i in the compound 2 i i temperature superconducting families (Fig. 4b, c), even though recovers the empirical relation first discovered by Matthias more this test set only contains compounds with T < 40 K. Conversely, than 60 years ago. Such findings validate the ability of ML the model trained on the cuprates tends to overestimate the T of approaches to discover meaningful patterns that encode true low-T (Fig. 4d) and iron-based (Fig. 4e) superconductors. This is a physical phenomena. clear indication that superconductors from these groups have Similar T -vs.-predictor plots reveal more interesting and subtle different factors determining their T . Interestingly, the family- features. A narrow cluster of materials with T > 20 K emerges in specific models do not perform better than the general regression the context of the mean covalent radii of compounds (Fig. 5b)— containing all the data points: R for the low-T materials is about c another important predictor for low-T superconductors. The 0.85, for cuprates is just below 0.8, and for iron-based compounds cluster includes (left-to-right) alkali-doped C ,MgB -related 60 2 is about 0.74. In fact, it is a purely geometric effect that the compounds, and bismuthates. The sector likely characterizes a combined model has the highest R . Each group of super- region of strong covalent bonding and corresponding high- conductors contributes mostly to a distinct T range, and, as a c frequency phonon modes that enhance T (however, frequencies result, the combined regression is better determined over longer that are too high become irrelevant for superconductivity). Another temperature interval. interesting relation appears in the context of the average number In order to reduce the number of predictors and increase the of d valence electrons. Figure 5c illustrates a fundamental bound interpretability of these models without significant detriment to on T of all non-cuprate and non-iron-based superconductors. their performance, a backward feature elimination process is again A similar limit exists for cuprates based on the average number employed. The procedure is very similar to the one described of unfilled orbitals (Fig. 5d). It appears to be quite rigid—several previously for the classification model, with the only difference data points found above it on inspection are actually incorrectly being that the reduction is guided by R of the model, rather than recorded entries in the database and were subsequently removed. the accuracy (the procedure stops when R drops by 3%). The connection between T and the average number of unfilled The most important predictors for the four models (one general orbitals may offer new insight into the mechanism for super- and three family-specific) together with their importances are conductivity in this family. (The number of unfilled orbitals refers shown in Tables 1 and 2. Differences in important predictors to the electron configuration of the substituent elements before across the family-specific models reflect the fact that distinct combining to form oxides. For example, Cu has one unfilled orbital mechanisms are responsible for driving superconductivity among 2 9 14 2 10 3 ([Ar]4s 3d ) and Bi has three ([Xe]4f 6s 5d 6p ). These values are these groups. The list is longest for the low-T superconductors, averaged per formula unit.) Known trends include higher T ’s for reflecting the eclectic nature of this group. Similar to the general structures that (i) stabilize more than one superconducting Cu–O regression model, different branches are likely created for distinct plane per unit cell and (ii) add more polarizable cations such as Tl sub-groups. Nevertheless, some important predictors have + 2+ and Hg between these planes. The connection reflects these straightforward interpretation. As illustrated in Fig. 5a, low average observations, since more copper and oxygen per formula unit atomic weight is a necessary (albeit not sufficient) condition for leads to lower average number of unfilled orbitals (one for copper, achieving high T among the low-T group. In fact, the maximum c c pffiffiffiffiffiffiffi two for oxygen). Further, the lower-T cuprates typically consist of T for a given weight roughly follows 1= m . Mass plays a c A 2− 3− Cu /Cu -containing layers stabilized by the addition/substition significant role in conventional superconductors through the 2+ 3+ of hard cations, such as Ba and La , respectively. These cations Debye frequency of phonons, leading to the well-known formula pffiffiffiffi have a large number of unfilled orbitals, thus increasing the 56– T  1= m, where m is the ionic mass (see, for example, refs. compound’s average. Therefore, the ability of between-sheet ). Other factors like density of states are also important, which cations to contribute charge to the Cu–O planes may be indeed explains the spread in T for a given m . Outlier materials clearly c A pffiffiffiffiffiffiffi quite important. The more polarizable the A cation, the more above the  1= m line include bismuthates and chloronitrates, electron density it can contribute to the already strongly covalent suggesting the conventional electron-phonon mechanism is not 2+ Cu –O bond. driving superconductivity in these materials. Indeed, chloroni- trates exhibit a very weak isotope effect, though some Including AFLOW unconventional electron-phonon coupling could still be relevant for superconductivity. Another important feature for low-T The models described previously demonstrate surprising accuracy materials is the average number of valence electrons. This and predictive power, especially considering the difference Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. Fig. 5 Scatter plots of T for superconducting materials in the space of significant, family-specific regression predictors. For 4000 “low-T ” c c superconductors (i.e., non-cuprate and non-iron-based), T is plotted vs. the a average atomic weight, b average covalent radius, and c average pffiffiffiffiffiffiffi number of d valence electrons. The dashed red line in a is  1= m . Having low average atomic weight and low average number of d valence electrons are necessary (but not sufficient) conditions for achieving high T in this group. d Scatter plot of T for all known superconducting c c cuprates vs. the mean number of unfilled orbitals. c, d suggest that the values of these predictors lead to hard limits on the maximum achievable T between the relevant energy scales of most Magpie predictors the superconducting compounds in the ICSD also yields an (typically in the range of eV) and superconductivity (meV scale). unsatisfactory regression model. The issue is not the lack of This disparity, however, hinders the interpretability of the models, compounds per se, as models created with randomly drawn i.e., the ability to extract meaningful physical correlations. Thus, it subsets from SuperCon with similar counts of compounds perform is highly desirable to create accurate ML models with features much better. In fact, the problem is the chemical sparsity of based on measurable macroscopic properties of the actual superconductors in the ICSD, i.e., the dearth of closely related compounds (e.g., crystallographic and electronic properties) rather compounds (usually created by chemical substitution). This than composite elemental predictors. Unfortunately, only a small translates to compound scatter in predictor space—a challenging subset of materials in SuperCon is also included in the ICSD: about learning environment for the model. 1500 compounds in total, only about 800 with finite T , and even The chemical sparsity in ICSD superconductors is a significant fewer are characterized with ab initio calculations. (Most of the hurdle, even when both sets of predictors (i.e., Magpie and AFLOW superconductors in ICSD but not in AFLOW are non-stoichio- features) are combined via feature fusion. Additionally, this metric/doped compounds, and thus not amenable to conven- approach neglects the majority of the 16,000 compounds tional DFT methods. For the others, AFLOW calculations were available via SuperCon. Instead, we constructed separate models employing Magpie and AFLOW features, and then judiciously attempted but did not converge to a reasonable solution.) In fact, a good portion of known superconductors are disordered (off- combined the results to improve model metrics—known as late or stoichiometric) materials and notoriously challenging to address decision-level fusion. Specifically, two independent classification with DFT calculations. Currently, much faster and efficient models are developed, one using the full SuperCon dataset and methods are becoming available for future applications. Magpie predictors, and another based on superconductors in the To extract suitable features, data are incorporated from the ICSD and AFLOW predictors. Such an approach can improve the AFLOW Online Repositories—a database of DFT calculations recall, for example, in the case where we classify “high-T ” managed by the software package AFLOW. It contains information superconductors as those predicted by either model to be above- for the vast majority of compounds in the ICSD and about T . Indeed, this is the case here where, separately, the models sep 550 superconducting materials. In ref. , several ML models using obtain a recall of 40 and 66%, respectively, and together achieve a a similar set of materials are presented. Though a classifier shows recall of about 76%. (These numbers are based on a relatively good accuracy, attempts to create a regression model for T led to small test set benchmarking and their uncertainty is roughly 3%.) disappointing results. We verify that using Magpie predictors for In this way, the models’ predictions complement each other in a npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences Machine learning modeling of superconducting critical V Stanev et al. constructive way such that above-T materials missed by one sep Table 3. List of potential superconductors identified by the pipeline model (but not the other) are now accurately classified. Compound ICSD SYM Searching for new superconductors in the ICSD CsBe(AsO ) 074027 Orthorhombic As a final proof of concept demonstration, the classification and RbAsO 413150 Orthorhombic regression models described previously are integrated in one KSbO 411214 Monoclinic pipeline and employed to screen the entire ICSD database for 2 candidate “high-T ” superconductors. (Note that “high-T ” is a c c RbSbO 411216 Monoclinic label, the precise meaning of which can be adjusted.) Similar tools CsSbO 059329 Monoclinic power high-throughput screening workflows for materials with AgCrO 004149/025624 Hexagonal 50,62 2 desired thermal conductivity and magnetocaloric properties. K (Li Sn )O 262638 Hexagonal 0.8 0.2 0.76 2 As a first step, the full set of Magpie predictors are generated for Cs(MoZn)(O F ) 018082 Cubic all compounds in ICSD. A classification model similar to the one 3 3 presented above is constructed, but trained only on materials in Na Cd (IrO ) 404507 Monoclinic 3 2 6 SuperCon and not in the ICSD (used as an independent test set). Sr Cd(PtO ) 280518 Hexagonal 3 6 The model is then applied on the ICSD set to create a list of Sr Zn(PtO ) 280519 Hexagonal 3 6 materials predicted to have T above 10 K. Opportunities for (Ba Br )Ru O 245668 Hexagonal 5 2 2 9 model benchmarking are limited to those materials both in the Ba (AgO )(AuO ) 072329 Orthorhombic SuperCon and ICSD datasets, though this test set is shown to be 4 2 4 problematic. The set includes about 1500 compounds, with T Sr (AuO ) 071965 Orthorhombic 5 4 2 reported for only about half of them. The model achieves an RbSeO F 078399 Cubic impressive accuracy of 0.98, which is overshadowed by the fact CsSeO F 078400 Cubic that 96.6% of these compounds belong to the T < 10 K class. The KTeO F 411068 Monoclinic precision, recall, and F scores are about 0.74, 0.66, and 0.70, Na K (Tl2O ) 074956 Monoclinic 2 4 6 respectively. These metrics are lower than the estimates calculated Na Ni BiO 237391 Monoclinic for the general classification model, which is expected given that 3 2 6 this set cannot be considered randomly selected. Nevertheless, Na Ca BiO 240975 Orthorhombic 3 2 6 the performance suggests a good opportunity to identify new CsCd(BO3) 189199 Cubic candidate superconductors. K Cd(SiO ) 083229/086917 Orthorhombic 2 4 Next in the pipeline, the list is fed into a random forest Rb Cd(SiO ) 093879 Orthorhombic 2 4 regression model (trained on the entire SuperCon database) to K Zn(SiO ) 083227 Orthorhombic predict T . Filtering on the materials with T > 20 K, the list is 2 4 c c further reduced to about 2000 compounds. This count may K Zn(Si2O ) 079705 Orthorhombic 2 6 appear daunting, but should be compared with the total number K Zn(GeO ) 069018/085006/085007 Orthorhombic 2 4 of compounds in the database—about 110,000. Thus, the method (K Na )Zn(GeO ) 069166 Orthorhombic 0.6 1.4 4 selects <2% of all materials, which in the context of the training K Zn(Ge O ) 065740 Orthorhombic 2 2 6 set (containing >20% with “high-T ”), suggests that the model is Na Ca (Ge O ) 067315 Hexagonal not overly biased toward predicting high-critical temperatures. 6 3 2 6 3 The vast majority of the compounds identified as candidate Cs (AlGe O ) 412140 Monoclinic 3 2 7 superconductors are cuprates, or at least compounds that contain K Ba(Ge O ) 100203 Monoclinic 4 3 9 copper and oxygen. There are also some materials clearly related K Sr (Ge O ) 100202 Cubic 16 4 3 9 4 to the iron-based superconductors. The remaining set has 35 K Tb[Ge O (OH) ] 193585 Orthorhombic 3 3 8 2 members, and is composed of materials that are not obviously K Eu[Ge O (OH) ] 262677 Orthorhombic connected to any high-temperature superconducting families (see 3 3 8 2 Table 3). (For at least one compound from the list—Na Ni BiO — KBa Zn (Ga O ) 040856 Trigonal 3 2 6 6 4 7 21 low-temperature measurements have been performed and no Also shown are their ICSD numbers and symmetries. Note that for some signs of superconductivity were observed. ) None of them is compounds there are several entries. All of the materials contain oxygen predicted to have T in excess of 40 K, which is not surprising, given that no such instances exist in the training dataset. All contain oxygen—also not a surprising result, since the group of known superconductors with T > 20 K is dominated by oxides. c 4− charges. This is reminiscent of the way Sr balances the (CuO ) 2 4 The list comprises several distinct groups. Most of the materials unit in Sr CuO . Such chemical similarities based on charge 2 4 are insulators, similar to stoichiometric (and underdoped) balancing and stoichiometry were likely identified and exploited cuprates; charge doping and/or pressure will be required to drive by the ML algorithms. these materials into a superconducting state. Especially interesting The electronic properties calculated by AFLOW offer additional are the compounds containing heavy metals (such as Au, Ir, and insight into the results of the search, and suggest a possible Ru), metalloids (Se, Te), and heavier post-transition metals (Bi, Tl), connection among these candidate. Plotting the electronic which are or could be pushed into interesting/unstable oxidation structure of the potential superconductors exposes a rather states. The most surprising and non-intuitive of the compounds in unusual feature shared by almost all—one or several (nearly) flat the list are the silicates and the germanates. These materials form bands just below the energy of the highest occupied electronic corner-sharing SiO or GeO polyhedra, similar to quartz glass, and 4 4 state. Such bands lead to a large peak in the DOS (Fig. 6) and can also have counter cations with full or empty shells, such as Cd or cause a significant enhancement in T . Peaks in the DOS elicited K . Converting these insulators to metals (and possibly super- c by van Hove singularities can enhance T if sufficiently close to conductors) likely requires significant charge doping. However, 64–66 the similarity between these compounds and cuprates is mean- E . However, note that unlike typical van Hove points, a true ingful. In compounds like K CdSiO or K ZnSiO ,K Cd (or K Zn) flat band creates divergence in the DOS (as opposed to its 2 4 2 4 2 2 4− 4− unit carries a 4+ charge that offsets the (SiO ) (or (GeO ) ) derivatives), which in turn leads to a critical temperature 4 4 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. Cs (AlGe O ) CsBe(AsO ) 3 2 7 4 eDOS (states/eV) eDOS (states/eV) 4 4 3 3 total 2 2 1 1 0 0 -1 -1 -2 -2 -3 -3 -4 -4 Y H |I Γ Y F H Z I X Γ Z |M Γ N |X 0 20 40 60 80 Γ X S Y Γ Z U R T Z |Y X |S R 0 20 40 60 80 1 1 F T |U Sr Cd(PtO ) Ba (AgO )(AuO ) 3 6 eDOS (states/eV) 4 2 4 eDOS (states/eV) 4 4 3 3 2 2 1 1 0 0 -1 -1 -2 -2 -3 -3 -4 -4 Γ P Q Γ F P L Z 0 15 30 45 Γ X S R A Z Γ Y X A T Y |Z T 0 20 40 60 1 1 1 Fig. 6 DOS of four compounds identified by the ML algorithm as potential materials with T > 20 K. The partial DOS contributions from s, p, and d electrons and total DOS are shown in blue, green, red, and black, respectively. The large peak just below E is a direct consequence of the flat band(s) present in all these materials. These images were generated automatically via AFLOW . In the case of substantial overlap among k-point labels, the right-most label is offset below dependence linear in the pairing interaction strength, rather than different groups. With the incorporation of crystallographic-/ the usual exponential relationship yielding lower T . Additionally, electronic-based features from the AFLOW Online Repositories, there is significant similarity with the band structure and DOS of the ML models are further improved. Finally, we combined these layered BiS -based superconductors. models into one integrated pipeline, which is employed to search This band structure feature came as the surprising result of the entire ICSD database for new inorganic superconductors. The applying the ML model. It was not sought for, and, moreover, no model identified 35 oxides as candidate materials. Some of these explicit information about the electronic band structure has been are chemically and structurally similar to cuprates (even though no included in these predictors. This is in contrast to the algorithm explicit structural information was provided during training of the presented in ref. , which was specifically designed to filter ICSD model). Another feature that unites almost all of these materials is compounds based on several preselected electronic structure the presence of flat or nearly-flat bands just below the energy of features. the highest occupied electronic state. While at the moment it is not clear if some (or indeed any) of In conclusion, this work demonstrates the important role ML these compounds are really superconducting, let alone with T ’s models can play in superconductivity research. Records collected over several decades in SuperCon and other relevant databases above 20 K, the presence of this highly unusual electronic can be consumed by ML models, generating insights and structure feature is encouraging. Attempts to synthesize several of these compounds are already underway. promoting better understanding of the connection between materials’ chemistry/structure and superconductivity. Application of sophisticated ML algorithms has the potential to dramatically DISCUSSION accelerate the search for candidate high-temperature Herein, several machine learning tools are developed to study the superconductors. critical temperature of superconductors. Based on information from the SuperCon database, initial coarse-grained chemical METHODS features are generated using the Magpie software. As a first Superconductivity data application of ML methods, materials are divided into two classes depending on whether T is above or below 10 K. A non- The SuperCon database consists of two separate subsets: “Oxide and Metallic” (inorganic materials containing metals, alloys, cuprate high- parametric random forest classification model is constructed to temperature superconductors, etc.) and “Organic” (organic superconduc- predict the class of superconductors. The classifier shows excellent tors). Downloading the entire inorganic materials dataset and removing performance, with out-of-sample accuracy and F score of about compounds with incompletely specified chemical compositions leaves 92%. Next, several successful random forest regression models are about 22,000 entries. If a single T record exists for a given material, it is created to predict the value of T , including separate models for taken to accurately reflect the critical temperature of this material. In the three material sub-groups, i.e., cuprate, iron-based, and low-T case of multiple records for the same compound, the reported material’s compounds. By studying the importance of predictors for each T 's are averaged, but only if their standard deviation is <5 K, and discarded family of superconductors, insights are obtained about the otherwise. This brings the total down to about 16,400 compounds, of physical mechanisms driving superconductivity among the which around 4,000 have no critical temperature reported. Each entry in npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences energy (eV) energy (eV) energy (eV) energy (eV) Machine learning modeling of superconducting critical V Stanev et al. the set contains fields for the chemical composition, T , structure, and a journal reference to the information source. Here, structural information is ignored as it is not always available. There are occasional problems with the validity and consistency of some of the data. For example, the database includes some reports based on tenuous experimental evidence and only indirect signatures of super- conductivity, as well as reports of inhomogeneous (surface, interfacial) and non-equilibrium phases. Even in cases of bona fide bulk superconducting phases, important relevant variables like pressure are not recorded. Though some of the obviously erroneous records were removed from the data, these issues were largely ignored assuming their effect on the entire dataset to be relatively modest. The data cleaning and processing is carried out using the Python Pandas package for data analysis. Chemical and structural features The predictors are calculated using the Magpie software. It computes a set of 145 attributes for each material, including: (i) stoichiometric features (depends only on the ratio of elements and not the specific species); (ii) elemental property statistics: the mean, mean absolute deviation, range, minimum, maximum, and mode of 22 different elemental properties (e.g., period/group on the periodic table, atomic number, atomic radii, melting temperature); (iii) electronic structure attributes: the average fraction of Fig. 7 Regression model predictions of T . Predicted vs. measured T electrons from the s, p, d, and f valence shells among all elements present; c c for general regression model. R score is comparable to the one and (iv) ionic compound features that include whether it is possible to obtained testing regression modeling ln(T ) form an ionic compound assuming all elements exhibit a single-oxidation state. ML models are also constructed with the superconducting materials in the AFLOW Online Repositories. AFLOW is a high-throughput ab initio framework that manages density functional theory (DFT) calculations in accordance with the AFLOW Standard. The Standard ensures that the calculations and derived properties are empirical (reproducible), reason- ably well-converged, and above all, consistent (fixed set of parameters), a particularly attractive feature for ML modeling. Many materials properties important for superconductivity have been calculated within the AFLOW framework, and are easily accessible through the AFLOW Online Repositories. The features are built with the following properties: number of atoms, space group, density, volume, energy per atom, electronic entropy per atom, valence of the cell, scintillation attenuation length, the ratios of the unit cell’s dimensions, and Bader charges and volumes. For the Bader charges and volumes (vectors), the following statistics are calculated and incorporated: the maximum, minimum, average, standard deviation, and range. Machine learning algorithms Once we have a list of relevant predictors, various ML models can be 51,52 applied to the data. All ML algorithms in this work are variants of the random forest method. It is based on creating a set of individual decision trees (hence the “forest”), each built to solve the same classification/ Fig. 8 Flat bands feature. Comparison between the normalized regression problem. The model then combines their results, either by average DOS of 380 known superconductors in the ICSD (left) and voting or averaging depending on the problem. The deeper individual tree the normalized average DOS of the potential high-temperature are, the more complex the relationships the model can learn, but also the superconductors from Table 3 (right) greater the danger of overfitting, i.e., learning some irrelevant information or just “noise”. To make the forest more robust to overfitting, individual the relative sparsity of data points in some T ranges, which makes the trees in the ensemble are built from samples drawn with replacement (a model susceptible to outliers. bootstrap sample) from the training set. In addition, when splitting a node during the construction of a tree, the model chooses the best split of the data only considering a random subset of the features. Flat bands feature The random forest models above are developed using scikit-learn—a The flat band attribute is unusual for a superconducting material: the powerful and efficient machine learning Python library. Hyperparameters average DOS of the known superconductors in the ICSD has no distinct of these models include the number of trees in the forest, the maximum features, demonstrating roughly uniform distribution of electronic states. depth of each tree, the minimum number of samples required to split an In contrast, the average DOS of the potential superconductors in Table 3 internal node, and the number of features to consider when looking for the shows a sharp peak just below E (Fig. 8). Also, note that most of the flat best split. To optimize the classifier and the combined/family-specific bands in the potential superconductors we discuss have a notable regressors, the GridSearch function in scikit-learn is employed, which contribution from the oxygen p-orbitals. Accessing/exploiting the potential generates and compares candidate models from a grid of parameter strong instability this electronic structure feature creates can require values. To reduce computational expense, models are not optimized at significant charge doping. each step of the backward feature selection process. To test the influence of using log-transformed target variable ln(T ), a Prediction errors of the regression models general regression model is trained and tested on raw T data (shown in Fig. 7). This model is very similar to the one described in section “Results”, Previously, several regression models were described, each one designed 2 2 and its R value is fairly similar as well (although comparing R scores of to predict the critical temperatures of materials from different super- models built using different target data can be misleading). However, note conducting groups. These models achieved an impressive R score, Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. a b c d −1 meas pred meas Fig. 9 Histograms of Δln(T ) × ln(T ) for the four regression models. Δln(T ) ≡ ln T  ln T and ln(T ) ≡ ln T c c c c c c c demonstrating good predictive power for each group. However, it is also AUTHOR CONTRIBUTIONS important to consider the accuracy of the predictions for individual V.S., I.T., and A.G.K. designed the research. V.S. worked on the model. C.O. and S.C. compounds (rather than on the aggregate set), especially in the context of performed the AFLOW calculations. V.S., I.T., E.R., and J.P. analyzed the results. V.S., C. searching for new materials. To do this, we calculate the prediction errors O., I.T., and E.R. wrote the text of the manuscript. All authors discussed the results and for about 300 materials from a test set. Specifically, we consider the commented on the manuscript. difference between the logarithm of the predicted and measured critical meas pred meas temperature ln T  ln T normalized by the value of ln T c c c ADDITIONAL INFORMATION (normalization compensates the different T ranges of different groups). The models show comparable spread of errors. The histograms of errors for Competing interests: The authors declare no competing interests. the four models (combined and three group-specific) are shown in Fig. 9. The errors approximately follow a normal distribution, centered not at zero Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims but at a small negative value. This suggests the models are marginally in published maps and institutional affiliations. biased, and on average tend to slightly underestimate T . The variance is comparable for all models, but largest for the model trained and tested on iron-based materials, which also shows the smallest R . Performance of this REFERENCES model is expected to benefit from a larger training set. 1. Hirsch, J. E., Maple, M. B. & Marsiglio, F. Superconducting materials: conventional, unconventional and undetermined. Phys. C. 514,1–444 (2015). 2. Anderson, P. W. Plasmons, gauge invariance, and mass. Phys. Rev. 130, 439–442 Data availability (1963). The superconductivity data used to generate the results in this work can 3. Chu, C. W., Deng, L. Z. & Lv, B. Hole-doped cuprate high temperature super- be downloaded from https://github.com/vstanev1/Supercon. conductors. Phys. C. 514, 290–313 (2015). 4. Paglione, J. & Greene, R. L. High-temperature superconductivity in iron-based materials. Nat. Phys. 6, 645–658 (2010). ACKNOWLEDGEMENTS 5. Bergerhoff, G., Hundt, R., Sievers, R. & Brown, I. D. The inorganic crystal structure The authors are grateful to Daniel Samarov, Victor Galitski, Cormac Toher, Richard L. data base. J. Chem. Inf. Comput. Sci. 23,66–69 (1983). Greene, and Yibin Xu for many useful discussions and suggestions. We acknowledge 6. Curtarolo, S. et al. AFLOW: an automatic framework for high-throughput materials Stephan Rühl for ICSD. This research is supported by ONR N000141512222, ONR discovery. Comput. Mater. Sci. 58, 218–226 (2012). N00014-13-1-0635, and AFOSR No. FA 9550-14-10332. C.O. acknowledges support 7. Landis, D. D. et al. The computational materials repository. Comput. Sci. Eng. 14, 51–57 (2012). from the National Science Foundation Graduate Research Fellowship under grant No. 8. Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. Materials design and DGF1106401. J.P. acknowledges support from the Gordon and Betty Moore discovery with high-throughput density functional theory: the Open Quantum Foundation’s EPiQS Initiative through grant No. GBMF4419. S.C. acknowledges Materials Database (OQMD). JOM 65, 1501–1509 (2013). support by the Alexander von Humboldt-Foundation. This research is supported by 9. Jain, A. et al. Commentary: the Materials Project: a materials genome approach to ONR N000141512222, ONR N00014-13-1-0635, and AFOSR no. FA 9550-14-10332. C. accelerating materials innovation. APL Mater. 1, 011002 (2013). O. acknowledges support from the National Science Foundation Graduate Research 10. Agrawal, A. & Choudhary, A. Perspective: materials informatics and big data: Fellowship under grant no. DGF1106401. J.P. acknowledges support from the Gordon realization of the “fourth paradigm” of science in materials science. APL Mater. 4, and Betty Moore Foundation’s EPiQS Initiative through grant no. GBMF4419. S.C. 053208 (2016). acknowledges support by the Alexander von Humboldt-Foundation. npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences Machine learning modeling of superconducting critical V Stanev et al. 11. Lookman, T., Alexander, F. J. & Rajan, K. eds, A Perspective on Materials Informatics: 41. Levy, O., Hart, G. L. W. & Curtarolo, S. Structure maps for hcp metals from first- State-of-the-Art and Challenges, https://doi.org/10.1007/978-3-319-23871-5 principles calculations. Phys. Rev. B 81, 174106 (2010). (Springer International Publishing, 2016). 42. Levy, O., Chepulskii, R. V., Hart, G. L. W. & Curtarolo, S. The new face of rhodium 12. Jain, A., Hautier, G., Ong, S. P. & Persson, K. A. New opportunities for materials alloys: revealing ordered structures from first principles. J. Am. Chem. Soc. 132, informatics: resources and data mining techniques for uncovering hidden rela- 833–837 (2010). tionships. J. Mater. Res. 31, 977–994 (2016). 43. Levy, O., Hart, G. L. W. & Curtarolo, S. Uncovering compounds by synergy of 13. Mueller, T., Kusne, A. G. & Ramprasad, R. Machine Learning in Materials Science, pp. cluster expansion and high-throughput methods. J. Am. Chem. Soc. 132, 186–273, https://doi.org/10.1002/9781119148739.ch4 (John Wiley & Sons, Inc, 4830–4833 (2010). 2016). 44. Hart, G. L. W., Curtarolo, S., Massalski, T. B. & Levy, O. Comprehensive search for 14. Seko, A., Maekawa, T., Tsuda, K. & Tanaka, I. Machine learning with systematic new phases and compounds in binary alloy systems based on platinum-group density-functional theory calculations: application to melting temperatures of metals, using a computational first-principles approach. Phys. Rev. X 3, 041035 single- and binary-component solids. Phys. Rev. B 89, 054303–054313 (2014). (2013). 15. Balachandran, P. V., Theiler, J., Rondinelli, J. M. & Lookman, T. Materials prediction 45. Mehl, M. J. et al. The AFLOW library of crystallographic prototypes: part 1. via classification learning. Sci. Rep. 5, 13285–13301 (2015). Comput. Mater. Sci. 136,S1–S828 (2017). 16. Pilania, G. et al. Machine learning bandgaps of double perovskites. Sci. Rep. 6, 46. Supka, A. R. et al. AFLOWπ: a minimalist approach to high-throughput ab initio 19375 (2016). calculations including the generation of tight-binding hamiltonians. Comput. 17. Isayev, O. et al. Universal fragment descriptors for predicting electronic properties Mater. Sci. 136,76–84 (2017). of inorganic crystals. Nat. Commun. 8, 15679 (2017). 47. Toher, C. et al. High-throughput computational screening of thermal con- 18. National Institute of Materials Science, Materials Information Station, SuperCon, ductivity, Debye temperature, and Grüneisen parameter using a quasiharmonic http://supercon.nims.go.jp/index_en.html (2011). Debye model. Phys. Rev. B 90, 174107 (2014). 19. Curtarolo, S. et al. AFLOWLIB.ORG: a distributed materials properties repository 48. Perim, E. et al. Spectral descriptors for bulk metallic glasses based on the from high-throughput ab initio calculations. Comput. Mater. Sci. 58, 227–235 thermodynamics of competing crystalline phases. Nat. Commun. 7, 12315 (2012). (2016). 20. Taylor, R. H. et al. A RESTful API for exchanging materials data in the AFLOWLIB. 49. Toher, C. et al. Combining the AFLOW GIBBS and Elastic Libraries to efficiently org consortium. Comput. Mater. Sci. 93, 178–192 (2014). and robustly screen thermomechanical properties of solids. Phys. Rev. Mater. 1, 21. Calderon, C. E. et al. The AFLOW standard for high-throughput materials science 015401 (2017). calculations. Comput. Mater. Sci. 108 Part A, 233–238 (2015). 50. van Roekeghem, A., Carrete, J., Oses, C., Curtarolo, S. & Mingo, N. High-throughput 22. Rose, F. et al. AFLUX: the LUX materials search API for the AFLOW data reposi- computation of thermal conductivity of high-temperature solid phases: the case tories. Comput. Mater. Sci. 137, 362–370 (2017). of oxide and fluoride perovskites. Phys. Rev. X 6, 041061 (2016). 23. Villars, P. & Phillips, J. C. Quantum structural diagrams and high-T super- 51. Bishop, C. Pattern Recognition and Machine Learning. (Springer-Verlag, NY, 2006). conductivity. Phys. Rev. B 37, 2345–2348 (1988). 52. Hastie, T., Tibshirani, R. & Friedman, J. H. The Elements of Statistical Learning: Data 24. Rabe, K. M., Phillips, J. C., Villars, P. & Brown, I. D. Global multinary structural Mining, Inference, and Prediction. (Springer-Verlag, NY, 2001). chemistry of stable quasicrystals, high-T ferroelectrics, and high-T super- 53. Breiman, L. Random forests. Mach. Learn. 45,5–32 (2001). C c conductors. Phys. Rev. B 45, 7650–7676 (1992). 54. Caruana, R. & Niculescu-Mizil, A. An Empirical Comparison of Supervised Learning 25. Isayev, O. et al. Materials cartography: representing and mining materials space Algorithms. In Proceedings of the 23rd International Conference on Machine using structural and electronic fingerprints. Chem. Mater. 27, 735–743 (2015). Learning, ICML ’06, 161–168 (ACM, New York, NY, 2006). https://doi.org/10.1145/ 26. Ling J., Hutchinson M., Antono E., Paradiso S., and Meredig B. High-dimensional 1143844.1143865. materials and process optimization using data-driven experimental design with 55. Chawla, N. V., Bowyer, K. W., Hall, L. O. & Kegelmeyer, W. P. SMOTE: synthetic well-calibrated uncertainty estimates. Integr. Mater. Manuf. Innov. 6, 207–217 minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). (2017). 56. Maxwell, E. Isotope effect in the superconductivity of mercury. Phys. Rev. 78, 27. Hirsch, J. E. Correlations between normal-state properties and superconductivity. 477–477 (1950). Phys. Rev. B 55, 9007–9024 (1997). 57. Reynolds, C. A., Serin, B., Wright, W. H. & Nesbitt, L. B. Superconductivity of 28. Owolabi, T. O., Akande, K. O. & Olatunji, S. O. Estimation of superconducting isotopes of mercury. Phys. Rev. 78, 487–487 (1950). transition temperature T for superconductors of the doped MgB system from 58. Reynolds, C. A., Serin, B. & Nesbitt, L. B. The isotope effect in superconductivity. I. C 2 the crystal lattice parameters using support vector regression. J. Supercond. Nov. Mercury. Phys. Rev. 84, 691–694 (1951). Magn. 28,75–81 (2015). 59. Kasahara, Y., Kuroki, K., Yamanaka, S. & Taguchi, Y. Unconventional super- 29. Ziatdinov, M. et al. Deep data mining in a real space: separation of intertwined conductivity in electron-doped layered metal nitride halides MNX (M = Ti, Zr, Hf; electronic responses in a lightly doped BaFe As . Nanotechnology 27, 475706 (2016). X = Cl, Br, I). Phys. C. 514, 354–367 (2015). 2 2 30. Klintenberg, M. & Eriksson, O. Possible high-temperature superconductors pre- 60. Yin, Z. P., Kutepov, A. & Kotliar, G. Correlation-enhanced electron-phonon dicted from electronic structure and data-filtering algorithms. Comput. Mater. Sci. coupling: applications of GW and screened hybrid functional to bismuthates, 67, 282–286 (2013). chloronitrides, and other high-T superconductors. Phys. Rev. X 3, 021011 31. Owolabi, T. O., Akande, K. O. & Olatunji, S. O. Prediction of superconducting (2013). transition temperatures for Fe-based superconductors using support vector 61. Matthias, B. T. Empirical relation between superconductivity and the number of machine. Adv. Phys. Theor. Appl. 35,12–26 (2014). valence electrons per atom. Phys. Rev. 97,74–76 (1955). 32. Norman, M. R. Materials design for new superconductors. Rep. Prog. Phys. 79, 62. Bocarsly, J. D. et al. A simple computational proxy for screening magnetocaloric 074502 (2016). compounds. Chem. Mater. 29, 1613–1622 (2017). 33. Kopnin, N. B., Heikkilä, T. T. & Volovik, G. E. High-temperature surface super- 63. Seibel, E. M. et al. Structure and magnetic properties of the α-NaFeO -type conductivity in topological flat-band systems. Phys. Rev. B 83, 220503 (2011). honeycomb compound Na Ni BiO . Inorg. Chem. 52, 13605–13611 (2013). 3 2 6 34. Peotta, S. & Törmä, P. Superfluidity in topologically nontrivial flat bands. Nat. 64. Labbé, J., Barišić, S. & Friedel, J. Strong-coupling superconductivity in V X type of Commun. 6, 8944 (2015). compounds. Phys. Rev. Lett. 19, 1039–1041 (1967). 35. Hosono, H. et al. Exploration of new superconductors and functional materials, 65. Hirsch, J. E. & Scalapino, D. J. Enhanced superconductivity in quasi two- and fabrication of superconducting tapes and wires of iron pnictides. Sci. Technol. dimensional systems. Phys. Rev. Lett. 56, 2732–2735 (1986). Adv. Mater. 16, 033503 (2015). 66. Dzyaloshinskiĭ, I. E. Maximal increase of the superconducting transition 36. Kohn, W. & Luttinger, J. M. New mechanism for superconductivity. Phys. Rev. Lett. temperature due to the presence of van’t Hoff singularities. JETP Lett. 46, 118 15, 524–526 (1965). (1987). 37. Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. A general-purpose machine 67. Yazici, D., Jeon, I., White, B. D. & Maple, M. B. Superconductivity in layered BiS - learning framework for predicting properties of inorganic materials. NPJ Comput. based compounds. Phys. C. 514, 218–236 (2015). Mater. 2, 16028 (2016). 68. McKinney, W. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and 38. Setyawan, W. & Curtarolo, S. High-throughput electronic band structure calcu- IPython (O’Reilly Media, 2012). lations: challenges and tools. Comput. Mater. Sci. 49, 299–312 (2010). 69. Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. Magpie Software, https:// 39. Yang, K., Oses, C. & Curtarolo, S. Modeling off-stoichiometry materials with a high- bitbucket.org/wolverton/magpie (2016). https://doi.org/10.1038/npjcompumats. throughput ab-initio approach. Chem. Mater. 28, 6484–6492 (2016). 2016.28 40. Levy, O., Jahnátek, M., Chepulskii, R. V., Hart, G. L. W. & Curtarolo, S. Ordered 70. Pedregosa, F. et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. structures in rhenium binary alloys from first-principles calculations. J. Am. Chem. 12, 2825–2830 (2011). Soc. 133, 158–163 (2011). Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2018) 29 Machine learning modeling of superconducting critical V Stanev et al. article’s Creative Commons license and your intended use is not permitted by statutory Open Access This article is licensed under a Creative Commons regulation or exceeds the permitted use, you will need to obtain permission directly Attribution 4.0 International License, which permits use, sharing, from the copyright holder. To view a copy of this license, visit http://creativecommons. adaptation, distribution and reproduction in any medium or format, as long as you give org/licenses/by/4.0/. appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless © The Author(s) 2018 indicated otherwise in a credit line to the material. If material is not included in the npj Computational Materials (2018) 29 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences

Journal

npj Computational MaterialsSpringer Journals

Published: Jun 28, 2018

There are no references for this article.