Access the full text.
Sign up today, get DeepDyve free for 14 days.
S. Ong, S. Cholia, Anubhav Jain, Miriam Brafman, D. Gunter, G. Ceder, K. Persson (2015)
The Materials Application Programming Interface (API): A simple, flexible and efficient API for materials data based on REpresentational State Transfer (REST) principlesComputational Materials Science, 97
Yixuan Shi, C. Sturm, H. Kleinke (2019)
Chalcogenides as thermoelectric materialsJournal of Solid State Chemistry
A. Furmanchuk, J. Saal, J. Doak, G. Olson, A. Choudhary, Ankit Agrawal (2018)
Prediction of seebeck coefficient for compounds without restriction to fixed stoichiometry: A machine learning approachJournal of Computational Chemistry, 39
B. S. Everitt, S. Landau, M. Leese, D. Stahl (2011)
Cluster Analysis
Shrayesh Patel, A. Glaudell, K. Peterson, E. Thomas, Kathryn O'Hara, E. Lim, M. Chabinyc (2017)
Morphology controls the thermoelectric power factor of a doped semiconducting polymerScience Advances, 3
R. Bhatt, S. Bhattacharya, Miral Patel, R. Basu, Ajay Singh, C. Surger, M. Navaneethan, Y. Hayakawa, D. Aswal, S. Gupta (2013)
Thermoelectric performance of Cu intercalated layered TiSe2 above 300 KJournal of Applied Physics, 114
Xinyue Zhang, Zhonglin Bu, Xuemin Shi, Zhiwei Chen, Siqi Lin, Bing-rui Shan, Max Wood, Alemayouh Snyder, Lidong Chen, G. Snyder, Y. Pei (2020)
Electronic quality factor for thermoelectricsScience Advances, 6
Dan-qi He, Wenyu Zhao, X. Mu, Hongyu Zhou, P. Wei, Wanting Zhu, Xiaolei Nie, X. Su, Huijun Liu, Jiaqing He, Qingjie Zhang (2017)
Enhanced thermoelectric performance of heavy-fermion YbAl3 via multi-scale microstructuresJournal of Alloys and Compounds, 725
W. Zeier, Jennifer Schmitt, G. Hautier, U. Aydemir, Z. Gibbs, C. Felser, G. Snyder (2016)
Engineering half-Heusler thermoelectric materials using Zintl chemistryNature Reviews Materials, 1
Mater
E. Guilmeau, A. Maignan, C. Martin (2009)
Thermoelectric Oxides: Effect of Doping in Delafossites and Zinc OxideJournal of Electronic Materials, 38
Jong-Woon Park, D. Kwak, Sung‐Hwa Yoon, S. Choi (2009)
Thermoelectric properties of Bi, Nb co-substituted CaMnO3 at high temperatureJournal of Alloys and Compounds, 487
T. M. Mitchell (1997)
Machine Learning
W. S. Chen, C. H. Hsu, W. H. Kao, Y. T. Yu, P. C. Yang, C. F. Tseng, C. H. Lai, Y. M. Lee, H. W. Yang, J. S. Lin (2014)
Applied Mechanics and Materials, 535
D. Flahaut, T. Mihara, R. Funahashi, N. Nabeshima, Kyu Lee, H. Ohta, K. Koumoto (2006)
Thermoelectrical properties of A-site substituted Ca1- xRexMnO3 systemJournal of Applied Physics, 100
M. Gaultois, Taylor Sparks, Christopher Borg, R. Seshadri, W. Bonificio, D. Clarke (2013)
Data-Driven Review of Thermoelectric Materials: Performance and Resource ConsiderationsChemistry of Materials, 25
Xun Shi, Jiong Yang, J. Salvador, M. Chi, J. Cho, Hsin Wang, S. Bai, Jihui Yang, Wenqing Zhang, Lidong Chen (2011)
Multiple-filled skutterudites: high thermoelectric figure of merit through separately optimizing electrical and thermal transports.Journal of the American Chemical Society, 133 20
G. Nolas, J. Cohn, G. Slack, S. Schujman (1998)
Semiconducting Ge clathrates: Promising candidates for thermoelectric applicationsApplied Physics Letters, 73
(2022)
Training and predicted data for machine learning on thermoelectric materials (2022), https://doi.org/ 10.25532/OPARA-164 (accessed
Jiawei Zhang, Lirong Song, B. Iversen (2019)
Insights into the design of thermoelectric Mg3Sb2 and its analogs by combining theory and experimentnpj Computational Materials, 5
Y. Putri, C. Wan, Yifeng Wang, W. Norimatsu, M. Kusunoki, K. Koumoto (2012)
Effects of alkaline earth doping on the thermoelectric properties of misfit layer sulfidesScripta Materialia, 66
Y. Pei, S. Bai, Xu Zhao, W. Zhang, Longxin Chen (2008)
Thermoelectric properties of EuyCo4Sb12 filled skutteruditesSolid State Sciences, 10
C. Uhlig, Ekrem Guenes, Anne Schulze, M. Elm, P. Klar, S. Schlecht (2014)
Nanoscale FeS2 (Pyrite) as a Sustainable Thermoelectric MaterialJournal of Electronic Materials, 43
K. Biswas, Jiaqing He, I. Blum, Chun Wu, T. Hogan, D. Seidman, V. Dravid, M. Kanatzidis (2012)
High-performance bulk thermoelectrics with all-scale hierarchical architecturesNature, 489
S. Populoh, M. Trottmann, Myriam Aguire, A. Weidenkaff (2011)
Nanostructured Nb-substituted CaMnO_3 n-type thermoelectric material prepared in a continuous process by ultrasonic spray combustionJournal of Materials Research, 26
J. Lan, Yuanhua Lin, Huizheng Fang, A. Mei, C. Nan, Yong Liu, S. Xu, M. Peters (2010)
High‐Temperature Thermoelectric Behaviors of Fine‐Grained Gd‐Doped CaMnO3 CeramicsJournal of the American Ceramic Society, 93
G. Rogl, P. Rogl (2017)
Skutterudites, a most promising group of thermoelectric materialsGreen and Sustainable Chemistry, 4
K. Hsu, S. Loo, F. Guo, Wei Chen, J. Dyck, C. Uher, T. Hogan, E. Polychroniadis, M. Kanatzidis (2004)
Cubic AgPbmSbTe2+m: Bulk Thermoelectric Materials with High Figure of MeritScience, 303
Muskan Nabi, D. Gupta (2019)
Insight into various properties of rare‐earth–based inverse perovskites Gd3AlX (X = B, N)International Journal of Energy Research, 44
Logan Ward, Alex Dunn, Alireza Faghaninia, Nils Zimmermann, Saurabh Bajaj, Qi Wang, Joseph Montoya, Jiming Chen, Kyle Bystrom, M. Dylla, K. Chard, M. Asta, K. Persson, G. Snyder, Ian Foster, Anubhav Jain (2018)
Matminer: An open source toolkit for materials data miningComputational Materials Science
M. Gaultois, A. Oliynyk, A. Mar, Taylor Sparks, G. Mulholland, B. Meredig (2016)
Perspective: Web-based machine learning models for real-time screening of thermoelectric materials propertiesAPL Materials, 4
V. Stanev, C. Oses, A. Kusne, Efrain Rodriguez, J. Paglione, S. Curtarolo, I. Takeuchi (2017)
Machine learning modeling of superconducting critical temperaturenpj Computational Materials, 4
M. Amado, R. Pinto, M. Braga, J. Sousa, P. Morin (1996)
Transport and magnetic properties of GdAgJournal of Magnetism and Magnetic Materials, 153
H. Mclaughlin, Anna Liljestrom, J. Lim, Dawn Meyers (2002)
LearnEducation and Urban Society, 34
N. Chawla, K. Bowyer, L. Hall, W. Kegelmeyer (2002)
SMOTE: Synthetic Minority Over-sampling TechniqueArXiv, abs/1106.1813
Ya Zhuo, Aria Tehrani, Jakoah Brgoch (2018)
Predicting the Band Gaps of Inorganic Solids by Machine Learning.The journal of physical chemistry letters, 9 7
D. Bérardan, E. Guilmeau, A. Maignan, B. Raveau (2008)
In2O3:Ge, a promising n-type thermoelectric oxide compositeSolid State Communications, 146
N. Blake, Lone Mo, llnitz, G. Kresse, H. Metiu (1999)
Why clathrates are good thermoelectrics: A theoretical study of Sr8Ga16Ge30Journal of Chemical Physics, 111
S. Sakurada, N. Shutoh (2005)
Effect of Ti substitution on the thermoelectric properties of (Zr,Hf)NiSn half-Heusler compoundsApplied Physics Letters, 86
B. Everitt, S. Landau, M. Leese, D. Ståhl (2011)
Cluster Analysis: Everitt/Cluster Analysis
T. Zhu, Yintu Liu, C. Fu, J. Heremans, J. Snyder, Xinbing Zhao (2017)
Compromise and Synergy in High‐Efficiency Thermoelectric MaterialsAdvanced Materials, 29
C. Wan, Yifeng Wang, Ning Wang, K. Koumoto (2010)
Low-Thermal-Conductivity (MS)1+x(TiS2)2 (M = Pb, Bi, Sn) Misfit Layer Compounds for Bulk Thermoelectric MaterialsMaterials, 3
Y. Putri, C. Wan, Feng Dang, T. Mori, Yuto Ozawa, W. Norimatsu, M. Kusunoki, K. Koumoto (2014)
Effects of Transition Metal Substitution on the Thermoelectric Properties of Metallic (BiS)1.2(TiS2)2 Misfit Layer SulfideJournal of Electronic Materials, 43
(2008)
Solid State Commun
Ceram
IntroductionThe development of sustainable concepts is a major research direction in many fields. In the field of thermoelectrics (TE), TE modules promise energy savings by collecting waste heat. However, there are several conditions to be met at the same time when engineering TE materials, which would be the starting point for TE modules. First and foremost, the figure of merit (zT) or power factor of the compound should be large, since this directly corresponds to the efficiency of the device. The figure of merit is defined by a simple relation zT=S2T/(ρκ)$zT=S^2 T/(\rho \kappa )$, where S, κ, ρ, and T are the Seebeck coefficient, thermal conductivity, electrical resistivity, and the measurement temperature, respectively.There are several material classes, that are and were investigated for their promised large zT and/or power factors: lead telluride,[1] chalcogenites,[2] inorganic clathrates,[3] Zintl phase compounds,[4] skutterodites,[5] (half) Heusler compounds[6] or, more recently, semiconducting polymers[7] to name a few. Ideally, all of these materials should neither contain critical elements such as cobalt or platinum[8] nor toxic elements, such as lead or cadmium.[9] Unfortunately, the latter excludes lead telluride from the list of candidates despite the extraordinary large zT factor of more than 2.[1]In general, the synthesis and subsequent investigation of a material candidate is a very time consuming process in chemistry, physics, and engineering and usually requires a few years. The search of new materials is based on existing information and the expertise of the researchers involved (e.g., ref. [10]). This expertise allows the researchers to successfully make assumptions about the TE performance of materials not yet investigated or ideally lead to new quality factors as, for example, suggested by Zhang and coworkers.[11] It is possible to upscale this approach, but this would require the use of machine learning (ML), where computer algorithms improve automatically through experience and by the use of data.[12]Here, we present a classification ML approach for TE materials, resulting in a list of possible material candidates. We also discuss results obtained by other authors in the context of machine learning for thermoelectrics. To improve the previously obtained results, we expand the training dataset, perform “classification boundary analysis” and propose prediction and filtering procedures to be left with a set of materials having potentially good thermoelectric properties. The training dataset, on which our models are based, comprises experimentally measured thermoelectric properties at different temperatures. We constructed a set of machine learning models at both fixed and nonfixed temperatures. In contrast to previous works (see, e.g., ref. [13]), we find optimal boundary values for the thermoelectric characteristics, at which one separates materials into classification classes, by performing classification threshold analysis. In a next step, we limit ourselves to sustainable materials excluding critical and toxic elements as defined by the European Union[8] and the World Health Organization (WHO),[9] respectively. Finally, we sort the compounds by their costs assuming the prices for respective elements. This allows the identification of candidates that do not only promise high TE performance, but also a good potential for subsequent exploitation of the materials in TE modules for applications.Generating Training DatasetWe collected the training dataset from open databases and expanded that with manually extracted values. In particular, we generated our dataset using the open‐source Python library Matminer,[14] which provides a user‐friendly interface for APIs of a number of materials databases. However, the only database with experimentally measured TE properties turned out to be Citrination.[15] Due to the limited availability of TE data in databases, our starting dataset is very similar to the dataset presented by Gaultois and coworkers.[13,16,17] Their dataset contains experimentally measured thermoelectric characteristics at 300, 400, 700, and 900 K for about 250 unique compounds.We extracted 269 unique compounds as a starting point. In case of duplicate entries for one compound, we took the average value. Then, we collected data from publications containing experimentally measured TE properties. The values from the manuscripts were extracted by hand, cleaned and preprocessed adding ≈200 compounds. Overall, our training dataset consists of more than 450 compounds (see the data file[18]).These materials are described by their stoichiometric formula, Seebeck coefficient, thermal and electrical conductivities measured at different temperatures from 300 to 900 K in 100K steps. For a number of materials, only some of the properties are given or only given in a narrower range of temperatures. For instance, for CaCo3.9Cr0.1O9, only the Seebeck coefficient and the resistivity between 400 to 900 K were measured. If Seebeck coefficient, thermal and electrical conductivity are known for a given material, thermoelectric power factor and figure of merit are calculated. The most number of materials are polycrystalline samples; some of them have misfit layer structures. Most materials in our database are oxides. This is depicted in Figure 1, where we plot the frequency at which a given element is contained in the compounds of our dataset.1FigureDisplayed is the frequency of compounds in the dataset we use containing a given element. The most part of the materials at hand are oxides.FeaturesFeatures are an essential ingredient when constructing a ML model. One of the main challenges one encounters in applying ML in materials science is feature generating. Indeed, a lot of physically relevant information is encoded in the crystal structure of a material. If one deals with a class of materials exhibiting a similar structure, the task of its encoding in the form of features can be solved in principle. However, given materials with very different types of structures, as in our case, the encoding problem becomes much more tricky. Instead of using structure information, a common strategy in this case is to exploit the stoichiometry of materials and generate features out of them. This approach has proven to be quite efficient in a number of previous studies dedicated to ML model construction for predicting the physical properties of materials (see, e.g., refs. [13, 19–21]). Thus, the only information we use about a material is the number of elements in the compound, the chemical formula and the physical properties of the constituent atoms. For every experimentally measured property of pure elements, we thus build four features according to the stoichiometry of a given material. For a formula AaBb$A_a B_b$ and a physical property i associated to the elements A and B we thus define1maxi=max(Ai,Bi),mini=min(Ai,Bi),Mi=waAi+wbBi,Si=wa(Ai−Mi)2+wb(Bi−Mi)2\begin{equation} \begin{aligned} &\text{max}_i = \text{max}(A_i, B_i), \qquad \text{min}_i = \text{min}(A_i, B_i),\\ &M_i = w_a A_i +w_b B_i,\\ &S_i = w_a (A_i-M_{i})^2+w_b (B_i-M_{i})^2 \end{aligned} \end{equation}where the weights are given by wa=aa+b$w_a = \frac{a}{a+b}$, wb=ba+b$w_b = \frac{b}{a+b}$. The quantity M represents an average of physical characteristics of the constituent atoms, while S can be thought as a deformed standard deviation. All these quantities for each material are concatenated in one vector and then used for a machine learning model construction. In this work, features were generated with the help of dataset of elemental properties used previously in ref. [19] for constructing a band gap predicting model.The total number of the generated features is more than a hundred. Many of them do not carry any useful information on the target physical characteristics. Moreover, the feature space is abundant, as can be seen from the simple correlation analysis. We have calculated the Pearson correlation coefficient for all the features, and having fixed the threshold for the correlation coefficient above which we consider the features strongly correlated to be 0.75, we are left with 36 features.Constructing Machine Learning ModelsAs explained above, the features we use for machine learning model construction ignore the crystalline structure and take only stoichiometric information into account. However, the crystal structure will be considered after the ML procedure. Machine learning models are merely able to identify main patterns in the data. Our aim is to construct models capable of catching trends in the features space related to thermoelectric properties of materials. As such, it is more reasonable to focus on classification machine learning models. In the classification machine learning approach, each sample in the training dataset is given by a label indicating that material belongs to a specific class. A model is trained to predict whether a given sample belongs to a given class. In what follows, we concentrate on binary classification in which all the data are separated into two different classes.The performance of binary classifiers is most often evaluated by an accuracy metric, which simply says how many predicted values match values in the test data. This measure is misleading when the dataset is imbalanced. This is exactly the case if one separates into classes materials with rare properties, such as a high thermoelectric figure of merit. An adequate metric should consider the interplay between false and true negative predictions and false and true positive predictions. With this, we introduce the definitions of precision P and recall R characteristics of model's performance by2P=tp/(tp+fp),R=tp/(tp+fn)\begin{equation} P = tp/(tp+fp), \qquad R = tp/(tp+fn) \end{equation}tp$tp$ is the number of true positives, fp$fp$ is the number of false positives and fn$fn$ is the number of false negatives. Recall represents a fraction of positive predictions out of all positive instances, while precision gives the proportion positive identifications which were actually correct. The F1‐score combines both of these quantities to define a binary classification metric3F1=2PR/(P+R)\begin{equation} F_1 = 2P R/(P+R) \end{equation}which, in contrast to the accuracy metric, is able to capture the nuances of the different types of errors. In what follows, we will use this metric to evaluate performance of our models.There are a plenty of machine learning algorithms. The most popular and profound among them belong to a class of deep learning algorithms. However, considering the limited amount of data we have, it is more reasonable to use a “classical” algorithm instead of neural networks. In the following, we will be using an improved version of the gradient boosting algorithm XGBoost[22] which has proved to be a powerful tool in solving machine learning tasks.Classification ModelsWe start with constructing classification models for predicting whether given thermoelectric values are acceptable for thermoelectric applications. If we consider, for example, the Seebeck coefficient and the range of its values: In our case, the absolute values of the Seebeck coefficient range between 0 to 500 μV K−1. In order to formulate the task in the form of a classification problem, we should put a boundary at a certain value in this range and declare that all the materials are separated into two different classes, according to the boundary value we have chosen. We thus have to define these boundary values for all thermoelectric characteristics. For example, Gaultois and coworkers[13] suggested to use the following values: |S|>100μ$|S| > {100}\text{ }\mu$V K−1; σ>100$\sigma > {100}$S cm−1; κ<10$\kappa < {10}$ W m−1 K−1. In the next section, we start with constructing classification machine learning models for the thermoelectric characteristics at fixed temperatures. We will compare our results with that obtained by the other authors and then propose another scheme for training the models considering the temperature as a feature. In particular, we propose a threshold analysis to find optimal boundary values for machine learning models construction.Classification at Fixed TemperaturesHere, we will construct a classification model for each of the thermoelectric characteristics (Seebeck coefficient, thermal, and electrical conductivities) at room temperature. But first, let us discuss the feature selection procedure that we followed. A preliminary feature selection was performed in Section 3 using Pearson's coefficient measuring a linear correlation between the features. We may perform an additional feature selection, which has two aims. The first, is to avoid overfitting by decreasing the number of features one uses for a machine learning model construction. The second, is the attempt to improve the performance of the models by selecting only relevant features for the model construction.In order to make additional feature selection, we applied a genetic algorithm using a Python module DEAP.[23] The genetic algorithm is a stochastic method for optimization. In our case, it aims to find those (fixed number of) features, which yield maximization of the F1‐score. We analyzed different numbers of features and found that without a great loss in accuracy, this number can be taken in between 18–20. In some cases, reducing the number of features increased accuracy of the models. Running the algorithm several times for the same number of features, one notices that it gives slightly different sets of features. Namely, due to the stochastic nature of the algorithm, every time it ran, it gives sets of features, which include different samples. Overall, the models demonstrate similar performance having approximately the same F1‐score. One of the reasons which may explain this observation is that the features in the whole set are still highly correlated, and it means that we are able to replace them for one another without loss in accuracy of the predictive model. The features used for the genetic algorithm can be found in Supporting information.The dataset is imbalanced, meaning that the number of high zT$zT$ materials is smaller than those having poor TE properties. When constructing machine learning models at fixed temperature, we extended the dataset by a number of synthetic samples. In an attempt to improve the performance of the constructed classificators by extending the dataset adding synthetic samples, we applied the SMOTE algorithm.[24] The algorithm generates new samples in the minority class along the lines joining minority class nearest neighbors in the feature space. As a result, we got a balanced dataset with an equal number of representatives in both classes. Unfortunately, this approach did not show a significant improvement of the models performances.Let us now turn to the results. First of all, we constructed classification models for the Seebeck coefficient, electrical and thermal conductivities for the values of boundaries mentioned in the beginning of Section 4.1. Figure 2 demonstrates the error distribution for our models in the form proposed in ref. [13]. The models for each of the TE properties gives a score whether a material belongs to a class defined as good for thermoelectric applications. In order to validate the models, we exploited leave‐one‐out cross‐validation technique. Within this model validation approach, the whole dataset is split into train and test parts and the test set consists of only one sample, while all the remaining are used to train the model. Errors approaching +1 correspond to false negative predictions, while those of −1 are for false positives. In order to map these scores to a discrete class, one should select a threshold value, according to which we classify our materials. It is usually set to be equal to 0.5. Taking this into account and comparing to Figure 2 with the corresponding figure from the paper,[13] one may conclude that our model demonstrates better performance for the electrical and thermal conductivities, while showing similar results for the Seebeck coefficient.2FigureError distribution histograms for leave‐one‐out cross‐validation for Seebeck coefficient, thermal and electrical conductivities measured at 300 K.It should be noted that the values we have taken as boundary values correspond to a relatively small thermoelectric figure of merit. For instance, at 1000 K, one has the minimum value for zT$zT$ equal to 10−2. Moreover, such high precision of the model at predicting the thermal conductivity can be explained by a simple fact. One may notice that the fraction of materials declared as potential good thermoelectrics according to the criterion κ<$\kappa <$10W m−1 K−1 is equal to 0.98, meaning that almost all materials in the dataset pass by this criterion. We should mention that the value of this fraction was increased by new data that we added to the original dataset of ref. [13]. If one calculates this value only for the data used by the authors, one obtains ≈0.62.A more accurate analysis is needed to define acceptable values for the boundaries. There are two main criteria: First of all, the corresponding value of zT$zT$ at the boundary must be relatively high. At the same time, constructed models should demonstrate reasonable performance. The second criterion is directly related to the boundary values one takes. It is thus reasonable to analyze interplay between accuracy of the models and the boundary values one takes to train the model. We will perform such an analysis in the next section.Classification with Temperature as a FeatureThe dataset we are using in this paper has thermoelectric characteristics measured at different temperatures. A proper approach for constructing predicting models should rely on all the available data. Instead of training models at fixed temperature, it is more reasonable to use it as a feature and train models at the whole dataset. To proceed, we added temperature, its power of two, three and four to the dataset and used them as features as well.Let us first describe the procedure we used to evaluate the performance of the constructed machine learning models. The dataset comprises thermoelectric properties measured at different temperatures, and it means that features for two samples (materials) may differ only by this parameter. For such a dataset, evaluating the models' performance using the standard train‐test split and cross‐validation algorithms is not valid. The reason is that while randomly selecting samples to train and test subsets, there will be such that are for the same materials but measured at different temperatures. In this case, it would be much easier for the models to make a more accurate prediction and we thus no longer may rely on the models' performance evaluation. To avoid this issue, in the train‐test splitting procedure, we select samples in such a way that those corresponding to the same material are always in either train subset or in test one. In the cross‐validation algorithm, the proportion of the dataset to include in the test split was taken to be 0.2. As was mentioned in Section 2 devoted to the dataset, there are materials for which we know thermoelectric properties at certain temperature points only (e.g., at 400 and 500 K). It means that the test proportion is equal to 0.2 only on average.In order to find optimal boundary values for the thermoelectric characteristics at which we should separate materials into two classes, we make a grid in the ranges of values for each of the property, and considering thus obtained points as boundaries for classification, train and validate machine learning models (see Figure 3). When declaring a classification boundary at a certain value, one has two classes of materials according to this separation. Materials for which a given characteristic belongs to a class in which it is more likely for the material to be a good thermoelectric, we dub as having positive labels. The red lines on Figure 3 show the fraction of positive labels at each boundary value we chose and performed validation of the trained model. When approaching higher values for the Seebeck coefficient and electrical conductivity, the fractions of positively labeled materials decreases, while the models performance is getting worse and uncertainty of the prediction increases. One may observe the same behavior for thermal conductivity when we choose smaller boundary values for classification. In particular, as we mentioned before, when choosing boundary value for thermal conductivity to be κ=10Wm−1K−1$\kappa = 10 \; \text{W m}^{-1}\text{ K}^{-1}$, the fraction of positive labels (those for which κ<10Wm−1K−1$\kappa < 10 \; \text{W m}^{-1}\text{ K}^{-1}$) is very close to 1. Based on this analysis, a natural choice for the boundary values is the following: |S|>150μVK−1$|S| > 150 \; \mu \text{V K}^{-1}$; σ>500Scm−1$\sigma > 500 \; \text{S cm$^{-1}$}$; κ<3Wm−1K−1$\kappa < 3 \; \text{W m}^{-1}\text{ K}^{-1}$.3FigureThe figure represents performance of the classification models at nonfixed temperature for a) Seebeck coefficient, b) electrical conductivity c) thermal conductivity, d) figure of merit plotted against F1‐score, and fraction of positive labels. The gray area shows the standard deviation in the accuracy of the predictions (measured by F1‐score), while each point of the blue line corresponds to the mean value of F1‐score. Points of the red line indicate fractions of positive labels in the dataset for a given boundary value.The analysis we made above is sensitive to such materials, which have, for instance, a relatively low Seebeck coefficient and, at the same time, low thermoelectric conductivity. The machine learning models may misclassify such materials and incorrectly refer to them as poor thermoelectrics. To avoid this issue, it is more appropriate to construct a model aimed at predicting zT (or power factor) directly. We performed the same analysis, as we did for Seebeck coefficient, thermal and electrical conductivity, for thermoelectric figure of merit.Figure 3d demonstrates performance of the constructed models at different values of boundaries for zT. One may see that when going to the zT value from 0.1 to 0.25, the fraction of positively labeled materials drastically decreases, while performance of the prediction model does not change significantly. Again, analysis of the figure suggest a natural choice for the boundary for making classification of materials with respect to figure of merit to be zT = 0.25. Note that an analog analysis could be done using the power factor and yielding similar results.Cluster AnalysisThe dataset contains samples from different families of materials and each of them is represented by a point in the feature space. It is interesting to see whether these materials can be grouped into clusters in the features space sharing similar thermoelectric properties. In other words, our aim is to segregate groups of materials according to the features we generated. There are plenty of techniques within the unsupervised learning approach for solving such tasks known as clustering. Here we will use hierarchical clustering method (see, e.g., ref. [25]).Hierarchical classifications may be represented by a 2D diagram known as a dendrogram (see Figure 4). Along the horizontal axis, each discrete point represents a material. We perform hierarchical classification in the space of features that were used for the supervised classification analysis for thermoelctric figure of merit in the previous section. Except, we removed the temperature from the list of features and thus left with only unique materials to make clusterization. One may introduce a metric in the feature space and organize samples into groups on the basis of how closely they are according to this metric. Here we choose the Eucleadian metric. Each feature takes values in a certain range. In order to prevent a bias toward a feature taking its values in a large range, before proceeding with the cluster analysis, all of them were normalized. Within the normalization procedure used here, one takes the difference between the maximum and the minimum values of a given feature and for each sample in the dataset divides values of the features by this quantity. We performed normalization procedure for the dataset that we combined from the original dataset used here for training our models and that for which we are going to make predictions (we will describe it in more detail in the next section). The clusterization algorithm produces a series of partitions of the data fusing the materials or groups of materials which are closest according to the Eucledian distances. At each stage of this fusion the number of groups is reduced. The vertical axis in the Figure 4 represents the distances. Each compound is joined with a vertical line and the diagram shows at which distance each fusion of materials into groups is made. One should choose where to stop the fusion process and thus be left with several groups of materials. We can chose the optimal number of clusters to be equal to four based on hierarchical structure of the dendrogram.4Figurea) Results of performing the clusterization algorithm in the form of dendogram. The title Distances refers to the Euclidean distances between materials and groups of materials in the feature space. According to the clusterization scheme, there are four classes having 75, 100, 165, 133 elements correspondingly. b) Performance of the zT‐classification model in each class and the number of samples they are having zT>0.25$zT>0.25$. All these numbers were normalized in such a way that in a sum they give one.It should be noted that materials grouped into one family sometimes do not seem to have anything in common. Although the features were constructed for stoichiometry only, a cluster may include materials having completely different constituent elements. Nevertheless, they may be quite close to each other according to the metric in the feature space. Let us now briefly describe the main representatives of the clusters (see the data file[18]). The orange cluster is populated predominantly by misfit layer materials with a series of transition metal elements (Fe, V, Cu, Mn, Zn, Co, Ni, Cr).[26–29] Several data points are represented by scutteride compounds[30,31] and silver based thermoelectric materials.[32,33] A big part of materials in the green cluster involves polycristalline ceramics of Ca1−x$_{1-{x}}$GdxMnO3[34] and Ca1−x$_{1-{x}}$BixMn1−y$_{1-{y}}$NbyO3[35] families, and other calcium containing compounds.[36–38] The red and the violet clusters are more inhomogeneous. For instance, the red class includes Cu intercalated selenides,[39] zinc,[40] and indium[41] oxides, while the violet family involves NiSn‐based half‐Heusler compounds[42] together with Ge‐based clathrates.[43]The dendogram is accompanied with a histogram showing performance of the constructed zT‐classification model with respect to a class the materials belong to, as well as the number of samples in each class having zT > 0.25. The clusterization procedure relies only on information encoded in the features and distances in the feature space. Yet, one may see that the frequency of occurrence of high‐zT materials in the orange and green classes is much lower compared to the red and violet ones. This observation hints that the features we use properly capture thermoelectric properties of the materials. The classification model demonstrates best performance for the orange class, having a small number of misclassified samples by the machine learning model when performing cross‐validation procedure. This observation can be possibly attributed to the fact that the this cluster is the most homogeneous, in a sense that it consists of materials belonging to a small number of different families. Similarly, the performance of the violet cluster exhibits the worst results due to its inhomogeneity.PredictionsWe are ready to use the constructed models to make prediction for materials which are not in our train dataset. Using the Materials project API,[44] we collected data for more than 125 000 materials. As we generate features using stoichiometric information, only chemical formulas were collected together with the Materials project id's in order to find more information on these materials if one is interested. As indicated before, we limited ourselves to sustainable, noncritical and nontoxic materials. After having applied this filtering procedure, we are left with a set of 25 111 materials having unique stoichiometry. Furthermore, we chose the materials where large scale synthesis seems feasible, that is, with cubic unit cell, which, for example, would lead to the material classes of skutterodites[5] and (half) Heusler compounds.[6] This gave us a set of 4450 compounds which was used for further analysis. It should be noted that compounds with a higher number of constituent elements often exhibit a better TE performance due to the increased size of their unit cell.In the previous sections, we described a set of predicting models that were constructed. When one deals with several machine learning models eventually aimed at predicting one property, a natural choice is to use a stacking algorithm, which by construction uses all the predictions made by other models to make the final decision on whether a given material has good thermoelectric properties. However, here we have two bunches of models which are trained at fixed and nonfixed temperatures. As such, it is not evident how to optimally implement them into one stacking model. Thus, we propose another strategy for making predictions. First of all, each constructed machine learning model for each of the thermoelectric properties (Seebeck coefficient, thermoelectric figure of merit, electrical and thermal conductivity) gives a label (0 or 1) for a given material. The label “1” classifies the materials to possess the property above or below a certain threshold for temperatures of 400, 500, and 700 K. In total, we end up with 21 labels for each of the materials.The more positive labels a material has, the more certain one can be that the material has good thermoelectric properties. Thus, we sort all the materials according to this criterion. In a next step, we arrange the materials according to the distances they have in the feature space with respect to the training data points: The closer a predicted material is located to a known material from the training dataset, the more certain one can be in the result of the prediction (see Supporting information for the details).Overall, these steps attempt to ensure that only reasonable TE materials are included in our list: First, we filtered critical and toxic elements, which reduced the number of predictions from 125k to 25k. Then, we only consider cubic materials to ensure a feasible large scale synthesis. Finally, we assigned 21 binary labels to find the most promising materials. The distance to a known, good TE materials, additionally, allows to gauge the certainty of the prediction.The complete list of materials for which we predicted their thermoelectric properties can be found in the open repository of TU Dresden.[18] As a side note, we would like to point out, that the metallic materials in the list will likely exhibit large power factors while the materials with a band gap will display high zT values. Using the obtained results, we chose a list of potentially good thermoelectric materials.We performed a literature search of the top predicted materials to investigate whether the algorithm led to reasonable results. Here, we immediately found five publications of materials investigated for their TE properties that were not included in our original training dataset: Namely, ZnTe (despite its large bandgap), Gd3Alx, ZrAl3, GdAg, and YbAl3.[45–49] This already hints to the general function of our algorithm.From the remaining top predictions, we hand‐picked five particularly interesting materials: Y2AlAg, SmAg2Sn, CaEuAg2, Cs3Cu20Te13, and AlCuSe2. Y2AlAg is an interesting modification of the C15 compound YAl2. SmAg2Sn and CaEuAg2 share some similarities with the full Heusler compounds, which are known to exhibit good TE properties. Cs3Cu20Te13 has a very large unit cell and promises an associated small thermal conductivity and, subsequently, good TE properties. Finally, AlCuSe2 only consists of elements, which are particularly biocompatible.ConclusionIn summary, we generated a list of experimentally not investigated materials with predicted good thermoelectric properties taking advantage of machine learning. This approach allows us to combine requirements with are mostly unrelated to each other, but should be considered for future thermoelectric applications: A high zT value or power factor is obviously required. In addition, the utilized materials should be neither toxic nor critical, where we used the definition put forward by the European union and the WHO. Finally, good scaling is desired from an engineering perspective to allow a cheap production and subsequent application many fields. This can, for example, be achieved by a preference of materials with a cubic symmetry of the unit cell.AcknowledgementsAll authors acknowledge the support by Oleg Janson. The work of D.C. was supported by the Alexander von Humboldt Foundation.Open Access funding enabled and organized by Projekt DEAL.Conflict of InterestThe authors declare no conflict of interest.Data Availability StatementThe data that support the findings of this study are openly available in OPARA at https://doi.org/10.25532/OPARA‐164, reference number 164.K. Biswas, J. He, I. D. Blum, C.‐I. Wu, T. P. Hogan, D. N. Seidman, V. P. Dravid, M. G. Kanatzidis, Nature 2012, 489, 414.Y. Shi, C. Sturm, H. Kleinke, J. Solid State Chem. 2019, 270, 273.N. P. Blake, L. Mollnitz, G. Kresse, H. Metiu, J. Chem. Phys. 1999, 111, 3133.J. Zhang, L. Song, B. B. Iversen, npj Comput. Mater. 2019, 5, 76.G. Rogl, P. Rogl, Curr. Opin. Green Sustainable Chem. 2017, 4, 50.W. G. Zeier, J. Schmitt, G. Hautier, U. Aydemir, Z. M. Gibbs, C. Felser, G. J. Snyder, Nat. Rev. Mater. 2016, 1, 16032.S. N. Patel, A. M. Glaudell, K. A. Peterson, E. M. Thomas, K. A. O'Hara, E. Lim, M. L. Chabinyc, Sci. Adv. 2017, 3, e1700434.Critical raw materials resilience: Charting a path towards greater security and sustainability (2020), https://eur‐lex.europa.eu/legal‐content/EN/TXT/?uri=CELEX:52020DC0474 (accessed: December 2021).International programme on chemical safety (ipcs) (2021), https://www.who.int/ipcs/ (accessed: December 2021).T. Zhu, Y. Liu, C. Fu, J. P. Heremans, J. G. Snyder, X. Zhao, Adv. Mater. 2017, 29, 1605884.X. Zhang, Z. Bu, X. Shi, Z. Chen, S. Lin, B. Shan, M. Wood, A. H. Snyder, L. Chen, G. J. Snyder, Y. Pei, Sci. Adv. 2020, 6, eabc0726.T. M. Mitchell, Machine Learning, McGraw‐Hill, New York 1997.M. W. Gaultois, A. O. Oliynyk, A. Mar, T. D. Sparks, G. J. Mulholland, B. Meredig, APL Mater. 2016, 4, 053213.L. Ward, A. Dunn, A. Faghaninia, N. E. Zimmermann, S. Bajaj, Q. Wang, J. Montoya, J. Chen, K. Bystrom, M. Dylla, K. Chard, M. Asta, K. A. Persson, G. J. Snyder, I. Foster, A. Jain, Comput. Mater. Sci. 2018, 152, 60.Citrination (2021), https://doi.org/10.25504/FAIRsharing.x6y19r (accessed: December 2021).A. Furmanchuk, J. E. Saal, J. W. Doak, G. B. Olson, A. Choudhary, A. Agrawal, J. Comput. Chem. 2018, 39, 191.M. W. Gaultois, T. D. Sparks, C. K. H. Borg, R. Seshadri, W. D. Bonificio, D. R. Clarke, Chem. Mater. 2013, 25, 2911.A. Thomas, D. Chernyavsky, Training and predicted data for machine learning on thermoelectric materials (2022), https://doi.org/10.25532/OPARA‐164 (accessed: September, 2022).Y. Zhuo, A. M. Tehrani, J. Brgoch, J. Phys. Chem. Lett. 2018, 9, 1668.V. Stanev, C. Oses, A. G. Kusne, E. Rodriguez, J. Paglione, S. Curtarolo, I. Takeuchi, npj Comput. Mater. 2018, 4, 29.N. Claussen, B. A. Bernevig, N. Regnault, Phys. Rev. B 2020, 101, 245117.T. Chen, C. Guestrin, in Proc. of the 22nd ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, ACM Press, New York, NY 2016, pp. 785–794.F.‐A. Fortin, F.‐M. De Rainville, M.‐A. Gardner, M. Parizeau, C. Gagné, J. Mach. Learn. Res. 2012, 13, 2171.N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, J. Artif. Intell. Res. 2002, 16, 321.B. S. Everitt, S. Landau, M. Leese, D. Stahl, Cluster Analysis, Wiley, New York 2011.Y. E. Putri, C. Wan, F. Dang, T. Mori, Y. Ozawa, W. Norimatsu, M. Kusunoki, K. Koumoto, J. Electron. Mater. 2014, 43, 1870.C. Uhlig, E. Guenes, A. S. Schulze, M. T. Elm, P. J. Klar, S. Schlecht, J. Electron. Mater. 2014, 43, 2362.C. Wan, Y. Wang, N. Wang, K. Koumoto, Materials 2010, 3, 2606.Y. E. Putri, C. Wan, Y. Wang, W. Norimatsu, M. Kusunoki, K. Koumoto, Scr. Mater. 2012, 66, 895.X. Shi, J. Yang, J. R. Salvador, M. Chi, J. Y. Cho, H. Wang, S. Bai, J. Yang, W. Zhang, L. Chen, J. Am. Chem. Soc. 2011, 133, 7837.Y. Z. Pei, S. Q. Bai, X. Y. Zhao, W. Zhang, L. D. Chen, Solid State Sci. 2008, 10, 1422.K. Ahn, M. G. Kanatzidis, MRS Online Proc. Libr. 2007, 1044408.K. F. Hsu, S. Loo, F. Guo, W. Chen, J. S. Dyck, C. Uher, T. Hogan, E. K. Polychroniadis, M. G. Kanatzidis, Science 2004, 303, 818.J. Lan, Y.‐H. Lin, H. Fang, A. Mei, C.‐W. Nan, Y. Liu, S. Xu, M. Peters, J. Am. Ceram. Soc. 2010, 93, 2121.J. W. Park, D. H. Kwak, S. H. Yoon, S. C. Choi, J. Alloys Compd. 2009, 487, 550.Y. Zhou, I. Matsubara, R. Funahashi, G. Xu, M. Shikano, Mater. Res. Bull. 2003, 38, 341.D. Flahaut, T. Mihara, R. Funahashi, N. Nabeshima, K. Lee, H. Ohta, K. Koumoto, J. Appl. Phys. 2006, 100, 084911.S. Populoh, M. Trottmann, M. H. Aguire, A. Weidenkaff, J. Mater. Res. 2011, 26, 1947.R. Bhatt, S. Bhattacharya, M. Patel, R. Basu, A. Singh, C. Sürger, M. Navaneethan, Y. Hayakawa, D. K. Aswal, S. K. Gupta, J. Appl. Phys. 2013, 114, 114509.E. Guilmeau, A. Maignan, C. Martin, J. Electron. Mater. 2009, 38, 1104.D. Bérardan, E. Guilmeau, A. Maignan, B. Raveau, Solid State Commun. 2008, 146, 97.S. Sakurada, N. Shutoh, Appl. Phys. Lett. 2005, 86, 082105.G. S. n. Nolas, J. L. Cohn, G. A. Slack, S. B. Schujman, Appl. Phys. Lett. 1998, 73, 178.S. P. Ong, S. Cholia, A. Jain, M. Brafman, D. Gunter, G. Ceder, K. A. Persson, Comput. Mater. Sci. 2015, 97, 209.W. S. Chen, C. H. Hsu, W. H. Kao, Y. T. Yu, P. C. Yang, C. F. Tseng, C. H. Lai, Y. M. Lee, H. W. Yang, J. S. Lin, in Applied Mechanics and Materials, Vol. 535, Trans Tech Publications, Warwick, NY 2014, pp. 688–691.M. Nabi, D. C. Gupta, Int. J. Energy Res. 2020, 44, 1654.C. Colinet, A. Pasturel, J. Alloys Compd. 2001, 319, 154.M. M. Amado, R. P. Pinto, M. E. Braga, J. B. Sousa, P. Morin, J. Magn. Magn. Mater. 1996, 153, 107.D. He, W. Zhao, X. Mu, H. Zhou, P. Wei, W. Zhu, X. Nie, X. Su, H. Liu, J. He, Q. Zhang, J. Alloys Compd. 2017, 725, 1297.
Advanced Theory and Simulations – Wiley
Published: Nov 1, 2022
Keywords: machine learning; materials science; thermoelectrics
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.