Access the full text.
Sign up today, get DeepDyve free for 14 days.
X. Guo, O. Christensen, T. Ostersen, Y. Wang, M. Lund, G. Su (2015)
Improving genetic evaluation of litter size and piglet mortality for both genotyped and nongenotyped individuals using a single-step method.Journal of animal science, 93 2
EL Heffner (2011)
65Plant Genome, 4
Mang Liang, Tianpeng Chang, B. An, X. Duan, Lili Du, Xiaoqiao Wang, Jian Miao, Lingyang Xu, Xue Gao, Lupei Zhang, Junya Li, Huijiang Gao (2020)
A Stacking Ensemble Learning Framework for Genomic PredictionFrontiers in Genetics, 12
Kaname Kojima, Shu Tadaka, F. Katsuoka, G. Tamiya, Masayuki Yamamoto, K. Kinoshita (2020)
A genotype imputation method for de-identified haplotype reference information by using recurrent neural networkPLoS Computational Biology, 16
D. Habier, R. Fernando, K. Kızılkaya, D. Garrick (2011)
Extension of the bayesian alphabet for genomic selectionBMC Bioinformatics, 12
(2016)
Machine Learning. In: Press TU, editor. Beijing,China
O. González-Recio, G. Rosa, D. Gianola (2014)
Machine learning methods and predictive ability metrics for genome-wide prediction of complex traits $Livestock Science, 166
J. Steiger (1980)
Tests for comparing elements of a correlation matrix.Psychological Bulletin, 87
Mary Akers (2018)
A retrospectiveUrban Environments and Health in the Philippines
Hailiang Song, Junling Zhang, Y. Jiang, H. Gao, S. Tang, S. Mi, F. Yu, Q. Meng, W. Xiao, Qin Zhang, Xiangdong Ding (2017)
Genomic prediction for growth and reproduction traits in pig using an admixed reference population.Journal of animal science, 95 8
D Habier (2011)
186BMC Bioinform, 12
F. Ghafouri-Kesbi, G. rahimi-Mianji, M. Honarvar, A. Nejati-Javaremi (2016)
Predictive ability of Random Forests, Boosting, Support Vector Machines and Genomic Best Linear Unbiased Prediction in different scenarios of genomic evaluationAnimal Production Science, 57
T. Meuwissen, B. Hayes, M. Goddard (2001)
Prediction of total genetic value using genome-wide dense marker maps.Genetics, 157 4
Christina Azodi, Emily Bolger, A. Mccarren, M. Roantree, G. Campos, Shin-Han Shiu (2019)
Benchmarking Parametric and Machine Learning Models for Genomic Prediction of Complex TraitsG3: Genes|Genomes|Genetics, 9
A. Roos, C. Schrooten, R. Veerkamp, J. Arendonk (2011)
Effects of genomic selection on genetic improvement, inbreeding, and merit of young versus proven bulls.Journal of dairy science, 94 3
L Varona (2018)
78Front Genet, 9
Xiujin Li, Sheng Wang, Ju Huang, Leyi Li, Qin Zhang, Xiangdong Ding (2014)
Improving the accuracy of genomic prediction in Chinese Holstein cattle by using one-step blendingGenetics, Selection, Evolution : GSE, 46
Peter Exterkate, P. Groenen, C. Heij, Dick Dijk (2011)
Nonlinear Forecasting with Many Predictors Using Kernel Ridge RegressionERN: Econometric Modeling in Macroeconomics (Topic)
O. Montesinos-López, A. Montesinos‐López, P. Pérez-Rodríguez, José Barrón-López, J. Martini, S. Fajardo-Flores, L. Gaytán-Lugo, P. Santana-Mancilla, J. Crossa (2021)
A review of deep learning applications for genomic selectionBMC Genomics, 22
S. Forni, I. Aguilar, I. Misztal (2011)
Different genomic relationship matrices for single-step analysis using phenotypic, pedigree and genomic informationGenetics, Selection, Evolution : GSE, 43
Rui Fa, D. Cozzetto, Cen Wan, David Jones (2018)
Predicting human protein function with multi-task deep neural networksPLoS ONE, 13
AP de Roos (2011)
1559J Dairy Sci, 94
D. Shrestha, D. Solomatine (2006)
Experiments with AdaBoost.RT, an Improved Boosting Scheme for RegressionNeural Computation, 18
BJ Hayes (2009)
433J Dairy Sci, 92
PM VanRaden (2008)
4414J Dairy Sci, 91
JC Whittaker (2000)
249Genet Res, 75
I. Misztal, A. Legarra, I. Aguilar (2009)
Computing procedures for genetic evaluation including phenotypic, full pedigree, and genomic information.Journal of dairy science, 92 9
F. Noé, G. Fabritiis, C. Clementi (2019)
Machine learning for protein folding and dynamics.Current opinion in structural biology, 60
J. GONZÁLEZ-CAMACHO, L. Ornella, P. Pérez-Rodríguez, D. Gianola, S. Dreisigacker, J. Crossa (2018)
Applications of Machine Learning Methods to Genomic Selection in Breeding Wheat for Rust ResistanceThe Plant Genome, 11
A García-Ruiz, JB Cole, PM VanRaden, GR Wiggans, FJ Ruiz-López, CP Van Tassell (2016)
Changes in genetic selection differentials and generation intervals in US Holstein dairy cattle as a result of genomic selectionProc Natl Acad Sci U S A, 113
Mang Liang, Jian Miao, Xiaoqiao Wang, Tianpeng Chang, B. An, X. Duan, Lingyang Xu, Xue Gao, Lupei Zhang, Junya Li, Huijiang Gao (2020)
Application of ensemble learning to genomic selection in chinese simmental beef cattle.Journal of animal breeding and genetics = Zeitschrift fur Tierzuchtung und Zuchtungsbiologie
J. Jensen, O. Christensen, G. Sahana (2014)
Proceedings, 10 World Congress of Genetics Applied to Livestock Production DMU - A Package for Analyzing Multivariate Mixed Models in quantitative Genetics and Genomics
E. Heffner, J. Jannink, M. Sorrells (2011)
Genomic Selection Accuracy using Multifamily Prediction Models in a Wheat Breeding ProgramThe Plant Genome, 4
M. Piles, R. Bergsma, D. Gianola, H. Gilbert, L. Tusell (2021)
Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine LearningFrontiers in Genetics, 12
(2011)
Multifamily Prediction Models in a Wheat Breeding Program. The Plant Genome
Shaolei Shi, Xiujin Li, L. Fang, Aoxing Liu, G. Su, Yi Zhang, Basang Luobu, Xiangdong Ding, Shengli Zhang (2021)
Genomic Prediction Using Bayesian Regression Models With Global–Local PriorFrontiers in Genetics, 12
B. Browning, S. Browning (2009)
A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals.American journal of human genetics, 84 2
O. Montesinos-López, J. Martín-Vallejo, J. Crossa, D. Gianola, C. Hernández-Suárez, A. Montesinos‐López, Philomin Juliana, R. Singh (2018)
A Benchmarking Between Deep Learning, Support Vector Machine and Bayesian Threshold Best Linear Unbiased Prediction for Predicting Ordinal Traits in Plant BreedingG3: Genes|Genomes|Genetics, 9
B. Hayes, P. Bowman, A. Chamberlain, M. Goddard (2009)
Invited review: Genomic selection in dairy cattle: progress and challenges.Journal of dairy science, 92 2
Hailiang Song, Qin Zhang, Xiangdong Ding (2020)
The superiority of multi-trait models with genotype-by-environment interactions in a limited number of environments for genomic prediction in pigsJournal of Animal Science and Biotechnology, 11
L. Schaeffer (2006)
Strategy for applying genome-wide selection in dairy cattle.Journal of animal breeding and genetics = Zeitschrift fur Tierzuchtung und Zuchtungsbiologie, 123 4
Andreas Müller, Sarah Guido (2016)
Introduction to Machine Learning with Python: A Guide for Data Scientists
L. Ornella, P. Pérez, E. Tapia, J. GONZÁLEZ-CAMACHO, J. Burgueño, Xuecai Zhang, Sukhwinder Singh, F. Vicente, D. Bonnett, S. Dreisigacker, R. Singh, N. Long, J. Crossa (2014)
Genomic-enabled prediction with classification algorithmsHeredity, 112
L. Varona, A. Legarra, M. Toro, Z. Vitezica (2018)
Non-additive Effects in Genomic SelectionFrontiers in Genetics, 9
O. Christensen, M. Lund (2010)
Genomic prediction when some animals are not genotypedGenetics, Selection, Evolution : GSE, 42
D. Gianola (2010)
Statistical learning methods for genome‐based analysis of quantitative traits
A. García-Ruiz, J. Cole, P. VanRaden, G. Wiggans, F. Ruiz-Lopez, C. Tassell (2016)
Changes in genetic selection differentials and generation intervals in US Holstein dairy cattle as a result of genomic selectionProceedings of the National Academy of Sciences, 113
N. Long, D. Gianola, G. Rosa, K. Weigel (2011)
Application of support vector regression to genome-assisted prediction of quantitative traitsTheoretical and Applied Genetics, 123
L. Zingaretti, S. Gezan, L. Ferrão, L. Osorio, A. Monfort, P. Muñoz, V. Whitaker, M. Pérez-Enciso (2020)
Exploring Deep Learning for Complex Trait Genomic Prediction in Polyploid Outcrossing SpeciesFrontiers in Plant Science, 11
M. Goddard, B. Hayes (2009)
Mapping genes for complex traits in domestic animals and their use in breeding programmesNature Reviews Genetics, 10
B. An, Mang Liang, Tianpeng Chang, X. Duan, Lili Du, Lingyang Xu, Lupei Zhang, Xue Gao, Junya Li, Huijiang Gao (2021)
KCRR: a nonlinear machine learning with a modified genomic similarity matrix improved the genomic prediction efficiencyBriefings in bioinformatics
A. Alves, R. Espigolan, T. Bresolin, R. Costa, G. Júnior, R. Ventura, R. Carvalheiro, R. Carvalheiro, L. Albuquerque, L. Albuquerque (2020)
Genome-enabled prediction of reproductive traits in Nellore cattle using parametric models and machine learning methods.Animal genetics
G. Su, Per Madsen, U. Nielsen, Esa Mäntysaari, G. Aamand, O. Christensen, M. Lund (2012)
Genomic prediction for Nordic Red Cattle using one-step and selection index blending.Journal of dairy science, 95 2
LR Schaeffer (2006)
218J Anim Breed Genet, 123
A. Statnikov, Lily Wang, C. Aliferis (2008)
A comprehensive comparison of random forests and support vector machines for microarray-based cancer classificationBMC Bioinformatics, 9
H. Drucker (1997)
Improving Regressors using Boosting Techniques
Hailiang Song, S. Ye, Yifan Jiang, Zhe Zhang, Qin Zhang, Xiangdong Ding (2019)
Using imputation-based whole-genome sequencing data to improve the accuracy of genomic prediction for combined populations in pigsGenetics, Selection, Evolution : GSE, 51
R. Abdollahi-Arpanahi, D. Gianola, F. Peñagaricano (2020)
Deep learning versus parametric and ensemble methods for genomic prediction of complex phenotypesGenetics, Selection, Evolution : GSE, 52
Christopher Chang, C. Chow, L. Tellier, S. Vattikuti, S. Purcell, James Lee (2014)
Second-generation PLINK: rising to the challenge of larger and richer datasetsGigaScience, 4
P. VanRaden (2008)
Efficient methods to compute genomic predictions.Journal of dairy science, 91 11
(2019)
Predicting Ordinal Traits in Plant Breeding. G3 (Bethesda)
P Exterkate, PJF Groenen, C Heij, D van Dijk (2016)
Nonlinear forecasting with many predictors using kernel ridge regressionInt J Forecast, 32
J. Whittaker, Robin Thompson, M. Denham (1999)
Marker-assisted selection using ridge regression.Genetical research, 75 2
A García-Ruiz (2016)
E3995Proc Natl Acad Sci U S A, 113
D. Gianola, H. Okut, K. Weigel, G. Rosa (2011)
Predicting complex quantitative traits with Bayesian neural networks: a case study with Jersey cows and wheatBMC Genetics, 12
L. Breiman (2001)
Random ForestsMachine Learning, 45
THE Meuwissen (2001)
1819Genetics, 157
I Misztal (2009)
4648J Dairy Sci, 92
R Tibshirani (2011)
Regression shrinkage and selection via the lasso: a retrospectiveJ R Stat Soc, 73
R Tibshirani (2011)
273J R Stat Soc, 73
B. Boser, Isabelle Guyon, V. Vapnik (1992)
A training algorithm for optimal margin classifiers
OF Christensen (2010)
2Genet Sel Evol, 42
Background: Recently, machine learning (ML) has become attractive in genomic prediction, but its superiority in genomic prediction over conventional (ss) GBLUP methods and the choice of optimal ML methods need to be investigated. Results: In this study, 2566 Chinese Yorkshire pigs with reproduction trait records were genotyped with the GenoBaits Porcine SNP 50 K and PorcineSNP50 panels. Four ML methods, including support vector regression (SVR), kernel ridge regression (KRR), random forest (RF) and Adaboost.R2 were implemented. Through 20 replicates of fivefold cross-validation (CV) and one prediction for younger individuals, the utility of ML methods in genomic prediction was explored. In CV, compared with genomic BLUP (GBLUP), single-step GBLUP (ssGBLUP) and the Bayesian method BayesHE, ML methods significantly outperformed these conventional methods. ML methods improved the genomic prediction accuracy of GBLUP, ssGBLUP, and BayesHE by 19.3%, 15.0% and 20.8%, respectively. In addition, ML methods yielded smaller mean squared error (MSE) and mean absolute error (MAE) in all scenarios. ssGBLUP yielded an improvement of 3.8% on average in accuracy compared to that of GBLUP, and the accuracy of BayesHE was close to that of GBLUP. In genomic prediction of younger individuals, RF and Adaboost.R2_KRR performed better than GBLUP and BayesHE, while ssGBLUP performed comparably with RF, and ssGBLUP yielded slightly higher accuracy and lower MSE than Adaboost.R2_KRR in the prediction of total number of piglets born, while for number of piglets born alive, Adaboost.R2_KRR performed significantly better than ssGBLUP. Among ML methods, Adaboost.R2_KRR consistently performed well in our study. Our findings also demonstrated that optimal hyperparameters are useful for ML methods. After tuning hyperparameters in CV and in predicting genomic outcomes of younger individuals, the average improvement was 14.3% and 21.8% over those using default hyperparameters, respectively. Conclusion: Our findings demonstrated that ML methods had better overall prediction performance than conventional genomic selection methods, and could be new options for genomic prediction. Among ML methods, Adaboost.R2_KRR consistently performed well in our study, and tuning hyperparameters is necessary for ML methods. The optimal hyperparameters depend on the character of traits, datasets etc. Keywords: Genomic prediction, Machine learning, Pig, Prediction accuracy * Correspondence: xding@cau.edu.cn Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, Beijing, China Full list of author information is available at the end of the article © The Author(s). 2022 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Wang et al. Journal of Animal Science and Biotechnology (2022) 13:60 Page 2 of 12 Background best in comparison with other ML methods and linear Genomic selection (GS) has been widely recognized and models in the genomic prediction of the rust resistance successfully implemented in animal and plant breeding of wheat [20]. Additionally, ML methods have also been programs [1–3]. It has been reported that the breeding widely used in the fields of gene screening, genotype im- costs of dairy cattle using GS were 92% lower than those putation, and protein structure and function prediction of traditional progeny testing [4]. At present, the genetic [22–25], demonstrating its superiority as well. However, gain rate of the annual yield traits of US Holstein dairy one challenge for ML is choosing the optimum ML cattle has increased from approximately 50% to 100% method as a series of ML methods have been proposed [5]. The accuracy of GS is impacted by a number of fac- and each has its own characteristics and shows different tors, such as analytical methods of genomic prediction, prediction abilities in different datasets and traits. reference population size, marker density, and heritabil- Therefore, the objectives of this study were to 1) assess ity values. Currently, parametric methods are most com- the performance of ML methods in genomic prediction monly used for livestock and poultry genomic selection, in comparison with existing prevail methods of GBLUP, mainly including genomic BLUP (GBLUP) [6], single- ssGBLUP, and BayesHE and 2) evaluate the efficiency of step GBLUP (ssGBLUP) [7, 8], ridge regression (RR) [9], different ML methods to explore the ideal ML method least absolute shrinkage and selection operator (LASSO) for genomic prediction. [10], and Bayesian regression models [11, 12] with the dif- ference mainly depending on the prior distribution of Materials and methods marker effects. Nevertheless, these linear models usually Ethics statement only take into account the additive inheritance and ignore The whole procedure for blood sample collection was the complex nonlinear relationships that may exist be- carried out in strict accordance with the protocol ap- tween markers and phenotypes (e.g. epistasis, dominance, proved by the Animal Care and Use Committee of China or genotype-by-environment interactions). In addition, Agricultural University (Permit Number: DK996). parametric methods usually provide limited flexibility for handling nonlinear effects in high-dimensional genomic Population and phenotypes data, resulting in large computational demands [13]. How- A purebred Yorkshire pig population from DHHS, a ever, studies have shown that considering nonlinearity breeding farm in Hebei Province, China, was studied. may enhance the genomic prediction ability of complex Animals from this farm were descendants of Canadian traits [14]. Therefore, new strategies should be explored to Yorkshires, and they were reared under the same feeding more accurately estimate genomic breeding values. conditions. A total of 2566 animals born between 2016 Driven by applications in intelligent robots, self-driving and 2020 were sampled, their 4274 reproductive records cars, automatic translation, face recognition, artificial of the total number of piglets born (TNB) and the num- intelligence games and medical services, machine learning ber of piglets born alive (NBA) with delivery dates ran- (ML) has gained considerable attention in the past decade. ging from 2017 to 2021 were available, and 3893 animals Some characteristics of ML methods make them potentially were traced back to construct the pedigree relationship attractive for dealing with high-order nonlinear relation- matrix (A matrix). The numbers of full-sib and half-sib ships in high-dimensional genomic data, e.g. allowing the families were 339 and 301, respectively. A single-trait re- number of variables larger than the sample size [15], cap- peatability model was used to estimate the heritability. able of capturing the hidden relationship between genotype The fixed effect included herd-year-season, and random and phenotype in an adaptive manner, and imposing little effects included additive genetic effects, random resid- or no specific distribution assumptions about the predictor uals, and permanent environment effects of sows (envir- variables as GBLUP and Bayesian methods [16, 17]. onmental effects affecting litter size across parities of Studies have shown that random forest (RF), support sows). The information of animals, phenotypes and gen- vector regression (SVR), kernel ridge regression (KRR) etic components, as well as the estimated heritability, are and other machine learning methods have advantages listed in Table 1. The estimated heritability of TNB and over GBLUP and Bayes B [18–20]. Ornella et al. com- NBA were both 0.12. pared the genomic prediction performance of support vector regression, random forest regression, reproducing Derivation of corrected phenotypes kernel Hilbert space (RKHS), ridge regression, and To avoid double counting of parental information, the Bayesian Lasso in maize and wheat datasets with differ- corrected phenotypes (y ) derived from the estimated ent trait-environment combinations, and found that breeding values (EBVs) were used as response variables RKHS and random forest regression were the best [21]. in genomic prediction. The pedigree-based BLUP and González-Camacho et al. reported that the support vec- single-trait repeatability model was performed to esti- tor machine (SVM) with linear kernel performed the mate the breeding values for each trait separately. Wang et al. Journal of Animal Science and Biotechnology (2022) 13:60 Page 3 of 12 Table 1 Summary of two reproduction traits of Yorkshire pigs a 2 2 2 Trait Number of records Birth year Genotyped animals Mean SD Minimum Maximum σ σ h (SE) a e TNB 4274 2016–2020 2566 13 3.38 3 24 1.26 8.95 0.12 (0.034) NBA 4274 2016–2020 2566 12 3.13 3 24 0.98 7.13 0.12 (0.032) TNB: total number of piglets born; NBA: number of piglets born alive SE standard error control of the genotype was carried out using PLINK y ¼ Xb þ Z a þ Z pe þ e; ð1Þ a pe software [29]. SNPs with a minor allele frequency (MAF) where y was the vector of raw phenotypic values; b was lower than 0.01 and call rate lower than 0.90 were re- the vector of fixed effects including herd-year-season, in moved, and individuals with call rates lower than 0.90 which season consisted of four levels (1st = December to were excluded. Finally, all animals and 44,922 SNPs on February; 2nd = March to May; 3rd = June to August; autosomes remained for further analysis. 4th = September to November); a was the vector of addi- tive genetic effects; pe was the vector of permanent en- Statistical models vironment effects of sows; and e was the vector of GBLUP, ssGBLUP, Bayesian Horseshoe (BayesHE) and random error. X, Z ,and Z are the incidence matrices four ML regression methods, support vector regression a pe linking b, a and pe to y. The random effects were as- (SVR), kernel ridge regression (KRR), random forest sumed to be normally distributed as follows: a ~ N (0, (RF), and Adaboost.R2 were used to perform genomic 2 2 2 Aσ ), pe ~ N (0, Iσ ), and e ~ N (0, Iσ ), where A was prediction. a pe e the pedigree-based relationship matrix; I was the identity 2 2 2 matrix; and σ , σ , and σ were the variances of addi- GBLUP a pe e tive genetic effects, permanent environment effects of y ¼1μþZgþe sows, and residuals, respectively. A total of 3893 individ- uals were traced to construct matrix A. Their EBVs were in which y is the vector of corrected phenotypes of ge- calculated using the DMUAI procedure of the DMU notyped individuals. μ is the overall mean, 1 is a vector software [26]. The y were calculated as EBV plus the of 1 s, g is the vector of genomic breeding values, e is average estimated residuals for multiple parties of a sow the vector of random errors, and Z is an incidence following Guo et al. [27]. matrix allocating records to g. The distributions of ran- 2 2 dom effects were: g ~N(0, G σ ) and e ~N(0, I σ ), g e Genotype data and imputation where G was the genomic relationship matrix (G Two kinds of 50 K define SNP panels, PorcineSNP50 2 2 matrix), and σ and σ were the additive genetic vari- g e BeadChip (Illumina, CA, USA) and GenoBaits Porcine ance and the residual variance, respectively. SNP 50 K (Molbreeding, China) were used for genotyp- ing. A total of 1189 sows were genotyped with the Porci- ssGBLUP neSNP50 BeadChip, which included 50,697 SNPs across ssGBLUP had the same expression as GBLUP, except that the genome, and 1978 individuals were genotyped using it used y of both genotyped and nongenotyped individ- the GenoBaits Porcine SNP 50 K with 52,000 SNPs. uals by combining the G matrix and A matrix. It was as- There were 30,998 common SNPs between these two sumed that g followed a normal distribution N (0, H σ ). SNP panels, and 601 individuals were genotyped with The inverse of matrix H was: both SNP panels; therefore, 2566 genotyped individuals were finally used for further analysis, including 1189 ani- −1 −1 G −A 0 −1 −1 w 22 H ¼ þ A mals with the PorcineSNP50 BeadChip and 1377 pigs with the GenoBaits Porcine SNP 50 K. The animals ge- notyped with GenoBaits Porcine SNP 50 K were imputed To prevent the problem that the singular matrix can- to the PorcineSNP50 BeadChip using Beagle 5.0 [28]. not be inverted, G = (1-w) G +wA , and w was equal w a 22 The reference population size for genotype imputation to 0.05 [30]. was 3720. Imputation accuracy was assessed by the dos- age R-squared measure (DR2), which is the estimated BayesHE squared correlation between the estimated allele dose BayesHE was developed by Shi. et al. [31], it was based and the true allele dose. The genotype correlation (COR) on global-local priors to increase the flexibility and and the genotype concordance rate (CR) were also cal- adaptability of the Bayesian model. In this study, the first culated based on the 601 overlapped animals to evaluate form of BayesHE (BayesHE1) was used [31], and the the imputation accuracy. After imputation, quality Markov chain Monte Carlo (MCMC) chain was run for Wang et al. Journal of Animal Science and Biotechnology (2022) 13:60 Page 4 of 12 50,000 cycles, with the first 20,000 cycles being discarded space, and then builds a ridge regression model to make as burn-in and every 50 samples of the remaining 30,000 the data linearly separable in this kernel space. The lin- iterations saved to infer posterior statistics. In-house ear function in the kernel space was selected according scripts written in Fortran 95 were used for BayesHE ana- to the mean squared error loss of ridge regularization lyses [31], and the DMUAI procedure implemented in [33]. The final KRR prediction model can be written as: DMU software [26] was used for GBLUP and ssGBLUP −1 yxðÞ¼ kðÞ K þ λI ^y ð6Þ analyses. where λ is the regularization constant, and K is the Gram Support vector regression T matrix with entries K = K(x , x )= ϕ(x )· ϕ(x ) ;thus, forn ij i j i j Support vector machine (SVM) was based on statistical training samples, the obtained kernel matrix is: learning theory. SVR was the application of SVM in re- 2 3 gression for dealing with quantitative responses, which Kðx ; x Þ Kðx ; x Þ ⋯ Kðx ; x Þ 1 1 1 2 1 n 6 7 used a linear or nonlinear kernel function to map the in- 6 Kðx ; x Þ Kðx ; x Þ ⋯ Kðx ; x Þ7 2 1 2 2 2 n 6 7 K ¼ put space (the marker dataset) to a higher dimensional 6 7 ⋮⋮ ⋮ ⋮ 4 5 feature space [32], and performed modelling and predic- Kðx ; x Þ Kðx ; x Þ ⋯ Kðx ; x Þ tion on the feature space. In other words, we can build a n 1 n 2 n n nn linear model in the feature space to deal with regression ð7Þ problems. The model formulation of SVR can be I is the identity matrix, k = K(x , x ) with j = 1,2,3, …,n, i j expressed as: n is the number of training samples, and x is the test sample. In the expanded form, fxðÞ ¼ β þ hxðÞ β ð2Þ 2 3 Kðx ; x Þ in which h(x) β is the kernel function, β is the vector of i 1 6 7 weights, and β is the bias. Generally, the formalized 0 6 7 Kðx ; x Þ i 2 6 7 k ¼ ð8Þ SVR was given by minimizing the following restricted 6 7 4 5 loss function: Kðx ; x Þ i n 1 n min kk β þ C VyðÞ −fxðÞ ; ð3Þ i¼1 The grid search was used to find the most suitable β ;β 2 kernel function and λ in this study, and an internal 5- in which fold CV strategy was used for tuning the hyperparameters. 0; ifjj r < ε V ðÞ r ¼ : ð4Þ jj r −ε; otherwise Random forest Random forest (RF) is an ML method that uses voting V (r) is the ε-insensitive loss and C (“cost parameter”) or the average of multiple decision trees to determine is the regularization constant that controls the trade-off the classification or predicted values of new instances between prediction error and model complexity. y is a [34]. Random forest was essentially a collection of deci- quantitative response, and ||·|| is the norm in Hilbert sion trees, and each decision tree was slightly different space. After optimization, the final form of SVR can be from other trees. Random forest reduced the risk of written as: overfitting by averaging the prediction results of many decision trees [20]. Random forest regression can be fxðÞ ¼ ðÞ a ^ −a kxðÞ ; x ; ð5Þ i i i i¼1 written in the following form: in which k(x , x )= ϕ(x ) ϕ(x ) is the kernel function. In i j i j 1 M this research, grid search was used to find the best ker- y ¼ tðÞ ψ ðÞ y : X ð9Þ m¼1 nel function and the optimal hyperparameters of C and gamma. An internal fivefold cross-validation (5-fold CV) in which y is the predicted value of random forest re- strategy was performed to tune the hyperparameters gression, t (ψ (y : X)) is an individual regression tree, m m when performing a grid search. and M is the number of decision trees in the forest. The prediction was obtained by passing down the predictor Kernel ridge regression variables in the flowchart of each tree, and the corre- Kernel ridge regression (KRR) is a nonlinear regression sponding estimated value at the terminal node was used method that can effectively discover the nonlinear struc- as the predicted value. Finally, the predictions of each ture of the data [33]. KRR uses a nonlinear kernel func- tree in RF were averaged to calculate the final prediction tion to map the data to a higher dimensional kernel of unobserved data. The grid search was used to find the Wang et al. Journal of Animal Science and Biotechnology (2022) 13:60 Page 5 of 12 most suitable hyperparameter M and the maximum Table 2 The optimal hyperparameters of each ML model obtained through a grid search for TNB and NBA traits in 20 depth of the tree, and the inner 5-fold CV was per- replicates of 5-fold CV formed to tune the hyperparameters. Method Optimal hyperparameters SVR kernel = ‘rbf’, C = 7, gamma = 0.0001 Adaboost.R2 KRR kernel = ‘rbf’, λ =0.1, gamma = 0.0001 Adaboost.R2 is an ad hoc modification of Adaboost. R and an extension of Adaboost.M2 created to deal with RF n_estimators = 250, max_depth = None regression problems, which repeatedly used a regression Adaboost.R2_SVR n_estimators = 50, kernel = ‘rbf’, C = 7, gamma = 0.0001 tree as a weak learner followed by increasing the weights Adaboost.R2_KRR n_estimators = 50, kernel = ‘rbf’, λ =0.01, gamma = 0.0001 of incorrectly predicted samples and decreasing the a Optimal hyperparameters: The optimal hyperparameters of each machine learning method obtained by using a grid search weights of correctly predicted samples. It builds a “com- mittee” by integrating multiple weak learners [35], mak- ing its prediction effect better than those of weak remaining group was treated as the validation popula- learners. Adaboost.R2 regression model can be written tion. The genotyped reference and validation sets in each as: replicate of 5-fold CV were the same for all methods, and it should be noted that nongenotyped individuals X X 1 1 1 were added to the reference population in ssGBLUP. For y ¼ inf y∈Y : log ≥ log ; t: f ðÞ x ≤ y t t ε 2 ε all methods, the accuracy of genomic prediction was cal- t t culated as the Pearson correlation of y (corrected phe- ð10Þ c notypes) and PV (predicted values). In addition, the where y is the predicted value, f (x) is the predicted value prediction unbiasedness was also calculated as the re- of the t-th weak learner, ε is the error rate of f (x) and ε gression of y on PV of the validation population. The 5- t t t fold CV scheme was repeated 20 times, and the overall ¼ L ð1−L Þ, L is the average loss and L ¼ L ðiÞ t t t t t i¼1 prediction accuracy and unbiasedness were the averages D ðiÞ; L (i) is the error between the actual observation t t of 20 replicates. The Hotelling-Williams Test [36] was value and the predicted value of the i-th predicted indi- performed to compare the prediction accuracy of differ- vidual, and D (i) is the weight distribution of f (x). After t t ent methods after parameter optimization. f (x) is trained, the weight distribution D (i) becomes D t t t + Meanwhile, prediction ability metrics, e.g., mean (i), squared error (MSE) and mean absolute error (MAE), ðÞ 1−L ðÞ i were also used to evaluate the performance of regression D ðÞ i β D ðÞ¼ i ; ð11Þ tþ1 models in the present study. MSE can take both predic- tion accuracy and bias into account [37], and the smaller in which Z is a normalization factor chosen such that the value of MSE is, the better the accuracy of the model D (i) will be a distribution. In the current study, SVR to describe the experimental data. The MAE could bet- t +1 and KRR were used as weak learners of Adaboost.R2. ter reflect the actual situation of the predicted value For these four ML methods, the vectors of genotypes error. Their formulas can be written as follows. (coded as 0, 1, 2) were the input independent variables, X X 1 m 1 m corrected phenotypes y was used as the response vari- MSE ¼ ðÞ f −y ; and MAE ¼ jj f −y ð12Þ c i i i i i¼1 i¼1 m m able, and the Sklearn package for Python (V0.22) was used for genomic prediction. We sought the optimal where m represents the number of animals in each CV hyperparameter combination from a grid of values with test fold of 5-fold CV, f is the vector of predicted values different hyperparameter combinations, and the combin- (PV) and y is the vector of observed values (y ). The final ation in the grid with the highest Pearson correlation MSE and MAE were the average of 20 replicates. was selected as the optimal hyper-parameter in each fold In addition, to be more in line with the actual situation (grid search). Meanwhile, the optimal hyperparameters of genomic selection, we compared ML methods and for SVR, KRR, RF and Adaboost.R2 in CV according to traditional genomic selection methods in using early- the grid search are shown in Table 2. generation animals to predict the performance of ani- mals of later generations. Therefore, the younger ani- Accuracy of genomic prediction mals born after January 2020 were chosen as the Fivefold cross-validation (5-fold CV) was used to esti- validation population, and the population sizes of the mate the accuracies of genomic prediction, in which reference and validation were 2222 and 344, respectively. 2566 individuals were randomly split into five groups The accuracy of genomic prediction was evaluated as r with 513 individuals each. For each CV, four of the five (y , PV), the Pearson correlation between corrected phe- groups were defined as the reference population, and the notypes y and predicted values PV. c Wang et al. Journal of Animal Science and Biotechnology (2022) 13:60 Page 6 of 12 Results 8.9% to 24.0%, 7.6% to 17.5% and 11.1% to 24.6%, re- Genotype imputation accuracy spectively. For trait TNB, compared with that of GBLUP, Figure 1 illustrates the accuracy of imputing GenoBaits the average accuracy of all ML methods was improved, Porcine SNP 50 K to PorcineSNP50 BeadChip across and support vector regression (SVR) showed an im- minor allele frequency (MAF) intervals and chromo- provement of 19.0%, similar to the outcomes of kernel somes. DR2, CR and COR were not sensitive to MAF ridge regression (KRR) and Adaboost.R2 based on SVR except that COR was lower when the MAF was less than and KRR, which obtained improvements of 18.1% and 0.05 and in the range of 0.45 to 0.5 (Fig. 1a). DR2, CR 17.7%, respectively; random forest (RF) yielded the low- and COR on each chromosome were 0.978 ~ 0.988, est improvement of 8.9%. The similar advantages of ML 0.984 ~ 0.988 and 0.957 ~ 0.972, respectively, and no sig- were also over ssGBLUP and the improvements of SVR, nificant differences were observed in DR2, CR and COR KRR, RF, Adaboost.R2_SVR and Adaboost.R2_KRR were between chromosomes (Fig. 1b). In the same scenarios, 17.5%, 17.5%, 7.6%, 16.7% and 16.3%, respectively. ML the COR values were smaller than those of DR2 and CR. methods gained the largest advantage over BayesHE, the The averaged DR2, CR and COR across all variants were accuracy from SVR, KRR, RF, Adaboost.R2_SVR and 0.984, 0.985 and 0.964, respectively, indicating that the Adaboost.R2_KRR were improved by 21.4%, 21.4%, imputation was sufficiently accurate to analyse the two 11.1%, 20.6% and 20.2%, respectively, compared with SNP panels together. BayesHE. For trait NBA, although ML methods still per- formed better than GBLUP, ssGBLUP and BayesHE, Accuracy of genomic prediction in cross-validation Adaboost.R2_KRR gained the largest improvement in all Comparison of ML methods with (ss) GBLUP and BayesHE comparisons, and KRR obtained the second largest im- Table 3 shows the prediction accuracies and unbiased- provement. SVR and Adaboost.R2 based on SVR yielded ness of the ML methods, (ss) GBLUP and BayesHE on the same improvements on GBLUP, ssGBLUP and traits of TNB and NBA in 20 replicates of 5-fold CV. BayesHE. RF still gained the lowest improvement com- The accuracies of the ML methods after tuning the pared with other ML methods. hyperparameters were significantly (P < 0.05) higher than Meanwhile, GBLUP, ssGBLUP and BayesHE had simi- those of (ss) GBLUP and BayesHE. The improvements lar performance, and no significant differences in predic- of ML methods over GBLUP, ssGBLUP and BayesHE tion accuracy were found among them. Nevertheless, were 19.3%, 15.0% and 20.8% on average, ranging from ssGBLUP produced an average improvement of 3.7% Fig. 1 Imputation accuracy. Imputation accuracy of GenoBaits Porcine SNP 50 K to PorcineSNP50 BeadChip at different minor allele frequency (MAF) intervals (a) and chromosomes (b). DR2, the estimated squared correlation between the estimated allele dose and the true allele dose; Genotype concordance rate (CR), the ratio of correctly imputed genotypes; Genotype correlation (COR), the correlation coefficient between the imputed variants and the true variants Wang et al. Journal of Animal Science and Biotechnology (2022) 13:60 Page 7 of 12 Table 3 Accuracies and unbiasedness of genomic prediction on TNB and NBA from seven methods in 20 replicates of 5-fold CV 1 2 Hyper- Method TNB NBA parameters 3 4 3 4 Accuracy Unbiasedness Accuracy Unbiasedness a a GBLUP 0.248 ± 0.026 0.958 ± 0.132 0.208 ± 0.025 0.931 ± 0.142 a ab ssGBLUP 0.251 ± 0.026 0.901 ± 0.121 0.221 ± 0.026 0.844 ± 0.113 a a BayesHE 0.243 ± 0.025 1.015 ± 0.148 0.207 ± 0.026 1.009 ± 0.171 b b Tuning SVR 0.295 ± 0.025 1.23 ± 0.119 0.254 ± 0.023 1.106 ± 0.11 b b KRR 0.295 ± 0.025 1.266 ± 0.125 0.256 ± 0.023 1.151 ± 0.113 ab ab RF 0.270 ± 0.029 1.229 ± 0.152 0.248 ± 0.028 1.188 ± 0.147 b b Adaboost.R2_SVR 0.293 ± 0.025 1.363 ± 0.138 0.254 ± 0.024 1.256 ± 0.131 b b Adaboost.R2_KRR 0.292 ± 0.025 1.344 ± 0.136 0.258 ± 0.024 1.249 ± 0.129 Default SVR 0.255 ± 0.027 1.275 ± 0.147 0.224 ± 0.023 1.098 ± 0.126 KRR 0.264 ± 0.025 1.007 ± 0.108 0.222 ± 0.024 0.879 ± 0.101 RF 0.246 ± 0.028 1.064 ± 0.142 0.225 ± 0.027 1.002 ± 0.128 Adaboost.R2_SVR 0.273 ± 0.024 0.998 ± 0.106 0.228 ± 0.026 0.822 ± 0.099 Adaboost.R2_KRR 0.254 ± 0.024 0.759 ± 0.085 0.209 ± 0.027 0.636 ± 0.085 TNB: total number of piglets born NBA: number of piglets born alive Accuracy: the correlation between corrected phenotypes and predicted values of the validation population; Unbiasedness: the regression of corrected phenotypes onto the predicted values The different superscript of accuracy indicates the significant difference by the Hotelling-Williams test compared with GBLUP (1.2% for TNB; 6.3% for NBA), 6.2%, 5.5% and 6.1% compared to RF, ranging from 8.1% while less bias was observed by GBLUP in all scenarios. to 9.3% for TNB and from 2.4% to 4.0% for NBA. For BayesHE yielded similar accuracy to GBLUP (0.243 and TNB, SVR and KRR showed the highest accuracies 0.248 for TNB; 0.207 and 0.208 for NBA), but the unbi- (0.295 for both), and Adaboost.R2_KRR yielded the asedness of BayesHE was much closer to 1 (1.015 for highest accuracies on NBA (0.258). In the comparison of TNB; 1.009 for NBA). unbiasedness, SVR produced the lowest genomic predic- On the other hand, the mean squared error (MSE) and tion bias, and the regression coefficient was close to 1.0, mean absolute error (MAE) were also used to assess the while Adaboost.R2 method with both base learner SVR performance of different methods. As shown in Table 4, and KRR produced larger bias. As a trade-off between after tuning the hyperparameters, the ML methods were generally superior to GBLUP, ssGBLUP and BayesHE in Table 4 Mean squared error (MSE) and mean absolute error terms of MSE and MAE. For two reproduction traits (MAE) of seven methods for TNB and NBA as assessed with 20 TNB and NBA, all ML methods yielded lower MSE and replicates of 5-fold CV MAE than GBLUP, ssGBLUP and BayesHE. The per- Hyperparameters Method TNB NBA formance of GBLUP, ssGBLUP and BayesHE was very MSE MAE MSE MAE close, and ssGBLUP produced a slightly lower MSE GBLUP 5.259 1.749 4.168 1.606 (5.26 for TNB; 3.95 for NBA) and MAE (1.748 for TNB; ssGBLUP 5.26 1.748 3.95 1.532 1.532 for NBA) among these three methods, while they BayesHE 5.32 1.763 4.023 1.556 were still higher than those obtained from RF, which performed the worst among the four ML methods and Tuning SVR 5.129 1.730 3.880 1.521 generated 5.212 and 3.901 of MSE and 1.747 and 1.527 KRR 5.134 1.731 3.876 1.521 of MAE on TNB and NBA, respectively. RF 5.212 1.747 3.901 1.527 Adaboost.R2_SVR 5.158 1.739 3.892 1.528 Comparison among ML methods Adaboost.R2_KRR 5.153 1.737 3.883 1.526 Tables 3 and 4 indicate that the ML methods performed Default SVR 5.271 1.748 3.956 1.522 better than GBLUP, ssGBLUP and BayesHE. They also showed that RF had the lowest accuracy even though no KRR 5.21 1.743 3.944 1.531 significant differences were observed among the ML RF 5.266 1.756 3.93 1.531 methods in this study. After tuning the parameters, the Adaboost.R2_SVR 5.202 1.75 3.95 1.541 accuracies of SVR, KRR, Adaboost.R2_SVR and Ada- Adaboost.R2_KRR 5.309 1.771 4.04 1.566 boost.R2_KRR was improved by an average of 5.8%, Wang et al. Journal of Animal Science and Biotechnology (2022) 13:60 Page 8 of 12 accuracy and unbiasedness, SVR and KRR had the most not significant in predicting younger animals, although robust prediction ability, which was also confirmed by they still improved the accuracies and reduced the MSE the results of MSE and MAE, in which SVR and KRR compared with the outcomes when using the default had the smallest MSE and MAE in all scenarios. hyperparameters. Table 5 indicates that Adaboost.R2_ It should be noted that the better performance of the KRR and RF still outperformed GBLUP and BayesHE as ML methods was acquired by tuning the hyperpara- was demonstrated in the CV, ssGBLUP performed com- meters (Tables 2,3). Compared with using hyperpara- parably with RF, and ssGBLUP yielded slightly higher ac- meters set to default, the accuracy was improved by curacy and lower MSE than Adaboost.R2_KRR in the 14.3% on average from the ML methods with optimal prediction of TNB; in contrast, for NBA, Adaboost.R2_ hyperparameters; the accuracy of SVR, KRR, RF and KRR performed significantly better than ssGBLUP. Adaboost.R2 with optimal hyperparameters improved Meanwhile, after tuning the parameters, RF and KRR the genomic prediction accuracies for TNB by 15.7%, obtained higher accuracies and lower MSE than GBLUP 11.7%, 9.8% and 15.0%, respectively; and for NBA, the and BayesHE, respectively. The performance of RF was improvements were 13.4%, 15.3%, 10.2% and 23.4%, re- significantly improved, and it performed better than that spectively. For unbiasedness, except for SVR on TNB, of KRR and SVR. In the prediction of younger animals, the unbiasedness of all ML methods using the default SVR with either default hyperparameters or optimal parameters was lower than the unbiasedness using the hyperparameters performed the worst, which was differ- optimal parameters. On the other hand, Tables 3 and 4 ent from its performance in the CV. indicate that ML methods with default hyperparameters did not yield advantages over GBLUP, ssGBLUP and Computing time BayesHE. The average computation time to complete each fold in CV for each genomic prediction method is demonstrated Accuracy of genomic prediction in predicting younger in Table 6. Running time of the methods was measured animals in minutes on an HP server (CentOS Linux 7.9.2009, Table 5 presents the accuracy and MSE of genomic pre- 2.5 GHz Intel Xeon processor and 515G total memory). diction on TNB and NBA by applying different methods Among all methods, KRR was the fastest algorithm; it to predict younger animals. On the one hand, a similar took an average of 1.16 min in each fold of CV to trend was obtained for GBLUP, BayesHE and ssGBLUP complete the analysis, requiring considerably less time as in CV. GBLUP performed comparably with BayesHE, than GBLUP (2.07 min) and ssGBLUP (3.23 min). The while ssGBLUP yielded higher accuracies and lower computing efficiency of SVR (5.28 min) and Ada- MSE than GBLUP and BayesHE for both traits. On the boost.R2_KRR (5.16 min) was comparable to that of other hand, different from the results in CV, the super- KRR, GBLUP and ssGBLUP. However, RF (53.45 min) iority of ML methods with optimal hyperparameters was and Adaboost.R2_SVR (85.34 min) ran slowly among the Table 5 Accuracy and mean squared error (MSE) of genomic prediction of TNB and NBA in younger individuals from seven methods 1 2 Hyperparameters Method TNB NBA 3 4 3 4 Accuracy MSE Optimal hyperparameters Accuracy MSE Optimal hyperparameters ab ab GBLUP 0.355 11.598 – 0.264 10.203 – b ab ssGBLUP 0.408 11.221 – 0.288 9.974 – ab ab BayesHE 0.357 11.566 – 0.262 10.143 – a a Tuning SVR 0.307 11.488 kernel = ‘rbf’; gamma = 0.00005; C = 14 0.229 10.235 kernel = ‘rbf’; gamma = 0.00005; C = 13 ab ab KRR 0.362 11.367 kernel = ‘rbf’; gamma = 0.000001; λ = 0.07 0.266 10.121 kernel = ‘rbf’; gamma = 0.000001; λ = 0.12 ab ab RF 0.385 11.337 n_estimators = 430; max_depth = None 0.285 10.116 n_estimators = 400; max_depth = None b b Adaboost.R2_KRR 0.395 11.254 n_estimators = 70; kernel = ‘rbf’, 0.328 9.794 n_estimators = 60; kernel = ‘rbf’, gamma = 0.00001, λ =1 gamma = 0.00001, λ = 0.9 Default SVR 0.271 11.858 – 0.17 10.37 – KRR 0.346 11.538 – 0.259 10.158 – RF 0.26 11.867 – 0.179 10.335 – Adaboost.R2_KRR 0.36 11.392 – 0.322 9.797 – TNB: total number of piglets born NBA: number of piglets born alive Accuracy: the correlation between corrected phenotypes and predicted values of the validation population; Optimal hyperparameters: The optimal hyper-parameters of each machine learning method obtained by using grid search The different superscript of accuracy indicates the significant difference by the Hotelling-Williams test Wang et al. Journal of Animal Science and Biotechnology (2022) 13:60 Page 9 of 12 ML methods. Adaboost.R2 based on KRR (Adaboost.R2_ exerting its advantages greatly. (II) The low heritability KRR) was much more time-saving than Adaboost.R2_ of TNB and NBA. In this study, the heritability for the SVR. Since the MCMC algorithm required more iter- two traits were both 0.12, which was generally consistent ation time to reach convergence, BayesHE was the slow- with other reports [27, 41, 42]; therefore, sufficient ac- est as expected, and it took 226.12 min for each fold curacy could not be achieved with the pedigree informa- of CV. tion. This also confirmed by other studies, that a certain improvement can be achieved by adding a smaller refer- Discussion ence population for traits with medium or high heritabil- Our results elucidated that ssGBLUP performed better ity [2, 43]. than GBLUP in terms of accuracy in all scenarios inves- In this study, we investigated the performance of ML tigated, which was consistent with previous studies [27, methods in genomic prediction, and demonstrated their 38–40]. This could be explained by the fact that GBLUP superiority compared to classical methods GBLUP, utilized phenotypic information only from genotyped in- ssGBLUP and Bayesian methods. Generally, the follow- dividuals, while ssGBLUP simultaneously used informa- ing characteristics of ML methods make it potentially at- tion from both genotyped and nongenotyped individuals tractive to genomic prediction. (I) Although ML to construct a genotype-pedigree relationship matrix (H methods generally require moderate fine-tuning of matrix). Since nongenotyped individuals were related to hyperparameters, the default hyperparameters usually do individuals in the validation population on the pedigree, not perform poorly [34]. According to our results, ML ssGBLUP took advantage of the phenotypic information methods after tuning parameters gained advantages over of the whole population to obtain better prediction. using the default hyperparameters; in addition, without However, in our research utilizing 5-fold CV and pre- tuning hyperparameters, almost all ML methods in CV dicting younger animals, ssGBLUP produced only and Adaboost.R2_KRR in predicting younger animals slightly higher accuracies for the two reproduction traits. performed better than GBLUP and BayesHE (Tables 3, The lower improvement of ssGBLUP may be due to the 4, 5). (II) ML methods can handle situations where the following reasons. (I) Weak relationship between the number of parameters is larger than the sample size, and nongenotyped reference population and genotyped can- they are very efficient in the case of high-density genetic didates in the pedigree. In our study, only 143 of the 789 markers for GS [44]. (III) ML methods do not make dis- nongenotyped reference population used by ssGBLUP tribution assumptions about the genetic determinism had pedigree information, and only 46 and 40 individual underlying the trait, enabling us to capture the possible sires and dams were included in the 2566 genotyped in- nonlinear relationships between genotype and phenotype dividuals, indicating that the relationship between non- in a flexible way [44], and it is different from GBLUP genotyped reference animals and genotyped candidates and Bayesian methods, which assume that all marker ef- was weak, making a small contribution to the genomic fects follow the same normal distribution or have differ- prediction. Li et al. [39] showed that the improvement of ent classes of shrinkage for different SNP effects. In ssGBLUP over GBLUP on accuracy was almost entirely addition, ML methods can take the correlation and contributed by nongenotyped close relatives of candi- interaction of markers into account as well, while linear dates. It can also be observed in Additional file 1: Fig. S1 models based on pedigree and genomic relationships that the greater the weight of the A matrix, the lower may not provide a sufficient approximation of the gen- the accuracy, indicating that the information obtained etic signals generated by complex genetic systems [16]. from pedigree is limited, resulting in ssGBLUP not Consequently, for traits with fully additive architecture, conventional linear models outperformed ML models [45], but when traits are affected by nonadditive effects, Table 6 Average computing time to complete each fold of 5- especially epistasis, ML methods can achieve more ac- fold CV according to different genomic prediction methods curate predictions [25]. These make ML methods gain a Method TNB NBA large advantage over GBLUP and BayesHE even though GBLUP 2 min 6 s 2 min 2 s they only use genotyped animals. ssGBLUP 3 min 12 s 3 min 16 s In our experiments with 5-fold CV, our results showed BayesHE 3 h 57 min 1 s 3 h 35 min 13 s that ML methods improved the prediction accuracy of SVR 5 min 27 s 5 min 7 s the reproduction traits in the Chinese Yorkshire pig population. SVR, KRR, RF and Adaboost.R2 reflected the KRR 1 min 4 s 1 min 16 s superiority of the ML methods, with average improve- RF 50 min 38 s 56 min 16 s ments over GBLUP of 20.5%, 21.0%, 14.1% and 20.5%, Adaboost.R2_SVR 1 h 35 min 13 s 1 h 15 min 28 s respectively. In predicting younger animals, our results Adaboost.R2_KRR 5 min 3 s 5 min 16 s also indicated that RF and Adaboost.R2_KRR gained Wang et al. Journal of Animal Science and Biotechnology (2022) 13:60 Page 10 of 12 8.45% and 11.3% on TNB and 7.95% and 24.2% on NBA, generally outperformed GBLUP and Bayesian methods, respectively, over GBLUP. However, SVR and KRR did while they performed comparably with GBLUP and not perform as well as in CV, and ssGBLUP performed BayesHE in the case of default hyperparameters. On the comparably with RF. Among ML methods, Adaboost.R2 other hand, our results also showed that the optimal performed consistently well in all situations; it generally hyperparameters depend on the characteristics of traits, outperformed ssGBLUP. Our findings related to ML datasets etc.. When optimal hyperparameters obtained methods were also confirmed in other studies. Liang in CV were used in predicting younger animals, the pre- et al. [46] also pointed out that compared with SVR, diction accuracies of all ML methods were decreased KRR, and RF, Adaboost possessed the most potent pre- compared to their performance with default parameters diction ability in the genomic prediction of economic (Additional file 1: Table S1). In CV, many replicates traits in Chinese Simmental beef cattle. Abdollahi- were used for tuning hyperparameters, and the optimal Arpanahi et al. [47] reported that the gradient boosting hyperparameters were easily obtained for SVR and KRR method yielded the best prediction performance in com- due to their fast computing, while in predicting younger parison with GBLUP, BayesB, RF, convolutional neural individuals, the hyperparameters were tuned based on networks (CNN) and multilayer perceptron (MLP) in only one genomic prediction, and they may not be suffi- the genomic prediction of the sire conception rate (SCR) cient to exert the generalization performance of SVR of Holstein bulls. Azodi et al. [48] compared the per- and KRR, leading to their relatively poorer prediction formance of six linear and five nonlinear ML models ability. using data on 18 traits from six plant species and found Moreover, our results indicated that the optimal that no one algorithm performed best across all traits, hyperparameters may reduce the risk of overfitting while ensemble learning performed consistently well. (Tables 3, 4 and 5), which is a key element for the qual- In 5-fold CV, Adaboost.R2 and RF did not show the ad- ity of the final predictions [50]. In this study, different vantages of ensemble learning compared with single learning ML models control overfitting with different parameters. methods (SVR and KRR). For Adaboost.R2, mainly because For example, SVR mainly increases the fault tolerance of the current SVR and KRR are sufficient to exert prediction the model by increasing the regularization parameter C abilities, which may limit the benefit of using boosting. In to achieve a regularization effect to reduce the degree of addition, the slow tuning process of Adaboost.R2, we did not overfitting. KRR mainly tunes the hyperparameter λ that precisely tune the hyperparameters, resulting in lower pre- controls the amount of shrinkage to reduce noise, diction accuracy than SVR and KRR. For RF, its prediction thereby controlling overfitting. For RF, the tendency of accuracy is mainly affected by the number and maximum overfitting can be reduced by adding decision trees due depth of decision trees [46], but to weigh the practical appli- to bagging and random feature selection, and the bias cation feasibility of RF, it is impractical to precisely tune the can be reduced by increasing the depth of the decision number of trees due to the slow tuning process. We ob- tree. Adaboost is an iterative algorithm, and each iter- tained only approximate hyperparameters, leading to the ation weights the samples according to the results of the most ideal RF model not being trained, further compromis- previous iteration; thus, with the continuation of iter- ing its performance. In predicting younger animals, particu- ation, the bias of the model will be continuously de- larly for RF, they were precisely tuned based on the creased. Accordingly, the tuning process highlights the hyperparameter ranges of CV, resulting in the dramatic im- flexibility of ML and increases the advantages of ML provement of Adaboost and RF compared to SVR and KRR. methods over conventional genomic selection methods. Our results implied that ensemble learning is helpful to im- Therefore, it is crucial to fine-tune the hyperpara- prove genomic prediction. Recently, another type of ensem- meters during the training phase when the dataset ble learning based on a hierarchical model also demonstrates changes [16, 37, 48]. Meanwhile, it should be noted that advantages in genomic selection. Liang et al. [49] developed the effect of the default hyperparameters usually did not a stacking ensemble learning framework (SELF) that inte- perform poorly as discussed above, and failure to find grated SVR, KRR, and ENET to perform genomic prediction suitable hyperparameters may greatly reduce the predic- and showed excellent performance. tion effect of ML methods [46]. If hyperparameter auto- Our results indicated that tuning hyperparameters is mation can be realized during ML operation, it will necessary for ML methods, confirming that ML algo- greatly improve the efficiency of hyperparameter rithms are sensitive to user-defined parameters during optimization and greatly broaden the application of ML the training phase [37]. After tuning the hyperpara- methods in genomic prediction. meters in CV and in genomic prediction of younger in- dividuals, the average improvement was 14.3% and Conclusions 21.8% over those using default hyperparameters, respect- In this study, we compared four ML methods, GBLUP, ively. The ML methods with optimal hyperparameters ssGBLUP and BayesHE to explore their efficiency in Wang et al. Journal of Animal Science and Biotechnology (2022) 13:60 Page 11 of 12 genomic prediction of reproduction traits in pigs. We Declarations compared the prediction accuracy, unbiasedness, MSE, Ethics approval and consent to participate MAE and computation time of different methods Animal samples used in this study were approved by the Animal Care and through 20 replicates of 5-fold CV and genomic predic- Use Committee of China Agricultural University. There was no use of human participants, data or tissues. tion of younger animals. Our results showed that ML methods possess a significant potential to improve gen- Consent for publication omic prediction over that obtained with GBLUP and Not applicable. BayesHE. In 5-fold CV, ML methods outperformed con- ventional methods in all scenarios; they yielded higher Competing interests accuracy and smaller MSE and MAE, while in genomic The authors declare that they have no conflict of interest. prediction of younger animals, RF and Adaboost.R2 per- Author details formed better than GBLUP and BayesHE. ssGBLUP was Key Laboratory of Animal Genetics and Breeding of Ministry of Agriculture comparable with RF and Adaboost.R2_KRR was overall and Rural Affairs, National Engineering Laboratory of Animal Breeding, College of Animal Science and Technology, China Agricultural University, better than ssGBLUP. Among ML methods, Ada- Beijing, China. Hebei Province Animal Husbandry and Improved Breeds boost.R2_KRR consistently performed well in our study. Work Station, Shijiazhuang, Hebei, China. Zhangjiakou Dahao Heshan New Our findings also demonstrated that tuning hyperpara- Agricultural Development Co., Ltd, Zhangjiakou, Hebei, China. meters is necessary for ML methods, and the optimal Received: 15 November 2021 Accepted: 13 March 2022 hyperparameters depend on the characteristics of traits, datasets etc. References Abbreviations 1. de Roos AP, Schrooten C, Veerkamp RF, van Arendonk JA. Effects of GS: Genomic selection; GBLUP: Genomic BLUP; ssGBLUP: Single-step GBLUP; genomic selection on genetic improvement, inbreeding, and merit of ML: Machine learning; RF: Random forest; SVR: Support vector regression; young versus proven bulls. J Dairy Sci. 2011;94(3):1559–67. KRR: Kernel ridge regression; RKHS: Reproducing kernel Hilbert space; 2. Hayes BJ, Bowman PJ, Chamberlain AJ, Goddard ME. Invited review: SVM: Support vector machine; TNB: Total number of piglets born; genomic selection in dairy cattle: progress and challenges. J Dairy Sci. 2009; NBA: Number of piglets born alive; A matrix: Pedigree relationship matrix; 92(2):433–43. EBV: Estimated breeding values; yc: Corrected phenotypes; DR2: Dosage R- 3. Heffner EL, Jannink JL, Sorrells ME. Genomic selection accuracy using squared measure; COR: Genotype correlation; CR: Genotype concordance multifamily prediction models in a wheat breeding program. Plant Genome. rate; MAF: Minor allele frequency; BayesHE: Bayesian Horseshoe; CV: Cross 2011;4(1):65–75. validation; MSE: Mean squared error; MAE: Mean absolute error; KcRR: Cosine 4. Schaeffer LR. Strategy for applying genome-wide selection in dairy cattle. J kernel-based KRR; SELF: Stacking ensemble learning framework; Anim Breed Genet. 2006;123(4):218–23. MCMC: Markov Chain Monte Carlo 5. García-Ruiz A, Cole JB, VanRaden PM, Wiggans GR, Ruiz-López FJ, Van Tassell CP. Changes in genetic selection differentials and generation intervals in US Holstein dairy cattle as a result of genomic selection. Proc Natl Acad Sci U S Supplementary Information A. 2016;113(28):E3995–4004. The online version contains supplementary material available at https://doi. 6. VanRaden PM. Efficient methods to compute genomic predictions. J Dairy org/10.1186/s40104-022-00708-0. Sci. 2008;91(11):4414–23. 7. Misztal I, Legarra A, Aguilar I. Computing procedures for genetic evaluation including phenotypic, full pedigree, and genomic information. J Dairy Sci. Additional file 1: Fig. S1. Accuracy of genomic prediction obtained 2009;92(9):4648–55. from the ssGBLUP method with different weighting factors, averaged by 8. Christensen OF, Lund MS. Genomic prediction when some animals are not TNB and NBA and assessed by 20 replicates of 5-fold CV. Table S1. genotyped. Genet Sel Evol. 2010;42:2. Accuracy and mean squared error (MSE) of genomic prediction of TNB 9. Whittaker JC, Thompson R, Denham MC. Marker-assisted selection using and NBA from seven methods in predicting younger individuals using ridge regression. Genet Res. 2000;75(2):249–52. hyperparameters of CV. 10. Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc. 2011;73(3):273–82. Acknowledgements 11. Meuwissen THE, Hayes BJ, Goddard ME. Prediction of total genetic value The authors gratefully acknowledge the constructive comments from using genome-wide dense marker maps. Genetics. 2001;157(4):1819–29. reviewers. 12. Habier D, Fernando RL, Kizilkaya K, Garrick DJ. Extension of the bayesian alphabet for genomic selection. BMC Bioinform. 2011;12:186. 13. Varona L, Legarra A, Toro MA, Vitezica ZG. Non-additive effects in genomic Authors’ contributions selection. Front Genet. 2018;9:78. XDD designed the experiments. XW performed statistical analysis and wrote 14. Gianola D, Campos G, González-Recio O, Long N, Wu XL. Statistical learning the manuscript. SLS provided help on BayesHE. XDD revised the manuscript. methods for genome-based analysis of quantitative traits. In: Proceedings of All authors read and approved the final manuscript. the 9th World Congress on Genetics Applied to Livestock Production. Leipzig: CD-ROM Communication 0014; 2010. Funding 15. An B, Liang M, Chang T, Duan X, Du L, Xu L, et al. KCRR: a nonlinear This work was supported by grants from the National Key Research and machine learning with a modified genomic similarity matrix improved the Development Project (2019YFE0106800), Modern Agriculture Science and genomic prediction efficiency. Brief Bioinform. 2021;22(6):bbab132. Technology Key Project of Hebei Province (19226376D), China Agriculture 16. Gianola D, Okut H, Weigel KA, Rosa GJ. Predicting complex quantitative Research System of MOF and MARA. traits with Bayesian neural networks: a case study with Jersey cows and wheat. BMC Genet. 2011;12:87. Availability of data and materials 17. González-Recio O, Rosa GJM, Gianola D. Machine learning methods and The datasets used or analyzed during the present study are available from predictive ability metrics for genome-wide prediction of complex traits. the corresponding author on reasonable request. Livest Sci. 2014;166:217–31. Wang et al. Journal of Animal Science and Biotechnology (2022) 13:60 Page 12 of 12 18. Montesinos-Lopez OA, Martin-Vallejo J, Crossa J, Gianola D, Hernandez- 42. Song H, Zhang J, Jiang Y, Gao H, Tang S, Mi S, et al. Genomic prediction for Suarez CM, Montesinos-Lopez A, et al. A benchmarking between deep growth and reproduction traits in pig using an admixed reference learning, support vector machine and Bayesian threshold best linear population. J Anim Sci. 2017;95(8):3415–24. unbiased prediction for predicting ordinal traits in plant breeding. G3 43. Goddard ME, Hayes BJ. Mapping genes for complex traits in domestic (Bethesda). 2019;9(2):601–18. animals and their use in breeding programmes. Nat Rev Genet. 2009;10(6): 19. Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random 381–91. forests and support vector machines for microarray-based cancer 44. Piles M, Bergsma R, Gianola D, Gilbert H, Tusell L. Feature selection stability classification. BMC Bioinform. 2008;9:319. and accuracy of prediction models for genomic prediction of residual feed intake in pigs using machine learning. Front Genet. 2021;12:611506. 20. González-Camacho JM, Ornella L, Pérez-Rodríguez P, Gianola D, Dreisigacker 45. Zingaretti LM, Gezan SA, Ferrao LFV, Osorio LF, Monfort A, Munoz PR, et al. S, Crossa J. Applications of machine learning methods to genomic selection in breeding wheat for rust resistance. Plant Genome. 2018;11(2):170104. Exploring deep learning for complex trait genomic prediction in Polyploid outcrossing species. Front Plant Sci. 2020;11:25. 21. Ornella L, Perez P, Tapia E, Gonzalez-Camacho JM, Burgueno J, Zhang X, 46. Liang M, Miao J, Wang X, Chang T, An B, Duan X, et al. Application of et al. Genomic-enabled prediction with classification algorithms. Heredity ensemble learning to genomic selection in chinese simmental beef cattle. J (Edinb). 2014;112(6):616–26. Anim Breed Genet. 2021;138(3):291–9. 22. Noe F, De Fabritiis G, Clementi C. Machine learning for protein folding and 47. Abdollahi-Arpanahi R, Gianola D, Penagaricano F. Deep learning versus dynamics. Curr Opin Struct Biol. 2020;60:77–84. parametric and ensemble methods for genomic prediction of complex 23. Kojima K, Tadaka S, Katsuoka F, Tamiya G, Yamamoto M, Kinoshita K. A phenotypes. Genet Sel Evol. 2020;52(1):12. genotype imputation method for de-identified haplotype reference 48. Azodi CB, Bolger E, McCarren A, Roantree M, de Los CG, Shiu SH. information by using recurrent neural network. PLoS Comput Biol. 2020; Benchmarking parametric and machine learning models for genomic 16(10):e1008207. prediction of complex traits. G3 (Bethesda). 2019;9(11):3691–702. 24. Fa R, Cozzetto D, Wan C, Jones DT. Predicting human protein function with 49. Liang M, Chang T, An B, Duan X, Du L, Wang X, et al. A stacking ensemble multi-task deep neural networks. PLoS One. 2018;13(6):e0198216. learning framework for genomic prediction. Front Genet. 2021;12:600040. 25. Long N, Gianola D, Rosa GJ, Weigel KA. Application of support vector 50. Montesinos-Lopez OA, Montesinos-Lopez A, Perez-Rodriguez P, Barron- regression to genome-assisted prediction of quantitative traits. Theor Appl Lopez JA, Martini JWR, Fajardo-Flores SB, et al. A review of deep learning Genet. 2011;123(7):1065–74. applications for genomic selection. BMC Genomics. 2021;22(1):19. 26. Madsen P, Jensen J, Labouriau R, Christensen O, Sahana G. DMU-A Package for analyzing multivariate mixed models in quantitative genetics and genomics. In: Proceedings of the 10th World Congress of genetics applied to livestock production. August 17-22, 2014. Canada. 27. Guo X, Christensen OF, Ostersen T, Wang Y, Lund MS, Su G. Improving genetic evaluation of litter size and piglet mortality for both genotyped and nongenotyped individuals using a single-step method. J Anim Sci. 2015; 93(2):503–12. 28. Browning BL, Browning SR. A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals. Am J Hum Genet. 2009;84(2):210–23. 29. Chang CC, Chow CC, Tellier LC, Vattikuti S, Purcell SM, Lee JJ. Second- generation PLINK: rising to the challenge of larger and richer datasets. Gigascience. 2015;4:7. 30. Forni S, Aguilar I, Misztal I. Different genomic relationship matrices for single-step analysis using phenotypic, pedigree and genomic information. Genet Sel Evol. 2011;43:1. 31. Shi S, Li X, Fang L, Liu A, Su G, Zhang Y, et al. Genomic prediction using Bayesian regression models with global-local prior. Front Genet. 2021;12: 32. Müller AC, Guido S. Introduction to machine learning with Python: a guide for data scientists. O’Reilly Media, Inc: Sebastopol; 2017. 33. Exterkate P, Groenen PJF, Heij C, van Dijk D. Nonlinear forecasting with many predictors using kernel ridge regression. Int J Forecast. 2016;32(3): 736–53. 34. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32. 35. Shrestha DL, Solomatine DP. Experiments with AdaBoost.RT, an improved boosting scheme for regression. Neural Comput. 2006;18(7):1678–710. 36. Steiger JH. Tests for comparing elements of a correlation matrix. Psychol Bull. 1980;87(2):245–51. 37. Alves AAC, Espigolan R, Bresolin T, Costa RM, Fernandes Junior GA, Ventura RV, et al. Genome-enabled prediction of reproductive traits in Nellore cattle using parametric models and machine learning methods. Anim Genet. 2021;52(1):32–46. 38. Song H, Ye S, Jiang Y, Zhang Z, Zhang Q, Ding X. Using imputation-based whole-genome sequencing data to improve the accuracy of genomic prediction for combined populations in pigs. Genet Sel Evol. 2019;51(1):58. 39. Li X, Wang S, Huang J, Li L, Zhang Q, Ding X. Improving the accuracy of genomic prediction in Chinese Holstein cattle by using one-step blending. Genet Sel Evol. 2014;46:66. 40. Su G, Madsen P, Nielsen US, Mantysaari EA, Aamand GP, Christensen OF, et al. Genomic prediction for Nordic red cattle using one-step and selection index blending. J Dairy Sci. 2012;95(2):909–17. 41. Song H, Zhang Q, Ding X. The superiority of multi-trait models with genotype-by-environment interactions in a limited number of environments for genomic prediction in pigs. J Anim Sci Biotechnol. 2020;11:88.
Journal of Animal Science and Biotechnology – Springer Journals
Published: May 17, 2022
Keywords: Genomic prediction; Machine learning; Pig; Prediction accuracy
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.