Access the full text.
Sign up today, get DeepDyve free for 14 days.
Background Colorectal cancer (CRC) is the third most commonly diagnosed cancer worldwide. Active health screen- ing for CRC yielded detection of an increasingly younger adults. However, current machine learning algorithms that are trained using older adults and smaller datasets, may not perform well in practice for large populations. Aim To evaluate machine learning algorithms using large datasets accounting for both younger and older adults from multiple regions and diverse sociodemographics. Methods A large dataset including 109,343 participants in a dietary-based colorectal cancer ase study from Canada, India, Italy, South Korea, Mexico, Sweden, and the United States was collected by the Center for Disease Control and Prevention. This global dietary database was augmented with other publicly accessible information from mul- tiple sources. Nine supervised and unsupervised machine learning algorithms were evaluated on the aggregated dataset. Results Both supervised and unsupervised models performed well in predicting CRC and non-CRC phenotypes. A prediction model based on an artificial neural network (ANN) was found to be the optimal algorithm with CRC mis- classification of 1% and non-CRC misclassification of 3%. Conclusions ANN models trained on large heterogeneous datasets may be applicable for both younger and older adults. Such models provide a solid foundation for building effective clinical decision support systems assisting healthcare providers in dietary-related, non-invasive screening that can be applied in large studies. Using optimal algorithms coupled with high compliance to cancer screening is expected to significantly improve early diagnoses and boost the success rate of timely and appropriate cancer interventions. Keywords Colorectal cancer, Machine learning, Dietary information solving healthcare challenges and ultimately improve the Introduction delivery of health services [1, 2]. In the current twenty-first century, the re-emergence One of the contemporary healthcare challenges is colo- of machine learning (ML) and advancement in artificial rectal cancer (CRC). CRC is the third most commonly intelligence (AI) through data science provide unique diagnosed malignancy after breast and lung cancers, and opportunities to go beyond traditional statistical and is also the second leading cause of cancer-related mortal- research limitations, and advance health data analytics in ity worldwide [3, 4]. In 2020, an estimated 1.93 million *Correspondence: new CRC cases were diagnosed, which accounts for 10% Hanif Abdul Rahman of the global cancer incidence [5]. The increasing num - hanifr@umich.edu; hanif.rahman@ubd.edu.bn ber of global CRC cases could be attributed to successful University of Michigan, Ann Arbor, USA PAPRSB Institute of Health Sciences, Universiti Brunei Darussalam, , population-based screening and surveillance programs Bandar Seri Begawan, Brunei that have been rapidly and actively implemented [6, 7]. Yarmouk University, Irbid, Jordan Nonetheless, the number of CRC mortality is still high © The Author(s) 2023. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Abdul Rahman et al. BMC Cancer (2023) 23:144 Page 2 of 13 where 0.94 million deaths were recorded in 2020 that made great strides in predicting CRC based on available accounts for 9.4% of cancer deaths globally [5]. Active genetic-based data, which have shown that some CRC health screening and prevention of CRC activities have cases have a component of hereditary predisposition [12, yielded an increasingly younger generation (below 13]. However, genetic disorder is a permanent and non- 50 years) of early-onset CRC in developed countries and modifiable risk factor. In contrast, dietary control is one overall increase in CRC incidence detection in developing of the most effective protective measures against CRC and emerging economic nations [8, 9]. Increased patho- that the population can modify [4, 14] especially because physiological understanding of CRC progression and CRC susceptibility is mainly resulting from adopting die- the advancement of treatment options, including endo- tary lifestyle associated with globalization [15, 16]. With scopic and surgical interventions, radiotherapy, immu- the globalization of the food industry and supply chain, notherapy, and targeted chemotherapy, have effectively it is thus important data science research to look into prolonged survival years and improved quality of life of global diet features in relation to CRC prediction. In this CRC patients [9, 10]. The prognosis after CRC therapy is study, we obtained global dietary-based data from pub- generally good when CRC is detected at a younger age, licly accessible databases and investigate the important however, there is still huge public health challenges and dietary factors of predicting CRC labels using exploratory financial burden associated with CRC [9]. In 2015, the unsupervised and supervised ML-based models. economic cost of CRC in Europe due to hospital-care costs, loss of productivity, premature death, and costs of Methods informal care was estimated at 19 billion euros [11]. Fur- Dataset and data preprocessing thermore, the underlying mechanisms and risk factors of Several end-to-end procedures were systematically per- early-onset CRC pathological features are sporadic and formed, as illustrated in Fig. 1. Dietary-related colorectal not fully understood and require more research [9]. cancer data was obtained from the Center for Disease In this era of digital technology, the vast amount of Control and Prevention, Global Dietary database, and high-quality CRC data (owing to an increase in the publicly accessible institutional sites [17, 18, 19, 20, 21, number of patients) can be rigorously collected through 22, 23]. The initial combined data contained 25 coun - health information systems. This has enabled data sci - tries consisting of Argentina, Bangladesh, Bulgaria, Can- ence to offer a new avenue of enhancing knowledge of ada, China, Korea, Ecuador, Estonia, Ethiopia, Finland, CRC through research and development. Currently, the Germany, India, Iran, Israel, Kenya, Malaysia, Mexico, extant evidence using machine-learning models have Mozambique, Philippines, Portugal, Sweden, Tanzania, Fig. 1 A schematic of the procedures undertaken in this study to classify CRC labels A bdul Rahman et al. BMC Cancer (2023) 23:144 Page 3 of 13 Italy, Japan, and the United States. The data collection and removal of punctuation. The corpus item was then methodology of these data sets were similar, i.e., cross- converted to a document term matrix to enable count- sectional and employed dietary questionnaires. The differ - ing of most frequent terms occurring (Fig. 2), which ent sets of data were then merged and extrapolated based are illustrated as a Wordcloud (Fig. 3). The important on the same dietary characteristics. Features that were not terms are converted into a data frame that is subse- common across the data sets were excluded. This study quently merged with the full data set. The dataset also only includes data sets that are of the English language. has unbalanced binary CRC outcome, which was then Features with different units of measurements were con - re-balanced using the Synthetic Minority Oversampling verted for standardization. A cleaning procedure was Technique (SMOTE) [26]. employed including removal of ineligible cases, duplicate characteristics, and features with more than 50% missing values (listwise deletion). At this stage, a total of 3,520,586 Feature selection valid data remained. Due to computational limitations, Two-step feature selection method was employed. Step a multi-stage, proportionate random sample of 109,342 one involves three separate procedures including Logis- were extracted for analysis, that maintains the percent- tic regression (LR), Boruta, and Knockoff selection. age by country and CRC distribution, of which 7,326 LR was used to screen each single index out to reduce (6.7%) cases were positive colorectal cancer labels that redundant features by computing a stepwise iterative are derived for seven countries that comprised of Canada, process of forward addition (adding important fea- India, Italy, South Korea, Mexico, Sweden, and United tures to a null set of features) and backward elimina- States. A sample size of 5,000 cases was sufficient to tion (removing worst-performing features from the list achieve a power of 0.8 [24]. Considering the computation of complete features) using the stepAIC function in the ability of our machine could handle up to 110,000 data MASS package [27]. Variable selection was determined points, we randomly selected the maximum data load for by the most significant features (p < 0.05) in the most this study. Table 1 presents the characteristics of the data. parsimonious model with the lowest Akaike Informa- Missing data in these valid cases was handled using tion Criterion (AIC). Next, a randomized wrapper multiple imputation techniques—MICE (Multivariate method, Boruta, which iteratively removes features that Imputation via Chained Equations) set at 10 multiple are statistically not significant and relevant than that of imputations to replace missing with predicted values, random probes, was employed [28]. Finally, the Knock- using R package *mice* [25]. The data set also consists off selection based on the Benjamini–Hochberg False of textual elements that describe the ingredients used Discovery Rate method was implemented, that controls such as milk, salt, chicken, and so on. Texts were con- for expected proportion of false rejection of features verted into corpus objects and processed for stand- in multiple significance testing [29], which could be ardization such as using English stop words, lower case, expressed as follows: Table 1 Data characteristics and sample statistics Positive Negative Total n % n % n % Overall 7326 6.7 102,016 93.3 109,342 100 Country Canada 6014 5.5 103,328 94.5 31,381 28.7 India 4702 4.3 104,640 95.7 18,807 17.2 Italy 14,652 13.4 94,690 86.6 8966 8.2 South Korea 2406 2.2 106,936 97.8 16,292 14.9 Mexico 2406 2.2 106,936 97.8 10,387 9.5 Sweden 17,604 16.1 91,738 83.9 10,497 9.6 United States 11,153 10.2 98,189 89.8 12,902 11.8 Gender Male 7763 7.1 101,579 92.9 51,172 46.8 Female 6998 6.4 102,344 93.6 58,170 53.2 Age (years) [Mean (SD)] 48.9 (16.7) 36.4 (23.3) 41.6 (21.7) Abdul Rahman et al. BMC Cancer (2023) 23:144 Page 4 of 13 Fig. 2 Frequent text items (1,000 occurrences) of the data where k is the number of classes. The data set with the finalized features then was further processed using data normalization to avoid effects of extreme numeric ranges and to help obtain higher clas- sification accuracy [ 30, 31]. The features were scaled as follows: ′ V − Min V = , Max − Min where V is the scale value corresponding to the origi- nal value V , and Min and Max are the upper and lower range limits. Finally, the features that intersect among the two-step variable selection procedures were selected as the most salient features to be used for unsupervised and super- vised classifications. Fig. 3 A word-cloud on the most frequent text items in the data set Unsupervised techniques Four types of unsupervised machine learning for non- linear relationship were used to explore the dimensions #FalsePositive FDR = E of the data including t-distributed stochastic Neighbor total number of selected Features False Discovery Rate expectation embedding (t-SNE), uniform manifold approximation and False DIscovery Proportion projection (UMAP), Apriori association rules, principal component analysis (PCA), and factor analysis (FA) [31, which determines the final selection based on vari - 32]. able importance using the Gini Index that is expressed as t-SNE technique is a machine learning strategy for follows: nonlinear dimensionality reduction that is useful for GI = p (1 − p ) = 1 − p , k k embedding high-dimensional data into lower-dimen- k k sional spaces. If the high dimensional data ( N D) is x , x , ..., x then, for each pair ( x , x ), t-SNE estimates 1 2 N i j A bdul Rahman et al. BMC Cancer (2023) 23:144 Page 5 of 13 the probabilities p that are proportional to their corre- In a set-theoretic sense, the union of item-sets is an i,j sponding similarities, p : item-set itself. In other words, if Z = X , Y = X ∪ Y , then j|i 2 support(Z) = support(X , Y ). −||x −x || i j exp 2σ For a given rule X → Y , the rule’s confidence measures p = . j|i −||x −x || i k the relative accuracy of the rule: exp k�=i 2σ support(X , Y ) confidence(X → Y ) = . t-SNE performs a binary search for the value σ that support(X) produces a predefined value perp . The perplexity ( perp ) of a discrete probability distribution, p , is defined as an The confidence measures the joint occurrence of exponential function of the entropy, H (p) , over all dis- X and Y over the X domain. If whenever X appears H (p) − p(x)log p(x) x 2 crete events: perp(x) = 2 = 2 . Y tends to also be present, then we will have a high UMAP relies on local approximations of patches on confidence(X → Y ). the manifold to construct local fuzzy simplicial complex Note that the ranges of the support and the confidence (topological) representations of the high dimensional are 0 ≤ support, confidence ≤ 1. data. For example, if S1 the set of all possible 1-simplexes, PCA (principal component analysis) is a mathemati- let’s denote by ω(e) and ω (e) the weight functions of the cal procedure that transforms a number of possibly 1-simplex e in the high dimensional space and the cor- correlated variables into a smaller number of uncorre- responding lower dimensional counterpart. Then, the lated variables through a process known as orthogonal cross-entropy measure for the 1-simplexes is: transformation. In general, the formula for the first PC T N is pc = a X = a X where X is a n × 1 vec- 1 i,1 i i 1 i=1 ⎡ ⎤ tor representing a column of the matrix X (representing ⎢ ⎥ � � � � a total of n observations and N features). The weights ⎢ ⎥ (e) 1 − (e) (e)log + (1 − (e))log . ⎢ ⎥ � � a ={a , a , ..., a } are chosen to maximize the (e) 1 − (e) 1 1,1 2,1 N ,1 e∈E ⎢ ⎥ th variance of pc . According to this rule, the k PC is ⎢ ⎥ ⎣ attractive force repulsive force ⎦ T N pc = a X = a X . a ={a , a , ..., a } has to k i,k i k 1,k 2,k N ,k k i=1 be constrained by more conditions: The iterative optimization process would minimize the objective function composed of all cross entropies for all Variance of pc is maximized simplicial complexes using a strategy like stochastic gra- Cov(pc , pc ) = 0 , ∀1 ≤ l < k k l dient descent. a a = 1 (the weights vectors are unitary) The optimization process balances the push–pull k between the attractive forces between the points favoring FA optimization relies on iterative perturbations with larger values of ω (e) (that correspond to small distances full-dimensional Gaussian noise and maximum-likeli- between the points), and the repulsive forces between the hood estimation where every observation in the data ends of e when ω(e) is small (that correspond to small represents a sample point in a higher dimensional space. values of ω (e). Whereas PCA assumes the noise is spherical, Factor The Apriori algorithm is based on a simple apriori Analysis allows the noise to have an arbitrary diagonal belief that all subsets of a frequent item-set must also be covariance matrix and estimates the subspace as well as frequent. We can measure a rule’s importance by com- the noise covariance matrix. puting its support and confidence metrics. The support Under FA, the centered data can be expressed in the and confidence represent two criteria useful in decid - following from: ing whether a pattern is “valuable.” By setting thresholds for these two criteria, we can easily limit the number of x − µ = l F + ... + l F + ǫ = LF + ǫ , i i i,1 1 i i i,k k interesting rules or item-sets reported. For item-sets X and Y , the support of an item-set meas- where i ∈ 1, ..., p , j ∈ 1, ..., k , k < p and ǫ are indepen- ures how (relatively) frequently it appears in the data: dently distributed error terms with zero mean and finite variance. count(X) support(X) = , Supervised classifiers where N is the total number of transactions in the data- The data was split into 80% for training and 20% for base and count(X) is the number of observations (trans- testing. The data was trained using machine learning actions) containing the item-set X. (ML) algorithms including neural network (Neuralnet), Abdul Rahman et al. BMC Cancer (2023) 23:144 Page 6 of 13 k-nearest neighbors (kNN), generalized linear model Generalized linear model, specifically, logistic regres - (GLM), and recursive partitioning (Rpart). sion, is a linear probabilistic classifier. It takes in the Neuralnet model mimics the biological brain response probability values for binary classification, in this case, to multisource stimuli (inputs). When we have three sig- positive (0) and negative (0) mental well-being and esti- nals (or inputs) x , x and x , the first step is weighting mates class probabilities directly using the logit trans- 1 2 3 the features ( w’s) according to their importance. Then, form function [33]. the weighted signals are summed by the “neuron cell” and Recursive partitioning (Rpart) is a decision tree clas- this sum is passed on according to an activation function sification technique that works well with variables with denoted by f. The last step is generating an output y at the definite ordering and unequal distances. The tree is end of the process. A typical output will have the follow- built similarly as a random forest with a resultant com- ing mathematical relationship to the inputs. plex model, however, Rpart procedure also consists of a cross-validation stage to trim back the full tree into nested terminals. The final model of the sub-tree pro - y(x) = f w x . i i vides the decision with the ‘best’ or lowest estimated i=1 error [34]. kNN classifier performs two steps calculations. For a given k , a specific similarity metric d , and a new testing Model validation and performance assessment case x, Unsupervised techniques were evaluated based on the model visualization, as the best way to determine suit- • Runs through the whole training dataset ( y ) comput- ability of the models. Whereas, the ML-classifiers used ing d(x, y) . Let A represent the k closest points to x in specific parameters. The caret package was used for auto - the training data y. mated parameter tuning with repeatedcv method set at • Estimates the conditional probability for each class, 15-folded cross-validation re-sampling that was repeated which corresponds to the fraction of points in A with with 10 iterations [35]. The k-fold validation results and that given class label. If I(z) is an indicator function values were then used to calculate the confusion matrix that determines the measures of sensitivity, specificity, 1 z = true I(z) = , then the testing data input x gets 0 otherwise kappa, and accuracy. These measures were used to evalu - assigned to the class with the largest probability, ate the performance of the ML-model classifiers. These P(y = j|X = x): measures were calculated as follows: 1 TP (i) P(y = j|X = x) = I(y = j). sensitivity = . TP + FN i∈A Fig. 4 Variable importance plot showing contribution of features to predicting colorectal cancer A bdul Rahman et al. BMC Cancer (2023) 23:144 Page 7 of 13 Fig. 5 Stability of t-SNE 3D embedding (Perplexity = 50) with six repeated (Rep) computations of the classification of no- and yes- colorectal cancer (CRC) labels where, True Positive(TP) is the number of observations TN specificity = . that are correctly classified as “yes” or “success.” True TN + FP Negative(TN) is the number of observations that are cor- rectly classified as “no” or “failure.” False Positive(FP) is P(a) − P(e) kappa = . the number of observations that are incorrectly classified 1 − P(e) as “yes” or “success.” False Negative(FN) is the number of observations that are incorrectly classified as “no” or TP + TN TP + TN accuracy = = “failure” (Dinov 2018). TP + TN + FP + FN Total number of observations Fig. 6 Uniform Manifold Approximation (UMAP) 2D embedding model (n-neighbor = 5) (L) and UMAP prediction on testing data (R) in the classification of no- and yes- colorectal cancer labels Abdul Rahman et al. BMC Cancer (2023) 23:144 Page 8 of 13 Results over several repeated computations. UMAP (Fig. 6) pre- Feature importance diction also appears to be able to distinguish positive and The common features derived from the procedures of negative CRC labels. Apriori association rules (Fig. 7) variable selection yielded ten salient variables (Fig. 4) that were able to map the textual features correlated to posi- are important contributors of CRC including, by order tive CRC labels, and the text items, by order of count, are of importance, fiber, total fat, cholesterol, age, vitamin E, listed in Table 2. PCA (Fig. 8) and FA (Table 3) showed saturated fats, monounsaturated fats, carbohydrates, and that the data could be reduced to two dimensions where vitamin B12. These features were used in the next step of CRC is negatively correlated with fiber and carbohydrates, machine learning modeling. and positively correlated with the rest of the features. Unsupervised learning Supervised learning Among the unsupervised classifiers, t-SNE (Fig. 5) was Model evaluation the best performer. By visual inspection, t-SNE has main- In supervised classifiers, all techniques performed very tained good stability of classifying positive CRC labels well where accuracy, kappa, sensitivity, and specificity Fig. 7 Apriori association rules of text features that are associated with the labelled yes colorectal cancer (“colrec_ca”) (See Supplementary 1 for this interactive html widget) A bdul Rahman et al. BMC Cancer (2023) 23:144 Page 9 of 13 Table 2 Summary description of apriori association rules of no colorectal cancer (“No_colrec_ca”) label lhs rhs Support Count {Shortening, household, unspecified vegetable oil} {colrec_ca} 0.062 3870 {Margarine, tub, composite} {colrec_ca} 0.045 2818 {Egg, chicken, whole, fresh or frozen, raw} {colrec_ca} 0.036 2277 {Cheese, cheddar} {colrec_ca} 0.030 1903 {Salad dressing, mayonnaise, commercial, regular} {colrec_ca} 0.029 1821 {HARD CHEESE FETT 28%} {colrec_ca} 0.029 1804 {Butter, regular} {colrec_ca} 0.028 1777 {FAT BLEND FAT 75% FORTIFIED BREGOTT } {colrec_ca} 0.022 1360 {Beef, ground, medium, broiled} {colrec_ca} 0.018 1150 { Vegetable oil, canola and soybean} {colrec_ca} 0.018 1126 {Egg, whole, fried, with fat (Scrambled egg, no milk added)} {colrec_ca} 0.016 998 { Vegetable oil, olive} {colrec_ca} 0.015 971 {Lettuce, salad with assorted vegetables including tomatoes and/or carrots, no dressing (Lettuce salad, NFS)} {colrec_ca} 0.014 905 {Egg, whole, fried, with fat (Scrambled egg, no milk added)} = > {Egg, chicken, whole, fresh or frozen, raw} {colrec_ca} 0.012 751 {Egg, chicken, whole, fresh or frozen, raw} = > {Egg, whole, fried, with fat (Scrambled egg, no milk added)} {colrec_ca} 0.012 751 {colrec_ca,Egg, whole, fried, with fat (Scrambled egg, no milk added)} = > {Egg, chicken, whole, fresh or frozen, raw} {colrec_ca} 0.012 751 {colrec_ca,Egg, chicken, whole, fresh or frozen, raw} = > {Egg, whole, fried, with fat (Scrambled egg, no milk added)} {colrec_ca} 0.012 751 {Egg, chicken, whole, fresh or frozen, raw, Egg, whole, fried, with fat (Scrambled egg, no milk added)} {colrec_ca} 0.012 751 {Salad dressing, oil and vinegar, homemade} {colrec_ca} 0.011 674 Fig. 8 A bi-plot of Principal component analysis on the most optimal number of dimensions in the data where Group 1 is no cancer label and Group 2 is the cancer label Abdul Rahman et al. BMC Cancer (2023) 23:144 Page 10 of 13 Table 3 Two-factor model in the dimensionality reduction as an early tool to identify individuals at risk as well procedure of the colorectal cancer data as predicting the clinical outcomes of colorectal cancer [38, 39]. Factor Analysis Two-factor model Dietary control is one of the most effective protec - Factor1 Factor2 tive and modifiable measures that the population can adopt for cancer prevention. Dietary features can signal Age 0.178 clues of the likelihood of early-onset of specific type of Energy 0.433 0.525 colorectal cancer such as distal colon and rectum [40]. carbohydrates -0.121 0.972 In fact, a systematic review of studies over a period of fiber -0.118 0.703 17 years concluded that strong evidence linking dietary Total fat 0.990 0.123 factors with CRC risk, however, specific food group Mono unsaturated fats 0.946 0.103 components on this relationship, were limited [41]. The Omega-6 0.512 present study identified total fat, mono-unsaturated fats, cholesterol 0.483 linoleic acid, cholesterol, omega-6 as moderate to high Vitamin B12 0.164 0.204 correlated dietary features to positive colorectal cancer. Linoleic acid 0.566 In contrast, fiber and carbohydrates have negative cor - Colorectal cancer 0.655 relation with colorectal cancer cases. These features reflects the evidence from precision nutrition that a combination of dietary parameters, particularly those in were above 0.90 (Fig. 9). It appeared that the neural the healthy eating index (such as whole fruit, saturated network performed better than the rest. By accounting fats, grains) are more accurate than single dietary index the weight decay, the neural network model was opti- (such as glycemic index) is important in the modifiable mal with a single layer of three hidden nodes, and we behavior for cancer prevention [39, 42]. In addition, mapped out the schematic of the network, illustrated in our text mining and apriori algorithm also indicated Fig. 10. Sensitivity analysis also revealed seven features that vegetables, eggs, margarine, and cheese have great in the neural network model in future consideration impacts on colorectal cancer. (Fig. 11). Although all classifiers were very good predictors of CRC labels, artificial neural networks had the best accu - Discussion racy and true positives and true negatives. The advantage Key findings of using neural networks over, for example, general linear In this study, we show that colorectal cancer can be models in cancer prediction, is having much lower uncer- predicted based on a list of important dietary data tainty and better generalizability of the model [36, 43]. using supervised and unsupervised machine learn- This is an important consideration since machine learn - ing approaches. The excellent level of prediction in ing algorithms have increasingly been used in many med- the present study is congruent with previous findings icine domains with varied success rates [44]. In addition, where mis-classification only ranged from 1 to 2% [36, most or all data sets will have a clear imbalance between 37]. These machine learning models can be used both CRC and non-CRC labels. We used a smote technique Fig. 9 A schematic of a neural network with a single hidden layer with three hidden nodes (L) and weight decay of optimal hidden node parameter using repeated cross-validation (R) A bdul Rahman et al. BMC Cancer (2023) 23:144 Page 11 of 13 Fig. 10 Sensitivity analysis of the three hidden node neural network model in relation to the mean and standard deviation (top), mean square difference among the input variables (middle), and density plots (bottom) to balance the data set, which otherwise, the machine fits, and classification predictions. Some of the features learning models will predict all cases as non-CRC. Future that were not common had to be excluded from model work may need to consider controlling the sampling pro- development, which may result in confounding effects. cess to allow similar distribution of the two categories to The outcome label of CRC is based on detected cases minimize effects of down- or up- sampling. Another con - and may not reflect early onset, new onset, or delayed sideration is the age group of which this model is appli- onset of CRC as well as stratification of risk in different cable. Unlike previous studies that account only for older stages and types of CRC. Nevertheless, this study has people, this study includes younger adults in model train- narrowed down salient features that future researchers ing as well, therefore the models developed in this study could consider in a more holistic approach, particularly, may work well from young to older adults’ CRC predic- multi-dimensional that simultaneously accounts for diet, tion. With early and regular screening assisted by an opti- lifestyle, genetics, and related factors for CRC prediction. mal machine learning algorithm, the incidence of CRC can be reduced even further. Conclusion In this study, we concluded that a combination of unsu- Limitations pervised and supervised machine learning approaches The strength of this study lies in the large datasets con - can be used to explore the key dietary features for colo- sisting of cases from seven major countries. Due to rectal cancer prediction. To help with feasibility and computational constraints, we randomly sampled obser- practicality, the artificial neural network was found to vations to induce almost real-time estimates, model Abdul Rahman et al. BMC Cancer (2023) 23:144 Page 12 of 13 Fig. 11 Box plot evaluates the performance metrics of different classifiers in the prediction of colorectal cancer based on dietary data Funding be the optimal algorithm with misclassification of CRC This study was partially supported by grants from NSF (1916425, 1734853, of 1% and misclassification of non-CRC of 3%, for more 1636840, 1416953)and NIH (UL1TR002240, R01CA233487, R01MH121079, effective cancer screening procedures. Furthermore, R01MH126137, T32GM141746). screening through dietary information can be used as Availability of data and materials a non-invasive procedure that can be applied in large The datasets generated and/or analyzed during the current study are available populations. Using optimal algorithms coupled with high upon reasonable request from the lead author, Dr Hanif Abdul Rahman. compliance to cancer screening will therefore signifi - cantly boost the success rate of cancer prevention. Declarations Ethics approval and consent to participate Supplementary Information Not applicable. The online version contains supplementary material available at https:// doi. org/ 10. 1186/ s12885- 023- 10587-x. Consent for publication Not applicable. Additional file 1. Competing interests The authors declare no competing interests. Acknowledgements The authors expressed their sincere thanks to all databases used in this study. Received: 5 September 2022 Accepted: 27 January 2023 Ethical guidelines Not applicable. Authors’ contributions HAR, MO, ID contributed to the conception or design of the paper. HAR References conducted the data analysis. All authors contributed to data interpretation 1. K. Hassibi, Machine learning vs. traditional statistics: different philoso - and drafting/editing the manuscript. All authors were involved in revising the phies, different approaches, (2016). Data Science Central. manuscript, providing critical comments, and agreed to be accountable for all 2. Stewart M. The actual difference between statistics and machine learn- aspects of the work and any issues related to the accuracy or integrity of any ing. Towar Data Sci. 2019;24:19. part of the work. 3. Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A, Global cancer statistics,. GLOBOCAN estimates of incidence and A bdul Rahman et al. BMC Cancer (2023) 23:144 Page 13 of 13 mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 27. Ripley B, Venables B, Bates DM, Hornik K, Gebhardt A, Firth D, Ripley MB. 2018;68(2018):394–424. Package ‘mass.’ Cran R. 2013;538:113–20. 4. Xi Y, Xu P. Global colorectal cancer burden in 2020 and projections to 28. Kursa MB, Rudnicki WR. Feature selection with the Boruta package. J 2040. Transl Oncol. 2021;14:101174. Stat Softw. 2010;36:1–13. 5. World Health Organization, Cancer, (2022). Retrieved 20 April 2022 from 29. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practi- https:// www. who. int/ news- room/ fact- sheets/ detail/ cancer. cal and powerful approach to multiple testing. J R Stat Soc Ser B. 6. Bénard F, Barkun AN, Martel M, von Renteln D. Systematic review of colo- 1995;57:289–300. rectal cancer screening guidelines for average-risk adults: Summarizing 30. Zhao M, Fu C, Ji L, Tang K, Zhou M. Feature selection and parameter the current global recommendations. World J Gastroenterol. 2018;24:124. optimization for support vector machines: A new approach based 7. Schreuders EH, Ruco A, Rabeneck L, Schoen RE, Sung JJY, Young GP, on genetic algorithm with feature chromosomes. Expert Syst Appl. Kuipers EJ. Colorectal cancer screening: a global overview of existing 2011;38:5197–204. programmes. Gut. 2015;64:1637–49. 31. Dinov ID, Data science and predictive analytics: Biomedical and health 8. Araghi M, Soerjomataram I, Bardot A, Ferlay J, Cabasag CJ, Morrison DS, applications using R, Springer, 2018. De P, Tervonen H, Walsh PM, Bucher O. Changes in colorectal cancer inci- 32. Dinov ID. Data Science and Predictive Analytics: Biomedical and Health dence in seven high-income countries: a population-based study, Lancet. Applications using R, 2nd edition, Springer Series in Applied Machine Gastroenterol Hepatol. 2019;4:511–8. Learning, ISBN 978-3-031-17482-7. Cham, Switzerland: Springer; 2023. 9. Guren MG. The global challenge of colorectal cancer, Lancet. Gastroen- 33. Myers RH, Montgomery DC. A tutorial on generalized linear models. J terol Hepatol. 2019;4:894–5. Qual Technol. 1997;29:274–91. 10. Dekker E, Tanis PJ, Vleugels JLA, Kasi PM, Wallace MB. Colorectal cancer. 34. Therneau TM, Atkinson EJ. An introduction to recursive par- Lancet. 2019;394:1467–80. titioning using the RPART routines. Technical report Mayo 11. Henderson RH, French D, Maughan T, Adams R, Allemani C, Minicozzi P, Foundation. 1997;61:452. Coleman MP, McFerran E, Sullivan R, Lawler M. The economic burden of 35. Kuhn M, Wing J, Weston S, Williams A, Keefer C, Engelhardt A, Cooper T, colorectal cancer across Europe: a population-based cost-of-illness study, Mayer Z, Kenkel B, Core Team R. 2020 Package ‘caret’. The R Journal 223, Lancet. Gastroenterol Hepatol. 2021;6:709–22. no. 7 12. Hossain MJ, Chowdhury UN, Islam MB, Uddin S, Ahmed MB, Quinn JMW, 36. Nartowt BJ, Hart GR, Muhammad W, Liang Y, Stark GF, Deng J. Robust Moni MA. Machine learning and network-based models to identify machine learning for colorectal cancer risk prediction and stratifica- genetic risk factors to the progression and survival of colorectal cancer. tion. Front Big Data. 2020;3:6. Comput Biol Med. 2021;135:104539. 37. Hornbrook MC, Goshen R, Choman E, O’Keeffe-Rosetti M, Kinar Y, Liles 13. Zhao D, Liu H, Zheng Y, He Y, Lu D, Lyu C. A reliable method for colorectal EG, Rust KC. Early colorectal cancer detected by machine learning cancer prediction based on feature selection and support vector model using gender, age, and complete blood count data. Dig Dis Sci. machine. Med Biol Eng Comput. 2019;57:901–12. 2017;62:2719–27. 14. Bingham SA, Day NE, Luben R, Ferrari P, Slimani N, Norat T, Clavel- 38. Gründner J, Prokosch H-U, Stürzl M, Croner R, Christoph J, Toddenroth D. Chapelon F, Kesse E, Nieters A, Boeing H. Dietary fibre in food and Predicting Clinical Outcomes in Colorectal Cancer Using Machine Learn- protection against colorectal cancer in the European Prospective Inves- ing., in: MIE, 2018: pp. 101–105. tigation into Cancer and nutrition (EPIC): an observational study. Lancet. 39. Shiao SPK, Grayson J, Lie A, Yu CH. Personalized nutrition—genes, diet, 2003;361:1496–501. and related interactive parameters as predictors of cancer in multiethnic 15. Keum N, Giovannucci E. Global burden of colorectal cancer: emerging colorectal cancer families. Nutrients. 2018;10:795. trends, risk factors and prevention strategies. Nat Rev Gastroenterol 40. Hofseth LJ, Hebert JR, Chanda A, Chen H, Love BL, Pena MM, Murphy EA, Hepatol. 2019;16:713–32. Sajish M, Sheth A, Buckhaults PJ. Early-onset colorectal cancer: initial clues 16. Murphy N, Moreno V, Hughes DJ, Vodicka L, Vodicka P, Aglago EK, Gunter and current views. Nat Rev Gastroenterol Hepatol. 2020;17:352–64. MJ, Jenab M. Lifestyle and dietary environmental factors in colorectal 41. Tabung FK, Brown LS, Fung TT. Dietary patterns and colorectal cancer risk: cancer susceptibility. Mol Aspects Med. 2019;69:2–9. a review of 17 years of evidence (2000–2016). Curr Colorectal Cancer Rep. 17. Centers for Disease Control and Prevention, National Health and Nutrition 2017;13:440–54. https:// doi. org/ 10. 1007/ s11888- 017- 0390-5. Examination Survey, (2022). Retrieved 20 April 2022 from https:// www. 42. T Li C Zheng L Zhang Z Zhou R Li 2015 Exploring the risk dietary factors cdc. gov/ nchs/ nhanes/ index. htm. for the colorectal cancer, in, IEEE Int. Conf. Prog. Informatics Comput IEEE 18. Global Dietary Database, Microdata Surveys, (2018). Retrieved March 2015 570 573. 2022 from https:// www. globa ldiet aryda tabase. org/ manag ement/ micro 43. Abu Zuhri MAZ, Awad M, Najjar S, El Sharif N, Ghrouz I. Colorectal cancer data- surve ys. risk factor assessment in Palestine using machine learning models, (2022). 19. U.S. National Library of Medicine, National Center for Biotechnology Infor- 44. L Zheng E Eniola J Wang M Learning for Colorectal Cancer Risk Prediction, mation: dbGAP data, (2022). Retrieved March 2022 from https:// www. in, 2021 Int. Conf. Cyber-Physical Soc. Intell IEEE 2021 1 6. ncbi. nlm. nih. gov/ proje cts/ gap/ cgi- bin/ colle ction. cgi? study_ id= phs00 1991. v1. p1. Publisher’s Note 20. Inter-university Consortium for Political and Social Research, Find Data, Springer Nature remains neutral with regard to jurisdictional claims in pub- (2022). Retrieved March 2022 from https:// www. icpsr. umich. edu/ web/ lished maps and institutional affiliations. pages/. 21. China Health and Nutrition Survey, China Health and Nutrition Survey, (2015). Retrieved March 2022 from https:// www. cpc. unc. edu/ proje cts/ china. 22. Government of Canada, Canadian Community Health Survey, (2018). Retrieved March 2022 from https:// www. canada. ca/ en/ health- canada/ servi ces/ food- nutri tion/ food- nutri tion- surve illan ce/ health- nutri tion- surve ys/ canad ian- commu nity- health- survey- cchs. html. 23. Data.world, Data.world, (2022). Retrieved March 2022 from https:// ourwo rldin data. org. 24. Naing L, Bin Nordin R, Abdul Rahman H, Naing Y T. Sample size calculation for prevalence studies using scalex and scalar calculators. BMC Med Res Methodol. 2022;22:209. https:// doi. org/ 10. 1186/ s12874- 022- 01694-7. 25. Zhang Z. Multiple imputation with multivariate imputation by chained equation (MICE) package. Ann Transl Med. 2016;4:30. 26. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minor- ity over-sampling technique. J Artif Intell Res. 2002;16:321–57.
BMC Cancer – Springer Journals
Published: Feb 10, 2023
Keywords: Colorectal cancer; Machine learning; Dietary information
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.