Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

A Neural Network-Based Multi-Label Classifier for Protein Function Prediction

A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Shahab Tahzeeb Shehzad Hasan Dept. of Computer & Information Systems Engineering Dept. of Computer & Information Systems Engineering NED University of Engineering & Technology NED University of Engineering & Technology Karachi, Pakistan Karachi, Pakistan stahzeeb@neduet.edu.pk shasan@neduet.edu.pk Abstract-Knowledge of the functions of proteins plays a vital role features pursued by different investigators. This section in gaining a deep insight into many biological studies. However, presents a brief summary of some of the most prominent efforts wet lab determination of protein function is prohibitively in this area. laborious, time-consuming, and costly. These challenges have An ensemble of Deep Neural Networks (DNNs) was created opportunities for automated prediction of protein proposed in [1], where each DNN worked on a different set of functions, and many computational techniques have been features from the dataset. The predictions of different DNNs explored. These techniques entail excessive computational were then voted to arrive at the final protein function resources and turnaround times. The current study compares the performance of various neural networks on predicting protein prediction. A DNN for the hierarchical multilabel classification function. These networks were trained and tested on a large of protein functions designed to perform well even with a dataset of reviewed protein entries from nine bacterial phyla, limited number of training samples was presented in [3]. In [4], obtained from the Universal Protein Resource Knowledgebase a DNN was introduced to learn features from word embedding (UniProtKB). Each protein instance was associated with multiple of protein sequences, based on the concept of Natural terms of the molecular function of Gene Ontology (GO), making Language Processing (NLP), using sequence similarity profiles the problem a multilabel classification one. The results in this as additional features to locate proteins. Authors in [5] dataset showed the superior performance of single-layer neural established the efficacy of exploiting any interrelationships networks having a modest number of neurons. Moreover, a among different functional terms. For instance, different useful set of features that can be deployed for efficient protein functional classes were found to coexist with some proteins function prediction was discovered. suggesting a mutual relationship. Furthermore, a quantification model of these relations was proposed, using a functional Keywords-gene ontology; molecular function term; multi-label similarity measure and a framework to capitalize on it for the classification; neural network; protein function prediction eventual prediction of protein functions. A classification I. INTRODUCTION technique based on a neural network coupled with a Support Vector Machine (SVM) was demonstrated in [6], utilizing a bi- Understanding proteins’ functions plays a vital role in directional Long Short-Term Memory (LSTM) network to acquiring insights of the molecular mechanisms operating in generate fixed-length protein vectors out of variable-sized both physiological and ailing medical conditions. As a result, sequences and deal with the challenges posed by the variable this understanding substantiates the discovery of drugs in length of protein sequences. In [7], protein sequence motifs different diseases [1]. However, predicting protein functions is were used to build a deep convolutional network and predict an arduous task. The fact is markedly implied by the incredibly protein function, while the authors claimed to have built the large number of unannotated protein entries hosted by the most best performing model for the cellular component classes. The comprehensive protein database, the Universal Protein significance of Protein-Protein Interaction (PPI) and time- Resource Knowledgebase (UniProtKB) [2]. This is mainly due course gene expression data as powerful predictors for the to the reliance on traditional experimental annotation prediction of protein function was shown in [8]. A method, techniques carried out by molecular biologists. The gap called Dynamic Weighted Interactome Network (DWIN), was between reviewed and unreviewed protein sequences is proposed, that in addition to PPI and gene expression data, took widening due to the data deluge from high-throughput state-of- also into account information related to protein domains and the-art sequencing techniques [1, 3-5]. The pressing demands complexes to improve the prediction performance. In [9], for computational methods on the functional annotation of clustering was applied on a PPI network for the prediction of proteins have paved the way for significant contributions by protein function. A protein graph model was shown in [10], computer science researchers. Many computational techniques constructed of protein structure, with each node representing a employing machine learning for functional annotation of cluster of amino acid residues. However, the idea of using an proteins have been utilized in the literature. The principal accuracy metric for evaluation is generally misleading. In [11], difference between various approaches lies in the set of an active learning approach was explored for the prediction of Corresponding author: Shahab Tahzeeb www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 protein function using a PPI network. This method operated in protein. As suggested in [18], these numbers overcome the loss two phases: Spectral clustering was used to cluster the PPI of sequence order effect in a protein caused by considering just network followed by the application of the betweenness plain amino acid compositions. Moreover, there are also 541 centrality measure for labeling within each cluster, and then the motifs included in the features. These are small segments in labeled protein data were used by a classification algorithm. proteins’ tertiary structure that are frequently found in different Associations between functions in a PPI network were used in proteins. These similar patterns are associated with the [12], stating that multiple function labels assigned to proteins structural or functional roles of proteins. were not independent and their coexistence could be used There are 1,739 binary labels associated with each protein effectively to predict protein function. A deep semantic text instance. These labels correspond to GO terms belonging to the representation was presented in [13], with various pieces of Molecular Function (MF) category. The GO is a categorization information extracted from protein sequences such as of biological functions using three broad classes, i.e. Molecular homology, motifs, and domains. Protein function prediction Function (MF), Cellular Component (CC), and Biological was carried out using a consensus between text-based and Process (BP), generally referred to as GO terms [19]. The sequence-based methods. In [14], a classifier using cumulative molecular function term specifies a biochemical activity iterations was proposed, based on its semantic similarity with performed by a gene product, without taking into account the the term Gene Ontology (GO). Each prediction was followed time and space dimensions of this activity. The enzyme is an by updating and optimizing scores of characteristic terms in the example of the MF term. The CC refers to the location of the set of GO annotations, which, in turn, led to improved future biochemical activity of a gene product in the cell. Ribosome predictions. The dissimilarity of protein functions, rather than and nuclear membrane are two such examples. BP, an all- conventional similarity measures, was used in [15] to segregate encompassing term, defines a biological objective to which rare and frequently occurring classes of functions. This activities of various gene products contribute. Cell growth and technique worked well for imbalanced datasets. maintenance serve as examples the BP term. The notable contributions cited above are just a handful of B. Data Preprocessing numerous praiseworthy efforts towards the prediction of The Comma Separated Values (CSV) files for 9 different protein function. These endeavors differ in terms of the protein bacterial phyla were combined to obtain a single Pandas’ data information utilized by the corresponding systems and the frame object using the Pandas data analysis library in Python computational or time complexities of the classification [20]. Duplicate rows were removed from the data frame, which models. The current paper presents a neural network-based was then converted to an array using the scientific computing multi-label classifier for the prediction of protein function by library NumPy in Python [21]. The feature values were then training and testing several neural networks on a large dataset scaled using the standard scaler available in the scikit-learn [16]. The results indicate that a neural network with a single library in Python [22]. Data scaling was investigated using hidden layer achieved remarkable prediction performance with normalization and robust scaler, but these data scaling nominal computational complexity. This makes its techniques proved inferior to the standard scaling technique. implementation viable on systems with modest hardware capabilities. Consequently, the time required for the C. Features Partitioning classification task is in the order of seconds. The neural networks were trained on 3 sets of features. The II. MATERIALS AND METHODS objective of partitioning features into various subsets was to test the hypothesis that compositions of amino acids, A. Dataset dipeptides, and tripeptides are sufficient to predict protein The dataset adopted from [16] includes 121,378 protein functions. F = {F , F , F } represented the set of features used 1 2 3 instances. These labeled protein examples were extracted from to train different models, where F was the entire set of 9,890 UniProtKB [2], a comprehensive worldwide repository of features, and F was the set of 8,420 features that contained protein information. These protein entries pertain to 9 bacterial only compositions of amino acids, dipeptides, and tripeptides. phyla, namely Actinobacteria, Bacteroidetes, Chlamydiae, The set F =F –F contained 1,470 features consisting of various 3 1 2 Cyanobacteria, Firmicutes, Fusobacteria, Proteobacteria, properties and characteristics derived from proteins as Spirochaetes, and Tenericutes. Each instance in the dataset had described in subsection A. 9,890 features. These features included the sequence of amino D. Neural Networks acids making up the corresponding protein, compositions of A variety of neural networks was selected, differing in the amino acids, dipeptides and tripeptides; compositions of five groups of amino acids, i.e. aliphatic, aromatic, positively number of hidden layers and neurons in each layer to train the charged, negatively charged, uncharged, and various structural protein function classification system on datasets corresponding to each feature set F , F , and F . The and physiochemical properties derived from the amino acid 1 2 3 sequence. In addition, some features quantify conjoint triads. A experimental results are given in Section III. It was observed conjoint triad is a unit of three successive amino acids such that that the simplest neural network containing a single hidden each amino acid in the unit belongs to one of the seven groups layer demonstrated better performance on this dataset formed on the basis of the dipole and volume scale [17]. These compared to neural networks having more hidden layers. The optimal number of neurons in this single hidden layer was characteristic values indicate the strength of interaction between the amino acids of these 7 groups. The feature set also experimentally determined to be just 5% of the total input and contains pseudo amino acid compositions for the corresponding output neurons for feature sets F and F for the best performing 1 2 www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 neural network. However, for the F feature set, the optimal hidden layer neural network M9, to investigate any number of neurons in the single hidden layer of the best improvements in the classifier performance. The second hidden performing neural network turned out to be 50% of the total of layer had 50% neurons of the first hidden layer to exploit the input and output neurons. predictors best suited for the prediction task. Once the optimal number of neurons in a single hidden TABLE IV. TWO -LAYER MODEL TRAINED ON THE F FEATURE SET layer was determined, the addition of another hidden layer was Training Prediction utilized to observe any potential boost in performance. The Model Size (MB) time (sec) time (sec) number of neurons in the second hidden layer was chosen to be 56 2050 4.76 M14 50% of the first hidden layer. This was done to ensure that the network captured the most important features for prediction. Table V summarizes several single-hidden layer neural Table I summarizes various single-hidden layer neural networks trained and tested on the F feature set, i.e. features networks trained and tested on the F feature set, i.e. the entire consisting of various properties and characteristics derived set of features from the dataset. The reference computer for all from proteins. time and memory size measurements presented is a Core i7- 8700 at 3.2 GHz 6-core processor. TABLE V. SINGLE-HIDDEN LAYER MODELS TRAIN ED ON F FEATURE SET TABLE I. SINGLE-H IDDEN LAYER MODELS TRAINED ON THE F FEATURE SET Training time Predictiontime Model Neurons Size (MB) (sec) (sec) Training time Predictiontime Model Neurons Size (MB) M15 5 6 2750 1.33 (sec) (sec) M16 10 12 3200 1.58 M1 1 16 1950 2.25 25 30 5200 2.92 M17 M2 5 77 2655 6.03 M18 30 35 5270 3.05 M3 10 154 3920 7.51 50 60 6630 4.20 M19 15 232 4860 9.75 M4 M20 60 71 7320 4.59 M5 25 386 8820 13.6 1. Expressed as a percentage of input + output neurons M6 50 773 12175 21.8 M7 75 1130 15400 31.6 Table VI presents the M21 neural network having two 1. Expressed as percentage of input + output neurons hidden layers and trained on F . This model was generated by adding another hidden layer to the best performing single Table II presents the M8 neural network with two hidden hidden layer neural network M19 to discover any potential layers trained on F . This model was constructed by adding performance enhancement. The number of neurons in the another hidden layer to the best performing single-hidden layer second hidden layer was chosen to be 50% of the first hidden neural network M2 to explore any performance gain. The layer to capitalize on the features best suited for the second hidden layer had 50% neurons of the first hidden layer classification. in an attempt to capture the optimal features best suited for the prediction task. TABLE VI. TWO -LAYER MODEL TRAINED ON THE F3 FEATURE SET TABLE II. TWO -LAYER MODEL TRAINED ON THE F FEATURE SET Training Prediction Model Size (MB) time (sec) time (sec) Training Prediction Model Size (MB) M21 58 5720 4.64 time (sec) time (sec) M8 74 2120 5.52 For each network, we employed the relu activation for the hidden layers, the sigmoid activation for the output layer, the Table III summarizes various single-hidden layer neural he_uniform kernel initializer for the hidden layers, and the networks trained and tested on the F feature set, i.e. the Adaptive moment estimation (Adam) optimizer with a learning compositions of amino acids, dipeptides, and tripeptides in the rate of 0.00001. protein sequence. E. Performance Evaluation TABLE III. SINGLE-H IDDEN LAYER MODELS TRAINED ON THE F FEATURE SET Since a protein example in this dataset can be mapped to more than one binary label, the prediction of protein function is Training time Prediction time Model Neurons Size (MB) a multilabel classification problem. The dataset is also highly (sec) (sec) imbalanced due to the overwhelming number of negative 5 60 2760 5 M9 M10 10 118 3480 6.31 examples for each label. Evaluation of such a classification 25 295 7920 11.1 M11 model cannot simply rely on the accuracy of prediction [23, M12 50 590 15000 17.8 24]. For example, if a negative class is abundantly prevalent M13 60 708 15855 20.4 among all examples in an imbalanced dataset, then a naive 1. Expressed as a percentage of input + output neurons classifier predicting this class for all examples will easily achieve very high accuracy. The challenge of this inflated Table IV presents the M14 neural network containing two accuracy measure becomes aggravated in the case of multilabel hidden layers and trained on F . This model was developed by classification of imbalanced datasets. This problem was adding another hidden layer to the best performing single www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 addressed by defining more meaningful performance measures, where Y_pred and Y_true are the predicted and actual target namely precision, recall, F1 score, zero-one loss, hamming labels, respectively. The conjunction operator ∧ ensures the loss, and Matthews Correlation Coefficient. These measures inclusion of only those label instances that are positive in both are defined here. Y_pred and Y_true, i.e. TPs. 1) Precision 5) Macro Average Precision is defined as the fraction of positively classified This averages the Precision and Recall scores of the instances that are, in effect, positive. This gives a clear picture individual target labels, giving equal weights to all of them. of a classifier’s strength in predicting positive classes. Letting ∑ #$_&' ∧ $_ , 1 (/0 (3 4567 (3 TP and FP respectively denote the count of true and false = (6) ! 891 - 2 ∑ $_&' (/0 positives, Precision is calculated as: (3 ∑ #$_&' ∧ $ , _4567 (/0 (3 1 (3 = (1) = (7) ! - 2 ∑ $_*+ (/0 (3 The precision of predict-majority-class-for-all classifier is 6) Weighted Average thus 0 judiciously penalizing for its shortcoming at predicting the positive minority class. However, any classifier that makes This averages the Precision and Recall scores of the just one positive prediction and ensures its correctness would individual target labels, using the number of positive instances have 100% precision despite its failure to predict other positive of each label in the set Y_true as their weight. examples. This calls for another classification metric, called #$_&' ∧ $_ ,∗: (3 4567 3 1 (/0 (3 Recall, also known as sensitivity. = ∑ (8) :;<*' . 891 - ∑ : ∑ $_&' 3/0 (/0 (3 2) Recall ∑ #$_&' ∧ $_ ,∗: 1 (/0 (3 4567 3 (3 Recall is defined as the fraction of positive examples in the 2 = ∑ (9) :;<*' . 891 - ∑ : ∑ $_*+ dataset classified as positive. Letting FN denote the number of 3/0 3 (/0 (3 false negatives, the Recall is given by: where w denotes the weight, also known as the support of the j-th label. = (2) 7) Samples Average This measure penalizes a classifier that attempts to achieve It averages the Precision and Recall scores across the high precision simply by making a few correct positive samples. predictions. ∑ #$_&' ∧ $_ , (3 4567 3) F1 Score 1 3/0 (3 = ∑ (10) !&  91 . ∑ $_&' 3/0 Precision and recall are combined in a single performance (3 measure called F1 score, which is their harmonic mean. ∑ #$_&' ∧ $_ , (3 4567 1 3/0 (3 ∗  = (11) !&  91 . ∑ $_*+ 1  = 2 ∗ (3) 3/0 (3 This is the most faithful as well as the most conservative As the harmonic mean is biased towards lower values, F1 performance indicator of the multi-label classifier as it reflects, score can have a higher value only in the case when both on the average, how well the classifier performed on each precision and recall have high values. In multi-label sample. Therefore, the sample averages were used to gauge the classification, there are several ways to average the performance of the models. aforementioned performance metrics on all labels [25, 26]. These are the micro average, macro average, weighted average, 8) Zero-One Loss and samples average as defined below. As usual, the F1 score For a multi-label classification problem, this measure is the harmonic mean of the corresponding precision and recall credits a prediction as correctly classified only when all labels in each case. are correctly classified. The loss is zero for a correct prediction. 4) Micro Average However, if the classifier fails to make a correct prediction even for just one target label, the corresponding loss is 1. It This is calculated by counting the number of True Positives follows that the zero-one loss is truly a conservative and highly (TPs) across the entire set of target labels. If there are N penalizing performance measure. samples in the dataset and each sample has L binary target labels, then the micro averages of Precision and Recall are ∑ ∏ = = #A_BC ⊕ A , (12) >/1 91 891 8 *+ calculated as: -∗. ∑ # , $_&' ∧ $_*+ The combination of the product operator Π and the ( ( (/0 = (4) ! -∗. $_&' exclusive-OR operator ⊕ ensures that any mismatch between (/0 L predicted and target labels generates a loss of 1 for any given -∗. ∑ #$_&' ∧ $_*+ , ( ( (/0 = (5) sample. Otherwise, the loss is zero for a complete match ! -∗. ∑ $_*+ (/0 between all predicted and target labels for a given sample. www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 layer were the 50% of the first hidden layer to ensure that the 9) Hamming Loss most relevant features for prediction play their due role. The This gives the fraction of all incorrectly predicted labels by blue bars in Figures 1-3 represent the performance of the best quantifying the number of incorrect predictions of all labels performing single-layer networks, while yellow bars show the rather than penalizing individual examples. Hence, if a multi- performance of 2-layer neural networks. label classifier incorrectly predicts 1 out of 10 labels for a given instance, the hamming loss for that example is just 1/10 as TABLE VII. PERFORMANCE O F NEURAL NETWORKS compared to 1 in the case of zero-one loss. It follows that hamming loss is lenient compared to the stringent zero-one Model P R F1 ZOL HL MCC CPM loss. M1 0.96 0.95 0.95 12.25 0.0159 0.9518 4.6423 0.96 0.96 0.96 11.36 0.0120 0.9576 6.7437 M2 ∑ # , E= = A_BC ⊕ A_FG (13) M3 0.96 0.96 0.95 11.64 0.0122 0.9571 6.4028 ∗2 M4 0.96 0.96 0.95 11.76 0.0122 0.9566 6.3341 10) Matthews Correlation Coefficient M5 0.96 0.96 0.95 11.90 0.0123 0.9568 6.2100 M6 0.96 0.96 0.95 12.55 0.0129 0.9546 5.6016 The Matthews Correlation Coefficient (MCC) is a binary M7 0.96 0.96 0.95 12.61 0.0132 0.9539 5.4442 version of Pearson’s correlation coefficient [27]. However, M8 0.96 0.95 0.95 12.21 0.0128 0.9535 5.7959 multiclass classification problems can also benefit from its 0.94 0.94 0.93 15.85 0.0173 0.9341 3.1681 M9 extended version [28]. MCC compares ground truth and M10 0.94 0.94 0.93 16.47 0.0179 0.9329 2.9429 predicted vectors, considering all possibilities of prediction, i.e. M11 0.93 0.94 0.93 16.70 0.0180 0.9318 2.8828 M12 0.93 0.94 0.93 17.30 0.0186 0.9298 2.6873 True Positives (TP), True Negatives (TN), False Positives (FP), M13 0.93 0.93 0.93 17.07 0.0183 0.9297 2.7678 and False negatives (FN). Therefore, it gives a balanced M14 0.93 0.93 0.92 17.17 0.0188 0.9279 2.6446 evaluation of the performance of the classifier. The correlation M15 0.96 0.95 0.95 12.43 0.0129 0.9535 5.6492 coefficient lies in the range [-1, +1], with -1 for false M16 0.96 0.96 0.96 11.34 0.0117 0.9591 6.9396 prediction, 0 for random prediction, and +1 for correct M17 0.97 0.96 0.96 10.43 0.0108 0.9614 8.1935 prediction. 0.97 0.96 0.96 10.66 0.0110 0.9614 7.8709 M18 M19 0.97 0.96 0.96 10.34 0.0107 0.9625 8.3516 J ∗ HII = (14) M20 0.97 0.96 0.96 10.39 0.0108 0.9621 8.2310 K# ,# ,# ,# , M21 0.97 0.96 0.96 10.95 0.0114 0.9593 7.3775 The MCC was calculated for every example and its average was used to assess the performance of the classifier on the entire dataset. 11) Consolidated Performance Metric For the sake of an all-encompassing and more realistic comparison of performance, the aforementioned metrics were combined in a single Consolidated Performance Metric (CPM) as follows: 1 L∗MNN IH = (15) 2 ∗P2 0/1 The CPM was constructed in the higher, the better way. III. RESULTS Table VII shows the performance details of the neural Fig. 1. Neural networks’ comparison on F feature set. networks (M1 to M21) as the samples averages of Precision (P), Recall (R), F1 score (F1), Zero-One Loss (ZOL), Hamming Loss (HL), MCC, and CPM. A wide variety of neural networks was trained and tested on a large dataset of proteins. For feature sets F and F , it was 1 2 observed that 5% of the total input and output neurons in the single hidden layer networks exhibited better prediction performance than other single-layer models. These models were designated as M2 and M9, respectively. However, for the feature set F , the optimal count of neurons in the hidden layer emerged as 50% of the total input and output neurons. This model was designated as M19. The bar graphs in Figures 1-3 compare the CPM for various neural networks that work on a specific feature set. In each case, the best performing single- layer network was extended by adding a second hidden layer to assess any performance edge. Neurons in the second hidden Fig. 2. Neural networks’ comparison on F feature set. www.etasr.com Tahzeeb & Hasan: A Neural Network- Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 score for models M2 and M19, respectively, against a range of classification thresholds. Fig. 3. Neural networks’s comparison on F feature set. F. Optimal Feature Set Fig. 5. Performance curves of M2 for the F feature set. The choice of The experiments also focused on exploring an optimal set classification threshold plays an important role in improving the performance of features for the prediction of protein function. As it was of the classifier. noticed, the F feature set proved to be the best predictor for this multi-label classification. Figure 4 shows a comparison of the best-performing models for each feature set, where M19 on F achieved the best performance. Fig. 6. The Performance curves of model M19 for the F feature set. H. Confusion Matrix The confusion matrix is a visualization of a classifier’s performance, as it gives the count of TP, FP, TN, and FN class Fig. 4. Performance comparison of the best performing single-layer predictions. models on different feature sets. F proves to be the best predictor set. G. Classification Threshold The impact of the classification threshold on the performance of a classifier was examined. The models predict the probability of each target label associated with every instance. These probability values quantify the chance for a given instance to belong to a particular class. These probability values should be translated into binary labels 0 and 1 before the final evaluation of the model. This conversion to binary labels required a threshold or probability cutoff value below which all values are classified as class 0 and equal or greater values are classified as class 1. Classifier performance metrics are profoundly influenced by the choice of this threshold. The impact of the threshold is more pronounced for imbalanced datasets. As the examined dataset is skewed towards more negative examples of each label, the performance of the models was evaluated for various values of thresholds. Figures 5 and 6 show two example plots of samples averages of P, R, and F1 Fig. 7. Confusion matrices of the best-performing model M19 for F . Only labels having support more than 1,000 are included. www.etasr.com Tahzeeb & Hasan: A Neural Network- Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 Figure 7 presents the confusion matrices of select labels in more than 16% over two-layer neural networks operating on the test dataset for the model M19 that performed best in the F the F feature set. The corresponding performance 3 1 feature set. To highlight the strength of the classifier, only improvements for the F and F features set were 20% and 2 3 labels whose support exceeded 1,000 in the dataset were 13%, respectively. This study could play a substantial role in chosen the prediction of protein function, due to the tremendous predictive power of some physiochemical properties of IV. DISCUSSION proteins, their pseudo-amino acid compositions, motifs in proteins, and some other significant characteristics. The bare The performance comparison of different neural networks compositions of amino acids, dipeptides, and tripeptides on a large protein dataset showed that neural networks having a provide a reasonably high level of approximation of protein single hidden layer and a modest number of neurons achieved functions. This could be useful in cases where researchers want superior performance on this specific dataset than relatively to have an approximate idea of protein functions just from the more complex networks. The number of neurons in the single amino acid sequence rather than extracting and relying on hidden layer was empirically determined. The rigorous many other properties of proteins. experimentation revealed that only 5% of the total input and output neurons were adequate in single-hidden-layer models REFERENCES operating on F and F feature sets. However, this count was 1 2 [1] S. Mishra, Y. P. Rastogi, S. Jabin, P. Kaur, M. Amir, and S. Khatun, "A 50% for a single-layer model on F . This disparity in the deep learning ensemble for function prediction of hypothetical proteins number of neurons was because these models for the F and F 1 2 from pathogenic bacterial species," Computational Biology and feature sets had 9,890 and 8,420 neurons, respectively, in their Chemistry, vol. 83, Dec. 2019, Art. no. 107147, https://doi.org/10.1016/ input layers. Therefore, even 5% of the total input and output j.compbiolchem.2019.107147. neurons were adequate to effectively train the model for 1,739 [2] "UniProtKB - UniProt Knowledgebase." https://www.uniprot.org/help/ labels. However, for F , the number of input neurons was uniprotkb (accessed Dec. 04, 2021). barely 1,470, and consequently, more neurons were needed for [3] X. Yuan, W. Li, K. Lin, and J. Hu, "A Deep Neural Network Based better prediction performance. This justifies a 50% count of Hierarchical Multi-Label Classifier for Protein Function Prediction," in 2019 International Conference on Computer, Information and neurons in the single hidden layer for this network. In any case, Telecommunication Systems (CITS), Aug. 2019, pp. 1–5, https://doi.org/ the training time, prediction time, and model size of these 10.1109/CITS.2019.8862034. models were much better than those of other competing [4] Z. Du, Y. He, J. Li, and V. N. Uversky, "DeepAdd: Protein function models. These models showed much better performance (F1 prediction from k-mer embedding and additional features," score: 0.96) compared to the deep learning ensemble [1] (F1 Computational Biology and Chemistry, vol. 89, Dec. 2020, Art. no. 107379, https://doi.org/10.1016/j.compbiolchem.2020.107379. score: 0.79) on the same dataset. This was also achieved with a much lower computational complexity. [5] X. F. Zhang and D. Q. Dai, "A Framework for Incorporating Functional Interrelationships into Protein Function Prediction Algorithms," A. Best Predictors IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 3, pp. 740–753, May 2012, https://doi.org/10.1109/ The findings highlight the impressive role of the TCBB.2011.148. physiochemical properties and motifs in proteins, pseudo [6] A. Ranjan, M. S. Fahad, D. Fernández-Baca, A. Deepak, and S. Tripathi, amino acid compositions, and other properties derived from the "Deep Robust Framework for Protein Function Prediction Using protein sequences in predicting protein functions. The proposed Variable-Length Protein Sequences," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 17, no. 5, pp. 1648– model for this feature set was extremely efficient as it had 1659, Sep. 2020, https://doi.org/10.1109/TCBB.2019.2911609. better performance and lower computational complexity. [7] M. Kulmanov and R. Hoehndorf, "DeepGOPlus: improved protein B. Sufficiency of Amino Acid, Dipeptide, and Tripeptide function prediction from sequence," Bioinformatics, vol. 36, no. 2, pp. 422–429, Jan. 2020, https://doi.org/10.1093/bioinformatics/btz595. Compositions [8] B. Zhao et al., "A New Method for Predicting Protein Functions From The results were suggestive of the sufficiency of amino Dynamic Weighted Interactome Networks," IEEE Transactions on acid, dipeptide, and tripeptide compositions in predicting NanoBioscience, vol. 15, no. 2, pp. 131–139, Mar. 2016, https://doi.org/ protein functions. Although the performance metrics for this 10.1109/TNB.2016.2536161. particular feature set had lower values than other feature sets, it [9] M. Modi, N. G. Jadeja, and K. Zala, "FMFinder: A Functional Module Detector for PPI Networks," Engineering, Technology & Applied can be used for a sufficient and tolerable approximation. This Science Research, vol. 7, no. 5, pp. 2022–2025, Oct. 2017, could save time spent in engineering features from existing https://doi.org/10.48084/etasr.1347. features of the dataset consisting of bare compositions of amino [10] M. A. Alvarez and C. Yan, "A new protein graph model for function acids, dipeptides, and tripeptides. prediction," Computational Biology and Chemistry, vol. 37, pp. 6–10, Apr. 2012, https://doi.org/10.1016/j.compbiolchem.2012.01.003. V. CONCLUSIONS [11] W. Xiong, L. Xie, S. Zhou, and J. Guan, "Active learning for protein This study culminated in two significant findings regarding function prediction in protein–protein interaction networks," Neurocomputing, vol. 145, pp. 44–52, Dec. 2014, https://doi.org/ the examined protein dataset. The first one pertains to the 10.1016/j.neucom.2014.05.075. exceptional performance of single-layer neural networks on [12] P. Sun et al., "Protein Function Prediction Using Function Associations this dataset, alhough the number of neurons in this single in Protein–Protein Interaction Network," IEEE Access, vol. 6, pp. hidden layer must be empirically determined as a percentage of 30892–30902, 2018, https://doi.org/10.1109/ACCESS.2018.2806478. the total input and output neurons in the network. The simple [13] R. You, X. Huang, and S. Zhu, "DeepText2GO: Improving large-scale design of this single-layer model requires minimal computing protein function prediction with deep semantic text representation," resources. This model showed a performance improvement of www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 Methods, vol. 145, pp. 82–90, Aug. 2018, https://doi.org/10.1016/ j.ymeth.2018.05.026. [14] K. Taha, P. D. Yoo, and M. Alzaabi, "iPFPi: A System for Improving Protein Function Prediction through Cumulative Iterations," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, no. 4, pp. 825–836, Jul. 2015, https://doi.org/10.1109/TCBB.2014.2344681. [15] M. Frasca and N. C. Bianchi, "Multitask Protein Function Prediction through Task Dissimilarity," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 16, no. 5, pp. 1550–1560, Sep. 2019, https://doi.org/10.1109/TCBB.2017.2684127. [16] S. Mishra, Y. P. Rastogi, S. Jabin, P. Kaur, M. Amir, and S. Khatoon, "A bacterial phyla dataset for protein function prediction," Data in Brief, vol. 28, Feb. 2020, Art. no. 105002, https://doi.org/10.1016/j.dib.2019. [17] J. Shen et al., "Predicting protein–protein interactions based only on sequences information," Proceedings of the National Academy of Sciences, vol. 104, no. 11, pp. 4337–4341, Mar. 2007, https://doi.org/ 10.1073/pnas.0607879104. [18] K. C. Chou, "Prediction of protein cellular attributes using pseudo-amino acid composition," Proteins: Structure, Function, and Bioinformatics, vol. 43, no. 3, pp. 246–255, 2001, https://doi.org/10.1002/prot.1035. [19] M. Ashburner et al., "Gene Ontology: tool for the unification of biology," Nature Genetics, vol. 25, no. 1, pp. 25–29, May 2000, https://doi.org/10.1038/75556. [20] "pandas - Python Data Analysis Library." https://pandas.pydata.org/ (accessed Dec. 04, 2021). [21] "NumPy - The fundamental package for scientific computing with Python." https://numpy.org/ (accessed Dec. 04, 2021). [22] "scikit-learn: machine learning in Python — scikit-learn 1.0.1 documentation." https://scikit-learn.org/stable/ (accessed Dec. 04, 2021). [23] D. Virmani, N. Jain, A. Srivastav, M. Mittal, and S. Mittal, "An Enhanced Binary Classifier Incorporating Weighted Scores," Engineering, Technology & Applied Science Research, vol. 8, no. 2, pp. 2853–2858, Apr. 2018, https://doi.org/10.48084/etasr.1962. [24] M. Alghobiri, "A Comparative Analysis of Classification Algorithms on Diverse Datasets," Engineering, Technology & Applied Science Research, vol. 8, no. 2, pp. 2790–2795, Apr. 2018, https://doi.org/ 10.48084/etasr.1952. [25] X. Z. Wu and Z. H. Zhou, "A Unified View of Multi-Label Performance Measures," in Proceedings of the 34th International Conference on Machine Learning, Jul. 2017, pp. 3780–3788, Accessed: Dec. 04, 2021. [Online]. Available: https://proceedings.mlr.press/v70/wu17a.html. [26] T. Li, C. Zhang, and S. Zhu, "Empirical Studies on Multi-label Classification," in 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’06), Nov. 2006, pp. 86–92, https://doi.org/10.1109/ICTAI.2006.55. [27] J. Gorodkin, "Comparing two K-category assignments by a K-category correlation coefficient," Computational Biology and Chemistry, vol. 28, no. 5, pp. 367–374, Dec. 2004, https://doi.org/10.1016/j.compbiolchem. 2004.09.006. [28] P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. Nielsen, "Assessing the accuracy of prediction algorithms for classification: an overview," Bioinformatics, vol. 16, no. 5, pp. 412–424, May 2000, https://doi.org/10.1093/bioinformatics/16.5.412. www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Engineering, Technology & Applied Science Research Unpaywall

A Neural Network-Based Multi-Label Classifier for Protein Function Prediction

Engineering, Technology & Applied Science ResearchFeb 12, 2022

Loading next page...
 
/lp/unpaywall/a-neural-network-based-multi-label-classifier-for-protein-function-Jex4e5HkAW

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
ISSN
1792-8036
DOI
10.48084/etasr.4597
Publisher site
See Article on Publisher Site

Abstract

Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Shahab Tahzeeb Shehzad Hasan Dept. of Computer & Information Systems Engineering Dept. of Computer & Information Systems Engineering NED University of Engineering & Technology NED University of Engineering & Technology Karachi, Pakistan Karachi, Pakistan stahzeeb@neduet.edu.pk shasan@neduet.edu.pk Abstract-Knowledge of the functions of proteins plays a vital role features pursued by different investigators. This section in gaining a deep insight into many biological studies. However, presents a brief summary of some of the most prominent efforts wet lab determination of protein function is prohibitively in this area. laborious, time-consuming, and costly. These challenges have An ensemble of Deep Neural Networks (DNNs) was created opportunities for automated prediction of protein proposed in [1], where each DNN worked on a different set of functions, and many computational techniques have been features from the dataset. The predictions of different DNNs explored. These techniques entail excessive computational were then voted to arrive at the final protein function resources and turnaround times. The current study compares the performance of various neural networks on predicting protein prediction. A DNN for the hierarchical multilabel classification function. These networks were trained and tested on a large of protein functions designed to perform well even with a dataset of reviewed protein entries from nine bacterial phyla, limited number of training samples was presented in [3]. In [4], obtained from the Universal Protein Resource Knowledgebase a DNN was introduced to learn features from word embedding (UniProtKB). Each protein instance was associated with multiple of protein sequences, based on the concept of Natural terms of the molecular function of Gene Ontology (GO), making Language Processing (NLP), using sequence similarity profiles the problem a multilabel classification one. The results in this as additional features to locate proteins. Authors in [5] dataset showed the superior performance of single-layer neural established the efficacy of exploiting any interrelationships networks having a modest number of neurons. Moreover, a among different functional terms. For instance, different useful set of features that can be deployed for efficient protein functional classes were found to coexist with some proteins function prediction was discovered. suggesting a mutual relationship. Furthermore, a quantification model of these relations was proposed, using a functional Keywords-gene ontology; molecular function term; multi-label similarity measure and a framework to capitalize on it for the classification; neural network; protein function prediction eventual prediction of protein functions. A classification I. INTRODUCTION technique based on a neural network coupled with a Support Vector Machine (SVM) was demonstrated in [6], utilizing a bi- Understanding proteins’ functions plays a vital role in directional Long Short-Term Memory (LSTM) network to acquiring insights of the molecular mechanisms operating in generate fixed-length protein vectors out of variable-sized both physiological and ailing medical conditions. As a result, sequences and deal with the challenges posed by the variable this understanding substantiates the discovery of drugs in length of protein sequences. In [7], protein sequence motifs different diseases [1]. However, predicting protein functions is were used to build a deep convolutional network and predict an arduous task. The fact is markedly implied by the incredibly protein function, while the authors claimed to have built the large number of unannotated protein entries hosted by the most best performing model for the cellular component classes. The comprehensive protein database, the Universal Protein significance of Protein-Protein Interaction (PPI) and time- Resource Knowledgebase (UniProtKB) [2]. This is mainly due course gene expression data as powerful predictors for the to the reliance on traditional experimental annotation prediction of protein function was shown in [8]. A method, techniques carried out by molecular biologists. The gap called Dynamic Weighted Interactome Network (DWIN), was between reviewed and unreviewed protein sequences is proposed, that in addition to PPI and gene expression data, took widening due to the data deluge from high-throughput state-of- also into account information related to protein domains and the-art sequencing techniques [1, 3-5]. The pressing demands complexes to improve the prediction performance. In [9], for computational methods on the functional annotation of clustering was applied on a PPI network for the prediction of proteins have paved the way for significant contributions by protein function. A protein graph model was shown in [10], computer science researchers. Many computational techniques constructed of protein structure, with each node representing a employing machine learning for functional annotation of cluster of amino acid residues. However, the idea of using an proteins have been utilized in the literature. The principal accuracy metric for evaluation is generally misleading. In [11], difference between various approaches lies in the set of an active learning approach was explored for the prediction of Corresponding author: Shahab Tahzeeb www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 protein function using a PPI network. This method operated in protein. As suggested in [18], these numbers overcome the loss two phases: Spectral clustering was used to cluster the PPI of sequence order effect in a protein caused by considering just network followed by the application of the betweenness plain amino acid compositions. Moreover, there are also 541 centrality measure for labeling within each cluster, and then the motifs included in the features. These are small segments in labeled protein data were used by a classification algorithm. proteins’ tertiary structure that are frequently found in different Associations between functions in a PPI network were used in proteins. These similar patterns are associated with the [12], stating that multiple function labels assigned to proteins structural or functional roles of proteins. were not independent and their coexistence could be used There are 1,739 binary labels associated with each protein effectively to predict protein function. A deep semantic text instance. These labels correspond to GO terms belonging to the representation was presented in [13], with various pieces of Molecular Function (MF) category. The GO is a categorization information extracted from protein sequences such as of biological functions using three broad classes, i.e. Molecular homology, motifs, and domains. Protein function prediction Function (MF), Cellular Component (CC), and Biological was carried out using a consensus between text-based and Process (BP), generally referred to as GO terms [19]. The sequence-based methods. In [14], a classifier using cumulative molecular function term specifies a biochemical activity iterations was proposed, based on its semantic similarity with performed by a gene product, without taking into account the the term Gene Ontology (GO). Each prediction was followed time and space dimensions of this activity. The enzyme is an by updating and optimizing scores of characteristic terms in the example of the MF term. The CC refers to the location of the set of GO annotations, which, in turn, led to improved future biochemical activity of a gene product in the cell. Ribosome predictions. The dissimilarity of protein functions, rather than and nuclear membrane are two such examples. BP, an all- conventional similarity measures, was used in [15] to segregate encompassing term, defines a biological objective to which rare and frequently occurring classes of functions. This activities of various gene products contribute. Cell growth and technique worked well for imbalanced datasets. maintenance serve as examples the BP term. The notable contributions cited above are just a handful of B. Data Preprocessing numerous praiseworthy efforts towards the prediction of The Comma Separated Values (CSV) files for 9 different protein function. These endeavors differ in terms of the protein bacterial phyla were combined to obtain a single Pandas’ data information utilized by the corresponding systems and the frame object using the Pandas data analysis library in Python computational or time complexities of the classification [20]. Duplicate rows were removed from the data frame, which models. The current paper presents a neural network-based was then converted to an array using the scientific computing multi-label classifier for the prediction of protein function by library NumPy in Python [21]. The feature values were then training and testing several neural networks on a large dataset scaled using the standard scaler available in the scikit-learn [16]. The results indicate that a neural network with a single library in Python [22]. Data scaling was investigated using hidden layer achieved remarkable prediction performance with normalization and robust scaler, but these data scaling nominal computational complexity. This makes its techniques proved inferior to the standard scaling technique. implementation viable on systems with modest hardware capabilities. Consequently, the time required for the C. Features Partitioning classification task is in the order of seconds. The neural networks were trained on 3 sets of features. The II. MATERIALS AND METHODS objective of partitioning features into various subsets was to test the hypothesis that compositions of amino acids, A. Dataset dipeptides, and tripeptides are sufficient to predict protein The dataset adopted from [16] includes 121,378 protein functions. F = {F , F , F } represented the set of features used 1 2 3 instances. These labeled protein examples were extracted from to train different models, where F was the entire set of 9,890 UniProtKB [2], a comprehensive worldwide repository of features, and F was the set of 8,420 features that contained protein information. These protein entries pertain to 9 bacterial only compositions of amino acids, dipeptides, and tripeptides. phyla, namely Actinobacteria, Bacteroidetes, Chlamydiae, The set F =F –F contained 1,470 features consisting of various 3 1 2 Cyanobacteria, Firmicutes, Fusobacteria, Proteobacteria, properties and characteristics derived from proteins as Spirochaetes, and Tenericutes. Each instance in the dataset had described in subsection A. 9,890 features. These features included the sequence of amino D. Neural Networks acids making up the corresponding protein, compositions of A variety of neural networks was selected, differing in the amino acids, dipeptides and tripeptides; compositions of five groups of amino acids, i.e. aliphatic, aromatic, positively number of hidden layers and neurons in each layer to train the charged, negatively charged, uncharged, and various structural protein function classification system on datasets corresponding to each feature set F , F , and F . The and physiochemical properties derived from the amino acid 1 2 3 sequence. In addition, some features quantify conjoint triads. A experimental results are given in Section III. It was observed conjoint triad is a unit of three successive amino acids such that that the simplest neural network containing a single hidden each amino acid in the unit belongs to one of the seven groups layer demonstrated better performance on this dataset formed on the basis of the dipole and volume scale [17]. These compared to neural networks having more hidden layers. The optimal number of neurons in this single hidden layer was characteristic values indicate the strength of interaction between the amino acids of these 7 groups. The feature set also experimentally determined to be just 5% of the total input and contains pseudo amino acid compositions for the corresponding output neurons for feature sets F and F for the best performing 1 2 www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 neural network. However, for the F feature set, the optimal hidden layer neural network M9, to investigate any number of neurons in the single hidden layer of the best improvements in the classifier performance. The second hidden performing neural network turned out to be 50% of the total of layer had 50% neurons of the first hidden layer to exploit the input and output neurons. predictors best suited for the prediction task. Once the optimal number of neurons in a single hidden TABLE IV. TWO -LAYER MODEL TRAINED ON THE F FEATURE SET layer was determined, the addition of another hidden layer was Training Prediction utilized to observe any potential boost in performance. The Model Size (MB) time (sec) time (sec) number of neurons in the second hidden layer was chosen to be 56 2050 4.76 M14 50% of the first hidden layer. This was done to ensure that the network captured the most important features for prediction. Table V summarizes several single-hidden layer neural Table I summarizes various single-hidden layer neural networks trained and tested on the F feature set, i.e. features networks trained and tested on the F feature set, i.e. the entire consisting of various properties and characteristics derived set of features from the dataset. The reference computer for all from proteins. time and memory size measurements presented is a Core i7- 8700 at 3.2 GHz 6-core processor. TABLE V. SINGLE-HIDDEN LAYER MODELS TRAIN ED ON F FEATURE SET TABLE I. SINGLE-H IDDEN LAYER MODELS TRAINED ON THE F FEATURE SET Training time Predictiontime Model Neurons Size (MB) (sec) (sec) Training time Predictiontime Model Neurons Size (MB) M15 5 6 2750 1.33 (sec) (sec) M16 10 12 3200 1.58 M1 1 16 1950 2.25 25 30 5200 2.92 M17 M2 5 77 2655 6.03 M18 30 35 5270 3.05 M3 10 154 3920 7.51 50 60 6630 4.20 M19 15 232 4860 9.75 M4 M20 60 71 7320 4.59 M5 25 386 8820 13.6 1. Expressed as a percentage of input + output neurons M6 50 773 12175 21.8 M7 75 1130 15400 31.6 Table VI presents the M21 neural network having two 1. Expressed as percentage of input + output neurons hidden layers and trained on F . This model was generated by adding another hidden layer to the best performing single Table II presents the M8 neural network with two hidden hidden layer neural network M19 to discover any potential layers trained on F . This model was constructed by adding performance enhancement. The number of neurons in the another hidden layer to the best performing single-hidden layer second hidden layer was chosen to be 50% of the first hidden neural network M2 to explore any performance gain. The layer to capitalize on the features best suited for the second hidden layer had 50% neurons of the first hidden layer classification. in an attempt to capture the optimal features best suited for the prediction task. TABLE VI. TWO -LAYER MODEL TRAINED ON THE F3 FEATURE SET TABLE II. TWO -LAYER MODEL TRAINED ON THE F FEATURE SET Training Prediction Model Size (MB) time (sec) time (sec) Training Prediction Model Size (MB) M21 58 5720 4.64 time (sec) time (sec) M8 74 2120 5.52 For each network, we employed the relu activation for the hidden layers, the sigmoid activation for the output layer, the Table III summarizes various single-hidden layer neural he_uniform kernel initializer for the hidden layers, and the networks trained and tested on the F feature set, i.e. the Adaptive moment estimation (Adam) optimizer with a learning compositions of amino acids, dipeptides, and tripeptides in the rate of 0.00001. protein sequence. E. Performance Evaluation TABLE III. SINGLE-H IDDEN LAYER MODELS TRAINED ON THE F FEATURE SET Since a protein example in this dataset can be mapped to more than one binary label, the prediction of protein function is Training time Prediction time Model Neurons Size (MB) a multilabel classification problem. The dataset is also highly (sec) (sec) imbalanced due to the overwhelming number of negative 5 60 2760 5 M9 M10 10 118 3480 6.31 examples for each label. Evaluation of such a classification 25 295 7920 11.1 M11 model cannot simply rely on the accuracy of prediction [23, M12 50 590 15000 17.8 24]. For example, if a negative class is abundantly prevalent M13 60 708 15855 20.4 among all examples in an imbalanced dataset, then a naive 1. Expressed as a percentage of input + output neurons classifier predicting this class for all examples will easily achieve very high accuracy. The challenge of this inflated Table IV presents the M14 neural network containing two accuracy measure becomes aggravated in the case of multilabel hidden layers and trained on F . This model was developed by classification of imbalanced datasets. This problem was adding another hidden layer to the best performing single www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 addressed by defining more meaningful performance measures, where Y_pred and Y_true are the predicted and actual target namely precision, recall, F1 score, zero-one loss, hamming labels, respectively. The conjunction operator ∧ ensures the loss, and Matthews Correlation Coefficient. These measures inclusion of only those label instances that are positive in both are defined here. Y_pred and Y_true, i.e. TPs. 1) Precision 5) Macro Average Precision is defined as the fraction of positively classified This averages the Precision and Recall scores of the instances that are, in effect, positive. This gives a clear picture individual target labels, giving equal weights to all of them. of a classifier’s strength in predicting positive classes. Letting ∑ #$_&' ∧ $_ , 1 (/0 (3 4567 (3 TP and FP respectively denote the count of true and false = (6) ! 891 - 2 ∑ $_&' (/0 positives, Precision is calculated as: (3 ∑ #$_&' ∧ $ , _4567 (/0 (3 1 (3 = (1) = (7) ! - 2 ∑ $_*+ (/0 (3 The precision of predict-majority-class-for-all classifier is 6) Weighted Average thus 0 judiciously penalizing for its shortcoming at predicting the positive minority class. However, any classifier that makes This averages the Precision and Recall scores of the just one positive prediction and ensures its correctness would individual target labels, using the number of positive instances have 100% precision despite its failure to predict other positive of each label in the set Y_true as their weight. examples. This calls for another classification metric, called #$_&' ∧ $_ ,∗: (3 4567 3 1 (/0 (3 Recall, also known as sensitivity. = ∑ (8) :;<*' . 891 - ∑ : ∑ $_&' 3/0 (/0 (3 2) Recall ∑ #$_&' ∧ $_ ,∗: 1 (/0 (3 4567 3 (3 Recall is defined as the fraction of positive examples in the 2 = ∑ (9) :;<*' . 891 - ∑ : ∑ $_*+ dataset classified as positive. Letting FN denote the number of 3/0 3 (/0 (3 false negatives, the Recall is given by: where w denotes the weight, also known as the support of the j-th label. = (2) 7) Samples Average This measure penalizes a classifier that attempts to achieve It averages the Precision and Recall scores across the high precision simply by making a few correct positive samples. predictions. ∑ #$_&' ∧ $_ , (3 4567 3) F1 Score 1 3/0 (3 = ∑ (10) !&  91 . ∑ $_&' 3/0 Precision and recall are combined in a single performance (3 measure called F1 score, which is their harmonic mean. ∑ #$_&' ∧ $_ , (3 4567 1 3/0 (3 ∗  = (11) !&  91 . ∑ $_*+ 1  = 2 ∗ (3) 3/0 (3 This is the most faithful as well as the most conservative As the harmonic mean is biased towards lower values, F1 performance indicator of the multi-label classifier as it reflects, score can have a higher value only in the case when both on the average, how well the classifier performed on each precision and recall have high values. In multi-label sample. Therefore, the sample averages were used to gauge the classification, there are several ways to average the performance of the models. aforementioned performance metrics on all labels [25, 26]. These are the micro average, macro average, weighted average, 8) Zero-One Loss and samples average as defined below. As usual, the F1 score For a multi-label classification problem, this measure is the harmonic mean of the corresponding precision and recall credits a prediction as correctly classified only when all labels in each case. are correctly classified. The loss is zero for a correct prediction. 4) Micro Average However, if the classifier fails to make a correct prediction even for just one target label, the corresponding loss is 1. It This is calculated by counting the number of True Positives follows that the zero-one loss is truly a conservative and highly (TPs) across the entire set of target labels. If there are N penalizing performance measure. samples in the dataset and each sample has L binary target labels, then the micro averages of Precision and Recall are ∑ ∏ = = #A_BC ⊕ A , (12) >/1 91 891 8 *+ calculated as: -∗. ∑ # , $_&' ∧ $_*+ The combination of the product operator Π and the ( ( (/0 = (4) ! -∗. $_&' exclusive-OR operator ⊕ ensures that any mismatch between (/0 L predicted and target labels generates a loss of 1 for any given -∗. ∑ #$_&' ∧ $_*+ , ( ( (/0 = (5) sample. Otherwise, the loss is zero for a complete match ! -∗. ∑ $_*+ (/0 between all predicted and target labels for a given sample. www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 layer were the 50% of the first hidden layer to ensure that the 9) Hamming Loss most relevant features for prediction play their due role. The This gives the fraction of all incorrectly predicted labels by blue bars in Figures 1-3 represent the performance of the best quantifying the number of incorrect predictions of all labels performing single-layer networks, while yellow bars show the rather than penalizing individual examples. Hence, if a multi- performance of 2-layer neural networks. label classifier incorrectly predicts 1 out of 10 labels for a given instance, the hamming loss for that example is just 1/10 as TABLE VII. PERFORMANCE O F NEURAL NETWORKS compared to 1 in the case of zero-one loss. It follows that hamming loss is lenient compared to the stringent zero-one Model P R F1 ZOL HL MCC CPM loss. M1 0.96 0.95 0.95 12.25 0.0159 0.9518 4.6423 0.96 0.96 0.96 11.36 0.0120 0.9576 6.7437 M2 ∑ # , E= = A_BC ⊕ A_FG (13) M3 0.96 0.96 0.95 11.64 0.0122 0.9571 6.4028 ∗2 M4 0.96 0.96 0.95 11.76 0.0122 0.9566 6.3341 10) Matthews Correlation Coefficient M5 0.96 0.96 0.95 11.90 0.0123 0.9568 6.2100 M6 0.96 0.96 0.95 12.55 0.0129 0.9546 5.6016 The Matthews Correlation Coefficient (MCC) is a binary M7 0.96 0.96 0.95 12.61 0.0132 0.9539 5.4442 version of Pearson’s correlation coefficient [27]. However, M8 0.96 0.95 0.95 12.21 0.0128 0.9535 5.7959 multiclass classification problems can also benefit from its 0.94 0.94 0.93 15.85 0.0173 0.9341 3.1681 M9 extended version [28]. MCC compares ground truth and M10 0.94 0.94 0.93 16.47 0.0179 0.9329 2.9429 predicted vectors, considering all possibilities of prediction, i.e. M11 0.93 0.94 0.93 16.70 0.0180 0.9318 2.8828 M12 0.93 0.94 0.93 17.30 0.0186 0.9298 2.6873 True Positives (TP), True Negatives (TN), False Positives (FP), M13 0.93 0.93 0.93 17.07 0.0183 0.9297 2.7678 and False negatives (FN). Therefore, it gives a balanced M14 0.93 0.93 0.92 17.17 0.0188 0.9279 2.6446 evaluation of the performance of the classifier. The correlation M15 0.96 0.95 0.95 12.43 0.0129 0.9535 5.6492 coefficient lies in the range [-1, +1], with -1 for false M16 0.96 0.96 0.96 11.34 0.0117 0.9591 6.9396 prediction, 0 for random prediction, and +1 for correct M17 0.97 0.96 0.96 10.43 0.0108 0.9614 8.1935 prediction. 0.97 0.96 0.96 10.66 0.0110 0.9614 7.8709 M18 M19 0.97 0.96 0.96 10.34 0.0107 0.9625 8.3516 J ∗ HII = (14) M20 0.97 0.96 0.96 10.39 0.0108 0.9621 8.2310 K# ,# ,# ,# , M21 0.97 0.96 0.96 10.95 0.0114 0.9593 7.3775 The MCC was calculated for every example and its average was used to assess the performance of the classifier on the entire dataset. 11) Consolidated Performance Metric For the sake of an all-encompassing and more realistic comparison of performance, the aforementioned metrics were combined in a single Consolidated Performance Metric (CPM) as follows: 1 L∗MNN IH = (15) 2 ∗P2 0/1 The CPM was constructed in the higher, the better way. III. RESULTS Table VII shows the performance details of the neural Fig. 1. Neural networks’ comparison on F feature set. networks (M1 to M21) as the samples averages of Precision (P), Recall (R), F1 score (F1), Zero-One Loss (ZOL), Hamming Loss (HL), MCC, and CPM. A wide variety of neural networks was trained and tested on a large dataset of proteins. For feature sets F and F , it was 1 2 observed that 5% of the total input and output neurons in the single hidden layer networks exhibited better prediction performance than other single-layer models. These models were designated as M2 and M9, respectively. However, for the feature set F , the optimal count of neurons in the hidden layer emerged as 50% of the total input and output neurons. This model was designated as M19. The bar graphs in Figures 1-3 compare the CPM for various neural networks that work on a specific feature set. In each case, the best performing single- layer network was extended by adding a second hidden layer to assess any performance edge. Neurons in the second hidden Fig. 2. Neural networks’ comparison on F feature set. www.etasr.com Tahzeeb & Hasan: A Neural Network- Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 score for models M2 and M19, respectively, against a range of classification thresholds. Fig. 3. Neural networks’s comparison on F feature set. F. Optimal Feature Set Fig. 5. Performance curves of M2 for the F feature set. The choice of The experiments also focused on exploring an optimal set classification threshold plays an important role in improving the performance of features for the prediction of protein function. As it was of the classifier. noticed, the F feature set proved to be the best predictor for this multi-label classification. Figure 4 shows a comparison of the best-performing models for each feature set, where M19 on F achieved the best performance. Fig. 6. The Performance curves of model M19 for the F feature set. H. Confusion Matrix The confusion matrix is a visualization of a classifier’s performance, as it gives the count of TP, FP, TN, and FN class Fig. 4. Performance comparison of the best performing single-layer predictions. models on different feature sets. F proves to be the best predictor set. G. Classification Threshold The impact of the classification threshold on the performance of a classifier was examined. The models predict the probability of each target label associated with every instance. These probability values quantify the chance for a given instance to belong to a particular class. These probability values should be translated into binary labels 0 and 1 before the final evaluation of the model. This conversion to binary labels required a threshold or probability cutoff value below which all values are classified as class 0 and equal or greater values are classified as class 1. Classifier performance metrics are profoundly influenced by the choice of this threshold. The impact of the threshold is more pronounced for imbalanced datasets. As the examined dataset is skewed towards more negative examples of each label, the performance of the models was evaluated for various values of thresholds. Figures 5 and 6 show two example plots of samples averages of P, R, and F1 Fig. 7. Confusion matrices of the best-performing model M19 for F . Only labels having support more than 1,000 are included. www.etasr.com Tahzeeb & Hasan: A Neural Network- Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 Figure 7 presents the confusion matrices of select labels in more than 16% over two-layer neural networks operating on the test dataset for the model M19 that performed best in the F the F feature set. The corresponding performance 3 1 feature set. To highlight the strength of the classifier, only improvements for the F and F features set were 20% and 2 3 labels whose support exceeded 1,000 in the dataset were 13%, respectively. This study could play a substantial role in chosen the prediction of protein function, due to the tremendous predictive power of some physiochemical properties of IV. DISCUSSION proteins, their pseudo-amino acid compositions, motifs in proteins, and some other significant characteristics. The bare The performance comparison of different neural networks compositions of amino acids, dipeptides, and tripeptides on a large protein dataset showed that neural networks having a provide a reasonably high level of approximation of protein single hidden layer and a modest number of neurons achieved functions. This could be useful in cases where researchers want superior performance on this specific dataset than relatively to have an approximate idea of protein functions just from the more complex networks. The number of neurons in the single amino acid sequence rather than extracting and relying on hidden layer was empirically determined. The rigorous many other properties of proteins. experimentation revealed that only 5% of the total input and output neurons were adequate in single-hidden-layer models REFERENCES operating on F and F feature sets. However, this count was 1 2 [1] S. Mishra, Y. P. Rastogi, S. Jabin, P. Kaur, M. Amir, and S. Khatun, "A 50% for a single-layer model on F . This disparity in the deep learning ensemble for function prediction of hypothetical proteins number of neurons was because these models for the F and F 1 2 from pathogenic bacterial species," Computational Biology and feature sets had 9,890 and 8,420 neurons, respectively, in their Chemistry, vol. 83, Dec. 2019, Art. no. 107147, https://doi.org/10.1016/ input layers. Therefore, even 5% of the total input and output j.compbiolchem.2019.107147. neurons were adequate to effectively train the model for 1,739 [2] "UniProtKB - UniProt Knowledgebase." https://www.uniprot.org/help/ labels. However, for F , the number of input neurons was uniprotkb (accessed Dec. 04, 2021). barely 1,470, and consequently, more neurons were needed for [3] X. Yuan, W. Li, K. Lin, and J. Hu, "A Deep Neural Network Based better prediction performance. This justifies a 50% count of Hierarchical Multi-Label Classifier for Protein Function Prediction," in 2019 International Conference on Computer, Information and neurons in the single hidden layer for this network. In any case, Telecommunication Systems (CITS), Aug. 2019, pp. 1–5, https://doi.org/ the training time, prediction time, and model size of these 10.1109/CITS.2019.8862034. models were much better than those of other competing [4] Z. Du, Y. He, J. Li, and V. N. Uversky, "DeepAdd: Protein function models. These models showed much better performance (F1 prediction from k-mer embedding and additional features," score: 0.96) compared to the deep learning ensemble [1] (F1 Computational Biology and Chemistry, vol. 89, Dec. 2020, Art. no. 107379, https://doi.org/10.1016/j.compbiolchem.2020.107379. score: 0.79) on the same dataset. This was also achieved with a much lower computational complexity. [5] X. F. Zhang and D. Q. Dai, "A Framework for Incorporating Functional Interrelationships into Protein Function Prediction Algorithms," A. Best Predictors IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 3, pp. 740–753, May 2012, https://doi.org/10.1109/ The findings highlight the impressive role of the TCBB.2011.148. physiochemical properties and motifs in proteins, pseudo [6] A. Ranjan, M. S. Fahad, D. Fernández-Baca, A. Deepak, and S. Tripathi, amino acid compositions, and other properties derived from the "Deep Robust Framework for Protein Function Prediction Using protein sequences in predicting protein functions. The proposed Variable-Length Protein Sequences," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 17, no. 5, pp. 1648– model for this feature set was extremely efficient as it had 1659, Sep. 2020, https://doi.org/10.1109/TCBB.2019.2911609. better performance and lower computational complexity. [7] M. Kulmanov and R. Hoehndorf, "DeepGOPlus: improved protein B. Sufficiency of Amino Acid, Dipeptide, and Tripeptide function prediction from sequence," Bioinformatics, vol. 36, no. 2, pp. 422–429, Jan. 2020, https://doi.org/10.1093/bioinformatics/btz595. Compositions [8] B. Zhao et al., "A New Method for Predicting Protein Functions From The results were suggestive of the sufficiency of amino Dynamic Weighted Interactome Networks," IEEE Transactions on acid, dipeptide, and tripeptide compositions in predicting NanoBioscience, vol. 15, no. 2, pp. 131–139, Mar. 2016, https://doi.org/ protein functions. Although the performance metrics for this 10.1109/TNB.2016.2536161. particular feature set had lower values than other feature sets, it [9] M. Modi, N. G. Jadeja, and K. Zala, "FMFinder: A Functional Module Detector for PPI Networks," Engineering, Technology & Applied can be used for a sufficient and tolerable approximation. This Science Research, vol. 7, no. 5, pp. 2022–2025, Oct. 2017, could save time spent in engineering features from existing https://doi.org/10.48084/etasr.1347. features of the dataset consisting of bare compositions of amino [10] M. A. Alvarez and C. Yan, "A new protein graph model for function acids, dipeptides, and tripeptides. prediction," Computational Biology and Chemistry, vol. 37, pp. 6–10, Apr. 2012, https://doi.org/10.1016/j.compbiolchem.2012.01.003. V. CONCLUSIONS [11] W. Xiong, L. Xie, S. Zhou, and J. Guan, "Active learning for protein This study culminated in two significant findings regarding function prediction in protein–protein interaction networks," Neurocomputing, vol. 145, pp. 44–52, Dec. 2014, https://doi.org/ the examined protein dataset. The first one pertains to the 10.1016/j.neucom.2014.05.075. exceptional performance of single-layer neural networks on [12] P. Sun et al., "Protein Function Prediction Using Function Associations this dataset, alhough the number of neurons in this single in Protein–Protein Interaction Network," IEEE Access, vol. 6, pp. hidden layer must be empirically determined as a percentage of 30892–30902, 2018, https://doi.org/10.1109/ACCESS.2018.2806478. the total input and output neurons in the network. The simple [13] R. You, X. Huang, and S. Zhu, "DeepText2GO: Improving large-scale design of this single-layer model requires minimal computing protein function prediction with deep semantic text representation," resources. This model showed a performance improvement of www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction Engineering, Technology & Applied Science Research Vol. 12, No. 1, 2022, 7974-7981 Methods, vol. 145, pp. 82–90, Aug. 2018, https://doi.org/10.1016/ j.ymeth.2018.05.026. [14] K. Taha, P. D. Yoo, and M. Alzaabi, "iPFPi: A System for Improving Protein Function Prediction through Cumulative Iterations," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 12, no. 4, pp. 825–836, Jul. 2015, https://doi.org/10.1109/TCBB.2014.2344681. [15] M. Frasca and N. C. Bianchi, "Multitask Protein Function Prediction through Task Dissimilarity," IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 16, no. 5, pp. 1550–1560, Sep. 2019, https://doi.org/10.1109/TCBB.2017.2684127. [16] S. Mishra, Y. P. Rastogi, S. Jabin, P. Kaur, M. Amir, and S. Khatoon, "A bacterial phyla dataset for protein function prediction," Data in Brief, vol. 28, Feb. 2020, Art. no. 105002, https://doi.org/10.1016/j.dib.2019. [17] J. Shen et al., "Predicting protein–protein interactions based only on sequences information," Proceedings of the National Academy of Sciences, vol. 104, no. 11, pp. 4337–4341, Mar. 2007, https://doi.org/ 10.1073/pnas.0607879104. [18] K. C. Chou, "Prediction of protein cellular attributes using pseudo-amino acid composition," Proteins: Structure, Function, and Bioinformatics, vol. 43, no. 3, pp. 246–255, 2001, https://doi.org/10.1002/prot.1035. [19] M. Ashburner et al., "Gene Ontology: tool for the unification of biology," Nature Genetics, vol. 25, no. 1, pp. 25–29, May 2000, https://doi.org/10.1038/75556. [20] "pandas - Python Data Analysis Library." https://pandas.pydata.org/ (accessed Dec. 04, 2021). [21] "NumPy - The fundamental package for scientific computing with Python." https://numpy.org/ (accessed Dec. 04, 2021). [22] "scikit-learn: machine learning in Python — scikit-learn 1.0.1 documentation." https://scikit-learn.org/stable/ (accessed Dec. 04, 2021). [23] D. Virmani, N. Jain, A. Srivastav, M. Mittal, and S. Mittal, "An Enhanced Binary Classifier Incorporating Weighted Scores," Engineering, Technology & Applied Science Research, vol. 8, no. 2, pp. 2853–2858, Apr. 2018, https://doi.org/10.48084/etasr.1962. [24] M. Alghobiri, "A Comparative Analysis of Classification Algorithms on Diverse Datasets," Engineering, Technology & Applied Science Research, vol. 8, no. 2, pp. 2790–2795, Apr. 2018, https://doi.org/ 10.48084/etasr.1952. [25] X. Z. Wu and Z. H. Zhou, "A Unified View of Multi-Label Performance Measures," in Proceedings of the 34th International Conference on Machine Learning, Jul. 2017, pp. 3780–3788, Accessed: Dec. 04, 2021. [Online]. Available: https://proceedings.mlr.press/v70/wu17a.html. [26] T. Li, C. Zhang, and S. Zhu, "Empirical Studies on Multi-label Classification," in 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’06), Nov. 2006, pp. 86–92, https://doi.org/10.1109/ICTAI.2006.55. [27] J. Gorodkin, "Comparing two K-category assignments by a K-category correlation coefficient," Computational Biology and Chemistry, vol. 28, no. 5, pp. 367–374, Dec. 2004, https://doi.org/10.1016/j.compbiolchem. 2004.09.006. [28] P. Baldi, S. Brunak, Y. Chauvin, C. A. F. Andersen, and H. Nielsen, "Assessing the accuracy of prediction algorithms for classification: an overview," Bioinformatics, vol. 16, no. 5, pp. 412–424, May 2000, https://doi.org/10.1093/bioinformatics/16.5.412. www.etasr.com Tahzeeb & Hasan: A Neural Network-Based Multi-Label Classifier for Protein Function Prediction

Journal

Engineering, Technology & Applied Science ResearchUnpaywall

Published: Feb 12, 2022

There are no references for this article.