Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes

An improved hidden Markov model for transmembrane protein detection and topology prediction and... Vol. 21 no. 9 2005, pages 1853–1858 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bti303 Sequence analysis An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes 1 1 1,2,∗ Robel Y. Kahsay , Guang Gao and Li Liao 1 2 Delaware Biotechnology Institute, Newark, DE 19715, USA and Department of Computer and Information Sciences, University of Delaware, 103 Smith Hall, Newark, DE 19716, USA Received on November 15, 2004; revised on January 7, 2005; accepted on January 27, 2005 Advance Access publication February 2, 2005 ABSTRACT Unfortunately, membrane proteins are hard to solubilize and purify Motivation: Knowledge of the transmembrane helical topology can in their native conformation because they have both hydrophobic and help identify binding sites and infer functions for membrane proteins. hydrophilic regions on their surfaces, and thus it is difficult to determ- However, because membrane proteins are hard to solubilize and ine their structure experimentally. Such a situation has motivated purify, only a very small amount of membrane proteins have structure the development of various computational methods for predicting and topology experimentally determined. This has motivated vari- the topology of membrane proteins. Most of these computational ous computational methods for predicting the topology of membrane approaches rely on the compositional bias of amino acids at differ- proteins. ent regions of the sequence (von Heijne, 1994). For example, there is Results: We present an improved hidden Markov model, TMMOD, for a high propensity of hydrophobic residues in transmembrane alpha the identification and topology prediction of transmembrane proteins. helices due to the hydrophobic environment in lipid membranes. Our model uses TMHMM as a prototype, but differs from TMHMM by Because such a bias is quite noticeable and consistent, the location the architecture of the submodels for loops on both sides of the mem- of transmembrane domains can often be easily identified with high brane and also by the model training procedure. In cross-validation accuracy even by a simple method such as applying a threshold on experiments using a set of 83 transmembrane proteins with known the hydrophobic propensity curve. topology, TMMOD outperformed TMHMM and other existing methods, Another compositional signal in membrane proteins is the abund- with an accuracy of 89% for both topology and locations. In another ance of positively charged residues in the segments (loops) that are experiment using a separate set of 160 transmembrane proteins, located on the cytoplasmic side of the membrane and therefore is TMMOD had 84% for topology and 89% for locations. When util- referred to as the ‘positive inside rule’ for predicting the orienta- ized for identifying transmembrane proteins from non-transmembrane tion of a transmembrane helix (von Heijne, 1986, 1992). Unlike the proteins, particularly signal peptides, TMMOD has consistently fewer hydrophobicity signal for transmembrane helices, the ‘positive inside false positives than TMHMM does. Application of TMMOD to a col- rule’ is a weaker signal and often confused by significant presence of lection of complete genomes shows that the number of predicted positively charged residues in globular domains of the protein on the membrane proteins accounts for ∼20–30% of all genes in those gen- non-cytoplasmic side. Consequently, it is more difficult to correctly omes, and that the topology where both the N- and C-termini are in the predict the overall topology of a given protein, i.e. the orientation of cytoplasm is dominant in these organisms except for Caenorhabditis all transmembrane segments. elegans. There are basically two ways for improving the prediction accur- Availability: http://liao.cis.udel.edu/website/servers/TMMOD/ acy of any given model: by enhancing the signal/noise ratio for those Contact: lliao@cis.udel.edu weak signals or by identifying new signals and associating them with the topology. For example, significant improvements of prediction accuracy were reported (Persson and Argos, 1994) by applying mul- INTRODUCTION tiple sequence alignment to proteins with similar topology so that Membrane proteins serve as highly active mediators between the the positive residue content in the cytoplasmic loops may become cell and its environment or between the interior of an organelle and obvious in the aligned motifs. A more recent work along this line is the cytosol. They catalyze specific metabolites and ions across the PRODIV–TMHMM, a profile-based hidden Markov model (Viklund membrane barriers, convert the energy of sunlight into chemical and and Elofsson, 2004), where a 10% increase in performance is repor- electrical energy and couple the flow of electrons to the synthesis of ted with the use of homologous sequences. However, it shall be noted ATP. Furthermore, they act as signal receptors and transduce signals that multiple sequence alignment may not always be suitable, either such as neurotransmitters, growth factors and hormones across the due to insufficient number of homologs or due to the length variations membrane. Because of their vast functional roles, membrane proteins in these cytoplasmic loops. Other methods have been attempted at are important targets of pharmacological agents. exploring more subtle signals such as correlation of compositional bias at different positions. The best performance attained so far is by using artificial neural networks (Rost et al., 1996), a method known To whom correspondence should be addressed. © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org 1853 R.Y.Kahsay et al. for its capability of capturing complex nonlinear signals. Despite its improvement at prediction accuracy, the artificial neural network method, well known for its black-box property, provides little insight into those signals that the network is designed to capture. A hidden Markov model, TMHMM, has recently been used for transmembrane topology prediction (Sonnhammer et al., 1998; Krogh et al., 2001). Hidden Markov models, as a probabilistic framework, have been widely applied in computational biology with remarkable success (Durbin et al., 1998). Unlike artificial neural networks, the architecture of hidden Markov models corresponds closely to the biological entities being simulated by the model. In TMHMM, the model comprises seven sets of states, with each set corresponding to a type of regions in the protein sequence. Each set of states has an associated probability distribution over the 20 amino acids characterizing the compositional bias in the corresponding regions. In addition, the model architecture specifies the intercon- nection of states within each set or submodel and also specifies how these submodels are connected to one another. Transitions among Fig. 1. Architecture of the four submodels: (A) transmembrane submodel states within a given submodel determine the length distribution of (C, M and X state types), (B) cytoplasmic loop submodel (G and I state the corresponding regions whereas transitions from one submodel types), (C) non-cytoplasmic short loop submodel (S and G state types), (D) to another reflect how the different regions are arranged to form non-cytoplasmic long loop submodel (L and G state types). To assemble the the entire protein. The transition probabilities, along with emission model, panel (B) shall be attached to the left of panel (A), with UI1 and LI1 in (B) pointing to C1 in (A), and C5 in (A) pointing to LI1 and TI1 in B. Both frequencies, enable the model to capture correlations among signals. panels (C) and (D) shall be attached to the right of panel (A). In this work, we present an improved hidden Markov model, TMMOD, for predicting transmembrane topology and identifying transmembrane proteins from soluble proteins. The TMMOD differs amino acid compositions at near the border between loops and helices, the first from TMHMM in both model architecture and training procedure. and last 10 residues of a loop region are explicitly modeled, i.e. each residue The architectural differences are on the cytoplasmic and the non- corresponds to an individual state in the model. These 20 states are marked as cytoplasmic loop submodels. For the training procedure, we adopt LI1–LI10 and UI1–UI10 in Figure 1B for loops inside the cytoplasm and as LS1–LS10 and US1–US10 in Figure 1C for short loops in the non-cytoplasm. the Bayesian based approach where the model parameters are set As shown in Figure 1B and C, a ladder-like structure is formed to allow for by posterior mean estimator (PME). In cross-validation experiments loop length to vary from just one residue, by traversing only state LI1 or LS1, using the datasets which were used by TMHMM, our model outper- to 20 residues, by traversing all 20 states. All other residues in the middle formed TMHMM in both topology prediction and identification of of a loop longer than 20 are collectively represented by one ‘globular’ state membrane proteins. which has a transition back to itself and thus can repeat as many times as the loop length dictates. Following TMHMM, since non-cytoplasmic loops MATERIALS AND METHODS longer than 100 appear to have compositional characteristics different from those of the short non-cytoplasmic loops, two separate non-cytoplasmic loop The TMMOD model architecture submodels are used for representing them, as depicted in Figure 1C and D. The overall skeleton of the TMMOD’s architecture is a linear structure of Inspired by TMHMM’s design of using a separate submodel for long three two-way connected submodels for cytoplasmic loop, transmembrane loops, we studied the length distribution of the loops in the training sequences helix and non-cytoplasmic loop. The two-way connections between the cyto- (Fig. 2). The length distribution shows that, ∼90% of the loops are shorter plasmic loop and the transmembrane helix and between the transmembrane than 40 residues, and the rest are quite spread out, as indicated by a long tail. helix and the non-cytoplasmic loop, plus a self return connection in the loop Similar findings with respect to the loop length distribution were reported in submodels, allow a path cycling through the three components of transmem- Wallin and von Heijne (1998) and Liu and Rost (2001). To capture this distri- brane proteins: cyto-loop, helix and noncyto-loop. A path can start with either bution more effectively, we introduced a separate chain of states (Figure 1B a cyto-loop or a noncyto-loop, reflecting the fact that a transmembrane protein and C) in parallel to the ladder-like structure in cytoplasmic and short non- can have its N-terminus either inside or outside the cell. The architectures of cytoplasmic loop submodels. As such, we want the transition parameters in the submodels for these three components are illustrated in Figure 1. the ladder-like part of the submodels to explicitly model the length distribu- The submodel for transmembrane helix, identical to that of TMHMM, has tion of the loops that are <40 amino acids, while the longer loops are directed two cap regions each of five residues surrounding a core region of variable through the bypass. More specifically, the transition probability LI → LI k k+1 length 5–25 residues (Fig. 1A). Therefore, the model can represent helices or LSI → LS now reflects the likelihood of loops with length >2k but k k+1 of size 15–35 residues long, a range that covers the actual sizes observed for <40; whereas the same transition parameter in a submodel without the bypass transmembrane domains. This submodel contains two chains of transmem- would have to reflect the likelihood of all loops with lengths >2k. We expect brane states, with one chain going inwards and the other going outwards, that such an effective estimation of the distribution of loop lengths would as a mechanism to enforce the structural constraint, i.e. a transmembrane enhance the signal-to-noise ratio of the topogenic signal. helix has to span the membrane. Since there are no observed differences in Another architectural difference from TMHMM is the use of a simpler amino acid composition and length distributions between ‘inwards’ helices submodel, as shown in Figure 1D, for non-cytoplasmic loops with lengths and ‘outwards’ helices, the emission and transition parameters for these two >100 amino acids. These long loops do not require a ladder-like structure chains are estimated collectively. since all of them are >100 amino acids and thus there is no need for an early The architecture of TMMOD differs from that of TMHMM by how the exit. Overall, TMHMM reported 83 transition and 133 emission parameters, loops are modeled (Fig. 1B, C and D). In order to capture the known biases of whereas our model has 66 transition and 133 emission parameters. 1854 An improved HMM predicting transmembrane topology The topology of a membrane protein is predicted using Viterbi algorithm. cyto We also compute the three posterior probabilities that a given residue is in a transmembrane helix, on the cytoplasmic side or on the periplasmic side. This ac yto additional information, which at times can be even more informative than the single most probable state path, shows where the prediction is certain and what alternatives there might be. Datasets The two datasets used to validate the model on topology prediction were downloaded from the TMHMM website (http://www.cbs.dtu.dk/ services/TMHMM). The first dataset contains 83 transmembrane sequences of known topology, with 45 of them being single spanning. The second data- set has 160 transmembrane sequences, with 52 of them being single spanning. The topology of most proteins in these datasets is determined experimentally. We adopted the same 10-fold cross-validation for topology prediction as 0 20 40 60 80 100 in Sonnhammer et al. (1998). Both datasets are divided into 10 subsets. The subsets from the first dataset contain either eight or nine sequences, and all length the subsets of the second dataset have exactly 16 sequences each. To make the learning task more challenging, the subsets are prepared in such a way that Fig. 2. Length distribution of cytoplasmic and short non-cytoplasmic loops. sequences from different subsets are no more than 25% identical to each other. The model is trained on nine subsets and then is used to make predictions on the remaining subset. This is repeated 10 times, and each time a different The TMMOD model training subset is selected as the test set. The prediction accuracy is the average over Model parameters are estimated by Bayesian approach (PME) using single the 10 runs. Dirichlet and substitution matrix mixtures priors (Durbin et al., 1998). For For discrimination or identification experiments, test datasets contain the each of the seven types of states as shown in Figure 1, the substitution matrix set of 160 transmembrane proteins (positives) and other non-transmembrane mixtures is given by proteins (negatives). The test datasets for discrimination experiments are β = A c P(a|b) (1) the same as used in Krogh et al. (2001). These datasets include 645 sol- ja jb b uble proteins, six porins and a set of signal peptides from different classes of organisms. For whole genome analysis, all genes annotated in the where β is the pseudocount for amino acid ‘a’ in state type j , c is the ja jb Genbank entry of the genomes and chromosomes used were downloaded observed frequency (or count) for amino acid ‘b’ in state type j , P(a|b) is from ftp://ncbi.nlm.nih.gov/genbank/genomes/, except for Caenorhabditis the conditional probability of amino acid ‘a’ given amino acid ‘b’ (derived elegans, which was downloaded from the URL: ftp://genome.wustl.edu/pub/ from BLOSUM50 matrix), and A is a constant. The technique of using Dirichlet prior assumes that the observed frequen- cies of 20 amino acids in each of the seven types of states were stochastically RESULTS generated from a distributionp  = (p , ... , p ), which itself is chosen from 1 20 a distribution specified by a parametric Dirichlet density ρ(p)  , Topology prediction 20 α −1 a=1 i The accuracy is measured by the number of sequences from ρ(p)  = (2) the test sets whose topology and location of all transmembrane where Z is the normalizing constant. Each of the training sequences, with helices are correctly predicted. Following the same criterion used their topology known, is partitioned into segments according to the state in Sonnhammer et al. (1998), a predicted helix is counted as correct types such that all residues in a segment are emitted from the same type of if it overlaps by at least five residues with a true helix. The perform- states. An observed count vector over amino acids is found for each of these ance is also measured by the sensitivity and specificity for identifying segments, and these count vectors are grouped into seven classes according individual transmembrane domains. to the state types. For each class, the parametric Dirichlet density function To help us understand and assess how the model architecture (α parameters) is estimated from the observed count vectors by following a and use of different regularizers have contributed to the perform- procedure outlined in Brown et al. (1993) and Sjolander et al. (1996). Then, a pseudocount for amino acid ‘a’ in states of type j is given as ance, three variations of the architecture, including the one shown α in Figure 1, and three regularization schemes are tested. Model M1 ja σ = A (3) ja α  is the architecture of TMHMM with our training. Model M2 has ja two bypassed ladder-like submodels on each side of the membrane, The above equation for deriving pseudocounts differs from the standard by a design intended to see if differentiating long and short loops on the constant A, which is introduced to tighten the Dirichlet density without the cytoplasmic side as well will perform better. Model M3 is the affecting its mean (Durbin et al., 1998). The final emission frequency of amino acid ‘a’ from state type j after adding both types of pseudocounts is architecture shown in Figure 1. Regularizer scheme (a) uses Dirichlet then given as follows: prior based pseudocounts; (b) uses substitution matrix based pseudo- counts; and (c) uses both. The performance for these variations on the c + σ + β ja ja ja e =  (4) ja two datasets is given in Table 1. It is shown that the model depicted (c  + σ  + β  ) ja ja ja in Figure 1, using both Dirichlet and substitute matrix based regu- We also produce a single component Dirichlet pseudocount vector to regu- larizers, has achieved the best performance: 89% accuracy for both larize transitions in the ladder-like part of the submodels by taking the three topology and location on the first dataset (83 sequences), and 84% outgoing transition counts from each of the lower chain of states as vectors accuracy for topology and 89% accuracy for locations on the second in three dimensions. A detailed description of our model training procedure is described in Kahsay et al. (2004). dataset (160 sequences). The performance improvement of TMMOD No. loops R.Y.Kahsay et al. Table 1. Prediction accuracy for the cross-validation experiments Model Regularizer scheme Dataset Correct topology Correct location Sensitivity (%) Specificity (%) M1 (a) S-83 65 (78.3%) 67 (80.7%) 97.4 97.4 (b) 51 (61.4%) 52 (62.7%) 71.3 71.3 (c) 64 (77.1%) 65 (78.3%) 97.1 97.1 M2 (a) S-83 61 (73.5%) 65 (78.3%) 99.4 97.4 (b) 54 (65.1%) 61 (73.5%) 93.8 71.3 (c) 54 (65.1%) 66 (79.5%) 99.7 97.1 M3 (a) S-83 70 (84.3%) 71 (85.5%) 98.2 97.4 (b) 64 (77.1%) 65 (78.3%) 95.3 71.3 (c) 74 (89.2%) 74 (89.2%) 99.1 97.1 TMHMM S-83 64 (77.1%) 69 (83.1%) 96.2 96.2 PHDtm S-83 (85.5%) (88.0%) 98.8 95.2 M1 (a) S-160 117 (73.1%) 128 (80.0%) 97.4 97.0 (b) 92 (57.5%) 103 (64.4%) 77.4 80.8 (c) 117 (73.1%) 126 (78.8%) 96.1 96.7 M2 (a) S-160 120 (75.0%) 132 (82.5%) 98.4 97.2 (b) 97 (60.6%) 121 (75.6%) 97.7 95.6 (c) 118 (73.8%) 135 (84.4%) 98.4 97.2 M3 (a) S-160 120 (75.0%) 133 (83.1%) 97.8 97.6 (b) 110 (68.8%) 124 (77.5%) 94.5 98.1 (c) 135 (84.4%) 143 (89.4%) 98.3 98.1 TMHMM S-160 123 (76.9%) 134 (83.8%) 97.1 97.7 Numbers in bold represent the best results from different methods. over that of TMHMM (77% topology and 83% locations on the first TMHMM that differentiation of short and long loops only applies to dataset, and 77% topology and 84% locations on the second dataset) the non-cytoplasmic side. is significant. It is noted that, on the first dataset where the results Discrimination between non-membrane and for PHDhtm are also available, the TMMOD’s performance even membrane proteins slightly exceeds the performance (86% for topology and 88% for locations) of PHDhtm, the best existing method, which utilizes mul- In addition to predicting the transmembrane protein topology, tiple alignments—a data source that carries extra information. It is TMMOD can also be used for identifying/discriminating helical also worth noting that, because proteins with known topologies (ones membrane proteins from other proteins. In general, this can be done applied for the training) constitute a biased set, the expected accur- by using Forward algorithm to calculate the model likelihood for a acy when applied to entire proteomes may be significantly lower, as given sequence (Durbin et al., 1998). For comparison reasons, we was shown by Melen et al. (2004). adopted the three more refined measures proposed in Krogh et al. In addition to the outstanding performance for TMMOD, several (2001). The first measure, abbreviated as ‘pred. no. tmh’, is simply other observations can also be made from Table 1 about the effect of the number of helices in the most likely structure found by the model. different variations on model architecture and regularization. First, The other two measures are the expected number of residues in trans- we notice that the Dirichlet prior based regularizer is consistently membrane (‘exp. no. aa’) and the expected number of transmembrane more effective than the substitution matrix mixture based regular- helices (‘exp. no. tmh’), which are computed from the posterior prob- izer for all three different architectures. Second, we noticed that abilities. The probability that a given sequence is a membrane protein combining the Dirichlet and the substitute matrix mixture based pri- is higher when the expected number of residues in any of the pre- ors enhanced the model performance, but not always; indeed the dicted helices is high. Since the shortest transmembrane helices are performance was even decreased in some cases. In the contrast, ∼18 residues long, a cut-off value should be set at ∼18. If the expec- we notice that M3 attained the best performance among the three ted number of transmembrane helices in a protein is ≥1, the protein architectures in all three variations of regularizers, suggesting that is likely to be a helical transmembrane protein. the model architecture played a more decisive role for better per- In a discrimination experiment designed to identify the 160 mem- formance. Another observation is that M2, which has two bypassed brane proteins from the 645 water-soluble proteins, the measures ladder-like loop submodels on each side of the membrane, has better described above were calculated using the 10-fold cross-validation performance than M1 which is the original TMHMM architecture; models. This means, the measures for the 16 sequences in a given it is reasonable to believe that the better performance is probably subset are calculated using a model that was trained on the remain- due to the bypass introduced. However, the best performance is ing nine subsets (144 sequences). For the non-membrane proteins, achieved by model M3, which has the bypassed ladder-like loop the averages over the 10 cross-validation models were calculated. submodels on both sides of the membrane, but has an extra, simple Even though all the three measures give discrimination with high submodel for loops (longer that 100) only on the non-cytoplasmic accuracy, we have used the ‘exp. no. aa’ as our standard measure. side. This observation further validates the hypothesis made in Figure 3 shows the fraction of false positives and negatives at different 1856 An improved HMM predicting transmembrane topology 0.01 Table 3. The number of signal peptides mistakenly predicted as transmem- fp_r brane proteins fn_r 0.008 Class No. of signal No. predicted as No. predicted as 0.006 peptides tm by TMHMM tm by TMMOD 0.004 Eukaryotes 1011 209 (21%) 87 (9%) Gram-negatives 266 60 (23%) 33 (12%) 0.002 Gram-positives 141 85 (60%) 60 (43%) The first column is the organism type, and the second column is the total number of 0 6 12 18 24 30 signal peptides in that class. The last two columns are the number of false positives for TMHMM and TMMOD respectively. cutoff Table 4. The number of predicted transmembrane proteins in complete Fig. 3. Discrimination between transmembrane proteins and soluble pro- genomes teins. A decision is made based on the expected number of residues in transmembrane helices (‘exp. no. aa’). The fraction of false negative (con- tinuous line) and false positive predictions (broken line) as a function of the Organism Number Exp. no. aa Pred. tmh cut-off value of ‘exp. no. aa’. of genes ≥18 (%) ≥1 (%) Table 2. False positives by TMMOD and TMHMM Treponema pallidum 1031 20.37 (23.4) 20.5 (23.7) Borrelia burgdorferi 850 25.5 (28.7) 27.8 (28.7) Model PDB entries Description Expected SD Chlamydia pneumoniae 1052 25.9 (27.9) 26.1 (27.8) number aa Chlamydia trachomatis 894 21.7 (23.3) 22.4 (24.5) in membrane Aquifex aeolicus 1522 18.9 (20.3) 19.9 (20.7) Synechococcus sp 3169 23.98 (25.8) 24.1 (25.8) TMHMM 1RDZ (A) Fructose 1,6-bisphosphatase 24.3 3.6 Thermotaga maritima 1846 21.99 (22.9) 22.6 (24.1) 1KVD (A) Smk toxin 24.7 0.5 Methanococcus jannaschii 1715 18.1 (18.5) 18.4 (18.9) 1NOX Nadh oxidase 21.0 1.1 Methanobacterium 1869 19.8 (21.8) 20.4 (21.8) 1CIY CryIA (a) insecticidal toxin 20.6 1.6 thermoautotrophicum 1ENO Enoyl acyl carrier protein 18.9 5.7 Archaeoglobus fulgidus 2407 19.1 (20.3) 19.3 (20.4) reductase Pyrococcus abyssi 1765 21.4 (22.6) 22.2 (22.9) TMMOD 1KVD (A) Smk toxin 27.9 0.2 Pyrococcus horikoshii 2064 23.21 (27.5) 24.4 (25.9) 1CIY_ CryIA (a) insecticidal toxin 29.9 0.1 For each organism, the number of annotated genes, the percentages of membrane proteins predicted using the two measures described in text. The PDB entries all have a known 3D structure. The PDB identifiers with the chain in parenthesis are listed in Column 2. Column 4 contains the expected number of residues in helices (‘exp. no. aa’) averaged over the ten cross-validation models, and the last column is the standard deviation. Genome-wide analysis of membrane proteins For genome annotation purpose, it is desirable to have an accurate cut-off values of this measure. At a cut-off value of 18, TMMOD has estimate of the number of membrane proteins as well as an accur- two false positives (∼0.3%) whereas TMHMM reported five false ate estimate of the frequency for proteins of different topologies to positives (∼0.6%) (Table 2). As it was the case with TMHMM, the be expected in a given genome. The outstanding performance of chlorophyll a–b binding protein ab96 (Swiss-Prot entry CB21_PEA) TMMOD on both discriminating membrane proteins from soluble is the only membrane protein that is classified as a non-membrane proteins and predicting the transmembrane topology has motivated protein (i.e. as false negative). us to apply it to estimate the number of membrane proteins in a Signal peptides and porins collection of organisms with fully sequenced genomes. A similar The signal peptides that target a protein for export contain a hydro- work of using TMHMM is reported in Krogh et al. (2001). Due to phobic region and can easily be mistaken as a transmembrane region. space limitation, we only report estimates for the genomes that do TMMOD was tested on a set of signal peptides by measuring how not develop signal peptides. The complete results for 21 genomes many of the signal peptides were predicted to be membrane pro- are available at the TMMOD webserver. teins by using the measure as described above. As shown in Table 3, A model (M3) was trained using all the 160 sequences in the second TMMOD performed much better than TMHMM at identifying signal training set as described above. For each genome, transmembrane peptides as non-membrane proteins. A substantial improvement over proteins were predicted based on the ‘pred. no. tmh’ and ‘exp. no. TMHMM at discriminating signal peptides from TM is also reported aa’ measures described earlier. For these predicted transmembrane in Phobius, an integrated hidden Markov model that can model both proteins, their topology was also predicted. transmembrane topology and signal peptide (Kall et al., 2004). We Table 4 summarizes the predictions by TMMOD on 12 complete also tested TMMOD on the six porins from Krogh et al. (2001), all genomes, including the number of genes that are predicted to encode of which were correctly predicted as not containing transmembrane integral membrane proteins. In general, the number of predicted helices. integral membrane proteins comprises between ∼20 and ∼30% of fraction R.Y.Kahsay et al. the total number genes for a genome. For almost all organisms in ACKNOWLEDGEMENTS Table 4, TMHMM, whose results are listed in parentheses, has pre- We thank the anonymous reviewers for the useful comments. dicted more proteins as integral membrane proteins than TMMOD This publication was made possible by NIH Grant Number P20 does. This finding, together with the results from the previous dis- RR-15588 from the COBRE Program of the National Center for crimination experiments where TMHMM was shown to have had Research Resources, and by a DuPont Science and Engineering more false positives, leads us to speculate that TMHMM may suffer grant. from problems of over-prediction. As for the occurrence frequencies of different topologies, we found that multispanning proteins with REFERENCES both N- and C-termini inside cytoplasm are strongly preferred in all Arai,M. et al. (2004) ConPred II: a consensus prediction method for obtaining organisms with the exception of C.elegans. This is in agreement with transmembrane topology models with high reliability. Nucleic Acids Res., 32, the predictions from TMHMM. W390–W393. Brown,M., Hughey,R., Krogh,A., Mian,I.S., Sjolander,K. and Haussler,D. (1993) Using dirichlet mixture priors to derive hidden markov models for protein families. Pro- ceedings of the First International Conference on Intelligent Systems for Molecular Biology, AAAI Press, pp. 47–55. DISCUSSION Durbin,R., Eddy,S., Krogh,A. and Mitchison,G. (1998). Biological Sequence Analysis. We presented here an improved hidden Markov model TMMOD Cambridge University Press, Cambridge, UK. for transmembrane topology prediction. In the cross-validation Kahsay,R., Liao,L. and Gao,G. (2004) An improved hidden markov model for trans- experiments on membrane proteins with known topology, TMMOD membrane topology prediction. In The Proceedings of the 16th IEEE Interna- tional Conference for Tools with Artificial Intelligence, IEEE Computer Society, outperforms not only a similar method, TMHMM, on which our pp. 634–639. model is prototyped, but also the previously best method (PHDhtm) Kall,L. et al. (2004) A combined transmembrane topology and signal peptide prediction which utilizes presumably more information from multiple align- method. J. Mol. Biol., 338, 1027–1036. ments. TMMOD also surpassed TMHMM in identifying integral Krogh,A. et al. (2001) Prediction transmembrane protein topology with a hid- den markov model: application to complete genomes. J. Mol. Biol., 305, membrane proteins from other proteins, particularly, signal 567–580. peptides. Liu,J. and Rost,B. (2001) Comparing function and structure between entire proteomes. By running TMMOD on a group of 21 complete genomes, we Protein Sci., 10, 1970–1979. estimate that integral membrane proteins account for ∼20–30% of Melen,K. et al. (2004) Reliability measures for membrane protein topology prediction all genes in all genomes, and that N –C topology of transmembrane algorithms. J. Mol. Biol., 327, 735–744. in in Persson,B. and Argos,P. (1997) Prediction of transmembrane protein topology utilizing proteins, namely with both the N- and C-termini inside cytoplasm, is multiple sequence alignments. J. Protein Chem., 16, 453–457. preferred in all organisms except C.elegans. This result is in general Rost,B. et al. (1996) Topology prediction for helical transmembrane proteins at 86% agreement with what is reported in Krogh et al. (2001). accuracy. Protein Sci., 5, 1704–1718. By experimenting with different variations of model architecture Sjolander,K. et al. (1996) Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput. Appl. Biosci., 12, and training regularizers, we concluded that the model architecture is 327–345. a more decisive factor for better performance. It is of further interest Sonnhammer,E. et al. (1998) A hidden Markov model for predicting transmembrane to refine the model architecture, particularly in such a way that helices in protein sequences. Proc. ISMB 6, 175–182. long range correlations across different regions of integral membrane Viklund,H. and Elofsson,A. (2004) Best α-helical transmembrane protein topology pre- proteins can be better captured. dictions are achieved using hidden Markov models and evolutionary information. Protein Sci., 13, 1908–1917. It is worth noting that substantial improvements in accuracy over von Heijne,G. (1986) The distribution of positively charged residues in bacterial inner TMHMM are also reported in some recent works (Arai et al., 2004; membrane proteins correlates with the transmembrane topology. EMBO J., 5, Kall et al., 2004). These methods achieved better performance either 3021–3027. via a ‘consensus’ of various individual methods (Arai et al., 2004) von Heijne,G. (1992) Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J. Mol. Biol., 225, 487–494. or by a more integrated way (Kall et al., 2004). Although it is dif- von Heijne,G. (1994) Membrane proteins: from sequence to structure. Annu. Rev. ficult to directly compare TMMOD to these methods due to the use Biophys. Biomol. Struct., 23, 167–192. of different datasets, these methods can benefit from TMMOD by Wallin,E. and von Heijne,G. (1998) Genome-wide analysis of integral membrane either weighing in TMMOD for the ‘consensus’ or incorporating proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci., 7, 1029–1038. TMMOD’s loop treatments into the architecture. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Bioinformatics Oxford University Press

An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes

Bioinformatics , Volume 21 (9): 6 – Feb 2, 2005

Loading next page...
 
/lp/oxford-university-press/an-improved-hidden-markov-model-for-transmembrane-protein-detection-N8DgVEXdlj

References (4)

Publisher
Oxford University Press
Copyright
© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org
ISSN
1367-4803
eISSN
1460-2059
DOI
10.1093/bioinformatics/bti303
pmid
15691854
Publisher site
See Article on Publisher Site

Abstract

Vol. 21 no. 9 2005, pages 1853–1858 BIOINFORMATICS ORIGINAL PAPER doi:10.1093/bioinformatics/bti303 Sequence analysis An improved hidden Markov model for transmembrane protein detection and topology prediction and its applications to complete genomes 1 1 1,2,∗ Robel Y. Kahsay , Guang Gao and Li Liao 1 2 Delaware Biotechnology Institute, Newark, DE 19715, USA and Department of Computer and Information Sciences, University of Delaware, 103 Smith Hall, Newark, DE 19716, USA Received on November 15, 2004; revised on January 7, 2005; accepted on January 27, 2005 Advance Access publication February 2, 2005 ABSTRACT Unfortunately, membrane proteins are hard to solubilize and purify Motivation: Knowledge of the transmembrane helical topology can in their native conformation because they have both hydrophobic and help identify binding sites and infer functions for membrane proteins. hydrophilic regions on their surfaces, and thus it is difficult to determ- However, because membrane proteins are hard to solubilize and ine their structure experimentally. Such a situation has motivated purify, only a very small amount of membrane proteins have structure the development of various computational methods for predicting and topology experimentally determined. This has motivated vari- the topology of membrane proteins. Most of these computational ous computational methods for predicting the topology of membrane approaches rely on the compositional bias of amino acids at differ- proteins. ent regions of the sequence (von Heijne, 1994). For example, there is Results: We present an improved hidden Markov model, TMMOD, for a high propensity of hydrophobic residues in transmembrane alpha the identification and topology prediction of transmembrane proteins. helices due to the hydrophobic environment in lipid membranes. Our model uses TMHMM as a prototype, but differs from TMHMM by Because such a bias is quite noticeable and consistent, the location the architecture of the submodels for loops on both sides of the mem- of transmembrane domains can often be easily identified with high brane and also by the model training procedure. In cross-validation accuracy even by a simple method such as applying a threshold on experiments using a set of 83 transmembrane proteins with known the hydrophobic propensity curve. topology, TMMOD outperformed TMHMM and other existing methods, Another compositional signal in membrane proteins is the abund- with an accuracy of 89% for both topology and locations. In another ance of positively charged residues in the segments (loops) that are experiment using a separate set of 160 transmembrane proteins, located on the cytoplasmic side of the membrane and therefore is TMMOD had 84% for topology and 89% for locations. When util- referred to as the ‘positive inside rule’ for predicting the orienta- ized for identifying transmembrane proteins from non-transmembrane tion of a transmembrane helix (von Heijne, 1986, 1992). Unlike the proteins, particularly signal peptides, TMMOD has consistently fewer hydrophobicity signal for transmembrane helices, the ‘positive inside false positives than TMHMM does. Application of TMMOD to a col- rule’ is a weaker signal and often confused by significant presence of lection of complete genomes shows that the number of predicted positively charged residues in globular domains of the protein on the membrane proteins accounts for ∼20–30% of all genes in those gen- non-cytoplasmic side. Consequently, it is more difficult to correctly omes, and that the topology where both the N- and C-termini are in the predict the overall topology of a given protein, i.e. the orientation of cytoplasm is dominant in these organisms except for Caenorhabditis all transmembrane segments. elegans. There are basically two ways for improving the prediction accur- Availability: http://liao.cis.udel.edu/website/servers/TMMOD/ acy of any given model: by enhancing the signal/noise ratio for those Contact: lliao@cis.udel.edu weak signals or by identifying new signals and associating them with the topology. For example, significant improvements of prediction accuracy were reported (Persson and Argos, 1994) by applying mul- INTRODUCTION tiple sequence alignment to proteins with similar topology so that Membrane proteins serve as highly active mediators between the the positive residue content in the cytoplasmic loops may become cell and its environment or between the interior of an organelle and obvious in the aligned motifs. A more recent work along this line is the cytosol. They catalyze specific metabolites and ions across the PRODIV–TMHMM, a profile-based hidden Markov model (Viklund membrane barriers, convert the energy of sunlight into chemical and and Elofsson, 2004), where a 10% increase in performance is repor- electrical energy and couple the flow of electrons to the synthesis of ted with the use of homologous sequences. However, it shall be noted ATP. Furthermore, they act as signal receptors and transduce signals that multiple sequence alignment may not always be suitable, either such as neurotransmitters, growth factors and hormones across the due to insufficient number of homologs or due to the length variations membrane. Because of their vast functional roles, membrane proteins in these cytoplasmic loops. Other methods have been attempted at are important targets of pharmacological agents. exploring more subtle signals such as correlation of compositional bias at different positions. The best performance attained so far is by using artificial neural networks (Rost et al., 1996), a method known To whom correspondence should be addressed. © The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org 1853 R.Y.Kahsay et al. for its capability of capturing complex nonlinear signals. Despite its improvement at prediction accuracy, the artificial neural network method, well known for its black-box property, provides little insight into those signals that the network is designed to capture. A hidden Markov model, TMHMM, has recently been used for transmembrane topology prediction (Sonnhammer et al., 1998; Krogh et al., 2001). Hidden Markov models, as a probabilistic framework, have been widely applied in computational biology with remarkable success (Durbin et al., 1998). Unlike artificial neural networks, the architecture of hidden Markov models corresponds closely to the biological entities being simulated by the model. In TMHMM, the model comprises seven sets of states, with each set corresponding to a type of regions in the protein sequence. Each set of states has an associated probability distribution over the 20 amino acids characterizing the compositional bias in the corresponding regions. In addition, the model architecture specifies the intercon- nection of states within each set or submodel and also specifies how these submodels are connected to one another. Transitions among Fig. 1. Architecture of the four submodels: (A) transmembrane submodel states within a given submodel determine the length distribution of (C, M and X state types), (B) cytoplasmic loop submodel (G and I state the corresponding regions whereas transitions from one submodel types), (C) non-cytoplasmic short loop submodel (S and G state types), (D) to another reflect how the different regions are arranged to form non-cytoplasmic long loop submodel (L and G state types). To assemble the the entire protein. The transition probabilities, along with emission model, panel (B) shall be attached to the left of panel (A), with UI1 and LI1 in (B) pointing to C1 in (A), and C5 in (A) pointing to LI1 and TI1 in B. Both frequencies, enable the model to capture correlations among signals. panels (C) and (D) shall be attached to the right of panel (A). In this work, we present an improved hidden Markov model, TMMOD, for predicting transmembrane topology and identifying transmembrane proteins from soluble proteins. The TMMOD differs amino acid compositions at near the border between loops and helices, the first from TMHMM in both model architecture and training procedure. and last 10 residues of a loop region are explicitly modeled, i.e. each residue The architectural differences are on the cytoplasmic and the non- corresponds to an individual state in the model. These 20 states are marked as cytoplasmic loop submodels. For the training procedure, we adopt LI1–LI10 and UI1–UI10 in Figure 1B for loops inside the cytoplasm and as LS1–LS10 and US1–US10 in Figure 1C for short loops in the non-cytoplasm. the Bayesian based approach where the model parameters are set As shown in Figure 1B and C, a ladder-like structure is formed to allow for by posterior mean estimator (PME). In cross-validation experiments loop length to vary from just one residue, by traversing only state LI1 or LS1, using the datasets which were used by TMHMM, our model outper- to 20 residues, by traversing all 20 states. All other residues in the middle formed TMHMM in both topology prediction and identification of of a loop longer than 20 are collectively represented by one ‘globular’ state membrane proteins. which has a transition back to itself and thus can repeat as many times as the loop length dictates. Following TMHMM, since non-cytoplasmic loops MATERIALS AND METHODS longer than 100 appear to have compositional characteristics different from those of the short non-cytoplasmic loops, two separate non-cytoplasmic loop The TMMOD model architecture submodels are used for representing them, as depicted in Figure 1C and D. The overall skeleton of the TMMOD’s architecture is a linear structure of Inspired by TMHMM’s design of using a separate submodel for long three two-way connected submodels for cytoplasmic loop, transmembrane loops, we studied the length distribution of the loops in the training sequences helix and non-cytoplasmic loop. The two-way connections between the cyto- (Fig. 2). The length distribution shows that, ∼90% of the loops are shorter plasmic loop and the transmembrane helix and between the transmembrane than 40 residues, and the rest are quite spread out, as indicated by a long tail. helix and the non-cytoplasmic loop, plus a self return connection in the loop Similar findings with respect to the loop length distribution were reported in submodels, allow a path cycling through the three components of transmem- Wallin and von Heijne (1998) and Liu and Rost (2001). To capture this distri- brane proteins: cyto-loop, helix and noncyto-loop. A path can start with either bution more effectively, we introduced a separate chain of states (Figure 1B a cyto-loop or a noncyto-loop, reflecting the fact that a transmembrane protein and C) in parallel to the ladder-like structure in cytoplasmic and short non- can have its N-terminus either inside or outside the cell. The architectures of cytoplasmic loop submodels. As such, we want the transition parameters in the submodels for these three components are illustrated in Figure 1. the ladder-like part of the submodels to explicitly model the length distribu- The submodel for transmembrane helix, identical to that of TMHMM, has tion of the loops that are <40 amino acids, while the longer loops are directed two cap regions each of five residues surrounding a core region of variable through the bypass. More specifically, the transition probability LI → LI k k+1 length 5–25 residues (Fig. 1A). Therefore, the model can represent helices or LSI → LS now reflects the likelihood of loops with length >2k but k k+1 of size 15–35 residues long, a range that covers the actual sizes observed for <40; whereas the same transition parameter in a submodel without the bypass transmembrane domains. This submodel contains two chains of transmem- would have to reflect the likelihood of all loops with lengths >2k. We expect brane states, with one chain going inwards and the other going outwards, that such an effective estimation of the distribution of loop lengths would as a mechanism to enforce the structural constraint, i.e. a transmembrane enhance the signal-to-noise ratio of the topogenic signal. helix has to span the membrane. Since there are no observed differences in Another architectural difference from TMHMM is the use of a simpler amino acid composition and length distributions between ‘inwards’ helices submodel, as shown in Figure 1D, for non-cytoplasmic loops with lengths and ‘outwards’ helices, the emission and transition parameters for these two >100 amino acids. These long loops do not require a ladder-like structure chains are estimated collectively. since all of them are >100 amino acids and thus there is no need for an early The architecture of TMMOD differs from that of TMHMM by how the exit. Overall, TMHMM reported 83 transition and 133 emission parameters, loops are modeled (Fig. 1B, C and D). In order to capture the known biases of whereas our model has 66 transition and 133 emission parameters. 1854 An improved HMM predicting transmembrane topology The topology of a membrane protein is predicted using Viterbi algorithm. cyto We also compute the three posterior probabilities that a given residue is in a transmembrane helix, on the cytoplasmic side or on the periplasmic side. This ac yto additional information, which at times can be even more informative than the single most probable state path, shows where the prediction is certain and what alternatives there might be. Datasets The two datasets used to validate the model on topology prediction were downloaded from the TMHMM website (http://www.cbs.dtu.dk/ services/TMHMM). The first dataset contains 83 transmembrane sequences of known topology, with 45 of them being single spanning. The second data- set has 160 transmembrane sequences, with 52 of them being single spanning. The topology of most proteins in these datasets is determined experimentally. We adopted the same 10-fold cross-validation for topology prediction as 0 20 40 60 80 100 in Sonnhammer et al. (1998). Both datasets are divided into 10 subsets. The subsets from the first dataset contain either eight or nine sequences, and all length the subsets of the second dataset have exactly 16 sequences each. To make the learning task more challenging, the subsets are prepared in such a way that Fig. 2. Length distribution of cytoplasmic and short non-cytoplasmic loops. sequences from different subsets are no more than 25% identical to each other. The model is trained on nine subsets and then is used to make predictions on the remaining subset. This is repeated 10 times, and each time a different The TMMOD model training subset is selected as the test set. The prediction accuracy is the average over Model parameters are estimated by Bayesian approach (PME) using single the 10 runs. Dirichlet and substitution matrix mixtures priors (Durbin et al., 1998). For For discrimination or identification experiments, test datasets contain the each of the seven types of states as shown in Figure 1, the substitution matrix set of 160 transmembrane proteins (positives) and other non-transmembrane mixtures is given by proteins (negatives). The test datasets for discrimination experiments are β = A c P(a|b) (1) the same as used in Krogh et al. (2001). These datasets include 645 sol- ja jb b uble proteins, six porins and a set of signal peptides from different classes of organisms. For whole genome analysis, all genes annotated in the where β is the pseudocount for amino acid ‘a’ in state type j , c is the ja jb Genbank entry of the genomes and chromosomes used were downloaded observed frequency (or count) for amino acid ‘b’ in state type j , P(a|b) is from ftp://ncbi.nlm.nih.gov/genbank/genomes/, except for Caenorhabditis the conditional probability of amino acid ‘a’ given amino acid ‘b’ (derived elegans, which was downloaded from the URL: ftp://genome.wustl.edu/pub/ from BLOSUM50 matrix), and A is a constant. The technique of using Dirichlet prior assumes that the observed frequen- cies of 20 amino acids in each of the seven types of states were stochastically RESULTS generated from a distributionp  = (p , ... , p ), which itself is chosen from 1 20 a distribution specified by a parametric Dirichlet density ρ(p)  , Topology prediction 20 α −1 a=1 i The accuracy is measured by the number of sequences from ρ(p)  = (2) the test sets whose topology and location of all transmembrane where Z is the normalizing constant. Each of the training sequences, with helices are correctly predicted. Following the same criterion used their topology known, is partitioned into segments according to the state in Sonnhammer et al. (1998), a predicted helix is counted as correct types such that all residues in a segment are emitted from the same type of if it overlaps by at least five residues with a true helix. The perform- states. An observed count vector over amino acids is found for each of these ance is also measured by the sensitivity and specificity for identifying segments, and these count vectors are grouped into seven classes according individual transmembrane domains. to the state types. For each class, the parametric Dirichlet density function To help us understand and assess how the model architecture (α parameters) is estimated from the observed count vectors by following a and use of different regularizers have contributed to the perform- procedure outlined in Brown et al. (1993) and Sjolander et al. (1996). Then, a pseudocount for amino acid ‘a’ in states of type j is given as ance, three variations of the architecture, including the one shown α in Figure 1, and three regularization schemes are tested. Model M1 ja σ = A (3) ja α  is the architecture of TMHMM with our training. Model M2 has ja two bypassed ladder-like submodels on each side of the membrane, The above equation for deriving pseudocounts differs from the standard by a design intended to see if differentiating long and short loops on the constant A, which is introduced to tighten the Dirichlet density without the cytoplasmic side as well will perform better. Model M3 is the affecting its mean (Durbin et al., 1998). The final emission frequency of amino acid ‘a’ from state type j after adding both types of pseudocounts is architecture shown in Figure 1. Regularizer scheme (a) uses Dirichlet then given as follows: prior based pseudocounts; (b) uses substitution matrix based pseudo- counts; and (c) uses both. The performance for these variations on the c + σ + β ja ja ja e =  (4) ja two datasets is given in Table 1. It is shown that the model depicted (c  + σ  + β  ) ja ja ja in Figure 1, using both Dirichlet and substitute matrix based regu- We also produce a single component Dirichlet pseudocount vector to regu- larizers, has achieved the best performance: 89% accuracy for both larize transitions in the ladder-like part of the submodels by taking the three topology and location on the first dataset (83 sequences), and 84% outgoing transition counts from each of the lower chain of states as vectors accuracy for topology and 89% accuracy for locations on the second in three dimensions. A detailed description of our model training procedure is described in Kahsay et al. (2004). dataset (160 sequences). The performance improvement of TMMOD No. loops R.Y.Kahsay et al. Table 1. Prediction accuracy for the cross-validation experiments Model Regularizer scheme Dataset Correct topology Correct location Sensitivity (%) Specificity (%) M1 (a) S-83 65 (78.3%) 67 (80.7%) 97.4 97.4 (b) 51 (61.4%) 52 (62.7%) 71.3 71.3 (c) 64 (77.1%) 65 (78.3%) 97.1 97.1 M2 (a) S-83 61 (73.5%) 65 (78.3%) 99.4 97.4 (b) 54 (65.1%) 61 (73.5%) 93.8 71.3 (c) 54 (65.1%) 66 (79.5%) 99.7 97.1 M3 (a) S-83 70 (84.3%) 71 (85.5%) 98.2 97.4 (b) 64 (77.1%) 65 (78.3%) 95.3 71.3 (c) 74 (89.2%) 74 (89.2%) 99.1 97.1 TMHMM S-83 64 (77.1%) 69 (83.1%) 96.2 96.2 PHDtm S-83 (85.5%) (88.0%) 98.8 95.2 M1 (a) S-160 117 (73.1%) 128 (80.0%) 97.4 97.0 (b) 92 (57.5%) 103 (64.4%) 77.4 80.8 (c) 117 (73.1%) 126 (78.8%) 96.1 96.7 M2 (a) S-160 120 (75.0%) 132 (82.5%) 98.4 97.2 (b) 97 (60.6%) 121 (75.6%) 97.7 95.6 (c) 118 (73.8%) 135 (84.4%) 98.4 97.2 M3 (a) S-160 120 (75.0%) 133 (83.1%) 97.8 97.6 (b) 110 (68.8%) 124 (77.5%) 94.5 98.1 (c) 135 (84.4%) 143 (89.4%) 98.3 98.1 TMHMM S-160 123 (76.9%) 134 (83.8%) 97.1 97.7 Numbers in bold represent the best results from different methods. over that of TMHMM (77% topology and 83% locations on the first TMHMM that differentiation of short and long loops only applies to dataset, and 77% topology and 84% locations on the second dataset) the non-cytoplasmic side. is significant. It is noted that, on the first dataset where the results Discrimination between non-membrane and for PHDhtm are also available, the TMMOD’s performance even membrane proteins slightly exceeds the performance (86% for topology and 88% for locations) of PHDhtm, the best existing method, which utilizes mul- In addition to predicting the transmembrane protein topology, tiple alignments—a data source that carries extra information. It is TMMOD can also be used for identifying/discriminating helical also worth noting that, because proteins with known topologies (ones membrane proteins from other proteins. In general, this can be done applied for the training) constitute a biased set, the expected accur- by using Forward algorithm to calculate the model likelihood for a acy when applied to entire proteomes may be significantly lower, as given sequence (Durbin et al., 1998). For comparison reasons, we was shown by Melen et al. (2004). adopted the three more refined measures proposed in Krogh et al. In addition to the outstanding performance for TMMOD, several (2001). The first measure, abbreviated as ‘pred. no. tmh’, is simply other observations can also be made from Table 1 about the effect of the number of helices in the most likely structure found by the model. different variations on model architecture and regularization. First, The other two measures are the expected number of residues in trans- we notice that the Dirichlet prior based regularizer is consistently membrane (‘exp. no. aa’) and the expected number of transmembrane more effective than the substitution matrix mixture based regular- helices (‘exp. no. tmh’), which are computed from the posterior prob- izer for all three different architectures. Second, we noticed that abilities. The probability that a given sequence is a membrane protein combining the Dirichlet and the substitute matrix mixture based pri- is higher when the expected number of residues in any of the pre- ors enhanced the model performance, but not always; indeed the dicted helices is high. Since the shortest transmembrane helices are performance was even decreased in some cases. In the contrast, ∼18 residues long, a cut-off value should be set at ∼18. If the expec- we notice that M3 attained the best performance among the three ted number of transmembrane helices in a protein is ≥1, the protein architectures in all three variations of regularizers, suggesting that is likely to be a helical transmembrane protein. the model architecture played a more decisive role for better per- In a discrimination experiment designed to identify the 160 mem- formance. Another observation is that M2, which has two bypassed brane proteins from the 645 water-soluble proteins, the measures ladder-like loop submodels on each side of the membrane, has better described above were calculated using the 10-fold cross-validation performance than M1 which is the original TMHMM architecture; models. This means, the measures for the 16 sequences in a given it is reasonable to believe that the better performance is probably subset are calculated using a model that was trained on the remain- due to the bypass introduced. However, the best performance is ing nine subsets (144 sequences). For the non-membrane proteins, achieved by model M3, which has the bypassed ladder-like loop the averages over the 10 cross-validation models were calculated. submodels on both sides of the membrane, but has an extra, simple Even though all the three measures give discrimination with high submodel for loops (longer that 100) only on the non-cytoplasmic accuracy, we have used the ‘exp. no. aa’ as our standard measure. side. This observation further validates the hypothesis made in Figure 3 shows the fraction of false positives and negatives at different 1856 An improved HMM predicting transmembrane topology 0.01 Table 3. The number of signal peptides mistakenly predicted as transmem- fp_r brane proteins fn_r 0.008 Class No. of signal No. predicted as No. predicted as 0.006 peptides tm by TMHMM tm by TMMOD 0.004 Eukaryotes 1011 209 (21%) 87 (9%) Gram-negatives 266 60 (23%) 33 (12%) 0.002 Gram-positives 141 85 (60%) 60 (43%) The first column is the organism type, and the second column is the total number of 0 6 12 18 24 30 signal peptides in that class. The last two columns are the number of false positives for TMHMM and TMMOD respectively. cutoff Table 4. The number of predicted transmembrane proteins in complete Fig. 3. Discrimination between transmembrane proteins and soluble pro- genomes teins. A decision is made based on the expected number of residues in transmembrane helices (‘exp. no. aa’). The fraction of false negative (con- tinuous line) and false positive predictions (broken line) as a function of the Organism Number Exp. no. aa Pred. tmh cut-off value of ‘exp. no. aa’. of genes ≥18 (%) ≥1 (%) Table 2. False positives by TMMOD and TMHMM Treponema pallidum 1031 20.37 (23.4) 20.5 (23.7) Borrelia burgdorferi 850 25.5 (28.7) 27.8 (28.7) Model PDB entries Description Expected SD Chlamydia pneumoniae 1052 25.9 (27.9) 26.1 (27.8) number aa Chlamydia trachomatis 894 21.7 (23.3) 22.4 (24.5) in membrane Aquifex aeolicus 1522 18.9 (20.3) 19.9 (20.7) Synechococcus sp 3169 23.98 (25.8) 24.1 (25.8) TMHMM 1RDZ (A) Fructose 1,6-bisphosphatase 24.3 3.6 Thermotaga maritima 1846 21.99 (22.9) 22.6 (24.1) 1KVD (A) Smk toxin 24.7 0.5 Methanococcus jannaschii 1715 18.1 (18.5) 18.4 (18.9) 1NOX Nadh oxidase 21.0 1.1 Methanobacterium 1869 19.8 (21.8) 20.4 (21.8) 1CIY CryIA (a) insecticidal toxin 20.6 1.6 thermoautotrophicum 1ENO Enoyl acyl carrier protein 18.9 5.7 Archaeoglobus fulgidus 2407 19.1 (20.3) 19.3 (20.4) reductase Pyrococcus abyssi 1765 21.4 (22.6) 22.2 (22.9) TMMOD 1KVD (A) Smk toxin 27.9 0.2 Pyrococcus horikoshii 2064 23.21 (27.5) 24.4 (25.9) 1CIY_ CryIA (a) insecticidal toxin 29.9 0.1 For each organism, the number of annotated genes, the percentages of membrane proteins predicted using the two measures described in text. The PDB entries all have a known 3D structure. The PDB identifiers with the chain in parenthesis are listed in Column 2. Column 4 contains the expected number of residues in helices (‘exp. no. aa’) averaged over the ten cross-validation models, and the last column is the standard deviation. Genome-wide analysis of membrane proteins For genome annotation purpose, it is desirable to have an accurate cut-off values of this measure. At a cut-off value of 18, TMMOD has estimate of the number of membrane proteins as well as an accur- two false positives (∼0.3%) whereas TMHMM reported five false ate estimate of the frequency for proteins of different topologies to positives (∼0.6%) (Table 2). As it was the case with TMHMM, the be expected in a given genome. The outstanding performance of chlorophyll a–b binding protein ab96 (Swiss-Prot entry CB21_PEA) TMMOD on both discriminating membrane proteins from soluble is the only membrane protein that is classified as a non-membrane proteins and predicting the transmembrane topology has motivated protein (i.e. as false negative). us to apply it to estimate the number of membrane proteins in a Signal peptides and porins collection of organisms with fully sequenced genomes. A similar The signal peptides that target a protein for export contain a hydro- work of using TMHMM is reported in Krogh et al. (2001). Due to phobic region and can easily be mistaken as a transmembrane region. space limitation, we only report estimates for the genomes that do TMMOD was tested on a set of signal peptides by measuring how not develop signal peptides. The complete results for 21 genomes many of the signal peptides were predicted to be membrane pro- are available at the TMMOD webserver. teins by using the measure as described above. As shown in Table 3, A model (M3) was trained using all the 160 sequences in the second TMMOD performed much better than TMHMM at identifying signal training set as described above. For each genome, transmembrane peptides as non-membrane proteins. A substantial improvement over proteins were predicted based on the ‘pred. no. tmh’ and ‘exp. no. TMHMM at discriminating signal peptides from TM is also reported aa’ measures described earlier. For these predicted transmembrane in Phobius, an integrated hidden Markov model that can model both proteins, their topology was also predicted. transmembrane topology and signal peptide (Kall et al., 2004). We Table 4 summarizes the predictions by TMMOD on 12 complete also tested TMMOD on the six porins from Krogh et al. (2001), all genomes, including the number of genes that are predicted to encode of which were correctly predicted as not containing transmembrane integral membrane proteins. In general, the number of predicted helices. integral membrane proteins comprises between ∼20 and ∼30% of fraction R.Y.Kahsay et al. the total number genes for a genome. For almost all organisms in ACKNOWLEDGEMENTS Table 4, TMHMM, whose results are listed in parentheses, has pre- We thank the anonymous reviewers for the useful comments. dicted more proteins as integral membrane proteins than TMMOD This publication was made possible by NIH Grant Number P20 does. This finding, together with the results from the previous dis- RR-15588 from the COBRE Program of the National Center for crimination experiments where TMHMM was shown to have had Research Resources, and by a DuPont Science and Engineering more false positives, leads us to speculate that TMHMM may suffer grant. from problems of over-prediction. As for the occurrence frequencies of different topologies, we found that multispanning proteins with REFERENCES both N- and C-termini inside cytoplasm are strongly preferred in all Arai,M. et al. (2004) ConPred II: a consensus prediction method for obtaining organisms with the exception of C.elegans. This is in agreement with transmembrane topology models with high reliability. Nucleic Acids Res., 32, the predictions from TMHMM. W390–W393. Brown,M., Hughey,R., Krogh,A., Mian,I.S., Sjolander,K. and Haussler,D. (1993) Using dirichlet mixture priors to derive hidden markov models for protein families. Pro- ceedings of the First International Conference on Intelligent Systems for Molecular Biology, AAAI Press, pp. 47–55. DISCUSSION Durbin,R., Eddy,S., Krogh,A. and Mitchison,G. (1998). Biological Sequence Analysis. We presented here an improved hidden Markov model TMMOD Cambridge University Press, Cambridge, UK. for transmembrane topology prediction. In the cross-validation Kahsay,R., Liao,L. and Gao,G. (2004) An improved hidden markov model for trans- experiments on membrane proteins with known topology, TMMOD membrane topology prediction. In The Proceedings of the 16th IEEE Interna- tional Conference for Tools with Artificial Intelligence, IEEE Computer Society, outperforms not only a similar method, TMHMM, on which our pp. 634–639. model is prototyped, but also the previously best method (PHDhtm) Kall,L. et al. (2004) A combined transmembrane topology and signal peptide prediction which utilizes presumably more information from multiple align- method. J. Mol. Biol., 338, 1027–1036. ments. TMMOD also surpassed TMHMM in identifying integral Krogh,A. et al. (2001) Prediction transmembrane protein topology with a hid- den markov model: application to complete genomes. J. Mol. Biol., 305, membrane proteins from other proteins, particularly, signal 567–580. peptides. Liu,J. and Rost,B. (2001) Comparing function and structure between entire proteomes. By running TMMOD on a group of 21 complete genomes, we Protein Sci., 10, 1970–1979. estimate that integral membrane proteins account for ∼20–30% of Melen,K. et al. (2004) Reliability measures for membrane protein topology prediction all genes in all genomes, and that N –C topology of transmembrane algorithms. J. Mol. Biol., 327, 735–744. in in Persson,B. and Argos,P. (1997) Prediction of transmembrane protein topology utilizing proteins, namely with both the N- and C-termini inside cytoplasm, is multiple sequence alignments. J. Protein Chem., 16, 453–457. preferred in all organisms except C.elegans. This result is in general Rost,B. et al. (1996) Topology prediction for helical transmembrane proteins at 86% agreement with what is reported in Krogh et al. (2001). accuracy. Protein Sci., 5, 1704–1718. By experimenting with different variations of model architecture Sjolander,K. et al. (1996) Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput. Appl. Biosci., 12, and training regularizers, we concluded that the model architecture is 327–345. a more decisive factor for better performance. It is of further interest Sonnhammer,E. et al. (1998) A hidden Markov model for predicting transmembrane to refine the model architecture, particularly in such a way that helices in protein sequences. Proc. ISMB 6, 175–182. long range correlations across different regions of integral membrane Viklund,H. and Elofsson,A. (2004) Best α-helical transmembrane protein topology pre- proteins can be better captured. dictions are achieved using hidden Markov models and evolutionary information. Protein Sci., 13, 1908–1917. It is worth noting that substantial improvements in accuracy over von Heijne,G. (1986) The distribution of positively charged residues in bacterial inner TMHMM are also reported in some recent works (Arai et al., 2004; membrane proteins correlates with the transmembrane topology. EMBO J., 5, Kall et al., 2004). These methods achieved better performance either 3021–3027. via a ‘consensus’ of various individual methods (Arai et al., 2004) von Heijne,G. (1992) Membrane protein structure prediction. Hydrophobicity analysis and the positive-inside rule. J. Mol. Biol., 225, 487–494. or by a more integrated way (Kall et al., 2004). Although it is dif- von Heijne,G. (1994) Membrane proteins: from sequence to structure. Annu. Rev. ficult to directly compare TMMOD to these methods due to the use Biophys. Biomol. Struct., 23, 167–192. of different datasets, these methods can benefit from TMMOD by Wallin,E. and von Heijne,G. (1998) Genome-wide analysis of integral membrane either weighing in TMMOD for the ‘consensus’ or incorporating proteins from eubacterial, archaean, and eukaryotic organisms. Protein Sci., 7, 1029–1038. TMMOD’s loop treatments into the architecture.

Journal

BioinformaticsOxford University Press

Published: Feb 2, 2005

There are no references for this article.