Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

MUSCLE: multiple sequence alignment with high accuracy and high throughput

MUSCLE: multiple sequence alignment with high accuracy and high throughput Published online March 19, 2004 1792±1797 Nucleic Acids Research, 2004, Vol. 32, No. 5 DOI: 10.1093/nar/gkh340 MUSCLE: multiple sequence alignment with high accuracy and high throughput Robert C. Edgar* 195 Roque Moraes Drive, Mill Valley, CA 94941, USA Received January 19, 2004; Revised January 30, 2004; Accepted February 24, 2004 ABSTRACT variant on this strategy is used by T-Coffee (5), which aligns pro®les by optimizing a score derived from local and global We describe MUSCLE, a new computer program for alignments of all pairs of input sequences. Misalignments by creating multiple alignments of protein sequences. progressive methods are sometimes readily apparent (Fig. 1), Elements of the algorithm include fast distance motivating further processing (re®nement). For a recent estimation using kmer counting, progressive align- review of multiple alignment methods, see Notredame (6). ment using a new pro®le function we call the log- Here we describe MUSCLE (multiple sequence comparison by log-expectation), a new computer program for multiple expectation score, and re®nement using tree- protein sequence alignment. dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of refer- MUSCLE ALGORITHM ence alignments: BAliBASE, SABmark, SMART and Here we give an overview of the algorithm; a more detailed a new benchmark, PREFAB. MUSCLE achieves the discussion is given in Edgar (submitted). Following guide tree highest, or joint highest, rank in accuracy on each construction, the fundamental step is pairwise pro®le align- of these sets. Without re®nement, MUSCLE ment, which is used ®rst for progressive alignment and then achieves average accuracy statistically indistin- for re®nement. This is similar to the strategies used by PRRP guishable from T-Coffee and MAFFT, and is the (7) and MAFFT (8). fastest of the tested methods for large numbers of Distance measures and guide tree estimation sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. MUSCLE uses two distance measures for a pair of sequences: The MUSCLE program, source code and PREFAB a kmer distance (for an unaligned pair) and the Kimura test data are freely available at http://www.drive5. distance (for an aligned pair). A kmer is a contiguous subsequence of length k, also known as a word or k-tuple. com/muscle. Related sequences tend to have more kmers in common than expected by chance. The kmer distance is derived from the fraction of kmers in common in a compressed alphabet, which INTRODUCTION we have previously shown to correlate well with fractional Multiple alignments of protein sequences are important in identity (9). This measure does not require an alignment, many applications, including phylogenetic tree estimation, giving a signi®cant speed advantage. Given an aligned pair of structure prediction and critical residue identi®cation. The sequences, we compute the pairwise identity and convert to an most natural formulation of the computational problem is to additive distance estimate, applying the Kimura correction for de®ne a model of sequence evolution that assigns probabilities multiple substitutions at a single site (10). Distance matrices to elementary sequence edits and seeks a most probable are clustered using UPGMA (11), which we ®nd to give directed graph in which edges represent edits and terminal slightly improved results over neighbor-joining (12), despite nodes are the observed sequences. No tractable method for the expectation that neighbor-joining will give a more reliable ®nding such a graph is known. A heuristic alternative is to estimate of the evolutionary tree. This can be explained by seek a multiple alignment that optimizes the sum of pairs (SP) assuming that in progressive alignment, the best accuracy score, i.e. the sum of pairwise alignment scores. Optimizing is obtained at each node by aligning the two pro®les that the SP score is NP complete (1) and can be achieved by have fewest differences, even if they are not evolutionary dynamic programming with time and space complexity O(L ) neighbors. in the sequence length L and number of sequences N (2). A Pro®le alignment more popular strategy is the progressive method (3,4), which ®rst estimates a tree and then constructs a pairwise alignment In order to apply pairwise alignment to pro®les, a scoring of the subtrees found at each internal node. A subtree is function must be de®ned on an aligned pair of pro®le represented by its pro®le, a multiple alignment treated as a positions, i.e. a pair of multiple alignment columns [see, for sequence by regarding each column as an alignable symbol. A example Edgar and Sjolander (13)]. Let i and j be amino acid *Email: bob@drive5.com Nucleic Acids Research, Vol. 32 No. 5 ã Oxford University Press 2004; all rights reserved Nucleic Acids Research, 2004, Vol. 32, No. 5 1793 Figure 1. Motifs misaligned by a progressive method. A set of 41 sequences containing SH2 domains (44) were aligned by the progressive method T-Coffee (above), and by MUSCLE (below). The N-terminal region of a subset of ®ve sequences is shown. The highlighted columns (upper case) are conserved within this family but are misaligned by T-Coffee. It should be noted that T-Coffee aligns these motifs correctly when given these ®ve sequences alone; the problem arises in the context of the other sequences. Figure 2. This diagram summarizes the ¯ow of the MUSCLE algorithm. Complete alignments are available at http://www.drive5.com/muscle. There are three main stages: Stage 1 (draft progressive), Stage 2 (improved progressive) and Stage 3 (re®nement). A multiple alignment is available at the completion of each stage, at which point the algorithm may types, p the background probability of i, p the joint terminate. i ij probability of i and j being aligned to each other, f the observed frequency of i in column x of the ®rst pro®le, and f internal node, a pairwise alignment is constructed of the two the observed frequency of gaps in that column at position x in child pro®les, giving a new pro®le which is assigned to that the family (similarly for position y in the second pro®le). The node. This produces a multiple alignment of all input estimated probability a of observing amino acid i in position sequences, MSA1, at the root. x can be derived from f , typically by adding heuristic pseudo- counts or by using Bayesian methods such as Dirichlet mixture Stage 2, Improved progressive. The main source of error in the priors (14). MUSCLE uses a new pro®le function we call the draft progressive stage is the approximate kmer distance log-expectation (LE) score: measure, which results in a suboptimal tree. MUSCLE xy x y x y therefore re-estimates the tree using the Kimura distance, LE =(1± f )(1± f ) log S S f f p /p p 1 G G i j i j ij i j which is more accurate but requires an alignment. 2.1 The Kimura distance for each pair of input sequences is This is a modi®ed version of the log-average function (15): computed from MSA1, giving distance matrix D2. xy x y LA = log S S a a p /p p 2 2.2 Matrix D2 is clustered by UPGMA, producing binary i j i j ij i j tree TREE2. MUSCLE uses probabilities p and p derived from the i ij 2.3 A progressive alignment is produced following TREE2 240 PAM VTML matrix (16). Frequencies f are normalized to (similar to 1.3), producing multiple alignment MSA2. This is sum to 1 when indels are present (otherwise the logarithm optimized by computing alignments only for subtrees whose becomes increasingly negative with increasing numbers of branching orders changed relative to TREE1. gaps even when aligning conserved or similar residues). The factor (1 ± f ) is the occupancy of a column, introduced to Stage 3, Re®nement. encourage more highly occupied columns to align. Position- 3.1 An edge is chosen from TREE2 (edges are visited in speci®c gap penalties are used, employing heuristics similar to order of decreasing distance from the root). those found in MAFFT and LAGAN (17). 3.2 TREE2 is divided into two subtrees by deleting the edge. The pro®le of the multiple alignment in each subtree is Algorithm computed. The high-level ¯ow is depicted in Figure 2. 3.3 A new multiple alignment is produced by re-aligning the two pro®les. Stage 1, Draft progressive. The goal of the ®rst stage is to 3.4 If the SP score is improved, the new alignment is kept, produce a multiple alignment, emphasizing speed over otherwise it is discarded. accuracy. Steps 3.1±3.4 are repeated until convergence or until a user- 1.1 The kmer distance is computed for each pair of input de®ned limit is reached. This is a variant of tree-dependent sequences, giving distance matrix D1. restricted partitioning (18). 1.2 Matrix D1 is clustered by UPGMA, producing binary Complete multiple alignments are available at steps 1.3, 2.3 tree TREE1. and 3.4, at which points the algorithm may be terminated. We 1.3 A progressive alignment is constructed by following the refer to the ®rst two stages alone as MUSCLE-p, which branching order of TREE1. At each leaf, a pro®le is produces MSA2. MUSCLE-p has time complexity O(N L + 2 2 2 constructed from an input sequence. Nodes in the tree are NL ) and space complexity O(N + NL + L ). Re®nement adds visited in pre®x order (children before their parent). At each an O(N L) term to the time complexity. 1794 Nucleic Acids Research, 2004, Vol. 32, No. 5 Eidhammer et al. (31). With this in mind, we constructed a ASSESSMENT new test set, PREFAB (protein reference alignment bench- We assessed the performance of MUSCLE on four sets of mark) which exploits methodology (21,32,33), test data reference alignments: BAliBASE (19,20), SABmark (21), (13,34,35) and statistical methods (19) that have previously SMART (22±24) and a new benchmark, PREFAB. We been applied to alignment accuracy assessment. The protocol compared these with four other methods: CLUSTALW (25), is as follows. Two proteins are aligned by a structural method probably the most widely used program at the time of writing; that does not incorporate sequence similarity. Each sequence T-Coffee, which has the best BAliBASE score reported to is used to query a database, from which high-scoring hits are date; and two MAFFT scripts: FFTNS1, the fastest previously collected. The queries and their hits are combined and aligned published method known to the author (in which diagonal by a multiple sequence method. Accuracy is assessed on the ®nding by fast Fourier transform is enabled and a progressive original pair alone, by comparison with their structural alignment constructed), and NWNSI, the slowest but most alignment. Three test sets selected from the FSSP database accurate of the MAFFT methods (in which fast Fourier (36) were used as described in Sadreyev and Grishin (34) (data transform is disabled and re®nement is enabled). Tested kindly provided by Ruslan Sadreyev), and Edgar and versions were MUSCLE 3.2, CLUSTALW 1.82, T-Coffee Sjolander (13,35), which we call SG, PP1 and PP2, respect- 1.37 and MAFFT 3.82. We also evaluated MUSCLE-p, in ively. These three sets vary mainly in their selection criteria. which the re®nement stage is omitted. We also tried Align-m PP1 and PP2 contain pairs with sequence identity <30%. PP1 1.0 (21), but found in many cases that the program either was designed to select pairs that have high structural aborted or was impractically slow on the larger alignments similarity, requiring a z-score of >15 and a root mean square found in SMART and PREFAB. deviation (r.m.s.d.) of <2.5 A. PP2 selected more diverged pairs with a z-score of >8 and <12, and an r.m.s.d. of <3.5 A. SG contains pairs sampled from three ranges of sequence BAliBASE. We used version 2 of the BAliBASE benchmark, identity: 0±15, 15±30 and 30±97%, with no z-score or r.m.s.d. reference sets Ref 1±Ref 5. Other reference sets contain limits. We re-aligned each pair of structures using the CE repeats, inversions and transmembrane helices, for which none aligner (29), and retained only those pairs for which FSSP and of the tested algorithms is designed. CE agreed on 50 or more positions. This was designed to minimize questionable and ambiguous structural alignments SABmark. We used version 1.63 of the SABmark reference as done in SABmark and MaxBench (33). We used the full- alignments, which consists of two subsets: Superfamily and chain sequence of each structure to make a PSI-BLAST Twilight. All sequences have known structure. The Twilight (37,38) search of the NCBI non-redundant protein sequence set contains 1994 domains from the Astral database (26) with database (39), keeping locally aligned regions of hits with pairwise sequence similarity e-values <1, divided into 236 e-values below 0.01. Hits were ®ltered to 80% maximum folds according to the SCOP classi®cation (27). The identity (including the query), and 24 selected at random. Superfamily set contains sequences of pairwise identity Finally, each pair of structures and their remaining hits were <50%, divided into 462 SCOP superfamilies. Each pair of structures was aligned with two structural aligners: SOFI (28) combined to make sets of <50 sequences. The limit of 50 was and CE (29), producing a sequence alignment from the arbitrarily chosen to make the test tractable on a desktop consensus in which only high-con®dence regions are retained. computer for some of the more resource-intensive methods, in Input sets range from three to 25 sequences, with an average of particular T-Coffee (which needed 10 CPU days, as noted in eight and an average sequence length of 179. Table 4). The ®nal set, PREFAB version 3.0, has 1932 alignments averaging 49 sequences of length 240, of which SMART. SMART contains multiple alignments re®ned by 178 positions in the structure pair are found in the consensus experts, focusing primarily on signaling domains. While of FSSP and CE. structures were considered where known, sequence methods Accuracy measurement were also used to aid construction of the database, so SMART is not suitable as a de®nitive benchmark. However, conven- We used three accuracy measures: Q, TC and APDB. Q tional wisdom [e.g. Fischer et al. (30)] holds that machine- (quality) is the number of correctly aligned residue pairs assisted experts can produce superior alignments to automated divided by the number of residue pairs in the reference methods, so performance on this set is of interest for alignment. This has previously been termed the developer comparison. We used a version of SMART downloaded in score (32) and SPS (40). TC (total column score) is the number July 2000, before the ®rst version of MUSCLE was made of correctly aligned columns divided by the number of available; eliminating the possibility that MUSCLE was used columns in the reference alignment; this is Thompson et al.'s to aid construction. We discarded alignments of more than 100 CS and is equivalent to Q in the case of two sequences (as in sequences in order to make the test tractable for T-Coffee, PREFAB). APDB (41) is derived from structures alone; no leaving 267 alignments averaging 31 sequences of length 175. reference alignment of the sequences or structures is needed. For BAliBASE, we use Q and TC, measured only on core PREFAB. The methods used to create databases such as blocks as annotated in the database. For PREFAB, we use Q, BAliBASE and SMART are time-consuming and demand including only those positions on which CE and FSSP agree, signi®cant expertise, making a fully automated protocol and also APDB. For SMART, we use Q and TC computed for desirable. Perhaps the most obvious approach is to generate all columns. For SABmark, we average the Q score over each sequence alignments from automated alignments of multiple pair of sequences. TC score is not applicable to SABmark as structures, but this is fraught with dif®culties; see for example the reference alignments are pairwise. Nucleic Acids Research, 2004, Vol. 32, No. 5 1795 Statistical analysis assumptions are made about the population distribution. In particular, the Wilcoxon test assumes a symmetrical differ- Following Thompson et al. (19), statistical signi®cance is ence between two methods, but in practice we sometimes measured by a Friedman rank test (42), which is more observe a signi®cant skew. PREFAB and SABmark use conservative than the Wilcoxon test that has also been used automated structure alignment methods, which sometimes for alignment accuracy discrimination (5,7,8) as fewer produce questionable results. Many low-quality regions are eliminated by taking the consensus between two independent aligners, but some may remain. In PREFAB, assessment of a Table 1. BAliBASE scores and times multiple alignment is made on a single pair of sequences, Method Q TC CPU which may be more or less accurately aligned than the average over all pairs. In SABmark, the upper bound on Q is less than 1 MUSCLE 0.896 0.747 97 to a varying degree because the pairwise reference alignments MUSCLE-p 0.883 0.727 52 T-Coffee 0.882 0.731 1500 may not be mutually consistent. These effects can be viewed NWNSI 0.881 0.722 170 as introducing noise into the experiment, and a single accuracy CLUSTALW 0.860 0.690 170 measurement may be subject to error. However, as the FFTNS1 0.844 0.646 16 structural aligners do not use primary sequence, these errors Average Q and TC scores for each method on BAliBASE are shown, are unbiased with respect to sequence methods. A difference together with the total CPU time in seconds. Align-m aborted on two in accuracy between two sequence alignment methods can alignments; average scores on the remainder were Q = 0.852 and TC = therefore be established by the Friedman test, and the 0.670, requiring 2202 s. measured difference in average accuracy will be approxim- ately correct when measured over a suf®cient number of Table 2. BAliBASE Q scores on subsets samples. Method Ref1 Ref2 Ref3 Ref4 Ref5 MUSCLE 0.887 0.935 0.823 0.876 0.968 RESULTS MUSCLE-p 0.871 0.928 0.813 0.857 0.974 T-Coffee 0.866 0.934 0.787 0.917 0.957 Quality scores and CPU times are summarized in Tables 1±7; NWNSI 0.867 0.923 0.787 0.904 0.963 rankings and statistical signi®cance on PREFAB and CLUSTALW 0.861 0.932 0.751 0.823 0.859 BAliBASE for all pairs of methods are given in Table 8. On FFTNS1 0.838 0.908 0.708 0.793 0.947 all test sets and quality measures, MUSCLE achieves the The average Q score for each method on each BAliBASE subset is shown. highest ranking (in some cases jointly with other methods due Ref1 is the largest subset with 81 test sets, comprising almost 60% of the to lack of statistical signi®cance), and MUSCLE-p is database. Other subsets are smaller. For example, Ref4 and Ref5 have 12 statistically indistinguishable from T-Coffee and NWNSI. alignments each, and there are large variances in the individual scores from MUSCLE achieves the highest BAliBASE score reported to which the averages are computed. In our opinion, it is not possible to draw meaningful conclusions about the relative performance of different methods date, but the improvement of 1.6% in Q and 2.2% in TC over on these subsets. T-Coffee has low signi®cance (P = 0.15). A similar result is found on SABmark, where MUSCLE achieves a 1.5% Table 3. BAliBASE TC scores on subsets improvement over T-Coffee in Q with P = 0.14. The Q score on PREFAB is best able to distinguish between methods, Method Ref1 Ref2 Ref3 Ref4 Ref5 giving statistically signi®cant rankings to MUSCLE > MUSCLE 0.815 0.574 0.577 0.627 0.902 MUSCLE-p, MUSCLE > T-Coffee, MUSCLE > NWNSI MUSCLE-p 0.795 0.558 0.550 0.598 0.891 and MUSCLE-p > NWNSI. SMART also ranks MUSCLE T-Coffee 0.780 0.573 0.510 0.751 0.903 highest. SMART cannot be considered de®nitive due to the NWNSI 0.788 0.514 0.514 0.742 0.859 use of sequence methods in construction of the database, CLUSTALW 0.782 0.579 0.470 0.542 0.638 although any bias from this source is likely to favor methods FFTNS1 0.732 0.496 0.350 0.451 0.831 that were available to the SMART developers (i.e. to be The average TC score for each method on each BAliBASE subset is shown. against MUSCLE). The SMART results could be interpreted Table 4. Q scores and times on PREFAB Method All 0±20% 20±40% 40±70% 70±100% CPU MUSCLE 0.645 0.473 0.813 0.937 0.980 1.7 3 10 MUSCLE-p 0.634 0.460 0.802 0.942 0.985 2.0 3 10 T-Coffee 0.615 0.464 0.795 0.935 0.976 1.0 3 10 NWNSI 0.615 0.448 0.772 0.930 0.939 1.4 3 10 FFTNS1 0.591 0.423 0.756 0.931 0.938 1.0 3 10 CLUSTALW 0.563 0.382 0.732 0.916 0.930 3.3 3 10 The average Q score for each method over all PREFAB alignments (All), and the total CPU time in seconds are given. The remaining columns show average Q scores on subsets in which the structure pairs fall within the given pairwise identity ranges. Note that T-Coffee required 10 CPU days to complete the test, compared with <5 h for MUSCLE and ~30 min for MUSCLE-p. 1796 Nucleic Acids Research, 2004, Vol. 32, No. 5 Table 5. APDB scores on PREFAB as suggesting that MUSCLE alignments are more consistent with re®nements made by human experts. The APDB score Method APDB appears to be relatively insensitive, showing no signi®cant NWNSI 62.0 improvement due to the re®nement stage of MUSCLE MUSCLE 61.9 (similarly for MAFFT; not shown), and is not able to T-Coffee 61.9 distinguish between the four highest scoring methods. We MUSCLE-p 61.4 speculate that the scatter observed in the correlation between FFTNS1 60.8 CLUSTALW 59.1 APDB and more conventional measures such as TC (40) injects suf®cient noise to obscure meaningful differences in The average APDB score of each method on the PREFAB reference accuracy that can be resolved using Q. The low rank of alignments is given. There is no statistically signi®cant difference between Align-m on SABmark differs from results quoted by Van the four best methods. The top four are signi®cantly better than FFTNS1 (MUSCLE-p > FFTNS1 with P = 0.009), and FFTNS1 is signi®cantly Walle et al. (21), who assessed pairwise alignments produced ±5 better than CLUSTALW (P =3 3 10 ). by an intermediate step in the algorithm, whereas we used the ®nal multiple alignment. Table 6. Q scores and CPU times on SABmark Resource requirements for large numbers of sequences Method All Superfamily Twilight CPU To investigate resource requirements for increasing number of MUSCLE 0.430 0.523 0.249 1886 sequences N, we used the Rose sequence generator (43) T-Coffee 0.424 0.519 0.237 5615 MUSCLE-p 0.416 0.511 0.230 304 (complete results not shown). In agreement with other studies, NWNSI 0.410 0.506 0.223 629 [e.g. Katoh et al. (8)], we found that T-Coffee was unable to CLUSTALW 0.404 0.498 0.220 206 align more than approximately 10 sequences of typical length FFSNT1 0.373 0.467 0.190 75 on a current desktop computer. CLUSTALW was able to align Align-m 0.348 0.445 0.172 8902 a few hundred sequences, with a practical limit around N =10 All gives the average Q score over all SABmark alignments, Superfamily where CPU time begins to scale approximately as N . The and Twilight are average Q scores on the two subsets. These are computed largest set had 5000 sequences of average length 350. ®rst by averaging Q for each pair in a single multiple alignment, then MUSCLE-p completed this test in 7 min, compared with averaging over multiple alignments. This corrects for the lack of 10 min for FFTNS1; we estimate that CLUSTALW would independence between pairs in a given multiple alignment. Align-m aborted in nine cases; quoted averages for this program are for completed need approximately 1 year. alignments. Selected P-values are: MUSCLE > T-Coffee P = 0.14, ±5 ±6 MUSCLE > MUSCLE-p P =4310 , MUSCLE > NWNSI P =6 3 10 , MUSCLE-p > NWNSI P = 0.03, T-Coffee > MUSCLE-p P = 0.1, T-Coffee DISCUSSION ±10 > Align-m P <10 . We have described a new multiple sequence alignment Table 7. Q and TC scores on SMART algorithm, MUSCLE, and presented evidence that it creates Method Q TC Signi®cance alignments with average accuracy comparable with or super- ior to the best current methods. It should be emphasized that MUSCLE 0.855 0.537 0.07 performance differences between the better methods emerge NWNSI 0.848 0.546 0.03 only when averaged over a large number of test cases, even MUSCLE-p 0.836 0.505 0.54 T-Coffee 0.835 0.503 0.16 when alignments are considered trustworthy. For example, on CLUSTALW 0.823 0.504 0.07 BAliBASE, the lowest scoring of the tested methods FFTNS1 0.817 (FFTNS1) achieved a higher Q than the highest scoring (MUSCLE) in 21 out of 141 alignments and tied in 19 more; The average Q and TC accuracy scores over the 267 reference alignments compared with T-Coffee, MUSCLE scored higher or tied in 95 in SMART that have no more than 100 sequences are given. The last column is the P-value of the difference between the method in a row and cases, but lower in 24. This suggests the use of multiple the method in the next row, measured on the Q score. The P-value for algorithms and careful inspection of the results. MUSCLE is MUSCLE > T-Coffee is 0.0004 on Q and 0.01 on TC; the P-value for NSI comparable in speed with CLUSTALW, completing a test set > T-Coffee is 0.19 on Q and 0.0002 on TC. The difference between (PREFAB) averaging 49 sequences of length 240 in about half MUSCLE and NWNSI is only weakly signi®cant on the Q score (P = 0.07) and is not signi®cant on the TC score (P = 0.3). the time. The progressive method MUSCLE-p, which has Table 8. Ranks and statistical signi®cance on BAliBASE and PREFAB MUSCLE MUSCLE-p T-Coffee NWNSI FFTNS1 CLUSTALW ±6 MUSCLE +0.001 (0.15) + 0.005 +2 3 10 +0.0002 ±6 MUSCLE-p ±6 3 10 (0.3) (0.7) +0.0002 +0.02 ±5 T-Coffee ±0.0002 (0.4) (0.55) +7 3 10 +0.01 ±10 ±10 ±7 NWNSI ±<10 ±<10 ±10 +0.0001 (0.06) ±10 ±10 ±10 ±10 FFTNS1 ±<10 ±<10 ±<10 ±<10 ±0.04 ±10 ±10 ±10 ±10 CLUSTALW ±<10 ±<10 ±<10 ±<10 ±0.008 Each entry in the table contains the P-value assigned by a Friedman rank test to the difference between a pair of methods. The upper-right corner of the matrix is obtained from Q scores on BAliBASE, the lower-left corner from Q scores on PREFAB. If the method to the left is ranked higher than the method above, the P-value is preceded by +. If the method to the left is ranked lower, the P-value is preceded by ±. If the P-value is >0.05, the difference is not considered signi®cant and is shown in parentheses. So, for example, MUSCLE ranks higher than T-Coffee on PREFAB with P = 0.0002 and MUSCLE-p higher than CLUSTALW on BAliBASE with P = 0.02. Nucleic Acids Research, 2004, Vol. 32, No. 5 1797 22. Schultz,J., Milpetz,F., Bork,P. and Ponting,C.P. (1998) SMART, a average accuracy statistically indistinguishable from T-Coffee simple modular architecture research tool: identi®cation of signaling and the most accurate MAFFT script, is the fastest algorithm domains. Proc. Natl Acad. Sci. USA, 95, 5857±5864. known to the author for large numbers of sequences, able to 23. Schultz,J., Copley,R.R., Doerks,T., Ponting,C.P. and Bork,P. (2000) align 5000 sequences of average length 350 in 7 min on a SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res., 28, 231±234. current desktop computer. The MUSCLE software, source 24. Ponting,C.P., Schultz,J., Milpetz,F. and Bork,P. (1999) SMART: code and test data are freely available at: http://www.drive5. identi®cation and annotation of domains from signalling and com/muscle. extracellular protein sequences. Nucleic Acids Res., 27, 229±332. 25. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment REFERENCES through sequence weighting, position-speci®c gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673±4680. 1. Wang,L. and Jiang,T. (1994) On the complexity of multiple sequence 26. Brenner,S.E., Koehl,P. and Levitt,M. (2000) The ASTRAL compendium alignment. J. Comput. Biol., 1, 337±348. for protein structure and sequence analysis. Nucleic Acids Res., 28, 2. Waterman,M.S., Smith,T.F. and Beyer,W.A. (1976) Some biological 254±256. sequence metrics. Adv. Math., 20, 367±387. 27. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a 3. Hogeweg,P. and Hesper,B. (1984) The alignment of sets of sequences structural classi®cation of proteins database for the investigation of and the construction of phyletic trees: an integrated method. J. Mol. sequences and structures. J. Mol. Biol., 247, 536±540. Evol., 20, 175±186. 28. Boutonnet,N.S., Rooman,M.J., Ochagavia,M.E., Richelle,J. and 4. Feng,D.F. and Doolittle,R.F. (1987) Progressive sequence alignment as a Wodak,S.J. (1995) Optimal protein structure alignments by multiple prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351±360. linkage clustering: application to distantly related proteins. Protein Eng., 5. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel 8, 647±662. method for fast and accurate multiple sequence alignment. J. Mol. Biol., 29. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by 302, 205±217. incremental combinatorial extension (CE) of the optimal path. Protein 6. Notredame,C. (2002) Recent progress in multiple sequence alignment: a Eng., 11, 739±747. survey. Pharmacogenomics, 3, 131±144. 30. Fischer,D., Barret,C., Bryson,K., Elofsson,A., Godzik,A., Jones,D., 7. Gotoh,O. (1996) Signi®cant improvement in accuracy of multiple protein Karplus,K.J., Kelley,L.A., MacCallum,R.M., Pawowski,K., Rost,B., sequence alignments by iterative re®nement as assessed by reference to Rychlewski,L. and Sternberg,M. (1999) CAFASP-1: critical assessment structural alignments. J. Mol. Biol., 264, 823±838. of fully automated structure prediction methods. Proteins, Suppl. 3, 8. Katoh,K., Misawa,K., Kuma,K. and Miyata,T. (2002) MAFFT: a novel 209±217. method for rapid multiple sequence alignment based on fast Fourier 31. Eidhammer,I., Jonassen,I. and Taylor,W.R. (2000) Structure comparison transform. Nucleic Acids Res., 30, 3059±3066. and structure patterns. J. Comput. Biol., 7, 685±716. 9. Edgar,R.C. (2004) Local homology recognition and distance measures in 32. Sauder,J.M., Arthur,J.W. and Dunbrack,R.L.,Jr (2000) Large-scale linear time using compressed amino acid alphabets. Nucleic Acids Res., comparison of protein sequence alignment algorithms with structure 32, 380±385. alignments. Proteins, 40, 6±22. 10. Kimura,M. (1983) The Neutral Theory of Molecular Evolution. 33. Leplae,R. and Hubbard,T.J. (2002) MaxBench: evaluation of sequence Cambridge University Press. and structure comparison methods. Bioinformatics, 18, 494±495. 11. Sneath,P.H.A. and Sokal,R.R. (1973) Numerical Taxonomy. Freeman, 34. Sadreyev,R. and Grishin,N. (2003) COMPASS: a tool for comparison of San Francisco. multiple protein alignments with assessment of statistical signi®cance. 12. Saitou,N. and Nei,M. (1987) The neighbor-joining method: a new J. Mol. Biol., 326, 317±336. method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 35. Edgar,R.C. and Sjolander,K. (2004) COACH: pro®le-pro®le alignment 406±425. of protein families using hidden Markov models. Bioinformatics, DOI: 13. Edgar,R.C. and Sjolander,K. (2004) A comparison of scoring functions 10.1093/bioinformatics/bth091. for protein sequence pro®le alignment. Bioinformatics, DOI: 10.1093/ 36. Holm,L. and Sander,C. (1998) Touring protein fold space with Dali/ bioinformatics/bth090. FSSP. Nucleic Acids Res., 26, 316±319. 14. Sjolander,K., Karplus,K., Brown,M., Hughey,R., Krogh,A., Mian,I.S. 37. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., and Haussler,D. (1996) Dirichlet mixtures: a method for improved Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a detection of weak but signi®cant protein sequence homology. CABIOS, new generation of protein database search programs. Nucleic Acids Res., 12, 327±345. 25, 3389±3402. 15. von Ohsen,N. and Zimmer,R. (2001) Improving pro®le±pro®le alignment 38. Schaffer,A.A., Aravind,L., Madden,T.L., Shavirin,S., Spouge,J.L., via log average scoring. In Gascuel,O. and Moret,B.M.E. (eds), Wolf,Y.I., Koonin,E.V. and Altschul,S.F. (2001) Improving the Algorithms in Bioinformatics, First International Workshop, WABI 2001. accuracy of PSI-BLAST protein database searches with composition- Springer-Verlag, Berlin, Germany, pp. 11±26. based statistics and other re®nements. Nucleic Acids Res., 29, 16. Muller,T., Spang,R. and Vingron,M. (2002) Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent 2994±3005. approach and a maximum likelihood method. Mol. Biol. Evol., 19, 8±13. 39. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2003) NCBI Reference 17. Brudno,M., Do,C.B., Cooper,G.M., Kim,M.F., Davydov,E., Green,E.D., Sequence project: update and current status. Nucleic Acids Res., 31, Sidow,A. and Batzoglou,S. (2003) LAGAN and Multi-LAGAN: ef®cient 34±37. tools for large-scale multiple alignment of genomic DNA. Genome Res., 40. Thompson,J.D., Plewniak,F. and Poch,O. (1999b) A comprehensive 13, 721±731. comparison of multiple sequence alignment programs. Nucleic Acids 18. Hirosawa,M., Totoki,Y., Hoshida,M. and Ishikawa,M. (1995) Res., 27, 2682±2690. Comprehensive study on iterative algorithms of multiple sequence 41. O'Sullivan,O., Zehnder,M., Higgins,D., Bucher,P., Grosdidier,A. and alignment. CABIOS, 11, 13±18. Notredame,C. (2003) APDB: a novel measure for benchmarking 19. Thompson,J.D., Plewniak,F. and Poch,O. (1999a) BAliBASE: a sequence alignment methods without reference alignments. benchmark alignment database for the evaluation of multiple alignment Bioinformatics, 19 Suppl. 1, I215±I221. programs. Bioinformatics, 15, 87±88. 42. Friedman,M. (1937) The use of ranks to avoid the assumption of 20. Bahr,A., Thompson,J.D., Thierry,J.C. and Poch,O. (2001) BAliBASE normality implicit in the analysis of variance. J. Am. Stat. Assoc., 32, (Benchmark Alignment dataBASE): enhancements for repeats, 675±701. transmembrane sequences and circular permutations. Nucleic Acids Res., 43. Stoye,J., Evers,D. and Meyer,F. (1998) Rose: generating sequence 29, 323±326. families. Bioinformatics, 14, 157±163. 21. Van Walle,I., Lasters,I. and Wyns,L. (2004) Align-mÐa new algorithm 44. Sjolander,K. (1998) Phylogenetic inference in protein superfamilies: for multiple alignment of highly divergent sequences. Bioinformatics, analysis of SH2 domains. Proc. Int. Conf. Intell. Syst. Mol. Biol., 6, DOI: 10.1093/bioinformatics/bth116. 165±174. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Nucleic Acids Research , Volume 32 (5) – Mar 1, 2004

Loading next page...
 
/lp/oxford-university-press/muscle-multiple-sequence-alignment-with-high-accuracy-and-high-LQgB6009Um

References (44)

Publisher
Oxford University Press
Copyright
Oxford University Press
ISSN
0305-1048
eISSN
1362-4962
DOI
10.1093/nar/gkh340
pmid
15034147
Publisher site
See Article on Publisher Site

Abstract

Published online March 19, 2004 1792±1797 Nucleic Acids Research, 2004, Vol. 32, No. 5 DOI: 10.1093/nar/gkh340 MUSCLE: multiple sequence alignment with high accuracy and high throughput Robert C. Edgar* 195 Roque Moraes Drive, Mill Valley, CA 94941, USA Received January 19, 2004; Revised January 30, 2004; Accepted February 24, 2004 ABSTRACT variant on this strategy is used by T-Coffee (5), which aligns pro®les by optimizing a score derived from local and global We describe MUSCLE, a new computer program for alignments of all pairs of input sequences. Misalignments by creating multiple alignments of protein sequences. progressive methods are sometimes readily apparent (Fig. 1), Elements of the algorithm include fast distance motivating further processing (re®nement). For a recent estimation using kmer counting, progressive align- review of multiple alignment methods, see Notredame (6). ment using a new pro®le function we call the log- Here we describe MUSCLE (multiple sequence comparison by log-expectation), a new computer program for multiple expectation score, and re®nement using tree- protein sequence alignment. dependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of refer- MUSCLE ALGORITHM ence alignments: BAliBASE, SABmark, SMART and Here we give an overview of the algorithm; a more detailed a new benchmark, PREFAB. MUSCLE achieves the discussion is given in Edgar (submitted). Following guide tree highest, or joint highest, rank in accuracy on each construction, the fundamental step is pairwise pro®le align- of these sets. Without re®nement, MUSCLE ment, which is used ®rst for progressive alignment and then achieves average accuracy statistically indistin- for re®nement. This is similar to the strategies used by PRRP guishable from T-Coffee and MAFFT, and is the (7) and MAFFT (8). fastest of the tested methods for large numbers of Distance measures and guide tree estimation sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. MUSCLE uses two distance measures for a pair of sequences: The MUSCLE program, source code and PREFAB a kmer distance (for an unaligned pair) and the Kimura test data are freely available at http://www.drive5. distance (for an aligned pair). A kmer is a contiguous subsequence of length k, also known as a word or k-tuple. com/muscle. Related sequences tend to have more kmers in common than expected by chance. The kmer distance is derived from the fraction of kmers in common in a compressed alphabet, which INTRODUCTION we have previously shown to correlate well with fractional Multiple alignments of protein sequences are important in identity (9). This measure does not require an alignment, many applications, including phylogenetic tree estimation, giving a signi®cant speed advantage. Given an aligned pair of structure prediction and critical residue identi®cation. The sequences, we compute the pairwise identity and convert to an most natural formulation of the computational problem is to additive distance estimate, applying the Kimura correction for de®ne a model of sequence evolution that assigns probabilities multiple substitutions at a single site (10). Distance matrices to elementary sequence edits and seeks a most probable are clustered using UPGMA (11), which we ®nd to give directed graph in which edges represent edits and terminal slightly improved results over neighbor-joining (12), despite nodes are the observed sequences. No tractable method for the expectation that neighbor-joining will give a more reliable ®nding such a graph is known. A heuristic alternative is to estimate of the evolutionary tree. This can be explained by seek a multiple alignment that optimizes the sum of pairs (SP) assuming that in progressive alignment, the best accuracy score, i.e. the sum of pairwise alignment scores. Optimizing is obtained at each node by aligning the two pro®les that the SP score is NP complete (1) and can be achieved by have fewest differences, even if they are not evolutionary dynamic programming with time and space complexity O(L ) neighbors. in the sequence length L and number of sequences N (2). A Pro®le alignment more popular strategy is the progressive method (3,4), which ®rst estimates a tree and then constructs a pairwise alignment In order to apply pairwise alignment to pro®les, a scoring of the subtrees found at each internal node. A subtree is function must be de®ned on an aligned pair of pro®le represented by its pro®le, a multiple alignment treated as a positions, i.e. a pair of multiple alignment columns [see, for sequence by regarding each column as an alignable symbol. A example Edgar and Sjolander (13)]. Let i and j be amino acid *Email: bob@drive5.com Nucleic Acids Research, Vol. 32 No. 5 ã Oxford University Press 2004; all rights reserved Nucleic Acids Research, 2004, Vol. 32, No. 5 1793 Figure 1. Motifs misaligned by a progressive method. A set of 41 sequences containing SH2 domains (44) were aligned by the progressive method T-Coffee (above), and by MUSCLE (below). The N-terminal region of a subset of ®ve sequences is shown. The highlighted columns (upper case) are conserved within this family but are misaligned by T-Coffee. It should be noted that T-Coffee aligns these motifs correctly when given these ®ve sequences alone; the problem arises in the context of the other sequences. Figure 2. This diagram summarizes the ¯ow of the MUSCLE algorithm. Complete alignments are available at http://www.drive5.com/muscle. There are three main stages: Stage 1 (draft progressive), Stage 2 (improved progressive) and Stage 3 (re®nement). A multiple alignment is available at the completion of each stage, at which point the algorithm may types, p the background probability of i, p the joint terminate. i ij probability of i and j being aligned to each other, f the observed frequency of i in column x of the ®rst pro®le, and f internal node, a pairwise alignment is constructed of the two the observed frequency of gaps in that column at position x in child pro®les, giving a new pro®le which is assigned to that the family (similarly for position y in the second pro®le). The node. This produces a multiple alignment of all input estimated probability a of observing amino acid i in position sequences, MSA1, at the root. x can be derived from f , typically by adding heuristic pseudo- counts or by using Bayesian methods such as Dirichlet mixture Stage 2, Improved progressive. The main source of error in the priors (14). MUSCLE uses a new pro®le function we call the draft progressive stage is the approximate kmer distance log-expectation (LE) score: measure, which results in a suboptimal tree. MUSCLE xy x y x y therefore re-estimates the tree using the Kimura distance, LE =(1± f )(1± f ) log S S f f p /p p 1 G G i j i j ij i j which is more accurate but requires an alignment. 2.1 The Kimura distance for each pair of input sequences is This is a modi®ed version of the log-average function (15): computed from MSA1, giving distance matrix D2. xy x y LA = log S S a a p /p p 2 2.2 Matrix D2 is clustered by UPGMA, producing binary i j i j ij i j tree TREE2. MUSCLE uses probabilities p and p derived from the i ij 2.3 A progressive alignment is produced following TREE2 240 PAM VTML matrix (16). Frequencies f are normalized to (similar to 1.3), producing multiple alignment MSA2. This is sum to 1 when indels are present (otherwise the logarithm optimized by computing alignments only for subtrees whose becomes increasingly negative with increasing numbers of branching orders changed relative to TREE1. gaps even when aligning conserved or similar residues). The factor (1 ± f ) is the occupancy of a column, introduced to Stage 3, Re®nement. encourage more highly occupied columns to align. Position- 3.1 An edge is chosen from TREE2 (edges are visited in speci®c gap penalties are used, employing heuristics similar to order of decreasing distance from the root). those found in MAFFT and LAGAN (17). 3.2 TREE2 is divided into two subtrees by deleting the edge. The pro®le of the multiple alignment in each subtree is Algorithm computed. The high-level ¯ow is depicted in Figure 2. 3.3 A new multiple alignment is produced by re-aligning the two pro®les. Stage 1, Draft progressive. The goal of the ®rst stage is to 3.4 If the SP score is improved, the new alignment is kept, produce a multiple alignment, emphasizing speed over otherwise it is discarded. accuracy. Steps 3.1±3.4 are repeated until convergence or until a user- 1.1 The kmer distance is computed for each pair of input de®ned limit is reached. This is a variant of tree-dependent sequences, giving distance matrix D1. restricted partitioning (18). 1.2 Matrix D1 is clustered by UPGMA, producing binary Complete multiple alignments are available at steps 1.3, 2.3 tree TREE1. and 3.4, at which points the algorithm may be terminated. We 1.3 A progressive alignment is constructed by following the refer to the ®rst two stages alone as MUSCLE-p, which branching order of TREE1. At each leaf, a pro®le is produces MSA2. MUSCLE-p has time complexity O(N L + 2 2 2 constructed from an input sequence. Nodes in the tree are NL ) and space complexity O(N + NL + L ). Re®nement adds visited in pre®x order (children before their parent). At each an O(N L) term to the time complexity. 1794 Nucleic Acids Research, 2004, Vol. 32, No. 5 Eidhammer et al. (31). With this in mind, we constructed a ASSESSMENT new test set, PREFAB (protein reference alignment bench- We assessed the performance of MUSCLE on four sets of mark) which exploits methodology (21,32,33), test data reference alignments: BAliBASE (19,20), SABmark (21), (13,34,35) and statistical methods (19) that have previously SMART (22±24) and a new benchmark, PREFAB. We been applied to alignment accuracy assessment. The protocol compared these with four other methods: CLUSTALW (25), is as follows. Two proteins are aligned by a structural method probably the most widely used program at the time of writing; that does not incorporate sequence similarity. Each sequence T-Coffee, which has the best BAliBASE score reported to is used to query a database, from which high-scoring hits are date; and two MAFFT scripts: FFTNS1, the fastest previously collected. The queries and their hits are combined and aligned published method known to the author (in which diagonal by a multiple sequence method. Accuracy is assessed on the ®nding by fast Fourier transform is enabled and a progressive original pair alone, by comparison with their structural alignment constructed), and NWNSI, the slowest but most alignment. Three test sets selected from the FSSP database accurate of the MAFFT methods (in which fast Fourier (36) were used as described in Sadreyev and Grishin (34) (data transform is disabled and re®nement is enabled). Tested kindly provided by Ruslan Sadreyev), and Edgar and versions were MUSCLE 3.2, CLUSTALW 1.82, T-Coffee Sjolander (13,35), which we call SG, PP1 and PP2, respect- 1.37 and MAFFT 3.82. We also evaluated MUSCLE-p, in ively. These three sets vary mainly in their selection criteria. which the re®nement stage is omitted. We also tried Align-m PP1 and PP2 contain pairs with sequence identity <30%. PP1 1.0 (21), but found in many cases that the program either was designed to select pairs that have high structural aborted or was impractically slow on the larger alignments similarity, requiring a z-score of >15 and a root mean square found in SMART and PREFAB. deviation (r.m.s.d.) of <2.5 A. PP2 selected more diverged pairs with a z-score of >8 and <12, and an r.m.s.d. of <3.5 A. SG contains pairs sampled from three ranges of sequence BAliBASE. We used version 2 of the BAliBASE benchmark, identity: 0±15, 15±30 and 30±97%, with no z-score or r.m.s.d. reference sets Ref 1±Ref 5. Other reference sets contain limits. We re-aligned each pair of structures using the CE repeats, inversions and transmembrane helices, for which none aligner (29), and retained only those pairs for which FSSP and of the tested algorithms is designed. CE agreed on 50 or more positions. This was designed to minimize questionable and ambiguous structural alignments SABmark. We used version 1.63 of the SABmark reference as done in SABmark and MaxBench (33). We used the full- alignments, which consists of two subsets: Superfamily and chain sequence of each structure to make a PSI-BLAST Twilight. All sequences have known structure. The Twilight (37,38) search of the NCBI non-redundant protein sequence set contains 1994 domains from the Astral database (26) with database (39), keeping locally aligned regions of hits with pairwise sequence similarity e-values <1, divided into 236 e-values below 0.01. Hits were ®ltered to 80% maximum folds according to the SCOP classi®cation (27). The identity (including the query), and 24 selected at random. Superfamily set contains sequences of pairwise identity Finally, each pair of structures and their remaining hits were <50%, divided into 462 SCOP superfamilies. Each pair of structures was aligned with two structural aligners: SOFI (28) combined to make sets of <50 sequences. The limit of 50 was and CE (29), producing a sequence alignment from the arbitrarily chosen to make the test tractable on a desktop consensus in which only high-con®dence regions are retained. computer for some of the more resource-intensive methods, in Input sets range from three to 25 sequences, with an average of particular T-Coffee (which needed 10 CPU days, as noted in eight and an average sequence length of 179. Table 4). The ®nal set, PREFAB version 3.0, has 1932 alignments averaging 49 sequences of length 240, of which SMART. SMART contains multiple alignments re®ned by 178 positions in the structure pair are found in the consensus experts, focusing primarily on signaling domains. While of FSSP and CE. structures were considered where known, sequence methods Accuracy measurement were also used to aid construction of the database, so SMART is not suitable as a de®nitive benchmark. However, conven- We used three accuracy measures: Q, TC and APDB. Q tional wisdom [e.g. Fischer et al. (30)] holds that machine- (quality) is the number of correctly aligned residue pairs assisted experts can produce superior alignments to automated divided by the number of residue pairs in the reference methods, so performance on this set is of interest for alignment. This has previously been termed the developer comparison. We used a version of SMART downloaded in score (32) and SPS (40). TC (total column score) is the number July 2000, before the ®rst version of MUSCLE was made of correctly aligned columns divided by the number of available; eliminating the possibility that MUSCLE was used columns in the reference alignment; this is Thompson et al.'s to aid construction. We discarded alignments of more than 100 CS and is equivalent to Q in the case of two sequences (as in sequences in order to make the test tractable for T-Coffee, PREFAB). APDB (41) is derived from structures alone; no leaving 267 alignments averaging 31 sequences of length 175. reference alignment of the sequences or structures is needed. For BAliBASE, we use Q and TC, measured only on core PREFAB. The methods used to create databases such as blocks as annotated in the database. For PREFAB, we use Q, BAliBASE and SMART are time-consuming and demand including only those positions on which CE and FSSP agree, signi®cant expertise, making a fully automated protocol and also APDB. For SMART, we use Q and TC computed for desirable. Perhaps the most obvious approach is to generate all columns. For SABmark, we average the Q score over each sequence alignments from automated alignments of multiple pair of sequences. TC score is not applicable to SABmark as structures, but this is fraught with dif®culties; see for example the reference alignments are pairwise. Nucleic Acids Research, 2004, Vol. 32, No. 5 1795 Statistical analysis assumptions are made about the population distribution. In particular, the Wilcoxon test assumes a symmetrical differ- Following Thompson et al. (19), statistical signi®cance is ence between two methods, but in practice we sometimes measured by a Friedman rank test (42), which is more observe a signi®cant skew. PREFAB and SABmark use conservative than the Wilcoxon test that has also been used automated structure alignment methods, which sometimes for alignment accuracy discrimination (5,7,8) as fewer produce questionable results. Many low-quality regions are eliminated by taking the consensus between two independent aligners, but some may remain. In PREFAB, assessment of a Table 1. BAliBASE scores and times multiple alignment is made on a single pair of sequences, Method Q TC CPU which may be more or less accurately aligned than the average over all pairs. In SABmark, the upper bound on Q is less than 1 MUSCLE 0.896 0.747 97 to a varying degree because the pairwise reference alignments MUSCLE-p 0.883 0.727 52 T-Coffee 0.882 0.731 1500 may not be mutually consistent. These effects can be viewed NWNSI 0.881 0.722 170 as introducing noise into the experiment, and a single accuracy CLUSTALW 0.860 0.690 170 measurement may be subject to error. However, as the FFTNS1 0.844 0.646 16 structural aligners do not use primary sequence, these errors Average Q and TC scores for each method on BAliBASE are shown, are unbiased with respect to sequence methods. A difference together with the total CPU time in seconds. Align-m aborted on two in accuracy between two sequence alignment methods can alignments; average scores on the remainder were Q = 0.852 and TC = therefore be established by the Friedman test, and the 0.670, requiring 2202 s. measured difference in average accuracy will be approxim- ately correct when measured over a suf®cient number of Table 2. BAliBASE Q scores on subsets samples. Method Ref1 Ref2 Ref3 Ref4 Ref5 MUSCLE 0.887 0.935 0.823 0.876 0.968 RESULTS MUSCLE-p 0.871 0.928 0.813 0.857 0.974 T-Coffee 0.866 0.934 0.787 0.917 0.957 Quality scores and CPU times are summarized in Tables 1±7; NWNSI 0.867 0.923 0.787 0.904 0.963 rankings and statistical signi®cance on PREFAB and CLUSTALW 0.861 0.932 0.751 0.823 0.859 BAliBASE for all pairs of methods are given in Table 8. On FFTNS1 0.838 0.908 0.708 0.793 0.947 all test sets and quality measures, MUSCLE achieves the The average Q score for each method on each BAliBASE subset is shown. highest ranking (in some cases jointly with other methods due Ref1 is the largest subset with 81 test sets, comprising almost 60% of the to lack of statistical signi®cance), and MUSCLE-p is database. Other subsets are smaller. For example, Ref4 and Ref5 have 12 statistically indistinguishable from T-Coffee and NWNSI. alignments each, and there are large variances in the individual scores from MUSCLE achieves the highest BAliBASE score reported to which the averages are computed. In our opinion, it is not possible to draw meaningful conclusions about the relative performance of different methods date, but the improvement of 1.6% in Q and 2.2% in TC over on these subsets. T-Coffee has low signi®cance (P = 0.15). A similar result is found on SABmark, where MUSCLE achieves a 1.5% Table 3. BAliBASE TC scores on subsets improvement over T-Coffee in Q with P = 0.14. The Q score on PREFAB is best able to distinguish between methods, Method Ref1 Ref2 Ref3 Ref4 Ref5 giving statistically signi®cant rankings to MUSCLE > MUSCLE 0.815 0.574 0.577 0.627 0.902 MUSCLE-p, MUSCLE > T-Coffee, MUSCLE > NWNSI MUSCLE-p 0.795 0.558 0.550 0.598 0.891 and MUSCLE-p > NWNSI. SMART also ranks MUSCLE T-Coffee 0.780 0.573 0.510 0.751 0.903 highest. SMART cannot be considered de®nitive due to the NWNSI 0.788 0.514 0.514 0.742 0.859 use of sequence methods in construction of the database, CLUSTALW 0.782 0.579 0.470 0.542 0.638 although any bias from this source is likely to favor methods FFTNS1 0.732 0.496 0.350 0.451 0.831 that were available to the SMART developers (i.e. to be The average TC score for each method on each BAliBASE subset is shown. against MUSCLE). The SMART results could be interpreted Table 4. Q scores and times on PREFAB Method All 0±20% 20±40% 40±70% 70±100% CPU MUSCLE 0.645 0.473 0.813 0.937 0.980 1.7 3 10 MUSCLE-p 0.634 0.460 0.802 0.942 0.985 2.0 3 10 T-Coffee 0.615 0.464 0.795 0.935 0.976 1.0 3 10 NWNSI 0.615 0.448 0.772 0.930 0.939 1.4 3 10 FFTNS1 0.591 0.423 0.756 0.931 0.938 1.0 3 10 CLUSTALW 0.563 0.382 0.732 0.916 0.930 3.3 3 10 The average Q score for each method over all PREFAB alignments (All), and the total CPU time in seconds are given. The remaining columns show average Q scores on subsets in which the structure pairs fall within the given pairwise identity ranges. Note that T-Coffee required 10 CPU days to complete the test, compared with <5 h for MUSCLE and ~30 min for MUSCLE-p. 1796 Nucleic Acids Research, 2004, Vol. 32, No. 5 Table 5. APDB scores on PREFAB as suggesting that MUSCLE alignments are more consistent with re®nements made by human experts. The APDB score Method APDB appears to be relatively insensitive, showing no signi®cant NWNSI 62.0 improvement due to the re®nement stage of MUSCLE MUSCLE 61.9 (similarly for MAFFT; not shown), and is not able to T-Coffee 61.9 distinguish between the four highest scoring methods. We MUSCLE-p 61.4 speculate that the scatter observed in the correlation between FFTNS1 60.8 CLUSTALW 59.1 APDB and more conventional measures such as TC (40) injects suf®cient noise to obscure meaningful differences in The average APDB score of each method on the PREFAB reference accuracy that can be resolved using Q. The low rank of alignments is given. There is no statistically signi®cant difference between Align-m on SABmark differs from results quoted by Van the four best methods. The top four are signi®cantly better than FFTNS1 (MUSCLE-p > FFTNS1 with P = 0.009), and FFTNS1 is signi®cantly Walle et al. (21), who assessed pairwise alignments produced ±5 better than CLUSTALW (P =3 3 10 ). by an intermediate step in the algorithm, whereas we used the ®nal multiple alignment. Table 6. Q scores and CPU times on SABmark Resource requirements for large numbers of sequences Method All Superfamily Twilight CPU To investigate resource requirements for increasing number of MUSCLE 0.430 0.523 0.249 1886 sequences N, we used the Rose sequence generator (43) T-Coffee 0.424 0.519 0.237 5615 MUSCLE-p 0.416 0.511 0.230 304 (complete results not shown). In agreement with other studies, NWNSI 0.410 0.506 0.223 629 [e.g. Katoh et al. (8)], we found that T-Coffee was unable to CLUSTALW 0.404 0.498 0.220 206 align more than approximately 10 sequences of typical length FFSNT1 0.373 0.467 0.190 75 on a current desktop computer. CLUSTALW was able to align Align-m 0.348 0.445 0.172 8902 a few hundred sequences, with a practical limit around N =10 All gives the average Q score over all SABmark alignments, Superfamily where CPU time begins to scale approximately as N . The and Twilight are average Q scores on the two subsets. These are computed largest set had 5000 sequences of average length 350. ®rst by averaging Q for each pair in a single multiple alignment, then MUSCLE-p completed this test in 7 min, compared with averaging over multiple alignments. This corrects for the lack of 10 min for FFTNS1; we estimate that CLUSTALW would independence between pairs in a given multiple alignment. Align-m aborted in nine cases; quoted averages for this program are for completed need approximately 1 year. alignments. Selected P-values are: MUSCLE > T-Coffee P = 0.14, ±5 ±6 MUSCLE > MUSCLE-p P =4310 , MUSCLE > NWNSI P =6 3 10 , MUSCLE-p > NWNSI P = 0.03, T-Coffee > MUSCLE-p P = 0.1, T-Coffee DISCUSSION ±10 > Align-m P <10 . We have described a new multiple sequence alignment Table 7. Q and TC scores on SMART algorithm, MUSCLE, and presented evidence that it creates Method Q TC Signi®cance alignments with average accuracy comparable with or super- ior to the best current methods. It should be emphasized that MUSCLE 0.855 0.537 0.07 performance differences between the better methods emerge NWNSI 0.848 0.546 0.03 only when averaged over a large number of test cases, even MUSCLE-p 0.836 0.505 0.54 T-Coffee 0.835 0.503 0.16 when alignments are considered trustworthy. For example, on CLUSTALW 0.823 0.504 0.07 BAliBASE, the lowest scoring of the tested methods FFTNS1 0.817 (FFTNS1) achieved a higher Q than the highest scoring (MUSCLE) in 21 out of 141 alignments and tied in 19 more; The average Q and TC accuracy scores over the 267 reference alignments compared with T-Coffee, MUSCLE scored higher or tied in 95 in SMART that have no more than 100 sequences are given. The last column is the P-value of the difference between the method in a row and cases, but lower in 24. This suggests the use of multiple the method in the next row, measured on the Q score. The P-value for algorithms and careful inspection of the results. MUSCLE is MUSCLE > T-Coffee is 0.0004 on Q and 0.01 on TC; the P-value for NSI comparable in speed with CLUSTALW, completing a test set > T-Coffee is 0.19 on Q and 0.0002 on TC. The difference between (PREFAB) averaging 49 sequences of length 240 in about half MUSCLE and NWNSI is only weakly signi®cant on the Q score (P = 0.07) and is not signi®cant on the TC score (P = 0.3). the time. The progressive method MUSCLE-p, which has Table 8. Ranks and statistical signi®cance on BAliBASE and PREFAB MUSCLE MUSCLE-p T-Coffee NWNSI FFTNS1 CLUSTALW ±6 MUSCLE +0.001 (0.15) + 0.005 +2 3 10 +0.0002 ±6 MUSCLE-p ±6 3 10 (0.3) (0.7) +0.0002 +0.02 ±5 T-Coffee ±0.0002 (0.4) (0.55) +7 3 10 +0.01 ±10 ±10 ±7 NWNSI ±<10 ±<10 ±10 +0.0001 (0.06) ±10 ±10 ±10 ±10 FFTNS1 ±<10 ±<10 ±<10 ±<10 ±0.04 ±10 ±10 ±10 ±10 CLUSTALW ±<10 ±<10 ±<10 ±<10 ±0.008 Each entry in the table contains the P-value assigned by a Friedman rank test to the difference between a pair of methods. The upper-right corner of the matrix is obtained from Q scores on BAliBASE, the lower-left corner from Q scores on PREFAB. If the method to the left is ranked higher than the method above, the P-value is preceded by +. If the method to the left is ranked lower, the P-value is preceded by ±. If the P-value is >0.05, the difference is not considered signi®cant and is shown in parentheses. So, for example, MUSCLE ranks higher than T-Coffee on PREFAB with P = 0.0002 and MUSCLE-p higher than CLUSTALW on BAliBASE with P = 0.02. Nucleic Acids Research, 2004, Vol. 32, No. 5 1797 22. Schultz,J., Milpetz,F., Bork,P. and Ponting,C.P. (1998) SMART, a average accuracy statistically indistinguishable from T-Coffee simple modular architecture research tool: identi®cation of signaling and the most accurate MAFFT script, is the fastest algorithm domains. Proc. Natl Acad. Sci. USA, 95, 5857±5864. known to the author for large numbers of sequences, able to 23. Schultz,J., Copley,R.R., Doerks,T., Ponting,C.P. and Bork,P. (2000) align 5000 sequences of average length 350 in 7 min on a SMART: a web-based tool for the study of genetically mobile domains. Nucleic Acids Res., 28, 231±234. current desktop computer. The MUSCLE software, source 24. Ponting,C.P., Schultz,J., Milpetz,F. and Bork,P. (1999) SMART: code and test data are freely available at: http://www.drive5. identi®cation and annotation of domains from signalling and com/muscle. extracellular protein sequences. Nucleic Acids Res., 27, 229±332. 25. Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment REFERENCES through sequence weighting, position-speci®c gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673±4680. 1. Wang,L. and Jiang,T. (1994) On the complexity of multiple sequence 26. Brenner,S.E., Koehl,P. and Levitt,M. (2000) The ASTRAL compendium alignment. J. Comput. Biol., 1, 337±348. for protein structure and sequence analysis. Nucleic Acids Res., 28, 2. Waterman,M.S., Smith,T.F. and Beyer,W.A. (1976) Some biological 254±256. sequence metrics. Adv. Math., 20, 367±387. 27. Murzin,A.G., Brenner,S.E., Hubbard,T. and Chothia,C. (1995) SCOP: a 3. Hogeweg,P. and Hesper,B. (1984) The alignment of sets of sequences structural classi®cation of proteins database for the investigation of and the construction of phyletic trees: an integrated method. J. Mol. sequences and structures. J. Mol. Biol., 247, 536±540. Evol., 20, 175±186. 28. Boutonnet,N.S., Rooman,M.J., Ochagavia,M.E., Richelle,J. and 4. Feng,D.F. and Doolittle,R.F. (1987) Progressive sequence alignment as a Wodak,S.J. (1995) Optimal protein structure alignments by multiple prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351±360. linkage clustering: application to distantly related proteins. Protein Eng., 5. Notredame,C., Higgins,D.G. and Heringa,J. (2000) T-Coffee: a novel 8, 647±662. method for fast and accurate multiple sequence alignment. J. Mol. Biol., 29. Shindyalov,I.N. and Bourne,P.E. (1998) Protein structure alignment by 302, 205±217. incremental combinatorial extension (CE) of the optimal path. Protein 6. Notredame,C. (2002) Recent progress in multiple sequence alignment: a Eng., 11, 739±747. survey. Pharmacogenomics, 3, 131±144. 30. Fischer,D., Barret,C., Bryson,K., Elofsson,A., Godzik,A., Jones,D., 7. Gotoh,O. (1996) Signi®cant improvement in accuracy of multiple protein Karplus,K.J., Kelley,L.A., MacCallum,R.M., Pawowski,K., Rost,B., sequence alignments by iterative re®nement as assessed by reference to Rychlewski,L. and Sternberg,M. (1999) CAFASP-1: critical assessment structural alignments. J. Mol. Biol., 264, 823±838. of fully automated structure prediction methods. Proteins, Suppl. 3, 8. Katoh,K., Misawa,K., Kuma,K. and Miyata,T. (2002) MAFFT: a novel 209±217. method for rapid multiple sequence alignment based on fast Fourier 31. Eidhammer,I., Jonassen,I. and Taylor,W.R. (2000) Structure comparison transform. Nucleic Acids Res., 30, 3059±3066. and structure patterns. J. Comput. Biol., 7, 685±716. 9. Edgar,R.C. (2004) Local homology recognition and distance measures in 32. Sauder,J.M., Arthur,J.W. and Dunbrack,R.L.,Jr (2000) Large-scale linear time using compressed amino acid alphabets. Nucleic Acids Res., comparison of protein sequence alignment algorithms with structure 32, 380±385. alignments. Proteins, 40, 6±22. 10. Kimura,M. (1983) The Neutral Theory of Molecular Evolution. 33. Leplae,R. and Hubbard,T.J. (2002) MaxBench: evaluation of sequence Cambridge University Press. and structure comparison methods. Bioinformatics, 18, 494±495. 11. Sneath,P.H.A. and Sokal,R.R. (1973) Numerical Taxonomy. Freeman, 34. Sadreyev,R. and Grishin,N. (2003) COMPASS: a tool for comparison of San Francisco. multiple protein alignments with assessment of statistical signi®cance. 12. Saitou,N. and Nei,M. (1987) The neighbor-joining method: a new J. Mol. Biol., 326, 317±336. method for reconstructing phylogenetic trees. Mol. Biol. Evol., 4, 35. Edgar,R.C. and Sjolander,K. (2004) COACH: pro®le-pro®le alignment 406±425. of protein families using hidden Markov models. Bioinformatics, DOI: 13. Edgar,R.C. and Sjolander,K. (2004) A comparison of scoring functions 10.1093/bioinformatics/bth091. for protein sequence pro®le alignment. Bioinformatics, DOI: 10.1093/ 36. Holm,L. and Sander,C. (1998) Touring protein fold space with Dali/ bioinformatics/bth090. FSSP. Nucleic Acids Res., 26, 316±319. 14. Sjolander,K., Karplus,K., Brown,M., Hughey,R., Krogh,A., Mian,I.S. 37. Altschul,S.F., Madden,T.L., Schaffer,A.A., Zhang,J., Zhang,Z., and Haussler,D. (1996) Dirichlet mixtures: a method for improved Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-BLAST: a detection of weak but signi®cant protein sequence homology. CABIOS, new generation of protein database search programs. Nucleic Acids Res., 12, 327±345. 25, 3389±3402. 15. von Ohsen,N. and Zimmer,R. (2001) Improving pro®le±pro®le alignment 38. Schaffer,A.A., Aravind,L., Madden,T.L., Shavirin,S., Spouge,J.L., via log average scoring. In Gascuel,O. and Moret,B.M.E. (eds), Wolf,Y.I., Koonin,E.V. and Altschul,S.F. (2001) Improving the Algorithms in Bioinformatics, First International Workshop, WABI 2001. accuracy of PSI-BLAST protein database searches with composition- Springer-Verlag, Berlin, Germany, pp. 11±26. based statistics and other re®nements. Nucleic Acids Res., 29, 16. Muller,T., Spang,R. and Vingron,M. (2002) Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent 2994±3005. approach and a maximum likelihood method. Mol. Biol. Evol., 19, 8±13. 39. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2003) NCBI Reference 17. Brudno,M., Do,C.B., Cooper,G.M., Kim,M.F., Davydov,E., Green,E.D., Sequence project: update and current status. Nucleic Acids Res., 31, Sidow,A. and Batzoglou,S. (2003) LAGAN and Multi-LAGAN: ef®cient 34±37. tools for large-scale multiple alignment of genomic DNA. Genome Res., 40. Thompson,J.D., Plewniak,F. and Poch,O. (1999b) A comprehensive 13, 721±731. comparison of multiple sequence alignment programs. Nucleic Acids 18. Hirosawa,M., Totoki,Y., Hoshida,M. and Ishikawa,M. (1995) Res., 27, 2682±2690. Comprehensive study on iterative algorithms of multiple sequence 41. O'Sullivan,O., Zehnder,M., Higgins,D., Bucher,P., Grosdidier,A. and alignment. CABIOS, 11, 13±18. Notredame,C. (2003) APDB: a novel measure for benchmarking 19. Thompson,J.D., Plewniak,F. and Poch,O. (1999a) BAliBASE: a sequence alignment methods without reference alignments. benchmark alignment database for the evaluation of multiple alignment Bioinformatics, 19 Suppl. 1, I215±I221. programs. Bioinformatics, 15, 87±88. 42. Friedman,M. (1937) The use of ranks to avoid the assumption of 20. Bahr,A., Thompson,J.D., Thierry,J.C. and Poch,O. (2001) BAliBASE normality implicit in the analysis of variance. J. Am. Stat. Assoc., 32, (Benchmark Alignment dataBASE): enhancements for repeats, 675±701. transmembrane sequences and circular permutations. Nucleic Acids Res., 43. Stoye,J., Evers,D. and Meyer,F. (1998) Rose: generating sequence 29, 323±326. families. Bioinformatics, 14, 157±163. 21. Van Walle,I., Lasters,I. and Wyns,L. (2004) Align-mÐa new algorithm 44. Sjolander,K. (1998) Phylogenetic inference in protein superfamilies: for multiple alignment of highly divergent sequences. Bioinformatics, analysis of SH2 domains. Proc. Int. Conf. Intell. Syst. Mol. Biol., 6, DOI: 10.1093/bioinformatics/bth116. 165±174.

Journal

Nucleic Acids ResearchOxford University Press

Published: Mar 1, 2004

There are no references for this article.