Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

The CLUSTAL_X Windows Interface: Flexible Strategies for Multiple Sequence Alignment Aided by Quality Analysis Tools

The CLUSTAL_X Windows Interface: Flexible Strategies for Multiple Sequence Alignment Aided by... 4876–4882 Nucleic Acids Research, 1997, Vol. 25, No. 24  1997 Oxford University Press The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools Julie D. Thompson, Toby J. Gibson , Frédéric Plewniak, François Jeanmougin* and Desmond G. Higgins Institut de Genetique et de Biologie Moleculaire et Cellulaire (CNRS/INSERM/ULP), BP 163, 67404 Illkirch Cedex, 1 3 France, European Molecular Biology Laboratory, Postfach 10.2209, 69012 Heidelberg, Germany and Department of Biochemistry, University College, Cork, Ireland Received September 24, 1997; Revised and Accepted October 28, 1997 ABSTRACT sequences have <30% residue identity, this automatic method becomes less reliable. Any misaligned regions introduced in CLUSTAL X is a new windows interface for the previous stages of the progressive alignment are not corrected later widely-used progressive multiple sequence alignment as new information from other sequences is added. In such cases, program CLUSTAL W. The new system is easy to use, the automatic alignments need to be refined, either manually or providing an integrated system for performing multiple automatically. sequence and profile alignments and analysing the Numerous sequence editors have been developed which allow results. CLUSTAL X displays the sequence alignment in the user to display and manually make or modify an alignment a window on the screen. A versatile sequence colouring (eg. 3–7). These programs are useful for making small refinements scheme allows the user to highlight conserved features to an alignment, but the totally manual alignment of large numbers in the alignment. Pull-down menus provide all the of sequences is not feasible. Manual alignment is also highly options required for traditional multiple sequence and subjective hence is at least as likely as the automatic alignment profile alignment. New features include: the ability to process to result in errors in the alignment. cut-and-paste sequences to change the order of the The CLUSTAL X interface has been written to provide a single alignment, selection of a subset of the sequences to be environment in which the user can perform multiple alignments, realigned, and selection of a sub-range of the alignment view the results and, if necessary, refine and improve the to be realigned and inserted back into the original alignment. Tools for alignment quality analysis have been alignment. Alignment quality analysis can be performed developed which allow the user to highlight low-scoring regions and low-scoring segments or exceptional residues can in the alignment. Options are available for automatically be highlighted. Quality analysis and realignment of correcting these low-scoring regions by realigning a misaligned selected residue ranges provide the user with a sequence or a selected region of an alignment. powerful tool to improve and refine difficult alignments In earlier CLUSTAL programs (8–11), nested text menus and to trap errors in input sequences. CLUSTAL X has provided all the options to do multiple sequence/profile alignments been compiled on SUN Solaris, IRIX5.3 on Silicon and simple phylogenetic tree generation. The output alignments Graphics, Digital UNIX on DECstations, Microsoft were written to file for display, printing or further manipulation. Windows (32 bit) for PCs, Linux ELF for x86 PCs, and With these simple menus, the CLUSTAL programs could be highly Macintosh PowerMac. portable, and run on essentially all computers. Portability has been a major factor in the widespread usage of the CLUSTAL series for sequence alignment. On the other hand, much more attractive and INTRODUCTION powerful user interfaces can be built using non-portable windows The most widely used method in molecular biology to align sets of systems. nucleotide or amino acid sequences, is to build up a multiple The NCBI Software Development Toolkit (Version 1.8, National alignment progressively (1–2). The most closely related groups of Center for Biotechnology Information, Bethesda, MD) provides one sequences are aligned first and then these groups are gradually solution to the windows portability problem. It interfaces between aligned together, keeping the early alignments fixed. This approach the application code and various host windowing systems including works well when the sequences are sufficiently closely related. the X Window System, Macintosh windows and Microsoft However, a globally optimal solution (or a biologically significant Windows. We have made use of the toolkit to provide a portable one) cannot be guaranteed. In more difficult cases, where many windows interface, termed CLUSTAL X. *To whom correspondence should be addressed. Tel: +33 388 65 32 71; Fax: +33 388 65 32 01; Email: jeanmougin@igbmc.u-strasbg.fr 4877 Nucleic Acids Research, 1997, Vol. 25, No. 24 4877 Nucleic Acids Research, 1994, Vol. 22, No. 1 CLUSTAL X is a new graphical interface to the CLUSTAL W Highlighted residues may be expected to occur at a moderate program which displays the sequence alignment in a window on frequency in all the sequences because of their steady divergence the screen, allowing the user to move easily between different parts due to the natural processes of evolution, although the most of the alignment. Pull-down menus provide all the options familiar divergent sequences are likely to have the most outliers. However, to users of the text-menu-driven CLUSTAL W, plus several new the highlighted residues are especially useful in pointing to features. A versatile, configurable, colouring system is used to sequence misalignments. These can arise due to various reasons: highlight conserved residue features in the alignment. Options to (i) partial or total misalignments caused by a failure in the mark suspect regions and realign selected residue ranges give the alignment algorithm, (ii) partial or total misalignments because at user more information and control over the alignment process and least one of the sequences in the given set is partly or completely allow difficult alignments to be gradually built and refined. unrelated to the other sequences, (iii) frameshift translation errors in a protein sequence causing local mismatched regions to be heavily highlighted (see Discussion for more details). MATERIALS AND METHODS Occasionally, highlighted residues may point to regions of We make use of the NCBI VIBRANT (Virtual Interface for some biological significance. This might happen for example if Biological Research and Technology) development library which a protein alignment contains a sequence which has acquired new acts as an interface between the application code of CLUSTAL functions relative to the main sequence set. X and the host windowing system. The NCBI libraries are linked Quality scores. Suppose we have an alignment of M sequences of to the CLUSTAL X code providing mechanisms for displaying length N. Then, the alignment can be written as: windows, menus, buttons etc. In this way, the CLUSTAL X code A A A .......... A 1,1 1,2 1,3 1,N remains independent of the underlying operating system and A A A .......... A 2,1 2,2 2,3 2,N computer. The CLUSTAL X code is written in ANSI C, and should be portable to any machine capable of supporting the NCBI Vibrant toolkit. A A A .......... A M,1 M,2 M,3 M,N CLUSTAL X is available for a number of platforms including Suppose we also define a residue comparison matrix C of size R*R, SUN Solaris, IRIX 5.3 on Silicon Graphics, Digital UNIX on where R is the number of residues. C(a,b) is the score for aligning DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for residue a with residue b (all scores in this matrix are positive). x86 PCs, and Macintosh PowerMac. The source code is provided The problem is to calculate a score for the conservation of the jth for anyone wishing to port to any other platform supported by the position in the alignment. Vingron and Sibbald (12) used a Vibrant project. geometric analysis based on a continuous sequence space in order The source code for CLUSTAL X and several executable to compare sequence weighting methods. The method defines an versions for different machines are freely available by anonymous N-dimensional space, where N is the length of the alignment. Each ftp to ftp-igbmc.u-strasbg.fr. Hypertext documentation can be sequence can be placed in the space, and the distance between two viewed at www-igbmc.u-strasbg.fr/BioInfo/clustalx/. The NCBI sequences is defined as the euclidean distance between the Vibrant Toolkit is available by anonymous ftp from sequences in the space. We have applied an analogous approach to ncbi.nlm.nih.gov. each position in the alignment. An R-dimensional space is defined, in which each column of the alignment can be considered. For a Installation specified position j in the alignment, each sequence consists of a single residue which is assigned a point S in the space. For The CLUSTAL X program is easily installed by copying the sequence i, position j, the point S is defined as: executable file to a system directory which can be seen by all users. Several parameter files (named *.par), and an on-line help C(1, A ) ij text file (clustalx.hlp for MS Windows, otherwise clustalx_help) C(2, A ) ij are also required by CLUSTAL X. These files should be copied to one of the following directories: (i) the user’s current directory, . (ii) the user’s home directory, (iii) any of the directories specified C(R, A ) ij by the PATH environment variable. We then calculate a consensus value X for the jth position in the alignment. X is defined as: RESULTS Algorithms checking alignment quality F * C(i,1) i,j Three methods of alignment quality analysis are implemented. i1 (i) A ‘quality’ score is calculated for each column in the alignment, which depends on the amino acid variability in the column. A high R score indicates a highly conserved column, a low score indicating F * C(i,2) i,j a less well-conserved position. The scores are automatically plotted i1 in the window display (Fig. 1). (ii) The residues which get M exceptionally low scores in the above quality calculation can be highlighted in the alignment display (Fig. 1). (iii) Low-scoring F * C(i, R) i,j segments in each sequence of the alignment can be highlighted. i1 These are found by summing negative scores in the profile built from the sequence alignment (Figs 1 and 2). 4878 Nucleic Acids Research, 1997, Vol. 25, No. 24 Figure 1. The CLUSTAL X window in multiple alignment mode. An alignment of some EFTU proteins is displayed. Low-scoring segments are highlighted using a white character on a black background. Exceptional residues are shown as a white character on a grey background. The quality analysis reveals two anomalously low scoring regions, ruler positions 16–25 in EFTU_ODOSI and 61–71 in EFTU_MYCPN. These were found to be caused by frameshift errors. Two more sequences (EFTU_RICPR and EFTU_SPIPL), not shown here, have 4-residue sequencing errors in this region which CLUSTAL X will also highlight. where F is the count of residues i at position j in the alignment. in order of size from smallest to largest. We can find the upper and i,j Now, if S is the position of sequence i in the R-dimensional lower quartiles (the distances lying one-quarter of the way from space, we can calculate the distance D between each sequence the top and bottom of the array, respectively) and the inter quartile residue i and the consensus position X. range (the difference between the two quartiles). A residue A is considered as an exception in measure (ii) above ij if the sequence distance D is greater than (upper quartile + inter D  (X  S ) i r r quartile range × scaling factor). The scaling factor can be adjusted r1 by the user to select the proportion of residue exceptions that will be highlighted in the alignment display. Exceptional residues in where X is the rth dimension of position X, and S is the rth r r an EFTU alignment are shown in Figure 1. dimension of position S. This calculation runs automatically, in a very short time, each We define the quality score for the jth position in the alignment time the screen is updated. as the mean of the sequence distances D : Low-scoring segments. Given the above alignment of M sequences of length N and a residue exchange matrix, we can build a profile i1 Quality Score which is weighted for sequence divergence. Methods for calculat- ing sequence weights are discussed by Henikoff and Henikoff (13). Finally the scores are normalised by multiplying by the percen- Here we calculate sequence weights directly from a neighbour- tage of sequences which have residues (and not gaps) at this joining tree, using the ‘branch-proportional’ method which position. These scores are used in measure (i) above, as an corrects for unequal representation by downweighting similar estimate of the conservation of each alignment column (Fig. 1). sequences and upweighting divergent ones (14). Each sequence is Exceptional residues. It would be useful for each column in the assigned a weight W . In the residue comparison matrix C, the alignment to identify those sequences in the above calculations scores for common residue substitutions are positive while rarer which are found a long way from the consensus point (i.e., which substitutions are scored negatively. By default, the Gonnet PAM have a large distance D ), thus lowering the quality score for the 250 matrix (15) is used, but the user may supply a different matrix column. For the jth position in the alignment, we take the set of e.g. a lower PAM value is appropriate if the sequences are closely sequences which have a residue at this position (and not a gap). related. The profile P has a column of scores for each position in The distances D for this set of sequences are arranged in an array the alignment. The column is of height R and consists of a score i 4879 Nucleic Acids Research, 1997, Vol. 25, No. 24 4879 Nucleic Acids Research, 1994, Vol. 22, No. 1 Figure 2. Detection and correction of misaligned segments with CLUSTAL X. (A) A set of EFTUs, tested for low scoring regions, highlights a part of the EFTU_ECOLI sequence (which we deliberately misaligned by incorrect gap insertion). The range selected to be realigned is marked above the alignment. (B) After removal of gaps and realignment of the selected residue range, the sequence EFTU_ECOLI is now correctly aligned, and the erroneous gaps have been removed. The low-scoring segment check, the column conservation indicators above the alignment, and the quality graph below it, all reflect the improvement in the alignment. for each residue in the matrix C. The profile score for residue r at F  S if F  Sij  0 j–1 ij j–1 position j in the alignment is defined as: 0 if F  S  0 j–1 ijh 0 if j  0 C(r, A )* W ij i Having found the regions in the sequence which have negative F i1 P(r, j) scores, these regions are then refined by removing those positions at the end of each segment which have a positive profile score S . ij The F scores for these positions are reset to zero. i1 j Similarly, the backward phase can be described as: For the jth position in the ith sequence the score S is defined as: ij B  S if B  S  0 j1 ij j1 ij 0 if B  S  0 S  P(A , j) B  j1 ij ij ij j 0 if j  N  1 The low-scoring regions in the ith sequence are found by summing the scores S along the alignment in both the forward The regions in the sequence which have negative B scores are ij j and backward directions. If the sum is found to be positive, it is again refined by removing those positions at the beginning of reset to zero. The forward phase can be described by the following each segment which have a positive profile score S . The B ij j recurrence relations: scores for these positions are reset to zero. 4880 Nucleic Acids Research, 1997, Vol. 25, No. 24 The calculation is repeated for each sequence compared with DISCUSSION a profile for all aligned sequences, except itself. The low-scoring segments, defined as those positions for which both F and B are j j Ideally, methods for multiple sequence alignment should guarantee negative, are then highlighted in the display (Figs 1 and 2). to find the biologically correct alignment for a set of sequences. In The low-scoring segment calculation is done when the user practice, this is difficult to achieve. Firstly, it is difficult to define selects the ‘Calculate low scoring segments’ option. It takes only an optimal alignment between divergent nucleotide or protein a few seconds to perform unless the alignment is very large, sequences, even given tertiary structural information. Secondly, making it a practical tool for interactive use. methods that find an optimal multiple sequence alignment have been impractical to implement, mainly due to their computational Implementation cost. As computer performance improves, methods which iterate toward an optimal alignment are likely to become useful (16,17). CLUSTAL X displays a window on the screen, including a set of Meanwhile the heuristical approach of progressive alignment is pull-down menus. On-line help is available. The exact format of most often used, as the algorithm is reasonably fast and minimises the screen will depend on the host computer and the operating error in alignments of moderate difficulty. However, because the system. The user may select one of two modes: (i) multiple full information in the sequence set is not used to align each alignment mode which displays a single display area for multiple sequence, it can be possible to see one or more misaligned sequence alignment, or (ii) profile alignment mode which has two sequence segments in the output alignment. In such cases, the display areas, allowing the user to use previously aligned sequences would be expected to align correctly if the full sequences for alignment. Figure 1 shows a CLUSTAL X window information was used, or if alignment parameters such as gap in multiple alignment mode. penalties were adjusted. Alignments or individual sequences can be loaded into the When we developed CLUSTAL W, we gave the user the ability display areas displayed on the screen using the menu options. to iterate the alignment process by realigning an alignment, or by Scroll bars allow easy movement to different parts of the profile aligning sequences to an alignment. In this way, the user alignment. Extra lines are added to the sequence data displaying a could choose to iterate the alignment process, thereby overcoming ruler, an indicator of alignment conservation, plus any secondary some of the defects of progressive alignment. With CLUSTAL X, structure data which was found in the input alignment file. we have taken this capability further, by building in algorithms to The order of the sequences in the display can be changed by target the problem regions of an alignment and letting the user clicking on the sequence names, and selecting the cut and paste realign solely the suspect residue ranges. Using these tools, high options from the menus. In profile alignment mode, these options quality alignments of divergent sequence sets are produced more also allow the user to move sequences from one profile to the other. quickly and with greater confidence than has previously been The sequences can be saved at any time to an alignment file in possible by progressive alignment. one of a number of file formats. The sequence display can also be Many programs have been developed which allow, to a greater saved in a colour postscript file for printing on a postscript printer. or lesser degree, manual intervention in the automatic alignment process. For example, SOMAP (18) was designed to run under Colouring the alignment display. The sequences are automatically the DEC VMS operating system. The program allows the user to coloured to highlight conserved regions of the alignment. The manually build up a multiple sequence alignment. It can accept colours used, and the specification of the conservation of residues automatic alignments created by the original CLUSTAL program can be configured by the user. The ‘rules’ governing the colouring (8) to provide a starting-point for the manual editing process. of residues are read from a colour parameter file, which can be SEAVIEW (7) is a UNIX X Window-based multiple sequence loaded at any time. Two types of colouring ‘rules’ are defined. (i) A alignment editor which is interfaced to the CLUSTAL W residue can be assigned a specific colour regardless of its position program. SEQPUP (Don Gilbert, Biology Department, Indiana in the alignment. In this case, all occurrences of the residue will be University, Bloomington, IN 47405) is a sequence editor and coloured in the alignment display. (ii) A residue can be assigned analysis program which can launch external applications such as different colours depending on the consensus of the alignment at CLUSTAL W to perform sequence alignment. SEQLAB each position. [Wisconsin Package Version 9.0, Genetics Computer Group In this way, for example, conserved hydrophobic or hydrophilic (GCG), Madison, WI] is a graphical user interface based on the positions in the alignment can be highlighted (Fig. 1). OSF/Motif windowing system. It displays sequence alignments Realigning divergent regions. In difficult cases, with a family of on the screen and includes powerful sequence editing facilities. highly divergent sequences, it is possible that misalignments are The PILEUP program is interfaced to the SEQLAB editor to introduced during the multiple alignment process. CLUSTAL X perform automatic multiple sequence alignments. Numerous provides two simple mechanisms for realigning the most divergent Mac and PC alignment editors have also been developed. Most of regions. (i) Misaligned sequences may be selected by clicking on these editors will accept alignment output from CLUSTAL the sequence names. A single menu option then removes these programs. However, using CLUSTAL X, the amount of time sequences from the alignment set and realigns them to the remaining spent editing alignments by hand should be minimised, while a sequences. (ii) The second option allows the user to specify a range hand-edited alignment can itself be returned for error checking. of the alignment to be realigned. In this case, the selected sub-range CLUSTAL X is not confined to either VMS or UNIX of the alignment is removed and multiply aligned using the standard work-stations but also runs on Macintosh and PC computers. The progressive multiple sequence alignment method. The sub-range is program provides a flexible approach to the problem of the then fitted back into the full alignment. multiple alignment of large numbers of sequences. The methods Using these two options, the original multiple sequence used can be applied equally well to both nucleotide and amino acid alignment may be iteratively improved and refined. sequences. An initial automatic alignment using the traditional 4881 Nucleic Acids Research, 1997, Vol. 25, No. 24 4881 Nucleic Acids Research, 1994, Vol. 22, No. 1 progressive, pairwise approach provides a good starting point for is false but, by checking them over, the major errors are almost further refinement. The alignments are displayed on the screen, and always uncovered. Nevertheless, there are situations where the the user can move around easily between different parts of the test may give a false sense of alignment accuracy. This could alignment. A versatile residue colouring scheme based on the happen when aligning sequences with strong amino acid residue conservation of each position in the alignment automatically biases (‘reduced sequence complexity’). Tandem repeats are highlights conserved or special features. another case, since superposition of the wrong repeats could still give a high scoring alignment. Alignments of highly divergent membrane proteins are tricky on both counts since there are many Alignment analysis and error detection transmembrane helices with hydrophobic amino acid biases. More specialised, detailed alignment analysis programs are Tools for alignment quality analysis have been developed and available (24–27). The advantages of CLUSTAL X are that the incorporated into the package. A ‘quality’ estimate for each quality analyses are very fast as well as being integrated into the position in the alignment is plotted on the screen (Fig. 1). Highly alignment package and the results are displayed graphically on conserved positions in the alignment will get a high ‘quality’ score, the screen, with any low-scoring regions highlighted by shading whilst either low conservation, or exceptional residues at a partially the alignment background. This interactive system provides an conserved position, will lower the score for the column. The efficient and flexible approach to alignment analysis and exceptional residues, which may be due to misalignment of the correction. sequences or simply divergence can be highlighted in the alignment (Fig. 1). Sometimes these may be of biological interest, Correcting misaligned regions although most divergence is due to neutral evolutionary processes. Several methods for calculating the conservation of an In Figure 2, a ‘model’ protein misalignment has been set up. For alignment column have been developed. Zvelebil et al. (19) used clarity, the closely related EFTU sequences have been deliberately physico-chemical properties of amino acids to quantify the misaligned. Genuine misalignments would normally be highly conservation of a position in an alignment, in order to predict divergent with only a few identities in particularly conserved protein secondary structure. Smith and Smith (20) define the columns. In such cases, if the correct alignment can be ascertained, ‘information density’ of a sub-region, assuming that all amino this may be by matches between residue similarities rather than acids are informationally equivalent. Sander and Schneider (21) identities. calculate a variation entropy for each column. Brouillet et al. (22) In the example, a misaligned segment of EFTU_ECOLI is first calculate the mean and standard deviation of the pairwise distances detected and marked by applying the low-scoring segments between amino acids in each column of an alignment, using a 20×20 algorithm (Fig. 2A). Next, a region of the alignment spanning the distance matrix. None of these methods were found to be ideal for error is selected using the cursor. The menu option ‘reset all gaps incorporation into Clustal X. Apart from the latter, none of the before alignment’ is toggled on: in this example there are falsely methods use a standard residue exchange matrix, as is needed for inserted gaps that must be deleted. This is not always the case, and consistency with the alignment process, as well as providing a if the existing gaps seem correct, the option can stay switched off. natural way to allow the user to customise the quality analysis by Now the ‘realign selected residue range’ option is invoked. The varying the matrix. The advantage of the geometric interpretion misaligned region is now rapidly and correctly aligned again, and developed for Clustal X is that statistical methods can then be the false gaps are deleted (Fig. 2B). This time the low-scoring applied to define a mean value for the column and distances can be segments algorithm finds only short segments ascribable to measured between each sequence and the mean: upper limits for the natural sequence divergence. Realignments in which the gaps are expected distance between any residue and the mean value can be left in may result in columns with nothing but padding characters, defined and thus exceptional residues can be identified. in which case there is a menu option available to delete these. Low-scoring segments in the sequences can also be highlighted in The realignment process uses the alignment parameter default the alignment (Figs 1 and 2). Low-scoring segments most often settings, or as they are set up by the user. Misaligned regions are result from one of three major causes: high divergence between the often more divergent than other regions of the alignment, which sequences; errors in input sequences, most notably frameshifts; and means that the alignment score may not be much higher than misalignments. If the cause can be ascribed to high divergence, the misaligned alternatives. Therefore it may be necessary to lower alignment may not be wrong, but should be regarded as unreliable gap penalties to allow the sequences to align: this is tested by trial in the low-scoring segment. In particularly unreliable segments, and error. However, the user should be aware of two factors that CLUSTAL X may mark out every sequence! The alignment in such already affect the gap penalties in the local realignment. There is a region is likely to be meaningless. Frameshift errors are more no gap penalty at the ends of a selected region, so it is free to put frequent than usually realised. In the alignment of EFTUs taken new gaps there: judicious selection of the range boundaries can from Swiss-Prot release 34, four sequences have short frameshifts direct gaps to desired sites. Gap penalties are also lowered at within the region shown in Figure 1. Suspect sequences can be existing gaps if these are retained. These factors mean that the investigated with frameshifting alignment programs such as selected range may give a better alignment without having to PairWise in WiseTools (24), or Framesearch in the GCG package. lower the gap penalties. It is important to detect and remove sequences containing errors, as they confound many types of inferences based on multiple Further uses for the low scoring segments alignments, and may themselves also cause the propagation of further alignment errors. In CLUSTAL X, the new algorithm for marking low-scoring We have found the low scoring segments test to be remarkably segments has been implemented for visual interaction. However, powerful, picking up a number of frameshifts and leading to the the algorithm has the potential for wider usage. There are correction of many misalignments. Not every highlighted region currently many projects to automatically produce databases of 4882 Nucleic Acids Research, 1997, Vol. 25, No. 24 4 Thirup,S. and Larsen,N.E. (1990) Proteins, 7, 291–295. multiple sequence alignments. The alignments tend not to be of 5 Clark,S.P. (1992) Comput. Applic. Biosci., 8, 535–538. high quality as it has been difficult to distinguish good and bad 6 De Rijk,P. and De Wachter,R. (1993) Comput. Applic. Biosci., 9, 735–740. aligned regions rapidly and reliably. Removing sequences with 7 Galtier,N., Gouy,M. and Gautier,C. (1996) Comput. Applic. Biosci., 12, low-scoring segments below a cut-off score should dramatically 543–548. 8 Higgins,D.G. and Sharp,P.M. (1988) Gene, 1, 237–244. improve these alignments, as all sequences that contain major 9 Higgins,D.G., Bleasby,A.J. and Fuchs,R. (1992) Comput. Applic. Biosci., errors, or are too divergent to align, can be trapped. 8, 189–191. The algorithm also has the potential to automatically establish the 10 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., domain boundaries in sets of partially-related multi-domain proteins. 22, 4673–4680. In this case the Smith–Waterman best local alignment algorithm, 11 Higgins,D.G., Thompson, J.D. and Gibson,T.J. (1996) Methods Enzymol., 266, 383–402. finding the approximate regions encompassing the homologous 12 Vingron, M. and Sibbald, P.R. (1993) Proc. Natl. Acad. Sci. USA, 90, domains, would be harnessed to the forwards-backwards approach, 8777–8781. summing both the positive and negative scoring segments in order 13 Henikoff,S. and Henikoff,J.G. (1994) J. Mol. Biol., 243, 574–578. to define sharp boundaries. A simpler application would be 14 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Comput. Applic. Biosci., end-trimming in an alignment, since the termini of proteins are 10, 19–29. 15 Benner,S.A., Cohen,M.A. and Gonnet,G.H. (1994) Protein Engng, 7, often poorly conserved. 1323–1332. 16 Notredame,C. and Higgins,D.G. (1996) Nucleic Acids Res., 24, 1515–1524. ACKNOWLEDGEMENTS 17 Gotoh,O. (1996) J. Mol. Biol., 264, 823–838. 18 Parry-Smith,D.J. and Attwood,T.K. (1991) Comput. Applic. Biosci., 7, J.T.was supported by institute funds from INSERM, CNRS and 233–235. 19 Zvelebil,M.J.J.M., Barton,G.J., Taylor,W.R. and Sternberg,M.J.E. (1987) the Ministère de la Recherche et Technologie and the EMBL. We J. Mol. Biol., 195, 957–961. thank the many users of CLUSTAL W who have reported 20 Smith,R.F. and Smith,T.F. (1990) Proc. Natl. Acad. Sci. USA, 87, 118–122. bugs/suggestions, and those who beta tested CLUSTAL X. We 21 Sander,C. and Schneider,R. (1991) Proteins Struct. Funct. Genet., 9, 56–68. would also like to thank Dino Moras, Kevin Leonard, Matti 22 Brouillet,S., Risler,J.L., Henaut,A. and Slonimski,P.P. (1992) Biochimie, 74, 571–580. Saraste and Frank Gannon for support during this work. 23 Birney,E., Thompson,J.D. and Gibson,T.J. (1996) Nucleic Acids Res., 24, 2730–2739. 24 Schuler,G.D., Altschul,S.F. and Lipman,D.J. (1991) Proteins Struct. Funct. REFERENCES Genet., 9, 180–190. 1 Feng,D.F. and Doolittle,R.F. (1987) J. Mol. Evol., 25, 351–360. 25 Vingron,M. and Argos,P. (1991) J. Mol. Biol., 218, 33–43. 2 Taylor,W.R. (1988) J. Mol. Evol., 28, 161–169. 26 Friemann,A. and Schmitz,S. (1992) Comput. Applic. Biosci., 8, 261–265. 3 Stockwell,P.A. and Peterson,G.B. (1987) Comput. Applic. Biosci., 3, 37–43. 27 Livingstone,C.D. and Barton,G.J. (1993) Comput. Applic. Biosci., 9, 745–756. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press

The CLUSTAL_X Windows Interface: Flexible Strategies for Multiple Sequence Alignment Aided by Quality Analysis Tools

Loading next page...
 
/lp/oxford-university-press/the-clustal-x-windows-interface-flexible-strategies-for-multiple-nyd1qjzHto

References (33)

Publisher
Oxford University Press
Copyright
© 1997 Oxford University Press
ISSN
0305-1048
eISSN
1362-4962
DOI
10.1093/nar/25.24.4876
Publisher site
See Article on Publisher Site

Abstract

4876–4882 Nucleic Acids Research, 1997, Vol. 25, No. 24  1997 Oxford University Press The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools Julie D. Thompson, Toby J. Gibson , Frédéric Plewniak, François Jeanmougin* and Desmond G. Higgins Institut de Genetique et de Biologie Moleculaire et Cellulaire (CNRS/INSERM/ULP), BP 163, 67404 Illkirch Cedex, 1 3 France, European Molecular Biology Laboratory, Postfach 10.2209, 69012 Heidelberg, Germany and Department of Biochemistry, University College, Cork, Ireland Received September 24, 1997; Revised and Accepted October 28, 1997 ABSTRACT sequences have <30% residue identity, this automatic method becomes less reliable. Any misaligned regions introduced in CLUSTAL X is a new windows interface for the previous stages of the progressive alignment are not corrected later widely-used progressive multiple sequence alignment as new information from other sequences is added. In such cases, program CLUSTAL W. The new system is easy to use, the automatic alignments need to be refined, either manually or providing an integrated system for performing multiple automatically. sequence and profile alignments and analysing the Numerous sequence editors have been developed which allow results. CLUSTAL X displays the sequence alignment in the user to display and manually make or modify an alignment a window on the screen. A versatile sequence colouring (eg. 3–7). These programs are useful for making small refinements scheme allows the user to highlight conserved features to an alignment, but the totally manual alignment of large numbers in the alignment. Pull-down menus provide all the of sequences is not feasible. Manual alignment is also highly options required for traditional multiple sequence and subjective hence is at least as likely as the automatic alignment profile alignment. New features include: the ability to process to result in errors in the alignment. cut-and-paste sequences to change the order of the The CLUSTAL X interface has been written to provide a single alignment, selection of a subset of the sequences to be environment in which the user can perform multiple alignments, realigned, and selection of a sub-range of the alignment view the results and, if necessary, refine and improve the to be realigned and inserted back into the original alignment. Tools for alignment quality analysis have been alignment. Alignment quality analysis can be performed developed which allow the user to highlight low-scoring regions and low-scoring segments or exceptional residues can in the alignment. Options are available for automatically be highlighted. Quality analysis and realignment of correcting these low-scoring regions by realigning a misaligned selected residue ranges provide the user with a sequence or a selected region of an alignment. powerful tool to improve and refine difficult alignments In earlier CLUSTAL programs (8–11), nested text menus and to trap errors in input sequences. CLUSTAL X has provided all the options to do multiple sequence/profile alignments been compiled on SUN Solaris, IRIX5.3 on Silicon and simple phylogenetic tree generation. The output alignments Graphics, Digital UNIX on DECstations, Microsoft were written to file for display, printing or further manipulation. Windows (32 bit) for PCs, Linux ELF for x86 PCs, and With these simple menus, the CLUSTAL programs could be highly Macintosh PowerMac. portable, and run on essentially all computers. Portability has been a major factor in the widespread usage of the CLUSTAL series for sequence alignment. On the other hand, much more attractive and INTRODUCTION powerful user interfaces can be built using non-portable windows The most widely used method in molecular biology to align sets of systems. nucleotide or amino acid sequences, is to build up a multiple The NCBI Software Development Toolkit (Version 1.8, National alignment progressively (1–2). The most closely related groups of Center for Biotechnology Information, Bethesda, MD) provides one sequences are aligned first and then these groups are gradually solution to the windows portability problem. It interfaces between aligned together, keeping the early alignments fixed. This approach the application code and various host windowing systems including works well when the sequences are sufficiently closely related. the X Window System, Macintosh windows and Microsoft However, a globally optimal solution (or a biologically significant Windows. We have made use of the toolkit to provide a portable one) cannot be guaranteed. In more difficult cases, where many windows interface, termed CLUSTAL X. *To whom correspondence should be addressed. Tel: +33 388 65 32 71; Fax: +33 388 65 32 01; Email: jeanmougin@igbmc.u-strasbg.fr 4877 Nucleic Acids Research, 1997, Vol. 25, No. 24 4877 Nucleic Acids Research, 1994, Vol. 22, No. 1 CLUSTAL X is a new graphical interface to the CLUSTAL W Highlighted residues may be expected to occur at a moderate program which displays the sequence alignment in a window on frequency in all the sequences because of their steady divergence the screen, allowing the user to move easily between different parts due to the natural processes of evolution, although the most of the alignment. Pull-down menus provide all the options familiar divergent sequences are likely to have the most outliers. However, to users of the text-menu-driven CLUSTAL W, plus several new the highlighted residues are especially useful in pointing to features. A versatile, configurable, colouring system is used to sequence misalignments. These can arise due to various reasons: highlight conserved residue features in the alignment. Options to (i) partial or total misalignments caused by a failure in the mark suspect regions and realign selected residue ranges give the alignment algorithm, (ii) partial or total misalignments because at user more information and control over the alignment process and least one of the sequences in the given set is partly or completely allow difficult alignments to be gradually built and refined. unrelated to the other sequences, (iii) frameshift translation errors in a protein sequence causing local mismatched regions to be heavily highlighted (see Discussion for more details). MATERIALS AND METHODS Occasionally, highlighted residues may point to regions of We make use of the NCBI VIBRANT (Virtual Interface for some biological significance. This might happen for example if Biological Research and Technology) development library which a protein alignment contains a sequence which has acquired new acts as an interface between the application code of CLUSTAL functions relative to the main sequence set. X and the host windowing system. The NCBI libraries are linked Quality scores. Suppose we have an alignment of M sequences of to the CLUSTAL X code providing mechanisms for displaying length N. Then, the alignment can be written as: windows, menus, buttons etc. In this way, the CLUSTAL X code A A A .......... A 1,1 1,2 1,3 1,N remains independent of the underlying operating system and A A A .......... A 2,1 2,2 2,3 2,N computer. The CLUSTAL X code is written in ANSI C, and should be portable to any machine capable of supporting the NCBI Vibrant toolkit. A A A .......... A M,1 M,2 M,3 M,N CLUSTAL X is available for a number of platforms including Suppose we also define a residue comparison matrix C of size R*R, SUN Solaris, IRIX 5.3 on Silicon Graphics, Digital UNIX on where R is the number of residues. C(a,b) is the score for aligning DECstations, Microsoft Windows (32 bit) for PCs, Linux ELF for residue a with residue b (all scores in this matrix are positive). x86 PCs, and Macintosh PowerMac. The source code is provided The problem is to calculate a score for the conservation of the jth for anyone wishing to port to any other platform supported by the position in the alignment. Vingron and Sibbald (12) used a Vibrant project. geometric analysis based on a continuous sequence space in order The source code for CLUSTAL X and several executable to compare sequence weighting methods. The method defines an versions for different machines are freely available by anonymous N-dimensional space, where N is the length of the alignment. Each ftp to ftp-igbmc.u-strasbg.fr. Hypertext documentation can be sequence can be placed in the space, and the distance between two viewed at www-igbmc.u-strasbg.fr/BioInfo/clustalx/. The NCBI sequences is defined as the euclidean distance between the Vibrant Toolkit is available by anonymous ftp from sequences in the space. We have applied an analogous approach to ncbi.nlm.nih.gov. each position in the alignment. An R-dimensional space is defined, in which each column of the alignment can be considered. For a Installation specified position j in the alignment, each sequence consists of a single residue which is assigned a point S in the space. For The CLUSTAL X program is easily installed by copying the sequence i, position j, the point S is defined as: executable file to a system directory which can be seen by all users. Several parameter files (named *.par), and an on-line help C(1, A ) ij text file (clustalx.hlp for MS Windows, otherwise clustalx_help) C(2, A ) ij are also required by CLUSTAL X. These files should be copied to one of the following directories: (i) the user’s current directory, . (ii) the user’s home directory, (iii) any of the directories specified C(R, A ) ij by the PATH environment variable. We then calculate a consensus value X for the jth position in the alignment. X is defined as: RESULTS Algorithms checking alignment quality F * C(i,1) i,j Three methods of alignment quality analysis are implemented. i1 (i) A ‘quality’ score is calculated for each column in the alignment, which depends on the amino acid variability in the column. A high R score indicates a highly conserved column, a low score indicating F * C(i,2) i,j a less well-conserved position. The scores are automatically plotted i1 in the window display (Fig. 1). (ii) The residues which get M exceptionally low scores in the above quality calculation can be highlighted in the alignment display (Fig. 1). (iii) Low-scoring F * C(i, R) i,j segments in each sequence of the alignment can be highlighted. i1 These are found by summing negative scores in the profile built from the sequence alignment (Figs 1 and 2). 4878 Nucleic Acids Research, 1997, Vol. 25, No. 24 Figure 1. The CLUSTAL X window in multiple alignment mode. An alignment of some EFTU proteins is displayed. Low-scoring segments are highlighted using a white character on a black background. Exceptional residues are shown as a white character on a grey background. The quality analysis reveals two anomalously low scoring regions, ruler positions 16–25 in EFTU_ODOSI and 61–71 in EFTU_MYCPN. These were found to be caused by frameshift errors. Two more sequences (EFTU_RICPR and EFTU_SPIPL), not shown here, have 4-residue sequencing errors in this region which CLUSTAL X will also highlight. where F is the count of residues i at position j in the alignment. in order of size from smallest to largest. We can find the upper and i,j Now, if S is the position of sequence i in the R-dimensional lower quartiles (the distances lying one-quarter of the way from space, we can calculate the distance D between each sequence the top and bottom of the array, respectively) and the inter quartile residue i and the consensus position X. range (the difference between the two quartiles). A residue A is considered as an exception in measure (ii) above ij if the sequence distance D is greater than (upper quartile + inter D  (X  S ) i r r quartile range × scaling factor). The scaling factor can be adjusted r1 by the user to select the proportion of residue exceptions that will be highlighted in the alignment display. Exceptional residues in where X is the rth dimension of position X, and S is the rth r r an EFTU alignment are shown in Figure 1. dimension of position S. This calculation runs automatically, in a very short time, each We define the quality score for the jth position in the alignment time the screen is updated. as the mean of the sequence distances D : Low-scoring segments. Given the above alignment of M sequences of length N and a residue exchange matrix, we can build a profile i1 Quality Score which is weighted for sequence divergence. Methods for calculat- ing sequence weights are discussed by Henikoff and Henikoff (13). Finally the scores are normalised by multiplying by the percen- Here we calculate sequence weights directly from a neighbour- tage of sequences which have residues (and not gaps) at this joining tree, using the ‘branch-proportional’ method which position. These scores are used in measure (i) above, as an corrects for unequal representation by downweighting similar estimate of the conservation of each alignment column (Fig. 1). sequences and upweighting divergent ones (14). Each sequence is Exceptional residues. It would be useful for each column in the assigned a weight W . In the residue comparison matrix C, the alignment to identify those sequences in the above calculations scores for common residue substitutions are positive while rarer which are found a long way from the consensus point (i.e., which substitutions are scored negatively. By default, the Gonnet PAM have a large distance D ), thus lowering the quality score for the 250 matrix (15) is used, but the user may supply a different matrix column. For the jth position in the alignment, we take the set of e.g. a lower PAM value is appropriate if the sequences are closely sequences which have a residue at this position (and not a gap). related. The profile P has a column of scores for each position in The distances D for this set of sequences are arranged in an array the alignment. The column is of height R and consists of a score i 4879 Nucleic Acids Research, 1997, Vol. 25, No. 24 4879 Nucleic Acids Research, 1994, Vol. 22, No. 1 Figure 2. Detection and correction of misaligned segments with CLUSTAL X. (A) A set of EFTUs, tested for low scoring regions, highlights a part of the EFTU_ECOLI sequence (which we deliberately misaligned by incorrect gap insertion). The range selected to be realigned is marked above the alignment. (B) After removal of gaps and realignment of the selected residue range, the sequence EFTU_ECOLI is now correctly aligned, and the erroneous gaps have been removed. The low-scoring segment check, the column conservation indicators above the alignment, and the quality graph below it, all reflect the improvement in the alignment. for each residue in the matrix C. The profile score for residue r at F  S if F  Sij  0 j–1 ij j–1 position j in the alignment is defined as: 0 if F  S  0 j–1 ijh 0 if j  0 C(r, A )* W ij i Having found the regions in the sequence which have negative F i1 P(r, j) scores, these regions are then refined by removing those positions at the end of each segment which have a positive profile score S . ij The F scores for these positions are reset to zero. i1 j Similarly, the backward phase can be described as: For the jth position in the ith sequence the score S is defined as: ij B  S if B  S  0 j1 ij j1 ij 0 if B  S  0 S  P(A , j) B  j1 ij ij ij j 0 if j  N  1 The low-scoring regions in the ith sequence are found by summing the scores S along the alignment in both the forward The regions in the sequence which have negative B scores are ij j and backward directions. If the sum is found to be positive, it is again refined by removing those positions at the beginning of reset to zero. The forward phase can be described by the following each segment which have a positive profile score S . The B ij j recurrence relations: scores for these positions are reset to zero. 4880 Nucleic Acids Research, 1997, Vol. 25, No. 24 The calculation is repeated for each sequence compared with DISCUSSION a profile for all aligned sequences, except itself. The low-scoring segments, defined as those positions for which both F and B are j j Ideally, methods for multiple sequence alignment should guarantee negative, are then highlighted in the display (Figs 1 and 2). to find the biologically correct alignment for a set of sequences. In The low-scoring segment calculation is done when the user practice, this is difficult to achieve. Firstly, it is difficult to define selects the ‘Calculate low scoring segments’ option. It takes only an optimal alignment between divergent nucleotide or protein a few seconds to perform unless the alignment is very large, sequences, even given tertiary structural information. Secondly, making it a practical tool for interactive use. methods that find an optimal multiple sequence alignment have been impractical to implement, mainly due to their computational Implementation cost. As computer performance improves, methods which iterate toward an optimal alignment are likely to become useful (16,17). CLUSTAL X displays a window on the screen, including a set of Meanwhile the heuristical approach of progressive alignment is pull-down menus. On-line help is available. The exact format of most often used, as the algorithm is reasonably fast and minimises the screen will depend on the host computer and the operating error in alignments of moderate difficulty. However, because the system. The user may select one of two modes: (i) multiple full information in the sequence set is not used to align each alignment mode which displays a single display area for multiple sequence, it can be possible to see one or more misaligned sequence alignment, or (ii) profile alignment mode which has two sequence segments in the output alignment. In such cases, the display areas, allowing the user to use previously aligned sequences would be expected to align correctly if the full sequences for alignment. Figure 1 shows a CLUSTAL X window information was used, or if alignment parameters such as gap in multiple alignment mode. penalties were adjusted. Alignments or individual sequences can be loaded into the When we developed CLUSTAL W, we gave the user the ability display areas displayed on the screen using the menu options. to iterate the alignment process by realigning an alignment, or by Scroll bars allow easy movement to different parts of the profile aligning sequences to an alignment. In this way, the user alignment. Extra lines are added to the sequence data displaying a could choose to iterate the alignment process, thereby overcoming ruler, an indicator of alignment conservation, plus any secondary some of the defects of progressive alignment. With CLUSTAL X, structure data which was found in the input alignment file. we have taken this capability further, by building in algorithms to The order of the sequences in the display can be changed by target the problem regions of an alignment and letting the user clicking on the sequence names, and selecting the cut and paste realign solely the suspect residue ranges. Using these tools, high options from the menus. In profile alignment mode, these options quality alignments of divergent sequence sets are produced more also allow the user to move sequences from one profile to the other. quickly and with greater confidence than has previously been The sequences can be saved at any time to an alignment file in possible by progressive alignment. one of a number of file formats. The sequence display can also be Many programs have been developed which allow, to a greater saved in a colour postscript file for printing on a postscript printer. or lesser degree, manual intervention in the automatic alignment process. For example, SOMAP (18) was designed to run under Colouring the alignment display. The sequences are automatically the DEC VMS operating system. The program allows the user to coloured to highlight conserved regions of the alignment. The manually build up a multiple sequence alignment. It can accept colours used, and the specification of the conservation of residues automatic alignments created by the original CLUSTAL program can be configured by the user. The ‘rules’ governing the colouring (8) to provide a starting-point for the manual editing process. of residues are read from a colour parameter file, which can be SEAVIEW (7) is a UNIX X Window-based multiple sequence loaded at any time. Two types of colouring ‘rules’ are defined. (i) A alignment editor which is interfaced to the CLUSTAL W residue can be assigned a specific colour regardless of its position program. SEQPUP (Don Gilbert, Biology Department, Indiana in the alignment. In this case, all occurrences of the residue will be University, Bloomington, IN 47405) is a sequence editor and coloured in the alignment display. (ii) A residue can be assigned analysis program which can launch external applications such as different colours depending on the consensus of the alignment at CLUSTAL W to perform sequence alignment. SEQLAB each position. [Wisconsin Package Version 9.0, Genetics Computer Group In this way, for example, conserved hydrophobic or hydrophilic (GCG), Madison, WI] is a graphical user interface based on the positions in the alignment can be highlighted (Fig. 1). OSF/Motif windowing system. It displays sequence alignments Realigning divergent regions. In difficult cases, with a family of on the screen and includes powerful sequence editing facilities. highly divergent sequences, it is possible that misalignments are The PILEUP program is interfaced to the SEQLAB editor to introduced during the multiple alignment process. CLUSTAL X perform automatic multiple sequence alignments. Numerous provides two simple mechanisms for realigning the most divergent Mac and PC alignment editors have also been developed. Most of regions. (i) Misaligned sequences may be selected by clicking on these editors will accept alignment output from CLUSTAL the sequence names. A single menu option then removes these programs. However, using CLUSTAL X, the amount of time sequences from the alignment set and realigns them to the remaining spent editing alignments by hand should be minimised, while a sequences. (ii) The second option allows the user to specify a range hand-edited alignment can itself be returned for error checking. of the alignment to be realigned. In this case, the selected sub-range CLUSTAL X is not confined to either VMS or UNIX of the alignment is removed and multiply aligned using the standard work-stations but also runs on Macintosh and PC computers. The progressive multiple sequence alignment method. The sub-range is program provides a flexible approach to the problem of the then fitted back into the full alignment. multiple alignment of large numbers of sequences. The methods Using these two options, the original multiple sequence used can be applied equally well to both nucleotide and amino acid alignment may be iteratively improved and refined. sequences. An initial automatic alignment using the traditional 4881 Nucleic Acids Research, 1997, Vol. 25, No. 24 4881 Nucleic Acids Research, 1994, Vol. 22, No. 1 progressive, pairwise approach provides a good starting point for is false but, by checking them over, the major errors are almost further refinement. The alignments are displayed on the screen, and always uncovered. Nevertheless, there are situations where the the user can move around easily between different parts of the test may give a false sense of alignment accuracy. This could alignment. A versatile residue colouring scheme based on the happen when aligning sequences with strong amino acid residue conservation of each position in the alignment automatically biases (‘reduced sequence complexity’). Tandem repeats are highlights conserved or special features. another case, since superposition of the wrong repeats could still give a high scoring alignment. Alignments of highly divergent membrane proteins are tricky on both counts since there are many Alignment analysis and error detection transmembrane helices with hydrophobic amino acid biases. More specialised, detailed alignment analysis programs are Tools for alignment quality analysis have been developed and available (24–27). The advantages of CLUSTAL X are that the incorporated into the package. A ‘quality’ estimate for each quality analyses are very fast as well as being integrated into the position in the alignment is plotted on the screen (Fig. 1). Highly alignment package and the results are displayed graphically on conserved positions in the alignment will get a high ‘quality’ score, the screen, with any low-scoring regions highlighted by shading whilst either low conservation, or exceptional residues at a partially the alignment background. This interactive system provides an conserved position, will lower the score for the column. The efficient and flexible approach to alignment analysis and exceptional residues, which may be due to misalignment of the correction. sequences or simply divergence can be highlighted in the alignment (Fig. 1). Sometimes these may be of biological interest, Correcting misaligned regions although most divergence is due to neutral evolutionary processes. Several methods for calculating the conservation of an In Figure 2, a ‘model’ protein misalignment has been set up. For alignment column have been developed. Zvelebil et al. (19) used clarity, the closely related EFTU sequences have been deliberately physico-chemical properties of amino acids to quantify the misaligned. Genuine misalignments would normally be highly conservation of a position in an alignment, in order to predict divergent with only a few identities in particularly conserved protein secondary structure. Smith and Smith (20) define the columns. In such cases, if the correct alignment can be ascertained, ‘information density’ of a sub-region, assuming that all amino this may be by matches between residue similarities rather than acids are informationally equivalent. Sander and Schneider (21) identities. calculate a variation entropy for each column. Brouillet et al. (22) In the example, a misaligned segment of EFTU_ECOLI is first calculate the mean and standard deviation of the pairwise distances detected and marked by applying the low-scoring segments between amino acids in each column of an alignment, using a 20×20 algorithm (Fig. 2A). Next, a region of the alignment spanning the distance matrix. None of these methods were found to be ideal for error is selected using the cursor. The menu option ‘reset all gaps incorporation into Clustal X. Apart from the latter, none of the before alignment’ is toggled on: in this example there are falsely methods use a standard residue exchange matrix, as is needed for inserted gaps that must be deleted. This is not always the case, and consistency with the alignment process, as well as providing a if the existing gaps seem correct, the option can stay switched off. natural way to allow the user to customise the quality analysis by Now the ‘realign selected residue range’ option is invoked. The varying the matrix. The advantage of the geometric interpretion misaligned region is now rapidly and correctly aligned again, and developed for Clustal X is that statistical methods can then be the false gaps are deleted (Fig. 2B). This time the low-scoring applied to define a mean value for the column and distances can be segments algorithm finds only short segments ascribable to measured between each sequence and the mean: upper limits for the natural sequence divergence. Realignments in which the gaps are expected distance between any residue and the mean value can be left in may result in columns with nothing but padding characters, defined and thus exceptional residues can be identified. in which case there is a menu option available to delete these. Low-scoring segments in the sequences can also be highlighted in The realignment process uses the alignment parameter default the alignment (Figs 1 and 2). Low-scoring segments most often settings, or as they are set up by the user. Misaligned regions are result from one of three major causes: high divergence between the often more divergent than other regions of the alignment, which sequences; errors in input sequences, most notably frameshifts; and means that the alignment score may not be much higher than misalignments. If the cause can be ascribed to high divergence, the misaligned alternatives. Therefore it may be necessary to lower alignment may not be wrong, but should be regarded as unreliable gap penalties to allow the sequences to align: this is tested by trial in the low-scoring segment. In particularly unreliable segments, and error. However, the user should be aware of two factors that CLUSTAL X may mark out every sequence! The alignment in such already affect the gap penalties in the local realignment. There is a region is likely to be meaningless. Frameshift errors are more no gap penalty at the ends of a selected region, so it is free to put frequent than usually realised. In the alignment of EFTUs taken new gaps there: judicious selection of the range boundaries can from Swiss-Prot release 34, four sequences have short frameshifts direct gaps to desired sites. Gap penalties are also lowered at within the region shown in Figure 1. Suspect sequences can be existing gaps if these are retained. These factors mean that the investigated with frameshifting alignment programs such as selected range may give a better alignment without having to PairWise in WiseTools (24), or Framesearch in the GCG package. lower the gap penalties. It is important to detect and remove sequences containing errors, as they confound many types of inferences based on multiple Further uses for the low scoring segments alignments, and may themselves also cause the propagation of further alignment errors. In CLUSTAL X, the new algorithm for marking low-scoring We have found the low scoring segments test to be remarkably segments has been implemented for visual interaction. However, powerful, picking up a number of frameshifts and leading to the the algorithm has the potential for wider usage. There are correction of many misalignments. Not every highlighted region currently many projects to automatically produce databases of 4882 Nucleic Acids Research, 1997, Vol. 25, No. 24 4 Thirup,S. and Larsen,N.E. (1990) Proteins, 7, 291–295. multiple sequence alignments. The alignments tend not to be of 5 Clark,S.P. (1992) Comput. Applic. Biosci., 8, 535–538. high quality as it has been difficult to distinguish good and bad 6 De Rijk,P. and De Wachter,R. (1993) Comput. Applic. Biosci., 9, 735–740. aligned regions rapidly and reliably. Removing sequences with 7 Galtier,N., Gouy,M. and Gautier,C. (1996) Comput. Applic. Biosci., 12, low-scoring segments below a cut-off score should dramatically 543–548. 8 Higgins,D.G. and Sharp,P.M. (1988) Gene, 1, 237–244. improve these alignments, as all sequences that contain major 9 Higgins,D.G., Bleasby,A.J. and Fuchs,R. (1992) Comput. Applic. Biosci., errors, or are too divergent to align, can be trapped. 8, 189–191. The algorithm also has the potential to automatically establish the 10 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Nucleic Acids Res., domain boundaries in sets of partially-related multi-domain proteins. 22, 4673–4680. In this case the Smith–Waterman best local alignment algorithm, 11 Higgins,D.G., Thompson, J.D. and Gibson,T.J. (1996) Methods Enzymol., 266, 383–402. finding the approximate regions encompassing the homologous 12 Vingron, M. and Sibbald, P.R. (1993) Proc. Natl. Acad. Sci. USA, 90, domains, would be harnessed to the forwards-backwards approach, 8777–8781. summing both the positive and negative scoring segments in order 13 Henikoff,S. and Henikoff,J.G. (1994) J. Mol. Biol., 243, 574–578. to define sharp boundaries. A simpler application would be 14 Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Comput. Applic. Biosci., end-trimming in an alignment, since the termini of proteins are 10, 19–29. 15 Benner,S.A., Cohen,M.A. and Gonnet,G.H. (1994) Protein Engng, 7, often poorly conserved. 1323–1332. 16 Notredame,C. and Higgins,D.G. (1996) Nucleic Acids Res., 24, 1515–1524. ACKNOWLEDGEMENTS 17 Gotoh,O. (1996) J. Mol. Biol., 264, 823–838. 18 Parry-Smith,D.J. and Attwood,T.K. (1991) Comput. Applic. Biosci., 7, J.T.was supported by institute funds from INSERM, CNRS and 233–235. 19 Zvelebil,M.J.J.M., Barton,G.J., Taylor,W.R. and Sternberg,M.J.E. (1987) the Ministère de la Recherche et Technologie and the EMBL. We J. Mol. Biol., 195, 957–961. thank the many users of CLUSTAL W who have reported 20 Smith,R.F. and Smith,T.F. (1990) Proc. Natl. Acad. Sci. USA, 87, 118–122. bugs/suggestions, and those who beta tested CLUSTAL X. We 21 Sander,C. and Schneider,R. (1991) Proteins Struct. Funct. Genet., 9, 56–68. would also like to thank Dino Moras, Kevin Leonard, Matti 22 Brouillet,S., Risler,J.L., Henaut,A. and Slonimski,P.P. (1992) Biochimie, 74, 571–580. Saraste and Frank Gannon for support during this work. 23 Birney,E., Thompson,J.D. and Gibson,T.J. (1996) Nucleic Acids Res., 24, 2730–2739. 24 Schuler,G.D., Altschul,S.F. and Lipman,D.J. (1991) Proteins Struct. Funct. REFERENCES Genet., 9, 180–190. 1 Feng,D.F. and Doolittle,R.F. (1987) J. Mol. Evol., 25, 351–360. 25 Vingron,M. and Argos,P. (1991) J. Mol. Biol., 218, 33–43. 2 Taylor,W.R. (1988) J. Mol. Evol., 28, 161–169. 26 Friemann,A. and Schmitz,S. (1992) Comput. Applic. Biosci., 8, 261–265. 3 Stockwell,P.A. and Peterson,G.B. (1987) Comput. Applic. Biosci., 3, 37–43. 27 Livingstone,C.D. and Barton,G.J. (1993) Comput. Applic. Biosci., 9, 745–756.

Journal

Nucleic Acids ResearchOxford University Press

Published: Dec 1, 1997

There are no references for this article.