Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

ARB: a software environment for sequence data

ARB: a software environment for sequence data Published online February 25, 2004 Nucleic Acids Research, 2004, Vol. 32, No. 4 1363±1371 DOI: 10.1093/nar/gkh293 Wolfgang Ludwig*, Oliver Strunk, Ralf Westram, Lothar Richter, Harald Meier , 1 1 Yadhukumar, Arno Buchner, Tina Lai, Susanne Steppi, Gangolf Jobb , Wolfram Fo È rster , 1 1 Igor Brettske, Stefan Gerber, Anton W. Ginhart , Oliver Gross, Silke Grumann , 1 1 1 1 1 Stefan Hermann , Ralf Jost , Andreas Ko È nig , Thomas Liss , Ralph Lu È ûmann , 1 1 1 1 1 Michael May , Bjo È rn Nonhoff , Boris Reichel , Robert Strehlow , Alexandros Stamatakis , 1 1 1 2 1 Norbert Stuckmann , Alexander Vilbig , Michael Lenke , Thomas Ludwig , Arndt Bode and Karl-Heinz Schleifer Lehrstuhl fu È r Mikrobiologie, Technische Universita ÈtMu È nchen, D-853530 Freising, Germany Lehrstuhl fu È r Rechnertechnik und Rechnerorganisation, Parallelrechnerarchitektur, Technische UniversitatMu È nchen, D-85748 Garching, Germany Institut fu È r Informatik, Ruprecht-Karls-Universitat Heidelberg, D-69120 Heidelberg, Germany Received January 13, 2004; Revised and Accepted January 28, 2004 ABSTRACT as well as microbial taxonomy and identi®cation. Further- more, improved and automated sequencing techniques pro- The ARB (from Latin arbor, tree) project was moted a rapid increase in the number of small subunit rRNA initiated almost 10 years ago. The ARB program primary structure data available from data sources such as package comprises a variety of directly interacting GenBank (1) or EBI (European Bioinformatics Institute) (2). software tools for sequence database maintenance However, these databases provide only raw data and addi- and analysis which are controlled by a common tional descriptive information which cannot interactively be graphical user interface. Although it was initially extended by the user. Although the Ribosomal Database Project (RDP) (3) and the Antwerpen projects (4,5) offered designed for ribosomal RNA data, it can be used for datasets of aligned sequences, data handling and analysis any nucleic and amino acid sequence data as well. remained dif®cult for scientists applying rRNA-based A central database contains processed (aligned) methods. A variety of individual software tools for sequence primary structure data. Any additional descriptive editing, alignment and phylogenetic analysis were available data can be stored in database ®elds assigned to from the different database projects (1±4) and other sources the individual sequences or linked via local or (6) (http://www.gcg.com). However, a comprehensive worldwide networks. A phylogenetic tree visualized package of interacting tools was missing. Furthermore, the in the main window can be used for data access and number of different input and output formats which had to be visualization. The package comprises additional used re¯ected the variety of individual software programs tools for data import and export, sequence align- which uncomfortably had to be applied sequentially to achieve ment, primary and secondary structure editing, pro- a comprehensive analysis of molecular data. Unfortunately, a ®le and ®lter calculation, phylogenetic analyses, promising initiative, the Genetic Data Environment (GDE) project (http://bimas.dcrt.nih.gov/gde_sw.html), focusing on speci®c hybridization probe design and evaluation the development of a common graphical interface for data and other components for data analysis. Currently, handling and analysis was not continued. Consequently, the package is used by numerous working groups microbiologists and computer scientists at the Technical worldwide. University of Munich decided to develop their own software package capable of properly managing the upcoming data ¯ood. INTRODUCTION The two major tasks according to the ARB concept, The ARB (from Latin arbor, tree) project was established as formulated in the early days of the project and maintained to an interdisciplinary bioinformatics initiative of the Lehrstuhl the present, are (i) the maintenance of a structured integrative fu È r Mikrobiologie and the Lehrstuhl fu È r Rechnertechnik secondary database combining processed primary structures und Rechnerorganisation, Parallelrechnerarchitektur of the and any type of additional data assigned to the individual Technical University of Munich almost 10 years ago. In that sequence entries and (ii) a comprehensive selection of time, comparative sequence analysis of the small subunit software tools directly interacting with one another as well rRNAs or the respective genes had already been established as as the central database which are controlled via a common the most commonly applied approach for phylogeny inference graphical interface. Software and rRNA databases are publicly *To whom correspondence should be addressed. Tel: +8161 71 5451; Fax: +8161 71 5475; Email: ludwig@mikro.biologie.tu-muenchen.de Nucleic Acids Research, Vol. 32 No. 4 ã Oxford University Press 2004; all rights reserved 1364 Nucleic Acids Research, 2004, Vol. 32, No. 4 accessible (http://www.arb-home.de) and have been in use worldwide for several years. MATERIALS AND METHODS Sequence data The raw data used to establish databases and perform data analysis were taken from our own sequencing projects, provided by other research groups or periodically retrieved from public data sources such as the EBI (1), Genbank (2), the RDP (3) and the Antwerpen databases for small (4) and large (5) subunit RNAs. Complete releases were downloaded from the latter two locations. The search and retrieval tools of the former two institutions were used to select and download the primary structure and additional information on rRNA or other genes. Furthermore, sequence data determined at the Lehrstuhl fu È r Mikrobiologie of the Technical University of Munich or by other groups were imported and processed. Figure 1. The interacting components and tools of the ARB software package and database. Operating systems and programming languages The ARB software was developed for UNIX systems and their derivatives. Currently, the development is performed using SuSE LINUX (http://www.suse.com) running on PCs. can be assigned to the individual sequence data entry and The greater part of the source code was written in C++ and stored within default or user-de®ned database ®elds. These C; some parts were written in Perl and other script languages. data can either be kept as intrinsic components of the database The graphical environment is based upon the Open Motif or linked to it via local networks or the Internet. In the latter library. case the path to the respective ®le or the URL of an external Integrated external software tools database, optionally including commands and search strings, have to be stored within the respective ARB database ®elds. Functionalities from the GDE project (http://bimas.dcrt.nih. The designations and hierarchy of the database ®elds can be gov/gde_sw.html) concerning sequence editing were adopted customized by the user. The default structuring is according to and implemented in the ARB package. Some programs of the the phylogeny of the organisms derived from the respective PHYLIP package for phylogeny inference (6) were incorpor- sequence data. However, it can also be changed according to ated as components directly interacting with the central other criteria de®ned by database ®eld entries. This hierarchy database. Additionally, fastDNAml (7) and protml of the is used by special algorithms for highly effective data Molphy package (8), components of the Puzzle package (9) compression. Different protection levels (0±6) can be assigned and AxML, a new accelerated fastDNAml derivative (10), to the individual database ®elds. Database as well as security were included for maximum-likelihood-based phylogenetic management is facilitated by this tool. analyses of nucleic and amino acid sequence data. Data access and visualization RESULTS AND DISCUSSION A powerful search tool allows simple (strings and combination of strings) and complex (default or user-de®ned algorithms) A selection of tools and functionalities of the ARB packages searches in one or more of the database ®elds. The information will be brie¯y described in the following sections. The in all or a user-de®ned selection of database ®elds can be network in Figure 1 schematically visualizes these tools and visualized on the screen in respective windows (Fig. 2). The their interactions with one another and the central database. layout of the visualization windows, i.e. selection, size and Most tools developed for ARB directly interact with a copy of positioning of database ®eld entries, can be customized by the the database in the main storage, whereas the integrated user. Simple algorithms are included. second-party tools are provided with data from ARB and their An alternative method of data access and visualization is results are written back to the database. Thus any changes or provided by the ARB main window. Phylogenetic trees rearrangements are immediately known to the peripheral generated by intrinsic ARB tree reconstruction tools or software components. imported from external sources are stored in the database The central database and can be visualized in different formats within the ARB main window (Fig. 3). Any (combination of) database ®eld The sequences representing organisms, genes or gene products are stored in individual database ®elds as the central entries can be visualized at the terminal nodes of the tree components and a unique identi®er (short_name) is automatic- currently shown. Selection and order of data entries, the ally generated and assigned to each of them. Databases created results of data analysis or extractions to be visualized are using ARB are hierarchically structured. Following the ARB de®ned by the node display settings (NDS) tool. Irrespective concept of an integrative database, any type of additional data of the visualization mode used, the ARB search and Nucleic Acids Research, 2004, Vol. 32, No. 4 1365 Figure 2. Example of a data visualization window. Bibliographic data stored in respective database ®elds are shown. The selection of database ®elds, extraction of data and the layout of the visualization window can be customized by the user. replacement tool (SRT) and ARB command interpreter (ACI) sequences according to default or user-de®ned criteria and can be used for extraction of combinations of (sub)strings as optionally visualized along with or instead of the individual well as for analysis of database ®eld entries, respectively. sequences. This consensus can be edited and changes made concern any sequence in the group. A special feature of the Sequence editors editor is the simultaneous secondary structure check if rRNA (gene) data are visualized. Symbols indicating the presence or The sequence data can be visualized and modi®ed with a absence as well as the character of base pairing are shown powerful editor (Fig. 4). The original data as well as virtually below the individual characters and immediately refreshed transformed data (e.g. purine±pyrimidine or simpli®ed amino during sequence editing. A (three-domain) consensus acid presentation) are displayed in user-de®ned color codes. secondary structure mask established according to commonly Keyboard customization is possible for data entry and accepted secondary structure models (11) functions as a guide modi®cation. Two different editing modes can be selected. for this tool. Thereby the users are strongly supported with The `Align' mode allows only insertion/removal of alignment regard to the evaluation of sequences, alignment and probe gaps and movement of sequence characters. In addition to targets. these functions, character changes can be performed in the The ARB secondary structure editor (Fig. 5) ®ts any `Edit' mode. The rights to overcome protection of the sequence into the common consensus model. The particular individual sequence entries can be given for the two modes sequence to be visualized is selected by cursor positioning in independently. This helps to prevent unwanted character the primary structure editor. The layout of the structure, i.e. changes when manually modifying the sequence data or alignment. color coding of base-paired, non-paired and loop positions as Sets of search strings can be de®ned and optionally stored. well as probe target sites, can be customized according to the Their occurrence can be visualized within the displayed user's preferences. Any of the search strings activated in the sequences by user-de®ned background colors. Virtual com- primary structure editor can be indicated in the secondary pression (removal of alignment gaps common to all or a structure model. This helps the experts to evaluate probe certain fraction of the displayed sequences) is possible. This targets. The evaluation of target position with respect to makes data handling more convenient in the case of large higher-order rRNA structure is of importance especially insertions occurring in only part of the selected sequences. when probes are used for in situ cell hybridizations (12±14). Groups of sequences can be interactively de®ned or are The structure can be exported to x®g, a simple open- automatically shown if de®ned in the phylogenetic trees. source graphics program (http://ww.x®g.org), for further Consensus sequences are determined for each de®ned group of modi®cation and/or transformation into various formats. 1366 Nucleic Acids Research, 2004, Vol. 32, No. 4 Figure 3. The ARB main window showing part of an ARB parsimony-generated dendrogram. The rectangles represent `online compressed' monophyletic groups which can be `unfolded' by mouse click. Database ®eld entries such as taxonomic name, public database accession number and strain designation as reported in EMBL (1), RDP (3) and the European rRNA databases (DEW) (4,5) are visualized at the terminal nodes of the `unfolded' Desulfohalobiaceae. Pro®les, masks and ®lters upon the full database or user-de®ned subsets. The underlying methods range from simple character counting to maximum Conservation or base composition pro®les, higher-order parsimony-based column statistics. These pro®les, masks and structure masks and ®lters including or excluding particular ®lters are stored in the central database as so-called sequence alignment positions are important tools for sequence data associated information (SAI) and can be visualized and analyses, especially for phylogenetic inference (15). The ARB modi®ed by the primary structure editor. The ®lter selection package provides tools for determining such pro®les based tool allows not only choice of sets of particular ®lters but also Nucleic Acids Research, 2004, Vol. 32, No. 4 1367 Figure 4. The ARB primary structure editor. As an example for highlighting a search string a probe target site is shown by background color. Perfect and mismatched pairing is color coded as well. performance of ®ne tuning with respect to the inclusion or changes in the initial tree. This enables the user to reconstruct exclusion of alignment positions in the case of multiple and optimize an initial tree based upon the best (full character ®lters. sequences) and most comprehensive (wide variation of phylogenetic levels) sequence data and also to include partial Phylogenetic treeing sequences without perturbing the initial tree topology. The second peculiarity of the treeing software concerns the tree As mentioned in Materials and Methods, software implement- optimization performing cycles of NNI and Kernigham±Lin ations of several alternative treeing methods are incorporated (KL) (16) tree modi®cations. This optimization can not only in the package. They operate as intrinsic tools with all the be applied to the complete tree but also con®ned to user- respective ARB components and database elements such as selected subtrees. Thus tree optimization is possible applying alignment and ®lters. The central treeing tool of the package, the appropriate ®lters for the respective phylogenetic levels ARB-parsimony, is a special development for the handling of several thousand sequences (more than 30 000 in the current and groups. In this context, it is of interest that, while small subunit rRNA ARB database). New sequences are performing stepwise optimizations, the intermediates are successively added to an existing tree according to the stored until the user de®nes the version to be permanently parsimony criterion. An intrinsic software component super- stored in the database. Furthermore, different trees generated applying various parameters can be permanently kept in the imposes branch length on the parsimony-generated tree database and optionally used for data visualization in the ARB topology. These branch lengths re¯ect the signi®cance of the main window. individual `tetra-furcations' by expressing the difference of the most parsimonious and the two least parsimonious The positional tree server solutions when performing nearest-neighbor interchange (NNI) of adjacent branches or subtrees. These relative The ARB positional tree (PT) server, once established, allows distances are standardized according to a distance matrix rapid ®nding of sequence identity or peculiarity. Thus it is the deduced from primary structure comparison. Thus branch central tool for fast search of closest relatives for automated lengths in ARB-parsimony generated trees in the ®rst instance sequence alignment or to de®ne diagnostic sequence stretches visualize the signi®cance of topologies, and in the second for primer and probe design. Establishing a positional tree instance re¯ect a degree of estimated sequence divergence. A server of any oligonucleotide sequence up to 20mers occurring prominent feature of ARB-parsimony is the possibility of in the underlying database and assignment of the individual adding sequences to an existing tree without allowing any oligonucleotides to the sequences or organisms containing 1368 Nucleic Acids Research, 2004, Vol. 32, No. 4 Figure 5. Secondary structure editor. The sequence selected in the primary structure editor (Fig. 4) is automatically ®tted into a consensus secondary structure model. them is the basis for these procedures. PT-server-based Sequence alignment analyses do not rely upon aligned sequences. The PT server is not provided with the ARB program package or ARB As mentioned in Materials and Methods, for de novo database. It has to be established for the respective database generation of a nucleic or amino acid sequence alignment locally. The PT server can be used by multiple users on the ClustalW (19) as implemented in the ARB package can be used. However, in most cases new sequence entries have to be local machine or via network. The computing time for integrated in an already existing database of aligned generating the respective ®les depends strongly on the size and structure of the individual database as well as the sequences. For this purpose the ARB fast aligner was performance of the machine used. The advantages of these developed and included. This aligner uses a (set of) selected logarithmic algorithms over linear ones such as Blast (17) or aligned reference sequences as template(s) for rapid integra- Fasta (18) are effectiveness and rapidity. tion of a (set of) unaligned sequence(s). Individual entries, i.e. Nucleic Acids Research, 2004, Vol. 32, No. 4 1369 Figure 6. Results of probe design and evaluation. Part of the primary structure alignment containing the probe target site is shown for the target organism Desulfohalobium retbaense and the non-target organisms containing the most similar sequence stretches. sequences or consensus de®ned by the user or automatically whole database by using the program `Probe Match'. Local determined by PT-server-based search for most similar alignments are determined between the probe target reference sequences, are used as the template. sequence(s) and the most similar reference sequences In the case of protein coding nucleic acid sequences the (optionally from no to ®ve mismatches) in the respective alignment is usually optimized on the amino acid level. The database (Fig. 6). Furthermore, these sequence strings can underlying nucleic acid alignment can then be adapted to the automatically be visualized in the primary and secondary amino acid alignment by a back-translation-based tool taking structure editors. The latter information is of particular into consideration all known codon usages. importance when designing probes for in situ cell hybridiz- ation. A tool for visualization of accessibility maps (13,14) Probe design and evaluation in the primary and secondary structure editors is under development. Currently, taxon- or gene-speci®c probe design certainly plays A special advance is the ARB multiprobe software a central role in many molecular biological research and component. It determines sets of up to ®ve probes optimally analysis projects, for example the identi®cation and detection identifying the target group (21). These probe sets can be used of organisms in complex environmental samples or expression for multiple ¯uorescence in situ hybridization experiments. studies within the scope of genome projects. Algorithms of the ARB programs `Probe Design' and `Probe Match' are Data import and export searching the PT server to identify short (10±100 monomers) The sequence as well as additional data can be imported and diagnostic sequence stretches which are evaluated against the exported in commonly used ¯at ®le formats. The parsing from background of all full and partial sequences in the respective and to tagged ¯at ®les can be customized by advanced users. database the PT server has been built from. In principle, no There is also a tool for automated completion of database alignment of the sequence data is needed for speci®c probe submission forms for those users determining sequences on design. However, in the case of taxon-speci®c probes align- their own. ment and phylogenetic analyses are necessary to allow de®ning groups of phylogenetically (taxonomically) related Availability and documentation organisms as the targets of speci®c probes. The design of taxon-speci®c oligonucleotide probes with ARB is performed Although so far not of®cially published, previous versions of in three steps. First, the user selects the organism or a group of the software package and databases have been available for organisms for which he or she wants to design a diagnostic several years and the software has been used worldwide. The probe. Secondly, the software `Probe Design' searches the PT ARB package provides a comprehensive set of tools to support server for potential target sites. The results are shown in a the user's work. However, depending upon the user's interests, ranked list of proposed targets, probes and additional some knowledge of the basics of sequence alignment, information. The ranking is according to several composi- phylogenetic analyses or probe hybridization is needed. tional and thermodynamic criteria (12,20). Thirdly, the Some familiarity with UNIX operating systems is advanta- proposed oligonucleotide probes are evaluated against the geous. During installation, environment variables, paths, 1370 Nucleic Acids Research, 2004, Vol. 32, No. 4 Table 1. Run time studies for PT server generation, automated alignment ssu_1k.arb, ssu_100.arb (subsets comprising 10 000, 1000 and and parsimony-based treeing 100 sequences) used for these PT server building studies are available at http://www.arb-home.de. For adding 10, 100 and No. of sequences 10 100 1000 10 000 25 743 PT server ± 5 s 22 s 3 min 7 min 30s 1000 sequences to the database alignment or tree, ssu_ Add to alignment 4 s 38 s 6 min jan03.arb was used as basis. The ARB aligner was used in Add to tree 1 min 9 min 15 s 2 h combination with a PT-server-directed search for most similar reference sequences. For phylogenetic treeing the respective Datasets varying with respect to the number of sequence entries were used for PT server generation. The most comprehensive of these datasets sequences/organisms were added to a tree comprising all comprising 25 743 entries and 40 000 alignment positions was used as a entries of the database applying the ARB parsimony tool. template for inserting the speci®ed numbers of sequences into the database alignment or tree. For treeing, 2141 alignment columns were included. Future developments The ongoing developments are focusing on two major tasks. First, a web tool providing all potential probe target sites permissions and aliases have to be de®ned. Instructions for which can be derived from the current database version and installation can be downloaded from http://www.arb-home.de/ should phylogenetically (taxonomically) make sense. Users download/ARB/documentation/ARB_install.pdf. Self-install- can not only search for hierarchical and multiple probes ing versions of the recent program releases are currently submitting names, strain designations or accession numbers of available for Linux systems only. The binaries, source code organisms as search strings, but can also send their own probe and some documentation are available at the download area of sequences for in silico evaluation. Second, the package is the ARB web site http://www.arb-home.de/download/. An adopted for handling and analysing databases of completed HTTP Browser is required as ftp connection is not accepted. and annotated genomes. All ARB functionalities can be Furthermore, there is an email forum of the worldwide ARB applied and genome maps can be used for visualization and users community. Subscription is needed for those interested data access. In accordance with the ARB concept of in joining (subscribe@arb-home.de). Although a comprehen- integrative databases experimental parameters and data can sive formal handbook is not yet available, manuals, instruc- be stored and assigned to the individual genomes or genes. tions and problem solutions are available from the ARB Many users ask for a Windows-compatible ARB version. homepage and by contacting the ARB staff and user Although comprehensive software redesign would be desir- community via the email forum. ARB sequence databases able, the current capacity and funding of the ARB group does are currently available for small subunit rRNA, and those for not allow doing this in reasonable time with a source code other conserved genes will be provided soon. Checking for developed by many individual scientists and programmers. new releases and updates should be done at http:www. arb-home.de/downloads/databases/. ACKNOWLEDGEMENTS Systems, hardware and processing time requirements The authors highly acknowledge F. O. Glo È ckner and R. Amann The ARB group provides tested versions for SuSE LINUX and (Max Planck Institute for Marine Microbiology, Bremen, Sun Solaris systems. According to information provided by Germany) for redistributing the ARB database and software users, the LINUX version also runs on Redhat and Mandrake and answering user queries, as well as hosting and organizing LINUX systems. For running ARB on Mac OSX see http:// ARB workshops. The authors thank Laura Schulz (Ludwig www.microbiol.unimelb.edu.au/micro/staff/mds/ARB_OSX/ Maximilian University, Munich, Germany) for critical reading ARB_to_MacOSX.html. With respect to hardware require- of the manuscript. ARB software development and database ments, mainframe memory is more important than processor maintenance was partly supported by the European Union performance. The users among the wet laboratory partners of within the HRAMI project, by the German Research the ARB group are performing their analyses on dual Pentium Foundation and by the German Ministry of Research and III PCs with 1 Gb memory and 1 Gb swap space. The Education BIOLOG project. background storage requirements depend mainly upon the number and size of the user ARB databases and PT servers. REFERENCES The sizes of the installed program package, the current small 1. Stoesser,G., Baker,W., van den Broek,A., Camon,E., Garcia-Pastor,M., subunit rRNA database and the respective PT server ®les are Kanz,C., Kulikova,T., Leinonen,R., Lin,Q., Lombard,V. et al. (2002) The about 25 Mb, 80 Mb and 350 Mb, respectively. Twenty-one- EMBL nucleotide sequence database. Nucleic Acids Res., 30, 21±26. inch monitors at 1600 3 1200 are recommended. However, 2. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and ARB is also routinely used on laptops or older PCs and Wheeler,D.L. (2002) GenBank. Nucleic Acids Res., 30, 17±20. 3. Maidak,B.L., Cole,J.R., Lilburn,T.G., Parker,C.T.,Jr, Saxman,P.R., workstations with less memory and monitors with lower Farris,R.J., Garrity,G.M., Olsen,G.J., Schmidt,T.M. and Tiedje,J.M. resolution. (2001) The RDP-II (Ribosomal Database Project) Nucleic Acids Res., 29, Table 1 gives some information on processing times for 173±174. some of the major functions in ARB: generation of the PT 4. Wuyts,J., Van de Peer,Y., Winkelmans,T. and De Wachter,R. (2002) The server, automated sequence alignment and phylogenetic European database on small subunit ribosomal RNA. Nucleic Acids Res., 30, 183±185. treeing. Run-time measurements were performed on a dual- â 5. Wuyts,J., De Rijk,P., Van de Peer,Y., Winkelmans,T. and De Wachter,R. processor (Intel XeonÔ, 2.6 GHz) PC equipped with a 2 Gb (2001) The European Large Subunit Ribosomal RNA Database. Nucleic RAM running SuSE Linux 8.2. The ARB databases ssu_ Acids Res., 29, 175±177. jan03.arb (25 743 almost complete small subunit rRNA 6. Felsenstein,J., (1989) PHYLIPÐPhylogeny Inference Package (version sequences, 40 000 alignment positions) and ssu_10k.arb, 3.2). Cladistics, 5, 164±166. Nucleic Acids Research, 2004, Vol. 32, No. 4 1371 7. Olsen,G.J., Matsuda,H., Hagstrom,R. and Overbeek,R. (1994) 14. Fuchs,B.M., Syutsubo,K., Ludwig,W. and Amann,R. (2001) In situ FastDNAml: a tool for construction of phylogenetic trees of DNA accessibility of the Escherichia coli 23S rRNA for ¯uorescently labeled sequences using maximum likelihood. Comput. Appl. Biosci., 10, 41±48. oligonucleotide probes. Appl. Environ. Microbiol., 67, 961±968. 8. Adachi,J. and Hasegawa,M. (1996) Molphy Version 2.3, Programs for 15. Ludwig,W. and Klenk,H.P. (2001) Overview: a phylogenetic backbone Molecular Phylogenetics Based on Maximum Likelihood. Technical and taxonomic framework for prokaryotic systematics. In Garrity,G. (ed.) Report, The Institute of Statistical Mathematics, Tokyo. Bergey's Manual of Systematic Bacteriology (2nd edn). Springer, New 9. Strimmer,K. and von Haeseler,A. (1996) Quartett Puzzling: a quartett York, pp. 49±65. maximum likelihood method for reconstructing tree topologies. 16. Kernigham,B.W. and Lin,S. (1970) An ef®cient heuristic procedure for Mol. Biol. Evol., 13, 964±969. partitioning graphs. Bell Syst. Tech. J., 49, 291±307. 10. Stamatakis,A.P., Ludwig,T., Meier,H. and Wolf,M.J. Accelerating 17. Altschul,S.F., Madden,T.L., Schaffer,A.A., Znang,J., Znang,Z., parallel maximum likelihood-based phylogenetic tree calculations using Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-Blast: a new subtree equality vectors. Proc. Supercomputing Conference (SC2002), generation of protein-speci®c gap penalties and weight matrix choice. Baltimore, MD, Nov. 2002, IEEE Computer Society. Nucleic Acids Res., 25, 3389±3402. 11. Cannone,J.J, Subramanian,S., Schnare,M.N., Collett,J.R., D'Souza,L.M., 18. Pearson,W.R. and Lipman,D.C. (1988) Improved tools for biological Du,Y., Feng,B., Lin,N., Madabusi,L.V., Mu È ller,K.M. et al. (2002). The sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444±2448. comparative RNA Web (CRW) site: an online database of comparative 19. Thompson,J.D., Higgins,D.G. and Gibson,D.J. (1994) CLUSTAL W: sequence and structure information for ribosomal, intron and other improving the sensitivity of progressive multiple sequence alignment. RNAs. BioMed Central Bioinform., 3,2. Comput. Appl. Biosci., 8, 189±191. 12. Amann,R., Ludwig,W. and Schleifer,K.H. (1995) Phylogentic 20. Amann,R. and Ludwig,W. (2000) Ribosomal RNA-targeted nucleic acid identi®cation and in situ detection of individual microbial cells without probes for studies in microbial ecology. FEMS Microbiol. Rev., 24, cultivation. Microbiol. Rev., 59, 143±169. 555±565. 13. Fuchs,B.M., Wallner,G., Beisker,W., Schwippl,L., Ludwig,W. and 21. Ludwig,W., Amann,R., Martinez-Romero,E., Scho È nhuber,W., Bauer,S., Amann,R. (1998) Flow cytrometric analysis of the in situ accessibility of Neef,A. and Schleifer,K.H. (1998) rRNA based identi®cation systems for Escherichia coli 16S rRNA for ¯uorescently labeled oligonucleotide Rhizobia and other bacteria. Plant Soil, 204, 1±9. probes. Appl. Environ. Microbiol., 64, 4973±4982. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nucleic Acids Research Oxford University Press

Loading next page...
 
/lp/oxford-university-press/arb-a-software-environment-for-sequence-data-O7oq5fWpb8

References (27)

Publisher
Oxford University Press
Copyright
Oxford University Press
ISSN
0305-1048
eISSN
1362-4962
DOI
10.1093/nar/gkh293
pmid
14985472
Publisher site
See Article on Publisher Site

Abstract

Published online February 25, 2004 Nucleic Acids Research, 2004, Vol. 32, No. 4 1363±1371 DOI: 10.1093/nar/gkh293 Wolfgang Ludwig*, Oliver Strunk, Ralf Westram, Lothar Richter, Harald Meier , 1 1 Yadhukumar, Arno Buchner, Tina Lai, Susanne Steppi, Gangolf Jobb , Wolfram Fo È rster , 1 1 Igor Brettske, Stefan Gerber, Anton W. Ginhart , Oliver Gross, Silke Grumann , 1 1 1 1 1 Stefan Hermann , Ralf Jost , Andreas Ko È nig , Thomas Liss , Ralph Lu È ûmann , 1 1 1 1 1 Michael May , Bjo È rn Nonhoff , Boris Reichel , Robert Strehlow , Alexandros Stamatakis , 1 1 1 2 1 Norbert Stuckmann , Alexander Vilbig , Michael Lenke , Thomas Ludwig , Arndt Bode and Karl-Heinz Schleifer Lehrstuhl fu È r Mikrobiologie, Technische Universita ÈtMu È nchen, D-853530 Freising, Germany Lehrstuhl fu È r Rechnertechnik und Rechnerorganisation, Parallelrechnerarchitektur, Technische UniversitatMu È nchen, D-85748 Garching, Germany Institut fu È r Informatik, Ruprecht-Karls-Universitat Heidelberg, D-69120 Heidelberg, Germany Received January 13, 2004; Revised and Accepted January 28, 2004 ABSTRACT as well as microbial taxonomy and identi®cation. Further- more, improved and automated sequencing techniques pro- The ARB (from Latin arbor, tree) project was moted a rapid increase in the number of small subunit rRNA initiated almost 10 years ago. The ARB program primary structure data available from data sources such as package comprises a variety of directly interacting GenBank (1) or EBI (European Bioinformatics Institute) (2). software tools for sequence database maintenance However, these databases provide only raw data and addi- and analysis which are controlled by a common tional descriptive information which cannot interactively be graphical user interface. Although it was initially extended by the user. Although the Ribosomal Database Project (RDP) (3) and the Antwerpen projects (4,5) offered designed for ribosomal RNA data, it can be used for datasets of aligned sequences, data handling and analysis any nucleic and amino acid sequence data as well. remained dif®cult for scientists applying rRNA-based A central database contains processed (aligned) methods. A variety of individual software tools for sequence primary structure data. Any additional descriptive editing, alignment and phylogenetic analysis were available data can be stored in database ®elds assigned to from the different database projects (1±4) and other sources the individual sequences or linked via local or (6) (http://www.gcg.com). However, a comprehensive worldwide networks. A phylogenetic tree visualized package of interacting tools was missing. Furthermore, the in the main window can be used for data access and number of different input and output formats which had to be visualization. The package comprises additional used re¯ected the variety of individual software programs tools for data import and export, sequence align- which uncomfortably had to be applied sequentially to achieve ment, primary and secondary structure editing, pro- a comprehensive analysis of molecular data. Unfortunately, a ®le and ®lter calculation, phylogenetic analyses, promising initiative, the Genetic Data Environment (GDE) project (http://bimas.dcrt.nih.gov/gde_sw.html), focusing on speci®c hybridization probe design and evaluation the development of a common graphical interface for data and other components for data analysis. Currently, handling and analysis was not continued. Consequently, the package is used by numerous working groups microbiologists and computer scientists at the Technical worldwide. University of Munich decided to develop their own software package capable of properly managing the upcoming data ¯ood. INTRODUCTION The two major tasks according to the ARB concept, The ARB (from Latin arbor, tree) project was established as formulated in the early days of the project and maintained to an interdisciplinary bioinformatics initiative of the Lehrstuhl the present, are (i) the maintenance of a structured integrative fu È r Mikrobiologie and the Lehrstuhl fu È r Rechnertechnik secondary database combining processed primary structures und Rechnerorganisation, Parallelrechnerarchitektur of the and any type of additional data assigned to the individual Technical University of Munich almost 10 years ago. In that sequence entries and (ii) a comprehensive selection of time, comparative sequence analysis of the small subunit software tools directly interacting with one another as well rRNAs or the respective genes had already been established as as the central database which are controlled via a common the most commonly applied approach for phylogeny inference graphical interface. Software and rRNA databases are publicly *To whom correspondence should be addressed. Tel: +8161 71 5451; Fax: +8161 71 5475; Email: ludwig@mikro.biologie.tu-muenchen.de Nucleic Acids Research, Vol. 32 No. 4 ã Oxford University Press 2004; all rights reserved 1364 Nucleic Acids Research, 2004, Vol. 32, No. 4 accessible (http://www.arb-home.de) and have been in use worldwide for several years. MATERIALS AND METHODS Sequence data The raw data used to establish databases and perform data analysis were taken from our own sequencing projects, provided by other research groups or periodically retrieved from public data sources such as the EBI (1), Genbank (2), the RDP (3) and the Antwerpen databases for small (4) and large (5) subunit RNAs. Complete releases were downloaded from the latter two locations. The search and retrieval tools of the former two institutions were used to select and download the primary structure and additional information on rRNA or other genes. Furthermore, sequence data determined at the Lehrstuhl fu È r Mikrobiologie of the Technical University of Munich or by other groups were imported and processed. Figure 1. The interacting components and tools of the ARB software package and database. Operating systems and programming languages The ARB software was developed for UNIX systems and their derivatives. Currently, the development is performed using SuSE LINUX (http://www.suse.com) running on PCs. can be assigned to the individual sequence data entry and The greater part of the source code was written in C++ and stored within default or user-de®ned database ®elds. These C; some parts were written in Perl and other script languages. data can either be kept as intrinsic components of the database The graphical environment is based upon the Open Motif or linked to it via local networks or the Internet. In the latter library. case the path to the respective ®le or the URL of an external Integrated external software tools database, optionally including commands and search strings, have to be stored within the respective ARB database ®elds. Functionalities from the GDE project (http://bimas.dcrt.nih. The designations and hierarchy of the database ®elds can be gov/gde_sw.html) concerning sequence editing were adopted customized by the user. The default structuring is according to and implemented in the ARB package. Some programs of the the phylogeny of the organisms derived from the respective PHYLIP package for phylogeny inference (6) were incorpor- sequence data. However, it can also be changed according to ated as components directly interacting with the central other criteria de®ned by database ®eld entries. This hierarchy database. Additionally, fastDNAml (7) and protml of the is used by special algorithms for highly effective data Molphy package (8), components of the Puzzle package (9) compression. Different protection levels (0±6) can be assigned and AxML, a new accelerated fastDNAml derivative (10), to the individual database ®elds. Database as well as security were included for maximum-likelihood-based phylogenetic management is facilitated by this tool. analyses of nucleic and amino acid sequence data. Data access and visualization RESULTS AND DISCUSSION A powerful search tool allows simple (strings and combination of strings) and complex (default or user-de®ned algorithms) A selection of tools and functionalities of the ARB packages searches in one or more of the database ®elds. The information will be brie¯y described in the following sections. The in all or a user-de®ned selection of database ®elds can be network in Figure 1 schematically visualizes these tools and visualized on the screen in respective windows (Fig. 2). The their interactions with one another and the central database. layout of the visualization windows, i.e. selection, size and Most tools developed for ARB directly interact with a copy of positioning of database ®eld entries, can be customized by the the database in the main storage, whereas the integrated user. Simple algorithms are included. second-party tools are provided with data from ARB and their An alternative method of data access and visualization is results are written back to the database. Thus any changes or provided by the ARB main window. Phylogenetic trees rearrangements are immediately known to the peripheral generated by intrinsic ARB tree reconstruction tools or software components. imported from external sources are stored in the database The central database and can be visualized in different formats within the ARB main window (Fig. 3). Any (combination of) database ®eld The sequences representing organisms, genes or gene products are stored in individual database ®elds as the central entries can be visualized at the terminal nodes of the tree components and a unique identi®er (short_name) is automatic- currently shown. Selection and order of data entries, the ally generated and assigned to each of them. Databases created results of data analysis or extractions to be visualized are using ARB are hierarchically structured. Following the ARB de®ned by the node display settings (NDS) tool. Irrespective concept of an integrative database, any type of additional data of the visualization mode used, the ARB search and Nucleic Acids Research, 2004, Vol. 32, No. 4 1365 Figure 2. Example of a data visualization window. Bibliographic data stored in respective database ®elds are shown. The selection of database ®elds, extraction of data and the layout of the visualization window can be customized by the user. replacement tool (SRT) and ARB command interpreter (ACI) sequences according to default or user-de®ned criteria and can be used for extraction of combinations of (sub)strings as optionally visualized along with or instead of the individual well as for analysis of database ®eld entries, respectively. sequences. This consensus can be edited and changes made concern any sequence in the group. A special feature of the Sequence editors editor is the simultaneous secondary structure check if rRNA (gene) data are visualized. Symbols indicating the presence or The sequence data can be visualized and modi®ed with a absence as well as the character of base pairing are shown powerful editor (Fig. 4). The original data as well as virtually below the individual characters and immediately refreshed transformed data (e.g. purine±pyrimidine or simpli®ed amino during sequence editing. A (three-domain) consensus acid presentation) are displayed in user-de®ned color codes. secondary structure mask established according to commonly Keyboard customization is possible for data entry and accepted secondary structure models (11) functions as a guide modi®cation. Two different editing modes can be selected. for this tool. Thereby the users are strongly supported with The `Align' mode allows only insertion/removal of alignment regard to the evaluation of sequences, alignment and probe gaps and movement of sequence characters. In addition to targets. these functions, character changes can be performed in the The ARB secondary structure editor (Fig. 5) ®ts any `Edit' mode. The rights to overcome protection of the sequence into the common consensus model. The particular individual sequence entries can be given for the two modes sequence to be visualized is selected by cursor positioning in independently. This helps to prevent unwanted character the primary structure editor. The layout of the structure, i.e. changes when manually modifying the sequence data or alignment. color coding of base-paired, non-paired and loop positions as Sets of search strings can be de®ned and optionally stored. well as probe target sites, can be customized according to the Their occurrence can be visualized within the displayed user's preferences. Any of the search strings activated in the sequences by user-de®ned background colors. Virtual com- primary structure editor can be indicated in the secondary pression (removal of alignment gaps common to all or a structure model. This helps the experts to evaluate probe certain fraction of the displayed sequences) is possible. This targets. The evaluation of target position with respect to makes data handling more convenient in the case of large higher-order rRNA structure is of importance especially insertions occurring in only part of the selected sequences. when probes are used for in situ cell hybridizations (12±14). Groups of sequences can be interactively de®ned or are The structure can be exported to x®g, a simple open- automatically shown if de®ned in the phylogenetic trees. source graphics program (http://ww.x®g.org), for further Consensus sequences are determined for each de®ned group of modi®cation and/or transformation into various formats. 1366 Nucleic Acids Research, 2004, Vol. 32, No. 4 Figure 3. The ARB main window showing part of an ARB parsimony-generated dendrogram. The rectangles represent `online compressed' monophyletic groups which can be `unfolded' by mouse click. Database ®eld entries such as taxonomic name, public database accession number and strain designation as reported in EMBL (1), RDP (3) and the European rRNA databases (DEW) (4,5) are visualized at the terminal nodes of the `unfolded' Desulfohalobiaceae. Pro®les, masks and ®lters upon the full database or user-de®ned subsets. The underlying methods range from simple character counting to maximum Conservation or base composition pro®les, higher-order parsimony-based column statistics. These pro®les, masks and structure masks and ®lters including or excluding particular ®lters are stored in the central database as so-called sequence alignment positions are important tools for sequence data associated information (SAI) and can be visualized and analyses, especially for phylogenetic inference (15). The ARB modi®ed by the primary structure editor. The ®lter selection package provides tools for determining such pro®les based tool allows not only choice of sets of particular ®lters but also Nucleic Acids Research, 2004, Vol. 32, No. 4 1367 Figure 4. The ARB primary structure editor. As an example for highlighting a search string a probe target site is shown by background color. Perfect and mismatched pairing is color coded as well. performance of ®ne tuning with respect to the inclusion or changes in the initial tree. This enables the user to reconstruct exclusion of alignment positions in the case of multiple and optimize an initial tree based upon the best (full character ®lters. sequences) and most comprehensive (wide variation of phylogenetic levels) sequence data and also to include partial Phylogenetic treeing sequences without perturbing the initial tree topology. The second peculiarity of the treeing software concerns the tree As mentioned in Materials and Methods, software implement- optimization performing cycles of NNI and Kernigham±Lin ations of several alternative treeing methods are incorporated (KL) (16) tree modi®cations. This optimization can not only in the package. They operate as intrinsic tools with all the be applied to the complete tree but also con®ned to user- respective ARB components and database elements such as selected subtrees. Thus tree optimization is possible applying alignment and ®lters. The central treeing tool of the package, the appropriate ®lters for the respective phylogenetic levels ARB-parsimony, is a special development for the handling of several thousand sequences (more than 30 000 in the current and groups. In this context, it is of interest that, while small subunit rRNA ARB database). New sequences are performing stepwise optimizations, the intermediates are successively added to an existing tree according to the stored until the user de®nes the version to be permanently parsimony criterion. An intrinsic software component super- stored in the database. Furthermore, different trees generated applying various parameters can be permanently kept in the imposes branch length on the parsimony-generated tree database and optionally used for data visualization in the ARB topology. These branch lengths re¯ect the signi®cance of the main window. individual `tetra-furcations' by expressing the difference of the most parsimonious and the two least parsimonious The positional tree server solutions when performing nearest-neighbor interchange (NNI) of adjacent branches or subtrees. These relative The ARB positional tree (PT) server, once established, allows distances are standardized according to a distance matrix rapid ®nding of sequence identity or peculiarity. Thus it is the deduced from primary structure comparison. Thus branch central tool for fast search of closest relatives for automated lengths in ARB-parsimony generated trees in the ®rst instance sequence alignment or to de®ne diagnostic sequence stretches visualize the signi®cance of topologies, and in the second for primer and probe design. Establishing a positional tree instance re¯ect a degree of estimated sequence divergence. A server of any oligonucleotide sequence up to 20mers occurring prominent feature of ARB-parsimony is the possibility of in the underlying database and assignment of the individual adding sequences to an existing tree without allowing any oligonucleotides to the sequences or organisms containing 1368 Nucleic Acids Research, 2004, Vol. 32, No. 4 Figure 5. Secondary structure editor. The sequence selected in the primary structure editor (Fig. 4) is automatically ®tted into a consensus secondary structure model. them is the basis for these procedures. PT-server-based Sequence alignment analyses do not rely upon aligned sequences. The PT server is not provided with the ARB program package or ARB As mentioned in Materials and Methods, for de novo database. It has to be established for the respective database generation of a nucleic or amino acid sequence alignment locally. The PT server can be used by multiple users on the ClustalW (19) as implemented in the ARB package can be used. However, in most cases new sequence entries have to be local machine or via network. The computing time for integrated in an already existing database of aligned generating the respective ®les depends strongly on the size and structure of the individual database as well as the sequences. For this purpose the ARB fast aligner was performance of the machine used. The advantages of these developed and included. This aligner uses a (set of) selected logarithmic algorithms over linear ones such as Blast (17) or aligned reference sequences as template(s) for rapid integra- Fasta (18) are effectiveness and rapidity. tion of a (set of) unaligned sequence(s). Individual entries, i.e. Nucleic Acids Research, 2004, Vol. 32, No. 4 1369 Figure 6. Results of probe design and evaluation. Part of the primary structure alignment containing the probe target site is shown for the target organism Desulfohalobium retbaense and the non-target organisms containing the most similar sequence stretches. sequences or consensus de®ned by the user or automatically whole database by using the program `Probe Match'. Local determined by PT-server-based search for most similar alignments are determined between the probe target reference sequences, are used as the template. sequence(s) and the most similar reference sequences In the case of protein coding nucleic acid sequences the (optionally from no to ®ve mismatches) in the respective alignment is usually optimized on the amino acid level. The database (Fig. 6). Furthermore, these sequence strings can underlying nucleic acid alignment can then be adapted to the automatically be visualized in the primary and secondary amino acid alignment by a back-translation-based tool taking structure editors. The latter information is of particular into consideration all known codon usages. importance when designing probes for in situ cell hybridiz- ation. A tool for visualization of accessibility maps (13,14) Probe design and evaluation in the primary and secondary structure editors is under development. Currently, taxon- or gene-speci®c probe design certainly plays A special advance is the ARB multiprobe software a central role in many molecular biological research and component. It determines sets of up to ®ve probes optimally analysis projects, for example the identi®cation and detection identifying the target group (21). These probe sets can be used of organisms in complex environmental samples or expression for multiple ¯uorescence in situ hybridization experiments. studies within the scope of genome projects. Algorithms of the ARB programs `Probe Design' and `Probe Match' are Data import and export searching the PT server to identify short (10±100 monomers) The sequence as well as additional data can be imported and diagnostic sequence stretches which are evaluated against the exported in commonly used ¯at ®le formats. The parsing from background of all full and partial sequences in the respective and to tagged ¯at ®les can be customized by advanced users. database the PT server has been built from. In principle, no There is also a tool for automated completion of database alignment of the sequence data is needed for speci®c probe submission forms for those users determining sequences on design. However, in the case of taxon-speci®c probes align- their own. ment and phylogenetic analyses are necessary to allow de®ning groups of phylogenetically (taxonomically) related Availability and documentation organisms as the targets of speci®c probes. The design of taxon-speci®c oligonucleotide probes with ARB is performed Although so far not of®cially published, previous versions of in three steps. First, the user selects the organism or a group of the software package and databases have been available for organisms for which he or she wants to design a diagnostic several years and the software has been used worldwide. The probe. Secondly, the software `Probe Design' searches the PT ARB package provides a comprehensive set of tools to support server for potential target sites. The results are shown in a the user's work. However, depending upon the user's interests, ranked list of proposed targets, probes and additional some knowledge of the basics of sequence alignment, information. The ranking is according to several composi- phylogenetic analyses or probe hybridization is needed. tional and thermodynamic criteria (12,20). Thirdly, the Some familiarity with UNIX operating systems is advanta- proposed oligonucleotide probes are evaluated against the geous. During installation, environment variables, paths, 1370 Nucleic Acids Research, 2004, Vol. 32, No. 4 Table 1. Run time studies for PT server generation, automated alignment ssu_1k.arb, ssu_100.arb (subsets comprising 10 000, 1000 and and parsimony-based treeing 100 sequences) used for these PT server building studies are available at http://www.arb-home.de. For adding 10, 100 and No. of sequences 10 100 1000 10 000 25 743 PT server ± 5 s 22 s 3 min 7 min 30s 1000 sequences to the database alignment or tree, ssu_ Add to alignment 4 s 38 s 6 min jan03.arb was used as basis. The ARB aligner was used in Add to tree 1 min 9 min 15 s 2 h combination with a PT-server-directed search for most similar reference sequences. For phylogenetic treeing the respective Datasets varying with respect to the number of sequence entries were used for PT server generation. The most comprehensive of these datasets sequences/organisms were added to a tree comprising all comprising 25 743 entries and 40 000 alignment positions was used as a entries of the database applying the ARB parsimony tool. template for inserting the speci®ed numbers of sequences into the database alignment or tree. For treeing, 2141 alignment columns were included. Future developments The ongoing developments are focusing on two major tasks. First, a web tool providing all potential probe target sites permissions and aliases have to be de®ned. Instructions for which can be derived from the current database version and installation can be downloaded from http://www.arb-home.de/ should phylogenetically (taxonomically) make sense. Users download/ARB/documentation/ARB_install.pdf. Self-install- can not only search for hierarchical and multiple probes ing versions of the recent program releases are currently submitting names, strain designations or accession numbers of available for Linux systems only. The binaries, source code organisms as search strings, but can also send their own probe and some documentation are available at the download area of sequences for in silico evaluation. Second, the package is the ARB web site http://www.arb-home.de/download/. An adopted for handling and analysing databases of completed HTTP Browser is required as ftp connection is not accepted. and annotated genomes. All ARB functionalities can be Furthermore, there is an email forum of the worldwide ARB applied and genome maps can be used for visualization and users community. Subscription is needed for those interested data access. In accordance with the ARB concept of in joining (subscribe@arb-home.de). Although a comprehen- integrative databases experimental parameters and data can sive formal handbook is not yet available, manuals, instruc- be stored and assigned to the individual genomes or genes. tions and problem solutions are available from the ARB Many users ask for a Windows-compatible ARB version. homepage and by contacting the ARB staff and user Although comprehensive software redesign would be desir- community via the email forum. ARB sequence databases able, the current capacity and funding of the ARB group does are currently available for small subunit rRNA, and those for not allow doing this in reasonable time with a source code other conserved genes will be provided soon. Checking for developed by many individual scientists and programmers. new releases and updates should be done at http:www. arb-home.de/downloads/databases/. ACKNOWLEDGEMENTS Systems, hardware and processing time requirements The authors highly acknowledge F. O. Glo È ckner and R. Amann The ARB group provides tested versions for SuSE LINUX and (Max Planck Institute for Marine Microbiology, Bremen, Sun Solaris systems. According to information provided by Germany) for redistributing the ARB database and software users, the LINUX version also runs on Redhat and Mandrake and answering user queries, as well as hosting and organizing LINUX systems. For running ARB on Mac OSX see http:// ARB workshops. The authors thank Laura Schulz (Ludwig www.microbiol.unimelb.edu.au/micro/staff/mds/ARB_OSX/ Maximilian University, Munich, Germany) for critical reading ARB_to_MacOSX.html. With respect to hardware require- of the manuscript. ARB software development and database ments, mainframe memory is more important than processor maintenance was partly supported by the European Union performance. The users among the wet laboratory partners of within the HRAMI project, by the German Research the ARB group are performing their analyses on dual Pentium Foundation and by the German Ministry of Research and III PCs with 1 Gb memory and 1 Gb swap space. The Education BIOLOG project. background storage requirements depend mainly upon the number and size of the user ARB databases and PT servers. REFERENCES The sizes of the installed program package, the current small 1. Stoesser,G., Baker,W., van den Broek,A., Camon,E., Garcia-Pastor,M., subunit rRNA database and the respective PT server ®les are Kanz,C., Kulikova,T., Leinonen,R., Lin,Q., Lombard,V. et al. (2002) The about 25 Mb, 80 Mb and 350 Mb, respectively. Twenty-one- EMBL nucleotide sequence database. Nucleic Acids Res., 30, 21±26. inch monitors at 1600 3 1200 are recommended. However, 2. Benson,D.A., Karsch-Mizrachi,I., Lipman,D.J., Ostell,J., Rapp,B.A. and ARB is also routinely used on laptops or older PCs and Wheeler,D.L. (2002) GenBank. Nucleic Acids Res., 30, 17±20. 3. Maidak,B.L., Cole,J.R., Lilburn,T.G., Parker,C.T.,Jr, Saxman,P.R., workstations with less memory and monitors with lower Farris,R.J., Garrity,G.M., Olsen,G.J., Schmidt,T.M. and Tiedje,J.M. resolution. (2001) The RDP-II (Ribosomal Database Project) Nucleic Acids Res., 29, Table 1 gives some information on processing times for 173±174. some of the major functions in ARB: generation of the PT 4. Wuyts,J., Van de Peer,Y., Winkelmans,T. and De Wachter,R. (2002) The server, automated sequence alignment and phylogenetic European database on small subunit ribosomal RNA. Nucleic Acids Res., 30, 183±185. treeing. Run-time measurements were performed on a dual- â 5. Wuyts,J., De Rijk,P., Van de Peer,Y., Winkelmans,T. and De Wachter,R. processor (Intel XeonÔ, 2.6 GHz) PC equipped with a 2 Gb (2001) The European Large Subunit Ribosomal RNA Database. Nucleic RAM running SuSE Linux 8.2. The ARB databases ssu_ Acids Res., 29, 175±177. jan03.arb (25 743 almost complete small subunit rRNA 6. Felsenstein,J., (1989) PHYLIPÐPhylogeny Inference Package (version sequences, 40 000 alignment positions) and ssu_10k.arb, 3.2). Cladistics, 5, 164±166. Nucleic Acids Research, 2004, Vol. 32, No. 4 1371 7. Olsen,G.J., Matsuda,H., Hagstrom,R. and Overbeek,R. (1994) 14. Fuchs,B.M., Syutsubo,K., Ludwig,W. and Amann,R. (2001) In situ FastDNAml: a tool for construction of phylogenetic trees of DNA accessibility of the Escherichia coli 23S rRNA for ¯uorescently labeled sequences using maximum likelihood. Comput. Appl. Biosci., 10, 41±48. oligonucleotide probes. Appl. Environ. Microbiol., 67, 961±968. 8. Adachi,J. and Hasegawa,M. (1996) Molphy Version 2.3, Programs for 15. Ludwig,W. and Klenk,H.P. (2001) Overview: a phylogenetic backbone Molecular Phylogenetics Based on Maximum Likelihood. Technical and taxonomic framework for prokaryotic systematics. In Garrity,G. (ed.) Report, The Institute of Statistical Mathematics, Tokyo. Bergey's Manual of Systematic Bacteriology (2nd edn). Springer, New 9. Strimmer,K. and von Haeseler,A. (1996) Quartett Puzzling: a quartett York, pp. 49±65. maximum likelihood method for reconstructing tree topologies. 16. Kernigham,B.W. and Lin,S. (1970) An ef®cient heuristic procedure for Mol. Biol. Evol., 13, 964±969. partitioning graphs. Bell Syst. Tech. J., 49, 291±307. 10. Stamatakis,A.P., Ludwig,T., Meier,H. and Wolf,M.J. Accelerating 17. Altschul,S.F., Madden,T.L., Schaffer,A.A., Znang,J., Znang,Z., parallel maximum likelihood-based phylogenetic tree calculations using Miller,W. and Lipman,D.J. (1997) Gapped BLAST and PSI-Blast: a new subtree equality vectors. Proc. Supercomputing Conference (SC2002), generation of protein-speci®c gap penalties and weight matrix choice. Baltimore, MD, Nov. 2002, IEEE Computer Society. Nucleic Acids Res., 25, 3389±3402. 11. Cannone,J.J, Subramanian,S., Schnare,M.N., Collett,J.R., D'Souza,L.M., 18. Pearson,W.R. and Lipman,D.C. (1988) Improved tools for biological Du,Y., Feng,B., Lin,N., Madabusi,L.V., Mu È ller,K.M. et al. (2002). The sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444±2448. comparative RNA Web (CRW) site: an online database of comparative 19. Thompson,J.D., Higgins,D.G. and Gibson,D.J. (1994) CLUSTAL W: sequence and structure information for ribosomal, intron and other improving the sensitivity of progressive multiple sequence alignment. RNAs. BioMed Central Bioinform., 3,2. Comput. Appl. Biosci., 8, 189±191. 12. Amann,R., Ludwig,W. and Schleifer,K.H. (1995) Phylogentic 20. Amann,R. and Ludwig,W. (2000) Ribosomal RNA-targeted nucleic acid identi®cation and in situ detection of individual microbial cells without probes for studies in microbial ecology. FEMS Microbiol. Rev., 24, cultivation. Microbiol. Rev., 59, 143±169. 555±565. 13. Fuchs,B.M., Wallner,G., Beisker,W., Schwippl,L., Ludwig,W. and 21. Ludwig,W., Amann,R., Martinez-Romero,E., Scho È nhuber,W., Bauer,S., Amann,R. (1998) Flow cytrometric analysis of the in situ accessibility of Neef,A. and Schleifer,K.H. (1998) rRNA based identi®cation systems for Escherichia coli 16S rRNA for ¯uorescently labeled oligonucleotide Rhizobia and other bacteria. Plant Soil, 204, 1±9. probes. Appl. Environ. Microbiol., 64, 4973±4982.

Journal

Nucleic Acids ResearchOxford University Press

Published: Mar 15, 2004

There are no references for this article.