Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu. sequence a total of about six trillion base pairs of human DNA Rationale Improvements in the efficiency of DNA sequencing have both [9]. broadened the applications for sequencing and dramatically increased the size of sequencing datasets. Technologies from With existing methods, the computational cost of aligning Illumina (San Diego, CA, USA) and Applied Biosystems (Fos- many short reads to a mammalian genome is very large. For ter City, CA, USA) have been used to profile methylation pat- example, extrapolating from the results presented here in terns (MeDIP-Seq) [1], to map DNA-protein interactions Tables 1 and 2, one can see that Maq would require more than (ChIP-Seq) [2], and to identify differentially expressed genes 5 central processing unit (CPU)-months and SOAP more than (RNA-Seq) [3] in the human genome and other species. The 3 CPU-years to align the 140 billion bases from the study by Illumina instrument was recently used to re-sequence three Ley and coworkers [5]. Although using Maq or SOAP for this human genomes, one from a cancer patient and two from pre- purpose has been shown to be feasible by using multiple viously unsequenced ethnic groups [4-6]. Each of these stud- CPUs, there is a clear need for new tools that consume less ies required the alignment of large numbers of short DNA time and computational resources. sequences ('short reads') onto the human genome. For exam- ple, two of the studies [4,5] used the short read alignment tool Maq and SOAP take the same basic algorithmic approach as Maq [7] to align more than 130 billion bases (about 45× cov- other recent read mapping tools such as RMAP [10], ZOOM erage) of short Illumina reads to a human reference genome [11], and SHRiMP [12]. Each tool builds a hash table of short in order to detect genetic variations. The third human re- oligomers present in either the reads (SHRiMP, Maq, RMAP, sequencing study [6] used the SOAP program [8] to align and ZOOM) or the reference (SOAP). Some employ recent more than 100 billion bases to the reference genome. In addi- theoretical advances to align reads quickly without sacrificing tion to these projects, the 1,000 Genomes project is in the sensitivity. For example, ZOOM uses 'spaced seeds' to signif- process of using high-throughput sequencing instruments to icantly outperform RMAP, which is based on a simpler algo- Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.2 Table 1 Bowtie alignment performance versus SOAP and Maq Platform CPU time Wall clock time Reads mapped per Peak virtual memory Bowtie speed-up Reads aligned (%) hour (millions) footprint (megabytes) Bowtie -v 2 Server 15 m 7 s 15 m 41 s 33.8 1,149 - 67.4 SOAP 91 h 57 m 35 s 91 h 47 m 46 s 0.10 13,619 351× 67.3 Bowtie PC 16 m 41 s 17 m 57 s 29.5 1,353 - 71.9 Maq 17 h 46 m 35 s 17 h 53 m 7 s 0.49 804 59.8× 74.7 Bowtie Server 17 m 58 s 18 m 26 s 28.8 1,353 - 71.9 Maq 32 h 56 m 53 s 32 h 58 m 39 s 0.27 804 107× 74.7 The performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6 when aligning 8.84 M reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR001115) trimmed to 35 base pairs. The 'soap.contig' version of the SOAP binary was used. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. For the SOAP comparison, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values). For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches during the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure only the forward or mirror index is resident in memory at one time. CPU, central processing unit. rithm developed by Baeza-Yaetes and Perleberg [13]. Spaced a typical desktop computer with 2 GB of RAM. The index is seeds have been shown to yield higher sensitivity than contig- small enough to be distributed over the internet and to be uous seeds of the same length [14,15]. SHRiMP employs a stored on disk and re-used. Multiple processor cores can be combination of spaced seeds and the Smith-Waterman [16] used simultaneously to achieve even greater alignment speed. algorithm to align reads with high sensitivity at the expense of We have used Bowtie to align 14.3× coverage worth of human speed. Eland is a commercial alignment program available Illumina reads from the 1,000 Genomes project in about 14 from Illumina that uses a hash-based algorithm to align hours on a single desktop computer with four processor cores. reads. Bowtie makes a number of compromises to achieve this Bowtie uses a different and novel indexing strategy to create speed, but these trade-offs are reasonable within the context an ultrafast, memory-efficient short read aligner geared of mammalian re-sequencing projects. If one or more exact toward mammalian re-sequencing. In our experiments using matches exist for a read, then Bowtie is guaranteed to report reads from the 1,000 Genomes project, Bowtie aligns 35-base one, but if the best match is an inexact one then Bowtie is not pair (bp) reads at a rate of more than 25 million reads per guaranteed in all cases to find the highest quality alignment. CPU-hour, which is more than 35 times faster than Maq and With its highest performance settings, Bowtie may fail to 300 times faster than SOAP under the same conditions (see align a small number of reads with valid alignments, if those Tables 1 and 2). Bowtie employs a Burrows-Wheeler index reads have multiple mismatches. If the stronger guarantees based on the full-text minute-space (FM) index, which has a are desired, Bowtie supports options that increase accuracy at memory footprint of only about 1.3 gigabytes (GB) for the the cost of some performance. For instance, the '--best' option human genome. The small footprint allows Bowtie to run on will guarantee that all alignments reported are best in terms Table 2 Bowtie alignment performance versus Maq with filtered read set Platform CPU time Wall clock time Reads mapped per hour Peak virtual memory Bowtie speed up Reads aligned (%) (millions) footprint (megabytes) Bowtie PC 16 m 39 s 17 m 47 s 29.8 1,353 - 74.9 Maq 11 h 15 m 58 s 11 h 22 m 2 s 0.78 804 38.4× 78.0 Bowtie Server 18 m 20 s 18 m 46 s 28.3 1,352 - 74.9 Maq 18 h 49 m 7 s 18 h 50 m 16 s 0.47 804 60.2× 78.0 Performance and sensitivity of Bowtie v0.9.6 and Maq v0.6.6 when the read set is filtered using Maq's 'catfilter' command to eliminate poly-A artifacts. The filter eliminates 438,145 out of 8,839,010 reads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.3 of minimizing mismatches in the seed portion of the read, avoid excessive backtracking. The Bowtie aligner follows a although this option incurs additional computational cost. policy similar to Maq's, in that it allows a small number of mismatches within the high-quality end of each read, and it With its default options, Bowtie's sensitivity measured in places an upper limit on the sum of the quality values at mis- terms of reads aligned is equal to SOAP's and somewhat less matched alignment positions. than Maq's. Command line options allow the user to increase sensitivity at the cost of greater running time, and to enable Burrows-Wheeler indexing Bowtie to report multiple hits for a read. Bowtie can align The BWT is a reversible permutation of the characters in a reads as short as four bases and as long as 1,024 bases. The text. Although originally developed within the context of data input to a single run of Bowtie may comprise a mixture of compression, BWT-based indexing allows large texts to be reads with different lengths. searched efficiently in a small memory footprint. It has been applied to bioinformatics applications, including oligomer counting [20], whole-genome alignment [21], tiling microar- ray probe design [22], and Smith-Waterman alignment to a Bowtie description and results Bowtie indexes the reference genome using a scheme based human-sized reference [23]. on the Burrows-Wheeler transform (BWT) [17] and the FM index [18,19]. A Bowtie index for the human genome fits in The Burrows-Wheeler transformation of a text T, BWT(T), is 2.2 GB on disk and has a memory footprint of as little as 1.3 constructed as follows. The character $ is appended to T, GB at alignment time, allowing it to be queried on a worksta- where $ is not in T and is lexicographically less than all char- tion with under 2 GB of RAM. acters in T. The Burrows-Wheeler matrix of T is constructed as the matrix whose rows comprise all cyclic rotations of T$. The common method for searching in an FM index is the The rows are then sorted lexicographically. BWT(T) is the exact-matching algorithm of Ferragina and Manzini [18]. sequence of characters in the rightmost column of the Bur- Bowtie does not simply adopt this algorithm because exact rows-Wheeler matrix (Figure 1a). BWT(T) has the same matching does not allow for sequencing errors or genetic var- length as the original text T. iations. We introduce two novel extensions that make the technique applicable to short read alignment: a quality-aware This matrix has a property called 'last first (LF) mapping'. The th backtracking algorithm that allows mismatches and favors i occurrence of character X in the last column corresponds to th high-quality alignments; and 'double indexing', a strategy to the same text character as the i occurrence of X in the first (a) (c) (b) Burro Figure ws-Wheeler 1 transform Burrows-Wheeler transform. (a) The Burrows-Wheeler matrix and transformation for 'acaacg'. (b) Steps taken by EXACTMATCH to identify the range of rows, and thus the set of reference suffixes, prefixed by 'aac'. (c) UNPERMUTE repeatedly applies the last first (LF) mapping to recover the original text (in red on the top line) from the Burrows-Wheeler transform (in black in the rightmost column). Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.4 column. This property underlies algorithms that use the BWT ment encountered by Bowtie will not necessarily be the 'best' index to navigate or search the text. Figure 1b illustrates in terms of number of mismatches or in terms of quality. The UNPERMUTE, an algorithm that applies the LF mapping user may instruct Bowtie to continue searching until it can repeatedly to re-create T from BWT(T). prove that any alignment it reports is 'best' in terms of number of mismatches (using the option --best). In our expe- The LF mapping is also used in exact matching. Because the rience, this mode is two to three times slower than the default matrix is sorted lexicographically, rows beginning with a mode. We expect that the faster default mode will be pre- given sequence appear consecutively. In a series of steps, the ferred for large re-sequencing projects. EXACTMATCH algorithm (Figure 1c) calculates the range of matrix rows beginning with successively longer suffixes of the The user may also opt for Bowtie to report all alignments up query. At each step, the size of the range either shrinks or to a specified number (option -k) or all alignments with no remains the same. When the algorithm completes, rows limit on the number (option -a) for a given read. If in the (the entire query) correspond to exact beginning with S course of its search Bowtie finds N possible alignments for a occurrences of the query in the text. If the range is empty, the given set of substitutions, but the user has requested only K text does not contain the query. UNPERMUTE is attributable alignments where K < N, Bowtie will report K of the N align- to Burrows and Wheeler [17] and EXACTMATCH to Ferra- ments selected at random. Note that these modes can be gina and Manzini [18]. See Additional data file 1 (Supplemen- much slower than the default. In our experience, for example, tary Discussion 1) for details. -k 1 is more than twice as fast as -k 2. Searching for inexact alignments Excessive backtracking EXACTMATCH is insufficient for short read alignment The aligner as described so far can, in some cases, encounter because alignments may contain mismatches, which may be sequences that cause excessive backtracking. This occurs due to sequencing errors, genuine differences between refer- when the aligner spends most of its effort fruitlessly back- ence and query organisms, or both. We introduce an align- tracking to positions close to the 3' end of the query. Bowtie ment algorithm that conducts a backtracking search to mitigates excessive backtracking with the novel technique of quickly find alignments that satisfy a specified alignment pol- 'double indexing'. Two indices of the genome are created: one icy. Each character in a read has a numeric quality value, with containing the BWT of the genome, called the 'forward' index, lower values indicating a higher likelihood of a sequencing and a second containing the BWT of the genome with its char- error. Our alignment policy allows a limited number of mis- acter sequence reversed (not reverse complemented) called matches and prefers alignments where the sum of the quality the 'mirror' index. To see how this helps, consider a matching values at all mismatched positions is low. policy that allows one mismatch in the alignment. A valid alignment with one mismatch falls into one of two cases The search proceeds similarly to EXACTMATCH, calculating according to which half of the read contains the mismatch. matrix ranges for successively longer query suffixes. If the Bowtie proceeds in two phases corresponding to those two range becomes empty (a suffix does not occur in the text), cases. Phase 1 loads the forward index into memory and then the algorithm may select an already-matched query invokes the aligner with the constraint that it may not substi- position and substitute a different base there, introducing a tute at positions in the query's right half. Phase 2 uses the mismatch into the alignment. The EXACTMATCH search mirror index and invokes the aligner on the reversed query, resumes from just after the substituted position. The algo- with the constraint that the aligner may not substitute at posi- rithm selects only those substitutions that are consistent with tions in the reversed query's right half (the original query's the alignment policy and which yield a modified suffix that left half). The constraints on backtracking into the right half occurs at least once in the text. If there are multiple candidate prevent excessive backtracking, whereas the use of two substitution positions, then the algorithm greedily selects a phases and two indices maintains full sensitivity. position with a minimal quality value. Unfortunately, it is not possible to avoid excessive backtrack- Backtracking scenarios play out within the context of a stack ing fully when alignments are permitted to have two or more structure that grows when a new substitution is introduced mismatches. In our experiments, we have observed that and shrinks when the aligner rejects all candidate alignments excessive backtracking is significant only when a read has for the substitutions currently on the stack. See Figure 2 for many low-quality positions and does not align or aligns an illustration of how the search might proceed. poorly to the reference. These cases can trigger in excess of 200 backtracks per read because there are many legal combi- In short, Bowtie conducts a quality-aware, greedy, rand- nations of low-quality positions to be explored before all pos- omized, depth-first search through the space of possible sibilities are exhausted. We mitigate this cost by enforcing a alignments. If a valid alignment exists, then Bowtie will find limit on the number of backtracks allowed before a search is it (subject to the backtrack ceiling discussed in the following terminated (default: 125). The limit prevents some legitimate, section). Because the search is greedy, the first valid align- Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.5 Exact Inexact g a a a E Fi xg au ct m re 2 atching versus inexact alignment Exact matching versus inexact alignment. Illustration of how EXACTMATCH (top) and Bowtie's aligner (bottom) proceed when there is no exact match for query 'ggta' but there is a one-mismatch alignment when 'a' is replaced by 'g'. Boxed pairs of numbers denote ranges of matrix rows beginning with the suffix observed up to that point. A red X marks where the algorithm encounters an empty range and either aborts (as in EXACTMATCH) or backtracks (as in the inexact algorithm). A green check marks where the algorithm finds a nonempty range delimiting one or more occurrences of a reportable alignment for the query. Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.6 low-quality alignments from being reported, but we expect that this is a desirable trade-off for most applications. Phased Maq-like search Bowtie allows the user to select the number of mismatches permitted (default: two) in the high-quality end of a read (default: the first 28 bases) as well as the maximum accepta- ble quality distance of the overall alignment (default: 70). Quality values are assumed to follow the definition in PHRED [24], where p is the probability of error and Q = -10log p. Both the read and its reverse complement are candidates for alignment to the reference. For clarity, this discussion consid- ers only the forward orientation. See Additional data file 1 (Supplementary Discussion 2) for an explanation of how the reverse complement is incorporated. The first 28 bases on the high-quality end of the read are termed the 'seed'. The seed consists of two halves: the 14 bp on the high-quality end (usually the 5' end) and the 14 bp on the low-quality end, termed the 'hi-half' and the 'lo-half', respectively. Assuming the default policy (two mismatches permitted in the seed), a reportable alignment will fall into one of four cases: no mismatches in seed (case 1); no mis- matches in hi-half, one or two mismatches in lo-half (case 2); no mismatches in lo-half, one or two mismatches in hi-half The thre Figure 3e phases of the Bowtie algorithm for the Maq-like policy (case 3); and one mismatch in hi-half, one mismatch in lo- The three phases of the Bowtie algorithm for the Maq-like policy. A three- half (case 4). phase approach finds alignments for two-mismatch cases 1 to 4 while minimizing backtracking. Phase 1 uses the mirror index and invokes the aligner to find alignments for cases 1 and 2. Phases 2 and 3 cooperate to All cases allow any number of mismatches in the nonseed part find alignments for case 3: Phase 2 finds partial alignments with mismatches of the read and all cases are also subject to the quality distance only in the hi-half, and phase 3 attempts to extend those partial alignments constraint. into full alignments. Finally, phase 3 invokes the aligner to find alignments for case 4. The Bowtie algorithm consists of three phases that alternate between using the forward and mirror indices, as illustrated in Figure 3. Phase 1 uses the mirror index and invokes the All runs were performed on a single CPU. Bowtie speedups aligner to find alignments for cases 1 and 2. Phases 2 and 3 were calculated as a ratio of wall-clock alignment times. Both cooperate to find alignments for case 3: Phase 2 finds partial wall-clock and CPU times are given to demonstrate that alignments with mismatches only in the hi-half and phase 3 input/output load and CPU contention are not significant fac- attempts to extend those partial alignments into full align- tors. ments. Finally, phase 3 invokes the aligner to find alignments for case 4. The time required to build the Bowtie index was not included in the Bowtie running times. Unlike competing tools, Bowtie Performance results can reuse a pre-computed index for the reference genome We evaluated the performance of Bowtie using reads from the across many alignment runs. We anticipate most users will 1,000 Genomes project pilot (National Center for Biotechnol- simply download such indices from a public repository. The ogy Information [NCBI] Short Read Archive:SRR001115). A Bowtie site [25] provides indices for current builds of the total of 8.84 million reads, about one lane of data from an human, chimp, mouse, dog, rat, and Arabidopsis thaliana Illumina instrument, were trimmed to 35 bp and aligned to genomes, as well as many others. the human reference genome [NCBI build 36.3]. Unless spec- ified otherwise, read data are not filtered or modified (besides Results were obtained on two hardware platforms: a desktop trimming) from how they appear in the archive. This leads to workstation with 2.4 GHz Intel Core 2 processor and 2 GB of about 70% to 75% of reads aligning somewhere to the RAM; and a large-memory server with a four-core 2.4 GHz genome. In our experience, this is typical for raw data from AMD Opteron processor and 32 GB of RAM. These are the archive. More aggressive filtering leads to higher align- denoted 'PC' and 'server', respectively. Both PC and server ment rates and faster alignment. run Red Hat Enterprise Linux AS release 4. Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.7 Comparison to SOAP and Maq shows performance results when the three tools are each used Maq is a popular aligner [1,4,5,26,27] that is among the fast- to align three sets of 2 M untrimmed reads, a 36-bp set, a 50- est competing open source tools for aligning millions of Illu- bp set and a 76-bp set, to the human genome on the server mina reads to the human genome. SOAP is another open platform. Each set of 2 M is randomly sampled from a larger source tool that has been reported and used in short-read set (NCBI Short Read Archive: SRR003084 for 36-bp, projects [6,28]. SRR003092 for 50-bp, SRR003196 for 76-bp). Reads were sampled such that the three sets of 2 M have uniform per-base Table 1 presents the performance and sensitivity of Bowtie error rate, as calculated from per-base Phred qualities. All v0.9.6, SOAP v1.10, and Maq v0.6.6. SOAP could not be run reads pass through Maq's 'catfilter'. on the PC because SOAP's memory footprint exceeds the PC's physical memory. The 'soap.contig' version of the SOAP Bowtie is run both in its Maq-like default mode and in its binary was used. For comparison with SOAP, Bowtie was SOAP-like '-v 2' mode. Bowtie is also given the '-z' option to invoked with '-v 2' to mimic SOAP's default matching policy ensure that only the forward or mirror index is resident in (which allows up to two mismatches in the alignment and dis- memory at one time. Maq v0.7.1 was used instead of Maq regards quality values), and with '--maxns 5' to simulate v0.6.6 for the 76-bp set because v0.6.6 cannot align reads SOAP's default policy of filtering out reads with five or more longer than 63 bp. SOAP was not run on the 76-bp set because no-confidence bases. For the Maq comparison Bowtie is run it does not support reads longer than 60 bp. with its default policy, which mimics Maq's default policy of allowing up to two mismatches in the first 28 bases and The results show that Maq's algorithm scales better overall to enforcing an overall limit of 70 on the sum of the quality val- longer read lengths than Bowtie or SOAP. However, Bowtie in ues at all mismatched read positions. To make Bowtie's mem- SOAP-like '-v 2' mode also scales very well. Bowtie in its ory footprint more comparable to Maq's, Bowtie is invoked default Maq-like mode scales well from 36-bp to 50-bp reads with the '-z' option in all experiments to ensure that only the but is substantially slower for 76-bp reads, although it is still forward or mirror index is resident in memory at one time. more than an order of magnitude faster than Maq. The number of reads aligned indicates that SOAP (67.3%) and Parallel performance Bowtie -v 2 (67.4%) have comparable sensitivity. Of the reads Alignment can be parallelized by distributing reads across aligned by either SOAP or Bowtie, 99.7% were aligned by concurrent search threads. Bowtie allows the user to specify a both, 0.2% were aligned by Bowtie but not SOAP, and 0.1% desired number of threads (option -p); Bowtie then launches were aligned by SOAP but not Bowtie. Maq (74.7%) and Bow- the specified number of threads using the pthreads library. tie (71.9%) also have roughly comparable sensitivity, Bowtie threads synchronize with each other when fetching although Bowtie lags by 2.8%. Of the reads aligned by either reads, outputting results, switching between indices, and per- Maq or Bowtie, 96.0% were aligned by both, 0.1% were forming various forms of global bookkeeping, such as mark- aligned by Bowtie but not Maq, and 3.9% were aligned by ing a read as 'done'. Otherwise, threads are free to operate in Maq but not Bowtie. Of the reads mapped by Maq but not parallel, substantially speeding up alignment on computers Bowtie, almost all are due to a flexibility in Maq's alignment with multiple processor cores. The memory image of the algorithm that allows some alignments to have three mis- index is shared by all threads, and so the footprint does not matches in the seed. The remainder of the reads mapped by increase substantially when multiple threads are used. Table Maq but not Bowtie are due to Bowtie's backtracking ceiling. 4 shows performance results for running Bowtie v0.9.6 on the four-core server with one, two, and four threads. Maq's documentation mentions that reads containing 'poly-A artifacts' can impair Maq's performance. Table 2 presents Index building performance and sensitivity of Bowtie and Maq when the read Bowtie uses a flexible indexing algorithm [29] that can be set is filtered using Maq's 'catfilter' command to eliminate configured to trade off between memory usage and running poly-A artifacts. The filter eliminates 438,145 out of time. Table 5 illustrates this trade-off when indexing the 8,839,010 reads. Other experimental parameters are identi- entire human reference genome (NCBI build 36.3, contigs). cal to those of the experiments in Table 1, and the same obser- Runs were performed on the server platform. The indexer was vations about the relative sensitivity of Bowtie and Maq apply run four times with different upper limits on memory usage. here. The reported times compare favorably with alignment times Read length and performance of competing tools that perform indexing during alignment. As sequencing technology improves, read lengths are growing Less than 5 hours is required for Bowtie to both build and beyond the 30-bp to 50-bp commonly seen in public data- query a whole-human index with 8.84 million reads from the bases today. Bowtie, Maq, and SOAP support reads of lengths 1,000 Genome project (NCBI Short Read up to 1,024, 63, and 60 bp, respectively, and Maq versions Archive:SRR001115) on a server, more than sixfold faster 0.7.0 and later support read lengths up to 127 bp. Table 3 than the equivalent Maq run. The bottom-most row illus- Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.8 Table 3 Varying read length using Bowtie, Maq and SOAP Length Program CPU time Wall clock time Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) 36 bp Bowtie 6 m 15 s 6 m 21 s 1,305 - 62.2 Maq 3 h 52 m 26 s 3 h 52 m 54 s 804 36.7× 65.0 Bowtie -v 2 4 m 55 s 5 m 00 s 1,138 - 55.0 SOAP 16 h 44 m 3 s 18 h 1 m 38 s 13,619 216× 55.1 50 bp Bowtie 7 m 11 s 7 m 20 s 1,310 - 67.5 Maq 2 h 39 m 56 s 2 h 40 m 9 s 804 21.8× 67.9 Bowtie -v 2 5 m 32 s 5 m 46 s 1,138 - 56.2 SOAP 48 h 42 m 4 s 66 h 26 m 53 s 13,619 691× 56.2 76 bp Bowtie 18 m 58 s 19 m 6 s 1,323 - 44.5 Maq 0.7.1 4 h 45 m 7 s 4 h 45 m 17 s 1,155 14.9× 44.9 Bowtie -v 2 7 m 35 s 7 m 40 s 1,138 - 31.7 The performance of Bowtie v0.9.6, SOAP v1.10, and Maq versions v0.6.6 and v0.7.1 on the server platform when aligning 2 M untrimmed reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR003084 for 36 base pairs [bp], SRR003092 for 50 bp, and SRR003196 for 76 bp). For each read length, the 2 M reads were randomly sampled from the FASTQ file downloaded from the Archive such that the average per-base error rate as measured by quality values was uniform across the three sets. All reads pass through Maq's "catfilter". Maq v0.7.1 was used for the 76-bp reads because v0.6.6 does not support reads longer than 63 bp. SOAP is excluded from the 76-bp experiment because it does not support reads longer than 60 bp. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. trates that the Bowtie indexer, with appropriate arguments, is Discussion memory-efficient enough to run on a typical workstation with Bowtie exhibits a large performance advantage over both Maq and SOAP when mapping reads to the human genome. Bow- 2 GB of RAM. Additional data file 1 (Supplementary discus- sions 3 and 4) explains the algorithm and the contents of the tie's sensitivity in terms of reads aligned is comparable to that resulting index. of SOAP and slightly less than Maq's, although the user may use command-line options to trade slower running time for Software greater sensitivity. Unlike SOAP, Bowtie's 1.3 GB memory Bowtie is written in C++ and uses the SeqAn library [30]. The footprint allows it to run on a typical PC with 2 GB of RAM. converter to the Maq mapping format uses code from Maq. Bowtie aligns Illumina reads to the human genome at a rate of over 25 million reads per hour. Multiple processor cores can run parallel Bowtie threads to achieve even greater align- Table 4 Bowtie parallel alignment performance CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Speedup Bowtie, one thread 18 m 19 s 18 m 46 s 28.3 1,353 - Bowtie, two threads 20 m 34 s 10 m 35 s 50.1 1,363 1.77× Bowtie, four threads 23 m 9 s 6 m 1 s 88.1 1,384 3.12× Performance results for running Bowtie v0.9.6 on the four-core server with one, two, and four threads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.9 Table 5 Bowtie index building performance Physical memory target (GB) Actual peak memory footprint (GB) Wall clock time 16 14.4 4 h 36 m 8 5.84 5 h 5 m 4 3.39 7 h 40 m 2 1.39 21 h 30 m Performance results and memory footprints of running the Bowtie v0.9.6 indexer on the whole human genome (National Center for Biotechnology Information build 36.3, contigs). Runs were performed on the server platform. The indexer was run four times with different upper limits on memory usage. See Additional data file 1 (Supplementary Discussion 3 and Supplementary Table 1) for details. ment speed; experiments show a speed up of 3.12 for four bytes; LF: last first; NCBI: National Center for Biotechnology threads on a typical Opteron server. Information. Unlike many other short-read aligners, Bowtie creates a per- manent index of the reference that may be re-used across Authors' contributions alignment runs. Building the index is fast - Bowtie outper- BL developed the algorithms, collected results, and wrote forms competing tools when aligning lanes of Illumina reads most of the software. CT wrote some of the software. CT and even with index construction time included. At 2.2 GB for the MP contributed to discussions on algorithms. BL, CT, MP, human genome, the on-disk size of a Bowtie index is small and SLS wrote the manuscript. enough to distribute over the internet. The Bowtie website hosts pre-built indices for the human genome and several other model organisms including chimp, dog, rat, mouse, and Additional data files chicken. The following additional data are included with the online version of this article: a document containing supplementary Bowtie's speed and small memory footprint are due chiefly to discussions, tables, and figures pertaining to algorithms for its use of the Burrows-Wheeler index in combination with the navigating the Burrows-Wheeler transform, the full four- novel, quality-aware, backtracking algorithm introduced phase version of the alignment algorithm that incorporates here. Double indexing is used to avoid the performance pen- the reverse-complement, index construction, and compo- alty of excessive backtracking. nents of the index (Additional data file 1). Additio Supp Pres tai for inc components of the Clic n m ok he ing t rpor ente , the leme n re fo o ate d are al data file ful algori ntar sl r the f supp fil y dis our-p thms for n e rev c index. leme 1 uss hase ersi e-c ons, ntar vers o ample vigat table y di ion of scuss ment ing t s, an the h ions , index d e Burrow ali figur , tabl gnm cons esent es, a s-Whe truc algori nd fi tion, ele gur thm that r tr and es p ans- er - Bowtie supports standard FASTQ and FASTA input formats, and comes with a conversion program that allows Bowtie out- Acknowledgements The authors would like to thank Arthur Delcher for his thorough review put to be used with Maq's consensus generator and single of the manuscript, and Michael Schatz for helpful discussion of algorithms. nucleotide polymorphism caller. This research was supported in part by NIH grants R01-LM006845 and R01-GM083873 to SLS. Bowtie does not yet support paired-end alignment or align- ments with insertions or deletions, although both improve- ments are planned for the future. Paired-end alignment is not References 1. Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, Kulesha E, Graf S, difficult to implement in Bowtie's framework, and we expect Johnson N, Herrero J, Tomazou EM, Thorne NP, Backdahl L, Her- that Bowtie's performance advantage will be comparable to, berth M, Howe KL, Jackson DK, Miretti MM, Marioni JC, Birney E, though perhaps somewhat less than, that of unpaired align- Hubbard TJ, Durbin R, Tavare S, Beck S: A Bayesian deconvolu- tion strategy for immunoprecipitation-based DNA methyl- ment mode. Support for insertions and deletions is also a con- ome analysis. Nat Biotechnol 2008, 26:779-785. ceptually straightforward addition. 2. Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide map- ping of in vivo protein-DNA interactions. Science 2007, 316:1497-1502. Bowtie is free, open source software available from the Bowtie 3. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an website [25]. assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008, 18:1509-1517. 4. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bry- ant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Abbreviations Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, bp: base pair; BWT: Burrows-Wheeler transform; CPU: cen- Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, tral processing unit; FM: full-text minute-space; GB: giga- Pratt MR, et al.: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008, 456:53-59. Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.10 5. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling 30. Doring A, Weese D, Rausch T, Reinert K: SeqAn an efficient, D, Dunford-Shore BH, McGrath S, Hickenbotham M, Cook L, Abbott generic C++ library for sequence analysis. BMC Bioinformatics R, Larson DE, Koboldt DC, Pohl C, Smith S, Hawkins A, Abbott S, 2008, 9:11. Locke D, Hillier LW, Miner T, Fulton L, Magrini V, Wylie T, Glasscock J, Conyers J, Sander N, Shi X, Osborne JR, Minx P, et al.: DNA sequencing of a cytogenetically normal acute myeloid leu- kaemia genome. Nature 2008, 456:66-72. 6. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, et al.: The diploid genome sequence of an Asian individual. Nature 2008, 456:60-65. 7. Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18:1851-1858. 8. Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics 2008, 24:713-714. 9. Kaiser J: DNA sequencing. A plan to capture human diversity in 1000 genomes. Science 2008, 319:395. 10. Smith AD, Xuan Z, Zhang MQ: Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioin- formatics 2008, 9:128. 11. Lin H, Zhang Z, Zhang MQ, Ma B, Li M: ZOOM! Zillions Of Oligos Mapped. Bioinformatics 2008, 24:2431-2437. 12. SHRiMP - SHort Read Mapping Package [http://comp bio.cs.toronto.edu/shrimp/] 13. Baeza-Yates RA, Perleberg CH: Fast and practical approximate string matching. Inf Process Lett 1996, 59:21-27. 14. Burkhardt S, Kärkkäinen J: Better Filtering with Gapped q- Grams. Fundam Inf 2003, 56:51-70. 15. Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search. Bioinformatics 2002, 18:440-445. 16. Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147:195-197. 17. Burrows M, Wheeler DJ: A Block Sorting Lossless Data Compression Algorithm. Technical Report 124 Palo Alto, CA: Digital Equipment Cor- poration; 1994. 18. Ferragina P, Manzini G: Opportunistic data structures with applications. [http://web.unipmn.it/~manzini/papers/ focs00draft.pdf]. 19. Ferragina P, Manzini G: An experimental study of an opportun- istic index. In Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete algorithms Washington, DC: Society for Industrial and Applied Mathematics; 2001:269-278. 20. Healy J, Thomas EE, Schwartz JT, Wigler M: Annotating large genomes with exact word matches. Genome Res 2003, 13:2306-2315. 21. Lippert RA: Space-efficient whole genome comparisons with Burrows-Wheeler transforms. J Comput Biol 2005, 12:407-415. 22. Graf S, Nielsen FG, Kurtz S, Huynen MA, Birney E, Stunnenberg H, Flicek P: Optimized design and assessment of whole genome tiling arrays. Bioinformatics 2007, 23:i195-i204. 23. Lam TW, Sung WK, Tam SL, Wong CK, Yiu SM: Compressed indexing and local alignment of DNA. Bioinformatics 2008, 24:791-797. 24. Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998, 8:186-194. 25. Bowtie: An ultrafast memory-efficient short read aligner [http://bowtie.cbcb.umd.edu/] 26. Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PA, Bignell GR, Stratton MR, Futreal PA: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet 2008, 40:722-729. 27. Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, Goodhead I, Rance R, Baker S, Maskell DJ, Wain J, Dolecek C, Achtman M, Dougan G: High-throughput sequencing provides insights into genome variation and evolution in Salmonella typhi. Nat Genet 2008, 40:987-993. 28. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Sny- der M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 2008, 320:1344-1349. 29. Kärkkäinen J: Fast BWT in small space by blockwise suffix sort- ing. Theor Comput Sci 2007, 387:249-257. Genome Biology 2009, 10:R25 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Genome Biology Springer Journals

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

Loading next page...
 
/lp/springer-journals/ultrafast-and-memory-efficient-alignment-of-short-dna-sequences-to-the-wVT90cjYK9

References (36)

Publisher
Springer Journals
Copyright
Copyright © 2009 by Langmead et al.; licensee BioMed Central Ltd.
Subject
Life Sciences; Animal Genetics and Genomics; Human Genetics; Plant Genetics & Genomics; Microbial Genetics and Genomics; Bioinformatics; Evolutionary Biology
eISSN
1474-760X
DOI
10.1186/gb-2009-10-3-r25
pmid
19261174
Publisher site
See Article on Publisher Site

Abstract

Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu. sequence a total of about six trillion base pairs of human DNA Rationale Improvements in the efficiency of DNA sequencing have both [9]. broadened the applications for sequencing and dramatically increased the size of sequencing datasets. Technologies from With existing methods, the computational cost of aligning Illumina (San Diego, CA, USA) and Applied Biosystems (Fos- many short reads to a mammalian genome is very large. For ter City, CA, USA) have been used to profile methylation pat- example, extrapolating from the results presented here in terns (MeDIP-Seq) [1], to map DNA-protein interactions Tables 1 and 2, one can see that Maq would require more than (ChIP-Seq) [2], and to identify differentially expressed genes 5 central processing unit (CPU)-months and SOAP more than (RNA-Seq) [3] in the human genome and other species. The 3 CPU-years to align the 140 billion bases from the study by Illumina instrument was recently used to re-sequence three Ley and coworkers [5]. Although using Maq or SOAP for this human genomes, one from a cancer patient and two from pre- purpose has been shown to be feasible by using multiple viously unsequenced ethnic groups [4-6]. Each of these stud- CPUs, there is a clear need for new tools that consume less ies required the alignment of large numbers of short DNA time and computational resources. sequences ('short reads') onto the human genome. For exam- ple, two of the studies [4,5] used the short read alignment tool Maq and SOAP take the same basic algorithmic approach as Maq [7] to align more than 130 billion bases (about 45× cov- other recent read mapping tools such as RMAP [10], ZOOM erage) of short Illumina reads to a human reference genome [11], and SHRiMP [12]. Each tool builds a hash table of short in order to detect genetic variations. The third human re- oligomers present in either the reads (SHRiMP, Maq, RMAP, sequencing study [6] used the SOAP program [8] to align and ZOOM) or the reference (SOAP). Some employ recent more than 100 billion bases to the reference genome. In addi- theoretical advances to align reads quickly without sacrificing tion to these projects, the 1,000 Genomes project is in the sensitivity. For example, ZOOM uses 'spaced seeds' to signif- process of using high-throughput sequencing instruments to icantly outperform RMAP, which is based on a simpler algo- Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.2 Table 1 Bowtie alignment performance versus SOAP and Maq Platform CPU time Wall clock time Reads mapped per Peak virtual memory Bowtie speed-up Reads aligned (%) hour (millions) footprint (megabytes) Bowtie -v 2 Server 15 m 7 s 15 m 41 s 33.8 1,149 - 67.4 SOAP 91 h 57 m 35 s 91 h 47 m 46 s 0.10 13,619 351× 67.3 Bowtie PC 16 m 41 s 17 m 57 s 29.5 1,353 - 71.9 Maq 17 h 46 m 35 s 17 h 53 m 7 s 0.49 804 59.8× 74.7 Bowtie Server 17 m 58 s 18 m 26 s 28.8 1,353 - 71.9 Maq 32 h 56 m 53 s 32 h 58 m 39 s 0.27 804 107× 74.7 The performance and sensitivity of Bowtie v0.9.6, SOAP v1.10, and Maq v0.6.6 when aligning 8.84 M reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR001115) trimmed to 35 base pairs. The 'soap.contig' version of the SOAP binary was used. SOAP could not be run on the PC because SOAP's memory footprint exceeds the PC's physical memory. For the SOAP comparison, Bowtie was invoked with '-v 2' to mimic SOAP's default matching policy (which allows up to two mismatches in the alignment and disregards quality values). For the Maq comparison Bowtie is run with its default policy, which mimics Maq's default policy of allowing up to two mismatches during the first 28 bases and enforcing an overall limit of 70 on the sum of the quality values at all mismatched positions. To make Bowtie's memory footprint more comparable to Maq's, Bowtie is invoked with the '-z' option in all experiments to ensure only the forward or mirror index is resident in memory at one time. CPU, central processing unit. rithm developed by Baeza-Yaetes and Perleberg [13]. Spaced a typical desktop computer with 2 GB of RAM. The index is seeds have been shown to yield higher sensitivity than contig- small enough to be distributed over the internet and to be uous seeds of the same length [14,15]. SHRiMP employs a stored on disk and re-used. Multiple processor cores can be combination of spaced seeds and the Smith-Waterman [16] used simultaneously to achieve even greater alignment speed. algorithm to align reads with high sensitivity at the expense of We have used Bowtie to align 14.3× coverage worth of human speed. Eland is a commercial alignment program available Illumina reads from the 1,000 Genomes project in about 14 from Illumina that uses a hash-based algorithm to align hours on a single desktop computer with four processor cores. reads. Bowtie makes a number of compromises to achieve this Bowtie uses a different and novel indexing strategy to create speed, but these trade-offs are reasonable within the context an ultrafast, memory-efficient short read aligner geared of mammalian re-sequencing projects. If one or more exact toward mammalian re-sequencing. In our experiments using matches exist for a read, then Bowtie is guaranteed to report reads from the 1,000 Genomes project, Bowtie aligns 35-base one, but if the best match is an inexact one then Bowtie is not pair (bp) reads at a rate of more than 25 million reads per guaranteed in all cases to find the highest quality alignment. CPU-hour, which is more than 35 times faster than Maq and With its highest performance settings, Bowtie may fail to 300 times faster than SOAP under the same conditions (see align a small number of reads with valid alignments, if those Tables 1 and 2). Bowtie employs a Burrows-Wheeler index reads have multiple mismatches. If the stronger guarantees based on the full-text minute-space (FM) index, which has a are desired, Bowtie supports options that increase accuracy at memory footprint of only about 1.3 gigabytes (GB) for the the cost of some performance. For instance, the '--best' option human genome. The small footprint allows Bowtie to run on will guarantee that all alignments reported are best in terms Table 2 Bowtie alignment performance versus Maq with filtered read set Platform CPU time Wall clock time Reads mapped per hour Peak virtual memory Bowtie speed up Reads aligned (%) (millions) footprint (megabytes) Bowtie PC 16 m 39 s 17 m 47 s 29.8 1,353 - 74.9 Maq 11 h 15 m 58 s 11 h 22 m 2 s 0.78 804 38.4× 78.0 Bowtie Server 18 m 20 s 18 m 46 s 28.3 1,352 - 74.9 Maq 18 h 49 m 7 s 18 h 50 m 16 s 0.47 804 60.2× 78.0 Performance and sensitivity of Bowtie v0.9.6 and Maq v0.6.6 when the read set is filtered using Maq's 'catfilter' command to eliminate poly-A artifacts. The filter eliminates 438,145 out of 8,839,010 reads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.3 of minimizing mismatches in the seed portion of the read, avoid excessive backtracking. The Bowtie aligner follows a although this option incurs additional computational cost. policy similar to Maq's, in that it allows a small number of mismatches within the high-quality end of each read, and it With its default options, Bowtie's sensitivity measured in places an upper limit on the sum of the quality values at mis- terms of reads aligned is equal to SOAP's and somewhat less matched alignment positions. than Maq's. Command line options allow the user to increase sensitivity at the cost of greater running time, and to enable Burrows-Wheeler indexing Bowtie to report multiple hits for a read. Bowtie can align The BWT is a reversible permutation of the characters in a reads as short as four bases and as long as 1,024 bases. The text. Although originally developed within the context of data input to a single run of Bowtie may comprise a mixture of compression, BWT-based indexing allows large texts to be reads with different lengths. searched efficiently in a small memory footprint. It has been applied to bioinformatics applications, including oligomer counting [20], whole-genome alignment [21], tiling microar- ray probe design [22], and Smith-Waterman alignment to a Bowtie description and results Bowtie indexes the reference genome using a scheme based human-sized reference [23]. on the Burrows-Wheeler transform (BWT) [17] and the FM index [18,19]. A Bowtie index for the human genome fits in The Burrows-Wheeler transformation of a text T, BWT(T), is 2.2 GB on disk and has a memory footprint of as little as 1.3 constructed as follows. The character $ is appended to T, GB at alignment time, allowing it to be queried on a worksta- where $ is not in T and is lexicographically less than all char- tion with under 2 GB of RAM. acters in T. The Burrows-Wheeler matrix of T is constructed as the matrix whose rows comprise all cyclic rotations of T$. The common method for searching in an FM index is the The rows are then sorted lexicographically. BWT(T) is the exact-matching algorithm of Ferragina and Manzini [18]. sequence of characters in the rightmost column of the Bur- Bowtie does not simply adopt this algorithm because exact rows-Wheeler matrix (Figure 1a). BWT(T) has the same matching does not allow for sequencing errors or genetic var- length as the original text T. iations. We introduce two novel extensions that make the technique applicable to short read alignment: a quality-aware This matrix has a property called 'last first (LF) mapping'. The th backtracking algorithm that allows mismatches and favors i occurrence of character X in the last column corresponds to th high-quality alignments; and 'double indexing', a strategy to the same text character as the i occurrence of X in the first (a) (c) (b) Burro Figure ws-Wheeler 1 transform Burrows-Wheeler transform. (a) The Burrows-Wheeler matrix and transformation for 'acaacg'. (b) Steps taken by EXACTMATCH to identify the range of rows, and thus the set of reference suffixes, prefixed by 'aac'. (c) UNPERMUTE repeatedly applies the last first (LF) mapping to recover the original text (in red on the top line) from the Burrows-Wheeler transform (in black in the rightmost column). Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.4 column. This property underlies algorithms that use the BWT ment encountered by Bowtie will not necessarily be the 'best' index to navigate or search the text. Figure 1b illustrates in terms of number of mismatches or in terms of quality. The UNPERMUTE, an algorithm that applies the LF mapping user may instruct Bowtie to continue searching until it can repeatedly to re-create T from BWT(T). prove that any alignment it reports is 'best' in terms of number of mismatches (using the option --best). In our expe- The LF mapping is also used in exact matching. Because the rience, this mode is two to three times slower than the default matrix is sorted lexicographically, rows beginning with a mode. We expect that the faster default mode will be pre- given sequence appear consecutively. In a series of steps, the ferred for large re-sequencing projects. EXACTMATCH algorithm (Figure 1c) calculates the range of matrix rows beginning with successively longer suffixes of the The user may also opt for Bowtie to report all alignments up query. At each step, the size of the range either shrinks or to a specified number (option -k) or all alignments with no remains the same. When the algorithm completes, rows limit on the number (option -a) for a given read. If in the (the entire query) correspond to exact beginning with S course of its search Bowtie finds N possible alignments for a occurrences of the query in the text. If the range is empty, the given set of substitutions, but the user has requested only K text does not contain the query. UNPERMUTE is attributable alignments where K < N, Bowtie will report K of the N align- to Burrows and Wheeler [17] and EXACTMATCH to Ferra- ments selected at random. Note that these modes can be gina and Manzini [18]. See Additional data file 1 (Supplemen- much slower than the default. In our experience, for example, tary Discussion 1) for details. -k 1 is more than twice as fast as -k 2. Searching for inexact alignments Excessive backtracking EXACTMATCH is insufficient for short read alignment The aligner as described so far can, in some cases, encounter because alignments may contain mismatches, which may be sequences that cause excessive backtracking. This occurs due to sequencing errors, genuine differences between refer- when the aligner spends most of its effort fruitlessly back- ence and query organisms, or both. We introduce an align- tracking to positions close to the 3' end of the query. Bowtie ment algorithm that conducts a backtracking search to mitigates excessive backtracking with the novel technique of quickly find alignments that satisfy a specified alignment pol- 'double indexing'. Two indices of the genome are created: one icy. Each character in a read has a numeric quality value, with containing the BWT of the genome, called the 'forward' index, lower values indicating a higher likelihood of a sequencing and a second containing the BWT of the genome with its char- error. Our alignment policy allows a limited number of mis- acter sequence reversed (not reverse complemented) called matches and prefers alignments where the sum of the quality the 'mirror' index. To see how this helps, consider a matching values at all mismatched positions is low. policy that allows one mismatch in the alignment. A valid alignment with one mismatch falls into one of two cases The search proceeds similarly to EXACTMATCH, calculating according to which half of the read contains the mismatch. matrix ranges for successively longer query suffixes. If the Bowtie proceeds in two phases corresponding to those two range becomes empty (a suffix does not occur in the text), cases. Phase 1 loads the forward index into memory and then the algorithm may select an already-matched query invokes the aligner with the constraint that it may not substi- position and substitute a different base there, introducing a tute at positions in the query's right half. Phase 2 uses the mismatch into the alignment. The EXACTMATCH search mirror index and invokes the aligner on the reversed query, resumes from just after the substituted position. The algo- with the constraint that the aligner may not substitute at posi- rithm selects only those substitutions that are consistent with tions in the reversed query's right half (the original query's the alignment policy and which yield a modified suffix that left half). The constraints on backtracking into the right half occurs at least once in the text. If there are multiple candidate prevent excessive backtracking, whereas the use of two substitution positions, then the algorithm greedily selects a phases and two indices maintains full sensitivity. position with a minimal quality value. Unfortunately, it is not possible to avoid excessive backtrack- Backtracking scenarios play out within the context of a stack ing fully when alignments are permitted to have two or more structure that grows when a new substitution is introduced mismatches. In our experiments, we have observed that and shrinks when the aligner rejects all candidate alignments excessive backtracking is significant only when a read has for the substitutions currently on the stack. See Figure 2 for many low-quality positions and does not align or aligns an illustration of how the search might proceed. poorly to the reference. These cases can trigger in excess of 200 backtracks per read because there are many legal combi- In short, Bowtie conducts a quality-aware, greedy, rand- nations of low-quality positions to be explored before all pos- omized, depth-first search through the space of possible sibilities are exhausted. We mitigate this cost by enforcing a alignments. If a valid alignment exists, then Bowtie will find limit on the number of backtracks allowed before a search is it (subject to the backtrack ceiling discussed in the following terminated (default: 125). The limit prevents some legitimate, section). Because the search is greedy, the first valid align- Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.5 Exact Inexact g a a a E Fi xg au ct m re 2 atching versus inexact alignment Exact matching versus inexact alignment. Illustration of how EXACTMATCH (top) and Bowtie's aligner (bottom) proceed when there is no exact match for query 'ggta' but there is a one-mismatch alignment when 'a' is replaced by 'g'. Boxed pairs of numbers denote ranges of matrix rows beginning with the suffix observed up to that point. A red X marks where the algorithm encounters an empty range and either aborts (as in EXACTMATCH) or backtracks (as in the inexact algorithm). A green check marks where the algorithm finds a nonempty range delimiting one or more occurrences of a reportable alignment for the query. Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.6 low-quality alignments from being reported, but we expect that this is a desirable trade-off for most applications. Phased Maq-like search Bowtie allows the user to select the number of mismatches permitted (default: two) in the high-quality end of a read (default: the first 28 bases) as well as the maximum accepta- ble quality distance of the overall alignment (default: 70). Quality values are assumed to follow the definition in PHRED [24], where p is the probability of error and Q = -10log p. Both the read and its reverse complement are candidates for alignment to the reference. For clarity, this discussion consid- ers only the forward orientation. See Additional data file 1 (Supplementary Discussion 2) for an explanation of how the reverse complement is incorporated. The first 28 bases on the high-quality end of the read are termed the 'seed'. The seed consists of two halves: the 14 bp on the high-quality end (usually the 5' end) and the 14 bp on the low-quality end, termed the 'hi-half' and the 'lo-half', respectively. Assuming the default policy (two mismatches permitted in the seed), a reportable alignment will fall into one of four cases: no mismatches in seed (case 1); no mis- matches in hi-half, one or two mismatches in lo-half (case 2); no mismatches in lo-half, one or two mismatches in hi-half The thre Figure 3e phases of the Bowtie algorithm for the Maq-like policy (case 3); and one mismatch in hi-half, one mismatch in lo- The three phases of the Bowtie algorithm for the Maq-like policy. A three- half (case 4). phase approach finds alignments for two-mismatch cases 1 to 4 while minimizing backtracking. Phase 1 uses the mirror index and invokes the aligner to find alignments for cases 1 and 2. Phases 2 and 3 cooperate to All cases allow any number of mismatches in the nonseed part find alignments for case 3: Phase 2 finds partial alignments with mismatches of the read and all cases are also subject to the quality distance only in the hi-half, and phase 3 attempts to extend those partial alignments constraint. into full alignments. Finally, phase 3 invokes the aligner to find alignments for case 4. The Bowtie algorithm consists of three phases that alternate between using the forward and mirror indices, as illustrated in Figure 3. Phase 1 uses the mirror index and invokes the All runs were performed on a single CPU. Bowtie speedups aligner to find alignments for cases 1 and 2. Phases 2 and 3 were calculated as a ratio of wall-clock alignment times. Both cooperate to find alignments for case 3: Phase 2 finds partial wall-clock and CPU times are given to demonstrate that alignments with mismatches only in the hi-half and phase 3 input/output load and CPU contention are not significant fac- attempts to extend those partial alignments into full align- tors. ments. Finally, phase 3 invokes the aligner to find alignments for case 4. The time required to build the Bowtie index was not included in the Bowtie running times. Unlike competing tools, Bowtie Performance results can reuse a pre-computed index for the reference genome We evaluated the performance of Bowtie using reads from the across many alignment runs. We anticipate most users will 1,000 Genomes project pilot (National Center for Biotechnol- simply download such indices from a public repository. The ogy Information [NCBI] Short Read Archive:SRR001115). A Bowtie site [25] provides indices for current builds of the total of 8.84 million reads, about one lane of data from an human, chimp, mouse, dog, rat, and Arabidopsis thaliana Illumina instrument, were trimmed to 35 bp and aligned to genomes, as well as many others. the human reference genome [NCBI build 36.3]. Unless spec- ified otherwise, read data are not filtered or modified (besides Results were obtained on two hardware platforms: a desktop trimming) from how they appear in the archive. This leads to workstation with 2.4 GHz Intel Core 2 processor and 2 GB of about 70% to 75% of reads aligning somewhere to the RAM; and a large-memory server with a four-core 2.4 GHz genome. In our experience, this is typical for raw data from AMD Opteron processor and 32 GB of RAM. These are the archive. More aggressive filtering leads to higher align- denoted 'PC' and 'server', respectively. Both PC and server ment rates and faster alignment. run Red Hat Enterprise Linux AS release 4. Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.7 Comparison to SOAP and Maq shows performance results when the three tools are each used Maq is a popular aligner [1,4,5,26,27] that is among the fast- to align three sets of 2 M untrimmed reads, a 36-bp set, a 50- est competing open source tools for aligning millions of Illu- bp set and a 76-bp set, to the human genome on the server mina reads to the human genome. SOAP is another open platform. Each set of 2 M is randomly sampled from a larger source tool that has been reported and used in short-read set (NCBI Short Read Archive: SRR003084 for 36-bp, projects [6,28]. SRR003092 for 50-bp, SRR003196 for 76-bp). Reads were sampled such that the three sets of 2 M have uniform per-base Table 1 presents the performance and sensitivity of Bowtie error rate, as calculated from per-base Phred qualities. All v0.9.6, SOAP v1.10, and Maq v0.6.6. SOAP could not be run reads pass through Maq's 'catfilter'. on the PC because SOAP's memory footprint exceeds the PC's physical memory. The 'soap.contig' version of the SOAP Bowtie is run both in its Maq-like default mode and in its binary was used. For comparison with SOAP, Bowtie was SOAP-like '-v 2' mode. Bowtie is also given the '-z' option to invoked with '-v 2' to mimic SOAP's default matching policy ensure that only the forward or mirror index is resident in (which allows up to two mismatches in the alignment and dis- memory at one time. Maq v0.7.1 was used instead of Maq regards quality values), and with '--maxns 5' to simulate v0.6.6 for the 76-bp set because v0.6.6 cannot align reads SOAP's default policy of filtering out reads with five or more longer than 63 bp. SOAP was not run on the 76-bp set because no-confidence bases. For the Maq comparison Bowtie is run it does not support reads longer than 60 bp. with its default policy, which mimics Maq's default policy of allowing up to two mismatches in the first 28 bases and The results show that Maq's algorithm scales better overall to enforcing an overall limit of 70 on the sum of the quality val- longer read lengths than Bowtie or SOAP. However, Bowtie in ues at all mismatched read positions. To make Bowtie's mem- SOAP-like '-v 2' mode also scales very well. Bowtie in its ory footprint more comparable to Maq's, Bowtie is invoked default Maq-like mode scales well from 36-bp to 50-bp reads with the '-z' option in all experiments to ensure that only the but is substantially slower for 76-bp reads, although it is still forward or mirror index is resident in memory at one time. more than an order of magnitude faster than Maq. The number of reads aligned indicates that SOAP (67.3%) and Parallel performance Bowtie -v 2 (67.4%) have comparable sensitivity. Of the reads Alignment can be parallelized by distributing reads across aligned by either SOAP or Bowtie, 99.7% were aligned by concurrent search threads. Bowtie allows the user to specify a both, 0.2% were aligned by Bowtie but not SOAP, and 0.1% desired number of threads (option -p); Bowtie then launches were aligned by SOAP but not Bowtie. Maq (74.7%) and Bow- the specified number of threads using the pthreads library. tie (71.9%) also have roughly comparable sensitivity, Bowtie threads synchronize with each other when fetching although Bowtie lags by 2.8%. Of the reads aligned by either reads, outputting results, switching between indices, and per- Maq or Bowtie, 96.0% were aligned by both, 0.1% were forming various forms of global bookkeeping, such as mark- aligned by Bowtie but not Maq, and 3.9% were aligned by ing a read as 'done'. Otherwise, threads are free to operate in Maq but not Bowtie. Of the reads mapped by Maq but not parallel, substantially speeding up alignment on computers Bowtie, almost all are due to a flexibility in Maq's alignment with multiple processor cores. The memory image of the algorithm that allows some alignments to have three mis- index is shared by all threads, and so the footprint does not matches in the seed. The remainder of the reads mapped by increase substantially when multiple threads are used. Table Maq but not Bowtie are due to Bowtie's backtracking ceiling. 4 shows performance results for running Bowtie v0.9.6 on the four-core server with one, two, and four threads. Maq's documentation mentions that reads containing 'poly-A artifacts' can impair Maq's performance. Table 2 presents Index building performance and sensitivity of Bowtie and Maq when the read Bowtie uses a flexible indexing algorithm [29] that can be set is filtered using Maq's 'catfilter' command to eliminate configured to trade off between memory usage and running poly-A artifacts. The filter eliminates 438,145 out of time. Table 5 illustrates this trade-off when indexing the 8,839,010 reads. Other experimental parameters are identi- entire human reference genome (NCBI build 36.3, contigs). cal to those of the experiments in Table 1, and the same obser- Runs were performed on the server platform. The indexer was vations about the relative sensitivity of Bowtie and Maq apply run four times with different upper limits on memory usage. here. The reported times compare favorably with alignment times Read length and performance of competing tools that perform indexing during alignment. As sequencing technology improves, read lengths are growing Less than 5 hours is required for Bowtie to both build and beyond the 30-bp to 50-bp commonly seen in public data- query a whole-human index with 8.84 million reads from the bases today. Bowtie, Maq, and SOAP support reads of lengths 1,000 Genome project (NCBI Short Read up to 1,024, 63, and 60 bp, respectively, and Maq versions Archive:SRR001115) on a server, more than sixfold faster 0.7.0 and later support read lengths up to 127 bp. Table 3 than the equivalent Maq run. The bottom-most row illus- Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.8 Table 3 Varying read length using Bowtie, Maq and SOAP Length Program CPU time Wall clock time Peak virtual memory footprint (megabytes) Bowtie speed-up Reads aligned (%) 36 bp Bowtie 6 m 15 s 6 m 21 s 1,305 - 62.2 Maq 3 h 52 m 26 s 3 h 52 m 54 s 804 36.7× 65.0 Bowtie -v 2 4 m 55 s 5 m 00 s 1,138 - 55.0 SOAP 16 h 44 m 3 s 18 h 1 m 38 s 13,619 216× 55.1 50 bp Bowtie 7 m 11 s 7 m 20 s 1,310 - 67.5 Maq 2 h 39 m 56 s 2 h 40 m 9 s 804 21.8× 67.9 Bowtie -v 2 5 m 32 s 5 m 46 s 1,138 - 56.2 SOAP 48 h 42 m 4 s 66 h 26 m 53 s 13,619 691× 56.2 76 bp Bowtie 18 m 58 s 19 m 6 s 1,323 - 44.5 Maq 0.7.1 4 h 45 m 7 s 4 h 45 m 17 s 1,155 14.9× 44.9 Bowtie -v 2 7 m 35 s 7 m 40 s 1,138 - 31.7 The performance of Bowtie v0.9.6, SOAP v1.10, and Maq versions v0.6.6 and v0.7.1 on the server platform when aligning 2 M untrimmed reads from the 1,000 Genome project (National Center for Biotechnology Information Short Read Archive: SRR003084 for 36 base pairs [bp], SRR003092 for 50 bp, and SRR003196 for 76 bp). For each read length, the 2 M reads were randomly sampled from the FASTQ file downloaded from the Archive such that the average per-base error rate as measured by quality values was uniform across the three sets. All reads pass through Maq's "catfilter". Maq v0.7.1 was used for the 76-bp reads because v0.6.6 does not support reads longer than 63 bp. SOAP is excluded from the 76-bp experiment because it does not support reads longer than 60 bp. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. trates that the Bowtie indexer, with appropriate arguments, is Discussion memory-efficient enough to run on a typical workstation with Bowtie exhibits a large performance advantage over both Maq and SOAP when mapping reads to the human genome. Bow- 2 GB of RAM. Additional data file 1 (Supplementary discus- sions 3 and 4) explains the algorithm and the contents of the tie's sensitivity in terms of reads aligned is comparable to that resulting index. of SOAP and slightly less than Maq's, although the user may use command-line options to trade slower running time for Software greater sensitivity. Unlike SOAP, Bowtie's 1.3 GB memory Bowtie is written in C++ and uses the SeqAn library [30]. The footprint allows it to run on a typical PC with 2 GB of RAM. converter to the Maq mapping format uses code from Maq. Bowtie aligns Illumina reads to the human genome at a rate of over 25 million reads per hour. Multiple processor cores can run parallel Bowtie threads to achieve even greater align- Table 4 Bowtie parallel alignment performance CPU time Wall clock time Reads mapped per hour (millions) Peak virtual memory footprint (megabytes) Speedup Bowtie, one thread 18 m 19 s 18 m 46 s 28.3 1,353 - Bowtie, two threads 20 m 34 s 10 m 35 s 50.1 1,363 1.77× Bowtie, four threads 23 m 9 s 6 m 1 s 88.1 1,384 3.12× Performance results for running Bowtie v0.9.6 on the four-core server with one, two, and four threads. Other experimental parameters are identical to those of the experiments in Table 1. CPU, central processing unit. Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.9 Table 5 Bowtie index building performance Physical memory target (GB) Actual peak memory footprint (GB) Wall clock time 16 14.4 4 h 36 m 8 5.84 5 h 5 m 4 3.39 7 h 40 m 2 1.39 21 h 30 m Performance results and memory footprints of running the Bowtie v0.9.6 indexer on the whole human genome (National Center for Biotechnology Information build 36.3, contigs). Runs were performed on the server platform. The indexer was run four times with different upper limits on memory usage. See Additional data file 1 (Supplementary Discussion 3 and Supplementary Table 1) for details. ment speed; experiments show a speed up of 3.12 for four bytes; LF: last first; NCBI: National Center for Biotechnology threads on a typical Opteron server. Information. Unlike many other short-read aligners, Bowtie creates a per- manent index of the reference that may be re-used across Authors' contributions alignment runs. Building the index is fast - Bowtie outper- BL developed the algorithms, collected results, and wrote forms competing tools when aligning lanes of Illumina reads most of the software. CT wrote some of the software. CT and even with index construction time included. At 2.2 GB for the MP contributed to discussions on algorithms. BL, CT, MP, human genome, the on-disk size of a Bowtie index is small and SLS wrote the manuscript. enough to distribute over the internet. The Bowtie website hosts pre-built indices for the human genome and several other model organisms including chimp, dog, rat, mouse, and Additional data files chicken. The following additional data are included with the online version of this article: a document containing supplementary Bowtie's speed and small memory footprint are due chiefly to discussions, tables, and figures pertaining to algorithms for its use of the Burrows-Wheeler index in combination with the navigating the Burrows-Wheeler transform, the full four- novel, quality-aware, backtracking algorithm introduced phase version of the alignment algorithm that incorporates here. Double indexing is used to avoid the performance pen- the reverse-complement, index construction, and compo- alty of excessive backtracking. nents of the index (Additional data file 1). Additio Supp Pres tai for inc components of the Clic n m ok he ing t rpor ente , the leme n re fo o ate d are al data file ful algori ntar sl r the f supp fil y dis our-p thms for n e rev c index. leme 1 uss hase ersi e-c ons, ntar vers o ample vigat table y di ion of scuss ment ing t s, an the h ions , index d e Burrow ali figur , tabl gnm cons esent es, a s-Whe truc algori nd fi tion, ele gur thm that r tr and es p ans- er - Bowtie supports standard FASTQ and FASTA input formats, and comes with a conversion program that allows Bowtie out- Acknowledgements The authors would like to thank Arthur Delcher for his thorough review put to be used with Maq's consensus generator and single of the manuscript, and Michael Schatz for helpful discussion of algorithms. nucleotide polymorphism caller. This research was supported in part by NIH grants R01-LM006845 and R01-GM083873 to SLS. Bowtie does not yet support paired-end alignment or align- ments with insertions or deletions, although both improve- ments are planned for the future. Paired-end alignment is not References 1. Down TA, Rakyan VK, Turner DJ, Flicek P, Li H, Kulesha E, Graf S, difficult to implement in Bowtie's framework, and we expect Johnson N, Herrero J, Tomazou EM, Thorne NP, Backdahl L, Her- that Bowtie's performance advantage will be comparable to, berth M, Howe KL, Jackson DK, Miretti MM, Marioni JC, Birney E, though perhaps somewhat less than, that of unpaired align- Hubbard TJ, Durbin R, Tavare S, Beck S: A Bayesian deconvolu- tion strategy for immunoprecipitation-based DNA methyl- ment mode. Support for insertions and deletions is also a con- ome analysis. Nat Biotechnol 2008, 26:779-785. ceptually straightforward addition. 2. Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide map- ping of in vivo protein-DNA interactions. Science 2007, 316:1497-1502. Bowtie is free, open source software available from the Bowtie 3. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y: RNA-seq: an website [25]. assessment of technical reproducibility and comparison with gene expression arrays. Genome Res 2008, 18:1509-1517. 4. Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bry- ant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Abbreviations Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, bp: base pair; BWT: Burrows-Wheeler transform; CPU: cen- Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, tral processing unit; FM: full-text minute-space; GB: giga- Pratt MR, et al.: Accurate whole human genome sequencing using reversible terminator chemistry. Nature 2008, 456:53-59. Genome Biology 2009, 10:R25 http://genomebiology.com/2009/10/3/R25 Genome Biology 2009, Volume 10, Issue 3, Article R25 Langmead et al. R25.10 5. Ley TJ, Mardis ER, Ding L, Fulton B, McLellan MD, Chen K, Dooling 30. Doring A, Weese D, Rausch T, Reinert K: SeqAn an efficient, D, Dunford-Shore BH, McGrath S, Hickenbotham M, Cook L, Abbott generic C++ library for sequence analysis. BMC Bioinformatics R, Larson DE, Koboldt DC, Pohl C, Smith S, Hawkins A, Abbott S, 2008, 9:11. Locke D, Hillier LW, Miner T, Fulton L, Magrini V, Wylie T, Glasscock J, Conyers J, Sander N, Shi X, Osborne JR, Minx P, et al.: DNA sequencing of a cytogenetically normal acute myeloid leu- kaemia genome. Nature 2008, 456:66-72. 6. Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, et al.: The diploid genome sequence of an Asian individual. Nature 2008, 456:60-65. 7. Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18:1851-1858. 8. Li R, Li Y, Kristiansen K, Wang J: SOAP: short oligonucleotide alignment program. Bioinformatics 2008, 24:713-714. 9. Kaiser J: DNA sequencing. A plan to capture human diversity in 1000 genomes. Science 2008, 319:395. 10. Smith AD, Xuan Z, Zhang MQ: Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioin- formatics 2008, 9:128. 11. Lin H, Zhang Z, Zhang MQ, Ma B, Li M: ZOOM! Zillions Of Oligos Mapped. Bioinformatics 2008, 24:2431-2437. 12. SHRiMP - SHort Read Mapping Package [http://comp bio.cs.toronto.edu/shrimp/] 13. Baeza-Yates RA, Perleberg CH: Fast and practical approximate string matching. Inf Process Lett 1996, 59:21-27. 14. Burkhardt S, Kärkkäinen J: Better Filtering with Gapped q- Grams. Fundam Inf 2003, 56:51-70. 15. Ma B, Tromp J, Li M: PatternHunter: faster and more sensitive homology search. Bioinformatics 2002, 18:440-445. 16. Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147:195-197. 17. Burrows M, Wheeler DJ: A Block Sorting Lossless Data Compression Algorithm. Technical Report 124 Palo Alto, CA: Digital Equipment Cor- poration; 1994. 18. Ferragina P, Manzini G: Opportunistic data structures with applications. [http://web.unipmn.it/~manzini/papers/ focs00draft.pdf]. 19. Ferragina P, Manzini G: An experimental study of an opportun- istic index. In Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete algorithms Washington, DC: Society for Industrial and Applied Mathematics; 2001:269-278. 20. Healy J, Thomas EE, Schwartz JT, Wigler M: Annotating large genomes with exact word matches. Genome Res 2003, 13:2306-2315. 21. Lippert RA: Space-efficient whole genome comparisons with Burrows-Wheeler transforms. J Comput Biol 2005, 12:407-415. 22. Graf S, Nielsen FG, Kurtz S, Huynen MA, Birney E, Stunnenberg H, Flicek P: Optimized design and assessment of whole genome tiling arrays. Bioinformatics 2007, 23:i195-i204. 23. Lam TW, Sung WK, Tam SL, Wong CK, Yiu SM: Compressed indexing and local alignment of DNA. Bioinformatics 2008, 24:791-797. 24. Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res 1998, 8:186-194. 25. Bowtie: An ultrafast memory-efficient short read aligner [http://bowtie.cbcb.umd.edu/] 26. Campbell PJ, Stephens PJ, Pleasance ED, O'Meara S, Li H, Santarius T, Stebbings LA, Leroy C, Edkins S, Hardy C, Teague JW, Menzies A, Goodhead I, Turner DJ, Clee CM, Quail MA, Cox A, Brown C, Durbin R, Hurles ME, Edwards PA, Bignell GR, Stratton MR, Futreal PA: Identification of somatically acquired rearrangements in cancer using genome-wide massively parallel paired-end sequencing. Nat Genet 2008, 40:722-729. 27. Holt KE, Parkhill J, Mazzoni CJ, Roumagnac P, Weill FX, Goodhead I, Rance R, Baker S, Maskell DJ, Wain J, Dolecek C, Achtman M, Dougan G: High-throughput sequencing provides insights into genome variation and evolution in Salmonella typhi. Nat Genet 2008, 40:987-993. 28. Nagalakshmi U, Wang Z, Waern K, Shou C, Raha D, Gerstein M, Sny- der M: The transcriptional landscape of the yeast genome defined by RNA sequencing. Science 2008, 320:1344-1349. 29. Kärkkäinen J: Fast BWT in small space by blockwise suffix sort- ing. Theor Comput Sci 2007, 387:249-257. Genome Biology 2009, 10:R25

Journal

Genome BiologySpringer Journals

Published: Mar 4, 2009

There are no references for this article.