Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Tracking HIV-1 recombination to resolve its contribution to HIV-1 evolution in natural infection

Tracking HIV-1 recombination to resolve its contribution to HIV-1 evolution in natural infection ARTICLE DOI: 10.1038/s41467-018-04217-5 OPEN Tracking HIV-1 recombination to resolve its contribution to HIV-1 evolution in natural infection 1,14 2 3 1 4 2 Hongshuo Song , Elena E. Giorgi , Vitaly V. Ganusov , Fangping Cai , Gayathri Athreya , Hyejin Yoon , 5 1 2 2 1,6 1 Oana Carja , Bhavna Hora , Peter Hraber , Ethan Romero-Severson , Chunlai Jiang , Xiaojun Li , 7 7 8,15 8 9 10 Shuyi Wang , Hui Li , Jesus F. Salazar-Gonzalez , Maria G. Salazar , Nilu Goonetilleke , Brandon F. Keele , 1 9 7,11 7,11 12 David C. Montefiori , Myron S. Cohen , George M. Shaw , Beatrice H. Hahn , Andrew J. McMichael , 1 2 2,13 1,6 Barton F. Haynes , Bette Korber , Tanmoy Bhattacharya & Feng Gao Recombination in HIV-1 is well documented, but its importance in the low-diversity setting of within-host diversification is less understood. Here we develop a novel computational tool (RAPR (Recombination Analysis PRogram)) to enable a detailed view of in vivo viral recombination during early infection, and we apply it to near-full-length HIV-1 genome sequences from longitudinal samples. Recombinant genomes rapidly replace transmitted/ founder (T/F) lineages, with a median half-time of 27 days, increasing the genetic complexity of the viral population. We identify recombination hot and cold spots that differ from those observed in inter-subtype recombinants. Furthermore, RAPR analysis of longitudinal samples from an individual with well-characterized neutralizing antibody responses shows that recombination helps carry forward resistance-conferring mutations in the diversifying qua- sispecies. These findings provide insight into molecular mechanisms by which viral recom- bination contributes to HIV-1 persistence and immunopathogenesis and have implications for studies of HIV transmission and evolution in vivo. 1 2 Duke Human Vaccine Institute and Department of Medicine, Duke University Medical Center, Durham, NC 27710, USA. Theoretical Division, Los Alamos 3 4 National Laboratory, Los Alamos, NM 87544, USA. Department of Microbiology, University of Tennessee, Knoxville, TN 37996, USA. Office for Research 5 6 & Discovery, University of Arizona, Tucson, AZ 85721, USA. Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA. National Engineering Laboratory For AIDS Vaccine, College of Life Science, Jilin University, Changchun, Jilin 130012, China. Department of Medicine, University of 8 9 Pennsylvania, Philadelphia, PA 19104, USA. Department of Medicine, University of Alabama at Birmingham, Birmingham, AL 35294, USA. Departments of Microbiology and Immunology & Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA. AIDS and Cancer Virus Program, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA. Department of Microbiology, University of Pennsylvania, Philadelphia, PA 12 13 19104, USA. Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK. Santa Fe Institute, Santa Fe, NM 87501, USA. 14 15 Present address: United States Military HIV Research Program, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA. Present address: MRC/UVRI and LSHTM Uganda Research Unit, Plot 51-57, Nakiwogo Road, Entebbe, Uganda. These authors contributed equally: Hongshuo Song, Elena E. Giorgi. These authors jointly supervised this work: Bette Korber, Tanmoy Bhattacharya, Feng Gao. Correspondence and requests for materials should be addressed to F.G. (email: fgao@duke.edu) NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 1 | | | 1234567890():,; ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 recombinant is a genetic sequence that carries regions rates and confounds relationships in standard phylogenetic tree 26 27 from two genetically distinct parental strains. That reconstructions . Currently, the bioinformatics tool GARD Arecombination contributed to HIV-1 quasispecies evolu- addresses this issue by finding likely breakpoints of recombina- tion in vivo was observed early after the discovery of HIV , and, tion within the alignment and reconstructing a series of phylo- soon after, recombination’s important role in global HIV-1 genetic trees of genetic segments within the breakpoints. 2,3 diversification was established . Ninety inter-subtype recombi- However, this program does not provide a list of recombinants or nants have been shown to be recurrent among circulating HIV-1 sequence relationships found within the alignment, and therefore viruses (for a current listing of these circulating recombinant is not helpful when analyzing low-diversity and/or shorter forms (CRFs), see https://www.hiv.lanl.gov/content/sequence/ regions, where eliminating a few sequences yields more statistical HIV/CRFs/CRFs.html). Some are very common epidemic lina- power than breaking up the alignment. ges like CRF01_AE, which is predominant in Thailand and Our bioinformatics tool, called Recombination Analysis PRo- 3 28 China, and CRF02_AG, which is common in West Africa . gram (RAPR), is based on the Wald–Wolfowitz Runs Test and Recombination increases overall genetic complexity of viral allows for computationally efficient detection of recombinants. populations more than just the accumulation of site mutations While variations on the Runs Test have been employed in the alone, thereby raising the likelihood for recombinants to find past to detect recombination in the context of gene conversion , favorable genetic configurations and facilitating faster adapta- this method has not been implemented on more recent and faster tion . Recombination is believed to contribute to viral diversity computational platforms. 5–7 8,9 10,11 and fitness , drug resistance , immunological escape , and Here, we show that RAPR is sensitive to identification of 7,12 disease progression . recombinant stretches in low-diversity settings, where only a Estimates of recombination rates in vivo vary among different small number of distinguishing bases are involved, and has the −5 −4 studies—1.4 × 10 to 2 × 10 breakpoints per site per genera- capability to discern parental strains from recombinants, in part 13–15 tion —comparable to the point mutation rate of 2.2–5.4 × 10 by accounting for longitudinal sampling times. Moreover, it can −5 16,17 per base per generation . In vitro studies have found HIV-1 address the lineage structure of the alignment by differentiating 18,19 to be a highly recombinogenic virus . However, in order to between the initiating recombinant event and its mutated des- detect recombinants, these studies utilize “foreign” gene inserts or cendants, when these are known for analysis. As a result, RAPR is sequences from divergent subtypes, unlike in vivo infections. able to reconstruct a hierarchy of serial recombination events as Schlub et al. describe an in vitro system that better mimics the recombinants of recombinants emerge over time. We first applied typical in vivo scenario, and observe that recombination occurs RAPR to a dataset of 3260 5′ and 3′ half genome sequences −3 non-randomly, with a frequency of the order of 10 breakpoints derived by SGA from 9 participants infected with multiple per nucleotide per round of infection, with multiple template transmitted/founder (T/F) viruses and sampled longitudinally switches between parental strains. over time. Because all infections were initiated by multiple, A number of factors have limited a better understanding of the genetically distinct T/Fs, recombinants between the infecting role of recombination in studies that looked at in vivo HIV-1 strains could be readily identified and tracked. This offered a infection. PCR-derived recombination artifacts are common unique opportunity to investigate how recombination impacts among sequences generated by bulk PCR products and, when not in vivo viral evolution and immune escape. We next applied excluded, may have affected older recombination studies that did RAPR to sequences from an individual (CH0505) whose infection not use single-genome amplification (SGA) . Similarly, studies was established by a single T/F virus and whose immune response that utilize sequences sampled only in chronic infection may not . We used this example to illustrate has been extensively studied be able to resolve ambiguities between putative parents and how to apply RAPR in a single T/F setting and illustrate the role 14,15 products of recombination . Other studies analyzed only a of recombination in carrying forward resistance mutations in a small portion of the viral genome. Such shortcomings can bias complex quasispecies. While heterogeneous infections have been estimates of the frequencies and evolutionary dynamics of analyzed before to estimate recombination rates , and long- recombinants in vivo. itudinal samples have been used to estimate recombination Many recombination detection tools currently available focus breakpoints , this is the first study to estimate how both on detecting recombinants between genetic subtypes and are best recombination rates and the accumulation of breakpoints con- applied to highly diverse sequences. Based on a sliding window tribute to the evolution of the HIV-1 quasispecies over time approach, RIP was the first bioinformatics tool to automatically during natural infection. screen for inter-subtype recombinants. More recent and advanced approaches are available today for detection of recombinants and their breakpoints within a single alignment. RECCO finds Results nominally significant recombinants by comparing the cost of Study participants. Previous studies have shown that approxi- obtaining each sequence either by mutation or by recombination. mately 80% of heterosexually transmitted HIV-1 infections are It does not, however, list parental strains or enumerate distinct initiated by a single T/F virus and only 20% are due to multiple, recombination events. Among the programs that provide a list of genetically distinct T/F viruses . In the latter case, due to the detected recombinants, including their potential breakpoints and genetically distinct quasispecies coevolving in the host, it becomes parental strains, is the suite RDP4 , which, in addition to easier to follow the history of recombination from the beginning implementing the recombination detection program (RDP) of the infection. To this purpose, we distinguished participants approach, includes several other existing tools, providing corro- productively infected with more than one T/F virus (hetero- boration of findings across the different methods. While these geneous infection) from those infected with a single T/F virus tools range in methods and strategies, the RDP test itself is based (homogeneous infection) by characterizing patterns of sequence on phylogenetic methods, and, as Posada and Crandall showed, diversity at the earliest time point (Fig. 1, Table 1, and Supple- in low-divergence scenarios like the recent infections presented in mentary Figs. 1 and 2) using statistical modeling, phylogenetic this study, these methods have less power to detect recombinants. trees, and highlighter plots, as previously described .Itis Defining the frequency of recombinants and identifying important to note that the number of infecting strains, the inci- breakpoints is particularly important in phylogenetic analyses dence of superinfection, and the estimated number of days since because recombination significantly biases estimates of mutation infection play no role in the detection of recombination, except 2 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE CH0010 (day 9) Table 1 Fiebig stage, transmission route, days since infection, and number of T/F viruses of heterogeneously infected individuals Subject Fiebig Transmission Days from No. of T/F stage route infection viruses (95% CI) Gaps CH0010 I/II Heterosexual 9 (7, 11) 3 6 CH0078 I/II Heterosexual 11 (7, 14) 2 2 CH0200 I/II Heterosexual 15 (13, 17) 3 5 CH0047 III Heterosexual 25 (21, 29) 2 2 CH0228 III Heterosexual 24 (21, 28) 2 3 CH0275 I/II Heterosexual 12 (8, 16) 1 2 CH0654 I/II MSM 15 (13, 17) 9 8 CH1244 IV Heterosexual 12 (11, 13) NA 2 CH1754 III Heterosexual 14 (12, 15) 6 7 Days from infection were calculated using previously published methods . These methods generally strongly correlate with Fiebig stage, though occasional deviations still happen, especially when under strong selection. In our present dataset, CH1244 is one such exception, 0 2000 4000 0.002 with a Fiebig stage IV yet diversity that would be expected—under a model of random tat accumulation of mutation—to be roughly 2 weeks into the infection vpu rev nef vif MSM men who have sex with men pol vpr env Fig. 1 Highlighter plot and phylogenetic tree of 3′ half genome sequences at when sequences are less diversified. Recombinant sequences screening time for CH0010. Six distinct T/F viruses (labeled a–f) were accumulated more breakpoints over time as a result of continuous identified. Each line represents a 3′ half genome sequence, and mutations serial recombination between earlier recombinants (Fig. 2b). from the major T/F (first sequence at the top) are color coded according to Similar results were obtained on both 5′ and 3′ half genome nucleotide sequences from all 9 participants (Supplementary Figs. 3–11). The frequency of recombinants and their descendants increased in all that we do not count the putative founder strains as recombinants individuals, eventually replacing all T/F lineages. Complete since we are focusing on recombination in the recipient rather replacement with recombinants of all distinctive T/F lineages than in the donor. However, calculating the time since infection that were evident in the earliest available time point was observed allows for aligning dynamics of recombination and viral loads to as early as 35 days, and typically was evident within 100 days facilitate comparisons. Participant selection and characteristics, (Fig. 3). The exception was participant CH0275, who was sparsely T/F identification, longitudinal sampling, and timing of infection sampled, with no time points available between 12 and 187 days; are described in the Methods. all CH0275 sequences sampled at day 187 were, however, recombinant forms (Fig. 3a). The RAPR program. To characterize recombination events and determine the frequency of recombinants over time, we developed Estimations of number of breakpoints. Template switches the RAPR tool, available as a web-based interface in the LANL occurring in nearly identical regions of the genome are likely to database (http://www.hiv.lanl.gov/content/sequence/RAP2017/ go undetected. Cromer et al. modeled this phenomenon and rap.html). The program relies on the fact that recombination estimated between 5 and 14 template switches per recombinant events can be identified through the inheritance of blocks of DNA genome. In this study, we observed a high frequency of recom- that carry distinguishing bases. Therefore, it finds recombinants bination breakpoints in our data when both significant and non- by comparing every set of three sequences in the alignment and significant but potential breakpoints were considered, consistent 28 31 applying the Wald–Wolfowitz Runs Test to check for ran- with what was reported by Cromer et al. . However, because of domness in the distribution of variable positions, identifying regions of near identity between the parents, the significant regions in a sequence that are significantly more similar to one breakpoint counts of RAPR are inevitably lower than the true candidate parent than another. It then classifies the viral popu- number of template switches, as illustrated in Supplementary lation into clusters of T/F viruses and their descendants—which Fig. 12. In order to quantify how many undetected breakpoints present mutations explainable by randomly distributed base resulted from template switches in regions of high homology changes—or identified recombinants and their descendants. In between parental strains, we used a recombination simulation the latter case, RAPR also determines the likeliest parents, based strategy (see Methods for details). As expected, RAPR con- both on the clustered distribution of sequence differences and on sistently underestimated the true number of breakpoints (Sup- their time of sampling, and delineates regions in which the plementary Figs. 13 and 14), and yielded a distribution of breakpoints are likely to lie. significant breakpoints lower than the actual number of break- As an example of RAPR performance and output, we illustrate points introduced in the simulation. This is not surprising given detailed results for participant CH0010, infected with 6 sampled that all sets of sequences used in the simulation had very low T/F viruses, labeled a–f (Figs 1 and 2a). Out of 103 unique within-alignment diversity (the within T/F mean Hamming dis- sequences from 4 time points (days 9, 26, 72, and 188 post tance range per nucleotide per sequence was 0.76–2.93% across infection), RAPR detected 52 distinct de novo recombinants, 4 of all parental pairs), and many of the random recombination which gave rise to 11 descendants (Fig. 2). By day 72 onward, all breakpoints in our simulation occur in sequence stretches where sequences were either recombinants or descendants of recombi- both parents were identical or nearly identical. As a result, this nants (Fig. 2). Estimated breakpoint intervals (illustrated as open study alone cannot bound the recombination rate from above, boxes in Fig. 2, shaded in gray when the breakpoint is statistically and Cromer et al. values are consistent with the lower bounds significant, see Methods) tend to be larger and sparser earlier, we obtain. NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 3 | | | ORF ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ab Recombination breakpoints T/F a T/F a T/F b T/F c T/F d T/F e T/F f Day 9 Day 26 Day 72 Day 188 T/Fs Rec cluster 1 T/F d T/F c T/F b Rec cluster 2 T/F f f * * Rec cluster 3 * Rec cluster 4 T/F e 0.002 tat vpu rev +2 vif nef pol vpr env Fig. 2 Identification of recombinants by RAPR. a Phylogenetic tree of 3′ half genome sequences from longitudinal samples from CH0010. Shaded regions show clonal expansion events and are color coded to indicate their closest T/F lineage. In each cluster, solid circle represents the originating (de novo) recombinant for that cluster and open circles represent its recombinant descendants. b Recombination breakpoints. Each line represents a sequence, and colored lines represent recombinants. Colored intervals in each recombinant sequence indicate the parental T/F lineage. Black boxes indicate the interval where the breakpoints are most likely to have occurred, and when the breakpoints are statistically significant, the box is shaded in gray. Shaded regions show clonal expansion events and are color coded to indicate their closest T/F lineage. Dashed lines indicate recombination descendants derived froma particular de novo recombinant as shown as a solid line in the same shaded group. The black stars on the right indicate the sequence lineage that RAPR identified as descendants of a single recombination event, while the program RECCO failed to recognize it as such To obtain a posterior distribution on the number of true biologically sound simulated datasets where the exact recombi- breakpoints, we also implemented a simulation based on an nants and their breakpoints are known. To realize this, we ran- approximate Bayesian computation method and applied it to domly generated sets of 100 artificial recombinants with known the same early time samples mentioned above (see Methods for crossover points, each from three different pairs of natural strains details). We conclude that for all recombinants for which RAPR that carried 0.6%, 1%, and 1.2% relative diversity, respectively (see detected 2 significant breakpoints, we are 95% confident that the Methods). In the lowest (0.6%) diversity setting, RAPR identified true breakpoints were no less than 2. A similar statement holds 77 out of 100 unique recombination events and 16 recombination when RAPR detected 4 breakpoints, with one exception descendants, missing 6 out of 100 known events (Supplementary (Supplementary Table 1). As the number of breakpoints Table 2). The other computational tools we used to test on the increases, they are more likely to go undetected: see for example same simulated datasets detected overall less recombination two of the recombinants from CH0228 3′ half sequences, for events than RAPR (Supplementary Table 2). However, we should which RAPR detects 5 and 6 significant breakpoints, but the 95% note that selection could impact our results when a few nearby confidence limit (CL) is 6 and 9 respectively. Supplementary mutations localized proximally are co-selected. Therefore, it is Fig. 15 illustrates how, as the number of template switches possible that generating recombinants in other ways, e.g., by increases, more breakpoints fall within regions of parental including some type of selection when sampling sequences, could homology, making it harder to accurately detect the true number reduce the efficiency of RAPR at detecting recombinants. of breakpoints. In conclusion, our tests indicate that RAPR For further comparisons, we tested different recombination provides a reasonable, though negatively biased, estimate for the detection tools on CH0010, the heterogeneous participant we true number of breakpoints; and the bias is small when the true used above to illustrate the RAPR output. RAPR identified 52 number of breakpoints is small. recombination events in this dataset, while the other tested tools detected between 2 and 18 (Fig. 2b) when accounting for sampling times. One feature of RAPR is that it can identify Comparison with existing recombination detection tools. clusters of sequences that likely descended from a single Despite the bias described above, our program proves useful in recombination event. When a single recombination event leads low-diversity scenarios. A good way to test this is using to a lineage, i.e., a cluster of sequences that differ from one 4 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | ORF NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE 3′ Half genome CH0010 CH0078 CH0200 CH0047 CH0228 100 100 100 100 100 80 80 80 80 80 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 0 50 100 150 200 0 100 200 300 400 500 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 CH1754 CH0654 CH1244 CH0275 T/F a T/F g 100 100 100 100 T/F b T/F h 80 80 80 80 60 60 60 60 T/F c Super-infected 40 40 40 40 T/F d Rec (all) 20 20 20 20 T/F e Rec (de novo) 0 0 0 0 T/F f Rec (descendant) 0 50 100 150 200 0 100 200 300 400 500 0 100 200 300 400 500 0 50 100 150 200 Days from infection 5′ Half genome CH0010 CH0078 CH0200 CH0047 100 100 100 100 80 80 80 80 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 0 0 50 100 150 200 0 100 200 300 400 500 0 200 400 600 800 0 200 400 600 800 CH0228 CH1754 CH0654 100 100 100 T/F a T/F g 80 80 80 T/F b T/F i T/F c T/F j 60 60 60 T/F d Rec (all) 40 40 40 T/F e Rec (de novo) 20 20 20 T/F f Rec (descendant) 0 0 0 0 200 400 600 800 0 50 100 150 200 0 100 200 300 400 500 Days from infection Fig. 3 Proportion of recombinants and their descendants in the viral population over time. The 3′ half genome (a) and 5′ half genome (b) sequences were analyzed separately. At each time point, colored dots indicate the percentages of T/F sequences and their lineages, red dots indicate all recombinants, red squares de novo recombinants, red circles descendants of recombinants, and open diamonds superinfection variants another by a small number of random mutations, many sampled at this or earlier time points. Furthermore, excluding recombination detection tools classify each resulting sequence later time points as parentals vastly reduces the multiple-testing as a recombinant rather than identifying the entire cluster as a corrections needed. lineage from a single recombinant founder strain. As a result, recombination rates can be overestimated. RAPR avoids over- counting descendants as independent recombination events by In vivo recombination rate. To investigate potential mechanisms breaking the alignment into clusters that are likely to have for the rapid loss of T/F viruses and their lineages during early originated either via mutation from one of the original T/Fs or infection, we developed a mathematical model of HIV-1 evolu- from a single recombination event and assigning a single tion that includes recombination. The model considers cells that “founder” (either a T/F or the original recombination event) for can be uninfected, infected with one viral variant, or coinfected each cluster (Fig. 4). with multiple variants (Supplementary Fig. 16), and explores the RAPR assumes that parental sequences are not sampled later impact of coinfection and selection on the prevalence of recom- than the recombinant, and therefore when resolving parental binants over time. In the absence of selective advantage of strains versus recombinants, the program does not allow later recombinants, a moderate frequency of cell coinfection by two time point sequences to be parents of earlier recombinants. different viral variants (on average 3.3%, range 0.7–8.0% among However, one needs to be careful when interpreting results in that the 9 participants) was sufficient to explain the rapid accumula- some earlier lineages may in fact be latent or escape sampling and tion of recombinants and the loss of T/F viruses when 3′ half then reappear at later time points. When in doubt, the user genome sequences were analyzed (Table 2 and Supplementary should consider multiple runs, with and without specifying Fig. 17a). Modeling 5′ half genome sequences showed similar sequence time points. In our study, we make the simplifying results (Supplementary Table 3 and Supplementary Fig. 17b). assumption that the sampling at each time point is adequate, so Alternatively, accumulation of recombinants could also be that sequences close to the parents are expected to be already explained by their moderate selective advantage over T/F viruses, NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 5 | | | % ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ranging from 0.034 per day to 0.5 per day. Either mechanism recombinants, one or few mutational clusters arising from could predict the time at which 50% of the viral population had recombinants would be highly represented. Given that selective been replaced by recombinants; for example, the “coinfection” forces due to immune pressure are subject to change over time, model estimated that the median half-time (T ) across 9 parti- and that new resistant forms of the virus can arise over time, such 1/2 cipants was 27 days for 3′ half genome sequences (Table 2). dominance may be transient. While the replacement of the T/F Similar results were obtained with 5′ half genome sequences lineages was most frequently due to accumulation of novel (de (Supplementary Table 3). novo) recombinants (Fig. 3), there were some exceptional time Deeper understanding of the relative contribution of these two points in some participants where a newly arisen recombinant mechanisms in accumulation of recombinants can be gleaned lineage dominated the sample, indicative of positive selection from a detailed analysis of recombination: if the replacements acting on a recombinant (Fig. 3). The most striking examples of were driven by a selective advantage of a few particular this were in the 3′ half from CH0078 at day 453, with one recombinant form having 26 descendants dominating the sample (Fig. 3a, Supplementary Fig. 4a), and in CH0654 at day 84, with a T/F a cluster recombinant form with 16 descendants dominating the sample T/F b cluster T/F c cluster (Fig. 3b, Supplementary Fig. 9b). Furthermore, all participants T/F d cluster had recombinant forms that were found to be repeated multiple T/F e cluster times, with descendent lineages sometimes persisting for weeks T/F f Rec (de novo) (Fig. 3, Supplementary Figs. 3–11). These observations suggest Rec (descendant) selection may have played a role in transient emergence of some Rec cluster 4 recombinant lineages. T/F c The number of de novo breakpoints observed correlated better T/F e with the number of novel mutations than with viral load (VL), consistent with the hypothesis that periods of more intense T/F d positive selection may influence both patterns of emergent recombinants and mutation. The coincidence in the timing of T/F f Rec cluster 3 periods of peak accumulation of new base substitutions and recombination breakpoints (Fig. 5) was statistically significant for the 3′ half genomes (p < 0.015 by one-sided sign test) but not for Rec cluster 1 the 5′ halves. While recombination requires coinfection, novel T/F a mutations are independent of it. This suggests that neither of the T/F b two simple models described above captured the full in vivo dynamics observed in our data, and that there may have been periods of increased selective pressure driving changes in the quasispecies, evident as coincident peaks in de novo recombina- tion and base mutation events in the longitudinal sample. Thus, Rec cluster 2 while occasionally selection can favor the dominance of one or Fig. 4 Clustering graph of the 3′ half genome sequences from CH0010. few recombinant forms, diversifying selection can also periodi- Each circle represents a sequence, and circles of the same color represent a cally act to increase the prevalence of both new mutations and T/F lineage. Gray circles represent de novo recombinant sequences and recombinants. Therefore, both factors—high VL that enhances open circles represent descendants of recombinant sequences. Clusters are coinfection of cells and selection—may have contributed to the calculated using a greedy algorithm and they originate either from a T/F rapid appearance of recombinants. lineage or through recombination across lineages. Recombination clusters are shaded and color coded to indicate their closest T/F lineage. At most Comparison with mutation rates. When combining the RAPR one sequence per recombination cluster is called a de novo recombinant: output across all 9 participants, we observed that while the rates the rest are assumed to be descendants of this recombinant sequences. of accumulation of new mutations and new observed recombi- −5 Light blue lines point to recombinant parents and green lines to the nation events were of the same order (~10 per sequence per recombinant child nucleotide per day; Fig. 5), the rate of new base mutations at any Table 2 Estimated rates of coinfection, the percentage of coinfected cells, and the half replacement time by recombinants during the infection (3′ half genome) Subject Coinfection rate (/day) % Coinfected cells T (days) 1/2 CH0010 0.067 (0.051, 0.093) 3.3 23.3 (19.3, 27.6) CH0078 0.088 (0.081, 0.095) 4.2 47.3 (44.6, 50.5) CH0200 0.056 (0.033, 0.094) 2.7 29.5 (21.9, 41.8) CH0047 0.014 (0.008, 0.022) 0.7 39.2 (31.2, 55.1) CH0228 0.103 (0.068, 0.153) 4.9 27.0 (24.3, 31.0) CH1754 0.149 (0.123, 0.177) 6.9 19.8 (18.9, 21.0) CH0654 0.035 (0.027, 0.044) 1.7 27.4 (24.8, 30.6) CH1244 0.175 (0.121, 0.27) 8.0 19.5 (16.8, 22.8) CH0275 0.024 (0.014, 0.039) 1.2 114.0 (73.9, 172.0) Median 0.067 3.3 27.0 The rate of coinfection is given as γβI, and the percentage of coinfected cells during the infection is F = γβI/(γβI + γβT) with γβT=2/day. T indicates the predicted time when the frequency of c 1/2 recombinants reaches 50% of the viral population. Coinfection rate was estimated by fitting 3′ and 5′ data simultaneously using maximum likelihood (see Methods for more detail). In CH0654, all sequences were recombinants at day 84 before the initiation of ART at day 112. Therefore, ART in CH0654 did not affect the analysis 6 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE New mutations/seq/day/bp New breakpoints/seq/day/bp Viral load CH0010 3′ half CH0010 5′ half CH0047 3′ half CH0047 5′ half 10^7 8e−05 10^6 6e−05 10^5 4e−05 10^4 2e−05 926 72 188 9 26 72 188 19 205 708 19 205 708 CH0078 3′ half CH0078 5′ half CH0200 3′ half CH0200 5′ half 10^7 8e−05 10^6 6e−05 10^5 4e−05 10^4 2e−05 11 46 88 145 208 453 11 46 88 145 208 453 11 74 409 696 11 74 409 696 CH0228 3′ half CH0228 5′ half CH0654 3′ half CH0654 5′ half 8e−05 10^7 6e−05 10^6 10^5 4e−05 10^4 2e−05 19 112 196 707 19 112 196 707 15 47 84 432 15 47 84 432 CH1754 3′ half CH1754 5′ half CH1244 3′ half CH0275 3′ half 10^7 8e−05 10^6 6e−05 10^5 4e−05 10^4 2e−05 14 77 189 357 441 609 776 888 14 77 189 357 441 609 776 888 12 40 108 188 277 375 435 12 187 Days from infection Fig. 5 Comparison of new breakpoint and mutation rates. Each graph shows the accumulation rate of new breakpoints per sequence per nucleotide per day (blue lines), new mutations per sequence per nucleotide per day (red lines), and log-scale viral loads (green lines) in both half genomes over time for all nine participants. Upper VL detection limit is 750,000 copies/ml. The gray vertical line in CH0200 3′ half highlights the time point at which a superinfecting T/F was observed. The sequences from the last time point (day 432) in CH0654 were excluded from the analysis due to the initiation of ART at day 112 time point (per sequence, per nucleotide, per day) was always Recombination hotspots. One open question when studying higher than the rate of new breakpoints (per sequence, per recombination is whether breakpoints are uniformly distributed nucleotide, per day) in both genome halves across all participants, across the HIV genome or whether instead they cluster pre- with the only exception of the 5′ half genome in CH0228 (Fig. 5). ferentially in certain regions (called hotspots) while leaving others Both recombination and mutation rates tended to decrease with relatively intact (cold spots). To address this, we calculated hot time. The difference in rates between recombination and muta- and cold spots of recombination across the 3′ half genomes of the −8 −4 tion was significant (p = 1.9 × 10 and p = 1.4 × 10 , respec- 9 participants using a sliding window of 20 nucleotides, and tively, one-sided paired Wilcoxon test), but they pertain only to found that hotspots were mostly clustered in Env, before or after detected recombination breakpoints. As discussed previously, the variable loops, whereas regions where two genes overlapped breakpoints are detectable only when the parental strains are (vif/tat and vpu/env) carried several cold spots (Fig. 6). distinctive enough that exchanged blocks carry within them dis- tinctive patterns of bases, and therefore the detected recombi- Increased genetic complexity among recombinants. Across all nation rate is an underestimate of the true recombination rate. nine participants, diversity within recombinant sequences was higher than intra-T/F lineage diversity but lower than inter-T/F diversity (Fig. 7). Previously, we observed that in homogeneous NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 7 | | | Viral loads (copies/ml) Rate (per seq per day per bp) ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 Combined recombination breakpoints Hotspots Cold spots Nef Vif Tat2 Tat1 Vpu Rev2 Vpr V1 V2 LoopD V4 V5 Rev1 Env 5000 6000 7000 8000 9000 HXB2 positions Fig. 6 Identification of recombination hotpots in HV-1 genome. Recombination hotspots are shown in dark orange and cold spots in blue across the 3′ half genome. Position numbering is relative to HXB2. Each line represents a position, and the thickness of the colored regions represents consecutive positions. The sequences from the last time point (day 432) in CH0654 were excluded from the analysis due to the initiation of ART at day 112 infections recombination lowers the rate at which diversity where the parental strains were discordant. The glycosylation site accumulates in the viral population . In these nine hetero- conferring mutation T234N was also found to be preferentially geneous infections, recombination not only tends to lower the carried forward by recombination events in CH505. It was first average diversity, but it also broadens the spectrum of genetic detected at day 202 (Fig. 8 and Supplementary Table 4) and by diversity distribution (Fig. 7). We speculate that this mechanism the time there is enough overall diversity for RAPR to detect allows the virus to explore a wider range of phenotypes, and recombinants (day 538), most sequences carry the glycosylation therefore potentially more likely escapes from immune responses. site. There were only 6 recombinants from day 538 where the Similar patterns of recombination were observed in vitro (Sup- parental strains are heterogeneous at this position (i.e., only one plementary Fig. 18). of the parents carries the glycosylation site), and in all events the recombinant always carries the resistance-conferring genotype (Supplementary Fig. 19b), indicating that recombination con- Recombination analysis in a homogeneous infection. Because tributes to the rate at which this mutation reached fixation. We of its sensitivity in detecting recombination even in low-diversity hypothesized that the glycosylation site mutation at position 234 scenarios, RAPR also proves useful when studying viral evolution could confer resistance to autologous antibodies, and confirmed by recombination in homogeneous infections, once enough experimentally that it indeed confers resistance to early lineage diversity has accrued to enable its detection. We show one CD4bs antibodies isolated from CH0505 (Supplementary example of this application, and use it to address the question of Table 4). Overall, RAPR found 17 unique recombination events whether recombination contributes to the emergence of resis- and 11 descendants, all in the last two time points from CH0505. tance. We used viral and neutralization data from CH0505 who 38 24 23 27 The other tested tools (RAT , RDP4 , RECCO , and GARD ) developed broadly neutralizing antibodies (bnAbs) over the identified between 0 and 18 recombinants. course of several years, and whose antibody lineages and neu- 30,36,37 tralization activity have been well characterized . The first Increased resistance of recombinants to neutralization. mutations conferring resistance arose at day 90 (Fig. 8a). New CH0010 also developed potent neutralizing antibodies against all mutations continued to accumulate, and the ones conferring five T/F Env pseudoviruses over time (Supplementary Table 5). resistance persisted at later time points. Two CD4 binding site To determine if recombinants would allow viruses to escape from (CD4bs) targeting B-cell lineages were found in this individual, later developed autologous neutralizing antibodies, we looked at and by week 100 (day 692) into the infection, the virus developed the sequences sampled at day 72 and generated Env pseudo- complete resistance to the CH235 antibody lineage and to half of viruses from two non-recombinants and three recombinants the antibodies from the CH103 lineage (Fig. 8b). As expected, (Supplementary Fig. 20 and Supplementary Table 5). The two given the homogeneity of a single T/F virus infection, RAPR does non-recombinant viruses could only be neutralized by the later not find any evidence of recombination at the early time points. time point plasmas (days 188 and 275). However, none of the By day 538, however, it finds that 25 sequences out of 31 are three different recombinants from day 72 were neutralized by all recombinants, and by day 692, all sampled sequences are either plasmas, ignoring the weak neutralization of one day 72 Env novel recombinants or descendants of recombination events pseudovirus by the plasma from day 188 (Supplementary (Fig. 8c). Table 5). This indicated that recombinants were more likely be Interestingly, the mutations that confer CH505 autologous resistant to later autologous neutralizing antibodies than non- antibody resistance were maintained in the diversifying quasis- recombinants. pecies through recombination events. For example, the pair of mutations V281G and N279D in the D loop appeared first at day 202 and day 55, respectively, and each mutation confers partial Discussion resistance to the CH235-lineage antibodies (Supplementary We developed a new analytical tool, RAPR, that allows for Table 4), with a strengthening of such resistance when they computationally efficient detection of recombinants in long- 30,36 appear together . RAPR finds that when parental strains are itudinal samples from recently infected individuals. RAPR clas- heterogeneous at position 279, the recombinant carries the sifies sequences depending on whether they originated from the resistant D genotype 10 out of 12 times. When the parental T/Fs via mutation or from a recombination event between T/F strains are heterogeneous at position 281, the recombinant carries lineages. Compared to existing programs like RDP4 and the G-resistant form 9 out of 10 times. Seven out of 8 times both GARD , our simple simulations indicate that RAPR is more mutations are present in one of the parents, and both are sensitive in detecting recombination events particularly in low- inherited by the recombinant (Supplementary Fig. 19a). There- diversity settings (e.g., about 1%). RAPR has a built-in mechan- fore, whether at position 279, 281, or both, recombination carries ism to constrain later sequences from being considered as par- the resistance mutations forward in the vast majority of cases ental strains of earlier sequences. RAPR sensitivity, detailed 8 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE 3′ Half genome CH0010 CH0078 CH0200 n =20 n =51 n =68 n =159 2.0 3.0 2.5 n =54 n =59 2.0 1.5 2.0 1.5 1.0 1.0 n =151 n =8 1.0 0.5 0.5 n =33 n =17 n =26 n =21 0.0 0.0 0.0 CH0047 CH0228 CH1754 2.5 1.5 5.0 n =38 n =29 n =138 n =182 n =35 n =29 2.0 4.0 1.0 1.5 3.0 1.0 2.0 0.5 n =21 n =8 n =20 n =6 0.5 1.0 n =63 n =84 n =15 n =9 n =9 0.0 0.0 0.0 CH0654 CH1244 CH0275 n =119 6.0 2.0 1.5 n =99 n =35 n =74 n =23 1.5 n =7 4.0 1.0 1.0 2.0 0.5 0.5 n =58 n =16 n =22 n =4 n =31 n =27 n =9 n =15 n =25 n =6 0.0 0.0 0.0 5′ Half genome CH0010 CH0078 CH0200 1.5 1.0 1.5 n =45 n =51 n =24 n =19 0.8 n =15 n =15 1.0 1.0 0.6 0.4 0.5 0.5 n =48 n =3 n =8 n =6 n =14 n =4 0.2 0.0 0.0 0.0 CH0047 CH0228 CH1754 n =17 n =21 1.0 1.5 2.5 n =75 n =15 n =9 n =48 0.8 2.0 1.0 0.6 1.5 0.4 1.0 0.5 0.2 n =5 n =4 0.5 n =32 n =17 n =6 n =7 n =11 n =11 n =10 0.0 0.0 0.0 CH0654 2.5 n =118 n =111 2.0 1.5 1.0 n =20 n =27 n =4 n =11 n =42 n =7 n =4 0.5 0.0 Fig. 7 Genetic p-distances within lineages, inter-lineages, and within recombinants. Pairwise genetic distances (p-distance) were calculated among recombinant viruses, variants evolved from the same T/F virus (intra-T/F), as well as variants from different T/F viruses (inter-T/F) for 3′ half (a) and 5′ half (b) genomes. For intra-T/F diversity, T/F lineages with less than three sequences were excluded. Pairwise genetic distances are shown in black dots, and their means in blue lines. All data were calculated by combining the viral sequences from all time points up to the time that all of the viruses were recombined. The middle line indicates the median of the diversity; the box shows 25 to 75 percentile of the data; and the whisker shows 10 to 90 percentile of the data. The number of sequences for each pairwise comparison group is indicated at the top of the plot NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 9 | | | Intra-T/Fa Intra-T/Fb Intra-T/Fc Intra-T/Fd Intra-T/Fe Intra-T/Ff Intra-T/Fg Rec Inter-T/Fs Intra-T/Fa Intra-T/Fb Rec Intra-T/Fa Inter-T/Fs Intra-T/Fb Intra-T/Fa Intra-T/Fc Intra-T/Fd Intra-T/Fb Intra-T/Fe Intra-T/Fa Rec Rec Inter-T/Fs Intra-T/Fb Inter-T/Fs Intra-T/Fb Intra-T/Fc Rec Intra-T/Fd Intra-T/Fe Inter-T/Fs Intra-T/Ff Intra-T/Fg Intra-T/Fa Intra-T/Fi Rec Intra-T/Fb Inter-T/Fs Intra-T/Fa Rec Intra-T/Fb Inter-T/Fs Rec Inter-T/Fs Intra-T/Fa Intra-T/Fb Intra-T/Fc Intra-T/Fd Intra-T/Fe Rec Inter-T/Fs Intra-T/Fa Intra-T/Fb Rec Inter-T/Fs Intra-T/Fa Intra-T/Fb Intra-T/Fa Rec Intra-T/Fb Inter-T/Fs Rec Inter-T/Fs Intra-T/Fa Intra-T/Fb Intra-T/Fa Rec Intra-T/Fb Inter-T/Fs Intra-T/Fa Rec Intra-T/Fb Inter-T/Fs Rec Inter-T/Fs Intra-T/Fa Rec Inter-T/Fs Genetic p -distance (%) Genetic p -distance (%) ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ab c T/F w4.3 Color key/histogram w4.26 w4.10 w4.27 1000 w4.29 w4.8 w4.37 w4.51 w4.56 w4.16 w4.13 w4.14 w4.11 600 w4.15 w14.21 w14.3 w14.4 400 w14.6 w14.19 w14.17 w14.20 w14.8 w14.2 w14.29 w14.34 w14.12 w14.10 w14.16 w14.30 w14.31 w14.32 w14.39 w20.27 Value w20.4 w20.19 w20.25 w20.14 w20.7 w20.9 w20.23 w20.24 Week 4/day 19 w20.26 w20.11 w20.22 Week 14/day 90 w20.8 w20.2 Week 20/day 132 w20.3 w20.21 w20.15 Week 30/day 202 w30.28 w30.13 Week 53/day 362 w30.24 w30.5 w30.34 Week 78/day 538 w30.37 w30.12 w30.10 Week 100/day 692 w30.19 w30.21 w30.18 w30.32 w30.6 w30.15 w30.17 w30.31 w30.36 w30.25 w30.8 w30.23 w30.26 w30.20 w30.27 w30.9 w53.31 w53.13 w53.28 w53.22 w53.6 w53.11 w53.19 w53.14 w53.32 w53.3 w53.25 w53.16 w53.17 w53.15 w53.27 PNG 279–281 w53.8 w53.10 T234N (D Loop) w53.29 w53.9 w78.15 w78.1 w78.3 w78.10 w78.8 w78.7 w78.33 w78.6 w78.14 w78.17 w78.38 w78.25 w78.4 w78.9 w78.5 w78.16 w100.B4 w100.A3 w100.A5 w100.B6 w100.A10 w100.B8 w100.A11 w100.B3 w100.B7 w100.A2 w100.A13 w100.A6 w100.A4 w100.A12 234 279–281 Fig. 8 Sequence and neutralization analysis of viruses from homogeneously infected CH0505. a Syn/nonsyn highlighter plot shows the synonymous (green) and nonsynonymous (red) mutations compared to the T/F sequences. b Autologous neutralization analysis. Multiple Env pseudoviruses were generated from seven time points from CH0505 over 2 years of infection. Neutralization susceptibility of 123 pseudoviruses were determined with 13 CH103-lineage antibodies (right panel) and neutralization susceptibility of 33 pseudoviruses were determined were determined with 16 CH235-lineage antibodies (right panel). Neutralization magnitude is color coded by warm colors, with red being the strongest. Blue color indicates that no neutralization activities were detected with neutralizing antibodies at the 50 µg/ml concentration. c Recombinant env genes from days 538 and 692 after infection. Recombinants and their breakpoints, as determined by RAPR, are shown. Recombination breakpoints are shown as black boxes, and they are filled when the breakpoints are significant output, and unique time-of-sampling features enabled an used to evaluate the effectiveness of a prevention or vaccination informed exploration of the timing of replacement of the T/F strategy . This is compatible with previous studies, which have lineages by either recombinants or their descendants. This is found that by reassembling genotypes that have appeared through important for understanding the continuous evolution of the viral independent events, recombination can generate a wider range of population and the rapid exploration for new immunological new phenotypes that would either take much longer or be less escapes. RAPR thus nicely adds to an existing set of other tools likely to appear through mutation alone . This wider spectrum of for detecting and characterizing recombination in vivo that genetic variants generated by recombination might allow viruses results from continuous interplay among distinct viruses during to rapidly adapt to the host environment and outcompete par- 39,40 infection . ental viruses during infection. We applied RAPR to SGA-derived sequences sampled long- We also showed how RAPR can be applied to longitudinal itudinally from 9 heterogeneously infected participants, starting sequences sampled from homogeneous infections. Once adequate from very early in infection. We calculated frequencies of accu- diversity has accumulated in the HIV-1 quasispecies, all triplet mulation of new mutations and new breakpoints, as well as hot combinations of sequences can be tested, and recombination and cold spots for recombination. The accumulation rate of new events framing resistance mutation can be tracked as they emerge mutations was higher than the accumulation rate of new break- over time. The data can then be used to answer the question of points across all participants, although this may be due to the how recombination affects the establishment of key antibody- potential underestimation of recombination events in homo- resistant mutations. logous regions. Even then, recombinants quickly replaced the T/F Recombination rates calculated in vitro tend to generally be 18–20 lineages with a median half-time of 27 days, and the recombi- higher than the ones observed in this study . Bioinformatics nation significantly increased the genetic complexity of a viral recombination tools cannot detect recombination events between population. The rapid loss of T/F lineages should be taken into highly related parental strains within the same lineage or that account if a reduction in numbers of T/F viruses is a parameter occur in highly similar regions of the parental genome. To avoid 10 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | UCA IA8 IA7 IA6 IA5 IA4 IA3 IA2 IA1 CH103 CH104 CH105 CH106 CH31 UCA.1 IA4.1 IA3.1 CH235.11 IA2.1 IA1.1 CH240 CH236 CH235 CH239 CH241 CH235.7 CH235.13 CH235.10 CH235.9 CH235.12 Count −1.72 −1.34 −0.95 −0.57 −0.18 1200 0 0.2 0.59 0.97 1.36 1.74 2400 NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE this limitation, Cromer et al. developed a statistical method that Recombination between different nAb escape mutations in a accounts for template switching modeled on in vitro results. In homogenously infected host likely rendered recombinants to our study, we used a simulation approach in which we introduced more easily escape early maturation members of the bnAbs known numbers of breakpoints in simulated data to estimate detected in the same individual. Taken together, these results strand switching events in vivo. Another caveat in the inter- suggest that extensive in vivo recombination may allow the pretation of our results is that as the sampling times increases and complex recombinant population to have better chances to adjust sequences become more divergent, the chance of a convergent to host immune selection pressure. selection event increases as well, thus dampening RAPR ability to Finally we would like to note that intra-subtype recombination distinguish independent recombination events. The new muta- in HIV was first detected due to the high diversity of the virus, tion and new breakpoint rates were found to peak at coordinated but it has also been observed in less diverse viral species such as 51–53 times (Fig. 5)—a phenomenon particularly striking in the 3′ half hepatitis B viruses, enterovirus, and norovirus . Because of its genomes. This suggests that there may be transient periods of higher sensitivity in low-diversity scenarios, RAPR may prove stronger diversifying selective pressure acting on both recombi- useful in detecting recombination in other viral species where it nation and base-substitution rates during the course of an has been more challenging to detect recombination. infection. The breakpoints detected by RAPR, provided in parallel to the Methods phylogenetic trees, can help clarify lineage relationships. When Study participants. To study HIV-1 recombination longitudinally, we identified recently infected participants whose estimated time from HIV-1 infection to blood combining 3′ half genome data from all nine participants, we sampling was generally less than 1 month. Of 36 individuals from a CHAVI001 found that new breakpoints tended to accumulate around (either infection cohort , 27 were infected with a single virus (homogeneous infection), before or after) the variable regions of Env (hotspots). In contrast, and 9 were infected with multiple ones (heterogeneous infection). The first viral in vitro studies of inter-subtype and CRFs have shown that RNA-positive plasma samples from these 9 heterogeneous individuals were col- lected 9–27 days post infection, as estimated using our tool Poisson Fitter sequence homology drives recombination, with breakpoints (Table 1). This early into the infection, the within-lineage diversity was sufficiently clustering in more conserved regions between the parental low to allow us to infer infecting T/F lineages and to estimate the time since 42,43 44 strains . More recently, Jia et al. analyzed breakpoints in infection using these previously described methods . The number of T/F viruses CRF sequences from the Los Alamos HIV Sequence Database and we were able to identify in the 9 participants ranged from 2 to 9 (Fig. 1, Table 1, and Supplementary Figs. 1 and 2). Given the close similarity of the within-host T/ found hotspots to be clustered at protein overlapping regions, but Fs (the within-host T/F mean base pair distance ranged from 0.76% to 2.93%), we not within the Env protein. The fact that we observe recombi- inferred that each participant had been infected by a single donor and at one single nation hotspots in Env around variable regions during within- time point, with one notable exception: in CH0200 we noticed the appearance of a host evolution could be due to a combination of factors. For new, distinct lineage at day 74 that persisted at later time points. After excluding example, recombination events mechanistically favor regions of the possibility of sample contamination, we concluded that this was an incidence of superinfection. Near-full-length sequences of overlapping half genomes were local homology of greater than 30 bases (maximum efficiency is obtained by SGA for 8 of 9 participants over multiple time points, while only the 3′ seen at 60 bases or longer) to facilitate strand switching during half genome sequences were obtained for CH1244. reverse transcription , although strand transfers were frequently Twelve homogeneously infected and 7 heterogeneously infected individuals had detected at limited homologous regions (<2 bases) in our recent no known protective or disease susceptible HLA alleles (Supplementary Table 6). In order to have enough diversity to detect recombination from early infection, as study . Between highly diverse viruses representing different matter of necessity, we focused on heterogeneously infected participants. As a subtypes, this may severely constrain the rate of recombination result, this study population may have biases relative to single T/F infections in events in the variable env gene. In contrast, the degree of terms of persistence of viral diversity. In addition to these 9 heterogeneous homology throughout the genome in our setting of highly related participants, we also included 341 env sequences from the homogeneously infected CH0505, spanning day 19 through day 692 since infection . T/F variants is much greater, including homology in Env, Longitudinal plasma samples within 2 years of infection were obtained from enabling recombination throughout the genome. This, combined nine heterogeneously infected participants as well as one homogenously infected with the fact that Env is under a particularly high degree of participant. All participants were male. CH0010, CH0047, CH0078, CH0200, selective pressure in vivo due to antibody-mediated selection for CH0228, CH0275, CH1244, and CH1754 were from Malawi, CH0078 from South 47,48 Africa, and CH0654 from the United States. None of these nine individuals immune escape , could explain why we see an enrichment for developed AIDS-like symptoms within 2 years of infection. No individuals were on recombination in variable regions in particular. In turn, the antiretroviral therapy (ART), except CH0654 who was on and off ART since day spectrum of genetic variants generated by recombination during 112 after infection. Therefore, sequences from the last time point (day 432) of infection contributes to the viral diversification that precedes the CH0654 was excluded from all analyses. The first screening plasma samples were collected within weeks from the infection for eight individuals, on or before Fiebig development of antibodies with neutralization breadth during 36,37 stage III, while the screening sample for CH1244 was collected at Fiebig stage IV . the course of an infection . All viruses were subtype C except CH0654, which was subtype B. Written informed The extensive recombination observed in each of these nine consent was obtained from all study participants and the study was approved by participants highlights an important caveat on the interpretation the Duke University Institutional Review Board. of phylogenetic trees that has long been known , but generally underappreciated. Recombination violates the primary assump- Amplification of near-full-length viral genome by SGA. Viral RNA was tions on which bifurcating evolutionary trees are built, yet such extracted from plasma samples or culture supernatant using the PureLink Viral RNA/DNA Mini Kit (Invitrogen, Carlsbad, CA). Complementary DNA was syn- trees are frequently used to represent HIV-1 evolution within a thesized using the SuperScript III reverse transcriptase (Invitrogen, Carlsbad, CA) host (including in this paper), and in the population as a whole. with the primers 07Rev8 (5′-CCTARTGGGATGTGTACTTCTGAACTT-3′;nt The trees provide a useful visualization of the clustering of related 5193–5219 in HXB2) for 5′ half genome SGA, or primer 1.R3.B3R 5′- sequences and a portrait of the acquisition of diversity over time. ACTACTTGAAGCACTCAAGGCAAGCTTTATTG-3′ (nt 9611–9642) for the 3′ half or near-full-length genomes. SGA was performed to obtain the 5′ half, 3′ half, However, in light of the extensive and early recombination or near-full-length HIV-1 genome as described previously . For the 5′ half gen- demonstrated here, it is important to consider such HIV-1 trees ome amplification, the first round PCR was carried out using the primers 1.U5.B1F as representations of clustering patterns of similar sequences, and 5′-CCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGT-3′ (nt 538–571) and not direct evolutionary trajectories. 07Rev8 5′-CCTARTGGGATGTGTACTTCTGAACTT-3′ (nt 5193–5219), and the 10,11 second round PCR with primers Upper1A 5′-AGTGGCGCCCGAACAGG-3′ (nt Recombination plays a role in escape from T-cell responses 634–650) and Rev11 5′-ATCATCACCTGCCATCTGTTTTCCAT-3′ (nt and bnAbs . We now show that recombination between differ- 5041–5066). To amplify the 3′ half genome, the first round PCR was performed ent T/F env genes in a heterogeneously infected individual were using the primers 07For7 5′-CAAATTAYAAAAATTCAAAATTTTCGGGTT- more resistant to the later autologous neutralizing antibodies than TATTACAG-3′ (nt 4875–4912) and 2.R3.B6R 5′-TGAAGCACTCAAGG- the T/F lineage variants from the same time point. CAAGCTTTATTGAGGC-3′ (nt 9636–9607), and the second round PCR with NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 11 | | | ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 primers VIF1 5′-GGGTTTATTACAGGGACAGCAGAG-3′ (nt 4900–4923) and estimates. In order to account for the heterogeneity of the infection, we divided Low2c 5′-TGAGGCTTAAGCAGTGGGTTCC-3′ (nt 9591–9612). Then, 2 µl of the each first-time-point sequence alignment into single-founder lineages. Sequences first round PCR products were used for the second round PCR. The PCR ther- significantly enriched for hypermutation (p < 0.1 by Fisher's exact test) and mocycling conditions were as follows: one cycle at 94 °C for 2 min; 35 cycles of a sequences with ambiguous nucleotides were excluded from the analysis. Our denaturing step at 94 °C for 15 s, an annealing step at 55 °C for 30 s, an extension method neglects differences between the putative ancestor and the consensus, an step at 68 °C for 5 min; and one cycle of an additional extension at 68 °C for 10 assumption that breaks down for very small sample sizes. For this reason, lineages min. with less than three sequences were excluded, and also because the goodness-of-fit For amplification of near-full-length HIV-1 genome, the first round PCR was test in the Poisson fitter is based on the chi square test, which is unreliable for small carried out using the primers Upper1A and 2.R3.B6R, and the second round PCR samples (less than 4 data points). For each remaining lineage, we approximated the was done with the primers Upper2 5′-CTCTCTCGACGCAGGACTCGGCTT-3′ T/F sequence by the lineage consensus, calculated the Hamming distance (HD) of (nt 681–704) and LTRD 5′-CTGGAAAGTCCCCAGCGGAAAGTC-3′ (nt each sequence in the lineage from this T/F sequence, and then merged all HDs 9460–9437). Then, 2 µl of the first round PCR products were used for the second from lineages of the same alignment into one distribution. Assuming that all T/Fs round PCR. The PCR thermocycling conditions were as follows: one cycle at 94 °C evolved from the same time point, were all equally fit, and accumulated random for 2 min; 10 cycles of a denaturing step at 94 °C for 15 s, an annealing step at 55 °C mutations at the same rate, the combined HDs follow a Poisson distribution. We for 30 s, an extension step at 68 °C for 8 min; 20 cycles of a denaturing step at 94 °C proceeded to estimate the time since infection using this combined distribution for 15 s, an annealing step at 55 °C for 30 s, an extension step at 68 °C for 8 min following methods previously described . In order to estimate the dates post with an incremental 20 s for each successive cycle; and one cycle of an additional infection for participants for which we had alignments covering both the 5′ half extension at 68 °C for 10 min. genome and the 3′ half genome, we first timed each half genome alignment separately, and then averaged the two estimates using a harmonic mean weighted by the inverse of the number of sequences in each genome half, a method that Sequence analysis. The PCR amplicons were directly sequenced by the cycle minimizes the asymptotic sampling variance. sequencing and dye terminator methods on an ABI 3730xl genetic analyzer (Applied Biosystems, Foster City, CA). Individual sequences were assembled and Wald–Wolfowitz Runs Test. Recombination results in contiguous stretches of edited using Sequencher 4.7 (Gene Codes, Ann Arbor, MI). The sequences were aligned using CLUSTAL W, and the manual adjustment for optimal alignment was genetic information inherited, with random mutations, from each parent. Our program, RAPR, checks for patterns of inheritance in every set of three sequences performed using Seaview. Neighbor-joining trees were constructed using the Kimura 2-parameter model with 1000 bootstrap replications. GenBank Access in a given alignment. Specifically, RAPR identifies the sites that differ in exactly one numbers for the newly obtained sequences in this study are: of the sequences and applies the Wald–Wolfowitz Runs Test to check if these MF499156–MF502416. These 3260 new sequences augment preexisting data to sites cluster more than expected of random mutations. This is done as follows. Let provide a unique dataset, including extensive sets of longitudinally sampled A and B be the putative parental strains and C the recombinant one. Each position sequences from individuals infected with multiple T/F viruses. All sequences from in C is either identical to A but not B, identical to B but not A, identical to both, or this study, alignments, and auxiliary data are also available in the HIV special different to either one. In the two latter scenarios we say that the site is non- interest alignments (https://www.hiv.lanl.gov/content/sequence/HIV/ informative. Define an “A run” as the maximal non-empty segment of adjacent SI_alignments/datasets.html). positions in C that are either non-informative or identical to A but not B. Likewise, a “B run” is the set of contiguous sites that are either non-informative or identical to B but not A. Let m be the total number of positions where C is more similar to Generation of infectious molecular clones. The infectious molecular clones A; n the total number of sites where it is more similar to B; and r the total number (IMCs) representing T/F viruses from CH0200 (a, b, and c) and CH0228 (a and b) of “A” or “B” runs. The probability that there are at most r runs just by a chance were chemically synthesized and cloned in pUC57 as described previously . The ordering of independent mutations is given by: virus stocks were prepared from the supernatants of 293T cells (NIH AIDS Reagent Program, cat. number: 103) transfected with IMCs . All cell lines were tested free X PRðÞ  r¼ f ðÞ k ; of mycoplasma contamination. k¼2 Virus culture. Primary CD4 T cells were purified from peripheral blood where the probability of a total of k runs is given by: mononuclear cells from a healthy donor under the protocols approved by the Duke ! ! ! ! University Institutional Review Board. Written informed consent was obtained m  1 n  1 m  1 n  1 from all study participants. Cryopreserved CD4 T cells were thawed and sti- k3 k1 k1 k3 d e b c b c d e 2 2 2 2 mulated for 3 days in RPMI-1640 containing 10% fetal bovine serum (FBS), f ðÞ k ¼ PRðÞ ¼ k¼ m þ n interleukin 2 (IL-2) (32 U/ml; Advanced Biotechnologies, Columbia, MD, cat. number: 03-001-050), soluble anti-CD3 (0.2 µg/ml; eBioscience, San Diego, CA, cat. number: 16-0037-81), and anti-CD28 (0.2 µg/ml; BD Bioscience, San Diego, CA, cat. number: 16-0281-81). The stimulated cells (1 × 10 cells) was seeded into with a and a standing for the least integer not less than and the greatest integer not each well of a 96-well plate, and infected with a mixture of multiple T/F viruses greater than a, respectively (see Supplementary Fig. 21 as an example). (0.001 m.o.i. (multiplicity of infection) for each virus) from each participant (3 T/F viruses for CH0200, and 2 T/F viruses for CH0228) . After absorption at 37 °C for Similarity clusters and correcting for circularity. In cases where multiple viruses 4 h, the cells were washed 3 times with RPMI-1640. The infected cells were cultured in a 24-well plate with 600 µl of RPMI-1640 containing 10% FBS and IL-2 (32 U/ establish an infection, and likely transmitted founder strains (T/Fs) are resolved, T/ Fs can be constrained to be parents in a recombination triplet, thus avoiding ml). The supernatant (200 µl) were harvested at day 4 and then passaged onto fresh CD4 T cells four times. The virus replication was monitored by determination of detection of recombination events that happened in the donor, prior to trans- mission. When the input alignment contains data from multiple time points, RAPR the p24 concentration in the supernatant using the p24 ELISA kit (PerkinElmer, Waltham, MA). The near-full-length genomes from passages 1, 2, and 4 were can assign likely parent–daughter relationships by factoring in the time of sam- pling. As a result, sequences from future time points do not affect the recombi- obtained by SGA to determine the frequency of recombination in vitro. nation inferences at earlier time points. When sampling is sparse, this constraint may be broken by the fact that the actual parental strains may have escaped Neutralization assay. Neutralization activity was measured in a luciferase reporter sampling. In this case RAPR chooses the sequences that are closest to the actual system in TZM-bl cells (NIH AIDS Reagent Program, cat. number: 8129). Plasma parent even though they may have been sampled at a later time point. Multiple samples were heat inactivated at 56 °C for 1 h and were then diluted at a 1:3 serial testing is addressed through a false discovery rate approach applied to each time dilution starting at 1:20. The diluted plasma samples in duplicate were incubated point individually. Using a greedy algorithm, sequences are grouped into closely with viruses for 1 h at 37 °C and then used to infect TZM-bl cells. The 50% related clusters such that differences within each cluster are likely to have arisen by inhibitory dose (ID ) was defined as the plasma dilution at which relative lumi- mutation. Some clusters will have evolved directly from a TF. Others will represent nescence units (RLUs) were reduced by 50% compared with RLUs in virus control recombination events that were indicated in the initial comparison of all sets of wells after subtraction of background RLUs in cell control wells. A response was three sequences. For each recombinant cluster, RAPR calculates the minimum HD considered positive for neutralization if the ID titer was >1:20 dilution. from the closest cluster, and compares it to the minimum number of mutations diverging from the parental strains. Based on which of the two numbers is lower, Determination of dates post infection. Highlighter plots and neighbor-joining the cluster is “explained” as having arisen from another cluster by mutation, or as a phylogenetic trees were generated using the Highlighter tool at the Los Alamos de novo recombination of sequences belonging to other clusters. When a recom- HIV sequence database (https://www.hiv.lanl.gov/content/sequence/HIGHLIGHT/ bination event is called, the sequence in the cluster with the least number of highlighter_top.html). Based on visual inspection of these plots we determined that mutations from the parental strains, the earliest parents, or with the lowest p value all nine participants in this study had been infected by two or more genetically is deemed the initial recombinant, and all other sequences in that cluster that share distinct T/Fs. Days since infection were estimated by applying the Poisson fitter the same breakpoints are considered “descendants.” 33,54 model on the first time point alignments from both the 3′ half and 5′ half In the realistic case of multiple recombination events, or when one of the genome sequences and then taking the harmonic average between the two time parental strains escapes sampling, a triplet will yield a low p value when parents 12 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE and child swap roles. When such parent–daughter cycles are found, RAPR replaces can be estimated as Σ f (b≥m) p (β). We can use this to find simulated CL on the β β r the parents from the least likely configuration in the cycle with an alternate set, rate of recombination: in particular, the least value of r for which p (b≥m) becomes and, if no such choice is admissible, merges the current cluster to a similar one and greater than 1−α provides a CL for the rate. deem the recombinants “descendants”. For more details see our tool help page All simulations were run using R . The simulated recombinants, together with (https://www.hiv.lanl.gov/content/sequence/RAP2017/help.html). the parental sequences used to generate them, are available at the following link: (https://www.hiv.lanl.gov/repository/hivdb/ Simulated_Data_for_RAPR_NatComm_Paper). RAPR is available for public use Breakpoint determination and frequency estimation. For each recombinant, on the LANL HIV database (www.hiv.lanl.gov/content/sequence/RAP2017/rap. RAPR identifies intervals in which the breakpoints are most likely to have hap- html). All software is developed in C++, R, and cgi. The R code utilizes functions 61 62 pened. Each breakpoint interval is examined and the run-test p value is recalculated from the packages ape and seqinr . after removing the run next to it. If this new p value is less significant than the overall p value, we conclude that the breakpoint is contributing to the significance Neutralization analysis of viruses from CH0505. To determine the impact of of the recombination. For each time point in the alignment, RAPR calculates two recombination on the emergence of antibody resistance, we ran RAPR on 341 env frequencies: the number of de novo breakpoints per sequence and the number of sequences from CH0505 , spanning day 19 through day 692 since infection. Of new mutations per sequence, where by “new mutation” we mean any nucleotide these, 123 of these env genes were used to generate pseudoviruses which were tested change that is not present in any of the T/Fs. Each mutation and breakpoint is against a panel of 13 autologous CH103-lineage antibodies, and 33 of the 123 were counted only at the time it appears for the first time. These frequencies are then additionally tested against a panel of 16 autologous CH235-lineage antibodies. converted to rates by dividing by the sample time interval over which these mutations or breakpoints arose. Kendall correlations between these rates and VL were compared using a one-sided sign test. The VL was averaged between con- Identification of recombination hotspots. To explore possible hotspots of secutive time points to account for the fact that mutations and recombination recombination across the 3′ half genomes, we devised the following strategy. For events occurred in the time period between the two consecutive time points when every participant, we counted the appearance of a new breakpoint only once. VL were sampled. We then calculated Kendall correlations τ between the break- Because the program outputs an interval where the recombination breakpoint is point rate (calculated as described above) and VL, and separately between the most likely to have occurred, we took the midpoint of such interval, and then breakpoint rate and the mutation rate. CH0275 was excluded from this analysis considered the cumulative distribution of the positions at which these breakpoints because of too few time points. For the rest of the participants, we observed that 6 occurred across all nine participants. Regions where the cumulative distribution out of 8 times the Kendall τ was higher for the correlation of the breakpoint rate had a steep step upward indicated a high accumulation of breakpoints and with the mutation rate compared to that with the average VL, and the other two therefore hotspots. Conversely, regions where the cumulative distribution was flat times they were equal. This was statistically significant (p < 0.015) when applying a indicated potential cold spots. To assess this with statistical significance, we con- one-sided sign test. For the 5′, we had only seven sets, and five of these showed the sidered a sliding window of 20 nucleotides and for each window we calculated the same trend, but in the other two, the breakpoint rate was more positively correlated slope of the line that best fitted the cumulative distribution within said window. If with the VL. This was not statistically significant (p > 0.23). there were hotspots of recombination within the given window, we expected the slope of the best fitting line to rise sharply, or to lower in areas of cold spots. We initially explored various window sizes, between 10 and 50 nucleotides, and we saw Breakpoint simulation. In order to estimate the potential bias introduced by that decreasing to smaller sizes from 20 nucleotides did not add any additional missing breakpoints that fall in regions of homology between parental strains, we information, while increasing the size caused several potentially interesting regions devised the following simulation. First, to encompass the time when recombination to go undetected. To smoothen out the differences across participants, we repeated happened, we chose participants for whom RAPR had not detected recombinants the sliding window algorithm 9 times, each time removing one of the participants. of recombinants at the first time points. These were the 5ʹ half from CH0047 and For each iteration we considered all positions corresponding to the upper quartile CH0200 (day 19 and 11 respectively), and the 3ʹ half from CH0228 (day 19) and of the slopes and all positions corresponding to the lower quartile, and then took all CH1244 (days 12 and 15). To avoid overestimating the frequency of template the positions that fell either in the upper or the lower quartile every time we ran the switching, only sequences from the early time points of 4 participants (CH0047, algorithm. CH0200, CH0228, and CH1244) were selected because they were sampled very early and had no evidence yet of recombination of recombinants. Cromer et al. Genetic distances. Genetic complexity among lineages was measured as the mean estimated 5 to 14 template switches per full genome, so for our half genome pairwise p-distances. The p-distances (the proportion of sites at which two samples we considered a range of breakpoints from 1 to 8 and for each value we sequences differ) were calculated among recombinant viruses, variants evolved randomly generated 100 artificial recombinants using the sequences from the above from the same T/F virus (intra-TFs), as well as variants from different T/F viruses participants. Each recombinant was generated randomly selecting two parents, one (inter-TFs) for 3′ half and 5′ half genomes using the software MEGA6 . from each of two different T/F clusters. For each set of 100 recombinants, we counted how many breakpoints (significant and not) RAPR detected. We then accumulated, for each true breakpoint between 1 and 8, the total number of Dynamical modeling. We extended the basic mathematical model for HIV breakpoints and the number of statistically significant breakpoints RAPR detected 64 dynamics to include the coinfection of target cells by two different viruses and the (Supplementary Figs. 13 and 14). resulting generation of recombinant viruses. Assuming n T/Fs, let V be the density We also used the same simulations to estimate frequentist 95% confidence of the ith (i = 1… n) T/F virus, and V the density of recombinant viruses. n+1 interval on the true number of breakpoints as follows. For each recombinant Uninfected target cells T become infected with at a rate β’ and produce virus at a i, observed in the four samples mentioned above, we randomly recombined the rate , for i = 1… n + 1. Free virus is cleared at a rate c. Cells infected with the T/F pi respective parents 5000 times, assigning breakpoints randomly drawn from a viruses can be coinfected with other viral strains, and such coinfection leads to the uniform distribution between 1 and 8. For every set of artificial recombinants from generation of recombinant viruses. It is well known that early after infection, the the same parental pair, we took all the recombinants for which RAPR detected the amount of CD4, the main receptor for HIV binding, on the surface of infected cells same amount of breakpoints as in the observed recombinant, and looked at the is downregulated due to the action of the Nef and/or Vpu proteins . Although the distribution of their true breakpoints (Supplementary Fig. 15). We then calculated kinetics of CD4 downregulation in vivo are not known, in our model we assume the lower bound on the 95% CL on the true number of breakpoints. Similarly, we that downregulation is slow and incomplete, and that a cell infected with one viral calculated the lower bound on the 95% CL on the true rate of recombination, i.e., variant can be reinfected with another viral strain. Coinfection occurs at a reduced the minimum rate of recombination necessary to explain the RAPR-observed rate γβ’ . Uninfected and virus-infected cells die at rates d and δ , respectively. The i i recombinant. In particular, the lower level-α CL l on an estimate using statistic m α dynamics of n T/F viruses and a recombinant strain during acute HIV-1 infection (for example, the observed number of breakpoints) for a population parameter μ is then described by the system of differential equations (for example, the true number of breakpoints) is defined as a statistic such that p (l ≤μ)≥α, where p is the probability distribution under the assumption that the nþ1 μ α μ dT ¼ dTðÞ  T T β′ V ; population parameter is μ and the relation holds independent of μ. To calculate 0 i i dt i¼1 such an l when the allowed values of μ are discrete, and the statistic m is sufficient nþ1 for μ, we can proceed to simulate the sampling process for various proposals π for dI ¼ β′ V T  δ I  γI θ β V ; i ¼ 1::n; dt i i i i i ij j j the population parameter μ, and obtain the fraction of samples for which the j¼1 simulated estimate s is greater than the observed estimate m, i.e., f (s≥m). n nþ1 P P dI nþ1 Assuming the simulation is good enough to ignore the variance of f (s), and this ¼ β′ V T  δ I þ γ θ β I V ; nþ1 nþ1 nþ1 nþ1 ij i j dt j i¼1 j¼1 fraction is a non-decreasing function of π, the least value of π at which this fraction dV becomes greater than 1−α defines such an l . We have no reason to believe that the i ¼ p I  cV ; i i i dt RAPR estimate of the number of breakpoints is a sufficient statistic, so the confidence interval determined in this fashion is likely to be overly conservative. Alternatively, if we assume a uniform rate r of recombination events where I and I is the density of cells infected with the T/F viruses and the i n+1 per sequence position, then the true number of recombination breakpoints s in a recombinant virus, respectively, and θ =1if i≠j and θ =0 otherwise. It should be ij ij sequence of length L are expected to be distributed as a binomial with size L−1 and noted that in this simple model we assumed that coinfection with two different probability r. Then, the probability of observing m or more breakpoints p (b≥m) viruses leads to a generation of the recombinant virus. NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 13 | | | ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 Previous studies have indicated that the virion clearance rate c ranges from 9 to 4. Simon-Loriere, E. & Holmes, E. C. Why do RNA viruses recombine? Nat. Rev. 36 per day which is much faster than the estimated lifespan of infected cells (δ =1 Microbiol. 9, 617–626 (2011). per day) . Therefore, given fast dynamics of the virus, the density of a given viral 5. Brown, R. J. et al. Intercompartmental recombination of HIV-1 contributes to variant is approximately proportional to the density of infected cells producing this env intrahost diversity and modulates viral tropism and sensitivity to entry p β′ p i i i variant, V ¼ I which simplifies the model. By replacing β ¼ and r =β T−δ , i i i i i i c P c inhibitors. J. Virol. 85, 6024–6037 (2011). I nþ1 and denoting the frequency of the ith variant as f ¼ and I ¼ I as the total i I i¼1 i 6. Charpentier, C., Nora, T., Tenaillon, O., Clavel, F. & Hance, A. J. Extensive number of infected cells we obtain the simplified model recombination among human immunodeficiency virus type 1 quasispecies makes an important contribution to viral diversity in individual patients. J. nþ1 df ¼ f r  r f  γβIfðÞ 1  f ; i ¼ 1¼ n; Virol. 80, 2472–2482 (2006). i i j j i i dt j¼1 7. Nishimura, Y. et al. Recombination-mediated changes in coreceptor usage nþ1 n nþ1 confer an augmented pathogenic phenotype in a nonhuman primate model of P P P df nþ1 ¼ f r  r f þ γβI θ f f ; nþ1 nþ1 j j ij i j HIV-1-induced AIDS. J. Virol. 85, 10617–10626 (2011). dt j¼1 i¼1 j¼1 8. Nora, T. et al. Contribution of recombination to the evolution of human nþ1 immunodeficiency viruses expressing resistance to antiretroviral treatment. J. dI ¼ I r f : dt i i Virol. 81, 7620–7628 (2007). i¼1 9. Moutouh, L., Corbeil, J. & Richman, D. D. Recombination leads to the rapid emergence of HIV-1 dually resistant mutants under selective drug pressure. To further simplify the analysis, we assume the infectivity of all variants to be Proc. Natl. Acad. Sci. USA 93, 6106–6111 (1996). 10. Ritchie, A. J. et al. Recombination-mediated escape from primary CD8+ the same (β =β=const), while allowing for potential differences in the rate of virus production from cells infected by variant V . T cells in acute HIV-1 infection. Retrovirology 11, 69 (2014). pi i The dynamics of the recombinant viruses in the population are determined by 11. Streeck, H. et al. Immune-driven recombination and loss of control after HIV two processes: accumulation due to selective advantage (s ¼ r  r where superinfection. J. Exp. Med. 205, 1789–1796 (2008). nþ1 P P n n r ¼ r f = f ) and accumulation due to high degree of cell coinfection with 12. Liu, S. L. et al. Selection for human immunodeficiency virus type 1 j j j j¼1 j¼1 two different viruses (γβI). The rate of coinfection of cells with 2 different viruses is recombinants in a patient with rapid progression to AIDS. J. Virol. 76, γβI approximately given by F ¼ (derivation not shown). To estimate the rate of c 10674–10684 (2002). γβIþβT coinfection of cells with different viruses we assume that the density of target cells 13. Batorsky, R. et al. Estimate of effective recombination rate and average remains constant after peak infection, and the total number of infected cells is selection coefficient for HIV in chronic infection. Proc. Natl. Acad. Sci. USA constant. Extensions of the model relaxing this assumption will be presented 108, 5661–5666 (2011). elsewhere. We fit the model predicting the dynamics of the frequency of the 14. Neher, R. A. & Leitner, T. Recombination rate and selection strength in HIV recombinant viruses in each individual to the experimentally measured frequency, intra-patient evolution. PLoS Comput. Biol. 6, e1000660 (2010). where the initial frequency of the different viral variants is taken directly from the 15. Shriner, D., Rodrigo, A. G., Nickle, D. C. & Mullins, J. I. Pervasive genomic data. Alternatively, we assumed that recombinants were present at some frequency recombination of HIV-1 in vivo. Genetics 167, 1573–1583 (2004). and their accumulation was due to selective advantage s which was estimated by 16. Gao, F. et al. Unselected mutations in the human immunodeficiency virus type fitting the model (reduced to a logistic equation) to data. The model is fitted using 1 genome are mostly nonsynonymous and often deleterious. J. Virol. 78, maximum likelihood method as was recently described elsewhere . Confidence 2426–2433 (2004). intervals for estimated parameters were calculated by resampling the data using 17. Mansky, L. M. & Temin, H. M. Lower in vivo mutation rate of human Jefferey’s intervals for binomial proportions with 1000 simulation runs. Because immunodeficiency virus type 1 than that predicted from the fidelity of purified of the limited data, two parameters (the coinfection rate and selection coefficient) reverse transcriptase. J. Virol. 69, 5087–5094 (1995). could not be accurately estimated from the data. Therefore, in a set of analyses we 18. Onafuwa, A., An, W., Robson, N. D. & Telesnitsky, A. Human fixed the selection coefficient to various values and estimated the coinfection rate. immunodeficiency virus type 1 genetic recombination is more frequent than Because of two half genomes available for some of the time points, for those that of Moloney murine leukemia virus despite similar template switching individuals we have two independent measurements of the frequency of different rates. J. Virol. 77, 4577–4587 (2003). viral variants, and these provide two estimates of the coinfection rate based on the 19. Zhuang, J. et al. Human immunodeficiency virus type 1 recombination: rate, dynamics of recombination at the 3′ and 5′ half genomes. Consistency of the fidelity, and putative hot spots. J. Virol. 76, 11273–11282 (2002). estimated coinfection rates varied by patient. In some cases (e.g., CH0654, 20. Schlub, T. E., Smyth, R. P., Grimm, A. J., Mak, J. & Davenport, M. P. CH0200), we found nearly identical estimates for coinfection rates using 3′ and 5′ Accurately measuring recombination between closely related HIV-1 genomes. data (Table 2 and Supplementary Table 3). However, in other instances, estimates PLoS Comput. Biol. 6, e1000766 (2010). for predicted coinfection rate were different between 3′ and 5′ half data (e.g., 21. Salazar-Gonzalez, J. F. et al. Deciphering human immunodeficiency virus type CH0228, CH0078; Table 2 and Supplementary Table 3). Underlying reasons for 1 transmission and early envelope diversification by single-genome such difference are not entirely clear but may be related to the different numbers of sequences analyzed. All models were implemented and run in Mathematica (ver. amplification and sequencing. J. Virol. 82, 3952–3970 (2008). 22. Siepel, A. C., Halpern, A. L., Macken, C. & Korber, B. T. A computer program 11.2). designed to screen rapidly for HIV type 1 intersubtype recombinant sequences. AIDS Res. Hum. Retroviruses 11, 1413–1416 Data availability. All newly generated nucleic acid sequences in the current study (1995). were deposited in GenBank with accession numbers MF499156–MF502416. RAPR 23. Maydt, J. & Lengauer, T. Recco: recombination analysis using cost is available for public use on the LANL HIV database (http://www.hiv.lanl.gov/ optimization. Bioinformatics 22, 1064–1071 (2006). content/sequence/RAP2017/rap.html). The simulated recombinants are available at 24. Martin, D. P., Murrell, B., Golden, M., Khoosal, A. & Muhire, B. RDP4: the following link: (https://www.hiv.lanl.gov/repository/hivdb/ detection and analysis of recombination patterns in virus genomes. Virus Evol. Simulated_Data_for_RAPR_NatComm_Paper). All relevant data are available 1, vev003 (2015). upon request. 25. Posada, D. & Crandall, K. A. Evaluation of methods for detecting recombination from DNA sequences: computer simulations. Proc. Natl. Acad. Received: 31 July 2017 Accepted: 10 April 2018 Sci. USA 98, 13757–13762 (2001). 26. Posada, D. & Crandall, K. A. The effect of recombination on the accuracy of phylogeny estimation. J. Mol. Evol. 54, 396–402 (2002). 27. Kosakovsky Pond, S. L. et al. An evolutionary model-based algorithm for accurate phylogenetic breakpoint mapping and subtype prediction in HIV-1. PLoS Comput. Biol. 5, e1000581 (2009). 28. Bradley, J. Distribution-Free Statistical Tests, Chapter 12 (Prentice-Hall, References Englewood Cliffs, 1968). 1. Simmonds, P. et al. Discontinuous sequence change of human 29. Takahata, N. Comments on the detection of reciprocal recombination or gene immunodeficiency virus (HIV) type 1 env sequences in plasma viral and conversion. Immunogenetics 39, 146–149 (1994). lymphocyte-associated proviral populations in vivo: implications for models of 30. Gao, F. et al. Cooperation of B cell lineages in induction of HIV-1-broadly HIV pathogenesis. J. Virol. 65, 6266–6276 (1991). neutralizing antibodies. Cell 158, 481–491 (2014). 2. Sabino, E. C. et al. Identification of human immunodeficiency virus type 1 31. Cromer, D., Grimm, A. J., Schlub, T. E., Mak, J. & Davenport, M. P. envelope genes recombinant between subtypes B and F in two Estimating the in-vivo HIV template switching and recombination rate. AIDS epidemiologically linked individuals from Brazil. J. Virol. 68, 6340–6346 30, 185–192 (2016). (1994). 3. Zhang, M. et al. The role of recombination in the emergence of a complex and dynamic HIV epidemic. Retrovirology 7, 25 (2010). 14 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE 32. Pacold, M. E. et al. Clinical, virologic, and immunologic correlates of HIV-1 62. Charif, D., Lobry, J. R. SeqinR 1.0-2: A Contributed Package to the R Project intraclade B dual infection among men who have sex with men. AIDS 26, for Statistical Computing Devoted to Biological Sequences Retrieval and 157–165 (2012). Analysis. in Structural Approaches to Sequence Evolution: Molecules, Networks, 33. Keele, B. F. et al. Identification and characterization of transmitted and early Populations (eds Bastolla, U. et al.) 207–232 (Springer Verlag, New York, founder virus envelopes in primary HIV-1 infection. Proc. Natl. Acad. Sci. 2007). USA 105, 7552–7557 (2008). 63. Tamura, K., Stecher, G., Peterson, D., Filipski, A. & Kumar, S. MEGA6: 34. Sunnaker, M. et al. Approximate Bayesian computation. PLoS Comput. Biol. 9, Molecular Evolutionary Genetics Analysis version 6.0. Mol. Biol. Evol. 30, e1002803 (2013). 2725–2729 (2013). 35. Giorgi, E. E., Korber, B. T., Perelson, A. S. & Bhattacharya, T. Modeling 64. Perelson, A. S. Modelling viral and immune system dynamics. Nat. Rev. sequence evolution in HIV-1 infection with recombination. J. Theor. Biol. 329, Immunol. 2,28–36 (2002). 82–93 (2013). 65. Delviks-Frankenberry, K. et al. Mechanisms and factors that influence high 36. Liao, H. X. et al. Co-evolution of a broadly neutralizing HIV-1 antibody and frequency retroviral recombination. Viruses 3, 1650–1680 (2011). founder virus. Nature 496, 469–476 (2013). 66. Ramratnam, B. et al. Rapid production and clearance of HIV-1 and hepatitis C 37. Bonsignori, M. et al. Maturation pathway from germline to broad HIV-1 virus assessed by large volume plasma apheresis. Lancet 354, 1782–1785 neutralizer of a CD4-mimic antibody. Cell 165, 449–463 (2016). (1999). 38. Etherington, G. J., Dicks, J. & Roberts, I. N. Recombination Analysis Tool 67. Perelson, A. S., Neumann, A. U., Markowitz, M., Leonard, J. M. & Ho, D. D. (RAT): a program for the high-throughput detection of recombination. HIV-1 dynamics in vivo: virion clearance rate, infected cell life-span, and viral Bioinformatics 21, 278–281 (2005). generation time. Science 271, 1582–1586 (1996). 39. Liu, M. K. et al. Vertical T cell immunodominance and epitope entropy 68. Ganusov, V. V., Neher, R. A. & Perelson, A. S. Mathematical modeling of determine HIV-1 escape. J. Clin. Invest. 123, 380–393 (2013). escape of HIV from cytotoxic T lymphocyte responses. J. Stat. Mech. 2013, 40. Bailey, J. R. et al. Broadly neutralizing antibodies with few somatic mutations P01010 (2013). and hepatitis C virus clearance. JCI Insight 2, e92872 (2017). 69. Brown, L. D., Cai, T. T. & DasGupta, A. Interval estimation for a binomial 41. Santra, S. et al. Human non-neutralizing HIV-1 envelope monoclonal proportion. Stat. Sci. 16, 101–133 (2001). antibodies limit the number of founder viruses during SHIV mucosal infection 70. Wolfram Research Inc. Mathematica, Version 11.2 (Champaign, 2017). in rhesus macaques. PLoS Pathog. 11, e1005042 (2015). 42. Archer, J. et al. Identifying the important HIV-1 recombination breakpoints. Acknowledgements PLoS Comput. Biol. 4, e1000178 (2008). We thank Yi Yang and Sheri Chen for technical assistance; and Jennifer Kirchherr and 43. Baird, H. A. et al. Sequence determinants of breakpoint location during HIV-1 Caroline Cockrell for compilation of clinic data. This work was supported by NIH grant intersubtype recombination. Nucleic Acids Res. 34, 5203–5216 (2006). NIH R01AI087520, NIH grants to the Center for HIV/AIDS Vaccine Immunology 44. Jia, L. et al. Analysis of HIV-1 intersubtype recombination breakpoints (AI067854), the Center for HIV/AIDS Vaccine Immunology and Immunogen Discovery suggests region with high pairing probability may be a more fundamental (AI100645), the HIV/SIV Database and Analysis Unit (AAI 12007-0000-01000), the factor than sequence similarity affecting HIV-1 recombination. Virol. J. 13, American Heart Foundation grant to V.V.G., and with Federal funds from the National 156 (2016). Cancer Institute, National Institutes of Health, under Contract No. 45. Sakuragi, S., Shioda, T. & Sakuragi, J. Properties of human immunodeficiency HHSN261200800001E. The content of this publication does not necessarily reflect the virus type 1 reverse transcriptase recombination upon infection. J. Gen. Virol. views or policies of the Department of Health and Human Services, nor does mention of 96, 3382–3388 (2015). trade names, commercial products, or organizations imply endorsement by the US 46. Li, X. et al. High-frequency illegitimate strand transfers of nascent DNA Government. fragments during reverse transcription result in defective retrovirus genomes. J. Acquir. Immune Defic. Syndr. 72, 353–362 (2016). 47. Bonsignori, M. et al. Staged induction of HIV-1 glycan-dependent broadly Author contributions neutralizing antibodies. Sci. Transl. Med. 9, eaai7514 (2017). F.G., B.K., T.B., H.S., and E.E.G. conceived and designed the study. H.S., F.C., B.H., C.J., 48. Moore, P. L. et al. Multiple pathways of escape from HIV broadly cross- X.L., S.W., H.L., J.F.S., M.G.S., N.G., and B.F.K. performed experiments. E.E.G., T.B., O. neutralizing V2-dependent antibodies. J. Virol. 87, 4882–4894 (2013). C., G.A., H.Y., E.R., and B.K. developed the computational algorithm for recombination 49. Huson, D. H. & Bryant, D. Application of phylogenetic networks in analysis. V.V.G. developed the mathematical model for recombination and viral load evolutionary studies. Mol. Biol. Evol. 23, 254–267 (2006). dynamics analysis. M.S.C. directed the clinical trial and supplied the clinical samples. H. 50. Doria-Rose, N. A. et al. Developmental pathway for potent V1V2-directed S., E.E.G., V.V.G., F.C., B.H., P.H., X.L., J.F.S., N.G., B.F.K., D.C.M., M.S.C., G.M.S., B.H. HIV-neutralizing antibodies. Nature 509,55–62 (2014). H., A.J.M., B.F.H., T.B., B.K., and F.G. analyzed the data. H.S., E.E.G., V.V.G., P.H., N.G., 51. Munshi, S. U. et al. Molecular characterization of hepatitis B virus in A.J.M., G.M.S., B.F.H., T.B., B.K., and F.G. wrote and edited the manuscript. Bangladesh reveals a highly recombinant population. PLoS ONE 12, e0188944 (2017). 52. Harvala, H. et al. Recommendations for enterovirus diagnostics and Additional information characterisation within and beyond Europe. J. Clin. Virol. 101,11–17 (2018). Supplementary Information accompanies this paper at https://doi.org/10.1038/s41467- 53. van Beek, J. et al. Molecular surveillance of norovirus, 2005-16: an 018-04217-5. epidemiological analysis of data collected from the NoroNet network. Lancet Infect. Dis.18, 545–553 (2018). Competing interests: The authors declare no competing interests. 54. Giorgi, E. E. et al. Estimating time since infection in early homogeneous HIV- 1 samples using a Poisson model. BMC Bioinforma. 11, 532 (2010). Reprints and permission information is available online at http://npg.nature.com/ 55. Fiebig, E. W. et al. Dynamics of HIV viremia and antibody seroconversion in reprintsandpermissions/ plasma donors: implications for diagnosis and staging of primary HIV infection. AIDS 17, 1871–1879 (2003). Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in 56. Salazar-Gonzalez, J. F. et al. Genetic identity, biological phenotype, and published maps and institutional affiliations. evolutionary pathways of transmitted/founder viruses in acute and early HIV- 1 infection. J. Exp. Med. 206, 1273–1289 (2009). 57. Song, H. et al. Transmission of multiple HIV-1 subtype C transmitted/founder viruses into the same recipients was not determined by modest phenotypic Open Access This article is licensed under a Creative Commons differences. Sci. Rep. 6, 38130 (2016). Attribution 4.0 International License, which permits use, sharing, 58. Rose, P. P. & Korber, B. T. Detecting hypermutations in viral sequences with adaptation, distribution and reproduction in any medium or format, as long as you give an emphasis on G--> A hypermutation. Bioinformatics 16, 400–401 (2000). appropriate credit to the original author(s) and the source, provide a link to the Creative 59. Benjamini, Y. H. & Controlling, Y. The false discovery rate: a practical and Commons license, and indicate if changes were made. The images or other third party powerful approach to multiple testing. J. Roy. Stat. Soc. Ser. B (Methodol.) 57, material in this article are included in the article’s Creative Commons license, unless 289–300 (1995). indicated otherwise in a credit line to the material. If material is not included in the 60. R Core Team. R: A Language and Environment for Statistical Computing (R article’s Creative Commons license and your intended use is not permitted by statutory Foundation for Statistical Computing, Vienna, 2013). http://www.R-project. regulation or exceeds the permitted use, you will need to obtain permission directly from org. the copyright holder. To view a copy of this license, visit http://creativecommons.org/ 61. Paradis, E., Claude, J. & Strimmer, K. APE: Analyses of Phylogenetics and licenses/by/4.0/. Evolution in R language. Bioinformatics 20, 289–290 (2004). © The Author(s) 2018 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 15 | | | http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Nature Communications Springer Journals

Loading next page...
 
/lp/springer-journals/tracking-hiv-1-recombination-to-resolve-its-contribution-to-hiv-1-8NaCeYsTAv

References (71)

Publisher
Springer Journals
Copyright
Copyright © 2018 by The Author(s)
Subject
Science, Humanities and Social Sciences, multidisciplinary; Science, Humanities and Social Sciences, multidisciplinary; Science, multidisciplinary
eISSN
2041-1723
DOI
10.1038/s41467-018-04217-5
Publisher site
See Article on Publisher Site

Abstract

ARTICLE DOI: 10.1038/s41467-018-04217-5 OPEN Tracking HIV-1 recombination to resolve its contribution to HIV-1 evolution in natural infection 1,14 2 3 1 4 2 Hongshuo Song , Elena E. Giorgi , Vitaly V. Ganusov , Fangping Cai , Gayathri Athreya , Hyejin Yoon , 5 1 2 2 1,6 1 Oana Carja , Bhavna Hora , Peter Hraber , Ethan Romero-Severson , Chunlai Jiang , Xiaojun Li , 7 7 8,15 8 9 10 Shuyi Wang , Hui Li , Jesus F. Salazar-Gonzalez , Maria G. Salazar , Nilu Goonetilleke , Brandon F. Keele , 1 9 7,11 7,11 12 David C. Montefiori , Myron S. Cohen , George M. Shaw , Beatrice H. Hahn , Andrew J. McMichael , 1 2 2,13 1,6 Barton F. Haynes , Bette Korber , Tanmoy Bhattacharya & Feng Gao Recombination in HIV-1 is well documented, but its importance in the low-diversity setting of within-host diversification is less understood. Here we develop a novel computational tool (RAPR (Recombination Analysis PRogram)) to enable a detailed view of in vivo viral recombination during early infection, and we apply it to near-full-length HIV-1 genome sequences from longitudinal samples. Recombinant genomes rapidly replace transmitted/ founder (T/F) lineages, with a median half-time of 27 days, increasing the genetic complexity of the viral population. We identify recombination hot and cold spots that differ from those observed in inter-subtype recombinants. Furthermore, RAPR analysis of longitudinal samples from an individual with well-characterized neutralizing antibody responses shows that recombination helps carry forward resistance-conferring mutations in the diversifying qua- sispecies. These findings provide insight into molecular mechanisms by which viral recom- bination contributes to HIV-1 persistence and immunopathogenesis and have implications for studies of HIV transmission and evolution in vivo. 1 2 Duke Human Vaccine Institute and Department of Medicine, Duke University Medical Center, Durham, NC 27710, USA. Theoretical Division, Los Alamos 3 4 National Laboratory, Los Alamos, NM 87544, USA. Department of Microbiology, University of Tennessee, Knoxville, TN 37996, USA. Office for Research 5 6 & Discovery, University of Arizona, Tucson, AZ 85721, USA. Department of Biology, University of Pennsylvania, Philadelphia, PA 19104, USA. National Engineering Laboratory For AIDS Vaccine, College of Life Science, Jilin University, Changchun, Jilin 130012, China. Department of Medicine, University of 8 9 Pennsylvania, Philadelphia, PA 19104, USA. Department of Medicine, University of Alabama at Birmingham, Birmingham, AL 35294, USA. Departments of Microbiology and Immunology & Medicine, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA. AIDS and Cancer Virus Program, Frederick National Laboratory for Cancer Research, Frederick, MD 21702, USA. Department of Microbiology, University of Pennsylvania, Philadelphia, PA 12 13 19104, USA. Weatherall Institute of Molecular Medicine, University of Oxford, Oxford OX3 9DS, UK. Santa Fe Institute, Santa Fe, NM 87501, USA. 14 15 Present address: United States Military HIV Research Program, Walter Reed Army Institute of Research, Silver Spring, MD 20910, USA. Present address: MRC/UVRI and LSHTM Uganda Research Unit, Plot 51-57, Nakiwogo Road, Entebbe, Uganda. These authors contributed equally: Hongshuo Song, Elena E. Giorgi. These authors jointly supervised this work: Bette Korber, Tanmoy Bhattacharya, Feng Gao. Correspondence and requests for materials should be addressed to F.G. (email: fgao@duke.edu) NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 1 | | | 1234567890():,; ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 recombinant is a genetic sequence that carries regions rates and confounds relationships in standard phylogenetic tree 26 27 from two genetically distinct parental strains. That reconstructions . Currently, the bioinformatics tool GARD Arecombination contributed to HIV-1 quasispecies evolu- addresses this issue by finding likely breakpoints of recombina- tion in vivo was observed early after the discovery of HIV , and, tion within the alignment and reconstructing a series of phylo- soon after, recombination’s important role in global HIV-1 genetic trees of genetic segments within the breakpoints. 2,3 diversification was established . Ninety inter-subtype recombi- However, this program does not provide a list of recombinants or nants have been shown to be recurrent among circulating HIV-1 sequence relationships found within the alignment, and therefore viruses (for a current listing of these circulating recombinant is not helpful when analyzing low-diversity and/or shorter forms (CRFs), see https://www.hiv.lanl.gov/content/sequence/ regions, where eliminating a few sequences yields more statistical HIV/CRFs/CRFs.html). Some are very common epidemic lina- power than breaking up the alignment. ges like CRF01_AE, which is predominant in Thailand and Our bioinformatics tool, called Recombination Analysis PRo- 3 28 China, and CRF02_AG, which is common in West Africa . gram (RAPR), is based on the Wald–Wolfowitz Runs Test and Recombination increases overall genetic complexity of viral allows for computationally efficient detection of recombinants. populations more than just the accumulation of site mutations While variations on the Runs Test have been employed in the alone, thereby raising the likelihood for recombinants to find past to detect recombination in the context of gene conversion , favorable genetic configurations and facilitating faster adapta- this method has not been implemented on more recent and faster tion . Recombination is believed to contribute to viral diversity computational platforms. 5–7 8,9 10,11 and fitness , drug resistance , immunological escape , and Here, we show that RAPR is sensitive to identification of 7,12 disease progression . recombinant stretches in low-diversity settings, where only a Estimates of recombination rates in vivo vary among different small number of distinguishing bases are involved, and has the −5 −4 studies—1.4 × 10 to 2 × 10 breakpoints per site per genera- capability to discern parental strains from recombinants, in part 13–15 tion —comparable to the point mutation rate of 2.2–5.4 × 10 by accounting for longitudinal sampling times. Moreover, it can −5 16,17 per base per generation . In vitro studies have found HIV-1 address the lineage structure of the alignment by differentiating 18,19 to be a highly recombinogenic virus . However, in order to between the initiating recombinant event and its mutated des- detect recombinants, these studies utilize “foreign” gene inserts or cendants, when these are known for analysis. As a result, RAPR is sequences from divergent subtypes, unlike in vivo infections. able to reconstruct a hierarchy of serial recombination events as Schlub et al. describe an in vitro system that better mimics the recombinants of recombinants emerge over time. We first applied typical in vivo scenario, and observe that recombination occurs RAPR to a dataset of 3260 5′ and 3′ half genome sequences −3 non-randomly, with a frequency of the order of 10 breakpoints derived by SGA from 9 participants infected with multiple per nucleotide per round of infection, with multiple template transmitted/founder (T/F) viruses and sampled longitudinally switches between parental strains. over time. Because all infections were initiated by multiple, A number of factors have limited a better understanding of the genetically distinct T/Fs, recombinants between the infecting role of recombination in studies that looked at in vivo HIV-1 strains could be readily identified and tracked. This offered a infection. PCR-derived recombination artifacts are common unique opportunity to investigate how recombination impacts among sequences generated by bulk PCR products and, when not in vivo viral evolution and immune escape. We next applied excluded, may have affected older recombination studies that did RAPR to sequences from an individual (CH0505) whose infection not use single-genome amplification (SGA) . Similarly, studies was established by a single T/F virus and whose immune response that utilize sequences sampled only in chronic infection may not . We used this example to illustrate has been extensively studied be able to resolve ambiguities between putative parents and how to apply RAPR in a single T/F setting and illustrate the role 14,15 products of recombination . Other studies analyzed only a of recombination in carrying forward resistance mutations in a small portion of the viral genome. Such shortcomings can bias complex quasispecies. While heterogeneous infections have been estimates of the frequencies and evolutionary dynamics of analyzed before to estimate recombination rates , and long- recombinants in vivo. itudinal samples have been used to estimate recombination Many recombination detection tools currently available focus breakpoints , this is the first study to estimate how both on detecting recombinants between genetic subtypes and are best recombination rates and the accumulation of breakpoints con- applied to highly diverse sequences. Based on a sliding window tribute to the evolution of the HIV-1 quasispecies over time approach, RIP was the first bioinformatics tool to automatically during natural infection. screen for inter-subtype recombinants. More recent and advanced approaches are available today for detection of recombinants and their breakpoints within a single alignment. RECCO finds Results nominally significant recombinants by comparing the cost of Study participants. Previous studies have shown that approxi- obtaining each sequence either by mutation or by recombination. mately 80% of heterosexually transmitted HIV-1 infections are It does not, however, list parental strains or enumerate distinct initiated by a single T/F virus and only 20% are due to multiple, recombination events. Among the programs that provide a list of genetically distinct T/F viruses . In the latter case, due to the detected recombinants, including their potential breakpoints and genetically distinct quasispecies coevolving in the host, it becomes parental strains, is the suite RDP4 , which, in addition to easier to follow the history of recombination from the beginning implementing the recombination detection program (RDP) of the infection. To this purpose, we distinguished participants approach, includes several other existing tools, providing corro- productively infected with more than one T/F virus (hetero- boration of findings across the different methods. While these geneous infection) from those infected with a single T/F virus tools range in methods and strategies, the RDP test itself is based (homogeneous infection) by characterizing patterns of sequence on phylogenetic methods, and, as Posada and Crandall showed, diversity at the earliest time point (Fig. 1, Table 1, and Supple- in low-divergence scenarios like the recent infections presented in mentary Figs. 1 and 2) using statistical modeling, phylogenetic this study, these methods have less power to detect recombinants. trees, and highlighter plots, as previously described .Itis Defining the frequency of recombinants and identifying important to note that the number of infecting strains, the inci- breakpoints is particularly important in phylogenetic analyses dence of superinfection, and the estimated number of days since because recombination significantly biases estimates of mutation infection play no role in the detection of recombination, except 2 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE CH0010 (day 9) Table 1 Fiebig stage, transmission route, days since infection, and number of T/F viruses of heterogeneously infected individuals Subject Fiebig Transmission Days from No. of T/F stage route infection viruses (95% CI) Gaps CH0010 I/II Heterosexual 9 (7, 11) 3 6 CH0078 I/II Heterosexual 11 (7, 14) 2 2 CH0200 I/II Heterosexual 15 (13, 17) 3 5 CH0047 III Heterosexual 25 (21, 29) 2 2 CH0228 III Heterosexual 24 (21, 28) 2 3 CH0275 I/II Heterosexual 12 (8, 16) 1 2 CH0654 I/II MSM 15 (13, 17) 9 8 CH1244 IV Heterosexual 12 (11, 13) NA 2 CH1754 III Heterosexual 14 (12, 15) 6 7 Days from infection were calculated using previously published methods . These methods generally strongly correlate with Fiebig stage, though occasional deviations still happen, especially when under strong selection. In our present dataset, CH1244 is one such exception, 0 2000 4000 0.002 with a Fiebig stage IV yet diversity that would be expected—under a model of random tat accumulation of mutation—to be roughly 2 weeks into the infection vpu rev nef vif MSM men who have sex with men pol vpr env Fig. 1 Highlighter plot and phylogenetic tree of 3′ half genome sequences at when sequences are less diversified. Recombinant sequences screening time for CH0010. Six distinct T/F viruses (labeled a–f) were accumulated more breakpoints over time as a result of continuous identified. Each line represents a 3′ half genome sequence, and mutations serial recombination between earlier recombinants (Fig. 2b). from the major T/F (first sequence at the top) are color coded according to Similar results were obtained on both 5′ and 3′ half genome nucleotide sequences from all 9 participants (Supplementary Figs. 3–11). The frequency of recombinants and their descendants increased in all that we do not count the putative founder strains as recombinants individuals, eventually replacing all T/F lineages. Complete since we are focusing on recombination in the recipient rather replacement with recombinants of all distinctive T/F lineages than in the donor. However, calculating the time since infection that were evident in the earliest available time point was observed allows for aligning dynamics of recombination and viral loads to as early as 35 days, and typically was evident within 100 days facilitate comparisons. Participant selection and characteristics, (Fig. 3). The exception was participant CH0275, who was sparsely T/F identification, longitudinal sampling, and timing of infection sampled, with no time points available between 12 and 187 days; are described in the Methods. all CH0275 sequences sampled at day 187 were, however, recombinant forms (Fig. 3a). The RAPR program. To characterize recombination events and determine the frequency of recombinants over time, we developed Estimations of number of breakpoints. Template switches the RAPR tool, available as a web-based interface in the LANL occurring in nearly identical regions of the genome are likely to database (http://www.hiv.lanl.gov/content/sequence/RAP2017/ go undetected. Cromer et al. modeled this phenomenon and rap.html). The program relies on the fact that recombination estimated between 5 and 14 template switches per recombinant events can be identified through the inheritance of blocks of DNA genome. In this study, we observed a high frequency of recom- that carry distinguishing bases. Therefore, it finds recombinants bination breakpoints in our data when both significant and non- by comparing every set of three sequences in the alignment and significant but potential breakpoints were considered, consistent 28 31 applying the Wald–Wolfowitz Runs Test to check for ran- with what was reported by Cromer et al. . However, because of domness in the distribution of variable positions, identifying regions of near identity between the parents, the significant regions in a sequence that are significantly more similar to one breakpoint counts of RAPR are inevitably lower than the true candidate parent than another. It then classifies the viral popu- number of template switches, as illustrated in Supplementary lation into clusters of T/F viruses and their descendants—which Fig. 12. In order to quantify how many undetected breakpoints present mutations explainable by randomly distributed base resulted from template switches in regions of high homology changes—or identified recombinants and their descendants. In between parental strains, we used a recombination simulation the latter case, RAPR also determines the likeliest parents, based strategy (see Methods for details). As expected, RAPR con- both on the clustered distribution of sequence differences and on sistently underestimated the true number of breakpoints (Sup- their time of sampling, and delineates regions in which the plementary Figs. 13 and 14), and yielded a distribution of breakpoints are likely to lie. significant breakpoints lower than the actual number of break- As an example of RAPR performance and output, we illustrate points introduced in the simulation. This is not surprising given detailed results for participant CH0010, infected with 6 sampled that all sets of sequences used in the simulation had very low T/F viruses, labeled a–f (Figs 1 and 2a). Out of 103 unique within-alignment diversity (the within T/F mean Hamming dis- sequences from 4 time points (days 9, 26, 72, and 188 post tance range per nucleotide per sequence was 0.76–2.93% across infection), RAPR detected 52 distinct de novo recombinants, 4 of all parental pairs), and many of the random recombination which gave rise to 11 descendants (Fig. 2). By day 72 onward, all breakpoints in our simulation occur in sequence stretches where sequences were either recombinants or descendants of recombi- both parents were identical or nearly identical. As a result, this nants (Fig. 2). Estimated breakpoint intervals (illustrated as open study alone cannot bound the recombination rate from above, boxes in Fig. 2, shaded in gray when the breakpoint is statistically and Cromer et al. values are consistent with the lower bounds significant, see Methods) tend to be larger and sparser earlier, we obtain. NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 3 | | | ORF ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ab Recombination breakpoints T/F a T/F a T/F b T/F c T/F d T/F e T/F f Day 9 Day 26 Day 72 Day 188 T/Fs Rec cluster 1 T/F d T/F c T/F b Rec cluster 2 T/F f f * * Rec cluster 3 * Rec cluster 4 T/F e 0.002 tat vpu rev +2 vif nef pol vpr env Fig. 2 Identification of recombinants by RAPR. a Phylogenetic tree of 3′ half genome sequences from longitudinal samples from CH0010. Shaded regions show clonal expansion events and are color coded to indicate their closest T/F lineage. In each cluster, solid circle represents the originating (de novo) recombinant for that cluster and open circles represent its recombinant descendants. b Recombination breakpoints. Each line represents a sequence, and colored lines represent recombinants. Colored intervals in each recombinant sequence indicate the parental T/F lineage. Black boxes indicate the interval where the breakpoints are most likely to have occurred, and when the breakpoints are statistically significant, the box is shaded in gray. Shaded regions show clonal expansion events and are color coded to indicate their closest T/F lineage. Dashed lines indicate recombination descendants derived froma particular de novo recombinant as shown as a solid line in the same shaded group. The black stars on the right indicate the sequence lineage that RAPR identified as descendants of a single recombination event, while the program RECCO failed to recognize it as such To obtain a posterior distribution on the number of true biologically sound simulated datasets where the exact recombi- breakpoints, we also implemented a simulation based on an nants and their breakpoints are known. To realize this, we ran- approximate Bayesian computation method and applied it to domly generated sets of 100 artificial recombinants with known the same early time samples mentioned above (see Methods for crossover points, each from three different pairs of natural strains details). We conclude that for all recombinants for which RAPR that carried 0.6%, 1%, and 1.2% relative diversity, respectively (see detected 2 significant breakpoints, we are 95% confident that the Methods). In the lowest (0.6%) diversity setting, RAPR identified true breakpoints were no less than 2. A similar statement holds 77 out of 100 unique recombination events and 16 recombination when RAPR detected 4 breakpoints, with one exception descendants, missing 6 out of 100 known events (Supplementary (Supplementary Table 1). As the number of breakpoints Table 2). The other computational tools we used to test on the increases, they are more likely to go undetected: see for example same simulated datasets detected overall less recombination two of the recombinants from CH0228 3′ half sequences, for events than RAPR (Supplementary Table 2). However, we should which RAPR detects 5 and 6 significant breakpoints, but the 95% note that selection could impact our results when a few nearby confidence limit (CL) is 6 and 9 respectively. Supplementary mutations localized proximally are co-selected. Therefore, it is Fig. 15 illustrates how, as the number of template switches possible that generating recombinants in other ways, e.g., by increases, more breakpoints fall within regions of parental including some type of selection when sampling sequences, could homology, making it harder to accurately detect the true number reduce the efficiency of RAPR at detecting recombinants. of breakpoints. In conclusion, our tests indicate that RAPR For further comparisons, we tested different recombination provides a reasonable, though negatively biased, estimate for the detection tools on CH0010, the heterogeneous participant we true number of breakpoints; and the bias is small when the true used above to illustrate the RAPR output. RAPR identified 52 number of breakpoints is small. recombination events in this dataset, while the other tested tools detected between 2 and 18 (Fig. 2b) when accounting for sampling times. One feature of RAPR is that it can identify Comparison with existing recombination detection tools. clusters of sequences that likely descended from a single Despite the bias described above, our program proves useful in recombination event. When a single recombination event leads low-diversity scenarios. A good way to test this is using to a lineage, i.e., a cluster of sequences that differ from one 4 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | ORF NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE 3′ Half genome CH0010 CH0078 CH0200 CH0047 CH0228 100 100 100 100 100 80 80 80 80 80 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 0 50 100 150 200 0 100 200 300 400 500 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 CH1754 CH0654 CH1244 CH0275 T/F a T/F g 100 100 100 100 T/F b T/F h 80 80 80 80 60 60 60 60 T/F c Super-infected 40 40 40 40 T/F d Rec (all) 20 20 20 20 T/F e Rec (de novo) 0 0 0 0 T/F f Rec (descendant) 0 50 100 150 200 0 100 200 300 400 500 0 100 200 300 400 500 0 50 100 150 200 Days from infection 5′ Half genome CH0010 CH0078 CH0200 CH0047 100 100 100 100 80 80 80 80 60 60 60 60 40 40 40 40 20 20 20 20 0 0 0 0 0 50 100 150 200 0 100 200 300 400 500 0 200 400 600 800 0 200 400 600 800 CH0228 CH1754 CH0654 100 100 100 T/F a T/F g 80 80 80 T/F b T/F i T/F c T/F j 60 60 60 T/F d Rec (all) 40 40 40 T/F e Rec (de novo) 20 20 20 T/F f Rec (descendant) 0 0 0 0 200 400 600 800 0 50 100 150 200 0 100 200 300 400 500 Days from infection Fig. 3 Proportion of recombinants and their descendants in the viral population over time. The 3′ half genome (a) and 5′ half genome (b) sequences were analyzed separately. At each time point, colored dots indicate the percentages of T/F sequences and their lineages, red dots indicate all recombinants, red squares de novo recombinants, red circles descendants of recombinants, and open diamonds superinfection variants another by a small number of random mutations, many sampled at this or earlier time points. Furthermore, excluding recombination detection tools classify each resulting sequence later time points as parentals vastly reduces the multiple-testing as a recombinant rather than identifying the entire cluster as a corrections needed. lineage from a single recombinant founder strain. As a result, recombination rates can be overestimated. RAPR avoids over- counting descendants as independent recombination events by In vivo recombination rate. To investigate potential mechanisms breaking the alignment into clusters that are likely to have for the rapid loss of T/F viruses and their lineages during early originated either via mutation from one of the original T/Fs or infection, we developed a mathematical model of HIV-1 evolu- from a single recombination event and assigning a single tion that includes recombination. The model considers cells that “founder” (either a T/F or the original recombination event) for can be uninfected, infected with one viral variant, or coinfected each cluster (Fig. 4). with multiple variants (Supplementary Fig. 16), and explores the RAPR assumes that parental sequences are not sampled later impact of coinfection and selection on the prevalence of recom- than the recombinant, and therefore when resolving parental binants over time. In the absence of selective advantage of strains versus recombinants, the program does not allow later recombinants, a moderate frequency of cell coinfection by two time point sequences to be parents of earlier recombinants. different viral variants (on average 3.3%, range 0.7–8.0% among However, one needs to be careful when interpreting results in that the 9 participants) was sufficient to explain the rapid accumula- some earlier lineages may in fact be latent or escape sampling and tion of recombinants and the loss of T/F viruses when 3′ half then reappear at later time points. When in doubt, the user genome sequences were analyzed (Table 2 and Supplementary should consider multiple runs, with and without specifying Fig. 17a). Modeling 5′ half genome sequences showed similar sequence time points. In our study, we make the simplifying results (Supplementary Table 3 and Supplementary Fig. 17b). assumption that the sampling at each time point is adequate, so Alternatively, accumulation of recombinants could also be that sequences close to the parents are expected to be already explained by their moderate selective advantage over T/F viruses, NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 5 | | | % ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ranging from 0.034 per day to 0.5 per day. Either mechanism recombinants, one or few mutational clusters arising from could predict the time at which 50% of the viral population had recombinants would be highly represented. Given that selective been replaced by recombinants; for example, the “coinfection” forces due to immune pressure are subject to change over time, model estimated that the median half-time (T ) across 9 parti- and that new resistant forms of the virus can arise over time, such 1/2 cipants was 27 days for 3′ half genome sequences (Table 2). dominance may be transient. While the replacement of the T/F Similar results were obtained with 5′ half genome sequences lineages was most frequently due to accumulation of novel (de (Supplementary Table 3). novo) recombinants (Fig. 3), there were some exceptional time Deeper understanding of the relative contribution of these two points in some participants where a newly arisen recombinant mechanisms in accumulation of recombinants can be gleaned lineage dominated the sample, indicative of positive selection from a detailed analysis of recombination: if the replacements acting on a recombinant (Fig. 3). The most striking examples of were driven by a selective advantage of a few particular this were in the 3′ half from CH0078 at day 453, with one recombinant form having 26 descendants dominating the sample (Fig. 3a, Supplementary Fig. 4a), and in CH0654 at day 84, with a T/F a cluster recombinant form with 16 descendants dominating the sample T/F b cluster T/F c cluster (Fig. 3b, Supplementary Fig. 9b). Furthermore, all participants T/F d cluster had recombinant forms that were found to be repeated multiple T/F e cluster times, with descendent lineages sometimes persisting for weeks T/F f Rec (de novo) (Fig. 3, Supplementary Figs. 3–11). These observations suggest Rec (descendant) selection may have played a role in transient emergence of some Rec cluster 4 recombinant lineages. T/F c The number of de novo breakpoints observed correlated better T/F e with the number of novel mutations than with viral load (VL), consistent with the hypothesis that periods of more intense T/F d positive selection may influence both patterns of emergent recombinants and mutation. The coincidence in the timing of T/F f Rec cluster 3 periods of peak accumulation of new base substitutions and recombination breakpoints (Fig. 5) was statistically significant for the 3′ half genomes (p < 0.015 by one-sided sign test) but not for Rec cluster 1 the 5′ halves. While recombination requires coinfection, novel T/F a mutations are independent of it. This suggests that neither of the T/F b two simple models described above captured the full in vivo dynamics observed in our data, and that there may have been periods of increased selective pressure driving changes in the quasispecies, evident as coincident peaks in de novo recombina- tion and base mutation events in the longitudinal sample. Thus, Rec cluster 2 while occasionally selection can favor the dominance of one or Fig. 4 Clustering graph of the 3′ half genome sequences from CH0010. few recombinant forms, diversifying selection can also periodi- Each circle represents a sequence, and circles of the same color represent a cally act to increase the prevalence of both new mutations and T/F lineage. Gray circles represent de novo recombinant sequences and recombinants. Therefore, both factors—high VL that enhances open circles represent descendants of recombinant sequences. Clusters are coinfection of cells and selection—may have contributed to the calculated using a greedy algorithm and they originate either from a T/F rapid appearance of recombinants. lineage or through recombination across lineages. Recombination clusters are shaded and color coded to indicate their closest T/F lineage. At most Comparison with mutation rates. When combining the RAPR one sequence per recombination cluster is called a de novo recombinant: output across all 9 participants, we observed that while the rates the rest are assumed to be descendants of this recombinant sequences. of accumulation of new mutations and new observed recombi- −5 Light blue lines point to recombinant parents and green lines to the nation events were of the same order (~10 per sequence per recombinant child nucleotide per day; Fig. 5), the rate of new base mutations at any Table 2 Estimated rates of coinfection, the percentage of coinfected cells, and the half replacement time by recombinants during the infection (3′ half genome) Subject Coinfection rate (/day) % Coinfected cells T (days) 1/2 CH0010 0.067 (0.051, 0.093) 3.3 23.3 (19.3, 27.6) CH0078 0.088 (0.081, 0.095) 4.2 47.3 (44.6, 50.5) CH0200 0.056 (0.033, 0.094) 2.7 29.5 (21.9, 41.8) CH0047 0.014 (0.008, 0.022) 0.7 39.2 (31.2, 55.1) CH0228 0.103 (0.068, 0.153) 4.9 27.0 (24.3, 31.0) CH1754 0.149 (0.123, 0.177) 6.9 19.8 (18.9, 21.0) CH0654 0.035 (0.027, 0.044) 1.7 27.4 (24.8, 30.6) CH1244 0.175 (0.121, 0.27) 8.0 19.5 (16.8, 22.8) CH0275 0.024 (0.014, 0.039) 1.2 114.0 (73.9, 172.0) Median 0.067 3.3 27.0 The rate of coinfection is given as γβI, and the percentage of coinfected cells during the infection is F = γβI/(γβI + γβT) with γβT=2/day. T indicates the predicted time when the frequency of c 1/2 recombinants reaches 50% of the viral population. Coinfection rate was estimated by fitting 3′ and 5′ data simultaneously using maximum likelihood (see Methods for more detail). In CH0654, all sequences were recombinants at day 84 before the initiation of ART at day 112. Therefore, ART in CH0654 did not affect the analysis 6 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE New mutations/seq/day/bp New breakpoints/seq/day/bp Viral load CH0010 3′ half CH0010 5′ half CH0047 3′ half CH0047 5′ half 10^7 8e−05 10^6 6e−05 10^5 4e−05 10^4 2e−05 926 72 188 9 26 72 188 19 205 708 19 205 708 CH0078 3′ half CH0078 5′ half CH0200 3′ half CH0200 5′ half 10^7 8e−05 10^6 6e−05 10^5 4e−05 10^4 2e−05 11 46 88 145 208 453 11 46 88 145 208 453 11 74 409 696 11 74 409 696 CH0228 3′ half CH0228 5′ half CH0654 3′ half CH0654 5′ half 8e−05 10^7 6e−05 10^6 10^5 4e−05 10^4 2e−05 19 112 196 707 19 112 196 707 15 47 84 432 15 47 84 432 CH1754 3′ half CH1754 5′ half CH1244 3′ half CH0275 3′ half 10^7 8e−05 10^6 6e−05 10^5 4e−05 10^4 2e−05 14 77 189 357 441 609 776 888 14 77 189 357 441 609 776 888 12 40 108 188 277 375 435 12 187 Days from infection Fig. 5 Comparison of new breakpoint and mutation rates. Each graph shows the accumulation rate of new breakpoints per sequence per nucleotide per day (blue lines), new mutations per sequence per nucleotide per day (red lines), and log-scale viral loads (green lines) in both half genomes over time for all nine participants. Upper VL detection limit is 750,000 copies/ml. The gray vertical line in CH0200 3′ half highlights the time point at which a superinfecting T/F was observed. The sequences from the last time point (day 432) in CH0654 were excluded from the analysis due to the initiation of ART at day 112 time point (per sequence, per nucleotide, per day) was always Recombination hotspots. One open question when studying higher than the rate of new breakpoints (per sequence, per recombination is whether breakpoints are uniformly distributed nucleotide, per day) in both genome halves across all participants, across the HIV genome or whether instead they cluster pre- with the only exception of the 5′ half genome in CH0228 (Fig. 5). ferentially in certain regions (called hotspots) while leaving others Both recombination and mutation rates tended to decrease with relatively intact (cold spots). To address this, we calculated hot time. The difference in rates between recombination and muta- and cold spots of recombination across the 3′ half genomes of the −8 −4 tion was significant (p = 1.9 × 10 and p = 1.4 × 10 , respec- 9 participants using a sliding window of 20 nucleotides, and tively, one-sided paired Wilcoxon test), but they pertain only to found that hotspots were mostly clustered in Env, before or after detected recombination breakpoints. As discussed previously, the variable loops, whereas regions where two genes overlapped breakpoints are detectable only when the parental strains are (vif/tat and vpu/env) carried several cold spots (Fig. 6). distinctive enough that exchanged blocks carry within them dis- tinctive patterns of bases, and therefore the detected recombi- Increased genetic complexity among recombinants. Across all nation rate is an underestimate of the true recombination rate. nine participants, diversity within recombinant sequences was higher than intra-T/F lineage diversity but lower than inter-T/F diversity (Fig. 7). Previously, we observed that in homogeneous NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 7 | | | Viral loads (copies/ml) Rate (per seq per day per bp) ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 Combined recombination breakpoints Hotspots Cold spots Nef Vif Tat2 Tat1 Vpu Rev2 Vpr V1 V2 LoopD V4 V5 Rev1 Env 5000 6000 7000 8000 9000 HXB2 positions Fig. 6 Identification of recombination hotpots in HV-1 genome. Recombination hotspots are shown in dark orange and cold spots in blue across the 3′ half genome. Position numbering is relative to HXB2. Each line represents a position, and the thickness of the colored regions represents consecutive positions. The sequences from the last time point (day 432) in CH0654 were excluded from the analysis due to the initiation of ART at day 112 infections recombination lowers the rate at which diversity where the parental strains were discordant. The glycosylation site accumulates in the viral population . In these nine hetero- conferring mutation T234N was also found to be preferentially geneous infections, recombination not only tends to lower the carried forward by recombination events in CH505. It was first average diversity, but it also broadens the spectrum of genetic detected at day 202 (Fig. 8 and Supplementary Table 4) and by diversity distribution (Fig. 7). We speculate that this mechanism the time there is enough overall diversity for RAPR to detect allows the virus to explore a wider range of phenotypes, and recombinants (day 538), most sequences carry the glycosylation therefore potentially more likely escapes from immune responses. site. There were only 6 recombinants from day 538 where the Similar patterns of recombination were observed in vitro (Sup- parental strains are heterogeneous at this position (i.e., only one plementary Fig. 18). of the parents carries the glycosylation site), and in all events the recombinant always carries the resistance-conferring genotype (Supplementary Fig. 19b), indicating that recombination con- Recombination analysis in a homogeneous infection. Because tributes to the rate at which this mutation reached fixation. We of its sensitivity in detecting recombination even in low-diversity hypothesized that the glycosylation site mutation at position 234 scenarios, RAPR also proves useful when studying viral evolution could confer resistance to autologous antibodies, and confirmed by recombination in homogeneous infections, once enough experimentally that it indeed confers resistance to early lineage diversity has accrued to enable its detection. We show one CD4bs antibodies isolated from CH0505 (Supplementary example of this application, and use it to address the question of Table 4). Overall, RAPR found 17 unique recombination events whether recombination contributes to the emergence of resis- and 11 descendants, all in the last two time points from CH0505. tance. We used viral and neutralization data from CH0505 who 38 24 23 27 The other tested tools (RAT , RDP4 , RECCO , and GARD ) developed broadly neutralizing antibodies (bnAbs) over the identified between 0 and 18 recombinants. course of several years, and whose antibody lineages and neu- 30,36,37 tralization activity have been well characterized . The first Increased resistance of recombinants to neutralization. mutations conferring resistance arose at day 90 (Fig. 8a). New CH0010 also developed potent neutralizing antibodies against all mutations continued to accumulate, and the ones conferring five T/F Env pseudoviruses over time (Supplementary Table 5). resistance persisted at later time points. Two CD4 binding site To determine if recombinants would allow viruses to escape from (CD4bs) targeting B-cell lineages were found in this individual, later developed autologous neutralizing antibodies, we looked at and by week 100 (day 692) into the infection, the virus developed the sequences sampled at day 72 and generated Env pseudo- complete resistance to the CH235 antibody lineage and to half of viruses from two non-recombinants and three recombinants the antibodies from the CH103 lineage (Fig. 8b). As expected, (Supplementary Fig. 20 and Supplementary Table 5). The two given the homogeneity of a single T/F virus infection, RAPR does non-recombinant viruses could only be neutralized by the later not find any evidence of recombination at the early time points. time point plasmas (days 188 and 275). However, none of the By day 538, however, it finds that 25 sequences out of 31 are three different recombinants from day 72 were neutralized by all recombinants, and by day 692, all sampled sequences are either plasmas, ignoring the weak neutralization of one day 72 Env novel recombinants or descendants of recombination events pseudovirus by the plasma from day 188 (Supplementary (Fig. 8c). Table 5). This indicated that recombinants were more likely be Interestingly, the mutations that confer CH505 autologous resistant to later autologous neutralizing antibodies than non- antibody resistance were maintained in the diversifying quasis- recombinants. pecies through recombination events. For example, the pair of mutations V281G and N279D in the D loop appeared first at day 202 and day 55, respectively, and each mutation confers partial Discussion resistance to the CH235-lineage antibodies (Supplementary We developed a new analytical tool, RAPR, that allows for Table 4), with a strengthening of such resistance when they computationally efficient detection of recombinants in long- 30,36 appear together . RAPR finds that when parental strains are itudinal samples from recently infected individuals. RAPR clas- heterogeneous at position 279, the recombinant carries the sifies sequences depending on whether they originated from the resistant D genotype 10 out of 12 times. When the parental T/Fs via mutation or from a recombination event between T/F strains are heterogeneous at position 281, the recombinant carries lineages. Compared to existing programs like RDP4 and the G-resistant form 9 out of 10 times. Seven out of 8 times both GARD , our simple simulations indicate that RAPR is more mutations are present in one of the parents, and both are sensitive in detecting recombination events particularly in low- inherited by the recombinant (Supplementary Fig. 19a). There- diversity settings (e.g., about 1%). RAPR has a built-in mechan- fore, whether at position 279, 281, or both, recombination carries ism to constrain later sequences from being considered as par- the resistance mutations forward in the vast majority of cases ental strains of earlier sequences. RAPR sensitivity, detailed 8 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE 3′ Half genome CH0010 CH0078 CH0200 n =20 n =51 n =68 n =159 2.0 3.0 2.5 n =54 n =59 2.0 1.5 2.0 1.5 1.0 1.0 n =151 n =8 1.0 0.5 0.5 n =33 n =17 n =26 n =21 0.0 0.0 0.0 CH0047 CH0228 CH1754 2.5 1.5 5.0 n =38 n =29 n =138 n =182 n =35 n =29 2.0 4.0 1.0 1.5 3.0 1.0 2.0 0.5 n =21 n =8 n =20 n =6 0.5 1.0 n =63 n =84 n =15 n =9 n =9 0.0 0.0 0.0 CH0654 CH1244 CH0275 n =119 6.0 2.0 1.5 n =99 n =35 n =74 n =23 1.5 n =7 4.0 1.0 1.0 2.0 0.5 0.5 n =58 n =16 n =22 n =4 n =31 n =27 n =9 n =15 n =25 n =6 0.0 0.0 0.0 5′ Half genome CH0010 CH0078 CH0200 1.5 1.0 1.5 n =45 n =51 n =24 n =19 0.8 n =15 n =15 1.0 1.0 0.6 0.4 0.5 0.5 n =48 n =3 n =8 n =6 n =14 n =4 0.2 0.0 0.0 0.0 CH0047 CH0228 CH1754 n =17 n =21 1.0 1.5 2.5 n =75 n =15 n =9 n =48 0.8 2.0 1.0 0.6 1.5 0.4 1.0 0.5 0.2 n =5 n =4 0.5 n =32 n =17 n =6 n =7 n =11 n =11 n =10 0.0 0.0 0.0 CH0654 2.5 n =118 n =111 2.0 1.5 1.0 n =20 n =27 n =4 n =11 n =42 n =7 n =4 0.5 0.0 Fig. 7 Genetic p-distances within lineages, inter-lineages, and within recombinants. Pairwise genetic distances (p-distance) were calculated among recombinant viruses, variants evolved from the same T/F virus (intra-T/F), as well as variants from different T/F viruses (inter-T/F) for 3′ half (a) and 5′ half (b) genomes. For intra-T/F diversity, T/F lineages with less than three sequences were excluded. Pairwise genetic distances are shown in black dots, and their means in blue lines. All data were calculated by combining the viral sequences from all time points up to the time that all of the viruses were recombined. The middle line indicates the median of the diversity; the box shows 25 to 75 percentile of the data; and the whisker shows 10 to 90 percentile of the data. The number of sequences for each pairwise comparison group is indicated at the top of the plot NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 9 | | | Intra-T/Fa Intra-T/Fb Intra-T/Fc Intra-T/Fd Intra-T/Fe Intra-T/Ff Intra-T/Fg Rec Inter-T/Fs Intra-T/Fa Intra-T/Fb Rec Intra-T/Fa Inter-T/Fs Intra-T/Fb Intra-T/Fa Intra-T/Fc Intra-T/Fd Intra-T/Fb Intra-T/Fe Intra-T/Fa Rec Rec Inter-T/Fs Intra-T/Fb Inter-T/Fs Intra-T/Fb Intra-T/Fc Rec Intra-T/Fd Intra-T/Fe Inter-T/Fs Intra-T/Ff Intra-T/Fg Intra-T/Fa Intra-T/Fi Rec Intra-T/Fb Inter-T/Fs Intra-T/Fa Rec Intra-T/Fb Inter-T/Fs Rec Inter-T/Fs Intra-T/Fa Intra-T/Fb Intra-T/Fc Intra-T/Fd Intra-T/Fe Rec Inter-T/Fs Intra-T/Fa Intra-T/Fb Rec Inter-T/Fs Intra-T/Fa Intra-T/Fb Intra-T/Fa Rec Intra-T/Fb Inter-T/Fs Rec Inter-T/Fs Intra-T/Fa Intra-T/Fb Intra-T/Fa Rec Intra-T/Fb Inter-T/Fs Intra-T/Fa Rec Intra-T/Fb Inter-T/Fs Rec Inter-T/Fs Intra-T/Fa Rec Inter-T/Fs Genetic p -distance (%) Genetic p -distance (%) ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ab c T/F w4.3 Color key/histogram w4.26 w4.10 w4.27 1000 w4.29 w4.8 w4.37 w4.51 w4.56 w4.16 w4.13 w4.14 w4.11 600 w4.15 w14.21 w14.3 w14.4 400 w14.6 w14.19 w14.17 w14.20 w14.8 w14.2 w14.29 w14.34 w14.12 w14.10 w14.16 w14.30 w14.31 w14.32 w14.39 w20.27 Value w20.4 w20.19 w20.25 w20.14 w20.7 w20.9 w20.23 w20.24 Week 4/day 19 w20.26 w20.11 w20.22 Week 14/day 90 w20.8 w20.2 Week 20/day 132 w20.3 w20.21 w20.15 Week 30/day 202 w30.28 w30.13 Week 53/day 362 w30.24 w30.5 w30.34 Week 78/day 538 w30.37 w30.12 w30.10 Week 100/day 692 w30.19 w30.21 w30.18 w30.32 w30.6 w30.15 w30.17 w30.31 w30.36 w30.25 w30.8 w30.23 w30.26 w30.20 w30.27 w30.9 w53.31 w53.13 w53.28 w53.22 w53.6 w53.11 w53.19 w53.14 w53.32 w53.3 w53.25 w53.16 w53.17 w53.15 w53.27 PNG 279–281 w53.8 w53.10 T234N (D Loop) w53.29 w53.9 w78.15 w78.1 w78.3 w78.10 w78.8 w78.7 w78.33 w78.6 w78.14 w78.17 w78.38 w78.25 w78.4 w78.9 w78.5 w78.16 w100.B4 w100.A3 w100.A5 w100.B6 w100.A10 w100.B8 w100.A11 w100.B3 w100.B7 w100.A2 w100.A13 w100.A6 w100.A4 w100.A12 234 279–281 Fig. 8 Sequence and neutralization analysis of viruses from homogeneously infected CH0505. a Syn/nonsyn highlighter plot shows the synonymous (green) and nonsynonymous (red) mutations compared to the T/F sequences. b Autologous neutralization analysis. Multiple Env pseudoviruses were generated from seven time points from CH0505 over 2 years of infection. Neutralization susceptibility of 123 pseudoviruses were determined with 13 CH103-lineage antibodies (right panel) and neutralization susceptibility of 33 pseudoviruses were determined were determined with 16 CH235-lineage antibodies (right panel). Neutralization magnitude is color coded by warm colors, with red being the strongest. Blue color indicates that no neutralization activities were detected with neutralizing antibodies at the 50 µg/ml concentration. c Recombinant env genes from days 538 and 692 after infection. Recombinants and their breakpoints, as determined by RAPR, are shown. Recombination breakpoints are shown as black boxes, and they are filled when the breakpoints are significant output, and unique time-of-sampling features enabled an used to evaluate the effectiveness of a prevention or vaccination informed exploration of the timing of replacement of the T/F strategy . This is compatible with previous studies, which have lineages by either recombinants or their descendants. This is found that by reassembling genotypes that have appeared through important for understanding the continuous evolution of the viral independent events, recombination can generate a wider range of population and the rapid exploration for new immunological new phenotypes that would either take much longer or be less escapes. RAPR thus nicely adds to an existing set of other tools likely to appear through mutation alone . This wider spectrum of for detecting and characterizing recombination in vivo that genetic variants generated by recombination might allow viruses results from continuous interplay among distinct viruses during to rapidly adapt to the host environment and outcompete par- 39,40 infection . ental viruses during infection. We applied RAPR to SGA-derived sequences sampled long- We also showed how RAPR can be applied to longitudinal itudinally from 9 heterogeneously infected participants, starting sequences sampled from homogeneous infections. Once adequate from very early in infection. We calculated frequencies of accu- diversity has accumulated in the HIV-1 quasispecies, all triplet mulation of new mutations and new breakpoints, as well as hot combinations of sequences can be tested, and recombination and cold spots for recombination. The accumulation rate of new events framing resistance mutation can be tracked as they emerge mutations was higher than the accumulation rate of new break- over time. The data can then be used to answer the question of points across all participants, although this may be due to the how recombination affects the establishment of key antibody- potential underestimation of recombination events in homo- resistant mutations. logous regions. Even then, recombinants quickly replaced the T/F Recombination rates calculated in vitro tend to generally be 18–20 lineages with a median half-time of 27 days, and the recombi- higher than the ones observed in this study . Bioinformatics nation significantly increased the genetic complexity of a viral recombination tools cannot detect recombination events between population. The rapid loss of T/F lineages should be taken into highly related parental strains within the same lineage or that account if a reduction in numbers of T/F viruses is a parameter occur in highly similar regions of the parental genome. To avoid 10 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | UCA IA8 IA7 IA6 IA5 IA4 IA3 IA2 IA1 CH103 CH104 CH105 CH106 CH31 UCA.1 IA4.1 IA3.1 CH235.11 IA2.1 IA1.1 CH240 CH236 CH235 CH239 CH241 CH235.7 CH235.13 CH235.10 CH235.9 CH235.12 Count −1.72 −1.34 −0.95 −0.57 −0.18 1200 0 0.2 0.59 0.97 1.36 1.74 2400 NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE this limitation, Cromer et al. developed a statistical method that Recombination between different nAb escape mutations in a accounts for template switching modeled on in vitro results. In homogenously infected host likely rendered recombinants to our study, we used a simulation approach in which we introduced more easily escape early maturation members of the bnAbs known numbers of breakpoints in simulated data to estimate detected in the same individual. Taken together, these results strand switching events in vivo. Another caveat in the inter- suggest that extensive in vivo recombination may allow the pretation of our results is that as the sampling times increases and complex recombinant population to have better chances to adjust sequences become more divergent, the chance of a convergent to host immune selection pressure. selection event increases as well, thus dampening RAPR ability to Finally we would like to note that intra-subtype recombination distinguish independent recombination events. The new muta- in HIV was first detected due to the high diversity of the virus, tion and new breakpoint rates were found to peak at coordinated but it has also been observed in less diverse viral species such as 51–53 times (Fig. 5)—a phenomenon particularly striking in the 3′ half hepatitis B viruses, enterovirus, and norovirus . Because of its genomes. This suggests that there may be transient periods of higher sensitivity in low-diversity scenarios, RAPR may prove stronger diversifying selective pressure acting on both recombi- useful in detecting recombination in other viral species where it nation and base-substitution rates during the course of an has been more challenging to detect recombination. infection. The breakpoints detected by RAPR, provided in parallel to the Methods phylogenetic trees, can help clarify lineage relationships. When Study participants. To study HIV-1 recombination longitudinally, we identified recently infected participants whose estimated time from HIV-1 infection to blood combining 3′ half genome data from all nine participants, we sampling was generally less than 1 month. Of 36 individuals from a CHAVI001 found that new breakpoints tended to accumulate around (either infection cohort , 27 were infected with a single virus (homogeneous infection), before or after) the variable regions of Env (hotspots). In contrast, and 9 were infected with multiple ones (heterogeneous infection). The first viral in vitro studies of inter-subtype and CRFs have shown that RNA-positive plasma samples from these 9 heterogeneous individuals were col- lected 9–27 days post infection, as estimated using our tool Poisson Fitter sequence homology drives recombination, with breakpoints (Table 1). This early into the infection, the within-lineage diversity was sufficiently clustering in more conserved regions between the parental low to allow us to infer infecting T/F lineages and to estimate the time since 42,43 44 strains . More recently, Jia et al. analyzed breakpoints in infection using these previously described methods . The number of T/F viruses CRF sequences from the Los Alamos HIV Sequence Database and we were able to identify in the 9 participants ranged from 2 to 9 (Fig. 1, Table 1, and Supplementary Figs. 1 and 2). Given the close similarity of the within-host T/ found hotspots to be clustered at protein overlapping regions, but Fs (the within-host T/F mean base pair distance ranged from 0.76% to 2.93%), we not within the Env protein. The fact that we observe recombi- inferred that each participant had been infected by a single donor and at one single nation hotspots in Env around variable regions during within- time point, with one notable exception: in CH0200 we noticed the appearance of a host evolution could be due to a combination of factors. For new, distinct lineage at day 74 that persisted at later time points. After excluding example, recombination events mechanistically favor regions of the possibility of sample contamination, we concluded that this was an incidence of superinfection. Near-full-length sequences of overlapping half genomes were local homology of greater than 30 bases (maximum efficiency is obtained by SGA for 8 of 9 participants over multiple time points, while only the 3′ seen at 60 bases or longer) to facilitate strand switching during half genome sequences were obtained for CH1244. reverse transcription , although strand transfers were frequently Twelve homogeneously infected and 7 heterogeneously infected individuals had detected at limited homologous regions (<2 bases) in our recent no known protective or disease susceptible HLA alleles (Supplementary Table 6). In order to have enough diversity to detect recombination from early infection, as study . Between highly diverse viruses representing different matter of necessity, we focused on heterogeneously infected participants. As a subtypes, this may severely constrain the rate of recombination result, this study population may have biases relative to single T/F infections in events in the variable env gene. In contrast, the degree of terms of persistence of viral diversity. In addition to these 9 heterogeneous homology throughout the genome in our setting of highly related participants, we also included 341 env sequences from the homogeneously infected CH0505, spanning day 19 through day 692 since infection . T/F variants is much greater, including homology in Env, Longitudinal plasma samples within 2 years of infection were obtained from enabling recombination throughout the genome. This, combined nine heterogeneously infected participants as well as one homogenously infected with the fact that Env is under a particularly high degree of participant. All participants were male. CH0010, CH0047, CH0078, CH0200, selective pressure in vivo due to antibody-mediated selection for CH0228, CH0275, CH1244, and CH1754 were from Malawi, CH0078 from South 47,48 Africa, and CH0654 from the United States. None of these nine individuals immune escape , could explain why we see an enrichment for developed AIDS-like symptoms within 2 years of infection. No individuals were on recombination in variable regions in particular. In turn, the antiretroviral therapy (ART), except CH0654 who was on and off ART since day spectrum of genetic variants generated by recombination during 112 after infection. Therefore, sequences from the last time point (day 432) of infection contributes to the viral diversification that precedes the CH0654 was excluded from all analyses. The first screening plasma samples were collected within weeks from the infection for eight individuals, on or before Fiebig development of antibodies with neutralization breadth during 36,37 stage III, while the screening sample for CH1244 was collected at Fiebig stage IV . the course of an infection . All viruses were subtype C except CH0654, which was subtype B. Written informed The extensive recombination observed in each of these nine consent was obtained from all study participants and the study was approved by participants highlights an important caveat on the interpretation the Duke University Institutional Review Board. of phylogenetic trees that has long been known , but generally underappreciated. Recombination violates the primary assump- Amplification of near-full-length viral genome by SGA. Viral RNA was tions on which bifurcating evolutionary trees are built, yet such extracted from plasma samples or culture supernatant using the PureLink Viral RNA/DNA Mini Kit (Invitrogen, Carlsbad, CA). Complementary DNA was syn- trees are frequently used to represent HIV-1 evolution within a thesized using the SuperScript III reverse transcriptase (Invitrogen, Carlsbad, CA) host (including in this paper), and in the population as a whole. with the primers 07Rev8 (5′-CCTARTGGGATGTGTACTTCTGAACTT-3′;nt The trees provide a useful visualization of the clustering of related 5193–5219 in HXB2) for 5′ half genome SGA, or primer 1.R3.B3R 5′- sequences and a portrait of the acquisition of diversity over time. ACTACTTGAAGCACTCAAGGCAAGCTTTATTG-3′ (nt 9611–9642) for the 3′ half or near-full-length genomes. SGA was performed to obtain the 5′ half, 3′ half, However, in light of the extensive and early recombination or near-full-length HIV-1 genome as described previously . For the 5′ half gen- demonstrated here, it is important to consider such HIV-1 trees ome amplification, the first round PCR was carried out using the primers 1.U5.B1F as representations of clustering patterns of similar sequences, and 5′-CCTTGAGTGCTTCAAGTAGTGTGTGCCCGTCTGT-3′ (nt 538–571) and not direct evolutionary trajectories. 07Rev8 5′-CCTARTGGGATGTGTACTTCTGAACTT-3′ (nt 5193–5219), and the 10,11 second round PCR with primers Upper1A 5′-AGTGGCGCCCGAACAGG-3′ (nt Recombination plays a role in escape from T-cell responses 634–650) and Rev11 5′-ATCATCACCTGCCATCTGTTTTCCAT-3′ (nt and bnAbs . We now show that recombination between differ- 5041–5066). To amplify the 3′ half genome, the first round PCR was performed ent T/F env genes in a heterogeneously infected individual were using the primers 07For7 5′-CAAATTAYAAAAATTCAAAATTTTCGGGTT- more resistant to the later autologous neutralizing antibodies than TATTACAG-3′ (nt 4875–4912) and 2.R3.B6R 5′-TGAAGCACTCAAGG- the T/F lineage variants from the same time point. CAAGCTTTATTGAGGC-3′ (nt 9636–9607), and the second round PCR with NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 11 | | | ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 primers VIF1 5′-GGGTTTATTACAGGGACAGCAGAG-3′ (nt 4900–4923) and estimates. In order to account for the heterogeneity of the infection, we divided Low2c 5′-TGAGGCTTAAGCAGTGGGTTCC-3′ (nt 9591–9612). Then, 2 µl of the each first-time-point sequence alignment into single-founder lineages. Sequences first round PCR products were used for the second round PCR. The PCR ther- significantly enriched for hypermutation (p < 0.1 by Fisher's exact test) and mocycling conditions were as follows: one cycle at 94 °C for 2 min; 35 cycles of a sequences with ambiguous nucleotides were excluded from the analysis. Our denaturing step at 94 °C for 15 s, an annealing step at 55 °C for 30 s, an extension method neglects differences between the putative ancestor and the consensus, an step at 68 °C for 5 min; and one cycle of an additional extension at 68 °C for 10 assumption that breaks down for very small sample sizes. For this reason, lineages min. with less than three sequences were excluded, and also because the goodness-of-fit For amplification of near-full-length HIV-1 genome, the first round PCR was test in the Poisson fitter is based on the chi square test, which is unreliable for small carried out using the primers Upper1A and 2.R3.B6R, and the second round PCR samples (less than 4 data points). For each remaining lineage, we approximated the was done with the primers Upper2 5′-CTCTCTCGACGCAGGACTCGGCTT-3′ T/F sequence by the lineage consensus, calculated the Hamming distance (HD) of (nt 681–704) and LTRD 5′-CTGGAAAGTCCCCAGCGGAAAGTC-3′ (nt each sequence in the lineage from this T/F sequence, and then merged all HDs 9460–9437). Then, 2 µl of the first round PCR products were used for the second from lineages of the same alignment into one distribution. Assuming that all T/Fs round PCR. The PCR thermocycling conditions were as follows: one cycle at 94 °C evolved from the same time point, were all equally fit, and accumulated random for 2 min; 10 cycles of a denaturing step at 94 °C for 15 s, an annealing step at 55 °C mutations at the same rate, the combined HDs follow a Poisson distribution. We for 30 s, an extension step at 68 °C for 8 min; 20 cycles of a denaturing step at 94 °C proceeded to estimate the time since infection using this combined distribution for 15 s, an annealing step at 55 °C for 30 s, an extension step at 68 °C for 8 min following methods previously described . In order to estimate the dates post with an incremental 20 s for each successive cycle; and one cycle of an additional infection for participants for which we had alignments covering both the 5′ half extension at 68 °C for 10 min. genome and the 3′ half genome, we first timed each half genome alignment separately, and then averaged the two estimates using a harmonic mean weighted by the inverse of the number of sequences in each genome half, a method that Sequence analysis. The PCR amplicons were directly sequenced by the cycle minimizes the asymptotic sampling variance. sequencing and dye terminator methods on an ABI 3730xl genetic analyzer (Applied Biosystems, Foster City, CA). Individual sequences were assembled and Wald–Wolfowitz Runs Test. Recombination results in contiguous stretches of edited using Sequencher 4.7 (Gene Codes, Ann Arbor, MI). The sequences were aligned using CLUSTAL W, and the manual adjustment for optimal alignment was genetic information inherited, with random mutations, from each parent. Our program, RAPR, checks for patterns of inheritance in every set of three sequences performed using Seaview. Neighbor-joining trees were constructed using the Kimura 2-parameter model with 1000 bootstrap replications. GenBank Access in a given alignment. Specifically, RAPR identifies the sites that differ in exactly one numbers for the newly obtained sequences in this study are: of the sequences and applies the Wald–Wolfowitz Runs Test to check if these MF499156–MF502416. These 3260 new sequences augment preexisting data to sites cluster more than expected of random mutations. This is done as follows. Let provide a unique dataset, including extensive sets of longitudinally sampled A and B be the putative parental strains and C the recombinant one. Each position sequences from individuals infected with multiple T/F viruses. All sequences from in C is either identical to A but not B, identical to B but not A, identical to both, or this study, alignments, and auxiliary data are also available in the HIV special different to either one. In the two latter scenarios we say that the site is non- interest alignments (https://www.hiv.lanl.gov/content/sequence/HIV/ informative. Define an “A run” as the maximal non-empty segment of adjacent SI_alignments/datasets.html). positions in C that are either non-informative or identical to A but not B. Likewise, a “B run” is the set of contiguous sites that are either non-informative or identical to B but not A. Let m be the total number of positions where C is more similar to Generation of infectious molecular clones. The infectious molecular clones A; n the total number of sites where it is more similar to B; and r the total number (IMCs) representing T/F viruses from CH0200 (a, b, and c) and CH0228 (a and b) of “A” or “B” runs. The probability that there are at most r runs just by a chance were chemically synthesized and cloned in pUC57 as described previously . The ordering of independent mutations is given by: virus stocks were prepared from the supernatants of 293T cells (NIH AIDS Reagent Program, cat. number: 103) transfected with IMCs . All cell lines were tested free X PRðÞ  r¼ f ðÞ k ; of mycoplasma contamination. k¼2 Virus culture. Primary CD4 T cells were purified from peripheral blood where the probability of a total of k runs is given by: mononuclear cells from a healthy donor under the protocols approved by the Duke ! ! ! ! University Institutional Review Board. Written informed consent was obtained m  1 n  1 m  1 n  1 from all study participants. Cryopreserved CD4 T cells were thawed and sti- k3 k1 k1 k3 d e b c b c d e 2 2 2 2 mulated for 3 days in RPMI-1640 containing 10% fetal bovine serum (FBS), f ðÞ k ¼ PRðÞ ¼ k¼ m þ n interleukin 2 (IL-2) (32 U/ml; Advanced Biotechnologies, Columbia, MD, cat. number: 03-001-050), soluble anti-CD3 (0.2 µg/ml; eBioscience, San Diego, CA, cat. number: 16-0037-81), and anti-CD28 (0.2 µg/ml; BD Bioscience, San Diego, CA, cat. number: 16-0281-81). The stimulated cells (1 × 10 cells) was seeded into with a and a standing for the least integer not less than and the greatest integer not each well of a 96-well plate, and infected with a mixture of multiple T/F viruses greater than a, respectively (see Supplementary Fig. 21 as an example). (0.001 m.o.i. (multiplicity of infection) for each virus) from each participant (3 T/F viruses for CH0200, and 2 T/F viruses for CH0228) . After absorption at 37 °C for Similarity clusters and correcting for circularity. In cases where multiple viruses 4 h, the cells were washed 3 times with RPMI-1640. The infected cells were cultured in a 24-well plate with 600 µl of RPMI-1640 containing 10% FBS and IL-2 (32 U/ establish an infection, and likely transmitted founder strains (T/Fs) are resolved, T/ Fs can be constrained to be parents in a recombination triplet, thus avoiding ml). The supernatant (200 µl) were harvested at day 4 and then passaged onto fresh CD4 T cells four times. The virus replication was monitored by determination of detection of recombination events that happened in the donor, prior to trans- mission. When the input alignment contains data from multiple time points, RAPR the p24 concentration in the supernatant using the p24 ELISA kit (PerkinElmer, Waltham, MA). The near-full-length genomes from passages 1, 2, and 4 were can assign likely parent–daughter relationships by factoring in the time of sam- pling. As a result, sequences from future time points do not affect the recombi- obtained by SGA to determine the frequency of recombination in vitro. nation inferences at earlier time points. When sampling is sparse, this constraint may be broken by the fact that the actual parental strains may have escaped Neutralization assay. Neutralization activity was measured in a luciferase reporter sampling. In this case RAPR chooses the sequences that are closest to the actual system in TZM-bl cells (NIH AIDS Reagent Program, cat. number: 8129). Plasma parent even though they may have been sampled at a later time point. Multiple samples were heat inactivated at 56 °C for 1 h and were then diluted at a 1:3 serial testing is addressed through a false discovery rate approach applied to each time dilution starting at 1:20. The diluted plasma samples in duplicate were incubated point individually. Using a greedy algorithm, sequences are grouped into closely with viruses for 1 h at 37 °C and then used to infect TZM-bl cells. The 50% related clusters such that differences within each cluster are likely to have arisen by inhibitory dose (ID ) was defined as the plasma dilution at which relative lumi- mutation. Some clusters will have evolved directly from a TF. Others will represent nescence units (RLUs) were reduced by 50% compared with RLUs in virus control recombination events that were indicated in the initial comparison of all sets of wells after subtraction of background RLUs in cell control wells. A response was three sequences. For each recombinant cluster, RAPR calculates the minimum HD considered positive for neutralization if the ID titer was >1:20 dilution. from the closest cluster, and compares it to the minimum number of mutations diverging from the parental strains. Based on which of the two numbers is lower, Determination of dates post infection. Highlighter plots and neighbor-joining the cluster is “explained” as having arisen from another cluster by mutation, or as a phylogenetic trees were generated using the Highlighter tool at the Los Alamos de novo recombination of sequences belonging to other clusters. When a recom- HIV sequence database (https://www.hiv.lanl.gov/content/sequence/HIGHLIGHT/ bination event is called, the sequence in the cluster with the least number of highlighter_top.html). Based on visual inspection of these plots we determined that mutations from the parental strains, the earliest parents, or with the lowest p value all nine participants in this study had been infected by two or more genetically is deemed the initial recombinant, and all other sequences in that cluster that share distinct T/Fs. Days since infection were estimated by applying the Poisson fitter the same breakpoints are considered “descendants.” 33,54 model on the first time point alignments from both the 3′ half and 5′ half In the realistic case of multiple recombination events, or when one of the genome sequences and then taking the harmonic average between the two time parental strains escapes sampling, a triplet will yield a low p value when parents 12 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE and child swap roles. When such parent–daughter cycles are found, RAPR replaces can be estimated as Σ f (b≥m) p (β). We can use this to find simulated CL on the β β r the parents from the least likely configuration in the cycle with an alternate set, rate of recombination: in particular, the least value of r for which p (b≥m) becomes and, if no such choice is admissible, merges the current cluster to a similar one and greater than 1−α provides a CL for the rate. deem the recombinants “descendants”. For more details see our tool help page All simulations were run using R . The simulated recombinants, together with (https://www.hiv.lanl.gov/content/sequence/RAP2017/help.html). the parental sequences used to generate them, are available at the following link: (https://www.hiv.lanl.gov/repository/hivdb/ Simulated_Data_for_RAPR_NatComm_Paper). RAPR is available for public use Breakpoint determination and frequency estimation. For each recombinant, on the LANL HIV database (www.hiv.lanl.gov/content/sequence/RAP2017/rap. RAPR identifies intervals in which the breakpoints are most likely to have hap- html). All software is developed in C++, R, and cgi. The R code utilizes functions 61 62 pened. Each breakpoint interval is examined and the run-test p value is recalculated from the packages ape and seqinr . after removing the run next to it. If this new p value is less significant than the overall p value, we conclude that the breakpoint is contributing to the significance Neutralization analysis of viruses from CH0505. To determine the impact of of the recombination. For each time point in the alignment, RAPR calculates two recombination on the emergence of antibody resistance, we ran RAPR on 341 env frequencies: the number of de novo breakpoints per sequence and the number of sequences from CH0505 , spanning day 19 through day 692 since infection. Of new mutations per sequence, where by “new mutation” we mean any nucleotide these, 123 of these env genes were used to generate pseudoviruses which were tested change that is not present in any of the T/Fs. Each mutation and breakpoint is against a panel of 13 autologous CH103-lineage antibodies, and 33 of the 123 were counted only at the time it appears for the first time. These frequencies are then additionally tested against a panel of 16 autologous CH235-lineage antibodies. converted to rates by dividing by the sample time interval over which these mutations or breakpoints arose. Kendall correlations between these rates and VL were compared using a one-sided sign test. The VL was averaged between con- Identification of recombination hotspots. To explore possible hotspots of secutive time points to account for the fact that mutations and recombination recombination across the 3′ half genomes, we devised the following strategy. For events occurred in the time period between the two consecutive time points when every participant, we counted the appearance of a new breakpoint only once. VL were sampled. We then calculated Kendall correlations τ between the break- Because the program outputs an interval where the recombination breakpoint is point rate (calculated as described above) and VL, and separately between the most likely to have occurred, we took the midpoint of such interval, and then breakpoint rate and the mutation rate. CH0275 was excluded from this analysis considered the cumulative distribution of the positions at which these breakpoints because of too few time points. For the rest of the participants, we observed that 6 occurred across all nine participants. Regions where the cumulative distribution out of 8 times the Kendall τ was higher for the correlation of the breakpoint rate had a steep step upward indicated a high accumulation of breakpoints and with the mutation rate compared to that with the average VL, and the other two therefore hotspots. Conversely, regions where the cumulative distribution was flat times they were equal. This was statistically significant (p < 0.015) when applying a indicated potential cold spots. To assess this with statistical significance, we con- one-sided sign test. For the 5′, we had only seven sets, and five of these showed the sidered a sliding window of 20 nucleotides and for each window we calculated the same trend, but in the other two, the breakpoint rate was more positively correlated slope of the line that best fitted the cumulative distribution within said window. If with the VL. This was not statistically significant (p > 0.23). there were hotspots of recombination within the given window, we expected the slope of the best fitting line to rise sharply, or to lower in areas of cold spots. We initially explored various window sizes, between 10 and 50 nucleotides, and we saw Breakpoint simulation. In order to estimate the potential bias introduced by that decreasing to smaller sizes from 20 nucleotides did not add any additional missing breakpoints that fall in regions of homology between parental strains, we information, while increasing the size caused several potentially interesting regions devised the following simulation. First, to encompass the time when recombination to go undetected. To smoothen out the differences across participants, we repeated happened, we chose participants for whom RAPR had not detected recombinants the sliding window algorithm 9 times, each time removing one of the participants. of recombinants at the first time points. These were the 5ʹ half from CH0047 and For each iteration we considered all positions corresponding to the upper quartile CH0200 (day 19 and 11 respectively), and the 3ʹ half from CH0228 (day 19) and of the slopes and all positions corresponding to the lower quartile, and then took all CH1244 (days 12 and 15). To avoid overestimating the frequency of template the positions that fell either in the upper or the lower quartile every time we ran the switching, only sequences from the early time points of 4 participants (CH0047, algorithm. CH0200, CH0228, and CH1244) were selected because they were sampled very early and had no evidence yet of recombination of recombinants. Cromer et al. Genetic distances. Genetic complexity among lineages was measured as the mean estimated 5 to 14 template switches per full genome, so for our half genome pairwise p-distances. The p-distances (the proportion of sites at which two samples we considered a range of breakpoints from 1 to 8 and for each value we sequences differ) were calculated among recombinant viruses, variants evolved randomly generated 100 artificial recombinants using the sequences from the above from the same T/F virus (intra-TFs), as well as variants from different T/F viruses participants. Each recombinant was generated randomly selecting two parents, one (inter-TFs) for 3′ half and 5′ half genomes using the software MEGA6 . from each of two different T/F clusters. For each set of 100 recombinants, we counted how many breakpoints (significant and not) RAPR detected. We then accumulated, for each true breakpoint between 1 and 8, the total number of Dynamical modeling. We extended the basic mathematical model for HIV breakpoints and the number of statistically significant breakpoints RAPR detected 64 dynamics to include the coinfection of target cells by two different viruses and the (Supplementary Figs. 13 and 14). resulting generation of recombinant viruses. Assuming n T/Fs, let V be the density We also used the same simulations to estimate frequentist 95% confidence of the ith (i = 1… n) T/F virus, and V the density of recombinant viruses. n+1 interval on the true number of breakpoints as follows. For each recombinant Uninfected target cells T become infected with at a rate β’ and produce virus at a i, observed in the four samples mentioned above, we randomly recombined the rate , for i = 1… n + 1. Free virus is cleared at a rate c. Cells infected with the T/F pi respective parents 5000 times, assigning breakpoints randomly drawn from a viruses can be coinfected with other viral strains, and such coinfection leads to the uniform distribution between 1 and 8. For every set of artificial recombinants from generation of recombinant viruses. It is well known that early after infection, the the same parental pair, we took all the recombinants for which RAPR detected the amount of CD4, the main receptor for HIV binding, on the surface of infected cells same amount of breakpoints as in the observed recombinant, and looked at the is downregulated due to the action of the Nef and/or Vpu proteins . Although the distribution of their true breakpoints (Supplementary Fig. 15). We then calculated kinetics of CD4 downregulation in vivo are not known, in our model we assume the lower bound on the 95% CL on the true number of breakpoints. Similarly, we that downregulation is slow and incomplete, and that a cell infected with one viral calculated the lower bound on the 95% CL on the true rate of recombination, i.e., variant can be reinfected with another viral strain. Coinfection occurs at a reduced the minimum rate of recombination necessary to explain the RAPR-observed rate γβ’ . Uninfected and virus-infected cells die at rates d and δ , respectively. The i i recombinant. In particular, the lower level-α CL l on an estimate using statistic m α dynamics of n T/F viruses and a recombinant strain during acute HIV-1 infection (for example, the observed number of breakpoints) for a population parameter μ is then described by the system of differential equations (for example, the true number of breakpoints) is defined as a statistic such that p (l ≤μ)≥α, where p is the probability distribution under the assumption that the nþ1 μ α μ dT ¼ dTðÞ  T T β′ V ; population parameter is μ and the relation holds independent of μ. To calculate 0 i i dt i¼1 such an l when the allowed values of μ are discrete, and the statistic m is sufficient nþ1 for μ, we can proceed to simulate the sampling process for various proposals π for dI ¼ β′ V T  δ I  γI θ β V ; i ¼ 1::n; dt i i i i i ij j j the population parameter μ, and obtain the fraction of samples for which the j¼1 simulated estimate s is greater than the observed estimate m, i.e., f (s≥m). n nþ1 P P dI nþ1 Assuming the simulation is good enough to ignore the variance of f (s), and this ¼ β′ V T  δ I þ γ θ β I V ; nþ1 nþ1 nþ1 nþ1 ij i j dt j i¼1 j¼1 fraction is a non-decreasing function of π, the least value of π at which this fraction dV becomes greater than 1−α defines such an l . We have no reason to believe that the i ¼ p I  cV ; i i i dt RAPR estimate of the number of breakpoints is a sufficient statistic, so the confidence interval determined in this fashion is likely to be overly conservative. Alternatively, if we assume a uniform rate r of recombination events where I and I is the density of cells infected with the T/F viruses and the i n+1 per sequence position, then the true number of recombination breakpoints s in a recombinant virus, respectively, and θ =1if i≠j and θ =0 otherwise. It should be ij ij sequence of length L are expected to be distributed as a binomial with size L−1 and noted that in this simple model we assumed that coinfection with two different probability r. Then, the probability of observing m or more breakpoints p (b≥m) viruses leads to a generation of the recombinant virus. NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 13 | | | ARTICLE NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 Previous studies have indicated that the virion clearance rate c ranges from 9 to 4. Simon-Loriere, E. & Holmes, E. C. Why do RNA viruses recombine? Nat. Rev. 36 per day which is much faster than the estimated lifespan of infected cells (δ =1 Microbiol. 9, 617–626 (2011). per day) . Therefore, given fast dynamics of the virus, the density of a given viral 5. Brown, R. J. et al. Intercompartmental recombination of HIV-1 contributes to variant is approximately proportional to the density of infected cells producing this env intrahost diversity and modulates viral tropism and sensitivity to entry p β′ p i i i variant, V ¼ I which simplifies the model. By replacing β ¼ and r =β T−δ , i i i i i i c P c inhibitors. J. Virol. 85, 6024–6037 (2011). I nþ1 and denoting the frequency of the ith variant as f ¼ and I ¼ I as the total i I i¼1 i 6. Charpentier, C., Nora, T., Tenaillon, O., Clavel, F. & Hance, A. J. Extensive number of infected cells we obtain the simplified model recombination among human immunodeficiency virus type 1 quasispecies makes an important contribution to viral diversity in individual patients. J. nþ1 df ¼ f r  r f  γβIfðÞ 1  f ; i ¼ 1¼ n; Virol. 80, 2472–2482 (2006). i i j j i i dt j¼1 7. Nishimura, Y. et al. Recombination-mediated changes in coreceptor usage nþ1 n nþ1 confer an augmented pathogenic phenotype in a nonhuman primate model of P P P df nþ1 ¼ f r  r f þ γβI θ f f ; nþ1 nþ1 j j ij i j HIV-1-induced AIDS. J. Virol. 85, 10617–10626 (2011). dt j¼1 i¼1 j¼1 8. Nora, T. et al. Contribution of recombination to the evolution of human nþ1 immunodeficiency viruses expressing resistance to antiretroviral treatment. J. dI ¼ I r f : dt i i Virol. 81, 7620–7628 (2007). i¼1 9. Moutouh, L., Corbeil, J. & Richman, D. D. Recombination leads to the rapid emergence of HIV-1 dually resistant mutants under selective drug pressure. To further simplify the analysis, we assume the infectivity of all variants to be Proc. Natl. Acad. Sci. USA 93, 6106–6111 (1996). 10. Ritchie, A. J. et al. Recombination-mediated escape from primary CD8+ the same (β =β=const), while allowing for potential differences in the rate of virus production from cells infected by variant V . T cells in acute HIV-1 infection. Retrovirology 11, 69 (2014). pi i The dynamics of the recombinant viruses in the population are determined by 11. Streeck, H. et al. Immune-driven recombination and loss of control after HIV two processes: accumulation due to selective advantage (s ¼ r  r where superinfection. J. Exp. Med. 205, 1789–1796 (2008). nþ1 P P n n r ¼ r f = f ) and accumulation due to high degree of cell coinfection with 12. Liu, S. L. et al. Selection for human immunodeficiency virus type 1 j j j j¼1 j¼1 two different viruses (γβI). The rate of coinfection of cells with 2 different viruses is recombinants in a patient with rapid progression to AIDS. J. Virol. 76, γβI approximately given by F ¼ (derivation not shown). To estimate the rate of c 10674–10684 (2002). γβIþβT coinfection of cells with different viruses we assume that the density of target cells 13. Batorsky, R. et al. Estimate of effective recombination rate and average remains constant after peak infection, and the total number of infected cells is selection coefficient for HIV in chronic infection. Proc. Natl. Acad. Sci. USA constant. Extensions of the model relaxing this assumption will be presented 108, 5661–5666 (2011). elsewhere. We fit the model predicting the dynamics of the frequency of the 14. Neher, R. A. & Leitner, T. Recombination rate and selection strength in HIV recombinant viruses in each individual to the experimentally measured frequency, intra-patient evolution. PLoS Comput. Biol. 6, e1000660 (2010). where the initial frequency of the different viral variants is taken directly from the 15. Shriner, D., Rodrigo, A. G., Nickle, D. C. & Mullins, J. I. Pervasive genomic data. Alternatively, we assumed that recombinants were present at some frequency recombination of HIV-1 in vivo. Genetics 167, 1573–1583 (2004). and their accumulation was due to selective advantage s which was estimated by 16. Gao, F. et al. Unselected mutations in the human immunodeficiency virus type fitting the model (reduced to a logistic equation) to data. The model is fitted using 1 genome are mostly nonsynonymous and often deleterious. J. Virol. 78, maximum likelihood method as was recently described elsewhere . Confidence 2426–2433 (2004). intervals for estimated parameters were calculated by resampling the data using 17. Mansky, L. M. & Temin, H. M. Lower in vivo mutation rate of human Jefferey’s intervals for binomial proportions with 1000 simulation runs. Because immunodeficiency virus type 1 than that predicted from the fidelity of purified of the limited data, two parameters (the coinfection rate and selection coefficient) reverse transcriptase. J. Virol. 69, 5087–5094 (1995). could not be accurately estimated from the data. Therefore, in a set of analyses we 18. Onafuwa, A., An, W., Robson, N. D. & Telesnitsky, A. Human fixed the selection coefficient to various values and estimated the coinfection rate. immunodeficiency virus type 1 genetic recombination is more frequent than Because of two half genomes available for some of the time points, for those that of Moloney murine leukemia virus despite similar template switching individuals we have two independent measurements of the frequency of different rates. J. Virol. 77, 4577–4587 (2003). viral variants, and these provide two estimates of the coinfection rate based on the 19. Zhuang, J. et al. Human immunodeficiency virus type 1 recombination: rate, dynamics of recombination at the 3′ and 5′ half genomes. Consistency of the fidelity, and putative hot spots. J. Virol. 76, 11273–11282 (2002). estimated coinfection rates varied by patient. In some cases (e.g., CH0654, 20. Schlub, T. E., Smyth, R. P., Grimm, A. J., Mak, J. & Davenport, M. P. CH0200), we found nearly identical estimates for coinfection rates using 3′ and 5′ Accurately measuring recombination between closely related HIV-1 genomes. data (Table 2 and Supplementary Table 3). However, in other instances, estimates PLoS Comput. Biol. 6, e1000766 (2010). for predicted coinfection rate were different between 3′ and 5′ half data (e.g., 21. Salazar-Gonzalez, J. F. et al. Deciphering human immunodeficiency virus type CH0228, CH0078; Table 2 and Supplementary Table 3). Underlying reasons for 1 transmission and early envelope diversification by single-genome such difference are not entirely clear but may be related to the different numbers of sequences analyzed. All models were implemented and run in Mathematica (ver. amplification and sequencing. J. Virol. 82, 3952–3970 (2008). 22. Siepel, A. C., Halpern, A. L., Macken, C. & Korber, B. T. A computer program 11.2). designed to screen rapidly for HIV type 1 intersubtype recombinant sequences. AIDS Res. Hum. Retroviruses 11, 1413–1416 Data availability. All newly generated nucleic acid sequences in the current study (1995). were deposited in GenBank with accession numbers MF499156–MF502416. RAPR 23. Maydt, J. & Lengauer, T. Recco: recombination analysis using cost is available for public use on the LANL HIV database (http://www.hiv.lanl.gov/ optimization. Bioinformatics 22, 1064–1071 (2006). content/sequence/RAP2017/rap.html). The simulated recombinants are available at 24. Martin, D. P., Murrell, B., Golden, M., Khoosal, A. & Muhire, B. RDP4: the following link: (https://www.hiv.lanl.gov/repository/hivdb/ detection and analysis of recombination patterns in virus genomes. Virus Evol. Simulated_Data_for_RAPR_NatComm_Paper). All relevant data are available 1, vev003 (2015). upon request. 25. Posada, D. & Crandall, K. A. Evaluation of methods for detecting recombination from DNA sequences: computer simulations. Proc. Natl. Acad. Received: 31 July 2017 Accepted: 10 April 2018 Sci. USA 98, 13757–13762 (2001). 26. Posada, D. & Crandall, K. A. The effect of recombination on the accuracy of phylogeny estimation. J. Mol. Evol. 54, 396–402 (2002). 27. Kosakovsky Pond, S. L. et al. An evolutionary model-based algorithm for accurate phylogenetic breakpoint mapping and subtype prediction in HIV-1. PLoS Comput. Biol. 5, e1000581 (2009). 28. Bradley, J. Distribution-Free Statistical Tests, Chapter 12 (Prentice-Hall, References Englewood Cliffs, 1968). 1. Simmonds, P. et al. Discontinuous sequence change of human 29. Takahata, N. Comments on the detection of reciprocal recombination or gene immunodeficiency virus (HIV) type 1 env sequences in plasma viral and conversion. Immunogenetics 39, 146–149 (1994). lymphocyte-associated proviral populations in vivo: implications for models of 30. Gao, F. et al. Cooperation of B cell lineages in induction of HIV-1-broadly HIV pathogenesis. J. Virol. 65, 6266–6276 (1991). neutralizing antibodies. Cell 158, 481–491 (2014). 2. Sabino, E. C. et al. Identification of human immunodeficiency virus type 1 31. Cromer, D., Grimm, A. J., Schlub, T. E., Mak, J. & Davenport, M. P. envelope genes recombinant between subtypes B and F in two Estimating the in-vivo HIV template switching and recombination rate. AIDS epidemiologically linked individuals from Brazil. J. Virol. 68, 6340–6346 30, 185–192 (2016). (1994). 3. Zhang, M. et al. The role of recombination in the emergence of a complex and dynamic HIV epidemic. Retrovirology 7, 25 (2010). 14 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications | | | NATURE COMMUNICATIONS | DOI: 10.1038/s41467-018-04217-5 ARTICLE 32. Pacold, M. E. et al. Clinical, virologic, and immunologic correlates of HIV-1 62. Charif, D., Lobry, J. R. SeqinR 1.0-2: A Contributed Package to the R Project intraclade B dual infection among men who have sex with men. AIDS 26, for Statistical Computing Devoted to Biological Sequences Retrieval and 157–165 (2012). Analysis. in Structural Approaches to Sequence Evolution: Molecules, Networks, 33. Keele, B. F. et al. Identification and characterization of transmitted and early Populations (eds Bastolla, U. et al.) 207–232 (Springer Verlag, New York, founder virus envelopes in primary HIV-1 infection. Proc. Natl. Acad. Sci. 2007). USA 105, 7552–7557 (2008). 63. Tamura, K., Stecher, G., Peterson, D., Filipski, A. & Kumar, S. MEGA6: 34. Sunnaker, M. et al. Approximate Bayesian computation. PLoS Comput. Biol. 9, Molecular Evolutionary Genetics Analysis version 6.0. Mol. Biol. Evol. 30, e1002803 (2013). 2725–2729 (2013). 35. Giorgi, E. E., Korber, B. T., Perelson, A. S. & Bhattacharya, T. Modeling 64. Perelson, A. S. Modelling viral and immune system dynamics. Nat. Rev. sequence evolution in HIV-1 infection with recombination. J. Theor. Biol. 329, Immunol. 2,28–36 (2002). 82–93 (2013). 65. Delviks-Frankenberry, K. et al. Mechanisms and factors that influence high 36. Liao, H. X. et al. Co-evolution of a broadly neutralizing HIV-1 antibody and frequency retroviral recombination. Viruses 3, 1650–1680 (2011). founder virus. Nature 496, 469–476 (2013). 66. Ramratnam, B. et al. Rapid production and clearance of HIV-1 and hepatitis C 37. Bonsignori, M. et al. Maturation pathway from germline to broad HIV-1 virus assessed by large volume plasma apheresis. Lancet 354, 1782–1785 neutralizer of a CD4-mimic antibody. Cell 165, 449–463 (2016). (1999). 38. Etherington, G. J., Dicks, J. & Roberts, I. N. Recombination Analysis Tool 67. Perelson, A. S., Neumann, A. U., Markowitz, M., Leonard, J. M. & Ho, D. D. (RAT): a program for the high-throughput detection of recombination. HIV-1 dynamics in vivo: virion clearance rate, infected cell life-span, and viral Bioinformatics 21, 278–281 (2005). generation time. Science 271, 1582–1586 (1996). 39. Liu, M. K. et al. Vertical T cell immunodominance and epitope entropy 68. Ganusov, V. V., Neher, R. A. & Perelson, A. S. Mathematical modeling of determine HIV-1 escape. J. Clin. Invest. 123, 380–393 (2013). escape of HIV from cytotoxic T lymphocyte responses. J. Stat. Mech. 2013, 40. Bailey, J. R. et al. Broadly neutralizing antibodies with few somatic mutations P01010 (2013). and hepatitis C virus clearance. JCI Insight 2, e92872 (2017). 69. Brown, L. D., Cai, T. T. & DasGupta, A. Interval estimation for a binomial 41. Santra, S. et al. Human non-neutralizing HIV-1 envelope monoclonal proportion. Stat. Sci. 16, 101–133 (2001). antibodies limit the number of founder viruses during SHIV mucosal infection 70. Wolfram Research Inc. Mathematica, Version 11.2 (Champaign, 2017). in rhesus macaques. PLoS Pathog. 11, e1005042 (2015). 42. Archer, J. et al. Identifying the important HIV-1 recombination breakpoints. Acknowledgements PLoS Comput. Biol. 4, e1000178 (2008). We thank Yi Yang and Sheri Chen for technical assistance; and Jennifer Kirchherr and 43. Baird, H. A. et al. Sequence determinants of breakpoint location during HIV-1 Caroline Cockrell for compilation of clinic data. This work was supported by NIH grant intersubtype recombination. Nucleic Acids Res. 34, 5203–5216 (2006). NIH R01AI087520, NIH grants to the Center for HIV/AIDS Vaccine Immunology 44. Jia, L. et al. Analysis of HIV-1 intersubtype recombination breakpoints (AI067854), the Center for HIV/AIDS Vaccine Immunology and Immunogen Discovery suggests region with high pairing probability may be a more fundamental (AI100645), the HIV/SIV Database and Analysis Unit (AAI 12007-0000-01000), the factor than sequence similarity affecting HIV-1 recombination. Virol. J. 13, American Heart Foundation grant to V.V.G., and with Federal funds from the National 156 (2016). Cancer Institute, National Institutes of Health, under Contract No. 45. Sakuragi, S., Shioda, T. & Sakuragi, J. Properties of human immunodeficiency HHSN261200800001E. The content of this publication does not necessarily reflect the virus type 1 reverse transcriptase recombination upon infection. J. Gen. Virol. views or policies of the Department of Health and Human Services, nor does mention of 96, 3382–3388 (2015). trade names, commercial products, or organizations imply endorsement by the US 46. Li, X. et al. High-frequency illegitimate strand transfers of nascent DNA Government. fragments during reverse transcription result in defective retrovirus genomes. J. Acquir. Immune Defic. Syndr. 72, 353–362 (2016). 47. Bonsignori, M. et al. Staged induction of HIV-1 glycan-dependent broadly Author contributions neutralizing antibodies. Sci. Transl. Med. 9, eaai7514 (2017). F.G., B.K., T.B., H.S., and E.E.G. conceived and designed the study. H.S., F.C., B.H., C.J., 48. Moore, P. L. et al. Multiple pathways of escape from HIV broadly cross- X.L., S.W., H.L., J.F.S., M.G.S., N.G., and B.F.K. performed experiments. E.E.G., T.B., O. neutralizing V2-dependent antibodies. J. Virol. 87, 4882–4894 (2013). C., G.A., H.Y., E.R., and B.K. developed the computational algorithm for recombination 49. Huson, D. H. & Bryant, D. Application of phylogenetic networks in analysis. V.V.G. developed the mathematical model for recombination and viral load evolutionary studies. Mol. Biol. Evol. 23, 254–267 (2006). dynamics analysis. M.S.C. directed the clinical trial and supplied the clinical samples. H. 50. Doria-Rose, N. A. et al. Developmental pathway for potent V1V2-directed S., E.E.G., V.V.G., F.C., B.H., P.H., X.L., J.F.S., N.G., B.F.K., D.C.M., M.S.C., G.M.S., B.H. HIV-neutralizing antibodies. Nature 509,55–62 (2014). H., A.J.M., B.F.H., T.B., B.K., and F.G. analyzed the data. H.S., E.E.G., V.V.G., P.H., N.G., 51. Munshi, S. U. et al. Molecular characterization of hepatitis B virus in A.J.M., G.M.S., B.F.H., T.B., B.K., and F.G. wrote and edited the manuscript. Bangladesh reveals a highly recombinant population. PLoS ONE 12, e0188944 (2017). 52. Harvala, H. et al. Recommendations for enterovirus diagnostics and Additional information characterisation within and beyond Europe. J. Clin. Virol. 101,11–17 (2018). Supplementary Information accompanies this paper at https://doi.org/10.1038/s41467- 53. van Beek, J. et al. Molecular surveillance of norovirus, 2005-16: an 018-04217-5. epidemiological analysis of data collected from the NoroNet network. Lancet Infect. Dis.18, 545–553 (2018). Competing interests: The authors declare no competing interests. 54. Giorgi, E. E. et al. Estimating time since infection in early homogeneous HIV- 1 samples using a Poisson model. BMC Bioinforma. 11, 532 (2010). Reprints and permission information is available online at http://npg.nature.com/ 55. Fiebig, E. W. et al. Dynamics of HIV viremia and antibody seroconversion in reprintsandpermissions/ plasma donors: implications for diagnosis and staging of primary HIV infection. AIDS 17, 1871–1879 (2003). Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in 56. Salazar-Gonzalez, J. F. et al. Genetic identity, biological phenotype, and published maps and institutional affiliations. evolutionary pathways of transmitted/founder viruses in acute and early HIV- 1 infection. J. Exp. Med. 206, 1273–1289 (2009). 57. Song, H. et al. Transmission of multiple HIV-1 subtype C transmitted/founder viruses into the same recipients was not determined by modest phenotypic Open Access This article is licensed under a Creative Commons differences. Sci. Rep. 6, 38130 (2016). Attribution 4.0 International License, which permits use, sharing, 58. Rose, P. P. & Korber, B. T. Detecting hypermutations in viral sequences with adaptation, distribution and reproduction in any medium or format, as long as you give an emphasis on G--> A hypermutation. Bioinformatics 16, 400–401 (2000). appropriate credit to the original author(s) and the source, provide a link to the Creative 59. Benjamini, Y. H. & Controlling, Y. The false discovery rate: a practical and Commons license, and indicate if changes were made. The images or other third party powerful approach to multiple testing. J. Roy. Stat. Soc. Ser. B (Methodol.) 57, material in this article are included in the article’s Creative Commons license, unless 289–300 (1995). indicated otherwise in a credit line to the material. If material is not included in the 60. R Core Team. R: A Language and Environment for Statistical Computing (R article’s Creative Commons license and your intended use is not permitted by statutory Foundation for Statistical Computing, Vienna, 2013). http://www.R-project. regulation or exceeds the permitted use, you will need to obtain permission directly from org. the copyright holder. To view a copy of this license, visit http://creativecommons.org/ 61. Paradis, E., Claude, J. & Strimmer, K. APE: Analyses of Phylogenetics and licenses/by/4.0/. Evolution in R language. Bioinformatics 20, 289–290 (2004). © The Author(s) 2018 NATURE COMMUNICATIONS (2018) 9:1928 DOI: 10.1038/s41467-018-04217-5 www.nature.com/naturecommunications 15 | | |

Journal

Nature CommunicationsSpringer Journals

Published: May 15, 2018

There are no references for this article.