Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Outbreak analytics: a developing data science for informing the response to emerging pathogens

Outbreak analytics: a developing data science for informing the response to emerging pathogens Outbreak analytics: a developing data science for informing the response royalsocietypublishing.org/journal/rstb to emerging pathogens 1,3,† 4,† 4 4 Jonathan A. Polonsky , Amrish Baidjoe , Zhian N. Kamvar , Anne Cori , 2 5,6 5,6 5,6 Kara Durski , W. John Edmunds , Rosalind M. Eggo , Sebastian Funk , Review 3 5,8 5,8,9 Laurent Kaiser , Patrick Keating , Olivier le Polain de Waroux , 7 10 1 4,11 Cite this article: Polonsky JA et al. 2019 Michael Marks , Paula Moraga , Oliver Morgan , Pierre Nouvellet , Outbreak analytics: a developing data science 5,6 7 5,8 Ruwan Ratnayake , Chrissy H. Roberts , Jimmy Whitworth for informing the response to emerging 4,5,8 and Thibaut Jombart pathogens. Phil. Trans. R. Soc. B 374: 20180276. 1 2 Department of Health Emergency Information and Risk Assessment, and Department of Infectious Hazard http://dx.doi.org/10.1098/rstb.2018.0276 Management, World Health Organization, Avenue Appia 20, 1211 Geneva, Switzerland Faculty of Medicine, University of Geneva, 1 rue Michel-Servet, 1211 Geneva, Switzerland Department of Infectious Disease Epidemiology, School of Public Health, MRC Centre for Global Infectious Disease Accepted: 4 December 2018 Analysis, Imperial College London, Medical School Building, St Mary’s Campus, Norfolk Place London W2 1PG, UK 5 6 Department of Infectious Disease Epidemiology, Centre for Mathematical Modelling of Infectious Diseases, and Clinical Research Department, Faculty of Infectious and Tropical Diseases, London School of Hygiene and One contribution of 16 to a theme issue Tropical Medicine, Keppel St, London WC1E 7HT, UK ‘Modelling infectious disease outbreaks in 8 UK Public Health Rapid Support Team, London School of Hygiene and Tropical Medicine, Keppel St, London humans, animals and plants: epidemic WC1E 7HT, UK Public Health England, Wellington House, 133 – 155 Waterloo Road, London SE1 8UG, UK forecasting and control’. Centre for Health Informatics, Computing and Statistics (CHICAS), Lancaster Medical School, Lancaster University, Lancaster LA1 4YW, UK Subject Areas: School of Life Sciences, University of Sussex, Sussex House, Brighton BN1 9RH, UK health and disease and epidemiology JAP, 0000-0002-8634-4255; AB, 0000-0001-5295-5085; ZNK, 0000-0003-1458-7108; AC, 0000-0002-8443-9162; SF, 0000-0002-2842-3406; OM, 0000-0002-9543-3778; PN, 0000-0002-6094-5722; TJ, 0000-0003-2226-8692 Keywords: epidemics, infectious, methods, tools, Despite continued efforts to improve health systems worldwide, emerging pipeline, software pathogen epidemics remain a major public health concern. Effective response to such outbreaks relies on timely intervention, ideally informed by all available sources of data. The collection, visualization and analysis of outbreak data are Author for correspondence: becomingincreasinglycomplex, owingtothediversity intypesofdata, questions Thibaut Jombart and available methods to address them. Recent advances have led to the rise of e-mail: thibautjombart@gmail.com outbreak analytics, an emerging data science focused on the technological and methodological aspects of the outbreak data pipeline, from collection to analysis, modelling and reporting to inform outbreak response. In this article, we assess the current state of the field. After laying out the context of outbreak response, we critically review the most common analytics components, their inter- dependencies, data requirements and the type of information they can provide to inform operations in real time. We discuss some challenges and opportunities and conclude on the potential role of outbreak analytics for improving our understanding of, and response to outbreaks of emerging pathogens. This article is part of the theme issue ‘Modelling infectious disease outbreaks in humans, animals and plants: epidemic forecasting and control‘. This theme issue is linked with the earlier issue ‘Modelling infectious disease outbreaks in humans, animals and plants: approaches and important themes’. 1. Introduction These authors contributed equally to the Emerging infectious diseases are a constant threat to public health worldwide. study. Inthe past decade, several majoroutbreaks, such asthe 2009 influenza pandemic [1], & 2019 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited. royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 Figure 1. Successive phases of an outbreak response. The histogram along the top represents reported (yellow) and unreported (grey) incidence. the Middle-East Respiratory Syndrome coronavirus (MERS- 2. The outbreak response context CoV) [2–4], the emergence of Zika [5,6] and the West African Ebola virus disease (EVD) outbreak [7,8], have been potent (a) The different phases of an outbreak response reminders of the need for robust surveillance systems and The focus of the public health response shifts during the timely responses to nascent epidemics [9]. The West African course of an epidemic or outbreak, and so do the analytics. EVD outbreak, by far the largest such epidemic in recorded his- We identify four main stages (figure 1). The detection stage tory, in particular, had a strong impact on global health security starts with the first case and ends with the first intervention and public health policy and practice [7,8,10]. It highlighted the activities (e.g. patient isolation, contact tracing, vaccination) difficulties of maintaining situational awareness in the absence and involves surveillance systems and mostly qualitative of standards for surveillance, data collection and analysis, as risk assessments. Next, the early response is the initial part well as the challenges of mounting and sustaining a large-scale of the intervention during which the first simple analytics international response [7,8,11,12]. Despite the lessons learnt can take place, essentially centred around estimating trans- [9,13,14], the recent (2018) EVD outbreaks in Democratic Repub- missibility. This blends into the intervention stage, where lic of the Congo [15,16] are a stark reminder that a large number more complex analytics may be involved to inform plann- of these challenges remain. ing (e.g. vaccination strategies), which ends once the last An important feature of the modern response to epidemics reported case has recovered or died. The post-intervention is the increasing focus on exploiting all available data to inform stage is for lessons to be learned, for improving prepared- the response in real time and allow evidence-based decision ness for the next epidemic and for training and capacity making [3,4,7,8,13,17]. Using data for improving situational building [39]. awareness is complex, involving a range of inter-connected tasks and skills from point-of-care data collection to the gener- ation of informative situational reports (sitreps). The science (b) Questions during and after the intervention underpinning these data pipelines involves a wide range of During the early response, efforts are dedicated to estimating approaches, including database design and mobile technology the likely impact of the outbreak and anticipating the nature, [18,19], frequentist statistics and maximum-likelihood esti- scale and timing of resources needed [7,13,15]. Theoretically, mation [7], interactive data visualization [20,21], geostatistics different factors including not only the total number of cases [22–24], graph theory [20,25,26], Bayesian statistics [8,27,28], and fatalities but also the morbidity and overall impact on qual- mathematical modelling [29–31], genetic analyses [32–36] ity of life, as well as societal and economic impact, should and evidence synthesis approaches [37]. This accretion of ideally be taken into account when attempting to predict heterogeneous disciplines, which may be best summarized as disease burden [40–43]. Generally, as the demographic and ‘outbreak analytics’, forms an emerging domain of data morbidity data needed by composite measures of health- science: an ‘interdisciplinary field that uses scientific methods, adjusted life years [40] are lacking in outbreak response processes, algorithms and systems to extract knowledge and contexts, efforts tend to focus on other proxies of impact: asses- insights from data in various forms’ [38], dedicated to inform- sing transmissibility, predicting future case incidence and ing outbreak response. Outbreak analytics sits at the crossroads associated mortality and investigating risk factors [1,3,7,15]. of public health planning, field epidemiology, methodological Analytical needs to diversify as the intervention progresses. development and information technologies, opening up excit- While investigations of transmissibility, mortality and risk ing opportunities for specialists in these fields to work together factors remain key throughout [8], new questions may arise to to meet the needs for an epidemic response. inform the implementation of control and mitigation measures. In this article, we outline this developing research field and These may focus on predicting the impact of potential measures review the current state of outbreak analytics. In particular, we including testing (e.g. ‘Could a rapid test help reduce inci- focus on how different analysis components interact within dence?’ [29]), vaccine development (e.g. ‘Could a candidate functional workflows, and how each component can be used vaccine be evaluated in this outbreak?’ [44,45]), vaccination to inform different stages of an outbreak response. We discuss campaigns (e.g. ‘Which is the optimal vaccination strategy?’ key challenges and opportunities associated with the deploy- [46,47]) or travel restrictions (e.g. ‘Should international travel ment of efficient, reliable and informative data analysis be restricted?’ [48]), or on estimating the impact of current pipelines and their potential impact. measures such as improvements in access to care (e.g. ‘Has the royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 delay between symptom onset and hospitalization been mobile datacollection, cloud computing and built-in automated reduced?’ [14,15]). As case incidence reduces, statistical model- data analyses and reporting. In resource-limited settings, in par- ling can also be useful for assessing or predicting the end of an ticular, epidemiological data are still often collected with pen outbreak [49–51]. At the field operational level, outbreak and paper, the advantages of which are familiarity, simplicity, response analytics may be best focused on informing and moni- low cost and reliability where access to Internet and power toring core surveillance activities and performance indicators, sources may be limited. However, there are some downsides such as contact tracing [11], through the use of tools for contact to using paper as a data management tool, becoming increas- data visualization [52], mapping [53,54] and on analysis pipe- ingly important with larger outbreaks, as any system for lines integrating mobile data collection tools [18,19,55,56] the printing and distribution, collection and storage and digitiz- with automated reporting systems [57–59]. Finally, the post- ation of forms becomes overwhelmed. Additionally, two-stage intervention phase lends itself to retrospective studies, which processes involving transcription of data from forms typically can assess further the impact of interventions [60], tease apart introduces additional data entry errors [72–75] and substantial finer processes driving the epidemic dynamics such as contact delays from data capture to analysis [72]. patterns [12,61], study risk factors [54,62], identify avenues for Electronic data collection (EDC) is becoming increasingly fortifying surveillance [13,36,63] and evaluate, improve and popular [18,19,55,56]. These tools make use of widely avail- develop modelling techniques [28,64,65]. able, low-cost hardware (e.g. smartphones and tablets) [76] that can, when appropriately configured, consume little power and collect data offline, making them suitable for use (c) What are outbreak data? in resource-poor settings. Some of those may be part of existing The term ‘outbreak data’ encompasses different types of surveillance systems or be deployed instead for specific information, of which we first distinguish ‘case data’from ‘back- enhanced surveillance and response activities during an out- ground data’. Case data include the description of reported cases break. EDC platforms can also enhance data quality through gathered in linelists, i.e. flat files where each row is a case and the use of restriction rules and logical checks, and enforce each column a recorded variable (e.g. dates of onset and admis- reporting (even when there are zero cases) and entry of essen- sion, gender, age, location), thereby fulfilling the definition of tial variables [72,76]. EDC can decrease the delay between data ‘tidy data’ in the data science community [66]. Case data also collection, centralization and analysis, which is critical for include exposure and contact tracing data, either stored within data-driven responses. Time can be saved through ‘form a linelist or in separate files, pathogen whole genome sequencing logic’ (e.g. automatically skipping sections of a survey not (WGS) and data pertaining to outbreak investigations (e.g. case– relevant to a participant), while real-time, automated centrali- control and cohort study data). Background data document the zation, data analysis and reporting can be directly built into underlying characteristics of the affected populations. This the platform. In addition, mobile-based EDC enables the collec- includes demographic information (e.g. maps of population den- tion of other types of data including GPS coordinates, sities, age stratification, mixing patterns), movement data (e.g. photographs, barcode (useful to link case data and clinical borders, traveller flows, migration), health infrastructure specimens) and even aiding diagnostics by directly interfacing (e.g. healthcare facilities, drug stockpiles) and epidemiological with point-of-care diagnostic devices [77–79]. data themselves (e.g. levels of pre-existing immunity). A final Maintaining confidentiality and privacy is a legitimate con- type of data we consider here is ‘intervention data’, which refers cern whenever data concerning human subjects are collected. to information on decisions made and efforts deployed as part While EDC systems provide opportunities for unauthorized of the intervention, such as vaccination coverage, the extent of interception and access to such information, many systems active case finding or potential changes in the epidemiologi- support end-to-end encryption during data transfer [80], cal case definition. An in-depth discussion of data needs in although few provide additional security through encryption outbreaks can be found in Cori et al. [13]. at the level of data entry. (c) Descriptive analyses 3. Outbreak analytics The first, and arguably one of the most important steps in data analysis is exploration, where visualization plays a (a) An overview of the outbreak analytics toolbox central role, completed with informative summary statistics We use the term ‘outbreak analytics’ to refer to the variety of [81,82]. The first type of graphics needed for rapid assessment tools and methods used to collect, curate, visualize, analyse, of ongoing dynamics is the epidemic curve (epicurve), which model and report on outbreak data. These tools and their shows case incidence time series as a histogram of new onset inter-dependencies are summarized in an exemplary workflow dates for a given time interval [83–85]. Cumulative case represented in figure 2, derived from analyses pipelines used counts, sometimes used in the absence of a raw linelist, are during recent epidemics of pandemic influenza [1], MERS- best avoided in epicurves, as they tend to obscure ongoing CoV [4] and EVD [7,8,17]. Note that workflows may vary dynamics and create statistical dependencies in data points substantially in other epidemic contexts. For instance, analyses that will result in biases and lead to under-estimating of food-borne outbreaks may focus on traceback data [67–69], uncertainty in downstream modelling [86]. while vector-borne disease analysis may focus heavily on Maps have been at the core of infectious disease epide- modelling the vector’s ecological niche [70,71]. miology from a very early stage [87]. Nowadays, they are typically used to visualize the distribution of disease [88], for (b) Tools for the collection of epidemiological data representing the ‘ecological niche’ of infectious diseases at Tools for data capture have become a focus of much discussion large scales [23,24,89] and for assessing the spatial dynamics in recent years as those involved in outbreak response seek of an outbreak and strategizing interventions [7,8]. Providers to make use of important technological advances including of free and crowd-sourced [90] geographical data like the royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 Figure 2. Example of outbreak analytics workflow. This schematic represents eight general analyses that can be performed from outbreak data. Outputs containing actionable information for the operations are represented as hexagons. Data needed for each analysis are represented as a different colour in the center, using plain and light shading for mandatory and optional data, respectively. (Online version in colour.) Humanitarian Open Street Maps Team (Humanitarian by time and space [91]. Other examples of freely available map- OpenStreetMap Team Home; see https://www.hotosm.org/ ping tools that can help track the spread of infectious diseases (accessed 26 September 2018)), the Missing Maps project (Mis- include the Spatial Epidemiology of Viral Haemorrhagic singMaps; see https://www.missingmaps.org/ (accessed 26 Fevers (VHF) disease visualization (see http://www.health- September 2018)), healthsites.io (see https://healthsites.io/ data.org/datavisualization/spatial-epidemiology-viralhemor- (accessed 26 September 2018)) and the Radiant Earth Foun- rhagic-fevers; accessed 19 September 2018), which maps risks of dation (Radiant Earth Foundation – Earth imagery for emergence and spread of VHF diseases, Nextstrain [92] and impact; see https://www.radiant.earth (accessed 18 November Microreact [93], which focus on mapping pathogen evolution 2018)) provide layers of spatial data that include information on and epidemic spread, and HealthMap [94], which provides the location of households and health facilities, among other resources for the rapid detection of outbreaks. Geographical determinants. Several tools including SaTScan and ClusterSeer locations of reported cases can also be useful for informing are routinely applied to surveillance system data for automated more complex modelling approaches [95]. outbreak detection and the evaluation of clustering of disease royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 In epidemics driven by person-to-person transmission, a differences in growth rates, e.g. between different locations, last essential source of data is contact data [20], which includes and to derive short-term incidence predictions. Moreover, the data on case exposure [12] as well as contact tracing, where growth rate can also be used to estimate the doubling and halv- appropriate [11,63]. Exposure data document transmission ing times of the epidemic, i.e. the time during which incidence pairs, which can yield precious insights into ‘paired delays’ doubles (respectively is halved), as alternative metrics of trans- (figure 2) including the serial interval (time between onsets missibility [103]. Unfortunately, the log-linear model can only of a case and their infector) or the generation time (time fit exponentially growing or decaying outbreaks, which may between the dates of infections of a case and their infector) not always be appropriate in the presence of complex spatial [7,8], which are in turn useful for estimating transmissibility or age structure, or owing to changes in reporting, transmissi- [27,28,96,97]. Exposure data can also be used to investigate bility or proportion of susceptible individuals over time. the occurrence and determinants of super-spreading events Besides, it cannot readily accommodate time periods with no [12] and help identify introduction events in the case of zoono- cases, so that its applicability may in practice be restricted. tic diseases [98]. Contact tracing, through the early detection of While r quantifiesthe speed at whicha disease spreads, itdoes new cases and their subsequent isolation and treatment, plays a not contain information on the level of the intervention that is central role in reducing onward transmission and therefore necessary to control a disease [106]. This is better characterized containing outbreaks [11,63,99], while additionally providing by the reproduction number (here generically noted ‘R’), which potential information on risk factors [7,11]. measures the average number of secondary cases caused by Summary statistics are a useful complement to data visual- each primary case. Researchers typically distinguish the basic ization in the exploratory phase of data analysis. Some metrics, reproduction number (R [104]), which applies in a large, fully such as transmissibility, require the use of statistical or math- susceptible population, without any control measures, from ematical models in order to be estimated (see §3d below) and the effective reproduction number (R ), which is the number eff are therefore not readily available as descriptive tools. Other of secondary cases after accounting for behavioural changes, useful statistics can be readily computed from linelists, includ- interventions and declines in susceptibility [96]. The current ing different demographic indicators of the reported cases reproduction number determines the dynamics of the epidemic (e.g. gender, age, occupation [7,100,101]), case fatality ratios in the near future, with values greater than 1 predicting an (the proportion of cases who died of the infection) or case increase in cases, and values less than 1 predicting control delays such as the times to hospitalization, recovery or death, [104]. The value of R can also be used to calculate the fraction reported as a whole [1,7,8] or stratified by groups [100,101]. of the population that needs to be immunized (typically through The incubation period (time from infection to symptom onset) vaccination) in order to contain an outbreak [104]. is another important delay for informing the intervention (e.g. Different methodological approaches have been developed to define the duration of contact tracing or declare the end of to estimate the reproduction number. R can be approximated an outbreak), but can be harder to derive as it requires data on using estimates of the growth rate r combined with knowledge case exposure as well. Note that in the case of delays, these are of the generation time distribution [97]. R can also be derived best analysed by characterising the full distribution (e.g. by fit- from compartmental models [104,107]. The formula will ting to an appropriate probability distribution such as depend on the type of model used, but such estimation discretized Gamma [7]) rather than reported as a single central will usually require that different rates (e.g. rates of infection, value [7,8,102,103]. recovery, death) are either known or estimated by fitting the model to data [104,107]. Real-world complexities can be incor- porated into this approach; however, fitting such models can be (d) Quantifying transmissibility challenging and may require computationally intensive algor- The ‘transmissibility’ of a disease is here used to refer to the ithms such as data augmentation, approximate bayesian rate at which new cases arise in the population, resulting computation, or particle filters [108]. Compartmental models either in epidemic growth or decline [1,3,27,28]. Rather than also require assumptions about the total population size and an intrinsic property of a specific disease, transmissibility the proportion of the population at risk, which may be difficult thus defined quantifies the propagation of a pathogen in a to estimate in an outbreak. As an alternative, branching process given epidemic setting and is impacted by multiple factors models can be used to estimate R directly from incidence data including population demographics, mixing and levels pre- [27,28,96,109]. This requires a pre-specified distribution of the existing immunity. Importantly, estimates of transmissibility generation time, or of the serial interval, although recent devel- reported in the literature will typically be biased towards opments suggest that in some cases, the generation time higher values, as subcritical outbreaks are by definition less distribution itself can also be simultaneously estimated [4]. likely to be detected. Several metrics of transmissibility can Branching process models are usually much simpler to fit to be used depending on the type of data available and can be data than their compartmental counterparts, which facilitates estimated using different approaches. their use in real time [27]. A first measure of transmissibility is the growth rate (r), Beyond the mere estimation of transmissibility, it is often which is estimated from a simple model where case incidence essential to forecast future incidence for advocacy and plan- is either exponentially growing (r . 0) or declining (r , 0). ning purposes, e.g. to compare different interventions and Typically, r is estimated directly from epicurves (figure 2) epidemic scenarios [7,8,15,30]. A variety of mathematical and using a log-linear model, where r is defined as the slope of a statistical models, including those reviewed here for estimating linear regression on log-transformed incidence [104,105]. transmissibility, can also be used for short-term incidence fore- Besides its simplicity and its computational efficiency, this casting [65]. Despite the growing body of research focusing on approach has the benefits of being embedded in the linear predicting incidence during epidemics [65,110], there are cur- modelling framework, thereby allowing one to measure the rently no gold standards and the relative performances of uncertainty associated with a given estimate of r, to test for forecasting methods largely remain to be assessed. Methods royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 that have been developed and applied in other fields to rigor- years [127,128], genetic analysis will likely carve out its own ously assess not just the accuracy of forecasts but also how space in the outbreak analytics toolkit. well models quantify the inherent uncertainty in making Different methods can be used to extract information from predictions, are only rarely applied in infectious disease epide- pathogen WGS. In bacterial genomics, molecular epidemiol- miology [111,112]. Whether it is to estimate R or predict future ogy methods have been used extensively for defining strains incidence, the most appropriate method ultimately depends on of related isolates [32,129], which can be used to infer various the particular epidemiological setting, existing knowledge of features of the pathogens sampled such as their origins, antimi- the transmission dynamics and data availability. Branching crobial resistance profiles, virulence or antigenic characteristics process models, for example, can be used for a quick estimate [130–132]. These methods usually exploit only a fraction of the of the current value of R from the recent trend in case numbers information contained within pathogens’ genomes, as they rely and, by extrapolating this forward, of expected case numbers in on genetic variation in a limited number of housekeeping the near future [27,28,96]. Mechanistic or simulation models, genes [32,129]. While these methods will likely remain useful on the other hand, aim to include a more explicit representation in years to come, substantially more information can be of the different factors that might influence transmission. They extracted by using WGS to reconstruct phylogenetic trees, can be a more natural choice for assessing the expected impact which represent the evolutionary history of the sampled iso- of possible interventions, but they usually require careful para- lates, assuming the absence of selection or horizontal gene metrization and often intensive computation [29,30,45,113], transfers [133]. Different types of phylogenetic reconstruction both of which can be challenging early in an outbreak when methods can be used, including fast, scalable distance-based data are scarce and rapid turnaround crucial. methods [134] or more computer-intensive approaches using a maximum-likelihood [135,136] or the Bayesian framework [33,137]. Phylogenies can be used to assess the origins of a (e) Analytical epidemiological techniques set of pathogens [138], patterns of geographical spread [125], Analytical epidemiological studies use data to better describe host species jumps [139,140], past fluctuations in the pathogen outbreaks and populations at risk and inform real-time and population sizes [141] and even, in some cases, the reproduc- subsequent response efforts. Typically, these are conducted tion number [1]. Importantly, there is a growing tendency to during the intervention and post-intervention phases of an out- analyse phylogenetic trees in the broader context of other epi- break response (figure 1). They include observational designs demiological data (mainly geographical locations until now), such as retrospective cohort and case–control studies to ident- which is facilitated by user-friendly Web applications [92,93]. ify risk factors and quantify associations between potential A further step towards integrating WGS alongside epide- causes and their outcomes (typically, the disease in question), miological data is the reconstruction of transmission trees and experimental designs, such as randomized-control studies (who infects whom) using evidence synthesis approaches. used to estimate the impact of interventions such as vaccination This methodological field has been growing fast over the past and treatments [114]. These studies reside outside of the decade [25,142–148], but most applications of these methods normal scope of outbreak response activities, being inserted remain within academia and their usefulness in the field in ad hoc as functions that are not necessarily routine response an outbreak response context needs to be critically assessed. activities such as strengthening surveillance. In the case of A potential benefit of accurately reconstructing transmission observational epidemiological studies, data on exposures and trees lies in the identification of multiple introductions, the outcomes are required, permitting estimations of the increased quantification of the proportion of unreported cases and the risk of disease among people exposed to risk factors of interest detection of heterogeneities in individual transmissibility [54,62,115,116]. In the case of experimental epidemiology, data [145]. Unfortunately, the reconstruction of transmission trees on outcomes of interest are collected to permit estimations of is a difficult and computationally intensive problem. First, heterogeneity among groups (e.g. in the presence/absence of most diseases do not accumulate sufficient genetic diversity intervention). during the course of an outbreak to allow the accurate recon- The usefulness of such studies in informing outbreak struction of transmission chains, so that multiple data sources response is highly context-dependent. Observational studies need to be used [35], making these methods more data- may be undertaken early on in the intervention phase to demanding than most other approaches in outbreak analytics help identify ongoing infection sources of environmental, (figure 2). In addition, the complex nature of the problem food-borne or water-borne nature [117] and to stop the out- requires the use of Bayesian methods for model fitting, break at its source. In longer-running outbreaks, they can making these approaches difficult to interpret by non-experts provide insights into opportunities for control [53,115,118] [145,146,148]. and inform global policy decisions that relate to outbreak response [119]. However, the time and expertise needed to prepare and implement these studies may preclude their 4. Discussion application in the midst of an ongoing outbreak, so that the cost and benefits of such an undertaking need to be carefully In this article, we reviewed methodological and technological weighed in emergency settings. resources forming the basis of outbreak analytics, an emerg- ing data science for informing outbreak response. Outbreak analytics is embedded within a broader public health infor- (f ) Genetic analyses mation context that starts with disease surveillance systems, Whole genome sequencing of pathogens is increasingly afford- followed by risk assessment and management, the epidemiolo- able and reliable, and therefore more frequent in outbreak gical response itself, and finishes with the production of investigations [1,120–126]. As technology is making real-time actionable information for decision making. Part of the chal- sequencing in the field a developing standard in the coming lenge that this new field will face in the coming years royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 pertains to the seamless integration of data analytics pipelines tools for data analysis and reporting, and an increasing within existing workflows. As responders can allocate only number of packages for infectious disease epidemiology limited time to data analysis, analytics resources should [20,21,27,84,145] may form a solid starting point for the devel- produce simple, interpretable results, highlighting the most opment of a comprehensive, robust and transparent toolkit for pressing issues that need addressing and monitoring all the analysis of epidemic data [151]. Importantly, the use of a relevant indicators to inform the response. common platform for the development and use of outbreak Outbreak analytics and resulting outputs are central to the analytics tools will also likely contribute to standardizing surveillance pillar of any outbreak response, yet resources and data practices, including collection, sharing and analysis. capacities to ensure data availability and quality are often lim- A final point relates to the use and dissemination of these ited owing to operational constraints [16]. Priorities in terms of new resources: how can outbreak analytics best help improve data needs should be defined by what actionable information it public health? As noted by Bausch & Clougherty [39], health may give access to through the available analytics pipelines science should not be an entity unto itself, but a means to an end. [13]. In this respect, we foresee that typical linelist data such Insofar as it can help field epidemiologists collect, visualize as dates of events (e.g. onset, reporting, hospitalization, dis- and analyse data, and subsequently provide decision-makers charge), age, gender, disease outcome, geographical locations with actionable information, outbreak analytics will likely and exposure data will fulfil most needs, while other data occupy an increasing space in field epidemiology over the such as WGS may only be useful for specific diseases and con- years to come. We foresee that the dissemination of free train- texts [34,35]. Intervention data are rarely collected but should ing resources [152], the modernization of field epidemiology be given more consideration, as they are key to assessing the training programmes and the deployment of applied data impact and effectiveness of control measures, both during scientists to the field with a sustained capacity building in and after the operations. Similarly, data on the fraction of resource-poor and vulnerable countries will be instrumental cases reported (and its variations through time), as well as be- in shaping the future of this emerging field of health science. havioural changes (e.g. care-seeking behaviour) in the affected Data accessibility. This article has no additional data. populations, can be very important sources of information for Authors’ contributions. T.J. drafted the outline of the review and revised modelling [149]. the manuscript. J.A.P., A.B., T.J. wrote the first draft of the manu- Fortunately, what we called ‘background data’ in this script. Z.N.K. produced the figures. A.C., W.J.E., R.M.E., S.F., L.K., article can be gathered and shared outside of the epidemic con- P.K., M.M., P.M., P.N., O.P.W., R.R., J.W. contributed the content. text. Besides maps, population census, sero-surveys or genetic Competing interests. We declare we have no competing interests. databanks, data on the natural histories of diseases derived Funding. This paper was supported with funding from the Global Chal- from past epidemics, such as key delay distributions and trans- lenges Research Fund (GCRF) for the project ‘RECAP—research capacity building and knowledge generation to support preparedness missibility, can form a useful substitute to real-time estimates, and response to humanitarian crises and epidemics’ managed through especially in the early stages of outbreaks when such data may RCUK and ESRC (ES/P010873/1). P.K., O.L.P., J.W. and T.J. receive be lacking. While crowd-sourced initiatives are promising and support from the UK Public Health Rapid Support Team, which is have been used successfully in low resource settings [90], more funded by the United Kingdom Department of Health and Social efforts are needed to collate and curate open data sources, Care. We acknowledge the National Institute for Health Research— Health Protection Research Unit for Modelling Methodology (T.J.) assess their quality and make them widely available to the for funding. M.M., C.H.R., receive funding through the National Insti- community. We argue that international public health agencies tute for Health Research (PR-OD-1017-20001). R.M.E. acknowledges and non-governmental agencies should play a central role in funding from an HDR UK Innovation Fellowship (grant no. MR/ orchestrating such background data preparedness. S003975/1). A.C. thanks the Medical Research Council for funding. Outbreak analytics is a developing field, and as such, there S.F. was supported by the Wellcome Trust (210758/Z/18/Z). The authors alone are responsible for the views expressed in this article remain many gaps in terms of data collection, analysis and and they do not necessarily represent the views, decisions or policies reporting tools. Some methodological challenges persist, such of the institutions with which they are affiliated. as better characterising forecasting methods [28,64,65], includ- Acknowledgements. We would like to thank Annick Lenglet and Isidro ing spatial information and population flows into existing Carrion-Martin, Epidemiologists at Medecins Sans Frontieres (MSF, transmission models [95], and improving the integration of Operational Centre Amsterdam) for their additional reflections. The different types of data for reconstructing transmission trees views expressed in this publication are those of the authors and not necessarily those of the National Health Service, the National [35]. In order to ensure transparent methods and availability Institute for Health Research or the Department of Health and to analysts in any setting, the implementation must be as Social Care. The authors alone are responsible for the views freely available, open-source software. Among other popular expressed in this article and they do not necessarily represent the programming languages, such as Python, Java, or Julia, the R views, decisions or policies of the institutions with which they are software [150] arguably offers the largest collection of free affiliated. References 1. Fraser C et al. 2009 Pandemic potential of a strain 3. Cauchemez S, Fraser C, Van Kerkhove MD, Donnelly 4. Cauchemez S et al. 2016 Unraveling the drivers of influenza A (H1N1): early findings. Science 324, CA, Riley S, Rambaut A, Enouf V, van der Werf S, of MERS-CoV transmission. Proc. Natl Acad. 1557 – 1561. (doi:10.1126/science.1176062) Ferguson NM. 2014 Middle East respiratory Sci. USA 113, 9081 – 9086. (doi:10.1073/pnas. 2. Assiri A et al. 2013 Hospital outbreak of Middle syndrome coronavirus: quantification of the extent 1519235113) East respiratory syndrome coronavirus. of the epidemic, surveillance biases, and 5. Campos GS, Bandeira AC, Sardi SI. 2015 Zika virus N. Engl. J. Med. 369, 407 – 416. (doi:10.1056/ transmissibility. Lancet Infect. Dis. 14, 50 – 56. outbreak, Bahia, Brazil. Emerg. Infect. Dis. 21, NEJMoa1306742) (doi:10.1016/S1473-3099(13)70304-9) 1885 – 1886. (doi:10.3201/eid2110.150847) royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 6. European Centre for Disease Prevention and Control. In Proceedings of the 14th Workshop on Mobile evolutionary analysis. PLoS Comput. Biol. 10, 2015 Zika virus epidemic in the Americas: potential Computing Systems and Applications, pp. e1003537. (doi:10.1371/journal.pcbi.1003537) association with microcephaly and Guillain-Barre´ 10:1 – 10:6. New York, NY: ACM. 34. Holmes EC, Rambaut A, Andersen KG. 2018 syndrome (first update), 21 January 2016. (See 20. Nagraj VP, Randhawa N, Campbell F, Crellen T, Pandemics: spend on surveillance, not prediction. https://ecdc.europa.eu/sites/portal/files/media/en/ Sudre B, Jombart T. 2018 epicontacts: Handling, Nature 558, 180 – 182. (doi:10.1038/d41586-018- publications/Publications/zika-virus-americas- visualisation and analysis of epidemiological 05373-w) association-with-microcephaly-rapid-risk- contacts. F1000Res. 7, 566. (doi:10.12688/ 35. Campbell F, Strang C, Ferguson N, Cori A, Jombart T. assessment.pdf ). f1000research.14492.1) 2018 When are pathogen genome sequences 7. WHO Ebola Response Team. 2014 Ebola virus 21. Moraga P, Dorigatti I, Kamvar ZN, Piatkowski P, informative of transmission events? PLoS disease in West Africa – the first 9 months Toikkanen SE, Nagraj VP, Donnelly CA, Jombart T. Pathog. 14, e1006885. (doi:10.1371/journal. of the epidemic and forward projections. 2018 epiflows: an R package for risk assessment of ppat.1006885) N. Engl. J. Med. 371, 1481 – 1495. (doi:10.1056/ travel-related spread of disease. F1000Res. 7, 1374. 36. Carroll D, Daszak P, Wolfe ND, Gao GF, Morel CM, NEJMoa1411100) (doi:10.12688/f1000research.16032.1) Morzaria S, Pablos-Me´ndez A, Tomori O, Mazet JAK. 8. WHO Ebola Response Team et al. 2015 West African 22. Pigott DM et al. 2017 Local, national, and regional 2018 The Global Virome Project. Science 359, Ebola epidemic after one year – slowing but not viral haemorrhagic fever pandemic potential in 872 – 874. (doi:10.1126/science.aap7463) yet under control. N. Engl. J. Med. 372, 584 – 587. Africa: a multistage analysis. Lancet 390, 37. Birrell PJ, De Angelis D, Presanis AM. 2018 Evidence (doi:10.1056/NEJMc1414992) 2662 – 2672. (doi:10.1016/S0140-6736(17)32092-5) synthesis for stochastic epidemic models. Stat. Sci. 9. Moon S et al. 2015 Will Ebola change the game? 23. Messina JP et al. 2016 Mapping global 33, 34 – 43. (doi:10.1214/17-STS631) Ten essential reforms before the next pandemic. The environmental suitability for Zika virus. Elife 5, 38. Wikipedia contributors. 2018 Data science. report of the Harvard-LSHTM independent panel on e15272. (doi:10.7554/eLife.15272) Wikipedia, The Free Encyclopedia. See https://en. the global response to Ebola. Lancet 386, 24. Pigott DM et al. 2014 Mapping the zoonotic niche wikipedia.org/w/index.php?title=Data_ 2204 – 2221. (doi:10.1016/S0140-6736(15)00946-0) of Ebola virus disease in Africa. Elife 3, e04395. science&oldid=868658447 (accessed on 16 10. Van Kerkhove MD, Bento AI, Mills HL, Ferguson NM, (doi:10.7554/eLife.04395) November 2018). Donnelly CA. 2015 A review of epidemiological 25. Jombart T, Eggo RM, Dodd PJ, Balloux F. 2011 39. Bausch DG, Clougherty MM. 2015 Ebola virus: parameters from Ebola outbreaks to inform early Reconstructing disease outbreaks from genetic data: sensationalism, science, and human rights. public health decision-making. Sci Data 2, 150019. a graph approach. Heredity 106, 383 – 390. (doi:10. J. Infect. Dis. 212(Suppl. 2), S79 – S83. (doi:10. (doi:10.1038/sdata.2015.19) 1038/hdy.2010.78) 1093/infdis/jiv359) 11. Senga M et al. 2017 Contact tracing performance 26. Famulare M, Hu H. 2015 Extracting transmission 40. Kwong JC et al. 2012 The impact of infection on during the Ebola virus disease outbreak in Kenema networks from phylogeographic data for epidemic population health: results of the Ontario burden of district, Sierra Leone. Phil. Trans. R. Soc. B 372, and endemic diseases: Ebola virus in Sierra Leone, infectious diseases study. PLoS ONE 7, e44103. 20160300. (doi:10.1098/rstb.2016.0300) 2009 H1N1 pandemic influenza and polio in (doi:10.1371/journal.pone.0044103) 12. International Ebola Response Team. 2016 Exposure Nigeria. Int. Health 7, 130 – 138. (doi:10.1093/ 41. Vos T et al. 2012 Years lived with disability (YLDs) patterns driving Ebola transmission in West Africa: a inthealth/ihv012) for 1160 sequelae of 289 diseases and injuries retrospective observational study. PLoS Med. 13, 27. Cori A, Ferguson NM, Fraser C, Cauchemez S. 2013 1990 – 2010: a systematic analysis for the Global e1002170. (doi:10.1371/journal.pmed.1002170) A new framework and software to estimate time- Burden of Disease Study 2010. Lancet 380, 13. Cori A et al. 2017 Key data for outbreak evaluation: varying reproduction numbers during epidemics. 2163 – 2196. (doi:10.1016/S0140-6736(12)61729-2) building on the Ebola experience. Phil. Trans. R. Soc. B Am. J. Epidemiol. 178, 1505 – 1512. (doi:10.1093/ 42. Global Burden of Disease Study 2013 Collaborators. 372, 20160371. (doi:10.1098/rstb.2016.0371) aje/kwt133) 2015 Global, regional, and national incidence, 14. Lewnard JA. 2018 Ebola virus disease: 11 323 28. Nouvellet P et al. 2017 A simple approach to prevalence, and years lived with disability for 301 deaths later, how far have we come? Lancet 392, measure transmissibility and forecast incidence. acute and chronic diseases and injuries in 188 189 – 190. (doi:10.1016/S0140-6736(18)31443-0) Epidemics 22, 29 – 35. (doi:10.1016/j.epidem.2017. countries, 1990 – 2013: a systematic analysis for the 15. Ebola Outbreak Epidemiology Team. 2018 Outbreak 02.012) Global Burden of Disease Study 2013. Lancet 386, of Ebola virus disease in the Democratic Republic of 29. Nouvellet P et al. 2015 The role of rapid diagnostics 743 – 800. (doi:10.1016/S0140-6736(15)60692-4) the Congo, April – May, 2018: an epidemiological in managing Ebola epidemics. Nature 528, 43. Pru¨ss-Ustu¨n A et al. 2003 Introduction and methods: study. Lancet 392, 213 – 221. (doi:10.1016/S0140- S109 – S116. (doi:10.1038/nature16041) assessing the environmental burden of disease at 6736(18)31387-4) 30. Finger F, Funk S, White K, Siddiqui R, John national and local levels. WHO Environmental 16. Polonsky J et al. 2019 Lessons learnt from Ebola Edmunds W, Kucharski AJ. 2018 Real-time analysis Burden of Disease Series, No. 1. Geneva, virus disease surveillance in Equateur Province, of the diphtheria outbreak in forcibly displaced Switzerland: World Health Organization. May – July 2018. Weekly Epidemiological Record 94, Myanmar nationals in Bangladesh. bioRxiv. 388645. 44. Camacho A, Eggo RM, Funk S, Watson CH, Kucharski 23 – 27. (doi:10.1101/388645) AJ, Edmunds WJ. 2015 Estimating the probability of 17. 2017 WHO j Ebola outbreak Democratic Republic of 31. Bausch DG, Edmunds J. 2018 Real-time modeling demonstrating vaccine efficacy in the declining the Congo 2017. See https://www.who.int/ should be routinely integrated into outbreak Ebola epidemic: a Bayesian modelling approach. emergencies/ebola-DRC-2017/en/. response. Am. J. Trop. Med. Hyg. 98, 1214 – 1215. BMJ Open 5, e009346. (doi:10.1136/bmjopen-2015- 18. Hartung C, Lerer A, Anokwa Y, Tseng C, Brunette W, (doi:10.4269/ajtmh.18-0150) 009346) Borriello G. 2010 Open Data Kit: Tools to build 32. Feil EJ, Li BC, Aanensen DM, Hanage WP, Spratt BG. 45. Camacho A et al. 2017 Real-time dynamic information services for developing regions. In 2004 eBURST: inferring patterns of evolutionary modelling for the design of a cluster-randomized Proceedings of the 4th ACM/IEEE International descent among clusters of related bacterial phase 3 Ebola vaccine trial in Sierra Leone. Vaccine Conference on Information and Communication genotypes from multilocus sequence typing data. 35, 544 – 551. (doi:10.1016/j.vaccine.2016.12.019) Technologies and Development, pp. 18:1 – 18:12. J. Bacteriol. 186, 1518 – 1530. (doi:10.1128/JB.186. 46. Garske T, Van Kerkhove MD, Yactayo S, Ronveaux O, New York, NY: ACM. 5.1518-1530.2004) Lewis RF, Staples JE, Perea W, Ferguson NM, Yellow 19. Brunette W, Sundt M, Dell N, Chaudhri R, Breit N, 33. Bouckaert R, Heled J, Ku¨hnert D, Vaughan T, Wu C- Fever Expert Committee. 2014 Yellow fever in Borriello G. 2013 Open Data Kit 2.0: Expanding and H, Xie D, Suchard MA, Rambaut A, Drummond AJ. Africa: estimating the burden of disease and impact refining information services for developing regions. 2014 BEAST 2: a software platform for Bayesian of mass vaccination from outbreak and serological royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 data. PLoS Med. 11, e1001638. (doi:10.1371/ 60. Cauchemez S, Valleron A-J, Boe¨lle P-Y, Flahault A, collection tools in the Burden of Obstructive Lung journal.pmed.1001638) Ferguson NM. 2008 Estimating the impact of school Disease (BOLD) study in Gezira state, Sudan. PLoS 47. Kraemer MUG et al. 2017 Spread of yellow fever closure on influenza transmission from Sentinel One 13, e0193917. (doi:10.1371/journal.pone. virus outbreak in Angola and the Democratic data. Nature 452, 750 – 754. (doi:10.1038/ 0193917) Republic of the Congo 2015 – 16: a modelling study. nature06732) 73. Solomon AW et al. 2018 Quality assurance and Lancet Infect. Dis. 17, 330 – 338. (doi:10.1016/ 61. Cauchemez S, Bhattarai A, Marchbanks TL, Fagan quality control in the global trachoma mapping S1473-3099(16)30513-8) RP, Ostroff S, Ferguson NM, Swerdlow D, project. Am. J. Trop. Med. Hyg. 99, 858 – 863. 48. Dorigatti I, Hamlet A, Aguas R, Cattarino L, Cori A, Pennsylvania H1N1 working group. 2011 Role of (doi:10.4269/ajtmh.18-0082) Donnelly CA, Garske T, Imai N, Ferguson NM. 2017 social networks in shaping disease transmission 74. King JD et al. 2013 A novel electronic data collection International risk of yellow fever spread from the during a community outbreak of 2009 H1N1 system for large-scale surveys of neglected tropical ongoing outbreak in Brazil, December 2016 to May pandemic influenza. Proc. Natl Acad. Sci. USA 108, diseases. PLoS One 8, e74570. (doi:10.1371/journal. 2017. Euro Surveill. 22, 30572. (doi:10.2807/1560- 2825 – 2830. (doi:10.1073/pnas.1008895108) pone.0074570) 7917.ES.2017.22.28.30572) 62. Gignoux E, Polonsky J, Ciglenecki I, Bichet M, 75. Njuguna HN et al. 2014 A comparison of 49. Brookmeyer R, You X. 2006 A hypothesis test for the Coldiron M, Thuambe Lwiyo E, Akonda I, Serafini M, smartphones to paper-based questionnaires for end of a common source outbreak. Biometrics 62, Porten K. 2018 Risk factors for measles mortality routine influenza sentinel surveillance, Kenya, 61 – 65. (doi:10.1111/j.1541-0420.2005.00421.x) and the importance of decentralized case 2011 – 2012. BMC Med. Inform. Decis. Mak. 14, 107. 50. Nishiura H, Miyamatsu Y, Chowell G, Saitoh M. 2015 management during an unusually large measles (doi:10.1186/s12911-014-0107-5) Assessing the risk of observing multiple generations epidemic in eastern Democratic Republic of Congo 76. Poushter J. 2016 Smartphone ownership and of Middle East respiratory syndrome (MERS) cases in 2013. PLoS One 13, e0194276. (doi:10.1371/ internet usage continues to climb in emerging given an imported case. Euro Surveill. 20, 21181. journal.pone.0194276) economies. Pew Res. Center 22, 1 – 44. (doi:10.2807/1560-7917.ES2015.20.27.21181) 63. Saurabh S, Prateek S. 2017 Role of contact tracing in 77. Bogoch II, Koydemir HC, Tseng D, Ephraim RKD, 51. Nishiura H, Miyamatsu Y, Mizumoto K. 2016 containing the 2014 Ebola outbreak: a review. Afr. Duah E, Tee J, Andrews JR, Ozcan A. 2017 Objective determination of end of MERS outbreak, Health Sci. 17, 225 – 236. (doi:10.4314/ahs.v17i1.28) Evaluation of a mobile phone-based microscope for South Korea, 2015. Emerg. Infect. Dis. 22, 146 – 148. 64. Funk S, Camacho A, Kucharski AJ, Eggo RM, screening of Schistosoma haematobium infection in (doi:10.3201/eid2201.151383) Edmunds WJ. 2018 Real-time forecasting of rural Ghana. Am. J. Trop. Med. Hyg. 96, 52. Fa¨hnrich C et al. 2015 Surveillance and outbreak infectious disease dynamics with a stochastic semi- 1468 – 1471. (doi:10.4269/ajtmh.16-0912) response management system (SORMAS) to support mechanistic model. Epidemics 22, 56 – 61. (doi:10. 78. Ku¨hnemund M et al. 2017 Targeted DNA the control of the Ebola virus disease outbreak in 1016/j.epidem.2016.11.003) sequencing and in situ mutation analysis using West Africa. Euro Surveill. 20, 21071. (doi:10.2807/ 65. Viboud C et al. 2018 The RAPIDD Ebola forecasting mobile phone microscopy. Nat. Commun. 8, 13913. 1560-7917.ES2015.20.12.21071) challenge: synthesis and lessons learnt. Epidemics (doi:10.1038/ncomms13913) 53. Polonsky JA, Martı´nez-Pino I, Nackers F, Chonzi P, 22, 13 – 21. (doi:10.1016/j.epidem.2017.08.002) 79. Quesada-Gonza´lez D, Merkoc¸i A. 2017 Mobile Manangazira P, Van Herp M, Maes P, Porten K, 66. Wickham H. 2014 Tidy data. J. Stat. Softw. 59, phone-based biosensing: an emerging ‘diagnostic Luquero FJ. 2014 Descriptive epidemiology of 1 – 23. (doi:10.18637/jss.v059.i10) and communication’ technology. Biosens. typhoid fever during an epidemic in Harare, 67. Dallman T et al. 2016 Phylogenetic structure of Bioelectron. 92, 549 – 562. (doi:10.1016/j.bios. Zimbabwe, 2012. PLoS One 9, e114702. (doi:10. European Salmonella enteritidis outbreak correlates 2016.10.062) 1371/journal.pone.0114702) with national and international egg distribution 80. Macharia P, Dunbar MD, Sambai B, Abuna F, 54. Page A-L et al. 2015 Geographic distribution and network. Microb Genom 2, e000070. (doi:10.1099/ Betz B, Njoroge A, Bukusi D, Cherutich P, mortality risk factors during the cholera outbreak in mgen.0.000070) Farquhar C. 2015 Enhancing data security in open a rural region of Haiti, 2010 – 2011. PLoS Neglect. 68. Jenkins C et al. 2015 Public health investigation of data kit as an mHealth application. In 2015 Trop. Dis. 9, e0003605. (doi:10.1371/journal.pntd. two outbreaks of shiga toxin-producing Escherichia International Conference on Computing, 0003605) coli O157 associated with consumption of Communication and Security (ICCCS), 55. Aanensen DM, Huntley DM, Feil EJ, al-Own F, Spratt watercress. Appl. Environ. Microbiol. 81, Pamplemousses, Mauritius, 4 – 5 December 2015. BG. 2009 EpiCollect: linking smartphones to web 3946 – 3952. (doi:10.1128/AEM.04188-14) (doi:10.1109/cccs.2015.7374205) applications for epidemiology, ecology and 69. Inns T et al. 2015 A multi-country Salmonella 81. Crawley MJ. 2012 The R book. Hoboken, NJ: John community data collection. PLoS One 4, e6968. enteritidis phage type 14b outbreak associated with Wiley & Sons. (doi:10.1371/journal.pone.0006968) eggs from a German producer: ‘near real-time’ 82. Wickham H. 2016 Ggplot2: elegant graphics for data 56. Tom-Aba D et al. 2015 Innovative technological application of whole genome sequencing and food analysis. Berlin, Germany: Springer. approach to Ebola virus disease outbreak response chain investigations, United Kingdom, May to 83. Ho¨hle M. 2007 surveillance: An R package for the in Nigeria using the open data kit and form hub September 2014. Eurosurveillance 20, 21098. monitoring of infectious diseases. Comput. Stat. 22, technology. PLoS One 10, e0131000. (doi:10.1371/ (doi:10.2807/1560-7917.ES2015.20.16.21098) 571 – 582. (doi:10.1007/s00180-007-0074-8) journal.pone.0131000) 70. Bousema T et al. 2016 The impact of hotspot- 84. Jombart T et al. 2014 OutbreakTools: a new 57. Xie Y, Allaire JJ, Grolemund G. 2018 R markdown: targeted interventions on malaria transmission in platform for disease outbreak analysis using the R The definitive guide. Boca Raton, FL: Chapman and Rachuonyo South District in the Western Kenyan software. Epidemics 7, 28 – 34. (doi:10.1016/j. Hall/CRC. https://bookdown.org/yihui/rmarkdown/. highlands: a cluster-randomized controlled trial. epidem.2014.04.003) 58. Xie Y. 2016 Bookdown: authoring books and PLoS Med. 13, e1001993. (doi:10.1371/journal. 85. Jombart T, Kamvar ZN, FitzJohn R. 2018 Incidence: technical documents with R markdown. Boca Raton, pmed.1001993) compute, handle, plot and model incidence of FL: CRC Press. 71. Baidjoe AY et al. 2016 Factors associated with high dated events. R package version 1.5.4. https:// 59. Karo B, Haskew C, Khan AS, Polonsky JA, Mazhar heterogeneity of malaria at fine spatial scale in the CRAN.R-project.org/package¼incidence. MKA, Buddha N. 2018 World Health Organization Western Kenyan highlands. Malar. J. 15, 307. 86. King AA, Domenech de Celle`s M, Magpantay FMG, early warning, alert, and response system in the (doi:10.1186/s12936-016-1362-y) Rohani P. 2015 Avoidable errors in the modelling of Rohingya Crisis, Bangladesh, 2017 – 2018. Emerg. 72. Ahmed R, Robinson R, Elsony A, Thomson R, Squire outbreaks of emerging pathogens, with special Infect. Dis. 24, 2074 – 2076. (doi:10.3201/ SB, Malmborg R, Burney P, Mortimer K. 2018 A reference to Ebola. Proc. R. Soc. B 282, 20150347. eid2411.181237) comparison of smartphone and paper data- (doi:10.1098/rspb.2015.0347) royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 87. Snow J. 1855 On the mode of communication of 361, 1761 – 1766. (doi:10.1016/S0140- randomised controlled trial design to evaluate cholera. London, UK: John Churchill. 6736(03)13410-1) vaccine efficacy and effectiveness during outbreaks, 88. Wertheim HFL, Horby P, Woodall JP. 2012 Atlas of 103. Anderson RM, Fraser C, Ghani AC, Donnelly CA, Riley with special reference to Ebola. BMJ 351, h3740. human infectious diseases. Hoboken, NJ: John Wiley S, Ferguson NM, Leung GM, Lam TH, Hedley AJ. (doi:10.1136/bmj.h3740) & Sons. 2004 Epidemiology, transmission dynamics and 119. Grais RF, Conlan AJK, Ferrari MJ, Djibo A, Le Menach 89. Nunes MRT et al. 2015 Emergence and potential for control of SARS: the 2002 – 2003 epidemic. Phil. A, Bjørnstad ON, Grenfell BT. 2008 Time is of the spread of Chikungunya virus in Brazil. BMC Med. 13, Trans. R. Soc. Lond. B 359, 1091 – 1105. (doi:10. essence: exploring a measles outbreak response 102. (doi:10.1186/s12916-015-0348-x) 1098/rstb.2004.1490) vaccination in Niamey, Niger. J. R. Soc. Interface 5, 90. In press. Radiant Earth Foundation – Earth imagery 104. Anderson RM, May RM. 1991 Infectious diseases of 67 – 74. (doi:10.1098/rsif.2007.1038) for impact. See https://www.radiant.earth (accessed humans, vol. 1. Oxford, UK: Oxford University Press. 120. Harris SR et al. 2013 Whole-genome sequencing for on 18 November 2018). 105. Farrington CP, Andrews NJ, Beale AD, Catchpole MA. analysis of an outbreak of meticillin-resistant 91. In press. Spatial epidemiology of Viral Hemorrhagic 1996 A statistical algorithm for the early detection Staphylococcus aureus: a descriptive study. Lancet Fevers. See http://www.healthdata.org/data- of outbreaks of infectious disease. J. R. Stat. Soc. Infect. Dis. 13, 130 – 136. (doi:10.1016/S1473- visualization/spatial-epidemiology-viral- Ser. A Stat. Soc. 159, 547 – 563. (doi:10.2307/ 3099(12)70268-2) hemorrhagic-fevers (accessed on 19 September 2983331) 121. Gire SK et al. 2014 Genomic surveillance elucidates 2018). 106. Park SW, Champredon D, Weitz J, Dushoff J. 2018 A Ebola virus origin and transmission during the 2014 92. Hadfield J, Megill C, Bell SM, Huddleston J, Potter practical generation interval-based approach to outbreak. Science 345, 1369 – 1372. (doi:10.1126/ B, Callender C, Sagulenko P, Bedford T, Neher RA. inferring the strength of epidemics from their science.1259657) 2018 Nextstrain: real-time tracking of pathogen speed. bioRxiv. 312397. (doi:10.1101/312397) 122. Cotten M et al. 2013 Transmission and evolution of evolution. Bioinformatics 34, 4121 – 4123. (doi:10. 107. Keeling M, Rohani P. 2008 Modeling infectious the Middle East respiratory syndrome coronavirus in 1093/bioinformatics/bty407) diseases in humans and animals. Clin. Infect. Dis. Saudi Arabia: a descriptive genomic study. Lancet 93. Argimo´nS et al. 2016 Microreact: visualizing and 47, 864 – 866. (doi:10.1086/591197) 382, 1993 – 2002. (doi:10.1016/S0140-6736(13) sharing data for genomic epidemiology and 108. McKinley T, Cook AR, Deardon R. 2009 Inference in 61887-5) phylogeography. Microb Genom 2, e000093. (doi:10. epidemic models without likelihoods. Int. J. Biostat. 123. Robinson ER, Walker TM, Pallen MJ. 2013 Genomics 1099/mgen.0.000093) 5(1): Article 24. (doi:10.2202/1557-4679.1171) and outbreak investigation: from sequence to 94. Freifeld CC, Mandl KD, Reis BY, Brownstein JS. 109. Obadia T, Haneef R, Boe¨lle P-Y. 2012 The R0 consequence. Genome Med. 5, 36. (doi:10.1186/ 2008 HealthMap: global infectious disease package: a toolbox to estimate reproduction gm440) monitoring through automated classification and numbers for epidemic outbreaks. BMC Med. 124. Hatherell H-A, Didelot X, Pollock SL, Tang P, Crisan visualization of Internet media reports. J. Am. Med. Inform. Decis. Mak. 12, 147. (doi:10.1186/1472- A, Johnston JC, Colijn C, Gardy JL. 2016 Declaring a Inform. Assoc. 15, 150 – 157. (doi:10.1197/jamia. 6947-12-147) tuberculosis outbreak over with genomic M2544) 110. Chowell G, Viboud C, Simonsen L, Merler S, epidemiology. Microb. Genomics 2, e000060. 95. Backer JA, Wallinga J. 2016 Spatiotemporal analysis Vespignani A. 2017 Perspectives on model forecasts (doi:10.1099/mgen.0.000060) of the 2014 Ebola epidemic in West Africa. PLoS of the 2014 – 2015 Ebola epidemic in West Africa: 125. Dudas G et al. 2017 Virus genomes reveal Comput. Biol. 12, e1005210. (doi:10.1371/journal. lessons and the way forward. BMC Med. 15, 42. factors that spread and sustained the Ebola pcbi.1005210) (doi:10.1186/s12916-017-0811-y) epidemic. Nature 544, 309 – 315. (doi:10.1038/ 96. Wallinga J, Teunis P. 2004 Different epidemic curves 111. Held L, Meyer S, Bracher J. 2017 Probabilistic nature22040) for severe acute respiratory syndrome reveal similar forecasting in infectious disease epidemiology: the 126. Faria NR et al. 2017 Establishment and cryptic impacts of control measures. Am. J. Epidemiol. 160, 13th Armitage lecture. Stat. Med. 36, 3443 – 3460. transmission of Zika virus in Brazil and the 509 – 516. (doi:10.1093/aje/kwh255) (doi:10.1002/sim.7363) Americas. Nature 546, 406 – 410. (doi:10.1038/ 97. Wallinga J, Lipsitch M. 2007 How generation 112. Funk S, Camacho A, Kucharski AJ, Lowe R, Eggo RM, nature22401) intervals shape the relationship between growth Edmunds WJ. 2017 Assessing the performance of 127. Quick J et al. 2016 Real-time, portable genome rates and reproductive numbers. Proc. R. Soc. B 274, real-time epidemic forecasts. bioRxiv. 177451. sequencing for Ebola surveillance. Nature 530, 599 – 604. (doi:10.1098/rspb.2006.3754) (doi:10.1101/177451) 228 – 232. (doi:10.1038/nature16996) 98. Cauchemez S, Van Kerkhove MD, Riley S, Donnelly 113. Kucharski AJ, Camacho A, Flasche S, Glover RE, 128. Pallen MJ, Loman NJ, Penn CW. 2010 High- CA, Fraser C, Ferguson NM. 2013 Transmission Edmunds WJ, Funk S. 2015 Measuring the impact of throughput sequencing and clinical microbiology: scenarios for Middle East Respiratory Syndrome Ebola control measures in Sierra Leone. Proc. Natl progress, opportunities and challenges. Curr. Opin. Coronavirus (MERS-CoV) and how to tell them Acad. Sci. USA 112, 14 366 – 14 371. (doi:10.1073/ Microbiol. 13, 625 – 631. (doi:10.1016/j.mib.2010. apart. Euro Surveill. 18, 20503. pnas.1508814112) 08.003) 99. Shrivastava SR, Shrivastava PS, Ramasamy J. 2014 114. Buring JE. 1987 Epidemiology in medicine. 129. Spratt BG, Hanage WP, Li B, Aanensen DM, Feil EJ. Utility of contact tracing in reducing the magnitude Philadelphia, PA: Lippincott Williams & Wilkins. 2004 Displaying the relatedness among isolates of of Ebola disease. Germs 4, 97 – 99. (doi:10.11599/ 115. Grandesso F et al. 2014 Risk factors for cholera bacterial species—the eBURST approach. FEMS germs.2014.1063) transmission in Haiti during inter-peak periods: Microbiol. Lett. 241, 129 – 134. (doi:10.1016/j. 100. WHO Ebola Response Team et al. 2015 Ebola virus insights to improve current control strategies from femsle.2004.11.015) disease among children in West Africa. two case-control studies. Epidemiol. Infect. 142, 130. Enright MC, Robinson DA, Randle G, Feil EJ, N. Engl. J. Med. 372, 1274 – 1277. (doi:10.1056/ 1625 – 1635. (doi:10.1017/S0950268813002562) Grundmann H, Spratt BG. 2002 The evolutionary NEJMc1415318) 116. Gross M. 1976 Oswego County revisited. Public history of methicillin-resistant Staphylococcus aureus 101. WHO Ebola Response Team. 2016 Ebola virus Health Rep. 91, 168 – 170. (MRSA). Proc. Natl Acad. Sci. USA 99, 7687 – 7692. disease among male and female persons in West 117. Buchholz U et al. 2011 German outbreak of (doi:10.1073/pnas.122108599) Africa. N. Engl. J. Med. 374, 96 – 98. (doi:10.1056/ Escherichia coli O104:H4 associated with sprouts. 131. King SJ, Leigh JA, Heath PJ, Luque I, Tarradas C, NEJMc1510305) N. Engl. J. Med. 365, 1763 – 1770. (doi:10.1056/ Dowson CG, Whatmore AM. 2002 Development of a 102. Donnelly CA et al. 2003 Epidemiological NEJMoa1106482) multilocus sequence typing scheme for the pig determinants of spread of causal agent of severe 118. Ebola ¸a c Suffit Ring Vaccination Trial Consortium. pathogen Streptococcus suis: identification of acute respiratory syndrome in Hong Kong. Lancet 2015 The ring vaccination trial: a novel cluster virulent clones and potential capsular serotype royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 exchange. J. Clin. Microbiol. 40, 3671 – 3680. 140. Siddle KJ et al. 2018 Genomic analysis of lassa virus data. PLoS Comput. Biol. 10, e1003457. (doi:10. (doi:10.1128/JCM.40.10.3671-3680.2002) during an increase in cases in Nigeria in 2018. 1371/journal.pcbi.1003457) 132. Urwin R, Maiden MCJ. 2003 Multi-locus sequence N. Engl. J. Med. 379, 1745 – 1753. (doi:10.1056/ 146. Klinkenberg D, Backer JA, Didelot X, Colijn C, typing: a tool for global epidemiology. Trends Microbiol. NEJMoa1804498) Wallinga J. 2017 Simultaneous inference of 11, 479–487. (doi:10.1016/j.tim.2003.08.006) 141. Grenfell BT, Pybus OG, Gog JR, Wood JLN, Daly JM, phylogenetic and transmission trees in infectious 133. Felsenstein J. 2004 Inferring phylogenies. Mumford JA, Holmes EC. 2004 Unifying the disease outbreaks. PLoS Comput. Biol. 13, e1005495. Sunderland, MA: Sinauer Associates Sunderland. epidemiological and evolutionary dynamics of (doi:10.1371/journal.pcbi.1005495) 134. Popescu A-A, Huber KT, Paradis E. 2012 ape 3.0: pathogens. Science 303, 327 – 332. (doi:10.1126/ 147. Didelot X, Gardy J, Colijn C. 2014 Bayesian inference New tools for distance-based phylogenetics and science.1090727) of infectious disease transmission from whole- evolutionary analysis in R. Bioinformatics 28, 142. Cottam EM, The´baud G, Wadsworth J, Gloster J, genome sequence data. Mol. Biol. Evol. 31, 1536 – 1537. (doi:10.1093/bioinformatics/bts184) Mansley L, Paton DJ, King DP, Haydon DT. 2008 1869 – 1879. (doi:10.1093/molbev/msu121) 135. Felsenstein J. 1981 Evolutionary trees from DNA Integrating genetic and epidemiological data to 148. De Maio N, Wu C-H, Wilson DJ. 2016 SCOTTI: sequences: a maximum likelihood approach. J. Mol. determine transmission pathways of foot-and- efficient reconstruction of transmission within Evol. 17, 368 – 376. (doi:10.1007/BF01734359) mouth disease virus. Proc. R. Soc. B 275, 887 – 895. outbreaks with the structured coalescent. PLoS 136. Schliep KP. 2011 phangorn: phylogenetic analysis in (doi:10.1098/rspb.2007.1442) Comput. Biol. 12, e1005130. (doi:10.1371/journal. R. Bioinformatics 27, 592 – 593. (doi:10.1093/ 143. Ypma RJF, Bataille AMA, Stegeman A, Koch G, pcbi.1005130) bioinformatics/btq706) Wallinga J, van Ballegooijen WM. 2012 Unravelling 149. Springborn M, Chowell G, MacLachlan M, Fenichel 137. Ronquist F, Huelsenbeck JP. 2003 MrBayes 3: transmission trees of infectious diseases by EP. 2015 Accounting for behavioral responses during Bayesian phylogenetic inference under mixed combining genetic and epidemiological data. Proc. a flu epidemic using home television viewing. BMC models. Bioinformatics 19, 1572 – 1574. (doi:10. R. Soc. B 279, 444 – 450. (doi:10.1098/rspb.2011. Infect. Dis. 15, 21. (doi:10.1186/s12879-014-0691-0) 1093/bioinformatics/btg180) 0913) 150. R Core Team. 2018 R: a language and environment 138. Grubaugh ND, Faria NR, Andersen KG, Pybus OG. 144. Ypma RJF, van Ballegooijen WM, Wallinga J. for statistical computing. Vienna, Austria: R 2018 Genomic insights into zika virus emergence 2013 Relating phylogenetic trees to transmission Foundation for Statistical Computing. https://www. and spread. Cell 172, 1160 – 1162. (doi:10.1016/j. trees of infectious disease outbreaks. Genetics R-project.org/. cell.2018.02.027) 195, 1055 – 1062. (doi:10.1534/genetics. 151. RECON-R Epidemics Consortium. 2018 R epidemics 139. Smith GJD et al. 2009 Origins and evolutionary 113.154856) consortium. See https://www.repidemicsconsortium. genomics of the 2009 swine-origin H1N1 influenza 145. Jombart T, Cori A, Didelot X, Cauchemez S, Fraser C, org/ (accessed on 26 September 2018). A epidemic. Nature 459, 1122 – 1125. (doi:10.1038/ Ferguson N. 2014 Bayesian reconstruction of disease 152. 2018 RECON learn. See https://www.reconlearn.org nature08182) outbreaks by combining epidemiologic and genomic (accessed on 26 September 2018). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Philosophical Transactions of the Royal Society B: Biological Sciences Pubmed Central

Outbreak analytics: a developing data science for informing the response to emerging pathogens

Loading next page...
 
/lp/pubmed-central/outbreak-analytics-a-developing-data-science-for-informing-the-YNjktLgpWE

References (320)

Publisher
Pubmed Central
Copyright
© 2019 The Authors.
ISSN
0962-8436
eISSN
1471-2970
DOI
10.1098/rstb.2018.0276
Publisher site
See Article on Publisher Site

Abstract

Outbreak analytics: a developing data science for informing the response royalsocietypublishing.org/journal/rstb to emerging pathogens 1,3,† 4,† 4 4 Jonathan A. Polonsky , Amrish Baidjoe , Zhian N. Kamvar , Anne Cori , 2 5,6 5,6 5,6 Kara Durski , W. John Edmunds , Rosalind M. Eggo , Sebastian Funk , Review 3 5,8 5,8,9 Laurent Kaiser , Patrick Keating , Olivier le Polain de Waroux , 7 10 1 4,11 Cite this article: Polonsky JA et al. 2019 Michael Marks , Paula Moraga , Oliver Morgan , Pierre Nouvellet , Outbreak analytics: a developing data science 5,6 7 5,8 Ruwan Ratnayake , Chrissy H. Roberts , Jimmy Whitworth for informing the response to emerging 4,5,8 and Thibaut Jombart pathogens. Phil. Trans. R. Soc. B 374: 20180276. 1 2 Department of Health Emergency Information and Risk Assessment, and Department of Infectious Hazard http://dx.doi.org/10.1098/rstb.2018.0276 Management, World Health Organization, Avenue Appia 20, 1211 Geneva, Switzerland Faculty of Medicine, University of Geneva, 1 rue Michel-Servet, 1211 Geneva, Switzerland Department of Infectious Disease Epidemiology, School of Public Health, MRC Centre for Global Infectious Disease Accepted: 4 December 2018 Analysis, Imperial College London, Medical School Building, St Mary’s Campus, Norfolk Place London W2 1PG, UK 5 6 Department of Infectious Disease Epidemiology, Centre for Mathematical Modelling of Infectious Diseases, and Clinical Research Department, Faculty of Infectious and Tropical Diseases, London School of Hygiene and One contribution of 16 to a theme issue Tropical Medicine, Keppel St, London WC1E 7HT, UK ‘Modelling infectious disease outbreaks in 8 UK Public Health Rapid Support Team, London School of Hygiene and Tropical Medicine, Keppel St, London humans, animals and plants: epidemic WC1E 7HT, UK Public Health England, Wellington House, 133 – 155 Waterloo Road, London SE1 8UG, UK forecasting and control’. Centre for Health Informatics, Computing and Statistics (CHICAS), Lancaster Medical School, Lancaster University, Lancaster LA1 4YW, UK Subject Areas: School of Life Sciences, University of Sussex, Sussex House, Brighton BN1 9RH, UK health and disease and epidemiology JAP, 0000-0002-8634-4255; AB, 0000-0001-5295-5085; ZNK, 0000-0003-1458-7108; AC, 0000-0002-8443-9162; SF, 0000-0002-2842-3406; OM, 0000-0002-9543-3778; PN, 0000-0002-6094-5722; TJ, 0000-0003-2226-8692 Keywords: epidemics, infectious, methods, tools, Despite continued efforts to improve health systems worldwide, emerging pipeline, software pathogen epidemics remain a major public health concern. Effective response to such outbreaks relies on timely intervention, ideally informed by all available sources of data. The collection, visualization and analysis of outbreak data are Author for correspondence: becomingincreasinglycomplex, owingtothediversity intypesofdata, questions Thibaut Jombart and available methods to address them. Recent advances have led to the rise of e-mail: thibautjombart@gmail.com outbreak analytics, an emerging data science focused on the technological and methodological aspects of the outbreak data pipeline, from collection to analysis, modelling and reporting to inform outbreak response. In this article, we assess the current state of the field. After laying out the context of outbreak response, we critically review the most common analytics components, their inter- dependencies, data requirements and the type of information they can provide to inform operations in real time. We discuss some challenges and opportunities and conclude on the potential role of outbreak analytics for improving our understanding of, and response to outbreaks of emerging pathogens. This article is part of the theme issue ‘Modelling infectious disease outbreaks in humans, animals and plants: epidemic forecasting and control‘. This theme issue is linked with the earlier issue ‘Modelling infectious disease outbreaks in humans, animals and plants: approaches and important themes’. 1. Introduction These authors contributed equally to the Emerging infectious diseases are a constant threat to public health worldwide. study. Inthe past decade, several majoroutbreaks, such asthe 2009 influenza pandemic [1], & 2019 The Authors. Published by the Royal Society under the terms of the Creative Commons Attribution License http://creativecommons.org/licenses/by/4.0/, which permits unrestricted use, provided the original author and source are credited. royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 Figure 1. Successive phases of an outbreak response. The histogram along the top represents reported (yellow) and unreported (grey) incidence. the Middle-East Respiratory Syndrome coronavirus (MERS- 2. The outbreak response context CoV) [2–4], the emergence of Zika [5,6] and the West African Ebola virus disease (EVD) outbreak [7,8], have been potent (a) The different phases of an outbreak response reminders of the need for robust surveillance systems and The focus of the public health response shifts during the timely responses to nascent epidemics [9]. The West African course of an epidemic or outbreak, and so do the analytics. EVD outbreak, by far the largest such epidemic in recorded his- We identify four main stages (figure 1). The detection stage tory, in particular, had a strong impact on global health security starts with the first case and ends with the first intervention and public health policy and practice [7,8,10]. It highlighted the activities (e.g. patient isolation, contact tracing, vaccination) difficulties of maintaining situational awareness in the absence and involves surveillance systems and mostly qualitative of standards for surveillance, data collection and analysis, as risk assessments. Next, the early response is the initial part well as the challenges of mounting and sustaining a large-scale of the intervention during which the first simple analytics international response [7,8,11,12]. Despite the lessons learnt can take place, essentially centred around estimating trans- [9,13,14], the recent (2018) EVD outbreaks in Democratic Repub- missibility. This blends into the intervention stage, where lic of the Congo [15,16] are a stark reminder that a large number more complex analytics may be involved to inform plann- of these challenges remain. ing (e.g. vaccination strategies), which ends once the last An important feature of the modern response to epidemics reported case has recovered or died. The post-intervention is the increasing focus on exploiting all available data to inform stage is for lessons to be learned, for improving prepared- the response in real time and allow evidence-based decision ness for the next epidemic and for training and capacity making [3,4,7,8,13,17]. Using data for improving situational building [39]. awareness is complex, involving a range of inter-connected tasks and skills from point-of-care data collection to the gener- ation of informative situational reports (sitreps). The science (b) Questions during and after the intervention underpinning these data pipelines involves a wide range of During the early response, efforts are dedicated to estimating approaches, including database design and mobile technology the likely impact of the outbreak and anticipating the nature, [18,19], frequentist statistics and maximum-likelihood esti- scale and timing of resources needed [7,13,15]. Theoretically, mation [7], interactive data visualization [20,21], geostatistics different factors including not only the total number of cases [22–24], graph theory [20,25,26], Bayesian statistics [8,27,28], and fatalities but also the morbidity and overall impact on qual- mathematical modelling [29–31], genetic analyses [32–36] ity of life, as well as societal and economic impact, should and evidence synthesis approaches [37]. This accretion of ideally be taken into account when attempting to predict heterogeneous disciplines, which may be best summarized as disease burden [40–43]. Generally, as the demographic and ‘outbreak analytics’, forms an emerging domain of data morbidity data needed by composite measures of health- science: an ‘interdisciplinary field that uses scientific methods, adjusted life years [40] are lacking in outbreak response processes, algorithms and systems to extract knowledge and contexts, efforts tend to focus on other proxies of impact: asses- insights from data in various forms’ [38], dedicated to inform- sing transmissibility, predicting future case incidence and ing outbreak response. Outbreak analytics sits at the crossroads associated mortality and investigating risk factors [1,3,7,15]. of public health planning, field epidemiology, methodological Analytical needs to diversify as the intervention progresses. development and information technologies, opening up excit- While investigations of transmissibility, mortality and risk ing opportunities for specialists in these fields to work together factors remain key throughout [8], new questions may arise to to meet the needs for an epidemic response. inform the implementation of control and mitigation measures. In this article, we outline this developing research field and These may focus on predicting the impact of potential measures review the current state of outbreak analytics. In particular, we including testing (e.g. ‘Could a rapid test help reduce inci- focus on how different analysis components interact within dence?’ [29]), vaccine development (e.g. ‘Could a candidate functional workflows, and how each component can be used vaccine be evaluated in this outbreak?’ [44,45]), vaccination to inform different stages of an outbreak response. We discuss campaigns (e.g. ‘Which is the optimal vaccination strategy?’ key challenges and opportunities associated with the deploy- [46,47]) or travel restrictions (e.g. ‘Should international travel ment of efficient, reliable and informative data analysis be restricted?’ [48]), or on estimating the impact of current pipelines and their potential impact. measures such as improvements in access to care (e.g. ‘Has the royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 delay between symptom onset and hospitalization been mobile datacollection, cloud computing and built-in automated reduced?’ [14,15]). As case incidence reduces, statistical model- data analyses and reporting. In resource-limited settings, in par- ling can also be useful for assessing or predicting the end of an ticular, epidemiological data are still often collected with pen outbreak [49–51]. At the field operational level, outbreak and paper, the advantages of which are familiarity, simplicity, response analytics may be best focused on informing and moni- low cost and reliability where access to Internet and power toring core surveillance activities and performance indicators, sources may be limited. However, there are some downsides such as contact tracing [11], through the use of tools for contact to using paper as a data management tool, becoming increas- data visualization [52], mapping [53,54] and on analysis pipe- ingly important with larger outbreaks, as any system for lines integrating mobile data collection tools [18,19,55,56] the printing and distribution, collection and storage and digitiz- with automated reporting systems [57–59]. Finally, the post- ation of forms becomes overwhelmed. Additionally, two-stage intervention phase lends itself to retrospective studies, which processes involving transcription of data from forms typically can assess further the impact of interventions [60], tease apart introduces additional data entry errors [72–75] and substantial finer processes driving the epidemic dynamics such as contact delays from data capture to analysis [72]. patterns [12,61], study risk factors [54,62], identify avenues for Electronic data collection (EDC) is becoming increasingly fortifying surveillance [13,36,63] and evaluate, improve and popular [18,19,55,56]. These tools make use of widely avail- develop modelling techniques [28,64,65]. able, low-cost hardware (e.g. smartphones and tablets) [76] that can, when appropriately configured, consume little power and collect data offline, making them suitable for use (c) What are outbreak data? in resource-poor settings. Some of those may be part of existing The term ‘outbreak data’ encompasses different types of surveillance systems or be deployed instead for specific information, of which we first distinguish ‘case data’from ‘back- enhanced surveillance and response activities during an out- ground data’. Case data include the description of reported cases break. EDC platforms can also enhance data quality through gathered in linelists, i.e. flat files where each row is a case and the use of restriction rules and logical checks, and enforce each column a recorded variable (e.g. dates of onset and admis- reporting (even when there are zero cases) and entry of essen- sion, gender, age, location), thereby fulfilling the definition of tial variables [72,76]. EDC can decrease the delay between data ‘tidy data’ in the data science community [66]. Case data also collection, centralization and analysis, which is critical for include exposure and contact tracing data, either stored within data-driven responses. Time can be saved through ‘form a linelist or in separate files, pathogen whole genome sequencing logic’ (e.g. automatically skipping sections of a survey not (WGS) and data pertaining to outbreak investigations (e.g. case– relevant to a participant), while real-time, automated centrali- control and cohort study data). Background data document the zation, data analysis and reporting can be directly built into underlying characteristics of the affected populations. This the platform. In addition, mobile-based EDC enables the collec- includes demographic information (e.g. maps of population den- tion of other types of data including GPS coordinates, sities, age stratification, mixing patterns), movement data (e.g. photographs, barcode (useful to link case data and clinical borders, traveller flows, migration), health infrastructure specimens) and even aiding diagnostics by directly interfacing (e.g. healthcare facilities, drug stockpiles) and epidemiological with point-of-care diagnostic devices [77–79]. data themselves (e.g. levels of pre-existing immunity). A final Maintaining confidentiality and privacy is a legitimate con- type of data we consider here is ‘intervention data’, which refers cern whenever data concerning human subjects are collected. to information on decisions made and efforts deployed as part While EDC systems provide opportunities for unauthorized of the intervention, such as vaccination coverage, the extent of interception and access to such information, many systems active case finding or potential changes in the epidemiologi- support end-to-end encryption during data transfer [80], cal case definition. An in-depth discussion of data needs in although few provide additional security through encryption outbreaks can be found in Cori et al. [13]. at the level of data entry. (c) Descriptive analyses 3. Outbreak analytics The first, and arguably one of the most important steps in data analysis is exploration, where visualization plays a (a) An overview of the outbreak analytics toolbox central role, completed with informative summary statistics We use the term ‘outbreak analytics’ to refer to the variety of [81,82]. The first type of graphics needed for rapid assessment tools and methods used to collect, curate, visualize, analyse, of ongoing dynamics is the epidemic curve (epicurve), which model and report on outbreak data. These tools and their shows case incidence time series as a histogram of new onset inter-dependencies are summarized in an exemplary workflow dates for a given time interval [83–85]. Cumulative case represented in figure 2, derived from analyses pipelines used counts, sometimes used in the absence of a raw linelist, are during recent epidemics of pandemic influenza [1], MERS- best avoided in epicurves, as they tend to obscure ongoing CoV [4] and EVD [7,8,17]. Note that workflows may vary dynamics and create statistical dependencies in data points substantially in other epidemic contexts. For instance, analyses that will result in biases and lead to under-estimating of food-borne outbreaks may focus on traceback data [67–69], uncertainty in downstream modelling [86]. while vector-borne disease analysis may focus heavily on Maps have been at the core of infectious disease epide- modelling the vector’s ecological niche [70,71]. miology from a very early stage [87]. Nowadays, they are typically used to visualize the distribution of disease [88], for (b) Tools for the collection of epidemiological data representing the ‘ecological niche’ of infectious diseases at Tools for data capture have become a focus of much discussion large scales [23,24,89] and for assessing the spatial dynamics in recent years as those involved in outbreak response seek of an outbreak and strategizing interventions [7,8]. Providers to make use of important technological advances including of free and crowd-sourced [90] geographical data like the royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 Figure 2. Example of outbreak analytics workflow. This schematic represents eight general analyses that can be performed from outbreak data. Outputs containing actionable information for the operations are represented as hexagons. Data needed for each analysis are represented as a different colour in the center, using plain and light shading for mandatory and optional data, respectively. (Online version in colour.) Humanitarian Open Street Maps Team (Humanitarian by time and space [91]. Other examples of freely available map- OpenStreetMap Team Home; see https://www.hotosm.org/ ping tools that can help track the spread of infectious diseases (accessed 26 September 2018)), the Missing Maps project (Mis- include the Spatial Epidemiology of Viral Haemorrhagic singMaps; see https://www.missingmaps.org/ (accessed 26 Fevers (VHF) disease visualization (see http://www.health- September 2018)), healthsites.io (see https://healthsites.io/ data.org/datavisualization/spatial-epidemiology-viralhemor- (accessed 26 September 2018)) and the Radiant Earth Foun- rhagic-fevers; accessed 19 September 2018), which maps risks of dation (Radiant Earth Foundation – Earth imagery for emergence and spread of VHF diseases, Nextstrain [92] and impact; see https://www.radiant.earth (accessed 18 November Microreact [93], which focus on mapping pathogen evolution 2018)) provide layers of spatial data that include information on and epidemic spread, and HealthMap [94], which provides the location of households and health facilities, among other resources for the rapid detection of outbreaks. Geographical determinants. Several tools including SaTScan and ClusterSeer locations of reported cases can also be useful for informing are routinely applied to surveillance system data for automated more complex modelling approaches [95]. outbreak detection and the evaluation of clustering of disease royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 In epidemics driven by person-to-person transmission, a differences in growth rates, e.g. between different locations, last essential source of data is contact data [20], which includes and to derive short-term incidence predictions. Moreover, the data on case exposure [12] as well as contact tracing, where growth rate can also be used to estimate the doubling and halv- appropriate [11,63]. Exposure data document transmission ing times of the epidemic, i.e. the time during which incidence pairs, which can yield precious insights into ‘paired delays’ doubles (respectively is halved), as alternative metrics of trans- (figure 2) including the serial interval (time between onsets missibility [103]. Unfortunately, the log-linear model can only of a case and their infector) or the generation time (time fit exponentially growing or decaying outbreaks, which may between the dates of infections of a case and their infector) not always be appropriate in the presence of complex spatial [7,8], which are in turn useful for estimating transmissibility or age structure, or owing to changes in reporting, transmissi- [27,28,96,97]. Exposure data can also be used to investigate bility or proportion of susceptible individuals over time. the occurrence and determinants of super-spreading events Besides, it cannot readily accommodate time periods with no [12] and help identify introduction events in the case of zoono- cases, so that its applicability may in practice be restricted. tic diseases [98]. Contact tracing, through the early detection of While r quantifiesthe speed at whicha disease spreads, itdoes new cases and their subsequent isolation and treatment, plays a not contain information on the level of the intervention that is central role in reducing onward transmission and therefore necessary to control a disease [106]. This is better characterized containing outbreaks [11,63,99], while additionally providing by the reproduction number (here generically noted ‘R’), which potential information on risk factors [7,11]. measures the average number of secondary cases caused by Summary statistics are a useful complement to data visual- each primary case. Researchers typically distinguish the basic ization in the exploratory phase of data analysis. Some metrics, reproduction number (R [104]), which applies in a large, fully such as transmissibility, require the use of statistical or math- susceptible population, without any control measures, from ematical models in order to be estimated (see §3d below) and the effective reproduction number (R ), which is the number eff are therefore not readily available as descriptive tools. Other of secondary cases after accounting for behavioural changes, useful statistics can be readily computed from linelists, includ- interventions and declines in susceptibility [96]. The current ing different demographic indicators of the reported cases reproduction number determines the dynamics of the epidemic (e.g. gender, age, occupation [7,100,101]), case fatality ratios in the near future, with values greater than 1 predicting an (the proportion of cases who died of the infection) or case increase in cases, and values less than 1 predicting control delays such as the times to hospitalization, recovery or death, [104]. The value of R can also be used to calculate the fraction reported as a whole [1,7,8] or stratified by groups [100,101]. of the population that needs to be immunized (typically through The incubation period (time from infection to symptom onset) vaccination) in order to contain an outbreak [104]. is another important delay for informing the intervention (e.g. Different methodological approaches have been developed to define the duration of contact tracing or declare the end of to estimate the reproduction number. R can be approximated an outbreak), but can be harder to derive as it requires data on using estimates of the growth rate r combined with knowledge case exposure as well. Note that in the case of delays, these are of the generation time distribution [97]. R can also be derived best analysed by characterising the full distribution (e.g. by fit- from compartmental models [104,107]. The formula will ting to an appropriate probability distribution such as depend on the type of model used, but such estimation discretized Gamma [7]) rather than reported as a single central will usually require that different rates (e.g. rates of infection, value [7,8,102,103]. recovery, death) are either known or estimated by fitting the model to data [104,107]. Real-world complexities can be incor- porated into this approach; however, fitting such models can be (d) Quantifying transmissibility challenging and may require computationally intensive algor- The ‘transmissibility’ of a disease is here used to refer to the ithms such as data augmentation, approximate bayesian rate at which new cases arise in the population, resulting computation, or particle filters [108]. Compartmental models either in epidemic growth or decline [1,3,27,28]. Rather than also require assumptions about the total population size and an intrinsic property of a specific disease, transmissibility the proportion of the population at risk, which may be difficult thus defined quantifies the propagation of a pathogen in a to estimate in an outbreak. As an alternative, branching process given epidemic setting and is impacted by multiple factors models can be used to estimate R directly from incidence data including population demographics, mixing and levels pre- [27,28,96,109]. This requires a pre-specified distribution of the existing immunity. Importantly, estimates of transmissibility generation time, or of the serial interval, although recent devel- reported in the literature will typically be biased towards opments suggest that in some cases, the generation time higher values, as subcritical outbreaks are by definition less distribution itself can also be simultaneously estimated [4]. likely to be detected. Several metrics of transmissibility can Branching process models are usually much simpler to fit to be used depending on the type of data available and can be data than their compartmental counterparts, which facilitates estimated using different approaches. their use in real time [27]. A first measure of transmissibility is the growth rate (r), Beyond the mere estimation of transmissibility, it is often which is estimated from a simple model where case incidence essential to forecast future incidence for advocacy and plan- is either exponentially growing (r . 0) or declining (r , 0). ning purposes, e.g. to compare different interventions and Typically, r is estimated directly from epicurves (figure 2) epidemic scenarios [7,8,15,30]. A variety of mathematical and using a log-linear model, where r is defined as the slope of a statistical models, including those reviewed here for estimating linear regression on log-transformed incidence [104,105]. transmissibility, can also be used for short-term incidence fore- Besides its simplicity and its computational efficiency, this casting [65]. Despite the growing body of research focusing on approach has the benefits of being embedded in the linear predicting incidence during epidemics [65,110], there are cur- modelling framework, thereby allowing one to measure the rently no gold standards and the relative performances of uncertainty associated with a given estimate of r, to test for forecasting methods largely remain to be assessed. Methods royalsocietypublishing.org/journal/rstb Phil. Trans. R. Soc. B 374: 20180276 that have been developed and applied in other fields to rigor- years [127,128], genetic analysis will likely carve out its own ously assess not just the accuracy of forecasts but also how space in the outbreak analytics toolkit. well models quantify the inherent uncertainty in making Different methods can be used to extract information from predictions, are only rarely applied in infectious disease epide- pathogen WGS. In bacterial genomics, molecular epidemiol- miology [111,112]. Whether it is to estimate R or predict future ogy methods have been used extensively for defining strains incidence, the most appropriate method ultimately depends on of related isolates [32,129], which can be used to infer various the particular epidemiological setting, existing knowledge of features of the pathogens sampled such as their origins, antimi- the transmission dynamics and data availability. Branching crobial resistance profiles, virulence or antigenic characteristics process models, for example, can be used for a quick estimate [130–132]. These methods usually exploit only a fraction of the of the current value of R from the recent trend in case numbers information contained within pathogens’ genomes, as they rely and, by extrapolating this forward, of expected case numbers in on genetic variat