Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

What difference does quantity make? On the epistemology of Big Data in biology:

What difference does quantity make? On the epistemology of Big Data in biology: Is Big Data science a whole new way of doing research? And what difference does data quantity make to knowledge production strategies and their outputs? I argue that the novelty of Big Data science does not lie in the sheer quantity of data involved, but rather in (1) the prominence and status acquired by data as commodity and recognised output, both within and outside of the scientific community and (2) the methods, infrastructures, technologies, skills and knowledge developed to handle data. These developments generate the impression that data-intensive research is a new mode of doing science, with its own epistemology and norms. To assess this claim, one needs to consider the ways in which data are actually disseminated and used to generate knowledge. Accordingly, this article reviews the development of sophis- ticated ways to disseminate, integrate and re-use data acquired on model organisms over the last three decades of work in experimental biology. I focus on online databases as prominent infrastructures set up to organise and interpret such data and examine the wealth and diversity of expertise, resources and conceptual scaffolding that such databases draw upon. This illuminates some of the conditions under which Big Data needs to be curated to support processes of discovery across biological subfields, which in turn highlights the difficulties caused by the lack of adequate curation for the vast majority of data in the life sciences. In closing, I reflect on the difference that data quantity is making to contemporary biology, the methodological and epistemic challenges of identifying and analysing data given these devel- opments, and the opportunities and worries associated with Big Data discourse and methods. Keywords Big Data epistemology, data-intensive science, biology, databases, data infrastructures, data curation, model organisms Introduction (partly because, as philosophers of science have long Big Data has become a central aspect of contemporary shown, there is no such thing as direct inference from science and policy, due to a variety of reasons that data, and data interpretation typically involves the use include both techno-scientific factors and the political of modelling techniques and various other kinds of con- and economic roles played by this terminology. The ceptual and material scaffolding). On the other hand, idea that Big Data is ushering in a whole new way of many sciences have a long history of dealing with large thinking, particularly within the sciences, is rampant – quantities of data, whose size and scale vastly outstrip as exemplified by the emergence of dedicated funding, available strategies and technologies for data collection, policies and publication venues (such as this journal). dissemination and analysis (Gitelman, 2013). This is This is at once fascinating and perplexing to scholars interested in the history, philosophy and social studies University of Exeter, UK of science. On the one hand, there seems to be some- thing interesting and novel happening as a consequence Corresponding author: of Big Data techniques and communication strategies, S Leonelli, University of Exeter, Byrne House, St Germans Road, Exeter which is, however, hard to capture with traditional EX4 4PJ, UK. notions, such as ‘induction’ and ‘data-driven’ science Email: S.Leonelli@exeter.ac.uk Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License (http:// www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (http://www.uk.sagepub.com/aboutus/open- access.htm). XML Template (2014) [8.7.2014–3:53pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] 2 Big Data & Society particularly evident in the life sciences, where data- dedicated investments than data generated in the rest gathering practices in subfields, such as natural history of the life sciences and biomedicine. Considering the and taxonomy, have been at the heart of inquiry since challenges encountered in disseminating this type of the early modern era, and have generated problems data thus also highlights the potential problems ever since (e.g. Johnson, 2012; Muller-Wille and involved in assembling data that have not received Charmantier, 2012). comparable levels of care (i.e. the vast majority of bio- So what is actually new here? How does Big Data logical data). science differ from other forms of inquiry, what can and In my conclusions, I use these findings to inform a cannot be learnt from Big Data, and what difference critique of the supposed revolutionary power of Big does quantity make? In this article, I discuss some of Data science. In its stead, I propose a less sensational, the central characteristics typically associated with Big but arguably more realistic, reflection on the difference Data, as conveniently summarised within the recent that data quantity is making to contemporary bio- book Big Data by Mayer-Scho¨ nberger and Cukier logical research, which stresses both continuities with (2013), and I scrutinise their plausibility in the case of and dissimilarities from previous attempts to handle biological research. I then argue that the novelty of Big large datasets. I also suggest that the natural sciences Data science does not lie in the sheer quantity of data may well be the area that is least affected by Big Data, involved, though this certainly makes a difference to whose emergence is much more likely to affect the pol- research methods and results. Rather, the novelty of itical and economic realms – though not necessarily for Big Data science lies in (1) the prominence and status the better. acquired by data as scientific commodity and recog- nised output both within and beyond the sciences and The novelty of Big Data (2) the methods, infrastructures, technologies and skills developed to handle (format, disseminate, retrieve, I will start by considering three ideas that, according to model and interpret) data. These developments gener- Mayer-Schonberger and Cukier (2013) among others, ate the impression that data-intensive research is a constitute core innovations brought in by the advent of whole new mode of doing science, with its own epis- Big Data in all realms of human activity, including sci- temology and norms. I here defend the idea that in ence. The first idea is what I shall label comprehensive- order to understand and critically evaluate this ness. This is the claim that the accumulation of large claim, one needs to analyse the ways in which data datasets enables scientists to ground their analysis on are actually disseminated and used to generate know- several different aspects of the same phenomenon, ledge, which I refer to as ‘data journeys’; and I consider documented by different people at different times. the extent to which the current handling of Big Data According to Mayer-Scho¨ nberger and Cukier, data fosters and validates its use as evidence towards new can become so big as to encompass all the available discoveries. data on a phenomenon of interest. As a consequence, Accordingly, the bulk of this article reviews the Big Data can provide a comprehensive perspective on development of sophisticated ways to disseminate, inte- the characteristics of that phenomenon, without need- grate and re-use data acquired on model organisms, ing to focus on specific details. such as the small plant Arabidopsis thaliana, the nema- The second idea is that of messiness. Big Data, it is tode Caenorhabditis elegans and the fruit-fly Drosophila argued, pushes researchers to embrace the complex and melanogaster (including data on their ecology, metab- multifaceted nature of the real world, rather than pur- olism, morphology and relations to other species) over suing exactitude and accuracy in measurement obtained the last three decades of work in experimental biology. under controlled conditions. Indeed, it is impossible to I focus on online databases as a key example of infra- assemble Big Data in ways that are guaranteed to be structures set up to organise and interpret such data; accurate and homogeneous. Rather, we should resign and on the wealth and diversity of expertise, resources ourselves to the fact that ‘Big Data is messy, varies in and conceptual scaffolding that such databases draw quality, and is distributed across countless servers upon in order to function well. This analysis of data around the world’ (Mayer-Schonberger and Cukier, journeys through model organism databases illumin- 2013: 13) and welcome the advantages of this lack of ates some of the conditions under which the evidential exactitude: ‘With Big Data, we’ll often be satisfied with value of data posted online can be assessed and inter- a sense of general direction rather than knowing a phe- preted by researchers wishing to use those data to foster nomenon down to the inch, the penny, the atom’ discovery. At the same time, model organism biology (Mayer-Scho¨ nberger and Cukier, 2013). has been one of the best funded scientific areas over the The idea of messiness relates closely to the third last three decades, and the curation of data produced key innovation brought about by Big Data, which therein has benefited from much more attention and Mayer-Scho¨ nberger and Cukier call the ‘triumph of XML Template (2014) [8.7.2014–3:53pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] Leonelli 3 correlations’. Correlations, defined as the statistical these corollaries into question, which in turn comprom- relationship between two data values, are notoriously ises the plausibility of the three claims that Mayer- useful as heuristic devices within the sciences. Spotting Schonberger and Cukier make about the power of the fact that when one of the data values changes the Big Data – at least when they are applied to the other is likely to change too is the starting point for realm of scientific inquiry. Let me immediately state many discoveries. However, scientists have typically that I do not intend this analysis to deny the wide- mistrusted correlations as a source of reliable know- spread attraction that these three ideas are generating ledge in and of themselves, chiefly because they may in many spheres of contemporary society (most obvi- be spurious – either because they result from serendip- ously, big government) and which is undoubtedly mir- ity rather than specific mechanisms or because they are rored in the ways in which biological research has been due to external factors. Big Data can override those re-organised since at least the early 2000s (which is worries. Mayer-Scho¨ nberger and Cukier (2013: 52) when technologies for the high-throughput production give the example of Amazon.com, whose astonishing of genomic data, such as sequencing machines, started expansion over the last few years is at least partly due to become widely used). Rather, I wish to shed some to their clever use of statistical correlations among the clarity on the gulf that separates the hyperbolic claims myriad of data provided by their consumer base in made about the novelty of Big Data science from the order to spot users’ preferences and successfully suggest challenges, problems and achievements characterising new items for consumption. In cases such as this, cor- data-handling practices in the everyday working life relations do indeed provide powerful knowledge that of biologists – and particularly the ways in which new was not available before. Hence, Big Data encourages computational and communication technologies such a growing respect for correlation, which comes to be as online databases are being developed so as to trans- appreciated as not only a more informative and plau- form these ideas into reality. sible form of knowledge than the more definite but also a more elusive, causal explanation. In the words of Big Data journeys in biology Mayer-Schonberger and Cukier (2013: 14): ‘the correla- tions may not tell us precisely why something is hap- For scientists to be able to analyse Big Data, those data pening, but they alert us that it is happening. And in have to be collected and assembled in ways that make it many situations this is good enough’. suitable to consider them as a single body of informa- These three ideas have two important corollaries, tion (O’Malley and Soyer, 2012). This is a particularly which shall constitute the main target of my analysis difficult task in the case of biological data, given the in this article. The first corollary is that Big Data makes highly fragmented and pluralist history of the field. reliance on small sampling, and even debates over sam- For a start, there are myriads of epistemic communities pling, unnecessary. This again seems to make sense within the life sciences, each of which uses a different prima facie: if we have all the data about a given phe- combination of methods, locations, materials, back- nomenon, what is the point of pondering which types of ground knowledge and interest to produce data. data might best document it? Rather, one can now skip Furthermore, there are vast differences in the types of that step and focus instead on assembling and analysing data that can be produced and the phenomena that can as much data as possible about the phenomenon of be targeted. And last but not least, the organisms and interest, so as to generate reliable knowledge about it: ecosystems on which data are being produced are both ‘Big Data gives us an especially clear view of the granu- highly variable and highly unstable, given their con- lar; subcategories and submarkets that samples can’t stant exposure to both developmental and evolutionary assess’ (Mayer-Scho¨ nberger and Cukier, 2013: 13). change. Given this situation, a crucial question within The second corollary is that Big Data is viewed, Big Data science concerns how one can bring such dif- through its mere existence, as countering the risk of ferent data types, coming from a variety of sources, bias in data collection and interpretation. This is under the same umbrella. because having access to large datasets makes it more To address this question, my research over the last likely that bias and error will be automatically elimi- eight years has focused on documenting and analysing nated from the system, for instance via what sociolo- the ways in which biological data – and particularly gists and philosophers call ‘triangulation’: the tendency ‘omics’ data, the quintessential form of ‘Big Data’ in of reliable data to cluster together, so that the more the life sciences – travel across research contexts, and data one has, the easier it becomes to cross-check the significant conceptual and material scaffolding used them with each other and eliminate the data that look by researchers to achieve this. For the purposes of this like outliers (Denzin, 2006; Wylie, 2002). article, I shall now focus on one case of Big Data hand- Over the next few sections, I show how an empirical ling in biology, which is arguably among the most study of how Big Data biology operates puts both of sophisticated and successful attempts made to integrate XML Template (2014) [8.7.2014–3:53pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] 4 Big Data & Society vast quantities of data of different types within this Despite constant advances, it is still impossible to auto- field for the purposes of advancing future knowledge mate the de-contextualisation of most types of bio- production. This is the development of model organ- logical data. ism databases between 2000 and 2010. These data- Formatting data to ensure that they can all be ana- bases were built with the immediate goal of storing lysed as a unique body of evidence is thus exceedingly and disseminating genomic data in a formalised labour-intensive, and requires the development of manner, and the long-term vision of (1) incorporating databases with long-term funding and enough person- and integrating any data available on the biology of nel to make sure that data submission and formatting the organism in question within a single resource, is carried out adequately. Setting up such resources is including data on physiology, metabolism and even an expensive business. Indeed, debate keeps raging morphology; (2) allowing and promoting cooperation among funding agencies about who is responsible for with other community databases so that the available maintaining these infrastructures. Many model organ- datasets would eventually be comparable across spe- ism databases have struggled to attract enough fund- cies; and (3) gathering information about laboratories ing to support their de-contextualisation activities. working on each organism and the associated experi- Hence, they have resorted to include only data that mental protocols, materials and instruments, thus pro- had been already published in a scientific journal – viding a platform for community building. Particularly thus vastly restricting the amount of data hosted by useful and rich examples include FlyBase, dedicated to the database – or that were donated by data producers D. melanogaster; WormBase, focused on C. elegans; in a format compatible to the ones supported by the and The Arabidopsis Information Resource, gathering database (Bastow and Leonelli, 2010). Despite the data on A. thaliana. At the turn of the 21st century, increasing pressure to disseminate data in the public these were arguably among most sophisticated com- domain, as recently recommended by the Royal munity databases within biology. They have played Society (2012) and several funding bodies in the UK a particularly significant role in the development (Levin et al., in preparation), the latter category com- of online data infrastructures in this area and continue prises a very small number of researchers. Again, this to serve as reference points for the construction is largely due to the labour-intensive nature of de-con- of other databases to this day (Leonelli and textualisation processes. Researchers who wish to Ankeny, 2012). They therefore represent a good submit their data to a database need to make sure instance of infrastructure explicitly set up to support that the format that they use, and the metadata that and promote Big Data research in experimental they provide, fit existing standards – which in turn biology. means acquiring updated knowledge on what the In order to analyse how these databases enable data standards are and how they can be implemented, if journeys, I will distinguish between three stages of data at all; and taking time out of experiments and grant- travel, and briefly describe the extent to which database writing. There are presently very few incentives for curators are involved in their realisation. researchers to sacrifice research time in this way, as data donation is not acknowledged as a contribu- tion to scientific research (Ankeny and Leonelli, Stage 1: De-contextualisation in press). One of the main tasks of database curators is to de- contextualise the data that are included in their Stage 2: Re-contextualisation resources, so that they can travel outside of their ori- ginal production context and become available for inte- Once data have been de-contextualised and added to a gration with other datasets (thus forming a Big Data database, the next stage of their journey is to be re- collection). The process of de-contextualisation contextualised – in other words, to be adopted by a involves making sure that data are formatted in ways new research context, in which they can be integrated that make them compatible with datasets coming from with other data and possibly contribute to spotting new other sources, so that they are easy to analyse by correlations. Within biology, re-contextualisation can researchers who see them for the first time. Given the only happen if database users have access not only to above-mentioned fragmentation and diversity of data the data themselves but also to the information about production processes to be found within biology, there their provenance – typically including the specific strain tends to be no agreement on formatting standards for of organisms on which they were collected, the instru- even the most common of data types (such as metabo- ments and procedures used for data collection, and the lomics data, for instance; Leonelli et al., 2013). As a composition of the research team who originated them result, database curators often need to assess how to in the first place. This sort of information, typically deal with specific datasets on a one-to-one basis. referred to as ‘metadata’ (Edwards et al., 2011; XML Template (2014) [8.7.2014–3:53pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] Leonelli 5 Leonelli, 2010), is indispensable to researchers wishing communicate information about research methods to evaluate the reliability and quality of data. Even and protocols. Indeed, despite the attempted implemen- more importantly, it makes the interpretation of the tation of standard descriptions such as the Minimal scientific significance of the data possible, thus enabling Information about Biological and Biomedical researchers to extract meaning from their scrutiny of Investigation, standards in this area are very under- databases. developed and rarely used by biologists (Leonelli, Given the challenges already linked to the de-con- 2012a). This makes the job of curators even more dif- textualisation of data, it will come as no surprise that ficult, as they are then left with the task of selecting re-contextualising them is proving even harder in bio- which metadata to insert in their database, and which logical practice. The selection and annotation of meta- format to use in order to provide such information. data is more labour-intensive than the formatting of Additionally, curators are often asked to provide a pre- data themselves, and involves the establishment of sev- liminary assessment of the quality of data, which can eral types of standards, each of which is managed by its act as a guideline for researchers interested in large own network of funding and institutions. For a start, it datasets. Curators achieve this through so-called ‘evi- presupposes reliable reference to material specimens of dence codes’ and ‘confidence rankings’ which, however, the model organisms in question. In other words, it is tend to be based on controversial assumptions (for important to standardise the materials on which data instance, the idea that data obtained through physical are produced as much as possible, so that researchers interaction with organisms are more trustworthy than working on those data in different locations can order simulation results) which may not fit all scenarios in those materials and reasonably assume that they are which data may be adopted. indeed the same materials as those from which data were originally extracted. Within model organism biol- Stage 3: Re-use ogy, the standardisation, coordination and dissemin- ation of specimens is in the hands of appositely built The final stage of data journeys that I wish to examine stock centres, which collect as many strains of organ- is that of re-use. One of the central themes in Big isms as possible, pair them up with datasets stored in Data research is the opportunity to re-use the same databases, and make them available for order to datasets to uncover a large number of different correl- researchers interested in the data. In the best cases, ations. After having been de-contextualised and re- this happens through the mediation of databases them- contextualised, data are therefore supposed to fulfil selves; for instance, The Arabidopsis Research their epistemic role by leading to a variety of new Database has long incorporated the option to order discoveries. From my observations above, it will materials associated with data stored therein at the already be clear that very few of the data produced same time as one is viewing the data (Rosenthal and within experimental biology make it to this stage of Ashburner, 2002). However, such a well-organised their journeys, due to the lack of standardisation in coordination between databases and stock centres is their format and production techniques, as well as the rare, particularly in cases where the specimens to be absence of stable reference materials to which data can collected and ordered are not easily transportable be meaningfully associated for re-contextualisation. items, such as seeds and worms, but organisms that Data that cannot be de-contextualised and re- are difficult and expensive to keep and disseminate, contextualised are not generally included into model such as viruses and mice. Most organisms used for organism databases, and thus do not become part of a experimental research do not even have a centralised body of Big Data from which biologically significant stock centre collecting exemplars for further dissemin- inferences can be made. Remarkably, the data that are ation. As a result, the data generated from these organ- most successfully assembled into big collections are isms are hard to incorporate into databases, as genomic data, such as genome sequences and micro- providing them with adequate metadata proves impos- arrays, which are produced through highly standar- sible (Leonelli, 2012a). dised technologies and are therefore easier to format Another serious challenge to the development of for travel. This is bad news for biological research metadata consists of capturing experimental protocols focused on understanding higher-level processes, such and procedures, which in biology are notoriously idio- as organismal development, behaviour and susceptibil- syncratic and difficult to capture through any kind of ity to environmental factors: data that document these textual description (let alone standard categories). The aspects are typically the least standardised in both difficulties are exemplified by the recent emergence of a their format and the materials and instruments Journal of Visualized Experiments, whose editors claim through which they are produced, which makes their that actually showing a video of how a specific experi- integration into large collections into a serious ment is performed is the only way to credibly challenge. XML Template (2014) [8.7.2014–3:53pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] 6 Big Data & Society This signals a problem with the idea that Big Data I have shown how curators have a strong influence involves unproblematic access to all data about a given on all three stages of data journeys via model organism phenomenon – or even to at least some data about databases. They are tasked with selecting, formatting several aspects of a phenomenon, such as multiple and classifying data so as to mediate among the mul- data sources concerning different levels of organisation tiple standards and needs of the disparate epistemic of an organism. When considering the stage of data communities involved in biological research. They re-use, however, an even more significant challenge also play a key role in devising and adding metadata, emerges: that of data classification. Whenever data including information about experimental protocols and metadata are added to a database, curators need and relevant materials, without which it would be to tag them with keywords that will make them retriev- impossible for database users to gauge the reliability able to biologists interested in related phenomena. This and significance of the data therein. All these activities is an extremely hard task, given that curators want to require large amounts of funding for manual curation, leave the interpretation of the potential evidential value which is mostly unavailable even in areas as successful of data as open as possible to database users. Ideally, as model organism biology. They also require the sup- curators should label data according to the interests port and co-operation of the broader biological com- and terminology used by their prospective users, so munity, which is however also rare due to the pressures that a biologist is able to search for any data connected and credit systems to which experimental biologists are to her phenomenon of interest (e.g. ‘metabolism’) and subjected. Activities such as data donation and partici- find what the evidence that she is looking for is. What pation in data curation are not currently rewarded makes such a labelling process into a complex and con- within the academic system. Therefore, many scientists tentious endeavour is the recognition that this classifi- who run large laboratories and are responsible for their cation partly determines the ways in which data may be scientific success perceive these activities as an inexcus- used in the future – which, paradoxically, is exactly able waste of time, despite being aware of their scien- what databases are not supposed to do. In other pub- tific importance in fostering Big Data science. lications, I have described at length the functioning of We thus are confronted with a situation in which the most popular system currently used to classify data (1) there is still a large gap between the opportunities in model organism databases, the so-called ‘bio-ontol- offered by cutting-edge technologies for data dissemin- ogies’ (Leonelli, 2012b). Bio-ontologies are standard ation and the realities of biological data production and vocabularies intended to be intelligible and usable re-use; (2) adequate funding to support and develop across all the model organism communities, sub-disci- online databases is lacking, which greatly limits cur- plines and cultural locations to which data should ators’ ability to make data travel; and (3) data donation travel in order to be re-used. Given the above-men- and incorporation into databases is very limited, which tioned fragmentation of biology into myriads of epi- means that only a very small part of the data produced stemic communities with their own terminologies, within biology actually get to be assembled into Big interests and beliefs, this is a tall order. Consequently, Data collections. Hence, Big Data collections in biol- despite the widespread recognition that model organ- ogy could be viewed as very small indeed, compared to ism databases are among the best sources of Big Data the quantity and variety of data actually produced within biology, many biologists are suspicious of them, within this area of research. Even more problematic- principally as a result of their mistrust of the categories ally, such data collections tend to be extremely partial under which data are classified and distributed. This in the data that they include and make visible. Despite puts into question not only the idea that databases curators’ best efforts, model organism databases mostly can successfully collect Big Data on all aspects of display the outputs of rich, English-speaking labs given organisms but also the idea that they succeed in within visible and highly reputed research traditions, making such data retrievable to researchers in ways which deal with ‘tractable’ data formats. The incorpo- that foster their re-use towards making new discoveries. ration of data produced by poor or unfashionable labs, whether in developed or developing countries, is very low – also because scientists working in those condi- What does it take to assemble Big Data? tions have an even lesser chance than scientists working Implications for Big Data claims in prestigious locations to be able to contribute to the The above analysis, however brief, clearly points to the development of databases in the first place (the digital huge amount of manual labour involved in developing divide is alive and well in Big Data science, though databases for the purpose of assembling Big Data and taking on a new form). making it possible to integrate and analyse them; and to A possible moral to be drawn from this situation is the many unresolved challenges and failures plaguing that what counts as data in the first place should be that process. defined by the nature of their journeys. According to XML Template (2014) [8.7.2014–3:54pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] Leonelli 7 this view, data are whatever can be fitted into highly types therein) by government or relevant public/private visible databases; and results that are hard to dissem- funders. Assuming that Big Data does away with the inate in this way do not count as data at all, since they need to consider sampling is highly problematic in such are not widely accessible. I regard this view as empiric- a situation. Unless the scientific system finds a way to ally unwarranted, as it is clear from my research that improve the inclusivity of biological databases, they there are many more results produced within the life will continue to incorporate partial datasets that never- sciences which biologists are happy to call and use as theless play a significant role in shaping future research, data; and that what biologists consider to be data does thus encouraging an inherently conservative and irra- depend on their availability for scrutiny (it has to be tional system. possible to circulate them to at least some peers who This partiality also speaks to the issue of bias in can assess their usefulness as evidence), but not neces- research, which Mayer-Scho¨ nberger and Cukier insist sarily on the extent to which they are publicly available can potentially be superseded in the case of Big Data – in other words, data disseminated through paper or science. The ways in which Big Data is assembled for by email can have as much weight as data disseminated further analysis clearly introduce numerous biases through online databases. Despite these obvious prob- related to methods for data collection, storage, dissem- lems, however, the increasing prominence of databases ination and visualisation. This feature is recognised by as supposedly comprehensive sources of information Mayer-Scho¨ nberger and Cukier, who indeed point to may well lead some scientists to use them as bench- the fact that the scale of such data collection takes focus marks for what counts as data in a specific area of away from the singularity of data points: the ways in investigation. This tendency is reinforced by wider pol- which datasets are arranged, selected, visualised and itical and economic forces, such as governments, cor- analysed become crucial to which trends and patterns porations and funding bodies, for whom the prospect emerge. However, they assume that the diversity and of assembling centralised repositories for all available variability of data thus collected will be enough to evidence on any given topics constitutes a powerful counter the bias incorporated in each of these sources. draw (Leonelli, 2013). In other words, Big Data is self-correcting by virtue of How do these findings compare to the claims made its very unevenness, which makes it probable that by Mayer-Scho¨ nberger and Cukier? For a start, I think incorrect or inaccurate data are rooted out of the that they cause problems to both of the corollaries to system because of their incongruence with other data their views that I listed above. Consider first the ques- sources. I think that my arguments about the inherent tion of sampling. Rather than disappearing as a scien- imbalances in the types and sources of data assembled tific concern, looking at the ways in which data travel in within big biology casts some doubt as to whether such biology highlights the ever-growing significance of sam- data collections, no matter how large, are diverse pling methods. Big Data that is made available through enough to counter bias in their sources. If all data databases for future analysis turns out to represent sources share more or less the same biases (for instance, highly selected phenomena, materials and contribu- they all rely on microarrays produced with the same tions, to the exclusion of the majority of biological machines), there is also the chance that bias will be work. What is worse, this selection is not the result of amplified, rather than reduced, through such Big Data. scientific choices, which can therefore be taken into These considerations do not make Mayer- account when analysing the data. Rather, it is the ser- Scho¨ nberger and Cukier’s claims about the power of endipitous result of social, political, economic and tech- Big Data completely implausible, but they certainly nical factors, which determines which data get to travel dent the idea that Big Data is revolutionising biological in ways that are non-transparent and hard to recon- research. The availability of large datasets does of struct by biologists at the receiving end. A full account course make a difference, as advertised for instance in of factors involved here far transcends the scope of this the Fourth Paradigm volume issued by Microsoft to article. Still, even my brief analysis of data journeys promote the power of data-intensive strategies (Hey illustrates how they depend on issues as diverse as et al., 2009). And yet, as I stressed above, having a national data donation policies (including privacy lot of data is not the same as having all of them; and laws, in the case of biomedical data); the good-will cultivating such an illusion of completeness is a very and resources of specific data producers, as well as risky and potentially misleading strategy within biology the ethos and visibility of the scientific traditions and – as most researchers whom I have interviewed over the environments in which they work (for instance, biolo- last few years pointed out to me. The idea that the gists working for private industries may not be allowed advent of Big Data lessens the value of accurate mea- to publicly disclose their data); and the availability of surements also does not seem to fit these findings. Most well-curated databases, which in turn depends on the sciences work at a level of sophistication in which one visibility and value placed upon them (and the data small error can have very serious consequences (the XML Template (2014) [8.7.2014–3:54pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] 8 Big Data & Society blatant example being engineering). The constant and tends to be curated and institutionalised in com- worry about the accuracy and reliability of data is pletely different ways. I view the fact that my study reflected in the care employed by database curators in bears no obvious similarities to other areas of Big enabling database users to assess such properties; and Data use as a strength of my approach, which in the importance given by users themselves to evaluat- indeed constitutes an invitation to disaggregate the ing the quality of data found on the internet. Indeed, notion of Big Data science as a homogeneous whole databases are often valued because they provide means and instead pay attention to its specific manifestations to triangulate findings coming from different sources, across different contexts. At the same time, I maintain so as to improve the accuracy of measurement and that a close examination of specialised areas can still determine which data are most reliable. Although yield general lessons, at the very least by drawing they may often fail to do so, as I just discussed, the attention to aspects that need to be critically scruti- very fact that this is a valued feature of databases nised in all instances of Big Data handling. These makes the claim that ‘messiness’ triumphs over include, for instance, the extent to which data are – accuracy look rather shaky. Finally, considering data and need to be – curated before being assembled into journeys prompts second thoughts about the supposed common repositories; the decisions and investments primacy of correlations over causal explanations. Big involved in selecting data for travel, and their impli- Data certainly does enable scientists to spot patterns cations for which data get to be circulated in the first and trends in new ways, which in turn constitutes an place; and the representativeness of data assembled enormous boost to research. At the same time, biolo- under the heading of ‘Big Data’ with respect to gists are rarely happy with such correlations, and other (and/or pre-existing) data collection activities instead use them as heuristics that shape the direction within the same field. of research without necessarily constituting a discovery At the most general level, my analysis can be used in itself. Being able to predict how an organism or eco- to argue that characterisations of Big Data science as system may behave is of huge importance, particularly comprehensive and intrinsically unbiased can be mis- within fields such as biomedicine or environmental sci- leading rather than helpful in shaping scientific as ence; and yet, within experimental biology, the ability well as public perceptions of the features, opportu- to explain why certain behaviour obtains is still very nities and dangers associated with data-intensive highly valued – arguably over and above the ability research. If one admits the plausibility of this pos- to relate two traits to each other. ition, then how can one better understand current developments? I here want to defend the idea that Big Data science has specific epistemological and methodological characteristics, and yet that it does Conclusion: An alternative approach not constitute a new epistemology for biology. Its to Big Data science strength lies in the combination of concerns that In closing my discussion, I not only want to consider have long featured in biological research with oppor- its specificity with respect to other parts of Big Data tunities opened up by novel communication technol- science but also the general lessons that may be drawn ogies, as well as the political and economic climate in from such a case study. Biology, and particularly the which scientific research is currently embedded. Big study of model organisms, represents a field where Data brings new salience to aspects of scientific prac- data have been produced long before the advent of tice which have always been vital to successful empir- computing and many data types are still generated ical research, and yet have often been overlooked by in ways that are not digital, but rather rely on physical policy-makers, funders, publishers, philosophers of and localised interactions between one or more inves- science and even scientists themselves, who in the tigators and a given organic sample. Accordingly, bio- past have tended to evaluate what counts as ‘good logical data on model organisms are heterogeneous science’ in terms of its products (e.g. new claims both in their content and in their format; are curated about phenomena or technologies for intervention and re-purposed to address the needs of highly dispar- in the world) rather than in terms of the processes ate and fragmented epistemic communities; and pre- through which such results are eventually achieved. sent curators with specific challenges to do with the These aspects include the processes involved in valu- wish to faithfully capture and represent complex, ing data as a key scientific resource; situating data in diverse and evolving organismal structures and behav- a context within which they can be interpreted reli- iours. Readers with experience in other forms of Big ably; and structuring scientific institutions and credit Data may well be dealing with cases where both data mechanisms so that data dissemination is supported and their prospective users are much more homoge- and regulated in ways that are conducive to the neous, which means that their travel is less contested advancement of both science and society. XML Template (2014) [8.7.2014–3:54pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] Leonelli 9 More specifically, I want to argue that the novelty of related technologies today. This is precisely because, Big Data science can be located in two key shifts char- like many other natural sciences, such as astronomy, acterising scientific practices over the last two decades. climatology and geology, biology has a long history of First is the new prominence attributed to data as com- engaging with large datasets; and because deepening modities with high scientific, economic, political and our current understanding of the world continues to social value (Leonelli, 2013). This has resulted in the be one of the key goals of inquiry in all areas of sci- acknowledgement of data as key scientific components, entific investigation. While often striving to take outputs in their own right that need to be widely dis- advantage of any available tool for the investigation seminated (for instance, through so-called ‘data jour- of the world and produce findings of use to society, nals’ or repositories such as Figshare or more biologists are not typically content with establishing specialised databases) – which in turn is engendering correlations. The quest for causal explanations, often significant shifts in the ways in which research is orga- involving detailed descriptions of the mechanisms and nised and assessed both within and beyond scientific laws at play in any given situation, is not likely to lose institutions. Second is the emergence of a new set of its appeal any time soon. Whether or not it is plaus- methods, infrastructures and skills to handle (format, ible in its implementation, the Big Data epistemology disseminate, retrieve, model and interpret) data. outlined by Mayer-Scho¨ nberger and Cukier is thus Hilgartner (1995) spoke about the introduction of com- unlikely to prove attractive to biologists, for whom puting and internet technologies in biology as a change correlations are typically but a starting point to a sci- of communication regime. Indeed, my analysis has entific investigation; and the same argument may well emphasised how the introduction of tools such as data- apply to other areas of the natural sciences. The real bases, and the related opportunity to make data revolution seems more likely to centre on other areas instantly available over the internet, is challenging the of social life, particularly economics and politics, ways in which data are produced and disseminated, as where the widespread use of patterns extracted from well as the types of expertise relevant to analysing such large datasets as evidence for decision-making is a data (which now needs to include computing and cura- relatively recent phenomenon. It is no coincidence torial skills, in addition to more traditional statistical that most of the examples given by Mayer- and modelling abilities). Scho¨ nberger and Cukier come from the industrial When seen through this lens, data quantity can world, and particularly globalised sales strategies, as indeed be said to make a difference to biology, but in in the case of Amazon.com. Big Data provides new ways that are not as revolutionary as many Big Data opportunities for managing goods and resources, advocates would advocate. There is strong continuity which may be exploited to reflect and engage individ- with practices of large data collection and assemblage ual preferences and desires. By the same token, Big conducted since the early modern period; and the core Data also provides as yet unexplored opportunities methods and epistemic problems of biological research, for manipulating and controlling individuals and com- including exploratory experimentation, sampling and munities on a large scale – a process that Raley (2013) the search for causal mechanisms, remain crucial characterised as ‘dataveillence’. As demonstrated by parts of inquiry in this area of science – particularly the history of quantification techniques as surveillance given the challenges encountered in developing and and monitoring tools (Porter, 1995), data have long applying curatorial standards for data other than the functioned as a way to quantify one’s actions and high-throughput results of ‘omics’ approaches. monitor others. ‘Bigness’ in data production, availabil- Nevertheless, the novel recognition of the relevance of ity and use thus needs to be contextualised and ques- data as a research output, and the use of technologies tioned as a political-economic phenomenon as much that greatly facilitate their dissemination and re-use, as a technical one (Davies et al., 2013). provide an opportunity for all areas in biology to rein- vent the exchange of scientific results and create new Acknowledgements forms of inference and collaboration. I am grateful to the ‘Sciences of the Archive’ Project at the I end this article by suggesting a provocative Max Planck Institute for the History of Science in Berlin, explanation for what I argued is a non-revolutionary whose generous hospitality and lively intellectual atmosphere role of Big Data in biology. It seems to me that my in 2014 enabled me to complete this manuscript; and to Brian scepticism arises because of my choice of domain, Rappert for helpful comments on the manuscript. This which is much narrower than Mayer-Scho¨ nberger research was funded by the UK Economic and Social and Cukier’s commentary on the impacts of Big Research Council (ESRC) through the ESRC Centre for Data on society as a whole. Indeed, biological Genomics and Society and grant number ES/F028180/1; the research may be the domain of human activity that Leverhulme Trust through grant award RPG-2013-153; and the European Research Council under the European Union’s is least affected by the emergence of Big Data and XML Template (2014) [8.7.2014–3:54pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] 10 Big Data & Society Seventh Framework Programme (FP7/2007-2013)/ERC grant Bechtel W (2006) Discovering Cell Mechanisms: The Creation agreement number 335925. of Modern Cell Biology. Cambridge, UK: Cambridge University Press. Bogen J (2013) Theory and observation in science. In: Zalta Notes EN (ed.) The Stanford Encyclopedia of Philosophy (Spring 1. For a review of this literature, which includes seminal con- 2013 Edition). Available at: http://plato.stanford.edu/ tributions such as Hacking (1992) and Rheinberger (2011), archives/spr2013/entries/science-theory-observation/ see Bogen (2010). (accessed 20 February 2014). 2. This idea, though articulated in a variety of different ways, Borgman CL (2007) Scholarship in the Digital Age: broadly underscores also the work of Sharon Traweek Information, Infrastructure, and the Internet. Cambridge, (1998), Geoffrey C. Bowker (2001), Christine Borgman MA: MIT Press. (2007), Karen Baker and Franc¸ ois Millerand (2010) and Bowker GC (2001) Biodiversity datadiversity. Social Studies Paul Edwards (2011). of Science 30(5): 643–684. 3. Incidentally, the idea of comprehensiveness may be inter- Bowker GC (2006) Memory Practices in the Sciences. preted as clashing with the idea of messiness when formu- Cambridge, MA: MIT Press. lated in this way. If we can have all the data on a specific Craver CF and Darden L (2013) In Search of Biological phenomenon, then surely we can focus on understanding it Mechanisms: Discoveries across the Life Sciences. to a high level of precision, if we so wish. I shall return to Chicago, IL: University of Chicago Press. this point below. Davies G, Frow E and Leonelli S (2013) Bigger, faster, better? 4. Investigations of how other types of databases function in Rhetorics and practices of large-scale research in contem- the biological and biomedical sciences, which also point to porary bioscience. BioSocieties 8(4): 386–396. the extensive labour required to get these infrastructures to Denzin N (2006) Sociological Methods: A Sourcebook. work as scientific tools, have been carried out by Chicago, IL: Aldine Transaction. Hilgartner (1995), Hine (2006), Bauer (2008), Strasser Dupre´ J (2012) Processes of Life. Oxford, UK: Oxford (2008), Stevens (2013) and Mackenzie and McNally University Press. (2013). Edwards PN (2010) A Vast Machine: Computer Models, 5. While a full investigation has yet to appear in print, STS Climate Data, and the Politics of Global Warming. scholars have explored several of the non-scientific aspects Cambridge, MA: MIT Press. affecting data circulation (e.g. Bowker, 2006; Harvey and Edwards PN, Mayernik MS, Batcheller AL, et al. (2011) McMeekin, 2007; Hilgartner, 2013; Martin, 2001). Science friction: Data, metadata, and collaboration. 6. The value of causal explanations in the life sciences is a key Social Studies of Science 41(5): 667–690. concern for many philosophers, particularly those inter- Gitelman L (ed.) (2013) ‘Raw Data’ is an Oxymoron. ested in mechanistic explanations as a form of biological Cambridge, MA: MIT Press. understanding (e.g. Bechtel, 2006; Craver and Darden, Hacking I (1992) The self-vindication of the laboratory sci- 2013). ences. In: Pickering A (ed.) Science as Practice and 7. The validity of this claim needs of course to be established Culture. Chicago, IL: University of Chicago Press, through further empirical and comparative research. Also, pp. 29–64. I should note one undisputed way in which Big Data rhet- Harvey M and McMeekin A (2007) Public or Private oric is affecting biological research: the allocation of fund- Economics of Knowledge? Turbulence in the Biological ing to increasingly large data consortia, to the detriment Sciences. Cheltenham, UK: Edward Elgar Publishing. Hey T, Tansley S and Tolle K (eds) (2009) The Fourth of more specialised and less data-centric areas of Paradigm: Data-Intensive Scientific Discovery. Redmond, investigation). WA: Microsoft Research. Hilgartner S (1995) Biomolecular databases: New communi- References cation regimes for biology? Science Communication 17: Ankeny R and Leonelli S (in press) Valuing data in post- 240–263. genomic biology: How data donation and curation prac- Hilgartner S (2013) Constituting large-scale biology: Building tices challenge the scientific publication system. In: Stevens a regime of governance in the early years of the Human H and Richardson S (eds) PostGenomics. Durham: Duke Genome Project. BioSocieties 8: 397–416. University Press. Hine C (2006) Databases as scientific instruments and their Baker KS and Millerand F (2010) Infrastructuring ecology: role in the ordering of scientific work. Social Studies of Challenges in achieving data sharing. In: Parker JN, Science 36(2): 269–298. Vermeulen N and Penders B (eds) Collaboration in the Johnson K (2012) Ordering Life: Karl Jordan and the New Life Sciences. Farnham, UK: Ashgate, pp. 111–138. Naturalist Tradition. Baltimore, MD: Johns Hopkins Bastow R and Leonelli S (2010) Sustainable digital infrastruc- University Press. ture. EMBO Reports 11(10): 730–735. Kelty CM (2012) This is not an article: Model organism news- Bauer S (2008) Mining data, gathering variables, and recom- letters and the question of ‘open science’. BioSocieties 7(2): bining information: The flexible architecture of epidemio- 140–168. logical studies. Studies in History and Philosophy of Leonelli S and Ankeny RA (2012) Re-thinking organisms: Biological and Biomedical Sciences 39: 415–426. The impact of databases on model organism biology. XML Template (2014) [8.7.2014–3:54pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] Leonelli 11 Studies in History and Philosophy of Biological and History and Philosophy of Biological and Biomedical Biomedical Sciences 43(1): 29–36. Sciences 43: 4–15. Leonelli S (2010) Packaging small facts for re-use: Databases O’Malley M and Soyer OS (2012) The roles of integration in in model organism biology. In: Howlett P and Morgan MS molecular systems biology. Studies in the History and the (eds) How Well Do Facts Travel?: The Dissemination of Philosophy of Biological and Biomedical Sciences 43(1): Reliable Knowledge. Cambridge, UK: Cambridge 58–68. University Press, pp. 325–348. Porter TM (1995) Trust in Numbers: The Pursuit of Leonelli S (2012a) When humans are the exception: Cross- Objectivity in Science and Public Life. Princeton, NJ: species databases at the interface of clinical and biological Princeton University Press. research. Social Studies of Science 42(2): 214–236. Raley R (2013) Dataveillance and countervailance. Leonelli S (2012b) Classificatory theory in data-intensive sci- In: Gitelman L (ed.) ‘Raw Data’ is an Oxymoron. ence: The case of open biomedical ontologies. Cambridge, MA: MIT Press, pp. 121–146. International Studies in the Philosophy of Science 26(1): Rheinberger H-J (2011) Infra-experimentality: From traces to 47–65. data, from data to patterning facts. History of Science Leonelli S (2013) Why the current insistence on open access to 49(3): 337–348. scientific data? Big Data, knowledge production and the Rosenthal N and Ashburner M (2002) Taking stock of our political economy of contemporary biology. Bulletin of models: The function and future of stock centres. Nature Science, Technology and Society 33(1/2): 6–11. Reviews Genetics 3: 711–717. Leonelli S, Smirnoff N, Moore J, Cook C and Bastow R Royal Society (2012) Science as an Open Enterprise. Available (2013) Making open data work in plant science. Journal at: http://royalsociety.org/policy/projects/science-public- for Experimental Botany 64(14): 4109–4117. enterprise/report/ (accessed 14 January 2014). Levin N, Weckoswka D, Castle D, et al. (in preparation) How Stein LD (2008) Towards a cyberinfrastructure for the bio- Do Scientists Understand Openness? Assessing the Impact logical sciences: Progress, visions and challenges. Nature of UK Open Science Policies on Biological Research. Reviews Genetics 9(9): 678–688. Mackenzie A and McNally R (2013) Living multiples: How Stevens H (2013) Life Out of Sequence: Bioinformatics and the large-scale scientific data-mining pursues identity and dif- Introduction of Computers into Biology. Chicago: ferences. Theory, Culture and Society 30: 72–91. University of Chicago Press. Martin P (2001) Genetic governance: The risks, oversight and Strasser BJ (2008) GenBank – Natural history in the 21st regulation of genetic databases in the UK. New Genetics century? Science 322(5901): 537–538. and Society 20(2): 157–183. Traweek S (1998) Iconic devices: Towards an ethnography of Mayer-Scho¨ nberger V and Cukier K (2013) Big Data: A physical images. In: Downey G and Dumit J (eds) Cyborgs Revolution That Will Transform How We Live, Work and and Cytadels. Santa Fe, NM: The SAR Press. Think. London: John Murray Publisher. Wylie A (2002) Thinking from Things: Essays in the Mu¨ ller-Wille S and Charmantier I (2012) Natural history and Philosophy of Archeology. Berkeley, CA: University of information overload: The case of Linnaeus. Studies in California Press. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Big Data & Society SAGE

What difference does quantity make? On the epistemology of Big Data in biology:

Big Data & Society , Volume 1 (1): 1 – Apr 1, 2014

Loading next page...
 
/lp/sage/what-difference-does-quantity-make-on-the-epistemology-of-big-data-in-1a000sdeZJ
Publisher
SAGE
Copyright
Copyright © 2022 by SAGE Publications Ltd, unless otherwise noted. Manuscript content on this site is licensed under Creative Commons Licenses.
ISSN
2053-9517
eISSN
2053-9517
DOI
10.1177/2053951714534395
Publisher site
See Article on Publisher Site

Abstract

Is Big Data science a whole new way of doing research? And what difference does data quantity make to knowledge production strategies and their outputs? I argue that the novelty of Big Data science does not lie in the sheer quantity of data involved, but rather in (1) the prominence and status acquired by data as commodity and recognised output, both within and outside of the scientific community and (2) the methods, infrastructures, technologies, skills and knowledge developed to handle data. These developments generate the impression that data-intensive research is a new mode of doing science, with its own epistemology and norms. To assess this claim, one needs to consider the ways in which data are actually disseminated and used to generate knowledge. Accordingly, this article reviews the development of sophis- ticated ways to disseminate, integrate and re-use data acquired on model organisms over the last three decades of work in experimental biology. I focus on online databases as prominent infrastructures set up to organise and interpret such data and examine the wealth and diversity of expertise, resources and conceptual scaffolding that such databases draw upon. This illuminates some of the conditions under which Big Data needs to be curated to support processes of discovery across biological subfields, which in turn highlights the difficulties caused by the lack of adequate curation for the vast majority of data in the life sciences. In closing, I reflect on the difference that data quantity is making to contemporary biology, the methodological and epistemic challenges of identifying and analysing data given these devel- opments, and the opportunities and worries associated with Big Data discourse and methods. Keywords Big Data epistemology, data-intensive science, biology, databases, data infrastructures, data curation, model organisms Introduction (partly because, as philosophers of science have long Big Data has become a central aspect of contemporary shown, there is no such thing as direct inference from science and policy, due to a variety of reasons that data, and data interpretation typically involves the use include both techno-scientific factors and the political of modelling techniques and various other kinds of con- and economic roles played by this terminology. The ceptual and material scaffolding). On the other hand, idea that Big Data is ushering in a whole new way of many sciences have a long history of dealing with large thinking, particularly within the sciences, is rampant – quantities of data, whose size and scale vastly outstrip as exemplified by the emergence of dedicated funding, available strategies and technologies for data collection, policies and publication venues (such as this journal). dissemination and analysis (Gitelman, 2013). This is This is at once fascinating and perplexing to scholars interested in the history, philosophy and social studies University of Exeter, UK of science. On the one hand, there seems to be some- thing interesting and novel happening as a consequence Corresponding author: of Big Data techniques and communication strategies, S Leonelli, University of Exeter, Byrne House, St Germans Road, Exeter which is, however, hard to capture with traditional EX4 4PJ, UK. notions, such as ‘induction’ and ‘data-driven’ science Email: S.Leonelli@exeter.ac.uk Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License (http:// www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (http://www.uk.sagepub.com/aboutus/open- access.htm). XML Template (2014) [8.7.2014–3:53pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] 2 Big Data & Society particularly evident in the life sciences, where data- dedicated investments than data generated in the rest gathering practices in subfields, such as natural history of the life sciences and biomedicine. Considering the and taxonomy, have been at the heart of inquiry since challenges encountered in disseminating this type of the early modern era, and have generated problems data thus also highlights the potential problems ever since (e.g. Johnson, 2012; Muller-Wille and involved in assembling data that have not received Charmantier, 2012). comparable levels of care (i.e. the vast majority of bio- So what is actually new here? How does Big Data logical data). science differ from other forms of inquiry, what can and In my conclusions, I use these findings to inform a cannot be learnt from Big Data, and what difference critique of the supposed revolutionary power of Big does quantity make? In this article, I discuss some of Data science. In its stead, I propose a less sensational, the central characteristics typically associated with Big but arguably more realistic, reflection on the difference Data, as conveniently summarised within the recent that data quantity is making to contemporary bio- book Big Data by Mayer-Scho¨ nberger and Cukier logical research, which stresses both continuities with (2013), and I scrutinise their plausibility in the case of and dissimilarities from previous attempts to handle biological research. I then argue that the novelty of Big large datasets. I also suggest that the natural sciences Data science does not lie in the sheer quantity of data may well be the area that is least affected by Big Data, involved, though this certainly makes a difference to whose emergence is much more likely to affect the pol- research methods and results. Rather, the novelty of itical and economic realms – though not necessarily for Big Data science lies in (1) the prominence and status the better. acquired by data as scientific commodity and recog- nised output both within and beyond the sciences and The novelty of Big Data (2) the methods, infrastructures, technologies and skills developed to handle (format, disseminate, retrieve, I will start by considering three ideas that, according to model and interpret) data. These developments gener- Mayer-Schonberger and Cukier (2013) among others, ate the impression that data-intensive research is a constitute core innovations brought in by the advent of whole new mode of doing science, with its own epis- Big Data in all realms of human activity, including sci- temology and norms. I here defend the idea that in ence. The first idea is what I shall label comprehensive- order to understand and critically evaluate this ness. This is the claim that the accumulation of large claim, one needs to analyse the ways in which data datasets enables scientists to ground their analysis on are actually disseminated and used to generate know- several different aspects of the same phenomenon, ledge, which I refer to as ‘data journeys’; and I consider documented by different people at different times. the extent to which the current handling of Big Data According to Mayer-Scho¨ nberger and Cukier, data fosters and validates its use as evidence towards new can become so big as to encompass all the available discoveries. data on a phenomenon of interest. As a consequence, Accordingly, the bulk of this article reviews the Big Data can provide a comprehensive perspective on development of sophisticated ways to disseminate, inte- the characteristics of that phenomenon, without need- grate and re-use data acquired on model organisms, ing to focus on specific details. such as the small plant Arabidopsis thaliana, the nema- The second idea is that of messiness. Big Data, it is tode Caenorhabditis elegans and the fruit-fly Drosophila argued, pushes researchers to embrace the complex and melanogaster (including data on their ecology, metab- multifaceted nature of the real world, rather than pur- olism, morphology and relations to other species) over suing exactitude and accuracy in measurement obtained the last three decades of work in experimental biology. under controlled conditions. Indeed, it is impossible to I focus on online databases as a key example of infra- assemble Big Data in ways that are guaranteed to be structures set up to organise and interpret such data; accurate and homogeneous. Rather, we should resign and on the wealth and diversity of expertise, resources ourselves to the fact that ‘Big Data is messy, varies in and conceptual scaffolding that such databases draw quality, and is distributed across countless servers upon in order to function well. This analysis of data around the world’ (Mayer-Schonberger and Cukier, journeys through model organism databases illumin- 2013: 13) and welcome the advantages of this lack of ates some of the conditions under which the evidential exactitude: ‘With Big Data, we’ll often be satisfied with value of data posted online can be assessed and inter- a sense of general direction rather than knowing a phe- preted by researchers wishing to use those data to foster nomenon down to the inch, the penny, the atom’ discovery. At the same time, model organism biology (Mayer-Scho¨ nberger and Cukier, 2013). has been one of the best funded scientific areas over the The idea of messiness relates closely to the third last three decades, and the curation of data produced key innovation brought about by Big Data, which therein has benefited from much more attention and Mayer-Scho¨ nberger and Cukier call the ‘triumph of XML Template (2014) [8.7.2014–3:53pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] Leonelli 3 correlations’. Correlations, defined as the statistical these corollaries into question, which in turn comprom- relationship between two data values, are notoriously ises the plausibility of the three claims that Mayer- useful as heuristic devices within the sciences. Spotting Schonberger and Cukier make about the power of the fact that when one of the data values changes the Big Data – at least when they are applied to the other is likely to change too is the starting point for realm of scientific inquiry. Let me immediately state many discoveries. However, scientists have typically that I do not intend this analysis to deny the wide- mistrusted correlations as a source of reliable know- spread attraction that these three ideas are generating ledge in and of themselves, chiefly because they may in many spheres of contemporary society (most obvi- be spurious – either because they result from serendip- ously, big government) and which is undoubtedly mir- ity rather than specific mechanisms or because they are rored in the ways in which biological research has been due to external factors. Big Data can override those re-organised since at least the early 2000s (which is worries. Mayer-Scho¨ nberger and Cukier (2013: 52) when technologies for the high-throughput production give the example of Amazon.com, whose astonishing of genomic data, such as sequencing machines, started expansion over the last few years is at least partly due to become widely used). Rather, I wish to shed some to their clever use of statistical correlations among the clarity on the gulf that separates the hyperbolic claims myriad of data provided by their consumer base in made about the novelty of Big Data science from the order to spot users’ preferences and successfully suggest challenges, problems and achievements characterising new items for consumption. In cases such as this, cor- data-handling practices in the everyday working life relations do indeed provide powerful knowledge that of biologists – and particularly the ways in which new was not available before. Hence, Big Data encourages computational and communication technologies such a growing respect for correlation, which comes to be as online databases are being developed so as to trans- appreciated as not only a more informative and plau- form these ideas into reality. sible form of knowledge than the more definite but also a more elusive, causal explanation. In the words of Big Data journeys in biology Mayer-Schonberger and Cukier (2013: 14): ‘the correla- tions may not tell us precisely why something is hap- For scientists to be able to analyse Big Data, those data pening, but they alert us that it is happening. And in have to be collected and assembled in ways that make it many situations this is good enough’. suitable to consider them as a single body of informa- These three ideas have two important corollaries, tion (O’Malley and Soyer, 2012). This is a particularly which shall constitute the main target of my analysis difficult task in the case of biological data, given the in this article. The first corollary is that Big Data makes highly fragmented and pluralist history of the field. reliance on small sampling, and even debates over sam- For a start, there are myriads of epistemic communities pling, unnecessary. This again seems to make sense within the life sciences, each of which uses a different prima facie: if we have all the data about a given phe- combination of methods, locations, materials, back- nomenon, what is the point of pondering which types of ground knowledge and interest to produce data. data might best document it? Rather, one can now skip Furthermore, there are vast differences in the types of that step and focus instead on assembling and analysing data that can be produced and the phenomena that can as much data as possible about the phenomenon of be targeted. And last but not least, the organisms and interest, so as to generate reliable knowledge about it: ecosystems on which data are being produced are both ‘Big Data gives us an especially clear view of the granu- highly variable and highly unstable, given their con- lar; subcategories and submarkets that samples can’t stant exposure to both developmental and evolutionary assess’ (Mayer-Scho¨ nberger and Cukier, 2013: 13). change. Given this situation, a crucial question within The second corollary is that Big Data is viewed, Big Data science concerns how one can bring such dif- through its mere existence, as countering the risk of ferent data types, coming from a variety of sources, bias in data collection and interpretation. This is under the same umbrella. because having access to large datasets makes it more To address this question, my research over the last likely that bias and error will be automatically elimi- eight years has focused on documenting and analysing nated from the system, for instance via what sociolo- the ways in which biological data – and particularly gists and philosophers call ‘triangulation’: the tendency ‘omics’ data, the quintessential form of ‘Big Data’ in of reliable data to cluster together, so that the more the life sciences – travel across research contexts, and data one has, the easier it becomes to cross-check the significant conceptual and material scaffolding used them with each other and eliminate the data that look by researchers to achieve this. For the purposes of this like outliers (Denzin, 2006; Wylie, 2002). article, I shall now focus on one case of Big Data hand- Over the next few sections, I show how an empirical ling in biology, which is arguably among the most study of how Big Data biology operates puts both of sophisticated and successful attempts made to integrate XML Template (2014) [8.7.2014–3:53pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] 4 Big Data & Society vast quantities of data of different types within this Despite constant advances, it is still impossible to auto- field for the purposes of advancing future knowledge mate the de-contextualisation of most types of bio- production. This is the development of model organ- logical data. ism databases between 2000 and 2010. These data- Formatting data to ensure that they can all be ana- bases were built with the immediate goal of storing lysed as a unique body of evidence is thus exceedingly and disseminating genomic data in a formalised labour-intensive, and requires the development of manner, and the long-term vision of (1) incorporating databases with long-term funding and enough person- and integrating any data available on the biology of nel to make sure that data submission and formatting the organism in question within a single resource, is carried out adequately. Setting up such resources is including data on physiology, metabolism and even an expensive business. Indeed, debate keeps raging morphology; (2) allowing and promoting cooperation among funding agencies about who is responsible for with other community databases so that the available maintaining these infrastructures. Many model organ- datasets would eventually be comparable across spe- ism databases have struggled to attract enough fund- cies; and (3) gathering information about laboratories ing to support their de-contextualisation activities. working on each organism and the associated experi- Hence, they have resorted to include only data that mental protocols, materials and instruments, thus pro- had been already published in a scientific journal – viding a platform for community building. Particularly thus vastly restricting the amount of data hosted by useful and rich examples include FlyBase, dedicated to the database – or that were donated by data producers D. melanogaster; WormBase, focused on C. elegans; in a format compatible to the ones supported by the and The Arabidopsis Information Resource, gathering database (Bastow and Leonelli, 2010). Despite the data on A. thaliana. At the turn of the 21st century, increasing pressure to disseminate data in the public these were arguably among most sophisticated com- domain, as recently recommended by the Royal munity databases within biology. They have played Society (2012) and several funding bodies in the UK a particularly significant role in the development (Levin et al., in preparation), the latter category com- of online data infrastructures in this area and continue prises a very small number of researchers. Again, this to serve as reference points for the construction is largely due to the labour-intensive nature of de-con- of other databases to this day (Leonelli and textualisation processes. Researchers who wish to Ankeny, 2012). They therefore represent a good submit their data to a database need to make sure instance of infrastructure explicitly set up to support that the format that they use, and the metadata that and promote Big Data research in experimental they provide, fit existing standards – which in turn biology. means acquiring updated knowledge on what the In order to analyse how these databases enable data standards are and how they can be implemented, if journeys, I will distinguish between three stages of data at all; and taking time out of experiments and grant- travel, and briefly describe the extent to which database writing. There are presently very few incentives for curators are involved in their realisation. researchers to sacrifice research time in this way, as data donation is not acknowledged as a contribu- tion to scientific research (Ankeny and Leonelli, Stage 1: De-contextualisation in press). One of the main tasks of database curators is to de- contextualise the data that are included in their Stage 2: Re-contextualisation resources, so that they can travel outside of their ori- ginal production context and become available for inte- Once data have been de-contextualised and added to a gration with other datasets (thus forming a Big Data database, the next stage of their journey is to be re- collection). The process of de-contextualisation contextualised – in other words, to be adopted by a involves making sure that data are formatted in ways new research context, in which they can be integrated that make them compatible with datasets coming from with other data and possibly contribute to spotting new other sources, so that they are easy to analyse by correlations. Within biology, re-contextualisation can researchers who see them for the first time. Given the only happen if database users have access not only to above-mentioned fragmentation and diversity of data the data themselves but also to the information about production processes to be found within biology, there their provenance – typically including the specific strain tends to be no agreement on formatting standards for of organisms on which they were collected, the instru- even the most common of data types (such as metabo- ments and procedures used for data collection, and the lomics data, for instance; Leonelli et al., 2013). As a composition of the research team who originated them result, database curators often need to assess how to in the first place. This sort of information, typically deal with specific datasets on a one-to-one basis. referred to as ‘metadata’ (Edwards et al., 2011; XML Template (2014) [8.7.2014–3:53pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] Leonelli 5 Leonelli, 2010), is indispensable to researchers wishing communicate information about research methods to evaluate the reliability and quality of data. Even and protocols. Indeed, despite the attempted implemen- more importantly, it makes the interpretation of the tation of standard descriptions such as the Minimal scientific significance of the data possible, thus enabling Information about Biological and Biomedical researchers to extract meaning from their scrutiny of Investigation, standards in this area are very under- databases. developed and rarely used by biologists (Leonelli, Given the challenges already linked to the de-con- 2012a). This makes the job of curators even more dif- textualisation of data, it will come as no surprise that ficult, as they are then left with the task of selecting re-contextualising them is proving even harder in bio- which metadata to insert in their database, and which logical practice. The selection and annotation of meta- format to use in order to provide such information. data is more labour-intensive than the formatting of Additionally, curators are often asked to provide a pre- data themselves, and involves the establishment of sev- liminary assessment of the quality of data, which can eral types of standards, each of which is managed by its act as a guideline for researchers interested in large own network of funding and institutions. For a start, it datasets. Curators achieve this through so-called ‘evi- presupposes reliable reference to material specimens of dence codes’ and ‘confidence rankings’ which, however, the model organisms in question. In other words, it is tend to be based on controversial assumptions (for important to standardise the materials on which data instance, the idea that data obtained through physical are produced as much as possible, so that researchers interaction with organisms are more trustworthy than working on those data in different locations can order simulation results) which may not fit all scenarios in those materials and reasonably assume that they are which data may be adopted. indeed the same materials as those from which data were originally extracted. Within model organism biol- Stage 3: Re-use ogy, the standardisation, coordination and dissemin- ation of specimens is in the hands of appositely built The final stage of data journeys that I wish to examine stock centres, which collect as many strains of organ- is that of re-use. One of the central themes in Big isms as possible, pair them up with datasets stored in Data research is the opportunity to re-use the same databases, and make them available for order to datasets to uncover a large number of different correl- researchers interested in the data. In the best cases, ations. After having been de-contextualised and re- this happens through the mediation of databases them- contextualised, data are therefore supposed to fulfil selves; for instance, The Arabidopsis Research their epistemic role by leading to a variety of new Database has long incorporated the option to order discoveries. From my observations above, it will materials associated with data stored therein at the already be clear that very few of the data produced same time as one is viewing the data (Rosenthal and within experimental biology make it to this stage of Ashburner, 2002). However, such a well-organised their journeys, due to the lack of standardisation in coordination between databases and stock centres is their format and production techniques, as well as the rare, particularly in cases where the specimens to be absence of stable reference materials to which data can collected and ordered are not easily transportable be meaningfully associated for re-contextualisation. items, such as seeds and worms, but organisms that Data that cannot be de-contextualised and re- are difficult and expensive to keep and disseminate, contextualised are not generally included into model such as viruses and mice. Most organisms used for organism databases, and thus do not become part of a experimental research do not even have a centralised body of Big Data from which biologically significant stock centre collecting exemplars for further dissemin- inferences can be made. Remarkably, the data that are ation. As a result, the data generated from these organ- most successfully assembled into big collections are isms are hard to incorporate into databases, as genomic data, such as genome sequences and micro- providing them with adequate metadata proves impos- arrays, which are produced through highly standar- sible (Leonelli, 2012a). dised technologies and are therefore easier to format Another serious challenge to the development of for travel. This is bad news for biological research metadata consists of capturing experimental protocols focused on understanding higher-level processes, such and procedures, which in biology are notoriously idio- as organismal development, behaviour and susceptibil- syncratic and difficult to capture through any kind of ity to environmental factors: data that document these textual description (let alone standard categories). The aspects are typically the least standardised in both difficulties are exemplified by the recent emergence of a their format and the materials and instruments Journal of Visualized Experiments, whose editors claim through which they are produced, which makes their that actually showing a video of how a specific experi- integration into large collections into a serious ment is performed is the only way to credibly challenge. XML Template (2014) [8.7.2014–3:53pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] 6 Big Data & Society This signals a problem with the idea that Big Data I have shown how curators have a strong influence involves unproblematic access to all data about a given on all three stages of data journeys via model organism phenomenon – or even to at least some data about databases. They are tasked with selecting, formatting several aspects of a phenomenon, such as multiple and classifying data so as to mediate among the mul- data sources concerning different levels of organisation tiple standards and needs of the disparate epistemic of an organism. When considering the stage of data communities involved in biological research. They re-use, however, an even more significant challenge also play a key role in devising and adding metadata, emerges: that of data classification. Whenever data including information about experimental protocols and metadata are added to a database, curators need and relevant materials, without which it would be to tag them with keywords that will make them retriev- impossible for database users to gauge the reliability able to biologists interested in related phenomena. This and significance of the data therein. All these activities is an extremely hard task, given that curators want to require large amounts of funding for manual curation, leave the interpretation of the potential evidential value which is mostly unavailable even in areas as successful of data as open as possible to database users. Ideally, as model organism biology. They also require the sup- curators should label data according to the interests port and co-operation of the broader biological com- and terminology used by their prospective users, so munity, which is however also rare due to the pressures that a biologist is able to search for any data connected and credit systems to which experimental biologists are to her phenomenon of interest (e.g. ‘metabolism’) and subjected. Activities such as data donation and partici- find what the evidence that she is looking for is. What pation in data curation are not currently rewarded makes such a labelling process into a complex and con- within the academic system. Therefore, many scientists tentious endeavour is the recognition that this classifi- who run large laboratories and are responsible for their cation partly determines the ways in which data may be scientific success perceive these activities as an inexcus- used in the future – which, paradoxically, is exactly able waste of time, despite being aware of their scien- what databases are not supposed to do. In other pub- tific importance in fostering Big Data science. lications, I have described at length the functioning of We thus are confronted with a situation in which the most popular system currently used to classify data (1) there is still a large gap between the opportunities in model organism databases, the so-called ‘bio-ontol- offered by cutting-edge technologies for data dissemin- ogies’ (Leonelli, 2012b). Bio-ontologies are standard ation and the realities of biological data production and vocabularies intended to be intelligible and usable re-use; (2) adequate funding to support and develop across all the model organism communities, sub-disci- online databases is lacking, which greatly limits cur- plines and cultural locations to which data should ators’ ability to make data travel; and (3) data donation travel in order to be re-used. Given the above-men- and incorporation into databases is very limited, which tioned fragmentation of biology into myriads of epi- means that only a very small part of the data produced stemic communities with their own terminologies, within biology actually get to be assembled into Big interests and beliefs, this is a tall order. Consequently, Data collections. Hence, Big Data collections in biol- despite the widespread recognition that model organ- ogy could be viewed as very small indeed, compared to ism databases are among the best sources of Big Data the quantity and variety of data actually produced within biology, many biologists are suspicious of them, within this area of research. Even more problematic- principally as a result of their mistrust of the categories ally, such data collections tend to be extremely partial under which data are classified and distributed. This in the data that they include and make visible. Despite puts into question not only the idea that databases curators’ best efforts, model organism databases mostly can successfully collect Big Data on all aspects of display the outputs of rich, English-speaking labs given organisms but also the idea that they succeed in within visible and highly reputed research traditions, making such data retrievable to researchers in ways which deal with ‘tractable’ data formats. The incorpo- that foster their re-use towards making new discoveries. ration of data produced by poor or unfashionable labs, whether in developed or developing countries, is very low – also because scientists working in those condi- What does it take to assemble Big Data? tions have an even lesser chance than scientists working Implications for Big Data claims in prestigious locations to be able to contribute to the The above analysis, however brief, clearly points to the development of databases in the first place (the digital huge amount of manual labour involved in developing divide is alive and well in Big Data science, though databases for the purpose of assembling Big Data and taking on a new form). making it possible to integrate and analyse them; and to A possible moral to be drawn from this situation is the many unresolved challenges and failures plaguing that what counts as data in the first place should be that process. defined by the nature of their journeys. According to XML Template (2014) [8.7.2014–3:54pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] Leonelli 7 this view, data are whatever can be fitted into highly types therein) by government or relevant public/private visible databases; and results that are hard to dissem- funders. Assuming that Big Data does away with the inate in this way do not count as data at all, since they need to consider sampling is highly problematic in such are not widely accessible. I regard this view as empiric- a situation. Unless the scientific system finds a way to ally unwarranted, as it is clear from my research that improve the inclusivity of biological databases, they there are many more results produced within the life will continue to incorporate partial datasets that never- sciences which biologists are happy to call and use as theless play a significant role in shaping future research, data; and that what biologists consider to be data does thus encouraging an inherently conservative and irra- depend on their availability for scrutiny (it has to be tional system. possible to circulate them to at least some peers who This partiality also speaks to the issue of bias in can assess their usefulness as evidence), but not neces- research, which Mayer-Scho¨ nberger and Cukier insist sarily on the extent to which they are publicly available can potentially be superseded in the case of Big Data – in other words, data disseminated through paper or science. The ways in which Big Data is assembled for by email can have as much weight as data disseminated further analysis clearly introduce numerous biases through online databases. Despite these obvious prob- related to methods for data collection, storage, dissem- lems, however, the increasing prominence of databases ination and visualisation. This feature is recognised by as supposedly comprehensive sources of information Mayer-Scho¨ nberger and Cukier, who indeed point to may well lead some scientists to use them as bench- the fact that the scale of such data collection takes focus marks for what counts as data in a specific area of away from the singularity of data points: the ways in investigation. This tendency is reinforced by wider pol- which datasets are arranged, selected, visualised and itical and economic forces, such as governments, cor- analysed become crucial to which trends and patterns porations and funding bodies, for whom the prospect emerge. However, they assume that the diversity and of assembling centralised repositories for all available variability of data thus collected will be enough to evidence on any given topics constitutes a powerful counter the bias incorporated in each of these sources. draw (Leonelli, 2013). In other words, Big Data is self-correcting by virtue of How do these findings compare to the claims made its very unevenness, which makes it probable that by Mayer-Scho¨ nberger and Cukier? For a start, I think incorrect or inaccurate data are rooted out of the that they cause problems to both of the corollaries to system because of their incongruence with other data their views that I listed above. Consider first the ques- sources. I think that my arguments about the inherent tion of sampling. Rather than disappearing as a scien- imbalances in the types and sources of data assembled tific concern, looking at the ways in which data travel in within big biology casts some doubt as to whether such biology highlights the ever-growing significance of sam- data collections, no matter how large, are diverse pling methods. Big Data that is made available through enough to counter bias in their sources. If all data databases for future analysis turns out to represent sources share more or less the same biases (for instance, highly selected phenomena, materials and contribu- they all rely on microarrays produced with the same tions, to the exclusion of the majority of biological machines), there is also the chance that bias will be work. What is worse, this selection is not the result of amplified, rather than reduced, through such Big Data. scientific choices, which can therefore be taken into These considerations do not make Mayer- account when analysing the data. Rather, it is the ser- Scho¨ nberger and Cukier’s claims about the power of endipitous result of social, political, economic and tech- Big Data completely implausible, but they certainly nical factors, which determines which data get to travel dent the idea that Big Data is revolutionising biological in ways that are non-transparent and hard to recon- research. The availability of large datasets does of struct by biologists at the receiving end. A full account course make a difference, as advertised for instance in of factors involved here far transcends the scope of this the Fourth Paradigm volume issued by Microsoft to article. Still, even my brief analysis of data journeys promote the power of data-intensive strategies (Hey illustrates how they depend on issues as diverse as et al., 2009). And yet, as I stressed above, having a national data donation policies (including privacy lot of data is not the same as having all of them; and laws, in the case of biomedical data); the good-will cultivating such an illusion of completeness is a very and resources of specific data producers, as well as risky and potentially misleading strategy within biology the ethos and visibility of the scientific traditions and – as most researchers whom I have interviewed over the environments in which they work (for instance, biolo- last few years pointed out to me. The idea that the gists working for private industries may not be allowed advent of Big Data lessens the value of accurate mea- to publicly disclose their data); and the availability of surements also does not seem to fit these findings. Most well-curated databases, which in turn depends on the sciences work at a level of sophistication in which one visibility and value placed upon them (and the data small error can have very serious consequences (the XML Template (2014) [8.7.2014–3:54pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] 8 Big Data & Society blatant example being engineering). The constant and tends to be curated and institutionalised in com- worry about the accuracy and reliability of data is pletely different ways. I view the fact that my study reflected in the care employed by database curators in bears no obvious similarities to other areas of Big enabling database users to assess such properties; and Data use as a strength of my approach, which in the importance given by users themselves to evaluat- indeed constitutes an invitation to disaggregate the ing the quality of data found on the internet. Indeed, notion of Big Data science as a homogeneous whole databases are often valued because they provide means and instead pay attention to its specific manifestations to triangulate findings coming from different sources, across different contexts. At the same time, I maintain so as to improve the accuracy of measurement and that a close examination of specialised areas can still determine which data are most reliable. Although yield general lessons, at the very least by drawing they may often fail to do so, as I just discussed, the attention to aspects that need to be critically scruti- very fact that this is a valued feature of databases nised in all instances of Big Data handling. These makes the claim that ‘messiness’ triumphs over include, for instance, the extent to which data are – accuracy look rather shaky. Finally, considering data and need to be – curated before being assembled into journeys prompts second thoughts about the supposed common repositories; the decisions and investments primacy of correlations over causal explanations. Big involved in selecting data for travel, and their impli- Data certainly does enable scientists to spot patterns cations for which data get to be circulated in the first and trends in new ways, which in turn constitutes an place; and the representativeness of data assembled enormous boost to research. At the same time, biolo- under the heading of ‘Big Data’ with respect to gists are rarely happy with such correlations, and other (and/or pre-existing) data collection activities instead use them as heuristics that shape the direction within the same field. of research without necessarily constituting a discovery At the most general level, my analysis can be used in itself. Being able to predict how an organism or eco- to argue that characterisations of Big Data science as system may behave is of huge importance, particularly comprehensive and intrinsically unbiased can be mis- within fields such as biomedicine or environmental sci- leading rather than helpful in shaping scientific as ence; and yet, within experimental biology, the ability well as public perceptions of the features, opportu- to explain why certain behaviour obtains is still very nities and dangers associated with data-intensive highly valued – arguably over and above the ability research. If one admits the plausibility of this pos- to relate two traits to each other. ition, then how can one better understand current developments? I here want to defend the idea that Big Data science has specific epistemological and methodological characteristics, and yet that it does Conclusion: An alternative approach not constitute a new epistemology for biology. Its to Big Data science strength lies in the combination of concerns that In closing my discussion, I not only want to consider have long featured in biological research with oppor- its specificity with respect to other parts of Big Data tunities opened up by novel communication technol- science but also the general lessons that may be drawn ogies, as well as the political and economic climate in from such a case study. Biology, and particularly the which scientific research is currently embedded. Big study of model organisms, represents a field where Data brings new salience to aspects of scientific prac- data have been produced long before the advent of tice which have always been vital to successful empir- computing and many data types are still generated ical research, and yet have often been overlooked by in ways that are not digital, but rather rely on physical policy-makers, funders, publishers, philosophers of and localised interactions between one or more inves- science and even scientists themselves, who in the tigators and a given organic sample. Accordingly, bio- past have tended to evaluate what counts as ‘good logical data on model organisms are heterogeneous science’ in terms of its products (e.g. new claims both in their content and in their format; are curated about phenomena or technologies for intervention and re-purposed to address the needs of highly dispar- in the world) rather than in terms of the processes ate and fragmented epistemic communities; and pre- through which such results are eventually achieved. sent curators with specific challenges to do with the These aspects include the processes involved in valu- wish to faithfully capture and represent complex, ing data as a key scientific resource; situating data in diverse and evolving organismal structures and behav- a context within which they can be interpreted reli- iours. Readers with experience in other forms of Big ably; and structuring scientific institutions and credit Data may well be dealing with cases where both data mechanisms so that data dissemination is supported and their prospective users are much more homoge- and regulated in ways that are conducive to the neous, which means that their travel is less contested advancement of both science and society. XML Template (2014) [8.7.2014–3:54pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] Leonelli 9 More specifically, I want to argue that the novelty of related technologies today. This is precisely because, Big Data science can be located in two key shifts char- like many other natural sciences, such as astronomy, acterising scientific practices over the last two decades. climatology and geology, biology has a long history of First is the new prominence attributed to data as com- engaging with large datasets; and because deepening modities with high scientific, economic, political and our current understanding of the world continues to social value (Leonelli, 2013). This has resulted in the be one of the key goals of inquiry in all areas of sci- acknowledgement of data as key scientific components, entific investigation. While often striving to take outputs in their own right that need to be widely dis- advantage of any available tool for the investigation seminated (for instance, through so-called ‘data jour- of the world and produce findings of use to society, nals’ or repositories such as Figshare or more biologists are not typically content with establishing specialised databases) – which in turn is engendering correlations. The quest for causal explanations, often significant shifts in the ways in which research is orga- involving detailed descriptions of the mechanisms and nised and assessed both within and beyond scientific laws at play in any given situation, is not likely to lose institutions. Second is the emergence of a new set of its appeal any time soon. Whether or not it is plaus- methods, infrastructures and skills to handle (format, ible in its implementation, the Big Data epistemology disseminate, retrieve, model and interpret) data. outlined by Mayer-Scho¨ nberger and Cukier is thus Hilgartner (1995) spoke about the introduction of com- unlikely to prove attractive to biologists, for whom puting and internet technologies in biology as a change correlations are typically but a starting point to a sci- of communication regime. Indeed, my analysis has entific investigation; and the same argument may well emphasised how the introduction of tools such as data- apply to other areas of the natural sciences. The real bases, and the related opportunity to make data revolution seems more likely to centre on other areas instantly available over the internet, is challenging the of social life, particularly economics and politics, ways in which data are produced and disseminated, as where the widespread use of patterns extracted from well as the types of expertise relevant to analysing such large datasets as evidence for decision-making is a data (which now needs to include computing and cura- relatively recent phenomenon. It is no coincidence torial skills, in addition to more traditional statistical that most of the examples given by Mayer- and modelling abilities). Scho¨ nberger and Cukier come from the industrial When seen through this lens, data quantity can world, and particularly globalised sales strategies, as indeed be said to make a difference to biology, but in in the case of Amazon.com. Big Data provides new ways that are not as revolutionary as many Big Data opportunities for managing goods and resources, advocates would advocate. There is strong continuity which may be exploited to reflect and engage individ- with practices of large data collection and assemblage ual preferences and desires. By the same token, Big conducted since the early modern period; and the core Data also provides as yet unexplored opportunities methods and epistemic problems of biological research, for manipulating and controlling individuals and com- including exploratory experimentation, sampling and munities on a large scale – a process that Raley (2013) the search for causal mechanisms, remain crucial characterised as ‘dataveillence’. As demonstrated by parts of inquiry in this area of science – particularly the history of quantification techniques as surveillance given the challenges encountered in developing and and monitoring tools (Porter, 1995), data have long applying curatorial standards for data other than the functioned as a way to quantify one’s actions and high-throughput results of ‘omics’ approaches. monitor others. ‘Bigness’ in data production, availabil- Nevertheless, the novel recognition of the relevance of ity and use thus needs to be contextualised and ques- data as a research output, and the use of technologies tioned as a political-economic phenomenon as much that greatly facilitate their dissemination and re-use, as a technical one (Davies et al., 2013). provide an opportunity for all areas in biology to rein- vent the exchange of scientific results and create new Acknowledgements forms of inference and collaboration. I am grateful to the ‘Sciences of the Archive’ Project at the I end this article by suggesting a provocative Max Planck Institute for the History of Science in Berlin, explanation for what I argued is a non-revolutionary whose generous hospitality and lively intellectual atmosphere role of Big Data in biology. It seems to me that my in 2014 enabled me to complete this manuscript; and to Brian scepticism arises because of my choice of domain, Rappert for helpful comments on the manuscript. This which is much narrower than Mayer-Scho¨ nberger research was funded by the UK Economic and Social and Cukier’s commentary on the impacts of Big Research Council (ESRC) through the ESRC Centre for Data on society as a whole. Indeed, biological Genomics and Society and grant number ES/F028180/1; the research may be the domain of human activity that Leverhulme Trust through grant award RPG-2013-153; and the European Research Council under the European Union’s is least affected by the emergence of Big Data and XML Template (2014) [8.7.2014–3:54pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] 10 Big Data & Society Seventh Framework Programme (FP7/2007-2013)/ERC grant Bechtel W (2006) Discovering Cell Mechanisms: The Creation agreement number 335925. of Modern Cell Biology. Cambridge, UK: Cambridge University Press. Bogen J (2013) Theory and observation in science. In: Zalta Notes EN (ed.) The Stanford Encyclopedia of Philosophy (Spring 1. For a review of this literature, which includes seminal con- 2013 Edition). Available at: http://plato.stanford.edu/ tributions such as Hacking (1992) and Rheinberger (2011), archives/spr2013/entries/science-theory-observation/ see Bogen (2010). (accessed 20 February 2014). 2. This idea, though articulated in a variety of different ways, Borgman CL (2007) Scholarship in the Digital Age: broadly underscores also the work of Sharon Traweek Information, Infrastructure, and the Internet. Cambridge, (1998), Geoffrey C. Bowker (2001), Christine Borgman MA: MIT Press. (2007), Karen Baker and Franc¸ ois Millerand (2010) and Bowker GC (2001) Biodiversity datadiversity. Social Studies Paul Edwards (2011). of Science 30(5): 643–684. 3. Incidentally, the idea of comprehensiveness may be inter- Bowker GC (2006) Memory Practices in the Sciences. preted as clashing with the idea of messiness when formu- Cambridge, MA: MIT Press. lated in this way. If we can have all the data on a specific Craver CF and Darden L (2013) In Search of Biological phenomenon, then surely we can focus on understanding it Mechanisms: Discoveries across the Life Sciences. to a high level of precision, if we so wish. I shall return to Chicago, IL: University of Chicago Press. this point below. Davies G, Frow E and Leonelli S (2013) Bigger, faster, better? 4. Investigations of how other types of databases function in Rhetorics and practices of large-scale research in contem- the biological and biomedical sciences, which also point to porary bioscience. BioSocieties 8(4): 386–396. the extensive labour required to get these infrastructures to Denzin N (2006) Sociological Methods: A Sourcebook. work as scientific tools, have been carried out by Chicago, IL: Aldine Transaction. Hilgartner (1995), Hine (2006), Bauer (2008), Strasser Dupre´ J (2012) Processes of Life. Oxford, UK: Oxford (2008), Stevens (2013) and Mackenzie and McNally University Press. (2013). Edwards PN (2010) A Vast Machine: Computer Models, 5. While a full investigation has yet to appear in print, STS Climate Data, and the Politics of Global Warming. scholars have explored several of the non-scientific aspects Cambridge, MA: MIT Press. affecting data circulation (e.g. Bowker, 2006; Harvey and Edwards PN, Mayernik MS, Batcheller AL, et al. (2011) McMeekin, 2007; Hilgartner, 2013; Martin, 2001). Science friction: Data, metadata, and collaboration. 6. The value of causal explanations in the life sciences is a key Social Studies of Science 41(5): 667–690. concern for many philosophers, particularly those inter- Gitelman L (ed.) (2013) ‘Raw Data’ is an Oxymoron. ested in mechanistic explanations as a form of biological Cambridge, MA: MIT Press. understanding (e.g. Bechtel, 2006; Craver and Darden, Hacking I (1992) The self-vindication of the laboratory sci- 2013). ences. In: Pickering A (ed.) Science as Practice and 7. The validity of this claim needs of course to be established Culture. Chicago, IL: University of Chicago Press, through further empirical and comparative research. Also, pp. 29–64. I should note one undisputed way in which Big Data rhet- Harvey M and McMeekin A (2007) Public or Private oric is affecting biological research: the allocation of fund- Economics of Knowledge? Turbulence in the Biological ing to increasingly large data consortia, to the detriment Sciences. Cheltenham, UK: Edward Elgar Publishing. Hey T, Tansley S and Tolle K (eds) (2009) The Fourth of more specialised and less data-centric areas of Paradigm: Data-Intensive Scientific Discovery. Redmond, investigation). WA: Microsoft Research. Hilgartner S (1995) Biomolecular databases: New communi- References cation regimes for biology? Science Communication 17: Ankeny R and Leonelli S (in press) Valuing data in post- 240–263. genomic biology: How data donation and curation prac- Hilgartner S (2013) Constituting large-scale biology: Building tices challenge the scientific publication system. In: Stevens a regime of governance in the early years of the Human H and Richardson S (eds) PostGenomics. Durham: Duke Genome Project. BioSocieties 8: 397–416. University Press. Hine C (2006) Databases as scientific instruments and their Baker KS and Millerand F (2010) Infrastructuring ecology: role in the ordering of scientific work. Social Studies of Challenges in achieving data sharing. In: Parker JN, Science 36(2): 269–298. Vermeulen N and Penders B (eds) Collaboration in the Johnson K (2012) Ordering Life: Karl Jordan and the New Life Sciences. Farnham, UK: Ashgate, pp. 111–138. Naturalist Tradition. Baltimore, MD: Johns Hopkins Bastow R and Leonelli S (2010) Sustainable digital infrastruc- University Press. ture. EMBO Reports 11(10): 730–735. Kelty CM (2012) This is not an article: Model organism news- Bauer S (2008) Mining data, gathering variables, and recom- letters and the question of ‘open science’. BioSocieties 7(2): bining information: The flexible architecture of epidemio- 140–168. logical studies. Studies in History and Philosophy of Leonelli S and Ankeny RA (2012) Re-thinking organisms: Biological and Biomedical Sciences 39: 415–426. The impact of databases on model organism biology. XML Template (2014) [8.7.2014–3:54pm] [1–11] //blrnas3/cenpro/ApplicationFiles/Journals/SAGE/3B2/BDSJ/Vol00000/140004/APPFile/SG-BDSJ140004.3d (BDS) [PREPRINTER stage] Leonelli 11 Studies in History and Philosophy of Biological and History and Philosophy of Biological and Biomedical Biomedical Sciences 43(1): 29–36. Sciences 43: 4–15. Leonelli S (2010) Packaging small facts for re-use: Databases O’Malley M and Soyer OS (2012) The roles of integration in in model organism biology. In: Howlett P and Morgan MS molecular systems biology. Studies in the History and the (eds) How Well Do Facts Travel?: The Dissemination of Philosophy of Biological and Biomedical Sciences 43(1): Reliable Knowledge. Cambridge, UK: Cambridge 58–68. University Press, pp. 325–348. Porter TM (1995) Trust in Numbers: The Pursuit of Leonelli S (2012a) When humans are the exception: Cross- Objectivity in Science and Public Life. Princeton, NJ: species databases at the interface of clinical and biological Princeton University Press. research. Social Studies of Science 42(2): 214–236. Raley R (2013) Dataveillance and countervailance. Leonelli S (2012b) Classificatory theory in data-intensive sci- In: Gitelman L (ed.) ‘Raw Data’ is an Oxymoron. ence: The case of open biomedical ontologies. Cambridge, MA: MIT Press, pp. 121–146. International Studies in the Philosophy of Science 26(1): Rheinberger H-J (2011) Infra-experimentality: From traces to 47–65. data, from data to patterning facts. History of Science Leonelli S (2013) Why the current insistence on open access to 49(3): 337–348. scientific data? Big Data, knowledge production and the Rosenthal N and Ashburner M (2002) Taking stock of our political economy of contemporary biology. Bulletin of models: The function and future of stock centres. Nature Science, Technology and Society 33(1/2): 6–11. Reviews Genetics 3: 711–717. Leonelli S, Smirnoff N, Moore J, Cook C and Bastow R Royal Society (2012) Science as an Open Enterprise. Available (2013) Making open data work in plant science. Journal at: http://royalsociety.org/policy/projects/science-public- for Experimental Botany 64(14): 4109–4117. enterprise/report/ (accessed 14 January 2014). Levin N, Weckoswka D, Castle D, et al. (in preparation) How Stein LD (2008) Towards a cyberinfrastructure for the bio- Do Scientists Understand Openness? Assessing the Impact logical sciences: Progress, visions and challenges. Nature of UK Open Science Policies on Biological Research. Reviews Genetics 9(9): 678–688. Mackenzie A and McNally R (2013) Living multiples: How Stevens H (2013) Life Out of Sequence: Bioinformatics and the large-scale scientific data-mining pursues identity and dif- Introduction of Computers into Biology. Chicago: ferences. Theory, Culture and Society 30: 72–91. University of Chicago Press. Martin P (2001) Genetic governance: The risks, oversight and Strasser BJ (2008) GenBank – Natural history in the 21st regulation of genetic databases in the UK. New Genetics century? Science 322(5901): 537–538. and Society 20(2): 157–183. Traweek S (1998) Iconic devices: Towards an ethnography of Mayer-Scho¨ nberger V and Cukier K (2013) Big Data: A physical images. In: Downey G and Dumit J (eds) Cyborgs Revolution That Will Transform How We Live, Work and and Cytadels. Santa Fe, NM: The SAR Press. Think. London: John Murray Publisher. Wylie A (2002) Thinking from Things: Essays in the Mu¨ ller-Wille S and Charmantier I (2012) Natural history and Philosophy of Archeology. Berkeley, CA: University of information overload: The case of Linnaeus. Studies in California Press.

Journal

Big Data & SocietySAGE

Published: Apr 1, 2014

Keywords: Big Data epistemology; data-intensive science; biology; databases; data infrastructures; data curation; model organisms

References