Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets:

What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets: Big Data has been variously defined in the literature. In the main, definitions suggest that Big Data possess a suite of key traits: volume, velocity and variety (the 3Vs), but also exhaustivity, resolution, indexicality, relationality, extensionality and scalability. However, these definitions lack ontological clarity, with the term acting as an amorphous, catch-all label for a wide selection of data. In this paper, we consider the question ‘what makes Big Data, Big Data?’, applying Kitchin’s taxonomy of seven Big Data traits to 26 datasets drawn from seven domains, each of which is considered in the literature to constitute Big Data. The results demonstrate that only a handful of datasets possess all seven traits, and some do not possess either volume and/or variety. Instead, there are multiple forms of Big Data. Our analysis reveals that the key definitional boundary markers are the traits of velocity and exhaustivity. We contend that Big Data as an analytical category needs to be unpacked, with the genus of Big Data further delineated and its various species identified. It is only through such ontological work that we will gain conceptual clarity about what constitutes Big Data, formulate how best to make sense of it, and identify how it might be best used to make sense of the world. Keywords Big Data, ontology, taxonomy, types, characteristics Introduction . relationality (containing common fields that enable The etymology of ‘Big Data’ has been traced to the the conjoining of different datasets) (Boyd and mid-1990s, first used by John Mashey, retired former Crawford, 2012); Chief Scientist at Silicon Graphics, to refer to handling . extensionality (can add/change new fields easily) and and analysis of massive datasets (Diebold, 2012). In scaleability (can expand in size rapidly) (Marz and 2001, Doug Laney detailed that Big Data were charac- Warren, 2012); terised by three traits: . veracity (the data can be messy, noisy and contain uncertainty and error) (Marr, 2014); . volume (consisting of enormous quantities of data); . value (many insights can be extracted and the data . velocity (created in real-time) and; repurposed) (Marr, 2014); . variety (being structured, semi-structured and . variability (data whose meaning can be constantly unstructured). shifting in relation to the context in which they are generated) (McNulty, 2014). Since then, others have attributed other qualities to Big Data, including: NIRSA, Maynooth University, County Kildare, Ireland UCD School of Computer Science, University College Dublin, Dublin, . exhaustivity (an entire system is captured, n¼ all, Ireland rather than being sampled) (Mayer-Schonberger and Cukier, 2013); Corresponding author: . fine-grained (in resolution) and uniquely indexical Rob Kitchin, NIRSA, Maynooth University, County Kildare, Ireland. (in identification) (Dodge and Kitchin, 2005); Email: rob.kitchin@nuim.ie Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License (http:// www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access- at-sage). 2 Big Data & Society Uprichard (2013) notes several other v-words that small data generated through state-administered sur- have also been used to describe Big Data, including: veys and administrative data. Kitchin (2015) extended ‘versatility, volatility, virtuosity, vitality, visionary, their original table, adding three further fields to their vigour, viability, vibrancy... virility... valueless, vam- 14 points of comparison (see Table 2). Table 2 makes it pire-like, venomous, vulgar, violating and very violent.’ clear that Big Data have a very different set of charac- More recently, Lupton (2015) has suggested dropping teristics to more traditional forms of small data across a v-words to adopt p-words to describe Big Data, detail- range of attributes which extend beyond the data’s ing 13: portentous, perverse, personal, productive, par- essential qualities (including methods, sampling, data tial, practices, predictive, political, provocative, quality, repurposing, management). privacy, polyvalent, polymorphous and playful. While In contrast, rather than focusing on the ontological useful entry points into thinking critically about Big characteristics of what constitutes the nature of Big Data, these additional v-words and new p-words are Data, some define Big Data with respect to the compu- often descriptive of a broad set of issues associated tational difficulties of processing and analyzing it, or in with Big Data, rather than characterising the onto- storing it on a single machine (Strom, 2012). For exam- logical traits of the data themselves. ple, Batty (2015) contends that Big Data challenges Based on a review of definitions of Big Data, Kitchin conventional statistical and visualization techniques, (2013, 2014) contends that Big Data are qualitatively and push the limits of computational power to analyze different to traditional, small data along seven axes (see them. He thus contends that we have always had Big Table 1). He details that, until recently, science has Data, with the massive datasets presently being pro- progressed using small data that have been produced duced merely the latest form of Big Data, which require in tightly controlled ways using sampling techniques new technique to process, store and make sense of that limit their scope, temporality and size, and are them. Murthy et al. (2014) categorises Big Data using quite inflexible in their administration and generation. a six-fold taxonomy that likewise focuses on its hand- While some of these small datasets are very large in ling and processing rather than key traits: (1) data ((a) size, they do not possess the other characteristics of temporal latency for analysis: real-time, near real-time, Big Data. For example, national censuses are typically batch; and (b) structure: structured, semi-structured, generated once every 10 years, asking just c.30 struc- unstructured); (2) compute infrastructure (batch or streaming); (3) storage infrastructure (SQL, NoSQL, tured questions, and once they are in the process of being administered it is impossible to tweak or add/ NewSQL); (4) analysis (supervised, semi-supervised, remove questions. In contrast, Big Data are generated unsupervised or re-enforcement machine learning; continuously and are more flexible and scalable in their data mining; statistical techniques); (5) visualisation production. For example, in 2014, Facebook was pro- (maps, abstract, interactive, real-time); and (6) privacy cessing 10 billion messages, 4.5 billion ‘Like’ actions, and security (data privacy, management, security). and 350 million photo uploads per day (Marr, 2014), Regardless of how Big Data have been defined it is and they were constantly refining and tweaking their clear that, despite widespread use, the term is still rather underlying algorithms and terms and conditions, chan- loose in its ontological framing and definition, and it is ging what and how data were generated (Bucher, 2012). being used as a catch-all label for a wide selection of Similarly, Florescu et al. (2014), in a study examin- data. The result is that these data are characterised ing the potential for Big Data to be used to generate as holding similar traits to each other and the term new official statistics, details how Big Data differs from ‘Big Data’ is treated like an amorphous entity that lacks conceptual clarity. However, for those who work with and analyze datasets that have been labelled Table 1. Comparing small and Big Data. as Big Data it is apparent that, although they undoubt- Small data Big Data edly share many traits, they also vary in their charac- teristics and nature. Not all of the data types that have Volume Limited to large Very large been declared as constituting Big Data have volume, Velocity Slow, freeze-framed/ Fast, continuous velocity or variety, let alone the other characteristics bundled noted above. Nor do they all overly challenge conven- Variety Limited to wide Wide tional statistical techniques or computational power in Exhaustivity Samples Entire populations making sense of them. In other words, there are mul- Resolution and Course and weak Tight and strong tiple forms of Big Data. However, while there has been indexicality to tight and strong some rudimentary work to identify the ‘genus’ of Big Relationality Weak to strong Strong Data, as detailed above, there has been no attempt to Extensionality and Low to middling High separate out its various ‘species’ and their defining scalability attributes. Kitchin and McArdle 3 Table 2. Characteristics of survey, administrative and Big Data. Survey data Administrative data Big Data Specification Statistical products specified Statistical products specified Statistical products specified ex-ante ex-post ex-post Purpose Designed for statistical Designed to deliver/monitor a Organic (not designed) or purposes service or program designed for other purposes Byproducts Lower potential for by-products Higher potential for by-products Higher potential for by-products Methods Classical statistical methods Classical statistical methods Classical statistical methods not available available, usually depending on always available the specific data Structure Structured A certain level of data structure, A certain level of data structure, depending on the objective of depending on the source of data collection information Comparability Weaker comparability Weaker comparability between Potentially greater comparability between countries countries between countries Representativeness Representativeness and Representativeness and coverage Representativeness and coverage coverage known by design often known difficult to assess Bias Not biased Possibly biased Unknown and possibly biased Error Typical types of errors Typical types of errors Both sampling and non-sampling (sampling and (non-sampling errors, errors (e.g., missing data, non-sampling errors) e.g., missing data, reporting reporting errors and outliers) errors and outliers) although possibly less fre- quently occurring, and new types of errors Persistence Persistent Possibly less persistent Less persistent Volume Manageable volume Manageable volume Huge volume Timeliness Slower Potentially faster Potentially much faster Cost Expensive Inexpensive Potentially inexpensive Burden High burden No incremental burden No incremental burden Geography National, defined National or extent of program National, international, poten- and service tially spatially uneven Demographics All or targeted Service users or program Consumers who use a service, recipients pass a sensor, contribute to a project, etc. Intellectual Property State State State/Private sector/ User-created rights. Source: Florescu et al. (2014: 2–3) and Kitchin (2015) In this paper, we examine the ontology of Big Data datasets are used for illustrative purposes and were and its definitional boundaries, exploring the question selected due to our familiarity with them. We start by ‘what makes Big Data, Big Data?’ We employ Kitchin’s examining each of the parameters detailed by Kitchin (2013) taxonomy of the characteristics of Big Data with respect to the 26 different data types, in effect (Table 1) to examine the nature of 26 specific types of working down the columns in Table 3. We then exam- data, drawn from seven domains (mobile communica- ine the rows to consider how these parameters are com- tion; websites; social media/crowdsourcing; sensors; bined with respect to the data types to produce multiple cameras/lasers; transaction process generated data; forms of Big Data. and administrative), that have been labelled in the lit- Our aim in performing this analysis is not to deter- erature as Big Data (see Table 3). These 26 types of mine a tightly constrained definition of Big Data – to data are by no means exhaustive of all types of Big definitively set out precisely the nature of Big Data and Data, for example there are a multitude of Big Data their essential qualities – but rather to explore the par- generated within scientific experiments, science comput- ameters, limits, and ‘species’ of Big Data. The analysis ing, and industrial manufacturing. Rather, these 26 is thus an exercise in boundary work designed to test 4 Big Data & Society Table 3. Ontological traits of Big Data. Velocity Volume frequency Volume Volume (TBs, of handling, (number per PBs, Velocity frequency recording, Data type of records) record etc.) of generation publishing Variety Exhaustivity Resolution Indexical Relational Extensionality Scalable Mobile Mobile phone High Low High Real-time At time of Structured n¼ all Fine-grained Yes Yes No Yes communication data constant generation (bkgrd comms), real-time sporadic (at use) App data High Low High Real-time constant At time of Structured & n¼ all Fine-grained Yes Yes Yes Yes (bkgrd comms), generation unstructured real-time sporadic (at use) Websites Web searches High Low High Real-time sporadic At time of Structured & n¼ all Fine-grained Yes Yes Yes Yes generation unstructured Scraped websites High Medium High Real-time sporadic At time of Semi-structured n¼ all Fine-grained Yes Yes Yes Yes generation Clickstream High Low High Real-time sporadic At time of Structured n¼ all Fine-grained Yes Yes Yes Yes generation Social media/ Social media High Medium High Real-time sporadic At time of Structured & n¼ all Fine-grained Yes Yes Yes Yes Crowdsourcing (full pipe) generation unstructured (e.g. Twitter) Social media Low Medium Medium Real-time sporadic At time of Structured & Sampled Fine-grained Yes Yes Yes Yes (spritzer) generation unstructured (e.g. twitter) Picture sharing/ High High High Real-time sporadic At time of Structured & n¼ all Fine-grained Yes Yes Yes Yes social media generation unstructured (flickr, Panoramio, Instagram) Collaborative Low Low Low Real-time sporadic At time of Structured & n¼ all Fine-grained Yes Yes Yes Yes mapping generation semi- platforms (open to structured (OpenStreetMap, editing) Wikimapia) Citizen science High Low Medium Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No Yes (wunderground) or real-time generation sporadic Sensors Traffic loops Medium Low Low Real-time constant At time of Structured n¼ all Aggregated Yes Yes No No generation (continued) Kitchin and McArdle 5 Table 3. Continued Velocity Volume frequency Volume Volume (TBs, of handling, (number per PBs, Velocity frequency recording, Data type of records) record etc.) of generation publishing Variety Exhaustivity Resolution Indexical Relational Extensionality Scalable Automatic Number Medium Low Medium Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No Yes Plate Readers generation (ANPR) Real-time passenger Medium Low Low Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No No info (RTPI) generation Smart meters High Low Medium Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No No generation Pollution and Medium Low Low Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No No sound generation sensors Satellite images Medium High High Real-time constant At time of Unstructured n¼ all, Fine-grained Yes Yes No No generation delayed repeat of coverage Cameras/Lasers Digital CCTV High High High Real-time constant At time of Unstructured n¼ all Fine-grained Yes Yes No No generation Lidar mapping High High High Real-time constant Delayed and Structured n¼ all, but Fine-grained Yes Yes No No (by HERE) (when in use) consolidated no or (daily) infrequent repeat coverage Transactions Supermarket High Low High Real-time sporadic At time of Structured n¼ all Fine-grained Yes Yes No Yes of process scanner and generation generated data sales data Immigration Low High High Real-time sporadic At time of Structured n¼ all, Fine-grained Yes Yes No Yes (inc. photo, generation infrequent fingerprint scan) repeat coverage Flight movements High Low High Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No Yes generation Credit card data High Low High Real-time sporadic At time of Structured n¼ all Fine-grained Yes Yes No Yes generation Stock market trades High Low High Real-time sporadic At time of Structured n¼ all Fine-grained Yes Yes No Yes generation Administrative House price register Low Low Low Real-time sporadic Delayed and Structured n¼ all Fine-grained Yes Yes No Yes consolidated (monthly) (continued) 6 Big Data & Society the edges of what might be considered Big Data and to internally tease apart what is presently an amorphous concept to reveal its inner diversity – its multiple forms. In other words, we consider in much more detail than previous studies the ontology of Big Data. This is an important exercise, we believe, as it enables the produc- tion of much more conceptual clarity about what con- stitutes Big Data, especially given the ongoing confusion over its traits and its amorphous description. In turn, acknowledging and detailing the various types of Big Data facilitates a much more nuanced under- standing of its forms, its value, and how they might be analyzed and for what purposes. The parameters of Big Data In Table 3 we have mapped 26 sources of data, defined as Big Data within the literature, against the traits iden- tified by Kitchin (2014) in Table 1. Through the process of evaluating each dataset against each characteristic it quickly became apparent that the categories of volume and velocity needed to be further teased apart. Similarly, while resolution and indexicality, and extensionality and scalability, are combined into two characteristics in Table 1, we consider them separately in Table 3 given that they are not synonymous traits. In the context of Big Data, volume generally refers to the storage space required to record and store data. Big Data, it is commonly stated, typically require tera- 40 50 bytes (2 bytes) or petabytes (2 bytes) of storage space (The Economist, 2010), far more than an average desktop computer can provide, with the data often stored in the cloud across several servers and locations. However, when we examine our 26 datasets it is clear that some of them, for example pollution and sound sensors, require very little storage space, maybe produ- cing a gigabyte (2 bytes) of data per annum (easily storable on a datastick). Although each sensor might be producing a steady stream of readings, say once per minute, each record is very small, consisting of just a few kilobytes (2 bytes). Even summed over the course of a year, the sensor dataset would be relatively small in stored volume, in fact much smaller than many ‘small datasets’ such as a Census. As detailed in Table 3, we have thus teased apart volume into three dimensions: (1) the number of records (which is reflective of velocity and the number of generating devices), (2) the storage required per record, and (3) the total storage required (effectively the sum of the first two). Using this threefold classification of volume it is clear that the 26 Big Data sets have differing volume characteristics. Automated forms of Big Data gener- ation, where records are created on a continual basis every few seconds or minutes, often across multiple sites or individuals, produce very large numbers of Table 3. Continued Velocity Volume frequency Volume Volume (TBs, of handling, (number per PBs, Velocity frequency recording, Data type of records) record etc.) of generation publishing Variety Exhaustivity Resolution Indexical Relational Extensionality Scalable Planning permissions Low Low Low Real-time sporadic Delayed and Structured n¼ all, but Fine-grained Yes Yes No Yes consolidated no or (weekly) infrequent repeat coverage Employment register Low Low Low Real-time sporadic Delayed and Structured n¼ all Aggregated Yes Yes No Yes (at release) consolidated (monthly) bkgrd comms: constant background passive monitoring. Kitchin and McArdle 7 records. Human-mediated forms, such as creating A single LIDAR scan generally produces a million administrative records (immigration, unemployment plus points of data (Cahalane et al., 2012). At the end registration), might have a steady stream of new rec- of every day the local storage device is removed from ords, usually generated from a constrained number of the vehicle performing the scan and its data transferred sites (a small number of entry points to a country, to a data centre. Similarly, unemployment data are rec- unemployment offices), that produce much lower vol- orded at the time a person updates their status on the umes than automated systems. Likewise, while each system, but the overall unemployment rate is published sensor record is generally very small in file size, imagery monthly and in an aggregated form. In some cases, data (such as streaming video, photographs and satel- even once the data are generated they are open to fur- lite images) are typically quite large in file size, meaning ther editing, as with crowdsourced data within that relatively low numbers of records soon scale into Wikipedia or OpenStreetMap, with the edits also rec- huge storage requirements. In many cases, although the orded in real-time and becoming part of the dataset. volume per record is low, the sheer number of devices Perhaps not unsurprisingly, there is a fair range of generating data produce very large storage volumes. variety in the data form across our 26 datasets, includ- For example, the million customers flowing through ing structured, semi-structured and unstructured data thousands of Walmart stores every hour generate 2.5 types. Of all the characteristics attributed to Big Data petabytes of transaction data (Open Data Center this seems to us to be the weakest attribute. Indeed, Alliance, 2012). small data are also highly heterogeneous in nature, Velocity is considered a key attribute of Big Data. especially datasets common to humanities and social Rather than data being occasionally sampled (either on sciences where the handling and analyzing of qualita- a one-off basis or with a large temporal gap between tive data (text, images, etc.) is normal. Our suspicion is samples), Big Data are produced much more continu- that this characteristic was attributed to Big Data ally. When we examined our datasets, however, it because those scientists who first coined the term were became apparent that there are two kinds of velocity used to handling structured data exclusively but were with respect to Big Data: (1) frequency of generation; starting to encounter semi-structured and unstructured (2) frequency of handling, recording, and publishing; data as new data generation and collection systems and that the 26 datasets varied with respect to these were deployed. As noted, small datasets consist of samples of repre- two traits. In terms of frequency of generation, data can be generated in real-time constantly, for example sentative data harvested from the total sum of poten- recording a reading every 30 seconds or verifying loca- tially available data. Sampling is typically used because tion every 4 minutes (as many mobile phone apps do), it is unfeasible in terms of time and resources to harvest or in real-time sporadically, for example at the point of a full dataset. In contrast, Big Data seeks to capture the use, such as clickstream data being generated in real- entire population (n¼ all) within a system, rather than time but only while a user is clicking through websites, a sample. In other words, Twitter captures all tweets or an immigration system recording only when some- made by all tweeters, plus their associated 32 fields of one is scanning their documents. metadata, not a sample of tweets or tweeters. Similarly, In some cases, as the data are recorded, the system is a set of pollution sensors is seeking to create a continu- updated in real-time and the new data are also pub- ous, longitudinal record of readings, captured every few lished in real-time (with only a fraction of delay seconds, from a fixed network of sensors. Likewise, a between the two). For example, as a tweet is tweeted credit card company or the stock market seeks to it is recorded in Twitter’s data architecture and micro- record every single transaction and alter credit balances seconds later it is published into user timelines. Here, accordingly. even though the data generation is sporadic at the point All our 26 datasets hold the characteristic of n¼ all, of generation (each user might only produce a couple of except for the spritzer of Twitter; this is the sample of tweets a day), it is far from the case at the point of tweets harvested from the full fire hose that Twitter recording by the company (the millions of Twitter shares with some researchers. It is important to note, users collectively generate thousands of tweets per however, that the temporality of n¼ all can vary. For second, meaning that the company databases and ser- example, an immigration system at an airport aims to vers are constantly handling a data deluge). In other capture details about all passengers passing through it, cases, the data are recorded in real-time, but their trans- but a passenger might only pass through that system mission to central servers and/or their processing or infrequently. In the case of a satellite, it might capture publication is delayed. For example, the HERE imagery of the whole planet, but it only flies over the LIDAR scanning project involves 200 cars driving same portion of the Earth every set number of days. around cities taking a LIDAR scan every second to Likewise, the HERE LIDAR project aims to scan every produce high definition mapping data (Nokia, 2015). road in every country, but each street is only surveyed 8 Big Data & Society once and is unlikely to be rescanned for several years. of the 26 datasets are generated and stored within rap- In other words, Big Data systems seek to capture idly scaleable systems, but not others. n¼ all, but capturing n¼ all varies with respect to what is being measured and their spatial coverage and The forms and boundaries of Big Data temporal register. As with exhaustivity, all 26 datasets hold the traits of What is clear from examining each Big Data parameter fine-grained resolution (with the exception of employ- with respect to the 26 datasets is that there is no one ment data, which is fine-grained in the database but is characteristic profile that all Big Data fit. Big Data does published in aggregated form), indexicality and rela- not possess all of the seven traits detailed by Kitchin tionality. In each case, the data are accompanied by (2013, 2014). Indeed, not all data termed Big Data in metadata that uniquely identifies the device, site and the literature possess the 3Vs of volume, velocity and time/date of generation, along with other characteris- variety. If one looks across the rows in Table 3 then the tics such as device settings. These metadata inherently diversity of Big Data becomes clear, with datasets pos- produces relationality, enabling data from the same sessing differing profiles, especially with regard to and related devices but generated at different times/ volume, velocity, variety, extensionality and scalability. locales to be linked, but also entirely different datasets Big Data are clearly then not an amorphous category that share some common fields to be tied together and and there are certainly different ‘species’ of Big Data. relationships between datasets to be identified. Examining these profiles starts to suggest the bound- However, the data themselves might not provide un- ary markers of what constitutes Big Data. Indeed, it ambiguous relationality or be easily machine-readable. may be the case that some of our 26 datasets might For example, a tweet is composed of text and/or an not be considered Big Data by some. Or it might be image which requires either data analytics or human that some consider certain datasets to constitute Big interpretation to identify the content and meaning of Data that we would not, for example, national censuses the tweet. Similarly, a CCTV feed will be indexical to a (which have volume, exhaustivity, resolution, indexical- camera and be time, date, and place stamped, but the ity and relationality, but no velocity (generated once content of the feed will either require image recognition every 10 years and taking 1–2 years to process), no to identify content (e.g., using facial recognition soft- extensionality or scaleability, and are published in ware) or operator recognition to make the image con- aggregated form). It seems to us, based on the datasets tent indexical. that we have examined, that the key boundary charac- Extensionality and scaleability refer to the flexibility teristics of Big Data, which together differentiate it of data generation. A system that is highly adaptable in from small data, are velocity (both frequency of gener- terms of what data are generated is said to possess ation, and frequency of handling, recording, and pub- strong extensionality (Marz and Warren, 2012). For lishing) and exhaustivity. Small data are slow and example, web-based and mobile apps are constantly sampled. Big Data are quick and n¼ all. Small data tweaking their designs and underlying algorithms, per- can hold all of the other characteristics (volume, reso- forming on-the-fly adaptive testing and rollout, as well lution, indexicality, relationality, extensionality and as altering their terms and conditions and the metadata flexibility) and still be considered small in nature. It is they capture. The result is the data they generate are the qualities of velocity and exhaustivity which set Big changeable, with new fields being added and removed Data apart and are responsible for so much recent as required. However, this is not a trait common to all attention and investment in Big Data ventures. While big datasets. For example, many systems, such as smart some datasets have possessed these two qualities for meters, credit card readers and sensor-networks, are some time, such as stock market and weather data, it seeking rigid continuity in what data are generated to is only in the past 15 years that these characteristics produce robust, comparable longitudinal datasets. have become much more common and routine. Scaleability refers to the extent to which a system can These two traits, we believe, act as key Big Data cope with varying data flow. Social media platforms boundary markers. In our own analysis of Table 3 it such as Twitter need to be able to cope with ebbs and was the administrative datasets of the house price regis- surges in data generation, scaling from managing a few ter, planning permissions and unemployment, as well as thousand tweets at certain times of the day to tens of the satellite and LIDAR imagery that provoked the thousands during popular live events. Such rapid scal- most discussion (we quite quickly rejected Census ing is not required in systems that have a constant flow data, which we had initially included, due to its very of data, such as a sensor network that produces data at long temporal gap in data generation). In the case of set intervals (the timing can be altered, but the flow the administrative data, they are produced in real-time remains constant rather than surging). As such, some as entries are made into the system (as house sales are Kitchin and McArdle 9 completed, planning permissions sought, and unem- across a whole system can produce a deluge of data, ployed people sign-on). However, the publishing of especially if each record is large in size. In some cases, the data is either weekly or monthly, and in the case however, the flow can be generated in real-time (e.g., of unemployment released in an aggregated form. Do every 30 seconds), but because the system is small (e.g., data that are generated in real-time, but released 30 sound sensors across a city) and each record is small monthly and in an aggregated form constitute Big in size, the storage volume is relatively small. The data Data? Certainly they are at the point of collection, generated by each sensor are also highly structured. but what about at the point of publishing where they Despite the lack of volume and variety, such sensor lack velocity? For some, such administrative data are data are widely considered Big Data. Likewise, variety Big Data (Economic and Social Research Council is not a distinguishing characteristic because small data (ESRC), 2013), for others they are more marginal, possesses just as much variety as Big Data. and the key element in doubt is temporality. One month’s delay is still much quicker than most adminis- Conclusion trative data that are published quarterly or annually, and the dataset still holds most of the other character- To date, there has been very little work that has sought istics of Big Data such as exhaustivity (the data refers to examine in detail the ontology of Big Data, other to all houses sold, all planning permissions sought, and than to suggest that they are data that possess certain all unemployed people), but it is nonetheless far slower broad characteristics (volume, velocity, variety, exhaus- than data published in real-time. tivity, etc.). Indeed, most studies that discuss Big Data Our discussion of satellite imagery and LIDAR treat the term as a catch-all, amorphous phrase that focused in particular on coverage and repetition of assumes that all Big Data share a set of general traits. gaze. In other forms of Big Data, what is being mea- Through an analysis that applied Kitchin’s (2013, 2014) sured remains quite constant, with the gaze and the typology of Big Data traits to 26 datasets our study object under surveillance relatively fixed. In social reveals that Big Data do not all share the same charac- media it is the contributions of every user, for credit teristics and that there are multiple forms of Big Data. cards it is the transactions of every card holder, for Indeed, our analysis demonstrates that only a handful supermarkets it is the purchases of every shopper. of the 26 datasets we examined held all seven traits However, the gaze of the satellite imagery moves, identified by Kitchin. That said, it is the case that for only returning to capture the same terrain after a set Big Data to be classified as Big Data they do need to number of days. Nonetheless the surface of the entire possess the majority of the traits set out in Table 1, of planet is being repeatedly generated and data are pro- which velocity and exhaustivity are the most important. cessed constantly. In the case of LIDAR, that repeti- Volume and variety, we contend, are not necessary con- tion is missing. The aim is to scan every road on the ditions of Big Data and without velocity and exhaus- planet, but to do so only once. The data are generated tivity are not qualifying criteria. In other words, the in real-time, and are voluminous, indexical, relational, 3Vs meme is actually false and misleading and along and they produce exhaustive spatial coverage (the aim with the term itself is partially to blame for the confu- is to create a 3D model of the whole road network and sion over the definitional boundaries of Big Data. the architecture bordering this network) though no lon- The observation that there are multiple forms of Big gitudinal data of the same places. In both cases, most Data is perhaps no surprise given the wide variety of would agree that satellite imagery and LIDAR scans small data, the varying nature of the systems that gen- constitute Big Data, but they are exhaustive in a par- erate Big Data, the differing purposes for which the ticular way which distinguishes them from other types data are generated, and the differing forms of the of Big Data. The same would also be the case with data generated. Nonetheless it is an observation that respect to large scientific experiments such as data gen- needs highlighting given that it has so far been ignored erated by the Large Hadron Collider. or taken for granted in the literature. Our analysis has Interestingly, given the meme of the 3Vs of Big Data, revealed that Big Data as an analytical category needs having examined 26 types of Big Data, our conclusion to be unpacked, with the ‘genus’ of Big Data further is that two of those Vs – volume and variety – are not delineated and its various ‘species’ identified. This is key defining characteristics of Big Data. It is certainly important work if we are to better understand what it the case that Big Data often consists of very large num- is that we are talking about when we discuss and ana- bers of records and the storage volume required to store lyze Big Data, and if we want to produce more nuanced them is significant, however, this is not a necessary con- insights about and from the data. It is only through dition of Big Data. Rather volume is a by-product of such ontological work, focused on shifting from velocity and exhaustivity: the real-time flow of data broad generalities to specific qualities, that we will 10 Big Data & Society Kitchin R (2014) The Data Revolution: Big Data, Open Data, gain conceptual clarity about what constitutes Big Data Data Infrastructures and Their Consequences. London: and formulate how best to make sense of it and how it Sage. might be used to make sense of the world. Kitchin R (2015) The opportunities, challenges and risks of big data for official statistics. Statistical Journal of the Declaration of conflicting interests International Association of Official Statistics 31(3): The author(s) declared no potential conflicts of interest with 471–481. respect to the research, authorship, and/or publication of this Laney D (2001) 3D data management: Controlling data article. volume, velocity and variety. In: Meta Group. Available at: http://blogs.gartner.com/doug-laney/files/2012/01/ Funding ad949-3D-Data-Management-Controlling-Data-Volume- The author(s) disclosed receipt of the following financial sup- Velocity-and-Variety.pdf (accessed 16 January 2013). port for the research, authorship, and/or publication of this Lupton D (2015) The thirteen Ps of big data. The Sociological article: The research for this paper was funded by a European Life, 13 May. Available at: https://simplysociology. Research Council Advanced Investigator Award, ‘The wordpress.com/2015/05/11/the-thirteen-ps-of-big-data/ Programmable City’ (ERC-2012-AdG-323636). (accessed 17 September 2015). Marr B (2014) Big data: The 5 vs everyone must know. References March 6. Available at: https://www.linkedin.com/pulse/ 20140306073407-64875646-big-data-the-5-vs-everyone-mu Batty M (2015) Data about cities: Redefining big, recasting st-know (accessed 4 September 2015). small. Paper prepared for the Data and the City work- Marz N and Warren J (2012) Big Data: Principles and Best shop, Maynooth University, 31 August–1 September Practices of Scalable Realtime Data Systems. MEAP edi- 2015. Available at: http://www.spatialcomplexity.info/ tion. Westhampton, NJ: Manning. files/2015/08/Data-Cities-Maynooth-Paper-BATTY.pdf Mayer-Schonberger V and Cukier K (2013) Big Data: A (accessed 4 September 2015). Revolution that will Change How We Live, Work and Boyd D and Crawford K (2012) Critical questions for big Think. London: John Murray. data. Information, Communication and Society 15(5): McNulty E (2014) Understanding Big Data: The seven V’s. 22 662–679. May. Available at: http://dataconomy.com/seven-vs-big Bucher T (2012) ‘Want to be on the top?’ Algorithmic power -data/ (accessed 4 September 2015). and the threat of invisibility on Facebook. New Media and Murthy P, Bharadwaj A, Subrahmanyam PA, et al. (2014) Society 14(7): 1164–1180. Big Data Taxonomy. Big Data Working Group, Cloud Cahalane C, McCarthy T and McElhinney CP (2012) Security Alliance. Available at: https://downloads.cloud MIMIC: Mobile mapping point density calculator. In: securityalliance.org/initiatives/bdwg/Big_Data_Taxonom Proceedings of the 3rd international conference on comput- ing for geospatial research and applications, 1–3 July 2012, y.pdf (accessed 7 September 2015). Washington, DC, USA: ACM. Nokia (2015) HERE makes HD map data in US, France, Diebold F (2012) A personal perspective on the origin(s) and Germany and Japan available for automated vehicle development of ‘big data’: The phenomenon, the term, and tests. Available at: http://company.nokia.com/en/news/ the discipline. Available at: http://www.ssc.upenn.edu/ press-releases/2015/07/20/here-makes-hd-map-data-in-us-f fdiebold/papers/paper112/Diebold_Big_Data.pdf rance-germany-and-japan-available-for-automated-vehic (accessed 5 February 2013). le-tests (accessed 16 September 2015). Dodge M and Kitchin R (2005) Codes of life: Identification Open Data Center Alliance (2012) Big Data Consumer Guide. codes and the machine-readable world. Environment and Open Data Center Alliance. Available at: http://www. Planning D: Society and Space 23(6): 851–881. opendatacenteralliance.org/docs/Big_Data_Consumer_ Economic and Social Research Council (ESRC) (2013) The Guide_Rev1.0.pdf (accessed 11 February 2013). Big Data family is born – David Willetts MP announces Strom D (2012) Big data makes things better. Slashdot, the ESRC Big Data Network. In: ESRC website, 10 3 August. Available at: http://slashdot.org/topic/bi/big- October. Available at: http://www.esrc.ac.uk/news-and- data-makes-things-better/ (accessed 24 October 2013). events/press-releases/28673/the-big-data-family-is-born-d The Economist (2010) All too much: Monstrous amounts of avid-willetts-mp-announces-the-esrc-big-data-network. data, 25 February. Available at: http://www.economist.- aspx (accessed 7 September 2015). com/node/15557421 (accessed 12 November 2012). Florescu D, Karlberg M, Reis F, et al. (2014) Will ‘big data’ Uprichard E (2013) Big data, little questions. Discover transform official statistics? Available at: http://www. Society, 1 October. Available at: http://discoversociety. q2014.at/fileadmin/user_upload/ESTAT-Q2014-BigData org/2013/10/01/focus-big-data-little-questions/ (accessed OS-v1a.pdf (accessed 1 April 2015). 17 September 2015). Kitchin R (2013) Big data and human geography: Opportunities, challenges and risks. Dialogues in Human Geography 3(3): 262–267. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Big Data & Society SAGE

What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets:

Big Data & Society , Volume 3 (1): 1 – Feb 17, 2016

Loading next page...
 
/lp/sage/what-makes-big-data-big-data-exploring-the-ontological-characteristics-JDr8kgy2r6

References (20)

Publisher
SAGE
Copyright
Copyright © 2022 by SAGE Publications Ltd, unless otherwise noted. Manuscript content on this site is licensed under Creative Commons Licenses.
ISSN
2053-9517
eISSN
2053-9517
DOI
10.1177/2053951716631130
Publisher site
See Article on Publisher Site

Abstract

Big Data has been variously defined in the literature. In the main, definitions suggest that Big Data possess a suite of key traits: volume, velocity and variety (the 3Vs), but also exhaustivity, resolution, indexicality, relationality, extensionality and scalability. However, these definitions lack ontological clarity, with the term acting as an amorphous, catch-all label for a wide selection of data. In this paper, we consider the question ‘what makes Big Data, Big Data?’, applying Kitchin’s taxonomy of seven Big Data traits to 26 datasets drawn from seven domains, each of which is considered in the literature to constitute Big Data. The results demonstrate that only a handful of datasets possess all seven traits, and some do not possess either volume and/or variety. Instead, there are multiple forms of Big Data. Our analysis reveals that the key definitional boundary markers are the traits of velocity and exhaustivity. We contend that Big Data as an analytical category needs to be unpacked, with the genus of Big Data further delineated and its various species identified. It is only through such ontological work that we will gain conceptual clarity about what constitutes Big Data, formulate how best to make sense of it, and identify how it might be best used to make sense of the world. Keywords Big Data, ontology, taxonomy, types, characteristics Introduction . relationality (containing common fields that enable The etymology of ‘Big Data’ has been traced to the the conjoining of different datasets) (Boyd and mid-1990s, first used by John Mashey, retired former Crawford, 2012); Chief Scientist at Silicon Graphics, to refer to handling . extensionality (can add/change new fields easily) and and analysis of massive datasets (Diebold, 2012). In scaleability (can expand in size rapidly) (Marz and 2001, Doug Laney detailed that Big Data were charac- Warren, 2012); terised by three traits: . veracity (the data can be messy, noisy and contain uncertainty and error) (Marr, 2014); . volume (consisting of enormous quantities of data); . value (many insights can be extracted and the data . velocity (created in real-time) and; repurposed) (Marr, 2014); . variety (being structured, semi-structured and . variability (data whose meaning can be constantly unstructured). shifting in relation to the context in which they are generated) (McNulty, 2014). Since then, others have attributed other qualities to Big Data, including: NIRSA, Maynooth University, County Kildare, Ireland UCD School of Computer Science, University College Dublin, Dublin, . exhaustivity (an entire system is captured, n¼ all, Ireland rather than being sampled) (Mayer-Schonberger and Cukier, 2013); Corresponding author: . fine-grained (in resolution) and uniquely indexical Rob Kitchin, NIRSA, Maynooth University, County Kildare, Ireland. (in identification) (Dodge and Kitchin, 2005); Email: rob.kitchin@nuim.ie Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License (http:// www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access- at-sage). 2 Big Data & Society Uprichard (2013) notes several other v-words that small data generated through state-administered sur- have also been used to describe Big Data, including: veys and administrative data. Kitchin (2015) extended ‘versatility, volatility, virtuosity, vitality, visionary, their original table, adding three further fields to their vigour, viability, vibrancy... virility... valueless, vam- 14 points of comparison (see Table 2). Table 2 makes it pire-like, venomous, vulgar, violating and very violent.’ clear that Big Data have a very different set of charac- More recently, Lupton (2015) has suggested dropping teristics to more traditional forms of small data across a v-words to adopt p-words to describe Big Data, detail- range of attributes which extend beyond the data’s ing 13: portentous, perverse, personal, productive, par- essential qualities (including methods, sampling, data tial, practices, predictive, political, provocative, quality, repurposing, management). privacy, polyvalent, polymorphous and playful. While In contrast, rather than focusing on the ontological useful entry points into thinking critically about Big characteristics of what constitutes the nature of Big Data, these additional v-words and new p-words are Data, some define Big Data with respect to the compu- often descriptive of a broad set of issues associated tational difficulties of processing and analyzing it, or in with Big Data, rather than characterising the onto- storing it on a single machine (Strom, 2012). For exam- logical traits of the data themselves. ple, Batty (2015) contends that Big Data challenges Based on a review of definitions of Big Data, Kitchin conventional statistical and visualization techniques, (2013, 2014) contends that Big Data are qualitatively and push the limits of computational power to analyze different to traditional, small data along seven axes (see them. He thus contends that we have always had Big Table 1). He details that, until recently, science has Data, with the massive datasets presently being pro- progressed using small data that have been produced duced merely the latest form of Big Data, which require in tightly controlled ways using sampling techniques new technique to process, store and make sense of that limit their scope, temporality and size, and are them. Murthy et al. (2014) categorises Big Data using quite inflexible in their administration and generation. a six-fold taxonomy that likewise focuses on its hand- While some of these small datasets are very large in ling and processing rather than key traits: (1) data ((a) size, they do not possess the other characteristics of temporal latency for analysis: real-time, near real-time, Big Data. For example, national censuses are typically batch; and (b) structure: structured, semi-structured, generated once every 10 years, asking just c.30 struc- unstructured); (2) compute infrastructure (batch or streaming); (3) storage infrastructure (SQL, NoSQL, tured questions, and once they are in the process of being administered it is impossible to tweak or add/ NewSQL); (4) analysis (supervised, semi-supervised, remove questions. In contrast, Big Data are generated unsupervised or re-enforcement machine learning; continuously and are more flexible and scalable in their data mining; statistical techniques); (5) visualisation production. For example, in 2014, Facebook was pro- (maps, abstract, interactive, real-time); and (6) privacy cessing 10 billion messages, 4.5 billion ‘Like’ actions, and security (data privacy, management, security). and 350 million photo uploads per day (Marr, 2014), Regardless of how Big Data have been defined it is and they were constantly refining and tweaking their clear that, despite widespread use, the term is still rather underlying algorithms and terms and conditions, chan- loose in its ontological framing and definition, and it is ging what and how data were generated (Bucher, 2012). being used as a catch-all label for a wide selection of Similarly, Florescu et al. (2014), in a study examin- data. The result is that these data are characterised ing the potential for Big Data to be used to generate as holding similar traits to each other and the term new official statistics, details how Big Data differs from ‘Big Data’ is treated like an amorphous entity that lacks conceptual clarity. However, for those who work with and analyze datasets that have been labelled Table 1. Comparing small and Big Data. as Big Data it is apparent that, although they undoubt- Small data Big Data edly share many traits, they also vary in their charac- teristics and nature. Not all of the data types that have Volume Limited to large Very large been declared as constituting Big Data have volume, Velocity Slow, freeze-framed/ Fast, continuous velocity or variety, let alone the other characteristics bundled noted above. Nor do they all overly challenge conven- Variety Limited to wide Wide tional statistical techniques or computational power in Exhaustivity Samples Entire populations making sense of them. In other words, there are mul- Resolution and Course and weak Tight and strong tiple forms of Big Data. However, while there has been indexicality to tight and strong some rudimentary work to identify the ‘genus’ of Big Relationality Weak to strong Strong Data, as detailed above, there has been no attempt to Extensionality and Low to middling High separate out its various ‘species’ and their defining scalability attributes. Kitchin and McArdle 3 Table 2. Characteristics of survey, administrative and Big Data. Survey data Administrative data Big Data Specification Statistical products specified Statistical products specified Statistical products specified ex-ante ex-post ex-post Purpose Designed for statistical Designed to deliver/monitor a Organic (not designed) or purposes service or program designed for other purposes Byproducts Lower potential for by-products Higher potential for by-products Higher potential for by-products Methods Classical statistical methods Classical statistical methods Classical statistical methods not available available, usually depending on always available the specific data Structure Structured A certain level of data structure, A certain level of data structure, depending on the objective of depending on the source of data collection information Comparability Weaker comparability Weaker comparability between Potentially greater comparability between countries countries between countries Representativeness Representativeness and Representativeness and coverage Representativeness and coverage coverage known by design often known difficult to assess Bias Not biased Possibly biased Unknown and possibly biased Error Typical types of errors Typical types of errors Both sampling and non-sampling (sampling and (non-sampling errors, errors (e.g., missing data, non-sampling errors) e.g., missing data, reporting reporting errors and outliers) errors and outliers) although possibly less fre- quently occurring, and new types of errors Persistence Persistent Possibly less persistent Less persistent Volume Manageable volume Manageable volume Huge volume Timeliness Slower Potentially faster Potentially much faster Cost Expensive Inexpensive Potentially inexpensive Burden High burden No incremental burden No incremental burden Geography National, defined National or extent of program National, international, poten- and service tially spatially uneven Demographics All or targeted Service users or program Consumers who use a service, recipients pass a sensor, contribute to a project, etc. Intellectual Property State State State/Private sector/ User-created rights. Source: Florescu et al. (2014: 2–3) and Kitchin (2015) In this paper, we examine the ontology of Big Data datasets are used for illustrative purposes and were and its definitional boundaries, exploring the question selected due to our familiarity with them. We start by ‘what makes Big Data, Big Data?’ We employ Kitchin’s examining each of the parameters detailed by Kitchin (2013) taxonomy of the characteristics of Big Data with respect to the 26 different data types, in effect (Table 1) to examine the nature of 26 specific types of working down the columns in Table 3. We then exam- data, drawn from seven domains (mobile communica- ine the rows to consider how these parameters are com- tion; websites; social media/crowdsourcing; sensors; bined with respect to the data types to produce multiple cameras/lasers; transaction process generated data; forms of Big Data. and administrative), that have been labelled in the lit- Our aim in performing this analysis is not to deter- erature as Big Data (see Table 3). These 26 types of mine a tightly constrained definition of Big Data – to data are by no means exhaustive of all types of Big definitively set out precisely the nature of Big Data and Data, for example there are a multitude of Big Data their essential qualities – but rather to explore the par- generated within scientific experiments, science comput- ameters, limits, and ‘species’ of Big Data. The analysis ing, and industrial manufacturing. Rather, these 26 is thus an exercise in boundary work designed to test 4 Big Data & Society Table 3. Ontological traits of Big Data. Velocity Volume frequency Volume Volume (TBs, of handling, (number per PBs, Velocity frequency recording, Data type of records) record etc.) of generation publishing Variety Exhaustivity Resolution Indexical Relational Extensionality Scalable Mobile Mobile phone High Low High Real-time At time of Structured n¼ all Fine-grained Yes Yes No Yes communication data constant generation (bkgrd comms), real-time sporadic (at use) App data High Low High Real-time constant At time of Structured & n¼ all Fine-grained Yes Yes Yes Yes (bkgrd comms), generation unstructured real-time sporadic (at use) Websites Web searches High Low High Real-time sporadic At time of Structured & n¼ all Fine-grained Yes Yes Yes Yes generation unstructured Scraped websites High Medium High Real-time sporadic At time of Semi-structured n¼ all Fine-grained Yes Yes Yes Yes generation Clickstream High Low High Real-time sporadic At time of Structured n¼ all Fine-grained Yes Yes Yes Yes generation Social media/ Social media High Medium High Real-time sporadic At time of Structured & n¼ all Fine-grained Yes Yes Yes Yes Crowdsourcing (full pipe) generation unstructured (e.g. Twitter) Social media Low Medium Medium Real-time sporadic At time of Structured & Sampled Fine-grained Yes Yes Yes Yes (spritzer) generation unstructured (e.g. twitter) Picture sharing/ High High High Real-time sporadic At time of Structured & n¼ all Fine-grained Yes Yes Yes Yes social media generation unstructured (flickr, Panoramio, Instagram) Collaborative Low Low Low Real-time sporadic At time of Structured & n¼ all Fine-grained Yes Yes Yes Yes mapping generation semi- platforms (open to structured (OpenStreetMap, editing) Wikimapia) Citizen science High Low Medium Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No Yes (wunderground) or real-time generation sporadic Sensors Traffic loops Medium Low Low Real-time constant At time of Structured n¼ all Aggregated Yes Yes No No generation (continued) Kitchin and McArdle 5 Table 3. Continued Velocity Volume frequency Volume Volume (TBs, of handling, (number per PBs, Velocity frequency recording, Data type of records) record etc.) of generation publishing Variety Exhaustivity Resolution Indexical Relational Extensionality Scalable Automatic Number Medium Low Medium Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No Yes Plate Readers generation (ANPR) Real-time passenger Medium Low Low Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No No info (RTPI) generation Smart meters High Low Medium Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No No generation Pollution and Medium Low Low Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No No sound generation sensors Satellite images Medium High High Real-time constant At time of Unstructured n¼ all, Fine-grained Yes Yes No No generation delayed repeat of coverage Cameras/Lasers Digital CCTV High High High Real-time constant At time of Unstructured n¼ all Fine-grained Yes Yes No No generation Lidar mapping High High High Real-time constant Delayed and Structured n¼ all, but Fine-grained Yes Yes No No (by HERE) (when in use) consolidated no or (daily) infrequent repeat coverage Transactions Supermarket High Low High Real-time sporadic At time of Structured n¼ all Fine-grained Yes Yes No Yes of process scanner and generation generated data sales data Immigration Low High High Real-time sporadic At time of Structured n¼ all, Fine-grained Yes Yes No Yes (inc. photo, generation infrequent fingerprint scan) repeat coverage Flight movements High Low High Real-time constant At time of Structured n¼ all Fine-grained Yes Yes No Yes generation Credit card data High Low High Real-time sporadic At time of Structured n¼ all Fine-grained Yes Yes No Yes generation Stock market trades High Low High Real-time sporadic At time of Structured n¼ all Fine-grained Yes Yes No Yes generation Administrative House price register Low Low Low Real-time sporadic Delayed and Structured n¼ all Fine-grained Yes Yes No Yes consolidated (monthly) (continued) 6 Big Data & Society the edges of what might be considered Big Data and to internally tease apart what is presently an amorphous concept to reveal its inner diversity – its multiple forms. In other words, we consider in much more detail than previous studies the ontology of Big Data. This is an important exercise, we believe, as it enables the produc- tion of much more conceptual clarity about what con- stitutes Big Data, especially given the ongoing confusion over its traits and its amorphous description. In turn, acknowledging and detailing the various types of Big Data facilitates a much more nuanced under- standing of its forms, its value, and how they might be analyzed and for what purposes. The parameters of Big Data In Table 3 we have mapped 26 sources of data, defined as Big Data within the literature, against the traits iden- tified by Kitchin (2014) in Table 1. Through the process of evaluating each dataset against each characteristic it quickly became apparent that the categories of volume and velocity needed to be further teased apart. Similarly, while resolution and indexicality, and extensionality and scalability, are combined into two characteristics in Table 1, we consider them separately in Table 3 given that they are not synonymous traits. In the context of Big Data, volume generally refers to the storage space required to record and store data. Big Data, it is commonly stated, typically require tera- 40 50 bytes (2 bytes) or petabytes (2 bytes) of storage space (The Economist, 2010), far more than an average desktop computer can provide, with the data often stored in the cloud across several servers and locations. However, when we examine our 26 datasets it is clear that some of them, for example pollution and sound sensors, require very little storage space, maybe produ- cing a gigabyte (2 bytes) of data per annum (easily storable on a datastick). Although each sensor might be producing a steady stream of readings, say once per minute, each record is very small, consisting of just a few kilobytes (2 bytes). Even summed over the course of a year, the sensor dataset would be relatively small in stored volume, in fact much smaller than many ‘small datasets’ such as a Census. As detailed in Table 3, we have thus teased apart volume into three dimensions: (1) the number of records (which is reflective of velocity and the number of generating devices), (2) the storage required per record, and (3) the total storage required (effectively the sum of the first two). Using this threefold classification of volume it is clear that the 26 Big Data sets have differing volume characteristics. Automated forms of Big Data gener- ation, where records are created on a continual basis every few seconds or minutes, often across multiple sites or individuals, produce very large numbers of Table 3. Continued Velocity Volume frequency Volume Volume (TBs, of handling, (number per PBs, Velocity frequency recording, Data type of records) record etc.) of generation publishing Variety Exhaustivity Resolution Indexical Relational Extensionality Scalable Planning permissions Low Low Low Real-time sporadic Delayed and Structured n¼ all, but Fine-grained Yes Yes No Yes consolidated no or (weekly) infrequent repeat coverage Employment register Low Low Low Real-time sporadic Delayed and Structured n¼ all Aggregated Yes Yes No Yes (at release) consolidated (monthly) bkgrd comms: constant background passive monitoring. Kitchin and McArdle 7 records. Human-mediated forms, such as creating A single LIDAR scan generally produces a million administrative records (immigration, unemployment plus points of data (Cahalane et al., 2012). At the end registration), might have a steady stream of new rec- of every day the local storage device is removed from ords, usually generated from a constrained number of the vehicle performing the scan and its data transferred sites (a small number of entry points to a country, to a data centre. Similarly, unemployment data are rec- unemployment offices), that produce much lower vol- orded at the time a person updates their status on the umes than automated systems. Likewise, while each system, but the overall unemployment rate is published sensor record is generally very small in file size, imagery monthly and in an aggregated form. In some cases, data (such as streaming video, photographs and satel- even once the data are generated they are open to fur- lite images) are typically quite large in file size, meaning ther editing, as with crowdsourced data within that relatively low numbers of records soon scale into Wikipedia or OpenStreetMap, with the edits also rec- huge storage requirements. In many cases, although the orded in real-time and becoming part of the dataset. volume per record is low, the sheer number of devices Perhaps not unsurprisingly, there is a fair range of generating data produce very large storage volumes. variety in the data form across our 26 datasets, includ- For example, the million customers flowing through ing structured, semi-structured and unstructured data thousands of Walmart stores every hour generate 2.5 types. Of all the characteristics attributed to Big Data petabytes of transaction data (Open Data Center this seems to us to be the weakest attribute. Indeed, Alliance, 2012). small data are also highly heterogeneous in nature, Velocity is considered a key attribute of Big Data. especially datasets common to humanities and social Rather than data being occasionally sampled (either on sciences where the handling and analyzing of qualita- a one-off basis or with a large temporal gap between tive data (text, images, etc.) is normal. Our suspicion is samples), Big Data are produced much more continu- that this characteristic was attributed to Big Data ally. When we examined our datasets, however, it because those scientists who first coined the term were became apparent that there are two kinds of velocity used to handling structured data exclusively but were with respect to Big Data: (1) frequency of generation; starting to encounter semi-structured and unstructured (2) frequency of handling, recording, and publishing; data as new data generation and collection systems and that the 26 datasets varied with respect to these were deployed. As noted, small datasets consist of samples of repre- two traits. In terms of frequency of generation, data can be generated in real-time constantly, for example sentative data harvested from the total sum of poten- recording a reading every 30 seconds or verifying loca- tially available data. Sampling is typically used because tion every 4 minutes (as many mobile phone apps do), it is unfeasible in terms of time and resources to harvest or in real-time sporadically, for example at the point of a full dataset. In contrast, Big Data seeks to capture the use, such as clickstream data being generated in real- entire population (n¼ all) within a system, rather than time but only while a user is clicking through websites, a sample. In other words, Twitter captures all tweets or an immigration system recording only when some- made by all tweeters, plus their associated 32 fields of one is scanning their documents. metadata, not a sample of tweets or tweeters. Similarly, In some cases, as the data are recorded, the system is a set of pollution sensors is seeking to create a continu- updated in real-time and the new data are also pub- ous, longitudinal record of readings, captured every few lished in real-time (with only a fraction of delay seconds, from a fixed network of sensors. Likewise, a between the two). For example, as a tweet is tweeted credit card company or the stock market seeks to it is recorded in Twitter’s data architecture and micro- record every single transaction and alter credit balances seconds later it is published into user timelines. Here, accordingly. even though the data generation is sporadic at the point All our 26 datasets hold the characteristic of n¼ all, of generation (each user might only produce a couple of except for the spritzer of Twitter; this is the sample of tweets a day), it is far from the case at the point of tweets harvested from the full fire hose that Twitter recording by the company (the millions of Twitter shares with some researchers. It is important to note, users collectively generate thousands of tweets per however, that the temporality of n¼ all can vary. For second, meaning that the company databases and ser- example, an immigration system at an airport aims to vers are constantly handling a data deluge). In other capture details about all passengers passing through it, cases, the data are recorded in real-time, but their trans- but a passenger might only pass through that system mission to central servers and/or their processing or infrequently. In the case of a satellite, it might capture publication is delayed. For example, the HERE imagery of the whole planet, but it only flies over the LIDAR scanning project involves 200 cars driving same portion of the Earth every set number of days. around cities taking a LIDAR scan every second to Likewise, the HERE LIDAR project aims to scan every produce high definition mapping data (Nokia, 2015). road in every country, but each street is only surveyed 8 Big Data & Society once and is unlikely to be rescanned for several years. of the 26 datasets are generated and stored within rap- In other words, Big Data systems seek to capture idly scaleable systems, but not others. n¼ all, but capturing n¼ all varies with respect to what is being measured and their spatial coverage and The forms and boundaries of Big Data temporal register. As with exhaustivity, all 26 datasets hold the traits of What is clear from examining each Big Data parameter fine-grained resolution (with the exception of employ- with respect to the 26 datasets is that there is no one ment data, which is fine-grained in the database but is characteristic profile that all Big Data fit. Big Data does published in aggregated form), indexicality and rela- not possess all of the seven traits detailed by Kitchin tionality. In each case, the data are accompanied by (2013, 2014). Indeed, not all data termed Big Data in metadata that uniquely identifies the device, site and the literature possess the 3Vs of volume, velocity and time/date of generation, along with other characteris- variety. If one looks across the rows in Table 3 then the tics such as device settings. These metadata inherently diversity of Big Data becomes clear, with datasets pos- produces relationality, enabling data from the same sessing differing profiles, especially with regard to and related devices but generated at different times/ volume, velocity, variety, extensionality and scalability. locales to be linked, but also entirely different datasets Big Data are clearly then not an amorphous category that share some common fields to be tied together and and there are certainly different ‘species’ of Big Data. relationships between datasets to be identified. Examining these profiles starts to suggest the bound- However, the data themselves might not provide un- ary markers of what constitutes Big Data. Indeed, it ambiguous relationality or be easily machine-readable. may be the case that some of our 26 datasets might For example, a tweet is composed of text and/or an not be considered Big Data by some. Or it might be image which requires either data analytics or human that some consider certain datasets to constitute Big interpretation to identify the content and meaning of Data that we would not, for example, national censuses the tweet. Similarly, a CCTV feed will be indexical to a (which have volume, exhaustivity, resolution, indexical- camera and be time, date, and place stamped, but the ity and relationality, but no velocity (generated once content of the feed will either require image recognition every 10 years and taking 1–2 years to process), no to identify content (e.g., using facial recognition soft- extensionality or scaleability, and are published in ware) or operator recognition to make the image con- aggregated form). It seems to us, based on the datasets tent indexical. that we have examined, that the key boundary charac- Extensionality and scaleability refer to the flexibility teristics of Big Data, which together differentiate it of data generation. A system that is highly adaptable in from small data, are velocity (both frequency of gener- terms of what data are generated is said to possess ation, and frequency of handling, recording, and pub- strong extensionality (Marz and Warren, 2012). For lishing) and exhaustivity. Small data are slow and example, web-based and mobile apps are constantly sampled. Big Data are quick and n¼ all. Small data tweaking their designs and underlying algorithms, per- can hold all of the other characteristics (volume, reso- forming on-the-fly adaptive testing and rollout, as well lution, indexicality, relationality, extensionality and as altering their terms and conditions and the metadata flexibility) and still be considered small in nature. It is they capture. The result is the data they generate are the qualities of velocity and exhaustivity which set Big changeable, with new fields being added and removed Data apart and are responsible for so much recent as required. However, this is not a trait common to all attention and investment in Big Data ventures. While big datasets. For example, many systems, such as smart some datasets have possessed these two qualities for meters, credit card readers and sensor-networks, are some time, such as stock market and weather data, it seeking rigid continuity in what data are generated to is only in the past 15 years that these characteristics produce robust, comparable longitudinal datasets. have become much more common and routine. Scaleability refers to the extent to which a system can These two traits, we believe, act as key Big Data cope with varying data flow. Social media platforms boundary markers. In our own analysis of Table 3 it such as Twitter need to be able to cope with ebbs and was the administrative datasets of the house price regis- surges in data generation, scaling from managing a few ter, planning permissions and unemployment, as well as thousand tweets at certain times of the day to tens of the satellite and LIDAR imagery that provoked the thousands during popular live events. Such rapid scal- most discussion (we quite quickly rejected Census ing is not required in systems that have a constant flow data, which we had initially included, due to its very of data, such as a sensor network that produces data at long temporal gap in data generation). In the case of set intervals (the timing can be altered, but the flow the administrative data, they are produced in real-time remains constant rather than surging). As such, some as entries are made into the system (as house sales are Kitchin and McArdle 9 completed, planning permissions sought, and unem- across a whole system can produce a deluge of data, ployed people sign-on). However, the publishing of especially if each record is large in size. In some cases, the data is either weekly or monthly, and in the case however, the flow can be generated in real-time (e.g., of unemployment released in an aggregated form. Do every 30 seconds), but because the system is small (e.g., data that are generated in real-time, but released 30 sound sensors across a city) and each record is small monthly and in an aggregated form constitute Big in size, the storage volume is relatively small. The data Data? Certainly they are at the point of collection, generated by each sensor are also highly structured. but what about at the point of publishing where they Despite the lack of volume and variety, such sensor lack velocity? For some, such administrative data are data are widely considered Big Data. Likewise, variety Big Data (Economic and Social Research Council is not a distinguishing characteristic because small data (ESRC), 2013), for others they are more marginal, possesses just as much variety as Big Data. and the key element in doubt is temporality. One month’s delay is still much quicker than most adminis- Conclusion trative data that are published quarterly or annually, and the dataset still holds most of the other character- To date, there has been very little work that has sought istics of Big Data such as exhaustivity (the data refers to examine in detail the ontology of Big Data, other to all houses sold, all planning permissions sought, and than to suggest that they are data that possess certain all unemployed people), but it is nonetheless far slower broad characteristics (volume, velocity, variety, exhaus- than data published in real-time. tivity, etc.). Indeed, most studies that discuss Big Data Our discussion of satellite imagery and LIDAR treat the term as a catch-all, amorphous phrase that focused in particular on coverage and repetition of assumes that all Big Data share a set of general traits. gaze. In other forms of Big Data, what is being mea- Through an analysis that applied Kitchin’s (2013, 2014) sured remains quite constant, with the gaze and the typology of Big Data traits to 26 datasets our study object under surveillance relatively fixed. In social reveals that Big Data do not all share the same charac- media it is the contributions of every user, for credit teristics and that there are multiple forms of Big Data. cards it is the transactions of every card holder, for Indeed, our analysis demonstrates that only a handful supermarkets it is the purchases of every shopper. of the 26 datasets we examined held all seven traits However, the gaze of the satellite imagery moves, identified by Kitchin. That said, it is the case that for only returning to capture the same terrain after a set Big Data to be classified as Big Data they do need to number of days. Nonetheless the surface of the entire possess the majority of the traits set out in Table 1, of planet is being repeatedly generated and data are pro- which velocity and exhaustivity are the most important. cessed constantly. In the case of LIDAR, that repeti- Volume and variety, we contend, are not necessary con- tion is missing. The aim is to scan every road on the ditions of Big Data and without velocity and exhaus- planet, but to do so only once. The data are generated tivity are not qualifying criteria. In other words, the in real-time, and are voluminous, indexical, relational, 3Vs meme is actually false and misleading and along and they produce exhaustive spatial coverage (the aim with the term itself is partially to blame for the confu- is to create a 3D model of the whole road network and sion over the definitional boundaries of Big Data. the architecture bordering this network) though no lon- The observation that there are multiple forms of Big gitudinal data of the same places. In both cases, most Data is perhaps no surprise given the wide variety of would agree that satellite imagery and LIDAR scans small data, the varying nature of the systems that gen- constitute Big Data, but they are exhaustive in a par- erate Big Data, the differing purposes for which the ticular way which distinguishes them from other types data are generated, and the differing forms of the of Big Data. The same would also be the case with data generated. Nonetheless it is an observation that respect to large scientific experiments such as data gen- needs highlighting given that it has so far been ignored erated by the Large Hadron Collider. or taken for granted in the literature. Our analysis has Interestingly, given the meme of the 3Vs of Big Data, revealed that Big Data as an analytical category needs having examined 26 types of Big Data, our conclusion to be unpacked, with the ‘genus’ of Big Data further is that two of those Vs – volume and variety – are not delineated and its various ‘species’ identified. This is key defining characteristics of Big Data. It is certainly important work if we are to better understand what it the case that Big Data often consists of very large num- is that we are talking about when we discuss and ana- bers of records and the storage volume required to store lyze Big Data, and if we want to produce more nuanced them is significant, however, this is not a necessary con- insights about and from the data. It is only through dition of Big Data. Rather volume is a by-product of such ontological work, focused on shifting from velocity and exhaustivity: the real-time flow of data broad generalities to specific qualities, that we will 10 Big Data & Society Kitchin R (2014) The Data Revolution: Big Data, Open Data, gain conceptual clarity about what constitutes Big Data Data Infrastructures and Their Consequences. London: and formulate how best to make sense of it and how it Sage. might be used to make sense of the world. Kitchin R (2015) The opportunities, challenges and risks of big data for official statistics. Statistical Journal of the Declaration of conflicting interests International Association of Official Statistics 31(3): The author(s) declared no potential conflicts of interest with 471–481. respect to the research, authorship, and/or publication of this Laney D (2001) 3D data management: Controlling data article. volume, velocity and variety. In: Meta Group. Available at: http://blogs.gartner.com/doug-laney/files/2012/01/ Funding ad949-3D-Data-Management-Controlling-Data-Volume- The author(s) disclosed receipt of the following financial sup- Velocity-and-Variety.pdf (accessed 16 January 2013). port for the research, authorship, and/or publication of this Lupton D (2015) The thirteen Ps of big data. The Sociological article: The research for this paper was funded by a European Life, 13 May. Available at: https://simplysociology. Research Council Advanced Investigator Award, ‘The wordpress.com/2015/05/11/the-thirteen-ps-of-big-data/ Programmable City’ (ERC-2012-AdG-323636). (accessed 17 September 2015). Marr B (2014) Big data: The 5 vs everyone must know. References March 6. Available at: https://www.linkedin.com/pulse/ 20140306073407-64875646-big-data-the-5-vs-everyone-mu Batty M (2015) Data about cities: Redefining big, recasting st-know (accessed 4 September 2015). small. Paper prepared for the Data and the City work- Marz N and Warren J (2012) Big Data: Principles and Best shop, Maynooth University, 31 August–1 September Practices of Scalable Realtime Data Systems. MEAP edi- 2015. Available at: http://www.spatialcomplexity.info/ tion. Westhampton, NJ: Manning. files/2015/08/Data-Cities-Maynooth-Paper-BATTY.pdf Mayer-Schonberger V and Cukier K (2013) Big Data: A (accessed 4 September 2015). Revolution that will Change How We Live, Work and Boyd D and Crawford K (2012) Critical questions for big Think. London: John Murray. data. Information, Communication and Society 15(5): McNulty E (2014) Understanding Big Data: The seven V’s. 22 662–679. May. Available at: http://dataconomy.com/seven-vs-big Bucher T (2012) ‘Want to be on the top?’ Algorithmic power -data/ (accessed 4 September 2015). and the threat of invisibility on Facebook. New Media and Murthy P, Bharadwaj A, Subrahmanyam PA, et al. (2014) Society 14(7): 1164–1180. Big Data Taxonomy. Big Data Working Group, Cloud Cahalane C, McCarthy T and McElhinney CP (2012) Security Alliance. Available at: https://downloads.cloud MIMIC: Mobile mapping point density calculator. In: securityalliance.org/initiatives/bdwg/Big_Data_Taxonom Proceedings of the 3rd international conference on comput- ing for geospatial research and applications, 1–3 July 2012, y.pdf (accessed 7 September 2015). Washington, DC, USA: ACM. Nokia (2015) HERE makes HD map data in US, France, Diebold F (2012) A personal perspective on the origin(s) and Germany and Japan available for automated vehicle development of ‘big data’: The phenomenon, the term, and tests. Available at: http://company.nokia.com/en/news/ the discipline. Available at: http://www.ssc.upenn.edu/ press-releases/2015/07/20/here-makes-hd-map-data-in-us-f fdiebold/papers/paper112/Diebold_Big_Data.pdf rance-germany-and-japan-available-for-automated-vehic (accessed 5 February 2013). le-tests (accessed 16 September 2015). Dodge M and Kitchin R (2005) Codes of life: Identification Open Data Center Alliance (2012) Big Data Consumer Guide. codes and the machine-readable world. Environment and Open Data Center Alliance. Available at: http://www. Planning D: Society and Space 23(6): 851–881. opendatacenteralliance.org/docs/Big_Data_Consumer_ Economic and Social Research Council (ESRC) (2013) The Guide_Rev1.0.pdf (accessed 11 February 2013). Big Data family is born – David Willetts MP announces Strom D (2012) Big data makes things better. Slashdot, the ESRC Big Data Network. In: ESRC website, 10 3 August. Available at: http://slashdot.org/topic/bi/big- October. Available at: http://www.esrc.ac.uk/news-and- data-makes-things-better/ (accessed 24 October 2013). events/press-releases/28673/the-big-data-family-is-born-d The Economist (2010) All too much: Monstrous amounts of avid-willetts-mp-announces-the-esrc-big-data-network. data, 25 February. Available at: http://www.economist.- aspx (accessed 7 September 2015). com/node/15557421 (accessed 12 November 2012). Florescu D, Karlberg M, Reis F, et al. (2014) Will ‘big data’ Uprichard E (2013) Big data, little questions. Discover transform official statistics? Available at: http://www. Society, 1 October. Available at: http://discoversociety. q2014.at/fileadmin/user_upload/ESTAT-Q2014-BigData org/2013/10/01/focus-big-data-little-questions/ (accessed OS-v1a.pdf (accessed 1 April 2015). 17 September 2015). Kitchin R (2013) Big data and human geography: Opportunities, challenges and risks. Dialogues in Human Geography 3(3): 262–267.

Journal

Big Data & SocietySAGE

Published: Feb 17, 2016

Keywords: Big Data; ontology; taxonomy; types; characteristics

There are no references for this article.