Access the full text.
Sign up today, get DeepDyve free for 14 days.
This paper discusses how an interactive artwork, the Crowd-Sourced Intelligence Agency (CSIA), can contribute to discussions of Big Data intelligence analytics. The CSIA is a publicly accessible Open Source Intelligence (OSINT) system that was constructed using information gathered from technical manuals, research reports, academic papers, leaked documents, and Freedom of Information Act files. Using a visceral heuristic, the CSIA demonstrates how the statistical correlations made by automated classification systems are different from human judgment and can produce false- positives, as well as how the display of information through an interface can affect the judgment of an intelligence agent. The public has the right to ask questions about how a computer program determines if they are a threat to national security and to question the practicality of using statistical pattern recognition algorithms in place of human judgment. Currently, the public’s lack of access to both Big Data and the actual datasets intelligence agencies use to train their classification algorithms keeps the possibility of performing effective sous-dataveillance out of reach. Without this data, the results returned by the CSIA will not be identical to those of intelligence agencies. Because we have replicated how OSINT is processed, however, our results will resemble the type of results and mistakes made by OSINT systems. The CSIA takes some initial steps toward contributing to an informed public debate about large-scale monitoring of open source, social media data and provides a prototype for counterveillance and sousveillance tools for citizens. Keywords Dataveillance, information access, transparency, social media, counterveillance, sousveillance these systems frame social media posts and (mis)inter- Introduction pret natural language, especially slang, jokes, and sar- With the release of the Snowden leaks, debates about casm, we hope to provide a visceral heuristic of the dataveillance practices used by intelligence agencies process to help participants of our app to ask questions have ﬁnally entered public discourse. Unfortunately, and make informed decisions about the large-scale people without familiarity with techniques for data col- monitoring of open source, social media data. This lection or analysis often do not understand how large type of awareness can facilitate new tactics for sousveil- troves of unstructured data (and metadata) become lance and counterveillance. intelligence and often assume that if they have nothing to hide, these systems should not concern them. Consequently, dataveillance of social media or other Department of Art, Art History and Design, Michigan State University, publicly available information has not faced the same USA public scrutiny over privacy as bulk collection of emails Department of Media Study, State University of New York at Buffalo, USA or cell phone data. To address this deﬁcit, we have created an interactive artwork, the Crowd-Sourced Corresponding author: Intelligence Agency (CSIA), that replicates the data Jennifer Gradecki, Department of Art, Art History and Design, Michigan processing of an open source intelligence (OSINT) sur- State University, 600 Auditorium Road, 113 Kresge Art Center, veillance system monitoring the popular microblogging East Lansing, MI 48824, USA. platform, Twitter. By allowing users to experience how Email: firstname.lastname@example.org Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 3.0 License (http:// www.creativecommons.org/licenses/by/3.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access- at-sage). 2 Big Data & Society mentality that assumes that if enough data can be col- The state of the public debate lected, future actions can be predicted with a high level Part of what impedes a public understanding of large- of accuracy. Consequently, the amount of data now scale OSINT surveillance is made possible by the rise of being collected and processed necessitates the use of Big Data and analytic tools to process it. Big Data is tools developed for handling Big Data. This toolkit ‘‘fundamentally networked’’ (boyd and Crawford, not only includes instruments for data capture, but 2011) and has been facilitated by the ‘‘widespread avail- also software that scans massive troves of unstructured ability of electronic storage media, speciﬁcally main- data, returning elements determined to be suspicious frame computers, servers and server farms, and through an algorithmic process. In the CSIA, we are storage area networks’’ (Gitelman and Jackson, 2013: using some of the same classiﬁcation techniques used to 6–7). However, Big Data is inherently inaccessible to parse and analyze Big Data for predictive policing pur- the public, both in terms of access to the database and poses. We focus on open-source intelligence (OSINT) the ability to process it. The public does not have access data because it is the easiest to obtain, and perhaps the to the same amount of data that intelligence agencies least controversial because it is already publicly do, but even when the public does gain access to available. massive troves of data, they generally do not have the For the last 20 years, intelligence agencies have been computational capacity to quickly process and analyze developing and reﬁning large-scale, automated data all of the data or the time to develop the technical gathering and processing software (Arnold, 2015: competencies needed to understand the intentionally 36), in order to address the growing problem of coded, specialized documents. Twitter, for example, ‘‘data deluge’’ or ‘‘ever-growing data sets [sic]’’ (IBM only oﬀers a limited amount of its datastream to the Software Whitepaper, 2012: 2). Today, intelligence public and academic researchers through its agencies routinely process massive amounts of struc- Application Program Interface (API). However, the tured and unstructured data, derived from both private full ‘‘ﬁrehose’’ is available to companies (including gov- and public sources, including: ﬁnancial, medical, pro- ernment contractors) who have the ability to pay for fessional and academic records, transactional data, and process it. Lev Manovich created a hierarchy of search queries, emails, texts, telephony metadata, geo- ‘‘data-classes’’ for a ‘‘Big Data society’’ that places graphic information system (GIS) data, public records, those with the expertise to analyze it at the top. The social media posts (Facebook, Instagram, Twitter), middle class is comprised of people and organizations websites and blogs, news articles, video, audio, who have the ability to collect Big Data, while the images, and the list goes on. The Big Data that intelli- bottom contains those who only make data, con- gence analytics systems have been developed to deal sciously or not (Manovich, 2011). with consists mainly of public data: in 2004, it was The danger of using Big Data to identify threats to estimated that over 80% of the intelligence database national security is that it tends to provoke apophenia, came from open sources (Mercado, 2004: 49). Because or the perception of meaningful patterns in random the number of people using social media has grown data (boyd and Crawford, 2011: 2). The tools for hand- substantially since 2004, it is likely that this percentage ling Big Data, such as machine-learning classiﬁcation is even higher today. Agencies feel the need to automate and techniques for making diﬀerent data types compat- the processing of this disparate data in order to gain ible with one another, have restrictions and diﬃculties situational awareness and predict outcomes. We are that data scientists are often in disagreement over how interested in how this process of automation impacts to address. Statistician and computer scientist Jesper the conclusions that intelligence agents come to. Andersen points out that simply the process of cleaning the data (determining which characteristics of the data The Crowd-Sourced Intelligence are important) ‘‘removes the objectivity from the data Agency (CSIA) itself’’ (Bollier, 2010: 13). The manner in which data is presented to an agent and the context in which it is CSIA is an online application and interactive artwork (re)framed can also inﬂuence how the data is perceived, that replicates and displays some of the known tech- thus aﬀecting the agent’s judgment. In the CSIA, we are niques used by intelligence agencies to collect and pro- interested in reproducing problems faced when process- cess open source information. The app uses technical ing and displaying data for intelligence analysis. manuals, research reports, academic papers, leaked documents, and Freedom of Information Act ﬁles to construct an OSINT system that is accessible to the Intelligence agencies and Big Data public. OSINT is intelligence collected from publicly The dataveillance practices currently employed by intel- available sources, such as the media (including social ligence agencies are spawned from a ‘collect-it-all’ media), academic records, and public data, and has Gradecki and Curry 3 been described as ‘‘the basic building block for secret Creating an informed public debate and intelligence’’ (Mercado, 2004: 49). model for resistance The purpose of the CSIA is to openly show how publicly available information is processed and ana- The release of the Snowden documents revealed the lyzed, with a focus on social media posts. We pieced extent to which governments and private contractors together an incomplete mosaic of information that are monitoring the communications of their citizens, became the basis for constructing a technological arti- including social media posts and exchanges. This type fact that replicates many of the features commonly used of dataveillance would fall under what Bakir (2015) has to process publicly available data, including: naıve termed the ‘‘veillant panoptic assemblage’’, which Bayes supervised machine-learning classiﬁcation for includes, among other things, governmental re-appro- predictive analytics, keyword search results for words priation of citizen’s social media communications for known to be used by intelligence agencies, and an disciplinary purposes. Technology and tactics for coun- interface that allows users to evaluate social media terbalancing the power diﬀerential ampliﬁed by older posts based on their threat to national security. Once forms of optical surveillance have already been devel- we were able to build and interact with this surveillance oped and are currently in use by the public. Among system, assumptions and problems inherent in the these is counterbalancing surveillance by the state system started to become visible. For example, we rea- (oversight) with citizen-based sousveillance (under- lized that if someone has a similar speech pattern or sight) to achieve a ‘democratic homeostasis’, or equi- Twitter user description as a known target, they could veillance, where the veillant forces of the state and potentially end up on a watch list. citizens are balanced. Since sousveillance is at a The CSIA app consists of several components: (1) power disadvantage, a socio-technical assemblage of The Social Media Monitor, a surveillance interface new media and social networks may need to be lever- where users evaluate Tweets based on their threat to aged to compensate for the power diﬀerential. A national security. (2) Two naı¨ve Bayes supervised common example of equiveillance is when citizens use machine-learning classiﬁers that automatically label cell phone cameras to ﬁlm abuses of power by police or tweets as suspicious or not suspicious. The Agent the power elite and post the videos online. Bayes classiﬁer is trained on a corpus of manually Bakir (2015: 21) poses the question of if it is possible labeled tweets created by researching and simulating to achieve an ‘‘equiveillant panoptic assemblage’’ where the process and judgments of intelligence agents. The the intelligence-power elite could face public scrutiny Crowd-Sourced Classiﬁer is trained on a corpus labeled for their dataveillance practices in a similar way that by visitors to Science Gallery Dublin’s SECRET exhib- citizen-produced videos can hold police accountable for ition. Users can review the algorithms’ suggestions for their actions. We agree with her conclusion (2015: 22) accuracy and idiosyncrasies. (3) The Social Media Post that the current civic infrastructure for ‘‘genuine public Inspector, where users can submit text to see if a post is debate’’ over mass surveillance is currently too weak to likely to be considered threatening by intelligence agen- facilitate ‘‘change from below’’, and her assessment that cies and choose whether or not to share it on social when it comes to surveillance, counterveillance and uni- media. (4) The Watchlist, where users can target them- veillance ‘‘making people understand and care about selves and others as subjects of social media monitor- such issues is challenging given their abstract, complex ing, and which provides automated evaluations nature’’ (18). We would add to this list of diﬃculties from our machine-learning classiﬁers to show how the inability of citizens to either collect or process Big social media posts may be treated by OSINT surveil- Data. Despite these obstacles, we intend the CSIA to be lance systems (Figure 1). (5) A Resource Library a step toward facilitating a genuine public debate about that links to documents that informed the creation of the dataveillance of social media and a prototype for the app. counterveillance and sousveillance tools for citizens. By The goal of the CSIA is to expose potential prob- demonstrating that what a dataveillance program ‘sees’ lems, assumptions, or oversights inherent in current when it ‘reads’ social media posts is nothing like what a dataveillance processes in order to help people human being sees, we hope to create a debate over cur- understand the eﬀectiveness of OSINT processing rent dataveillance technologies as well as the eﬃcacy and its impact on our privacy. We aim to facilitate a and ethics of mass automated dataveillance more critical and practice-based understanding of a socio- broadly. technical system that typically evades public scrutiny. The CSIA highlights the importance of the training Ultimately, the CSIA provides ﬁrsthand experience corpus in machine-learning by allowing participants to with social media monitoring, allowing users to create a corpus used to train the Crowd-Sourced choose how they want to navigate social media Classiﬁer and by providing another classiﬁer, Agent surveillance. Bayes, for comparison. The algorithm in both classiﬁers 4 Big Data & Society Figure 1. Crowd-Sourced Intelligence Agency watchlist interface. is identical—the only diﬀerence between the two classi- their predictive policing systems, and whether the ﬁers is the data, which was selected and labeled by users public should have access to that data for transparency of the CSIA application. The ratio of tweets found to and oversight purposes. There are already documented be suspicious versus not suspicious is surprisingly simi- instances of intelligence agencies misinterpreting social lar between the two corpuses. In the Crowd-Sourced media data as threatening. In 2012, two British students Classiﬁer, museum visitors in Dublin labeled 22.11% were detained by the US Department of Homeland of the tweets they reviewed as suspicious (Figure 2). Security and denied entrance to the US for posts they In the Agent Bayes corpus, 21.00% of the tweets were made on Twitter. In one post identiﬁed as a threat, identiﬁed as threatening by an individual who simu- Leigh Van Bryan tweeted a joke from the cartoon lated the judgment criteria used by intelligence agents Family Guy about ‘‘diggin’ Marilyn Monroe up’’, based on leaked documents and ethnographic which prompted authorities to search the couple’s lug- accounts. However, the predictions made by the two gage for shovels (Compton, 2012). Less humorously, in classiﬁers varied greatly. When Agent Bayes and the the trial of Dzhokhar Tsarnaev, the man convicted of Crowd-Sourced Classiﬁer were tested against each planting a bomb made from a pressure cooker at the other using a dataset containing 9,430 Twitter posts, Boston Marathon, the evidence initially presented from they disagreed 35% of the time. his Twitter account was exceptionally ﬂawed. Song Automated classiﬁcation does not make erroneous lyrics and jokes from the television show Key and data more accurate, it only automates the same errors Peel were presented as evidence of wrongdoing and across a larger dataset. This raises questions about the the background image of Tsarnaev’s home mosque in accuracy of the data intelligence agencies use to train Grozny had been labeled as ‘‘Mecca’’ by the FBI. Upon Gradecki and Curry 5 Figure 2. Visitors to Science Gallery Dublin reviewing Twitter posts. cross-examination, the agent admitted that they did not the option to tweet directly from the Post Inspector’s bother to look at a picture of Mecca for a comparison interface, giving them the option to rephrase the post to (Woolf, 2015). Because there was ample physical evi- avoid algorithmic scrutiny or even overload a post with dence linking Tsarnaev to the bombing, the FBI may language that creates false positives. An informed user have simply assumed that his Twitter posts were incri- may even decide to refrain from tweeting altogether. minating. If all of the social media posts made by The CSIA Watchlist can be used for sousveillance: known terrorists are labeled as threatening and used users may choose to include law enforcement, intelli- in a training corpus for a machine-learning classiﬁer, gence agencies, government contractors or other mem- we can expect to ﬁnd Twitter users who have similar bers of the intelligence–power elite, to keep track of taste in television and music being algorithmically iden- their social media posts using dataveillance techniques tiﬁed as threats to national security. People who believe and participate in a crowd-sourced and distributed they will not be targeted by these systems because they watching of the watchers. are not doing anything wrong need to understand that automated classiﬁcation systems only ﬁnd statistical Conclusion correlations between data: if you happen to make posts using language similar to a known target, you The CSIA fosters an informed public debate by making may be ﬂagged as a potential threat by the system. abstract ideas about surveillance into concrete, inter- The CSIA also provides a model for possible coun- active replications of intelligence techniques and tech- terveillance and sousveillance tools. The Social Media nologies to allow participants to see some aspects of Post Inspector feature, which allows users to type a how dataveillance works in practice. The CSIA pro- tweet and process the text with both keyword and algo- vides a visceral heuristic: as CSIA agents (users of the rithmic analysis to see if it might be ﬂagged as suspi- app) monitor their own posts and the posts of their cious by an OSINT dataveillance system, enables friends, they can see how the automated processing counterveillance by showing social media users how changes, reinterprets, reframes, and recontextualizes their posts might be interpreted. The user then has their posts without needing a background in data 6 Big Data & Society 3. Crowd-Sourced Intelligence Agency Website: http://www. science. The inaccessibility of Big Data keeps the pos- crowdsourcedintel.org/ sibility of performing eﬀective sousveillance on OSINT 4. From the 2011 Analyst’s Desktop Binder technical manual technologies out of reach, prohibiting the prospect of for the Department of Homeland Security’s social media achieving equiveillance under the current situation. monitoring program, released through a FOIA lawsuit by However, technologies in this area are developing rap- the Electronic Privacy Information Center: https://epic. idly enough that it is conceivable that consumer grade org/foia/epic-v-dhs-media-monitoring/Analyst-Desktop- equipment will be able to perform these types of ana- Binder-REDACTED.pdf lytics in the near future. The CSIA is taking some of the 5. Website for the SECRET exhibition: http://dublin.science ﬁrst steps towards creating tools for sous-dataveillance gallery.com/secret/ and counter-dataveillance. 6. Informational video about the CSIA: https://vimeo.com/ Ideally, the eﬀectiveness of speciﬁc algorithms for language processing, translation, and classiﬁcation 7. Univeillance is veillance with one party’s consent (i.e. rec- ording something you are a part of). Counterveillance per- could become topics of public debate and scrutiny. tains to the measures taken to block both surveillance and The public has the right to ask questions about how a sousveillance. Counterveillant technologies include soft- computer program determines if they are a threat to ware for anonymization and encryption, and tactics national security and to question the practicality of include going ‘off the grid’. See: Mann (2013: 7). using statistical pattern recognition algorithms in 8. Jennifer Gradecki read ethnographic reports, including place of human judgment. Ethical and legal questions Curing Analytic Pathologies: Pathways to Improved will also need to be addressed, such as who is held Intelligence Analysis (Cooper, 2005) and Information accountable when someone is wrongfully detained or Sharing and Collaboration in the United States Intelligence arrested due to a statistical similarity to a known Community: An Ethnographic Study of the National threat? What is badly needed for both the public Counterterrorism Center (Nolan, 2013), as well as leaked debate and to create eﬀective counterveillance and documents, such as entries from the NSA’s intranet column ‘‘The SIGINT Philosopher’’ and GCHQ files per- sousveillance tools is the actual data intelligence agen- taining to the Lovely Horse OSINT software, and replicated cies use to train their dataveillance algorithms. Without the practices of an intelligence agent as closely as possible. this data, the results returned by the CSIA will only resemble the results and mistakes made by OSINT sys- tems currently in use. These limitations may be over- come in the near future through leaked information, References FOIA requests, or public pressure. Despite these limi- Arnold SE (2015) CyberOSINT: Next-generation Information tations, by reproducing the type of problems inherent Access, Law Enforcement, Security and Intelligence in the processing and displaying of Big Data for intel- Edition. Louisville, KY: Arnold Information Technology. Bakir V (2015) ‘’Veillant panoptic assemblage’’: Mutual ligence analysis, the CSIA fosters a critical awareness of watching and resistance to mass surveillance after the assumptions in dataveillance technology and begins Snowden. Media and Communication 3(3): 12–25. to enable the development of counterveillance tactics. Available at: http://www.cogitatiopress.com/ojs/index. php/mediaandcommunication/article/view/277 (accessed Acknowledgments 3 February 2017). The CSIA was made possible by the generous support of the Bollier D (2010) The Promise and Peril of Big Data. Report Science Gallery Dublin. for the Communications and Society Program, Washington, DC: The Aspen Institute. Declaration of conflicting interests boyd d and Crawford K (2011) A decade in internet time. In: Symposium on the dynamics of the internet and society, The author(s) declared no potential conﬂicts of interest with Oxford Internet Institute, UK, 21 September. Available at: respect to the research, authorship, and/or publication of this http://dx.doi.org/10.2139/ssrn.1926431 (accessed 3 article. February 2017). Compton A (2012) British tourists detained, deported for Funding tweeting ‘destroy America’. Huffington Post, 30 January. The author(s) received no ﬁnancial support for the research, Available at: http://www.huffingtonpost.com/2012/01/30/ authorship, and/or publication of this article. british-tourists-deported-for-tweeting_n_1242073.html (accessed 23 October 2016). Notes Cooper JR (2005) Curing analytic pathologies: Pathways to 1. Beginning in 1996, Autonomy was the first company to improved intelligence analysis. Washington, DC: Center automate the processing of Big Data for intelligence for the Study of Intelligence. purposes. Gitelman L and Jackson V (2013) Introduction. In: Gitelman 2. IBM’s i2 Analyst’s Notebook is a Big Data analytics system L (ed.) ‘‘Raw Data’’ is an Oxymoron. Cambridge, MA: widely used in both intelligence and law enforcement. MIT Press, pp. 1–14. Gradecki and Curry 7 IBM Software Whitepaper (2012) IBM i2 Analyst’s Mercado SC (2004) Sailing the sea of OSINT in the informa- Notebook social network analysis. October. Available at: tion age. Studies in Intelligence 48(3): 45–55. https://cryptome.org/2013/12/ibm-i2-sna.pdf (accessed 23 Nolan BR (2013) Information sharing and collaboration in the October 2016). United States Intelligence community: an ethnographic Mann S (2013) Veillance and reciprocal transparency: surveil- study of the National Counterterrorism Center. Doctoral lance versus sousveillance, AR glass, lifelogging, and wear- dissertation, Princeton University. Woolf N (2015) Boston Marathon bomb trial: FBI agent mis- able computing. Available at: http://wearcam.org/ veillance/veillance.pdf (accessed 3 February 2017). takes Grozny for Mecca in Twitter photo. The Guardian 10 Manovich L (2011) Trending: The promises and the chal- March. Available at: https://www.theguardian.com/us-n lenges of big social data. Available at: http://manovich. ews/2015/mar/10/fbi-testimony-boston-marathon-bomb- net/index.php/projects/trending-the-promises-and-the-ch trial-dzhokhar-tsarnaev (accessed 23 October 2016). allenges-of-big-social-data (accessed 26 October 2016). This commentary is a part of special theme on Veillance and Transparency. To see a full list of all articles in this special theme, please click here: http://bds.sagepub.com/content/veillance-and-transparency
Big Data & Society – SAGE
Published: Feb 1, 2017
Keywords: Dataveillance; information access; transparency; social media; counterveillance; sousveillance
Access the full text.
Sign up today, get DeepDyve free for 14 days.