Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Playing with machines: Using machine learning to understand automated copyright enforcement at scale:

Playing with machines: Using machine learning to understand automated copyright enforcement at... This article presents the results of methodological experimentation that utilises machine learning to investigate auto- mated copyright enforcement on YouTube. Using a dataset of 76.7 million YouTube videos, we explore how digital and computational methods can be leveraged to better understand content moderation and copyright enforcement at a large scale.We used the BERT language model to train a machine learning classifier to identify videos in categories that reflect ongoing controversies in copyright takedowns. We use this to explore, in a granular way, how copyright is enforced on YouTube, using both statistical methods and qualitative analysis of our categorised dataset. We provide a large-scale systematic analysis of removals rates from Content ID’s automated detection system and the largely auto- mated, text search based, Digital Millennium Copyright Act notice and takedown system. These are complex systems that are often difficult to analyse, and YouTube only makes available data at high levels of abstraction. Our analysis provides a comparison of different types of automation in content moderation, and we show how these different systems play out across different categories of content. We hope that this work provides a methodological base for continued experimentation with the use of digital and computational methods to enable large-scale analysis of the operation of automated systems. Keywords Machine learning, copyright enforcement, YouTube, content moderation, automated decision-making, Content ID This article is a part of special theme on The Turn to AI. To see a full list of all articles in this special theme, please click here: https://journals.sagepub.com/page/bds/collections/theturntoai How can we understand how massive content mod- is a challenge that is widely acknowledged and quickly eration systems work? The major social media plat- becoming more pressing; the lack of good information forms use a combination of human and automated about how our shared digital environments are gov- processes to efficiently evaluate content that their erned – what information is available, removed, and users post against the rules of the platform and appli- made more or less visible – has led to serious concerns cable laws. These sociotechnical systems are notorious- ly difficult to understand – we can see their results in individual cases, but their inner workings and systemic Creative Industries Faculty, Queensland University of Technology, impact are often obscured (Gillespie, 2018). Most Brisbane, Australia Faculty of Law, Queensland University of Technology, Brisbane, Australia major platforms provide regular transparency reports, but these mainly provide high-level aggregations that Corresponding author: are insufficient to really probe the contours and social Nicolas P Suzor, Faculty of Law, Queensland University of Technology, effects of moderation systems (Suzor et al., 2019). This GPO Box 2434, Brisbane, Queensland 4001, Australia. Email: n.suzor@qut.edu.au Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (https:// creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage). 2 Big Data & Society about the potential for bias and the flow of harmful continual experimentation in the development of a content (Suzor, 2019). This is likely to become more new set of methodological approaches for interrogating important and more difficult as nations continue to pressing public policy research questions in the context ask platforms to do more to regulate social media con- of large-scale automated decision-making in digital tent – and to regulate more quickly by applying media environments. machine learning to filter material automatically or pri- oritise content for review. Background In this article, we investigate copyright enforcement YouTube’s baseline legal obligations for enforcing on YouTube as an important case study of a sophisti- copyright are set out by the notice-and-takedown cated and complex set of processes that are heavily system established under the United States DMCA leg- automated and remain highly controversial. YouTube islation and propagated around the world. Notice- is a major target for the Digital Millennium Copyright and-takedown has become an extremely important Act (DMCA) notice and takedown copyright enforce- industrial mechanism for enforcing copyright; copy- ment system, and it has also built Content ID, one of right owners employ rights management companies the most extensive automated systems for detecting who use automated search tools to send hundreds of copyright infringement. There are major concerns complaints of notices every year (Urban et al., 2017). that YouTube’s enforcement system frequently incor- Google, like other major targets of notice-and- rectly removes videos, at substantial cost to freedom of takedown, has had to develop streamlined automated expression (Tushnet, 2014: 1461). Nevertheless, processes to deal with the massive volume of com- Content ID now serves as a model for the further plaints it receives, but there is little public detail deployment of ‘upload filters’ (Reda, 2019), and about how these processes work. under the European Union’s recently approved copy- In practice, YouTube goes beyond its obligations right directive, more platforms will have a strong incen- under the DMCA when enforcing copyright, and has tive to deploy automated systems that can monitor developed a series of additional tools and privately potential copyright infringement by their users negotiated systems and policies (Bridy, forthcoming). (European Union Parliament, 2018). There is also The most visible of these tools is Content ID, real potential for these tools to be applied to censor YouTube’s automated rights management system that or moderate other types of content, including hate allows rightsholders to block, monetise, mute or track speech and abuse (European Commission, 2018). videos that contain their works. Rightsholders provide In this study, we use digital methods to try to make YouTube with a reference file of their work and the the content moderation system on YouTube – a system Content ID algorithm scans videos that are uploaded that relies on both automated and discretionary to YouTube to see if a match can be found in the data- decision-making and that is applied to varying types base of reference files. Google reports that on of video content – more legible for research. Starting YouTube, 98% of copyright matters are decided by with a random sample of the text metadata of 76.7 Content ID (Google, 2016). million YouTube videos that includes information Automated, privatised copyright enforcement is an about whether and why each video was removed or example of a controversial institutional shift from blocked, we developed a machine learning classifier to public to private modes of regulation. Private regula- categorise these videos into four categories. The cate- tory modes are controversial because they tend to lack gories represent ongoing controversies over online important democratic features and due process safe- copyright enforcement: full movies, gameplay, sports guards (Black, 2001: 143; Zimmerman, 2014: 273). content, and tutorials on copy control circumvention Private, automated regulatory systems in particular (hacks, cracks, and exploits). The core methodological can lead to institutional convergence (Perel and problem that we confronted was how to reliably iden- Elkin-Koren, 2016: 481); that is, a tendency for law- tify from a very large dataset of videos, relatively small making, enforcement and adjudication to occur in cen- subsets of removed videos that fell within our content categories. By solving this problem, we were able to tralised private modes rather than through separate examine how different types of automated and discre- legislative and judicial institutions. The increased use tionary enforcement operate differentially within dif- of automation in private regulation raises additional ferent categories of video content. concerns about transparency, accountability, and pro- Essentially, in this study, we explore how digital and tection for fundamental rights (Citron, 2007; Elkin- computational methods can be usefully combined with Koren, 2014). statistical and rich qualitative methods to study the There are multiple sources of opacity – institutional, digital traces of large-scale content moderation sys- legal and technological – that make it difficult to eval- tems. We hope that this work will help to inform uate automated private regulatory systems Gray and Suzor 3 (Diakopoulos, 2015; Zarsky, 2016). Trade secret laws stakeholders (Urban et al., 2017); experimental upload- often prevent public access to these systems (Perel and ing and interacting with platforms (Nas, 2004; Perel Elkin-Koren, 2017). Their workings typically encode and Elkin-Koren, 2017); the analysis of information rules and priorities privately negotiated between stake- made available at the discretion of platforms or holders; Content ID, for example, was developed during legal disputes (Bar-Ziv and Elkin-Koren, 2018; through partnerships between YouTube and large Seng, 2014, 2015; Tushnet, 2014; Urban et al., 2017; entertainment companies (Tushnet, 2014; Yafit, 2013: Urban and Quilter, 2006); tracking discrete samples 248). Automated decision-making processes frequently of publicly available material (Erickson and occur in a ‘black box’ that is difficult or impossible to Kretschmer, 2018; Jacques et al., 2018); and laboratory interrogate (Pasquale, 2015). The data and rules that experimentation (Fiala and Husovec, 2018). To this human and algorithmic moderators are trained on is body of work, our study contributes a machine learn- secret, and the outputs of these systems are often ing methodology that helps to identify particular cate- obscured by platforms who seek to avoid public scru- gories of content for further qualitative and statistical tiny (Gillespie, 2018). Increasing accountability analysis at a very large scale. Our hope is that this new requires real transparency, but improving transparency approach will improve understanding of how content will require substantial improvements in methodologies moderation systems that combine both automated and and collaborations (Suzor et al., 2019). discretionary decision-making operate in practice, A second major controversy that marks automated across different types of video content. decision-making is the capacity for error and bias. When undertaken on a very large scale, even very low Interrogating content moderation at scale rates of error in automated decision-making can create very large problems (Urban et al., 2017: 35). This article has two main aims. First, we seek to com- Automated decision-making systems are typically pare trends in copyright takedowns, Content ID blunt instruments, operating with narrow objectives blocks, and terms of service (ToS) removals across sev- that can introduce systematic bias, and incapable of eral key controversial issues. We set out to categorise accounting for the full context that might impact videos into topics that aligned with concerns in the lit- human decision-making (Binns et al., 2017). For exam- erature about potential misuse of copyright enforce- ple, if Content ID’s primary objective is to enforce and ment mechanisms. By focusing on particular commodify the use of content on YouTube, without controversies, we hoped to be able to investigate the the ability to distinguish between non-infringing and types of actors responsible for moderation within each infringing uses, it risks serving the interests of large category, as well as to compare rates of different types rightsholders at the expense of end users or smaller of moderation across categories and against the base- creators (Bridy, 2016; Kohl, 2013: 220). line average. Second, in order to undertake this analy- In copyright governance, as more advanced auto- sis, we had to develop a methodology that could help mated enforcement systems have been deployed to pro- us isolate videos in controversial categories from the tect the proprietary interests of media and much larger bulk of YouTube content. entertainment companies, controversies have arisen One of the major challenges of understanding con- over the potential for these systems to prioritise effi- tent moderation practices is the vast scale of content ciency over accuracy (Gray, 2020). Content ID on that is posted and moderated on major social media YouTube, in particular, has been criticised for remov- platforms. Overall rates of removal under the DMCA ing from YouTube content not subject to copyright and Content ID on YouTube are relatively low – only protection such as reviews, parodies and educational approximately 1% of all videos uploaded are removed videos that fall under a fair dealing or fair use excep- due to an apparent copyright violation (Suzor, n.d.). In tion (McSherry, 2014). Similarly, Content ID has been order to examine trends within copyright removals, we criticised for capturing content that rightsholders required a very large initial sample and a suitable would have, if left to their discretion, not removed, mechanism to isolate sufficiently large samples of such as gameplay videos (Boroughf, 2015). So long videos that are likely to be relevant to the inquiry at as rightsholders continue to pressure platforms to hand. Given the extremely large volume and propor- implement more streamlined and efficient system for tion of irrelevant videos, this is a task that is prohibi- removing content from the internet, the potential for tively difficult to undertake manually. A large-scale over-enforcement of copyright online will remain a dataset is required to analyse and compare trends in contentious issue. types of content removal across different categories of The study of copyright takedowns has been a content and it requires some form of computational vibrant area for scholarship. Existing work has utilised a broad range of methodologies, including interviewing analysis to assemble. 4 Big Data & Society Methodology Data collection For this study, we used existing infrastructure (Suzor, n.d.) to obtain a random sample of metadata about YouTube videos and their availability. The infrastruc- ture uses YouTube’s search API (‘list’ endpoint) to generate a random sample of YouTube videos as they are published. Each of these videos was tested approx- imately two weeks after it was first collected, and we Figure 1. Overall removal rates. logged whether it was still available, and if not, what reason YouTube provided for its removal. The two week time period was selected to provide a sufficiently approximately 4000 videos (with English metadata) long enough time for a copyright takedown or terms of that had been removed, to explore the main types of service removal – given that these systems tend to be content that are moderated. We used Latent Dirichlet most controversial when content is new and therefore Allocation (Blei et al., 2003; Pedregosa et al., 2011) to most visible and most commercially valuable. When generate a series of topic models at varying degrees of YouTube removes a video, it provides a detailed granularity, ranging from 5 to 80 clusters. By varying error message that explains why the video is not avail- the number of clusters and examining the most relevant able and, in the case of copyright takedowns, who was words for each cluster, as well as a sample of docu- responsible for requesting that the video be blocked. ments that were most likely to fall into each cluster, When a video is blocked by Content ID, YouTube we were able to identify some relatively strong group- will still host a link to the video and will provide an ings of topics for further analysis. error message explaining that the video was blocked We used the topic modelling as a starting point to due to a copyright claim. It is possible that our infra- inductively identify some coherent clusters of similar structure under-counts Content ID matches where a videos that appeared to be frequently blocked. We video is blocked before it is published, but it appears cross-referenced these topics against controversial for many videos that Content ID blocking occurs some issues that have been identified in the literature, in minutes or hours after initial publication. YouTube order to develop a set of categories for further analysis. does not provide detail about what proportion of Ultimately, we developed a classification scheme for Content ID blocks take effect before publication. five categories, informed by existing controversies We categorised the distinct explanations for removal over copyright enforcement and notice and takedown: that YouTube provides into six groups: the video was available, the video was not available because of a � ‘Full movies’: videos that appear to conform to a Content ID block (either globally or in the jurisdiction traditional classification of movie piracy: full in which our server is based), the video was not available length copies of feature films (Pariser, 2016; Patry, because of a DMCA notice or equivalent (either glob- 2009). In this category, we excluded movie trailers, ally or locally), the video was removed by YouTube for movie soundtracks and videos containing only violating its Terms of Service or Community Guidelines, movie scenes. the user’s account was terminated by YouTube, or the � ‘Gameplay’: live streaming or recorded video game user removed the video themselves. We discarded videos play. This category included reviews of games and that were unavailable for technical or other (unknown) game walkthroughs, tips, and guides that are vulner- reasons. Our final dataset consists of title, description, able to takedown because they contain a lot of exist- availability status, and error messages for 76.7 million ing copyright content, including artwork, music, and YouTube videos collected between October 2016 and dialogue (Boroughf, 2015; Burgess and Green, February 2019 (See Figure 1). 2018). We excluded advertisements and other pro- motional content. Topic selection ‘Sports’: videos of sporting event broadcasts either We first sought to divide the sample into categories of as live streams, recorded streams, or snippets. This similar videos to better understand the types of videos category focused on recordings of televised sports that were removed for different reasons in this large content that is protected as copyright subject dataset. We used topic modelling (Steyvers and matter under most copyright regimes (Garrett, Griffiths, 2007) on the title and description text of 2016; Jones, 2017). Premium sports content is Gray and Suzor 5 highly lucrative, and is accordingly a hotly contested classification on YouTube video titles and descriptions. site for copyright enforcement. We excluded non- We used the largest case-insensitive English-language professional games recorded by users. We further pre-trained BERT model, with 24 encoder layers, subdivided this category to train the classifier to dis- 1024 hidden units, 16 heads, and 340 million parame- tinguish live broadcast streams from non-live sports ters, and fine-tuned the model on Google’s Tensor content. Processing Unit cloud-based computing architecture. � ‘Hacks’: videos that provide tutorials on circumven- We started by labelling a training set of example tion of Digital Rights Management (DRM) software videos (title and description text) identified as likely and hardware, including game exploits, key and to fall within our chosen categories by our topic serial generators, and software cracks that can be model. We then went through several rounds of semi- used to infringe intellectual property rights supervised learning, where we ran our trained model to (Gillespie, 2007). Sometimes called ‘paracopyright’, classify small batches of unlabelled records and manu- there are long-standing concerns that anti- ally corrected its results. We iteratively evaluated the circumvention law can be misused to reduce compe- model’s performance quantitatively by measuring its tition (Burk, 2003). Notably, the DMCA does not accuracy (f1 score – an average combined measure of provide a procedure for notice and takedown of false positive and false negatives in each category, instructional material that teaches people to circum- weighted by the total number of results in each catego- vent DRM – so it would be a sign of potential ry) and qualitatively by closely examining the sample of misuse if this category of videos had relatively high predictions. We ultimately manually labelled approxi- copyright takedown rates. mately 10,000 videos to develop an adequate training set, although the majority of these (5801) were exam- Classification ples of videos that were not relevant to our categories. We were able to achieve satisfactory results using Our initial topic modelling was useful for exploration, only between 420 and 1696 examples in each category. but once we had identified our relevant categories, we To evaluate our model, we manually labelled each elected to use a supervised classification technique to video in a sample of approximately 100 videos from develop more robust samples for analysis. Recent each category (with the two sports categories com- advances in natural language processing have greatly bined) across the entire predicted dataset. The final improved the state of the art in text classification, results show an f1 weighted accuracy score of 90.6%. improving the utility of deep neural networks in classi- These results are quite good for our purposes – there fication tasks. New techniques make use of ‘transfer are relatively few false positives within each of catego- learning’ – general models that are trained on very ry, and few instances of confusion between the four large existing datasets, and then fine-tuned for specific categories under scrutiny. We will discuss particular applications on much smaller labelled datasets (Mou trends and limitations below in our qualitative analysis et al., 2016). These general models are trained on of each category but, notably, we found that a relative- large corpora to understand how different features of ly low amount of manual labelling was required to pro- language relate to each other – learning, for example, duce an accurate machine learning classifier using the how words are used in different senses by learning the ‘transfer learning’ technique. contexts in which they appear in different sentences In the final stage, we deployed the trained model to (Adhikari et al., 2019). We made use of the newly categorise our entire random sample of 76.7 million released Bidirectional Encoder Representations from YouTube videos. Our model found 12,943,693 unique Transformers (BERT), which has improved state of videos fell into one of our four categories. We used the art performance on many common natural lan- multinomial logistic regression to examine the relation- guage processing tasks (Devlin et al., 2018). BERT ship between takedowns, our predicted categories, and provides sentence-level representations trained on mas- additional variables in the metadata, including links to sive corpora of Wikipedia articles and digitised books. external sites. This statistical method enabled us to esti- These pre-trained models allowed us to use supervised mate the relative influence of different factors on the approaches to train a classifier to identify complex pat- likelihood of videos in each category being removed terns within our specific dataset with relatively small training sets. through different mechanisms, and provides a basis We used BERT to train a machine learning classifier from which to speculate about the factors that might to identify videos in each of our categories across our influence decisions to block content across different larger dataset. We use a transformer attention-based categories. Most importantly, however, the classifica- deep learning model – described as a ‘simple network tion technique allowed us to identify video metadata architecture’ (Vaswani et al., 2017) – to perform for manual qualitative investigation. 6 Big Data & Society We undertook qualitative analysis of the metadata category to build the model, which estimates log odds of a sample of 12,000 videos across our categories, of each type of removal for each category. We separat- including both removed and available videos, to ed out music claimants (music rightsholders such as explore the types of videos classified, and supplemented record labels or publishers) from the rest of the this with targeted samples on discrete questions. This Content ID claims because music claimants’ behav- analysis was first conducted by reading through the iours tend to differ from other types of rightsholders titles and descriptions of each of the videos in each quite significantly – they frequently make use of category, paying attention to the choices made by Content ID’s monetisation option rather than remov- uploaders to describe their videos and make them find- ing videos. The model also includes an interaction term able by others, as well as the types of content that the for whether the video description has a link to an exter- classifier assigned to each category. We then watched a nal website. We developed several models and selected manually selected subset of videos in each category this one based on its interpretability and its perfor- until we were confident that we could understand the mance under a pseudo-R squared metric. The full types of content that the classifier was identifying (or regression results are in Appendix 1. mis-identifying) in each category. Our analysis below Overall, our analysis shows that in YouTube’s focuses on how our findings relate to some of the key heavily automated content moderation system there is controversies around these categories of user-generated substantial discretionary decision-making, as well as a content and online copyright enforcement broadly. potential lack of contextual sensitivity. We found very high rates of removals for videos associated with film piracy and all types of sports content. We found that Results and discussion game publishers are largely not enforcing their rights One of the most important findings of this study is that against gameplay streams and that when gameplay we can see, at a large scale, the rates at which Content videos are removed it is usually due to a claim by a ID is used to remove content from YouTube. Previous music rightsholder. We also found high rates of remov- large-scale studies have provided important insights als in the hacks category but mostly for Terms of into rates of DMCA removals (e.g. Seng, 2014; Service violations, which indicates that YouTube Urban et al., 2017), but information about Content rather than rightsholders are more commonly taking ID removals has remained imprecise, provided at a action to remove content and terminate accounts that high level of abstraction by YouTube. In this article, provide DRM anti-circumvention information. we provide the first systematic analysis of Content ID removals rates, including comparisons with other Full movies removal types and across different categories of con- Film piracy has been one of the key issues of the ‘copy- tent. We note that where rightsholders have opted for right wars’ (Patry, 2009). From early in its history, monetisation through Content ID, they will be includ- YouTube has been a central battleground in these ed in the ‘available’ category. YouTube does not make wars, amidst major concern by both screen and music public the information necessary to determine which industries that YouTube’s core business model was rightsholders have opted to monetise which videos. built on copyright infringement (Burgess and Green, Across our entire dataset, videos were most fre- 2018). Content ID has cleaned up a lot of the direct quently removed from YouTube by users themselves, copyright infringement on YouTube over the years, but followed by removals due to an account termination there are ongoing concerns that YouTube continues to and then Content ID blocks (See Table I). DMCA host copies of infringing content and that it provides a takedowns were the least common removal type – vector for infringement by directing viewers to stream- Content ID removals occurred at seven times the rate of DMCA removals, and videos were on average five ing sites and filelockers where they can access copies of times more likely to be removed for terms of service feature films. In particular, rightsholders continue to violations than due to a DMCA notice. Our findings complain that movies that are removed often reappear quickly after removal (Van der Sar, 2018) and that fil- confirm the general trend identified in a recent study of tering technologies like Content ID are vulnerable to Content ID and DMCA removals: in a sample of 1839 gaming or circumvention by sophisticated users parody videos it was found that Content ID was used (Pariser, 2016: 19). to block videos five times more frequently than DMCA In our analysis, we sought to identify how well both notices (Jacques et al., 2018). Content ID and the DMCA process were working to We built a simple Logistic Regression model to esti- keep what might appear to be clearly infringing content mate the links between specific removal outcomes and the categories assigned by our classifier (See Table 2). of YouTube. Overall, only 36% of videos in this cate- We used 150,000 randomly selected records from each gory were available two weeks after they were first Gray and Suzor 7 posted. When we break this down by removal type, we transformative user-generated content, and videos see that videos in this category are nearly 30 times more made from computer games have long been a prime likely to be removed through a DMCA notice than the site of controversy (Burgess and Green, 2018). As baseline for unclassified videos, and 11.5 times more game streaming took off, YouTube has become an likely to be blocked through Content ID. This category important (if secondary) platform for live and recorded also had the highest rate of removals for terms of ser- gameplay footage (Taylor, 2018). For many years, vice violations (24.7 times more likely than baseline) commentators have raised concerns about potential and account terminations (14.5 times more likely). over-enforcement of copyright, because copyright law gives many different copyright owners the right to The high rate of account terminations suggests that object to gameplay videos and it can be hard to evalu- the accounts used to post full movies or links to full ate a fair use claim (or equivalent; Taylor, 2015). movies are likely to be ‘repeat infringers’, in the copy- The most common type of content identified in this right terminology, or are either frequently or flagrantly category was streams of game play, often with in breach of YouTube’s rules. The top claimants in this commentary by the player, either as a live broadcast category were a mix of film studios, who used Content or recording. There is some suggestion that early fears ID directly, and third-party rights management com- about the lawfulness of gameplay videos may have panies, who are generally responsible for sending settled down as copyright owners have come to DMCA notices on behalf of producers. accept and even embrace recorded footage and stream- From our qualitative analysis of the video metadata, ing videos (Matsui, 2016). In our data, we clearly see it was apparent that the primary types of content that game publishers are not enforcing their rights removed in this category were videos that purported against game streamers at any real scale: gameplay to be full copies of feature films hosted on YouTube videos are 83% less likely to be removed by a or videos that promoted links to third-party websites Content ID claim than an unclassified video, and apparently hosting streams of full copies of feature 93% less likely to be removed by a DMCA notice. In films. Many of the links in this category were to general, it appears that game streaming is an advanced URL shorteners, filelockers, and a long tail of domains case of ‘tolerated use’, where technically infringing con- that often appeared to be offering illicit downloads tent has become normalised and acceptable and streams, advertising farms, or malware sites, (Tehranian, 2011; Wu, 2007). amongst a lot more that were impossible to The problems with tolerated use arise when the efficiently classify. discretion to tolerate or remove content is exercised in To add these links to our regression, we created a a way that could stifle legitimate expression or reflects binary category for links. We excluded the most systemic biases. When videos in this category were common domains in our dataset, those that were not removed on copyright grounds, it was usually at the generally associated with copyright infringement (we request of large music companies – not game publishers. did so by excluding domains that received>1000 Worryingly, music claimants were 41% more likely to links in our entire classified dataset). We found that block gameplay videos than the uncategorised average. videos with links to sites that are not amongst the This may be a variant of the ‘tragedy of the most popular sites were much less likely to be removed anticommons’ (Heller, 1998): even if a large proportion by Content ID – between 66% and 89% less likely than of music is available to reuse on YouTube through videos without a link in this category. There was no Content ID’s licensing scheme, gameplay streams may statistically significant difference for DMCA notices. last for hours and include many different background From the video descriptions we examined, this appears songs, the owners of any of which can elect to block the to be evidence of uploaders seeking to use YouTube to entire stream. Since it can be difficult for streamers to gain the attention of users searching for illicit content know in advance which songs are made available to without uploading any infringing material in the video license, and particularly since there is little threat of itself – and therefore avoiding detection by Content ID. gameplay streams substituting in markets for The high rates of removals of all videos in this category recorded music, this is one area where Content ID under the DMCA, Terms of Service, and account ter- appears to remain a justified cause of frustration for minations by YouTube, however, regardless of whether ordinary users. they have a link or not, suggests that this may not be a Of the small proportion of videos that were major problem for rightsholders. removed for terms of service violations, or where the uploader’s account was terminated by YouTube, the Gameplay majority appeared to be misclassifications or overlap A persistent concern about both notice and takedown with the ‘hacks’ category: footage of cheating and and Content ID relates to how it might regulate exploits in video games. Our coding schema 8 Big Data & Society meant that guides on game exploits should be classified as ‘hacks’, not ‘game play’, but this is not always easy, and there is a degree of unavoidable overlap where users upload footage of cheating behaviour in multi- player games. Our manual validation, in Figure 2, confirms that this is a particular area of confusion for our classifier: approximately 4% of videos in each of these two categories were mistakenly predicted in the other. Sports Live sports continues to be one of the major areas of controversy over copyright infringement on the inter- net. Because live sporting events are immensely popular around the world, and access is often limited to premi- um cable channels, pay-per-view, and streaming offer- Figure 2. Manual validation of approximately 100 randomly ings (Garrett, 2016; Hull, 2010), we might expect a selected videos in each class (combining sports classes). great deal of unmet demand from consumers who are dissatisfied with or cannot afford premium offerings, which in turn may lead to increased copyright infringe- ment (Birmingham and David, 2011; Dootson and Suzor, 2015). Over the past decade, the infringement Table 1. Removal rates, all categories, all removal types. Terms of Account Content ID DMCA Removed Service(ToS) Category Available terminated block takedown by user takedown Full movie 36.39% 42.49% 2.33% 2.02% 8.45% 8.32% Gameplay 79.71% 0.38% 0.57% 0.01% 19.28% 0.05% Hacks 54.26% 32.83% 0.10% 0.13% 9.08% 3.60% Sports 61.05% 14.97% 1.59% 1.05% 19.59% 1.75% YouTube average 78.36% 3.98% 0.77% 0.11% 14.70% 0.57% DMCA: Digital Millennium Copyright Act. Videos that are not available for technical or other reasons are included in the total but not in the table. Table 2. Log odds of removal for each category (baseline is the rate at which unclassified videos are available). Content ID DMCA Account Removed ToS Content ID (music claimant) takedown terminated by user takedown Intercept 5.08 5.41 6.45 3.38 1.72 5.15 Game play 1.79 0.34 2.64 2.34 0.30 2.70 Hacks 2.15 1.27 2.72 0.41 2.27 Full movies 2.44 0.36 3.40 2.67 0.42 3.21 Sports highlights 1.35 0.51 2.36 0.27 0.12 1.21 Live sports 1.39 2.90 2.53 1.42 2.26 Has link (non-major site) 0.81 1.01 2.19 0.13 1.52 Game play * link 0.60 1.53 Hacks * link 1.62 1.20 0.98 Full Movies * link 1.08 2.24 0.18 1.06 Sports highlights * link 1.47 1.61 0.56 Live sports * link 1.17 0.26 1.42 0.93 DMCA: Digital Millennium Copyright Act. Results with low statistical significance have been omitted. Gray and Suzor 9 of live telecasts of sports broadcasts has become an prepaid gift cards on app stores and ecommerce plat- increasingly pressing concern for sports organisations forms. The classifier sometimes struggled to distinguish and, seeking to protect an important revenue stream, game cheats that required circumvention from ordinary they have argued for stronger laws (Garrett, 2016: 2) cheats and exploits, although it appeared likely from the and new practices to prevent internet users from shar- descriptions that many of the supposed exploits we iden- ing streams of live broadcasts of their sporting events tified were bait designed to lure traffic towards malware, (Mellis, 2007). At the same time, however, the strict advertising farms, or paid services. copyright enforcement practices pursued by sports This category had very high removal rates overall – only 54% of videos were still available two weeks after organisations have caused the removal or monetisation of works such as reviews, gifs, memes and other poten- they were posted. Tensions over DRM generally and tially fair uses of sports broadcast content (Jones, 2017; circumvention tools in particular have been a constant Wang, 2015). feature of copyright debates over the last three decades, In this category, the types of videos identified by our and rightsholders have worked hard to ensure that classifier primarily included live streams of sporting other companies secure digital distribution channels events, as well as videos containing parts of full record- (Gillespie, 2007). For this category, we sought to ings of sporting matches, for a wide variety of sports, know whether copyright owners have been using the from professional leagues of football, basketball, tennis, DMCA notice and takedown process to remove infor- hockey, motor sports, wrestling and more. Videos in mation about the circumvention of DRM – which these categories were at high risk of removal by almost would be clearly beyond the scope of the takedown all avenues. Content ID removal rates were high for both regime. There was no evidence to support this – copy- live sports and highlights (4 times and 3.9 times more right owners were not sending notices at a statistically likely than baseline respectively), as well as for DMCA significant higher rate in this category compared to takedowns (18 times and 10.6 times baseline respective- unclassified videos, and the odds of removal for ly). Users who posted videos that appeared to be live Content ID were 72–88% lower. streams were at risk of having their account terminated Interestingly, however, videos in this category were 12.5 times more often than the baseline, and these videos nearly 10 times more likely to be removed for a breach of YouTube’s Terms of Service, and 15 times more were 9.6 times more likely to be removed for violating YouTube’s terms of service. Clearly both copyright likely to be removed because YouTube had terminated owners and YouTube are heavily active in policing the uploader’s account. YouTube’s Community sports content on YouTube. Guidelines prohibit instructional videos that ‘[show] From our qualitative analysis of video metadata in users how to bypass secure computer systems’, and it the non-live sports subcategory, there was little discern- clarifies that this includes ‘Showing users how to cir- ible difference between types of takedowns. The high rate cumvent payment processes to download software or of removals in this category is somewhat concerning, applications for free’ (YouTube, n.d.). This appears to since many clips of sports content may be lawful under be a rule that YouTube is enforcing quite extensively – fair use or other copyright exceptions, but this cannot be perhaps not surprising given Google’s interests in determined from the metadata. This area appears to be securing the Android ecosystem and its need to main- an important candidate for follow-up studies that are tain working relationships with many software devel- able to undertake fair use analyses (see e.g. Erickson opers and copyright owners. and Kretschmer, 2018; Jacques et al., 2018) to determine whether sufficient care is being taken by YouTube, its Limitations partners, and copyright owners and their agents when The most obvious limitation of this study is that we are evaluating whether to block sports videos. attempting to classify the content of videos based on the text and description fields. First, we only trained the clas- Hacks sifier on English-language metadata, and the language The final category we investigated primarily consists of model is optimised for English – so most of the results video tutorials about circumventing DRM. This includes are also English-language. Second, these fields are entirely guides about jailbreaking smartphones, generating serial generated by the user, and do not always accurately reflect numbers for software, and downloading cracked soft- the video content. Unfortunately, our infrastructure does ware with the copy-protection removed. Because the not collect full copies of videos, and we do not have the tools are closely related, the classifier also identified computational power to accurately classify or recognise videos about game exploits (modified versions of games themes from video content at a large scale. Nevertheless, that allow players to cheat, bypassing both copy protec- there are important benefits in classifying video content tion and anti-cheat software), and serial generators for on textual metadata, since these text fields are used to aid 10 Big Data & Society users who search on YouTube to discover relevant con- these types of videos are regulated. This high rate, how- tent. The more problematic limitation is that we are ever, suggests that there may be a problem of over- unable to include other metadata fields in our analysis – fitting: that our classification may be so tightly con- the YouTube Search API endpoint only returns a selected strained to identify the particular patterns of our train- ‘snippet’ of information. It would have been useful to ing videos that we miss other ways that people might have additional metadata to build our logistic regression share full versions of films on YouTube. This is a risk models (video duration, for example). It would have also we guarded against by qualitative exploration of search improved the accuracy of our classifier to have full length results on film titles on YouTube – we did not find any data for the video description field. This is an unfortunate major omissions that do not fit the patterns we identi- limitation of the API and the quota imposed by YouTube fied in developing our training sets – but we are not that was unavoidable in this study. able to exclude the possibility that our model is too An important limitation of our analysis strategy is narrow in identifying too few relevant videos. that we do not account for changes in enforcement Ultimately, this choice means that we cannot make rates over time. Given the small proportions of generalisations about the overall prevalence of videos videos in some of the categories we examine compared matching our categories on YouTube, but we can be to the total number of videos in our overall random more confident that the videos in each category are sample, we have elected not to further divide our cat- correctly classified. egorised sets into daily or monthly aggregations. Because content moderation systems of platforms are Conclusion continuously tweaked, and their features, cultures, and This experimental methodology has left us optimistic the behaviour of their users change over time (Burgess, about the potential to use machine learning classifiers 2015), and individual rightsholders may change their to better understand systems of algorithmic gover- takedown strategies, we suggest that future longitudi- nance. The methodology improves our understanding nal work could be extremely useful. of how an evolving system of both automated and dis- Another key limitation is that we have chosen to cretionary content moderation operates in practice. optimise our training data for precision over recall. Using a simple neural network infrastructure, a pre- Precision is a measure of the proportion of correct trained BERT language model, and substantial cloud classifications within each category (number of true processing power, we were able to achieve satisfactory positives/total of true and false positives), and recall performance on a multiclass classifier over short texts is a measure of the proportion of correctly identified (300 characters) with a relatively small number of classifications across the entire dataset (number of true training examples (between 420 and 1696 per class). positives/total of true positives and false negatives). In The implications for computational social studies are classification tasks, there is a trade-off between these exciting, and we have made our code available to help measures: increasing the overall number of correctly others extend BERT classification in other contexts. classified results in any category generally means This methodological experiment proved useful for including more incorrect results as well (Buckland helping to identify and interrogate patterns in a large and Gey, 1994). The specific goals of the analysis dataset of content moderation outcomes. Content should guide training and evaluation of a classification moderation is a notoriously opaque area, where the model (Sokolova and Lapalme, 2009). We focused on training materials and performance of human modera- improving the quality of predictions within each cate- tion teams are kept confidential, as are the details of the gory (minimising false positives), at the expense of not classification systems that prioritise content for review including some potentially relevant videos (false nega- and, in some circumstances, remove content directly. tives). We were therefore conservative in allocating As nations around the world continue to pressure plat- examples to categories in our training sets, and in the forms to take a more active role in moderating harmful semi-supervised stages, focused on reducing the num- content, it will become increasingly important to devel- bers of misclassified examples, especially because it was difficult to find additional positive examples within the op mechanisms to hold these moderation systems to very large set of unclassified and (for our purposes) account. YouTube’s complex and highly automated largely irrelevant random sample. So, for example, copyright moderation systems are an excellent case from our manual validation, 97 out of 100 videos clas- study to develop and hone new methods. sified as ‘full movies’ appeared from the metadata to A key benefit of our methodology is that it allows purport to be full versions of cinematic films or to link for the identification of trends in content moderation to sites where people could watch full versions of those that would not be evident in small sample sizes or films. The precision of this category is very high, and through experimental uploading. Our study has we can be comfortable drawing conclusions about how shown the potential to undertake large scale Gray and Suzor 11 Discovery Projects grant (DP170100122). This research was quantitative analysis on these systems at a level of supported with Cloud TPUs from Google’s TensorFlow detail that has so far not been possible. Most impor- tantly, we hope that this methodological approach Research Cloud (TFRC). proves useful in the future for researchers who may undertake longitudinal analyses or undertake further ORCID iD detailed qualitative study of particular controversies. Nicolas P Suzor https://orcid.org/0000-0003-3029-0646 As for YouTube’s copyright enforcement system itself, we have only been able to scratch the surface Notes with this analysis, and we must leave some further detailed investigation for future work. It is clear, how- 1. See U.S.C. 17 s 512(c). ever, that both the Content ID and DMCA takedown 2. See https://developers.google.com/youtube/v3/docs/ system are used with a greater degree of discretion than search/list was previously apparent: at an aggregate level, right- 3. https://github.com/qut-dmrc/short_text_analysis sholders make different decisions in relation to differ- ent types of videos. It also seems that in aggregate, the References Content ID and DMCA systems are working relatively Adhikari A, Ram A, Tang R, et al. (2019) DocBERT: BERT well to remove apparently infringing content from for Document Classification. Available at: https://arxiv. YouTube. Our study does, however, raise some con- org/abs/1904.08398v1 (accessed 20 April 2019). cerns about potential misidentification and over block- Bar-Ziv S and Elkin-Koren N (2018) Behind the scenes of ing, particularly in the sports highlights category, as online copyright enforcement: Empirical evidence on well as the large amount of discretion that music right- notice & takedown. Connecticut Law Review 50: 339–385. sholders are able to exercise to choose to block all types Binns R, Veale M, Van Kleek M, et al. (2017) Like trainer, of content – including material such as gameplay that is like bot? Inheritance of bias in algorithmic content mod- eration. In: International conference on social informatics, unlikely to compete in the market for recorded music. 2017, pp. 405–415. Berlin: Springer. Identifying the factors that affect the decisions of Birmingham J and David M (2011) Live-streaming: Will foot- rightsholders to remove content is an important area ball fans continue to be more law abiding than music fans? for further study, since these decisions have ramifica- Sport in Society 14(1): 69–80. DOI: 10.1080/ tions for freedom of expression and access to informa- 17430437.2011.530011 tion. We suggest that future studies might fruitfully Black J (2001) Decentring regulation: Understanding the role develop finer-grained classification categories and of regulation and self-regulation in a “post-regulatory” seek to collect more extensive metadata in order to world. Current Legal Problems 54(1): 103. facilitate more extensive qualitative analyses of the Blei DM, Ng AY and Jordan MI (2003) Latent dirichlet allo- types of content that are most likely to be blocked. cation. Journal of Machine Learning Research 3(Jan): The high discretion and potential lack of contextual 993–1022. sensitivity evident in these systems is something that Boroughf B (2015) The next great YouTube: Improving content policymakers too should clearly evaluate and address ID to foster creativity, cooperation and fair compensation. before encouraging platforms to rely to a much greater Albany Law Journal of Science & Technology 25(1): 95. Bridy A (2016) Copyright’s digital deputies: DMCA-plus extent on automated content moderation tools, either enforcement by internet intermediaries. In: Rothchild J for copyright or for issues like hate speech and abuse. (ed.) Research Handbook on Electronic Commerce Law. Edward Elgar, pp. 185–208. Acknowledgements Bridy A (forthcoming) Addressing infringement: Developments We thank Rosalie Gillett for outstanding research assistance. in the US and the DNS. In: Frosio G (ed.) The Oxford Handbook of Online Intermediary Liability. London: Declaration of conflicting interests Oxford University Press. Available at: https://ssrn.com/ The author(s) declared no potential conflicts of interest with abstract=3264879 (accessed 15 April 2020) Buckland M and Gey F (1994) The relationship between respect to the research, authorship, and/or publication of this recall and precision. Journal of the American Society for article. Information Science 45(1): 12–19. DOI: 10.1002/(SICI) 1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-L Funding Burgess J (2015) From ‘broadcast yourself’ to ‘follow your The author(s) disclosed receipt of the following financial sup- interests’: Making over social media. International Journal port for the research, authorship, and/or publication of this of Cultural Studies 18(3): 281–285. article: Suzor is the recipient of an Australian Research Burgess J and Green J (2018) Youtube: Online Video and Council DECRA Fellowship (project number Participatory Culture. 2nd ed. Digital Media and DE160101542). This research is also supported by an ARC Society. Cambridge: Polity Press. 12 Big Data & Society Burk DL (2003) Anti-circumvention misuse. IEEE Hull MR (2010) Sports leagues’ new social media policies: Technology and Society Magazine 22(2): 40–47. Enforcement under copyright law and state law. The Citron DK (2007) Technological due process. Washington Columbia Journal of Law & the Arts 34: 457. Jacques S, Garstka K, Hviid M, et al. (2018) An empirical University Law Review 85: 1249. Devlin J, Chang M-W, Lee K, et al. (2018) BERT: Pre- study of the use of automated anti-piracy systems and training of deep bidirectional transformers for language their consequences for cultural diversity. Script-Ed 15(2): understanding. Available at: https://arxiv.org/abs/1810. 277–312. DOI: 10.2966/scrip.150218.277 04805v1 (accessed 2 March 2019). Jones AJ (2017) By throwing fans and writers in Twitter jail, Diakopoulos N (2015) Algorithmic accountability: sports leagues are abusing the law. The Guardian, 13 April. Journalistic investigation of computational power struc- Available at: https://www.theguardian.com/sport/2017/ tures. Digital Journalism 3(3): 398–415. apr/13/twitter-sports-leagues-rules-journalist-account-sus Dootson P and Suzor N (2015) Game of clones and the pensions (accessed 19 March 2019). Australia tax: Divergent views about copyright business Kohl U (2013) Google: The rise and rise of online models and the willingness of Australian consumers to intermediaries in the governance of the internet and infringe. The. University of New South Wales Law beyond (part 2). International Journal of Law and Journal 38: 206–329. Information Technology 21: 187. Elkin-Koren N (2014) After twenty years: Revisiting copy- McSherry C (2014) Lawrence Lessig settles fair use right liability of online intermediaries. In: Frankel S and lawsuit over phoenix music snippets. In: eff.org. Gervais DJ (eds) The Evolution and Equilibrium of Available at: https://www.eff.org/press/releases/law Copyright in the Digital Age. Cambridge: Cambridge rence-lessig-settles-fair-use-lawsuit-over-phoenix-music- University Press, pp. 29–51. snippets (accessed 15 April 2020) Erickson K and Kretschmer M (2018) “This video is Matsui S (2016) Does it have to be a copyright infringement: unavailable”: Analyzing copyright takedown of user- Live game streaming and copyright. Texas Intellectual generated content on YouTube. Journal of Intellectual Property Law Journal 24: 215. Property, Information Technology and Electronic Commerce Mellis MJ (2007) Internet piracy of live sports telecasts. 9(1): 75–89. http://www.jipitec.eu/issues/jipitec-9-1-2018/4680 Marquette Sports Law Review 18: 259. European Commission (2018) Proposal for a Regulation of the Mou L, Meng Z, Yan R, et al. (2016) How transferable are European Parliament and of the Council on preventing the neural networks in NLP applications? arXiv:1603.06111 [cs]. Available at: http://arxiv.org/abs/1603.06111 dissemination of terrorist content online. 2018/0331 (COD), 12 September. Available at: https://ec.europa.eu/commis (accessed 10 April 2019). sion/sites/beta-political/files/soteu2018-preventing-terror Nas S (2004) The Multatuli project: ISP notice & take down. In: 4th international system administration and network engineer- ist-content-online-regulation-640_en.pdf (accessed 15 ing conference,Amsterdam,Netherlands,September 2004. April 2020) Pariser J (2016) Comments of the Motion Picture Association European Union Parliament (2018) Amendments adopted by the European Parliament on 12 September 2018 ( ) on the of America before the Library of Congress United States proposal for a directive of the European Parliament and of Copyright Office in the Matter of Requests for Comments on United States Copyright Office Section 512 Study. the Council on copyright in the Digital Single Market Docket No. 2015-7, 1 April. COM(2016)059-C8-0383/2016-2016/0280(COD). Pasquale F (2015) The Black Box Society. Cambridge, MA: Fiala L and Husovec M (2018) Using experimental evidence to design optimal notice and takedown process. TILEC Harvard University Press. Discussion Paper DP2018-028. Patry W (2009) Moral Panics and the Copyright Wars. Garrett R (2016) Comments of The Professional Sports Toronto: Oxford University Press. Pedregosa F, Varoquaux G, Gramfort A, et al. (2011) Scikit- Organizations in the Matter of Section 512 Study before learn: Machine learning in python. Journal of Machine the Copyright Office Library of Congress. 2015–2017, Learning Research 12: 2825–2830. Washington, DC. Perel M and Elkin-Koren N (2016) Accountability in algo- Gillespie T (2007) Wired Shut Copyright and the Shape of rithmic copyright enforcement. Stanford Technology Law Digital Culture. Cambridge, MA: MIT Press. Review 19: 473–532. Gillespie T (2018) Custodians of the Internet: Platforms, Content Perel M and Elkin-Koren N (2017) Black box tinkering: Moderation, and the Hidden Decisions that Shape Social Beyond disclosure in algorithmic enforcement. Florida Media. 1st ed. New Haven, CT: Yale University Press. Law Review 69(1): 181. Google (2016) How Google Fights Piracy. Available at: Reda J (2019) Upload filters. Available at: https://juliareda. https://drive.google.com/file/d/0BwxyRPFduTN2TmpGa eu/eu-copyright-reform/censorship-machines/ (accessed 11 jJ6TnRLaDA/view (accessed 15 April 2020). April 2019). Gray JE (2020) Google Rules: The History and Future of Seng D (2014) The state of the discordant union: An empir- Copyright under the Influence of Google. New York, NY: ical analysis of DMCA takedown notices. Virginia Journal Oxford University Press. Heller MA (1998) The tragedy of the anticommons: Property of Law and Technology 18: 369. in the transition from Marx to markets. Harvard Law Seng D (2015) ‘Who Watches the Watchmen?’ An Empirical Review 111(3): 621–688. DOI: 10.2307/1342203 Analysis of Errors in DMCA Takedown Notices. ID Gray and Suzor 13 2563202, SSRN Scholarly Paper, 23 January. Rochester, NY: https://papers.ssrn.com/abstract=2755628 (accessed 18 Social Science Research Network. Available at: http://papers. September 2017). ssrn.com/abstract=2563202 (accessed 25 May 2016). Urban JM and Quilter L (2006) Efficient process or ‘Chilling Sokolova M and Lapalme G (2009) A systematic analysis of effects’? Takedown notices under section 512 of the digital performance measures for classification tasks. Information millennium copyright act. Santa Clara Computer and High Processing & Management 45(4): 427–437. DOI: 10.1016/j. Technology Law Journal 22: 621. ipm.2009.03.002 Van der Sar E (2018) Movie Company Demands e200,000 From Steyvers M and Griffiths T (2007) Probabilistic topic models. YouTube Over Pirated Film. Available at: https://torrent Handbook of Latent Semantic Analysis 427(7): 424–440. freak.com/movie-company-demands-e200000-from-youtube- Suzor NP (n.d.) Understanding content moderation systems: over-pirated-film-181129/ (accessed 19 March 2019). New methods to understand internet governance at scale, Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is over time, and across platforms. In: Whalen R (ed.) all you need. In: Guyon I, Luxburg UV, Bengio S, et al. Computational Legal Studies: The Promise and Challenge (eds) Advances in Neural Information Processing Systems of Data-Driven Legal Research. Cheltenham: Edward 30. New York, NY: Curran Associates, Inc., pp. 5998– Elgar Publishing, pp. 1–19. Available at: https://eprints. 6008. Available at: http://papers.nips.cc/paper/7181-atten qut.edu.au/129464/ (accessed 15 April 2020). tion-is-all-you-need.pdf (accessed 10 April 2019). Suzor NP (2019) Lawless: The Secret Rules That Govern Our Wang S (2015) Fair use or copyright infringement? Deadspin Digital Lives. Cambridge: Cambridge University Press. and SB Nation get tossed off Twitter for NFL GIFs. In: Suzor NP, West SM, Quodling A, et al. (2019) What do we Nieman Lab. Available at: http://www.niemanlab.org/ mean when we talk about transparency? Toward meaning- 2015/10/fair-use-or-copyright-infringement-deadspin-and- ful transparency in commercial content moderation. sb-nation-get-tossed-off-twitter-for-nfl-gifs/ (accessed 19 International Journal of Communication 13: 18. March 2019). Taylor I (2015) Video games, fair use and the internet: The Wu T (2007) Tolerated use. Columbia Journal of Law & Arts plight of the let’s play. University of Illinois Journal of 31: 617. Law, Technology & Policy 1: [i]272. Yafit L-A (2013) Copyright lawmaking and public choice: Taylor TL (2018) Watch Me Play: Twitch and the Rise of From legislative battles to private ordering. Harvard Game Live Streaming. Princeton, NJ: Princeton Journal of Law & Technology 27: 203. University Press. YouTube (n.d.) Policies on harmful or dangerous content. Tehranian J (2011) Infringement Nation : Copyright 2.0 and Available at: https://support.google.com/youtube/ You. New York, NY: Oxford University Press. Available answer/2801964?hl=en-GB (accessed 19 April 2019). at: https://trove.nla.gov.au/work/38120261 (accessed 18 Zarsky T (2016) The trouble with algorithmic decisions: An April 2019). analytic road map to examine efficiency and fairness in Tushnet R (2014) All of this has happened before and all of automated and opaque decision making. Ziewitz M this will happen again: Innovation in copyright licensing. (ed.). Science, Technology, & Human Values 41(1): 118– Berkeley Technology Law Journal 29(3): 1447. 132. DOI: 10.1177/0162243915605575 Urban JM, Karaganis J and Schofield BL (2017) Notice and Zimmerman DL (2014) Copyright and social media: A tale of Takedown in Everyday Practice. 22 March. Available at: legislative abdication. Pace Law Review 35(1): 260. Appendix 1 Logistic regression using maximum likelihood, predicting outcomes two weeks after a video is published on YouTube. N¼ 750,000 (150,000 in each class, combining live sport and sport highlights). Content ID block Coefficient Std err Zp> |z| [0.025 0.975] Intercept 5.0784 0.038 135.166 0.000 5.152 5.005 Game play 1.7866 0.098 18.204 0.000 1.979 1.594 Hacks 2.1544 0.151 14.253 0.000 2.451 1.858 Full movies 2.4416 0.042 58.218 0.000 2.359 2.524 Sports highlights 1.3533 0.044 30.591 0.000 1.267 1.440 Live sports 1.3947 0.087 15.993 0.000 1.224 1.566 Has link (non-major site) 0.8071 0.245 3.297 0.001 1.287 0.327 Game play with link 0.2663 0.781 0.341 0.733 1.265 1.798 Hacks with link 0.3992 0.451 0.885 0.376 0.484 1.283 Full Movies with link 1.0823 0.261 4.139 0.000 1.595 0.570 (continued) 14 Big Data & Society Continued Content ID block Coefficient Std err Zp> |z| [0.025 0.975] Sports highlights with link 0.3204 0.301 1.063 0.288 0.911 0.270 Live sports with link 1.1672 0.397 2.939 0.003 1.946 0.389 Content ID block by music claimant Coefficient Std err zp> |z| [0.025 0.975] Intercept 5.4055 0.044 122.270 0.000 5.492 5.319 Game play 0.3422 0.058 5.941 0.000 0.229 0.455 Hacks 1.2743 0.120 10.659 0.000 1.509 1.040 Full movies 0.3561 0.075 4.762 0.000 0.210 0.503 Sports highlights 0.5054 0.082 6.172 0.000 0.666 0.345 Live sports 0.3106 0.164 1.895 0.058 0.011 0.632 Has link (non-major site) 1.0114 0.318 3.176 0.001 1.635 0.387 Game play with link 0.5806 0.428 1.357 0.175 0.258 1.420 Hacks with link 0.3458 0.543 0.637 0.524 1.410 0.719 Full Movies with link 2.2410 0.674 3.324 0.001 3.562 0.920 Sports highlights with link 0.7243 0.774 0.935 0.350 2.242 0.793 Live sports with link 1.1186 0.751 1.489 0.136 2.591 0.353 DMCA takedown Coefficient Std err zp> |z| [0.025 0.975] Intercept 6.4466 0.074 86.771 0.000 6.592 6.301 Game play 2.6427 0.285 9.259 0.000 3.202 2.083 Hacks 0.0452 0.122 0.370 0.711 0.194 0.284 Full movies 3.4000 0.078 43.781 0.000 3.248 3.552 Sports highlights 2.3582 0.079 29.715 0.000 2.203 2.514 Live sports 2.8983 0.105 27.699 0.000 2.693 3.103 Has link (non-major site) 0.0810 0.316 0.256 0.798 0.539 0.701 Game play with link 0.5974 1.287 0.464 0.643 1.926 3.120 Hacks with link 0.5792 0.357 1.621 0.105 0.121 1.279 Full Movies with link 0.4701 0.319 1.474 0.140 0.155 1.095 Sports highlights with link 1.4678 0.397 3.702 0.000 2.245 0.691 Live sports with link 0.0851 0.342 0.249 0.803 0.755 0.585 Not available: account terminated Coefficient Std err zp> |z| [0.025 0.975] Intercept 3.3786 0.016 207.512 0.000 3.411 3.347 Game play 2.3369 0.054 43.581 0.000 2.442 2.232 Hacks 2.7222 0.018 154.512 0.000 2.688 2.757 Full movies 2.6726 0.018 145.957 0.000 2.637 2.708 Sports highlights 0.2667 0.024 11.209 0.000 0.220 0.313 Live sports 2.5291 0.028 91.026 0.000 2.475 2.584 Has link (non-major site) 2.1893 0.031 70.659 0.000 2.129 2.250 Game play with link 0.5974 0.100 5.956 0.000 0.401 0.794 Hacks with link 1.6208 0.034 48.262 0.000 1.687 1.555 Full Movies with link 0.1806 0.034 5.349 0.000 0.247 0.114 Sports highlights with link 1.6132 0.067 24.124 0.000 1.744 1.482 Live sports with link 0.2633 0.043 6.122 0.000 0.179 0.348 Removed by user Coefficient Std err zp> |z| [0.025 0.975] Intercept 1.7246 0.008 226.892 0.000 1.739 1.710 Game play 0.2996 0.010 29.665 0.000 0.280 0.319 Hacks 0.4118 0.014 28.811 0.000 0.440 0.384 Full movies 0.4205 0.013 32.599 0.000 0.395 0.446 Sports highlights 0.1197 0.012 10.321 0.000 0.097 0.142 Live sports 1.4245 0.020 69.963 0.000 1.385 1.464 Has link (non-major site) 0.1254 0.035 3.546 0.000 0.195 0.056 Game play with link 0.1104 0.055 2.011 0.044 0.003 0.218 (continued) Gray and Suzor 15 Continued Content ID block Coefficient Std err Zp> |z| [0.025 0.975] Hacks with link 1.2022 0.040 29.819 0.000 1.123 1.281 Full Movies with link 1.0583 0.050 21.210 0.000 1.156 0.961 Sports highlights with link 0.5631 0.062 9.019 0.000 0.686 0.441 Live sports with link 1.4167 0.045 31.327 0.000 1.328 1.505 ToS takedown Coefficient Std err zp> |z| [0.025 0.975] Intercept 5.1467 0.039 132.410 0.000 5.223 5.071 Game play 2.6983 0.153 17.640 0.000 2.998 2.398 Hacks 2.2720 0.042 53.543 0.000 2.189 2.355 Full movies 3.2051 0.041 77.833 0.000 3.124 3.286 Sports highlights 1.2102 0.094 12.810 0.000 1.395 1.025 Live sports 2.2584 0.066 34.084 0.000 2.128 2.388 Has link (non-major site) 1.5245 0.088 17.327 0.000 1.352 1.697 Game play with link 1.5332 0.264 5.817 0.000 1.017 2.050 Hacks with link 0.9784 0.093 10.495 0.000 1.161 0.796 Full Movies with link 0.1911 0.090 2.113 0.035 0.368 0.014 Sports highlights with link 0.0363 0.215 0.169 0.866 0.457 0.384 Live sports with link 0.9252 0.107 8.652 0.000 0.716 1.135 http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Big Data & Society SAGE

Playing with machines: Using machine learning to understand automated copyright enforcement at scale:

Big Data & Society , Volume 7 (1): 1 – Apr 28, 2020

Loading next page...
 
/lp/sage/playing-with-machines-using-machine-learning-to-understand-automated-0dvNsp7LKC
Publisher
SAGE
Copyright
Copyright © 2022 by SAGE Publications Ltd, unless otherwise noted. Manuscript content on this site is licensed under Creative Commons Licenses.
ISSN
2053-9517
eISSN
2053-9517
DOI
10.1177/2053951720919963
Publisher site
See Article on Publisher Site

Abstract

This article presents the results of methodological experimentation that utilises machine learning to investigate auto- mated copyright enforcement on YouTube. Using a dataset of 76.7 million YouTube videos, we explore how digital and computational methods can be leveraged to better understand content moderation and copyright enforcement at a large scale.We used the BERT language model to train a machine learning classifier to identify videos in categories that reflect ongoing controversies in copyright takedowns. We use this to explore, in a granular way, how copyright is enforced on YouTube, using both statistical methods and qualitative analysis of our categorised dataset. We provide a large-scale systematic analysis of removals rates from Content ID’s automated detection system and the largely auto- mated, text search based, Digital Millennium Copyright Act notice and takedown system. These are complex systems that are often difficult to analyse, and YouTube only makes available data at high levels of abstraction. Our analysis provides a comparison of different types of automation in content moderation, and we show how these different systems play out across different categories of content. We hope that this work provides a methodological base for continued experimentation with the use of digital and computational methods to enable large-scale analysis of the operation of automated systems. Keywords Machine learning, copyright enforcement, YouTube, content moderation, automated decision-making, Content ID This article is a part of special theme on The Turn to AI. To see a full list of all articles in this special theme, please click here: https://journals.sagepub.com/page/bds/collections/theturntoai How can we understand how massive content mod- is a challenge that is widely acknowledged and quickly eration systems work? The major social media plat- becoming more pressing; the lack of good information forms use a combination of human and automated about how our shared digital environments are gov- processes to efficiently evaluate content that their erned – what information is available, removed, and users post against the rules of the platform and appli- made more or less visible – has led to serious concerns cable laws. These sociotechnical systems are notorious- ly difficult to understand – we can see their results in individual cases, but their inner workings and systemic Creative Industries Faculty, Queensland University of Technology, impact are often obscured (Gillespie, 2018). Most Brisbane, Australia Faculty of Law, Queensland University of Technology, Brisbane, Australia major platforms provide regular transparency reports, but these mainly provide high-level aggregations that Corresponding author: are insufficient to really probe the contours and social Nicolas P Suzor, Faculty of Law, Queensland University of Technology, effects of moderation systems (Suzor et al., 2019). This GPO Box 2434, Brisbane, Queensland 4001, Australia. Email: n.suzor@qut.edu.au Creative Commons CC BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (https:// creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access-at-sage). 2 Big Data & Society about the potential for bias and the flow of harmful continual experimentation in the development of a content (Suzor, 2019). This is likely to become more new set of methodological approaches for interrogating important and more difficult as nations continue to pressing public policy research questions in the context ask platforms to do more to regulate social media con- of large-scale automated decision-making in digital tent – and to regulate more quickly by applying media environments. machine learning to filter material automatically or pri- oritise content for review. Background In this article, we investigate copyright enforcement YouTube’s baseline legal obligations for enforcing on YouTube as an important case study of a sophisti- copyright are set out by the notice-and-takedown cated and complex set of processes that are heavily system established under the United States DMCA leg- automated and remain highly controversial. YouTube islation and propagated around the world. Notice- is a major target for the Digital Millennium Copyright and-takedown has become an extremely important Act (DMCA) notice and takedown copyright enforce- industrial mechanism for enforcing copyright; copy- ment system, and it has also built Content ID, one of right owners employ rights management companies the most extensive automated systems for detecting who use automated search tools to send hundreds of copyright infringement. There are major concerns complaints of notices every year (Urban et al., 2017). that YouTube’s enforcement system frequently incor- Google, like other major targets of notice-and- rectly removes videos, at substantial cost to freedom of takedown, has had to develop streamlined automated expression (Tushnet, 2014: 1461). Nevertheless, processes to deal with the massive volume of com- Content ID now serves as a model for the further plaints it receives, but there is little public detail deployment of ‘upload filters’ (Reda, 2019), and about how these processes work. under the European Union’s recently approved copy- In practice, YouTube goes beyond its obligations right directive, more platforms will have a strong incen- under the DMCA when enforcing copyright, and has tive to deploy automated systems that can monitor developed a series of additional tools and privately potential copyright infringement by their users negotiated systems and policies (Bridy, forthcoming). (European Union Parliament, 2018). There is also The most visible of these tools is Content ID, real potential for these tools to be applied to censor YouTube’s automated rights management system that or moderate other types of content, including hate allows rightsholders to block, monetise, mute or track speech and abuse (European Commission, 2018). videos that contain their works. Rightsholders provide In this study, we use digital methods to try to make YouTube with a reference file of their work and the the content moderation system on YouTube – a system Content ID algorithm scans videos that are uploaded that relies on both automated and discretionary to YouTube to see if a match can be found in the data- decision-making and that is applied to varying types base of reference files. Google reports that on of video content – more legible for research. Starting YouTube, 98% of copyright matters are decided by with a random sample of the text metadata of 76.7 Content ID (Google, 2016). million YouTube videos that includes information Automated, privatised copyright enforcement is an about whether and why each video was removed or example of a controversial institutional shift from blocked, we developed a machine learning classifier to public to private modes of regulation. Private regula- categorise these videos into four categories. The cate- tory modes are controversial because they tend to lack gories represent ongoing controversies over online important democratic features and due process safe- copyright enforcement: full movies, gameplay, sports guards (Black, 2001: 143; Zimmerman, 2014: 273). content, and tutorials on copy control circumvention Private, automated regulatory systems in particular (hacks, cracks, and exploits). The core methodological can lead to institutional convergence (Perel and problem that we confronted was how to reliably iden- Elkin-Koren, 2016: 481); that is, a tendency for law- tify from a very large dataset of videos, relatively small making, enforcement and adjudication to occur in cen- subsets of removed videos that fell within our content categories. By solving this problem, we were able to tralised private modes rather than through separate examine how different types of automated and discre- legislative and judicial institutions. The increased use tionary enforcement operate differentially within dif- of automation in private regulation raises additional ferent categories of video content. concerns about transparency, accountability, and pro- Essentially, in this study, we explore how digital and tection for fundamental rights (Citron, 2007; Elkin- computational methods can be usefully combined with Koren, 2014). statistical and rich qualitative methods to study the There are multiple sources of opacity – institutional, digital traces of large-scale content moderation sys- legal and technological – that make it difficult to eval- tems. We hope that this work will help to inform uate automated private regulatory systems Gray and Suzor 3 (Diakopoulos, 2015; Zarsky, 2016). Trade secret laws stakeholders (Urban et al., 2017); experimental upload- often prevent public access to these systems (Perel and ing and interacting with platforms (Nas, 2004; Perel Elkin-Koren, 2017). Their workings typically encode and Elkin-Koren, 2017); the analysis of information rules and priorities privately negotiated between stake- made available at the discretion of platforms or holders; Content ID, for example, was developed during legal disputes (Bar-Ziv and Elkin-Koren, 2018; through partnerships between YouTube and large Seng, 2014, 2015; Tushnet, 2014; Urban et al., 2017; entertainment companies (Tushnet, 2014; Yafit, 2013: Urban and Quilter, 2006); tracking discrete samples 248). Automated decision-making processes frequently of publicly available material (Erickson and occur in a ‘black box’ that is difficult or impossible to Kretschmer, 2018; Jacques et al., 2018); and laboratory interrogate (Pasquale, 2015). The data and rules that experimentation (Fiala and Husovec, 2018). To this human and algorithmic moderators are trained on is body of work, our study contributes a machine learn- secret, and the outputs of these systems are often ing methodology that helps to identify particular cate- obscured by platforms who seek to avoid public scru- gories of content for further qualitative and statistical tiny (Gillespie, 2018). Increasing accountability analysis at a very large scale. Our hope is that this new requires real transparency, but improving transparency approach will improve understanding of how content will require substantial improvements in methodologies moderation systems that combine both automated and and collaborations (Suzor et al., 2019). discretionary decision-making operate in practice, A second major controversy that marks automated across different types of video content. decision-making is the capacity for error and bias. When undertaken on a very large scale, even very low Interrogating content moderation at scale rates of error in automated decision-making can create very large problems (Urban et al., 2017: 35). This article has two main aims. First, we seek to com- Automated decision-making systems are typically pare trends in copyright takedowns, Content ID blunt instruments, operating with narrow objectives blocks, and terms of service (ToS) removals across sev- that can introduce systematic bias, and incapable of eral key controversial issues. We set out to categorise accounting for the full context that might impact videos into topics that aligned with concerns in the lit- human decision-making (Binns et al., 2017). For exam- erature about potential misuse of copyright enforce- ple, if Content ID’s primary objective is to enforce and ment mechanisms. By focusing on particular commodify the use of content on YouTube, without controversies, we hoped to be able to investigate the the ability to distinguish between non-infringing and types of actors responsible for moderation within each infringing uses, it risks serving the interests of large category, as well as to compare rates of different types rightsholders at the expense of end users or smaller of moderation across categories and against the base- creators (Bridy, 2016; Kohl, 2013: 220). line average. Second, in order to undertake this analy- In copyright governance, as more advanced auto- sis, we had to develop a methodology that could help mated enforcement systems have been deployed to pro- us isolate videos in controversial categories from the tect the proprietary interests of media and much larger bulk of YouTube content. entertainment companies, controversies have arisen One of the major challenges of understanding con- over the potential for these systems to prioritise effi- tent moderation practices is the vast scale of content ciency over accuracy (Gray, 2020). Content ID on that is posted and moderated on major social media YouTube, in particular, has been criticised for remov- platforms. Overall rates of removal under the DMCA ing from YouTube content not subject to copyright and Content ID on YouTube are relatively low – only protection such as reviews, parodies and educational approximately 1% of all videos uploaded are removed videos that fall under a fair dealing or fair use excep- due to an apparent copyright violation (Suzor, n.d.). In tion (McSherry, 2014). Similarly, Content ID has been order to examine trends within copyright removals, we criticised for capturing content that rightsholders required a very large initial sample and a suitable would have, if left to their discretion, not removed, mechanism to isolate sufficiently large samples of such as gameplay videos (Boroughf, 2015). So long videos that are likely to be relevant to the inquiry at as rightsholders continue to pressure platforms to hand. Given the extremely large volume and propor- implement more streamlined and efficient system for tion of irrelevant videos, this is a task that is prohibi- removing content from the internet, the potential for tively difficult to undertake manually. A large-scale over-enforcement of copyright online will remain a dataset is required to analyse and compare trends in contentious issue. types of content removal across different categories of The study of copyright takedowns has been a content and it requires some form of computational vibrant area for scholarship. Existing work has utilised a broad range of methodologies, including interviewing analysis to assemble. 4 Big Data & Society Methodology Data collection For this study, we used existing infrastructure (Suzor, n.d.) to obtain a random sample of metadata about YouTube videos and their availability. The infrastruc- ture uses YouTube’s search API (‘list’ endpoint) to generate a random sample of YouTube videos as they are published. Each of these videos was tested approx- imately two weeks after it was first collected, and we Figure 1. Overall removal rates. logged whether it was still available, and if not, what reason YouTube provided for its removal. The two week time period was selected to provide a sufficiently approximately 4000 videos (with English metadata) long enough time for a copyright takedown or terms of that had been removed, to explore the main types of service removal – given that these systems tend to be content that are moderated. We used Latent Dirichlet most controversial when content is new and therefore Allocation (Blei et al., 2003; Pedregosa et al., 2011) to most visible and most commercially valuable. When generate a series of topic models at varying degrees of YouTube removes a video, it provides a detailed granularity, ranging from 5 to 80 clusters. By varying error message that explains why the video is not avail- the number of clusters and examining the most relevant able and, in the case of copyright takedowns, who was words for each cluster, as well as a sample of docu- responsible for requesting that the video be blocked. ments that were most likely to fall into each cluster, When a video is blocked by Content ID, YouTube we were able to identify some relatively strong group- will still host a link to the video and will provide an ings of topics for further analysis. error message explaining that the video was blocked We used the topic modelling as a starting point to due to a copyright claim. It is possible that our infra- inductively identify some coherent clusters of similar structure under-counts Content ID matches where a videos that appeared to be frequently blocked. We video is blocked before it is published, but it appears cross-referenced these topics against controversial for many videos that Content ID blocking occurs some issues that have been identified in the literature, in minutes or hours after initial publication. YouTube order to develop a set of categories for further analysis. does not provide detail about what proportion of Ultimately, we developed a classification scheme for Content ID blocks take effect before publication. five categories, informed by existing controversies We categorised the distinct explanations for removal over copyright enforcement and notice and takedown: that YouTube provides into six groups: the video was available, the video was not available because of a � ‘Full movies’: videos that appear to conform to a Content ID block (either globally or in the jurisdiction traditional classification of movie piracy: full in which our server is based), the video was not available length copies of feature films (Pariser, 2016; Patry, because of a DMCA notice or equivalent (either glob- 2009). In this category, we excluded movie trailers, ally or locally), the video was removed by YouTube for movie soundtracks and videos containing only violating its Terms of Service or Community Guidelines, movie scenes. the user’s account was terminated by YouTube, or the � ‘Gameplay’: live streaming or recorded video game user removed the video themselves. We discarded videos play. This category included reviews of games and that were unavailable for technical or other (unknown) game walkthroughs, tips, and guides that are vulner- reasons. Our final dataset consists of title, description, able to takedown because they contain a lot of exist- availability status, and error messages for 76.7 million ing copyright content, including artwork, music, and YouTube videos collected between October 2016 and dialogue (Boroughf, 2015; Burgess and Green, February 2019 (See Figure 1). 2018). We excluded advertisements and other pro- motional content. Topic selection ‘Sports’: videos of sporting event broadcasts either We first sought to divide the sample into categories of as live streams, recorded streams, or snippets. This similar videos to better understand the types of videos category focused on recordings of televised sports that were removed for different reasons in this large content that is protected as copyright subject dataset. We used topic modelling (Steyvers and matter under most copyright regimes (Garrett, Griffiths, 2007) on the title and description text of 2016; Jones, 2017). Premium sports content is Gray and Suzor 5 highly lucrative, and is accordingly a hotly contested classification on YouTube video titles and descriptions. site for copyright enforcement. We excluded non- We used the largest case-insensitive English-language professional games recorded by users. We further pre-trained BERT model, with 24 encoder layers, subdivided this category to train the classifier to dis- 1024 hidden units, 16 heads, and 340 million parame- tinguish live broadcast streams from non-live sports ters, and fine-tuned the model on Google’s Tensor content. Processing Unit cloud-based computing architecture. � ‘Hacks’: videos that provide tutorials on circumven- We started by labelling a training set of example tion of Digital Rights Management (DRM) software videos (title and description text) identified as likely and hardware, including game exploits, key and to fall within our chosen categories by our topic serial generators, and software cracks that can be model. We then went through several rounds of semi- used to infringe intellectual property rights supervised learning, where we ran our trained model to (Gillespie, 2007). Sometimes called ‘paracopyright’, classify small batches of unlabelled records and manu- there are long-standing concerns that anti- ally corrected its results. We iteratively evaluated the circumvention law can be misused to reduce compe- model’s performance quantitatively by measuring its tition (Burk, 2003). Notably, the DMCA does not accuracy (f1 score – an average combined measure of provide a procedure for notice and takedown of false positive and false negatives in each category, instructional material that teaches people to circum- weighted by the total number of results in each catego- vent DRM – so it would be a sign of potential ry) and qualitatively by closely examining the sample of misuse if this category of videos had relatively high predictions. We ultimately manually labelled approxi- copyright takedown rates. mately 10,000 videos to develop an adequate training set, although the majority of these (5801) were exam- Classification ples of videos that were not relevant to our categories. We were able to achieve satisfactory results using Our initial topic modelling was useful for exploration, only between 420 and 1696 examples in each category. but once we had identified our relevant categories, we To evaluate our model, we manually labelled each elected to use a supervised classification technique to video in a sample of approximately 100 videos from develop more robust samples for analysis. Recent each category (with the two sports categories com- advances in natural language processing have greatly bined) across the entire predicted dataset. The final improved the state of the art in text classification, results show an f1 weighted accuracy score of 90.6%. improving the utility of deep neural networks in classi- These results are quite good for our purposes – there fication tasks. New techniques make use of ‘transfer are relatively few false positives within each of catego- learning’ – general models that are trained on very ry, and few instances of confusion between the four large existing datasets, and then fine-tuned for specific categories under scrutiny. We will discuss particular applications on much smaller labelled datasets (Mou trends and limitations below in our qualitative analysis et al., 2016). These general models are trained on of each category but, notably, we found that a relative- large corpora to understand how different features of ly low amount of manual labelling was required to pro- language relate to each other – learning, for example, duce an accurate machine learning classifier using the how words are used in different senses by learning the ‘transfer learning’ technique. contexts in which they appear in different sentences In the final stage, we deployed the trained model to (Adhikari et al., 2019). We made use of the newly categorise our entire random sample of 76.7 million released Bidirectional Encoder Representations from YouTube videos. Our model found 12,943,693 unique Transformers (BERT), which has improved state of videos fell into one of our four categories. We used the art performance on many common natural lan- multinomial logistic regression to examine the relation- guage processing tasks (Devlin et al., 2018). BERT ship between takedowns, our predicted categories, and provides sentence-level representations trained on mas- additional variables in the metadata, including links to sive corpora of Wikipedia articles and digitised books. external sites. This statistical method enabled us to esti- These pre-trained models allowed us to use supervised mate the relative influence of different factors on the approaches to train a classifier to identify complex pat- likelihood of videos in each category being removed terns within our specific dataset with relatively small training sets. through different mechanisms, and provides a basis We used BERT to train a machine learning classifier from which to speculate about the factors that might to identify videos in each of our categories across our influence decisions to block content across different larger dataset. We use a transformer attention-based categories. Most importantly, however, the classifica- deep learning model – described as a ‘simple network tion technique allowed us to identify video metadata architecture’ (Vaswani et al., 2017) – to perform for manual qualitative investigation. 6 Big Data & Society We undertook qualitative analysis of the metadata category to build the model, which estimates log odds of a sample of 12,000 videos across our categories, of each type of removal for each category. We separat- including both removed and available videos, to ed out music claimants (music rightsholders such as explore the types of videos classified, and supplemented record labels or publishers) from the rest of the this with targeted samples on discrete questions. This Content ID claims because music claimants’ behav- analysis was first conducted by reading through the iours tend to differ from other types of rightsholders titles and descriptions of each of the videos in each quite significantly – they frequently make use of category, paying attention to the choices made by Content ID’s monetisation option rather than remov- uploaders to describe their videos and make them find- ing videos. The model also includes an interaction term able by others, as well as the types of content that the for whether the video description has a link to an exter- classifier assigned to each category. We then watched a nal website. We developed several models and selected manually selected subset of videos in each category this one based on its interpretability and its perfor- until we were confident that we could understand the mance under a pseudo-R squared metric. The full types of content that the classifier was identifying (or regression results are in Appendix 1. mis-identifying) in each category. Our analysis below Overall, our analysis shows that in YouTube’s focuses on how our findings relate to some of the key heavily automated content moderation system there is controversies around these categories of user-generated substantial discretionary decision-making, as well as a content and online copyright enforcement broadly. potential lack of contextual sensitivity. We found very high rates of removals for videos associated with film piracy and all types of sports content. We found that Results and discussion game publishers are largely not enforcing their rights One of the most important findings of this study is that against gameplay streams and that when gameplay we can see, at a large scale, the rates at which Content videos are removed it is usually due to a claim by a ID is used to remove content from YouTube. Previous music rightsholder. We also found high rates of remov- large-scale studies have provided important insights als in the hacks category but mostly for Terms of into rates of DMCA removals (e.g. Seng, 2014; Service violations, which indicates that YouTube Urban et al., 2017), but information about Content rather than rightsholders are more commonly taking ID removals has remained imprecise, provided at a action to remove content and terminate accounts that high level of abstraction by YouTube. In this article, provide DRM anti-circumvention information. we provide the first systematic analysis of Content ID removals rates, including comparisons with other Full movies removal types and across different categories of con- Film piracy has been one of the key issues of the ‘copy- tent. We note that where rightsholders have opted for right wars’ (Patry, 2009). From early in its history, monetisation through Content ID, they will be includ- YouTube has been a central battleground in these ed in the ‘available’ category. YouTube does not make wars, amidst major concern by both screen and music public the information necessary to determine which industries that YouTube’s core business model was rightsholders have opted to monetise which videos. built on copyright infringement (Burgess and Green, Across our entire dataset, videos were most fre- 2018). Content ID has cleaned up a lot of the direct quently removed from YouTube by users themselves, copyright infringement on YouTube over the years, but followed by removals due to an account termination there are ongoing concerns that YouTube continues to and then Content ID blocks (See Table I). DMCA host copies of infringing content and that it provides a takedowns were the least common removal type – vector for infringement by directing viewers to stream- Content ID removals occurred at seven times the rate of DMCA removals, and videos were on average five ing sites and filelockers where they can access copies of times more likely to be removed for terms of service feature films. In particular, rightsholders continue to violations than due to a DMCA notice. Our findings complain that movies that are removed often reappear quickly after removal (Van der Sar, 2018) and that fil- confirm the general trend identified in a recent study of tering technologies like Content ID are vulnerable to Content ID and DMCA removals: in a sample of 1839 gaming or circumvention by sophisticated users parody videos it was found that Content ID was used (Pariser, 2016: 19). to block videos five times more frequently than DMCA In our analysis, we sought to identify how well both notices (Jacques et al., 2018). Content ID and the DMCA process were working to We built a simple Logistic Regression model to esti- keep what might appear to be clearly infringing content mate the links between specific removal outcomes and the categories assigned by our classifier (See Table 2). of YouTube. Overall, only 36% of videos in this cate- We used 150,000 randomly selected records from each gory were available two weeks after they were first Gray and Suzor 7 posted. When we break this down by removal type, we transformative user-generated content, and videos see that videos in this category are nearly 30 times more made from computer games have long been a prime likely to be removed through a DMCA notice than the site of controversy (Burgess and Green, 2018). As baseline for unclassified videos, and 11.5 times more game streaming took off, YouTube has become an likely to be blocked through Content ID. This category important (if secondary) platform for live and recorded also had the highest rate of removals for terms of ser- gameplay footage (Taylor, 2018). For many years, vice violations (24.7 times more likely than baseline) commentators have raised concerns about potential and account terminations (14.5 times more likely). over-enforcement of copyright, because copyright law gives many different copyright owners the right to The high rate of account terminations suggests that object to gameplay videos and it can be hard to evalu- the accounts used to post full movies or links to full ate a fair use claim (or equivalent; Taylor, 2015). movies are likely to be ‘repeat infringers’, in the copy- The most common type of content identified in this right terminology, or are either frequently or flagrantly category was streams of game play, often with in breach of YouTube’s rules. The top claimants in this commentary by the player, either as a live broadcast category were a mix of film studios, who used Content or recording. There is some suggestion that early fears ID directly, and third-party rights management com- about the lawfulness of gameplay videos may have panies, who are generally responsible for sending settled down as copyright owners have come to DMCA notices on behalf of producers. accept and even embrace recorded footage and stream- From our qualitative analysis of the video metadata, ing videos (Matsui, 2016). In our data, we clearly see it was apparent that the primary types of content that game publishers are not enforcing their rights removed in this category were videos that purported against game streamers at any real scale: gameplay to be full copies of feature films hosted on YouTube videos are 83% less likely to be removed by a or videos that promoted links to third-party websites Content ID claim than an unclassified video, and apparently hosting streams of full copies of feature 93% less likely to be removed by a DMCA notice. In films. Many of the links in this category were to general, it appears that game streaming is an advanced URL shorteners, filelockers, and a long tail of domains case of ‘tolerated use’, where technically infringing con- that often appeared to be offering illicit downloads tent has become normalised and acceptable and streams, advertising farms, or malware sites, (Tehranian, 2011; Wu, 2007). amongst a lot more that were impossible to The problems with tolerated use arise when the efficiently classify. discretion to tolerate or remove content is exercised in To add these links to our regression, we created a a way that could stifle legitimate expression or reflects binary category for links. We excluded the most systemic biases. When videos in this category were common domains in our dataset, those that were not removed on copyright grounds, it was usually at the generally associated with copyright infringement (we request of large music companies – not game publishers. did so by excluding domains that received>1000 Worryingly, music claimants were 41% more likely to links in our entire classified dataset). We found that block gameplay videos than the uncategorised average. videos with links to sites that are not amongst the This may be a variant of the ‘tragedy of the most popular sites were much less likely to be removed anticommons’ (Heller, 1998): even if a large proportion by Content ID – between 66% and 89% less likely than of music is available to reuse on YouTube through videos without a link in this category. There was no Content ID’s licensing scheme, gameplay streams may statistically significant difference for DMCA notices. last for hours and include many different background From the video descriptions we examined, this appears songs, the owners of any of which can elect to block the to be evidence of uploaders seeking to use YouTube to entire stream. Since it can be difficult for streamers to gain the attention of users searching for illicit content know in advance which songs are made available to without uploading any infringing material in the video license, and particularly since there is little threat of itself – and therefore avoiding detection by Content ID. gameplay streams substituting in markets for The high rates of removals of all videos in this category recorded music, this is one area where Content ID under the DMCA, Terms of Service, and account ter- appears to remain a justified cause of frustration for minations by YouTube, however, regardless of whether ordinary users. they have a link or not, suggests that this may not be a Of the small proportion of videos that were major problem for rightsholders. removed for terms of service violations, or where the uploader’s account was terminated by YouTube, the Gameplay majority appeared to be misclassifications or overlap A persistent concern about both notice and takedown with the ‘hacks’ category: footage of cheating and and Content ID relates to how it might regulate exploits in video games. Our coding schema 8 Big Data & Society meant that guides on game exploits should be classified as ‘hacks’, not ‘game play’, but this is not always easy, and there is a degree of unavoidable overlap where users upload footage of cheating behaviour in multi- player games. Our manual validation, in Figure 2, confirms that this is a particular area of confusion for our classifier: approximately 4% of videos in each of these two categories were mistakenly predicted in the other. Sports Live sports continues to be one of the major areas of controversy over copyright infringement on the inter- net. Because live sporting events are immensely popular around the world, and access is often limited to premi- um cable channels, pay-per-view, and streaming offer- Figure 2. Manual validation of approximately 100 randomly ings (Garrett, 2016; Hull, 2010), we might expect a selected videos in each class (combining sports classes). great deal of unmet demand from consumers who are dissatisfied with or cannot afford premium offerings, which in turn may lead to increased copyright infringe- ment (Birmingham and David, 2011; Dootson and Suzor, 2015). Over the past decade, the infringement Table 1. Removal rates, all categories, all removal types. Terms of Account Content ID DMCA Removed Service(ToS) Category Available terminated block takedown by user takedown Full movie 36.39% 42.49% 2.33% 2.02% 8.45% 8.32% Gameplay 79.71% 0.38% 0.57% 0.01% 19.28% 0.05% Hacks 54.26% 32.83% 0.10% 0.13% 9.08% 3.60% Sports 61.05% 14.97% 1.59% 1.05% 19.59% 1.75% YouTube average 78.36% 3.98% 0.77% 0.11% 14.70% 0.57% DMCA: Digital Millennium Copyright Act. Videos that are not available for technical or other reasons are included in the total but not in the table. Table 2. Log odds of removal for each category (baseline is the rate at which unclassified videos are available). Content ID DMCA Account Removed ToS Content ID (music claimant) takedown terminated by user takedown Intercept 5.08 5.41 6.45 3.38 1.72 5.15 Game play 1.79 0.34 2.64 2.34 0.30 2.70 Hacks 2.15 1.27 2.72 0.41 2.27 Full movies 2.44 0.36 3.40 2.67 0.42 3.21 Sports highlights 1.35 0.51 2.36 0.27 0.12 1.21 Live sports 1.39 2.90 2.53 1.42 2.26 Has link (non-major site) 0.81 1.01 2.19 0.13 1.52 Game play * link 0.60 1.53 Hacks * link 1.62 1.20 0.98 Full Movies * link 1.08 2.24 0.18 1.06 Sports highlights * link 1.47 1.61 0.56 Live sports * link 1.17 0.26 1.42 0.93 DMCA: Digital Millennium Copyright Act. Results with low statistical significance have been omitted. Gray and Suzor 9 of live telecasts of sports broadcasts has become an prepaid gift cards on app stores and ecommerce plat- increasingly pressing concern for sports organisations forms. The classifier sometimes struggled to distinguish and, seeking to protect an important revenue stream, game cheats that required circumvention from ordinary they have argued for stronger laws (Garrett, 2016: 2) cheats and exploits, although it appeared likely from the and new practices to prevent internet users from shar- descriptions that many of the supposed exploits we iden- ing streams of live broadcasts of their sporting events tified were bait designed to lure traffic towards malware, (Mellis, 2007). At the same time, however, the strict advertising farms, or paid services. copyright enforcement practices pursued by sports This category had very high removal rates overall – only 54% of videos were still available two weeks after organisations have caused the removal or monetisation of works such as reviews, gifs, memes and other poten- they were posted. Tensions over DRM generally and tially fair uses of sports broadcast content (Jones, 2017; circumvention tools in particular have been a constant Wang, 2015). feature of copyright debates over the last three decades, In this category, the types of videos identified by our and rightsholders have worked hard to ensure that classifier primarily included live streams of sporting other companies secure digital distribution channels events, as well as videos containing parts of full record- (Gillespie, 2007). For this category, we sought to ings of sporting matches, for a wide variety of sports, know whether copyright owners have been using the from professional leagues of football, basketball, tennis, DMCA notice and takedown process to remove infor- hockey, motor sports, wrestling and more. Videos in mation about the circumvention of DRM – which these categories were at high risk of removal by almost would be clearly beyond the scope of the takedown all avenues. Content ID removal rates were high for both regime. There was no evidence to support this – copy- live sports and highlights (4 times and 3.9 times more right owners were not sending notices at a statistically likely than baseline respectively), as well as for DMCA significant higher rate in this category compared to takedowns (18 times and 10.6 times baseline respective- unclassified videos, and the odds of removal for ly). Users who posted videos that appeared to be live Content ID were 72–88% lower. streams were at risk of having their account terminated Interestingly, however, videos in this category were 12.5 times more often than the baseline, and these videos nearly 10 times more likely to be removed for a breach of YouTube’s Terms of Service, and 15 times more were 9.6 times more likely to be removed for violating YouTube’s terms of service. Clearly both copyright likely to be removed because YouTube had terminated owners and YouTube are heavily active in policing the uploader’s account. YouTube’s Community sports content on YouTube. Guidelines prohibit instructional videos that ‘[show] From our qualitative analysis of video metadata in users how to bypass secure computer systems’, and it the non-live sports subcategory, there was little discern- clarifies that this includes ‘Showing users how to cir- ible difference between types of takedowns. The high rate cumvent payment processes to download software or of removals in this category is somewhat concerning, applications for free’ (YouTube, n.d.). This appears to since many clips of sports content may be lawful under be a rule that YouTube is enforcing quite extensively – fair use or other copyright exceptions, but this cannot be perhaps not surprising given Google’s interests in determined from the metadata. This area appears to be securing the Android ecosystem and its need to main- an important candidate for follow-up studies that are tain working relationships with many software devel- able to undertake fair use analyses (see e.g. Erickson opers and copyright owners. and Kretschmer, 2018; Jacques et al., 2018) to determine whether sufficient care is being taken by YouTube, its Limitations partners, and copyright owners and their agents when The most obvious limitation of this study is that we are evaluating whether to block sports videos. attempting to classify the content of videos based on the text and description fields. First, we only trained the clas- Hacks sifier on English-language metadata, and the language The final category we investigated primarily consists of model is optimised for English – so most of the results video tutorials about circumventing DRM. This includes are also English-language. Second, these fields are entirely guides about jailbreaking smartphones, generating serial generated by the user, and do not always accurately reflect numbers for software, and downloading cracked soft- the video content. Unfortunately, our infrastructure does ware with the copy-protection removed. Because the not collect full copies of videos, and we do not have the tools are closely related, the classifier also identified computational power to accurately classify or recognise videos about game exploits (modified versions of games themes from video content at a large scale. Nevertheless, that allow players to cheat, bypassing both copy protec- there are important benefits in classifying video content tion and anti-cheat software), and serial generators for on textual metadata, since these text fields are used to aid 10 Big Data & Society users who search on YouTube to discover relevant con- these types of videos are regulated. This high rate, how- tent. The more problematic limitation is that we are ever, suggests that there may be a problem of over- unable to include other metadata fields in our analysis – fitting: that our classification may be so tightly con- the YouTube Search API endpoint only returns a selected strained to identify the particular patterns of our train- ‘snippet’ of information. It would have been useful to ing videos that we miss other ways that people might have additional metadata to build our logistic regression share full versions of films on YouTube. This is a risk models (video duration, for example). It would have also we guarded against by qualitative exploration of search improved the accuracy of our classifier to have full length results on film titles on YouTube – we did not find any data for the video description field. This is an unfortunate major omissions that do not fit the patterns we identi- limitation of the API and the quota imposed by YouTube fied in developing our training sets – but we are not that was unavoidable in this study. able to exclude the possibility that our model is too An important limitation of our analysis strategy is narrow in identifying too few relevant videos. that we do not account for changes in enforcement Ultimately, this choice means that we cannot make rates over time. Given the small proportions of generalisations about the overall prevalence of videos videos in some of the categories we examine compared matching our categories on YouTube, but we can be to the total number of videos in our overall random more confident that the videos in each category are sample, we have elected not to further divide our cat- correctly classified. egorised sets into daily or monthly aggregations. Because content moderation systems of platforms are Conclusion continuously tweaked, and their features, cultures, and This experimental methodology has left us optimistic the behaviour of their users change over time (Burgess, about the potential to use machine learning classifiers 2015), and individual rightsholders may change their to better understand systems of algorithmic gover- takedown strategies, we suggest that future longitudi- nance. The methodology improves our understanding nal work could be extremely useful. of how an evolving system of both automated and dis- Another key limitation is that we have chosen to cretionary content moderation operates in practice. optimise our training data for precision over recall. Using a simple neural network infrastructure, a pre- Precision is a measure of the proportion of correct trained BERT language model, and substantial cloud classifications within each category (number of true processing power, we were able to achieve satisfactory positives/total of true and false positives), and recall performance on a multiclass classifier over short texts is a measure of the proportion of correctly identified (300 characters) with a relatively small number of classifications across the entire dataset (number of true training examples (between 420 and 1696 per class). positives/total of true positives and false negatives). In The implications for computational social studies are classification tasks, there is a trade-off between these exciting, and we have made our code available to help measures: increasing the overall number of correctly others extend BERT classification in other contexts. classified results in any category generally means This methodological experiment proved useful for including more incorrect results as well (Buckland helping to identify and interrogate patterns in a large and Gey, 1994). The specific goals of the analysis dataset of content moderation outcomes. Content should guide training and evaluation of a classification moderation is a notoriously opaque area, where the model (Sokolova and Lapalme, 2009). We focused on training materials and performance of human modera- improving the quality of predictions within each cate- tion teams are kept confidential, as are the details of the gory (minimising false positives), at the expense of not classification systems that prioritise content for review including some potentially relevant videos (false nega- and, in some circumstances, remove content directly. tives). We were therefore conservative in allocating As nations around the world continue to pressure plat- examples to categories in our training sets, and in the forms to take a more active role in moderating harmful semi-supervised stages, focused on reducing the num- content, it will become increasingly important to devel- bers of misclassified examples, especially because it was difficult to find additional positive examples within the op mechanisms to hold these moderation systems to very large set of unclassified and (for our purposes) account. YouTube’s complex and highly automated largely irrelevant random sample. So, for example, copyright moderation systems are an excellent case from our manual validation, 97 out of 100 videos clas- study to develop and hone new methods. sified as ‘full movies’ appeared from the metadata to A key benefit of our methodology is that it allows purport to be full versions of cinematic films or to link for the identification of trends in content moderation to sites where people could watch full versions of those that would not be evident in small sample sizes or films. The precision of this category is very high, and through experimental uploading. Our study has we can be comfortable drawing conclusions about how shown the potential to undertake large scale Gray and Suzor 11 Discovery Projects grant (DP170100122). This research was quantitative analysis on these systems at a level of supported with Cloud TPUs from Google’s TensorFlow detail that has so far not been possible. Most impor- tantly, we hope that this methodological approach Research Cloud (TFRC). proves useful in the future for researchers who may undertake longitudinal analyses or undertake further ORCID iD detailed qualitative study of particular controversies. Nicolas P Suzor https://orcid.org/0000-0003-3029-0646 As for YouTube’s copyright enforcement system itself, we have only been able to scratch the surface Notes with this analysis, and we must leave some further detailed investigation for future work. It is clear, how- 1. See U.S.C. 17 s 512(c). ever, that both the Content ID and DMCA takedown 2. See https://developers.google.com/youtube/v3/docs/ system are used with a greater degree of discretion than search/list was previously apparent: at an aggregate level, right- 3. https://github.com/qut-dmrc/short_text_analysis sholders make different decisions in relation to differ- ent types of videos. It also seems that in aggregate, the References Content ID and DMCA systems are working relatively Adhikari A, Ram A, Tang R, et al. (2019) DocBERT: BERT well to remove apparently infringing content from for Document Classification. Available at: https://arxiv. YouTube. Our study does, however, raise some con- org/abs/1904.08398v1 (accessed 20 April 2019). cerns about potential misidentification and over block- Bar-Ziv S and Elkin-Koren N (2018) Behind the scenes of ing, particularly in the sports highlights category, as online copyright enforcement: Empirical evidence on well as the large amount of discretion that music right- notice & takedown. Connecticut Law Review 50: 339–385. sholders are able to exercise to choose to block all types Binns R, Veale M, Van Kleek M, et al. (2017) Like trainer, of content – including material such as gameplay that is like bot? Inheritance of bias in algorithmic content mod- eration. In: International conference on social informatics, unlikely to compete in the market for recorded music. 2017, pp. 405–415. Berlin: Springer. Identifying the factors that affect the decisions of Birmingham J and David M (2011) Live-streaming: Will foot- rightsholders to remove content is an important area ball fans continue to be more law abiding than music fans? for further study, since these decisions have ramifica- Sport in Society 14(1): 69–80. DOI: 10.1080/ tions for freedom of expression and access to informa- 17430437.2011.530011 tion. We suggest that future studies might fruitfully Black J (2001) Decentring regulation: Understanding the role develop finer-grained classification categories and of regulation and self-regulation in a “post-regulatory” seek to collect more extensive metadata in order to world. Current Legal Problems 54(1): 103. facilitate more extensive qualitative analyses of the Blei DM, Ng AY and Jordan MI (2003) Latent dirichlet allo- types of content that are most likely to be blocked. cation. Journal of Machine Learning Research 3(Jan): The high discretion and potential lack of contextual 993–1022. sensitivity evident in these systems is something that Boroughf B (2015) The next great YouTube: Improving content policymakers too should clearly evaluate and address ID to foster creativity, cooperation and fair compensation. before encouraging platforms to rely to a much greater Albany Law Journal of Science & Technology 25(1): 95. Bridy A (2016) Copyright’s digital deputies: DMCA-plus extent on automated content moderation tools, either enforcement by internet intermediaries. In: Rothchild J for copyright or for issues like hate speech and abuse. (ed.) Research Handbook on Electronic Commerce Law. Edward Elgar, pp. 185–208. Acknowledgements Bridy A (forthcoming) Addressing infringement: Developments We thank Rosalie Gillett for outstanding research assistance. in the US and the DNS. In: Frosio G (ed.) The Oxford Handbook of Online Intermediary Liability. London: Declaration of conflicting interests Oxford University Press. Available at: https://ssrn.com/ The author(s) declared no potential conflicts of interest with abstract=3264879 (accessed 15 April 2020) Buckland M and Gey F (1994) The relationship between respect to the research, authorship, and/or publication of this recall and precision. Journal of the American Society for article. Information Science 45(1): 12–19. DOI: 10.1002/(SICI) 1097-4571(199401)45:1<12::AID-ASI2>3.0.CO;2-L Funding Burgess J (2015) From ‘broadcast yourself’ to ‘follow your The author(s) disclosed receipt of the following financial sup- interests’: Making over social media. International Journal port for the research, authorship, and/or publication of this of Cultural Studies 18(3): 281–285. article: Suzor is the recipient of an Australian Research Burgess J and Green J (2018) Youtube: Online Video and Council DECRA Fellowship (project number Participatory Culture. 2nd ed. Digital Media and DE160101542). This research is also supported by an ARC Society. Cambridge: Polity Press. 12 Big Data & Society Burk DL (2003) Anti-circumvention misuse. IEEE Hull MR (2010) Sports leagues’ new social media policies: Technology and Society Magazine 22(2): 40–47. Enforcement under copyright law and state law. The Citron DK (2007) Technological due process. Washington Columbia Journal of Law & the Arts 34: 457. Jacques S, Garstka K, Hviid M, et al. (2018) An empirical University Law Review 85: 1249. Devlin J, Chang M-W, Lee K, et al. (2018) BERT: Pre- study of the use of automated anti-piracy systems and training of deep bidirectional transformers for language their consequences for cultural diversity. Script-Ed 15(2): understanding. Available at: https://arxiv.org/abs/1810. 277–312. DOI: 10.2966/scrip.150218.277 04805v1 (accessed 2 March 2019). Jones AJ (2017) By throwing fans and writers in Twitter jail, Diakopoulos N (2015) Algorithmic accountability: sports leagues are abusing the law. The Guardian, 13 April. Journalistic investigation of computational power struc- Available at: https://www.theguardian.com/sport/2017/ tures. Digital Journalism 3(3): 398–415. apr/13/twitter-sports-leagues-rules-journalist-account-sus Dootson P and Suzor N (2015) Game of clones and the pensions (accessed 19 March 2019). Australia tax: Divergent views about copyright business Kohl U (2013) Google: The rise and rise of online models and the willingness of Australian consumers to intermediaries in the governance of the internet and infringe. The. University of New South Wales Law beyond (part 2). International Journal of Law and Journal 38: 206–329. Information Technology 21: 187. Elkin-Koren N (2014) After twenty years: Revisiting copy- McSherry C (2014) Lawrence Lessig settles fair use right liability of online intermediaries. In: Frankel S and lawsuit over phoenix music snippets. In: eff.org. Gervais DJ (eds) The Evolution and Equilibrium of Available at: https://www.eff.org/press/releases/law Copyright in the Digital Age. Cambridge: Cambridge rence-lessig-settles-fair-use-lawsuit-over-phoenix-music- University Press, pp. 29–51. snippets (accessed 15 April 2020) Erickson K and Kretschmer M (2018) “This video is Matsui S (2016) Does it have to be a copyright infringement: unavailable”: Analyzing copyright takedown of user- Live game streaming and copyright. Texas Intellectual generated content on YouTube. Journal of Intellectual Property Law Journal 24: 215. Property, Information Technology and Electronic Commerce Mellis MJ (2007) Internet piracy of live sports telecasts. 9(1): 75–89. http://www.jipitec.eu/issues/jipitec-9-1-2018/4680 Marquette Sports Law Review 18: 259. European Commission (2018) Proposal for a Regulation of the Mou L, Meng Z, Yan R, et al. (2016) How transferable are European Parliament and of the Council on preventing the neural networks in NLP applications? arXiv:1603.06111 [cs]. Available at: http://arxiv.org/abs/1603.06111 dissemination of terrorist content online. 2018/0331 (COD), 12 September. Available at: https://ec.europa.eu/commis (accessed 10 April 2019). sion/sites/beta-political/files/soteu2018-preventing-terror Nas S (2004) The Multatuli project: ISP notice & take down. In: 4th international system administration and network engineer- ist-content-online-regulation-640_en.pdf (accessed 15 ing conference,Amsterdam,Netherlands,September 2004. April 2020) Pariser J (2016) Comments of the Motion Picture Association European Union Parliament (2018) Amendments adopted by the European Parliament on 12 September 2018 ( ) on the of America before the Library of Congress United States proposal for a directive of the European Parliament and of Copyright Office in the Matter of Requests for Comments on United States Copyright Office Section 512 Study. the Council on copyright in the Digital Single Market Docket No. 2015-7, 1 April. COM(2016)059-C8-0383/2016-2016/0280(COD). Pasquale F (2015) The Black Box Society. Cambridge, MA: Fiala L and Husovec M (2018) Using experimental evidence to design optimal notice and takedown process. TILEC Harvard University Press. Discussion Paper DP2018-028. Patry W (2009) Moral Panics and the Copyright Wars. Garrett R (2016) Comments of The Professional Sports Toronto: Oxford University Press. Pedregosa F, Varoquaux G, Gramfort A, et al. (2011) Scikit- Organizations in the Matter of Section 512 Study before learn: Machine learning in python. Journal of Machine the Copyright Office Library of Congress. 2015–2017, Learning Research 12: 2825–2830. Washington, DC. Perel M and Elkin-Koren N (2016) Accountability in algo- Gillespie T (2007) Wired Shut Copyright and the Shape of rithmic copyright enforcement. Stanford Technology Law Digital Culture. Cambridge, MA: MIT Press. Review 19: 473–532. Gillespie T (2018) Custodians of the Internet: Platforms, Content Perel M and Elkin-Koren N (2017) Black box tinkering: Moderation, and the Hidden Decisions that Shape Social Beyond disclosure in algorithmic enforcement. Florida Media. 1st ed. New Haven, CT: Yale University Press. Law Review 69(1): 181. Google (2016) How Google Fights Piracy. Available at: Reda J (2019) Upload filters. Available at: https://juliareda. https://drive.google.com/file/d/0BwxyRPFduTN2TmpGa eu/eu-copyright-reform/censorship-machines/ (accessed 11 jJ6TnRLaDA/view (accessed 15 April 2020). April 2019). Gray JE (2020) Google Rules: The History and Future of Seng D (2014) The state of the discordant union: An empir- Copyright under the Influence of Google. New York, NY: ical analysis of DMCA takedown notices. Virginia Journal Oxford University Press. Heller MA (1998) The tragedy of the anticommons: Property of Law and Technology 18: 369. in the transition from Marx to markets. Harvard Law Seng D (2015) ‘Who Watches the Watchmen?’ An Empirical Review 111(3): 621–688. DOI: 10.2307/1342203 Analysis of Errors in DMCA Takedown Notices. ID Gray and Suzor 13 2563202, SSRN Scholarly Paper, 23 January. Rochester, NY: https://papers.ssrn.com/abstract=2755628 (accessed 18 Social Science Research Network. Available at: http://papers. September 2017). ssrn.com/abstract=2563202 (accessed 25 May 2016). Urban JM and Quilter L (2006) Efficient process or ‘Chilling Sokolova M and Lapalme G (2009) A systematic analysis of effects’? Takedown notices under section 512 of the digital performance measures for classification tasks. Information millennium copyright act. Santa Clara Computer and High Processing & Management 45(4): 427–437. DOI: 10.1016/j. Technology Law Journal 22: 621. ipm.2009.03.002 Van der Sar E (2018) Movie Company Demands e200,000 From Steyvers M and Griffiths T (2007) Probabilistic topic models. YouTube Over Pirated Film. Available at: https://torrent Handbook of Latent Semantic Analysis 427(7): 424–440. freak.com/movie-company-demands-e200000-from-youtube- Suzor NP (n.d.) Understanding content moderation systems: over-pirated-film-181129/ (accessed 19 March 2019). New methods to understand internet governance at scale, Vaswani A, Shazeer N, Parmar N, et al. (2017) Attention is over time, and across platforms. In: Whalen R (ed.) all you need. In: Guyon I, Luxburg UV, Bengio S, et al. Computational Legal Studies: The Promise and Challenge (eds) Advances in Neural Information Processing Systems of Data-Driven Legal Research. Cheltenham: Edward 30. New York, NY: Curran Associates, Inc., pp. 5998– Elgar Publishing, pp. 1–19. Available at: https://eprints. 6008. Available at: http://papers.nips.cc/paper/7181-atten qut.edu.au/129464/ (accessed 15 April 2020). tion-is-all-you-need.pdf (accessed 10 April 2019). Suzor NP (2019) Lawless: The Secret Rules That Govern Our Wang S (2015) Fair use or copyright infringement? Deadspin Digital Lives. Cambridge: Cambridge University Press. and SB Nation get tossed off Twitter for NFL GIFs. In: Suzor NP, West SM, Quodling A, et al. (2019) What do we Nieman Lab. Available at: http://www.niemanlab.org/ mean when we talk about transparency? Toward meaning- 2015/10/fair-use-or-copyright-infringement-deadspin-and- ful transparency in commercial content moderation. sb-nation-get-tossed-off-twitter-for-nfl-gifs/ (accessed 19 International Journal of Communication 13: 18. March 2019). Taylor I (2015) Video games, fair use and the internet: The Wu T (2007) Tolerated use. Columbia Journal of Law & Arts plight of the let’s play. University of Illinois Journal of 31: 617. Law, Technology & Policy 1: [i]272. Yafit L-A (2013) Copyright lawmaking and public choice: Taylor TL (2018) Watch Me Play: Twitch and the Rise of From legislative battles to private ordering. Harvard Game Live Streaming. Princeton, NJ: Princeton Journal of Law & Technology 27: 203. University Press. YouTube (n.d.) Policies on harmful or dangerous content. Tehranian J (2011) Infringement Nation : Copyright 2.0 and Available at: https://support.google.com/youtube/ You. New York, NY: Oxford University Press. Available answer/2801964?hl=en-GB (accessed 19 April 2019). at: https://trove.nla.gov.au/work/38120261 (accessed 18 Zarsky T (2016) The trouble with algorithmic decisions: An April 2019). analytic road map to examine efficiency and fairness in Tushnet R (2014) All of this has happened before and all of automated and opaque decision making. Ziewitz M this will happen again: Innovation in copyright licensing. (ed.). Science, Technology, & Human Values 41(1): 118– Berkeley Technology Law Journal 29(3): 1447. 132. DOI: 10.1177/0162243915605575 Urban JM, Karaganis J and Schofield BL (2017) Notice and Zimmerman DL (2014) Copyright and social media: A tale of Takedown in Everyday Practice. 22 March. Available at: legislative abdication. Pace Law Review 35(1): 260. Appendix 1 Logistic regression using maximum likelihood, predicting outcomes two weeks after a video is published on YouTube. N¼ 750,000 (150,000 in each class, combining live sport and sport highlights). Content ID block Coefficient Std err Zp> |z| [0.025 0.975] Intercept 5.0784 0.038 135.166 0.000 5.152 5.005 Game play 1.7866 0.098 18.204 0.000 1.979 1.594 Hacks 2.1544 0.151 14.253 0.000 2.451 1.858 Full movies 2.4416 0.042 58.218 0.000 2.359 2.524 Sports highlights 1.3533 0.044 30.591 0.000 1.267 1.440 Live sports 1.3947 0.087 15.993 0.000 1.224 1.566 Has link (non-major site) 0.8071 0.245 3.297 0.001 1.287 0.327 Game play with link 0.2663 0.781 0.341 0.733 1.265 1.798 Hacks with link 0.3992 0.451 0.885 0.376 0.484 1.283 Full Movies with link 1.0823 0.261 4.139 0.000 1.595 0.570 (continued) 14 Big Data & Society Continued Content ID block Coefficient Std err Zp> |z| [0.025 0.975] Sports highlights with link 0.3204 0.301 1.063 0.288 0.911 0.270 Live sports with link 1.1672 0.397 2.939 0.003 1.946 0.389 Content ID block by music claimant Coefficient Std err zp> |z| [0.025 0.975] Intercept 5.4055 0.044 122.270 0.000 5.492 5.319 Game play 0.3422 0.058 5.941 0.000 0.229 0.455 Hacks 1.2743 0.120 10.659 0.000 1.509 1.040 Full movies 0.3561 0.075 4.762 0.000 0.210 0.503 Sports highlights 0.5054 0.082 6.172 0.000 0.666 0.345 Live sports 0.3106 0.164 1.895 0.058 0.011 0.632 Has link (non-major site) 1.0114 0.318 3.176 0.001 1.635 0.387 Game play with link 0.5806 0.428 1.357 0.175 0.258 1.420 Hacks with link 0.3458 0.543 0.637 0.524 1.410 0.719 Full Movies with link 2.2410 0.674 3.324 0.001 3.562 0.920 Sports highlights with link 0.7243 0.774 0.935 0.350 2.242 0.793 Live sports with link 1.1186 0.751 1.489 0.136 2.591 0.353 DMCA takedown Coefficient Std err zp> |z| [0.025 0.975] Intercept 6.4466 0.074 86.771 0.000 6.592 6.301 Game play 2.6427 0.285 9.259 0.000 3.202 2.083 Hacks 0.0452 0.122 0.370 0.711 0.194 0.284 Full movies 3.4000 0.078 43.781 0.000 3.248 3.552 Sports highlights 2.3582 0.079 29.715 0.000 2.203 2.514 Live sports 2.8983 0.105 27.699 0.000 2.693 3.103 Has link (non-major site) 0.0810 0.316 0.256 0.798 0.539 0.701 Game play with link 0.5974 1.287 0.464 0.643 1.926 3.120 Hacks with link 0.5792 0.357 1.621 0.105 0.121 1.279 Full Movies with link 0.4701 0.319 1.474 0.140 0.155 1.095 Sports highlights with link 1.4678 0.397 3.702 0.000 2.245 0.691 Live sports with link 0.0851 0.342 0.249 0.803 0.755 0.585 Not available: account terminated Coefficient Std err zp> |z| [0.025 0.975] Intercept 3.3786 0.016 207.512 0.000 3.411 3.347 Game play 2.3369 0.054 43.581 0.000 2.442 2.232 Hacks 2.7222 0.018 154.512 0.000 2.688 2.757 Full movies 2.6726 0.018 145.957 0.000 2.637 2.708 Sports highlights 0.2667 0.024 11.209 0.000 0.220 0.313 Live sports 2.5291 0.028 91.026 0.000 2.475 2.584 Has link (non-major site) 2.1893 0.031 70.659 0.000 2.129 2.250 Game play with link 0.5974 0.100 5.956 0.000 0.401 0.794 Hacks with link 1.6208 0.034 48.262 0.000 1.687 1.555 Full Movies with link 0.1806 0.034 5.349 0.000 0.247 0.114 Sports highlights with link 1.6132 0.067 24.124 0.000 1.744 1.482 Live sports with link 0.2633 0.043 6.122 0.000 0.179 0.348 Removed by user Coefficient Std err zp> |z| [0.025 0.975] Intercept 1.7246 0.008 226.892 0.000 1.739 1.710 Game play 0.2996 0.010 29.665 0.000 0.280 0.319 Hacks 0.4118 0.014 28.811 0.000 0.440 0.384 Full movies 0.4205 0.013 32.599 0.000 0.395 0.446 Sports highlights 0.1197 0.012 10.321 0.000 0.097 0.142 Live sports 1.4245 0.020 69.963 0.000 1.385 1.464 Has link (non-major site) 0.1254 0.035 3.546 0.000 0.195 0.056 Game play with link 0.1104 0.055 2.011 0.044 0.003 0.218 (continued) Gray and Suzor 15 Continued Content ID block Coefficient Std err Zp> |z| [0.025 0.975] Hacks with link 1.2022 0.040 29.819 0.000 1.123 1.281 Full Movies with link 1.0583 0.050 21.210 0.000 1.156 0.961 Sports highlights with link 0.5631 0.062 9.019 0.000 0.686 0.441 Live sports with link 1.4167 0.045 31.327 0.000 1.328 1.505 ToS takedown Coefficient Std err zp> |z| [0.025 0.975] Intercept 5.1467 0.039 132.410 0.000 5.223 5.071 Game play 2.6983 0.153 17.640 0.000 2.998 2.398 Hacks 2.2720 0.042 53.543 0.000 2.189 2.355 Full movies 3.2051 0.041 77.833 0.000 3.124 3.286 Sports highlights 1.2102 0.094 12.810 0.000 1.395 1.025 Live sports 2.2584 0.066 34.084 0.000 2.128 2.388 Has link (non-major site) 1.5245 0.088 17.327 0.000 1.352 1.697 Game play with link 1.5332 0.264 5.817 0.000 1.017 2.050 Hacks with link 0.9784 0.093 10.495 0.000 1.161 0.796 Full Movies with link 0.1911 0.090 2.113 0.035 0.368 0.014 Sports highlights with link 0.0363 0.215 0.169 0.866 0.457 0.384 Live sports with link 0.9252 0.107 8.652 0.000 0.716 1.135

Journal

Big Data & SocietySAGE

Published: Apr 28, 2020

Keywords: Machine learning; copyright enforcement; YouTube; content moderation; automated decision-making; Content ID

References