Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Producing “one vast index”: Google Book Search as an algorithmic system:

Producing “one vast index”: Google Book Search as an algorithmic system: In 2004, Google embarked on a massive book digitization project. Forty library partners and billions of scanned pages later, Google Book Search has provided searchable text access to millions of books. While many details of Google’s conversion processes remain proprietary secret, here we piece together their general outlines by closely examining Google Book Search products, Google patents, and the entanglement of libraries and computer scientists in the longer history of digitization work. We argue that far from simply ‘‘scanning’’ books, Google’s efforts may be characterized as algorithmic digitization, strongly shaped by an equation of digital access with full-text searchability. We explore the consequences of Google’s algorithmic digitization system for what end users ultimately do and do not see, placing these effects in the context of the multiple technical, material, and legal challenges surrounding Google Book Search. By approaching digitization primarily as a text extraction and indexing challenge—an effort to convert print books into electronically searchable data—GBS enacts one possible future for books, in which they are defined largely by their textual content. Keywords Algorithmic system, digitization, algorithmic culture, Google, web search, scanning Reading a public domain book on the Google Books erase—human intervention. The dream of automation website is a mundane encounter with text on a screen. persists, even as the materials resist. In the midst of this experience, the appearance of a hand The hand’s ghostly presence also highlights the presents an unsettling disruption (Figure 1). Positioned opacity surrounding Google’s undertaking, a disjunc- within the front matter of the Code of Procedure of the ture between the company’s techno-utopian public State of New York (1862), bright pink rubbers cover rhetoric and the paucity of public access it provided three fingers. The hand bears a thick silver ring and to the technical specifics of digital conversion. matching pink nail polish. The thumb has been partially Envisioning a far-reaching public impact, Google erased, appearing as a brown, pixelated stripe. The words CEO Eric Schmidt (2005) described the project’s goals: ‘‘Digitized by Google’’ have been digitally tattooed on the hand’s skin. Imagine the cultural impact of putting tens of millions Momentarily pulling back the curtain on Google’s of previously inaccessible volumes into one vast index, digitization processes, the hand’s presence draws atten- every word of which is searchable by anyone, rich and tion both to the book’s print origins and to the human and machine labor required to transport (and transform) University of Michigan School of Information, USA it from library shelf to laptop screen. This hand belongs Stanford University, USA to a contract worker hired by Google to turn the pages of more than 20 million books digitally imaged through the Corresponding author: Google Book Search Project since 2004. These fingers, Melissa K Chalmers, University of Michigan School of Information, 105 S. skin, nails, and rings appear as visible traces of ongoing State St., Ann Arbor, MI 48109, USA. processes designed to obviate—and subsequently to Email: mechalms@umich.edu Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http:// www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access- at-sage). 2 Big Data & Society Vincent (2007) have presented elements of Google’s technical workflows to specialized technical research communities. Here we take a new tack, arguing that Google’s approach to digitization was shaped by a confluence of technical and cultural factors that must be understood together. These include Google’s corporate commitment to the scalable logic of web search, partner selection parameters, the lingering influence of print intellectual property regimes, and the requirements of Google’s highly standardized ‘‘mass digitization’’ processes (Coyle, 2006). This article proposes an alter- native descriptor, algorithmic digitization, intended to highlight how the algorithms Google uses to scale and automate digitization intertwine with the production logic that governs GBS planning and execution. Understanding GBS as an algorithmic system foregrounds Google’s commitment to scale, standar- dized processes, automation, and iterative improvement (Gillespie, 2016). These features must also be under- stood as negotiated translations of varied project, part- ner, and corporate goals into executable workflows. We first examine how algorithms shape and structure the work of digitization in GBS and consider the effects of algorithmic processing on digitized books accessible to users. We then explore the implications of Google’s embrace of an algorithmic solution to the multiple tech- nical, material, and legal challenges posed by GBS. Beyond simply scaling up existing book digitization, Google’s algorithmic digitization effort has had the effect of reimagining what the intended outcome of Figure 1. Hands scanned by Google (New York, 1862). such a project should be—with important implications for mediating digital access to print books. poor, urban and rural, First World and Third, en toute Books as data: Digital hammer seeks langue – and all, of course, entirely for free. digital nails Yet the actual digitization proceeded under a cloud of Google’s corporate mission: ‘‘to organize the world’s secrecy, leaving analysts such as ourselves to glean traces information and make it universally accessible and of the project’s values and processes from public state- useful,’’ has remained effectively unchanged since its ments, contracts, project webpages, blog posts, presen- first appearance on the company’s website in late tations, and patent applications—and sometimes from 1999 (Google, Inc., 1999). At the time, it referred the margins of the page images themselves. chiefly to web search, Google’s core business. In Existing research has investigated many aspects of December 2004, Google announced an extension to Google Book Search (hereafter GBS), including its that mission: a massive book digitization project in goals, its outputs, and its intellectual property frame- partnership with five elite research libraries. Since works (Samuelson, 2009). Scholars have considered then Google has worked with over 40 library partners GBS in the context of the corporate monopolization to scan over 20 million books, producing billions of of cultural heritage (Vaidhyanathan, 2012), the history pages of searchable text. In 2012, without any formal and future of the book as a physical medium (Darnton, announcement, Google quietly began to scale back the 2009), and the place of digitized books in knowledge project, falling short of its aspirations to scan ‘‘every- infrastructures such as libraries (Jones, 2014; Murrell, thing’’ (Howard, 2012). While it seems unlikely that 2010). Leetaru (2008) provides a rare analysis of GBS Google will stop digitizing books completely or jettison analog–digital conversion processes, while Google its digitized corpus anytime soon, the project’s future is employees Langley and Bloomberg (2007) and currently unknown. Chalmers and Edwards 3 To Google, converting print books into electroni- Seaver (2013) calls for reframing the questions we ask cally searchable data was GBS’s entire raison d’eˆ tre. about algorithmic systems, moving away from Therefore, Google constructed digitization as a step conceiving of algorithms as technical objects with cul- parallel to the web crawling that enabled web search. tural consequences and toward the question of ‘‘how In contracts with library partners, Google defined digi- algorithmic systems define and produce distinctions tization as ‘‘to convert content from a tangible, analog and relations between technology and culture’’ in spe- form into a digital representation of that content’’ cific settings. Studying algorithmic systems empirically (University of Michigan and Google, Inc., 2005). In may thus bring together several elements: the technical practice, this conversion produced a digital surrogate details of algorithm function; the imbrication of in which multiple representations of a print book exist humans (designers, production assistants, users) and simultaneously. Each digitized book is comprised of a human values in algorithmic systems; and the multiple series of page images, a file containing the book’s text, contexts in which algorithms are developed and and associated metadata. Layered to produce multiple deployed. types of human and machine access—page images, Like many contemporary digital systems, GBS full-text search, and pointers to physical copies held integrated humans as light industrial labor, necessary by libraries—each of these elements was produced by if inefficient elements of an incompletely automated separate, yet related, processes. process. Human labor in GBS was almost entirely phy- sical, heavily routinized, and kept largely out of sight; human expertise resides outside rather than inside Integrating human values—and labor—into Google’s system. Partner library employees pulled algorithmic systems books from shelves onto carts destined for a Google- As with many Google endeavors, the company managed off-site scanning facility (Palmer, 2005). reengineered familiar processes at new levels of techno- There, contract workers turned pages positioned logical sophistication. From that perspective, Google’s under cameras, feeding high-speed image processing primary innovation on libraries’ hand-crafted ‘‘bou- workflows around the clock (University of Michigan tique’’ digitization models (which pair careful content and Google, Inc., 2005). Directly supervised by the selection with preservation-quality scanning) was to machines they were hired to operate, scanning workers approach book digitization as it would any other were required to sign nondisclosure agreements but large-scale data management project: as a challenge of afforded none of the perks of being a Google employee scale, rather than kind. Susan Wojcicki, a product man- beyond the walls of a private scanning facility (Norman ager for the project, contextualized Google’s approach Wilson, 2009). For the time being, at least, human bluntly: ‘‘At Google we’re good at doing things at labor in book digitization remains necessary largely scale’’ (Roush, 2005). In other words, Google because of the material fragility, inconsistency, and turned book digitization into an algorithmic process. variety of print books. Scaled-up scanning required a work process centered in and around algorithms. Preparing to digitize: Partnerships, goal alignment, Algorithms are complex sequences of instructions selection expressed in computer code, flowcharts, decision trees, or other structured representations. From Facebook to Mass digitization initiatives are often characterized as Google and Amazon, algorithms increasingly shape operating without a selection principle: ‘‘everything’’ how we seek information, what information we find, must be digitized (Coyle, 2006). In practice, however, and how we use it. Because algorithms are typically partnerships, scaling requirements, intellectual property designed to operate with little oversight or intervention, regimes designed for print, and the particulars of the substantial human labor involved in their creation books’ material characteristics all challenged Google’s and deployment remain obscured. Algorithmic invisi- universal scanning aspirations. bility easily slides into a presumed neutrality, and they At the turn of the 21st century, Lynch (2002) remain outside users’ direct control as they undergo observed that cultural heritage institutions mostly iterative improvement and refinement. Finally, the understood the hows of digitization, even at moderately vast complexity of many algorithms—especially inter- large scale. The main challenge, he argued, was to opti- acting systems of algorithms—can render their beha- mize processes. Lesk (2003) described the challenges of vior impossible for even their designers to predict or scale and efficiency more succinctly: ‘‘we need the understand. Henry Ford of digitization,’’ i.e. an institution willing Embedded in systems, algorithms have the power to invest vast resources in ‘‘digitization on an industrial to reconfigure work, life, and even physical spaces scale’’ (Milne, 2008). Google stepped forward to (Gillespie, 2016; Golumbia, 2009; Striphas, 2015). assume this role. 4 Big Data & Society Google courted partners to provide content by mainly to discover books rather than to actually read incurring nearly all costs of scanning, while carefully them (Google, Inc., 2004a). Yet partners scanning avoiding the repository-oriented responsibilities of a public domain books often referenced online reading library. Each partner library brought its own goals as a benefit. This ambiguity perhaps contributed to and motivations into the project. The New York copyright-related concerns—and misunderstandings— Public Library (2004) observed that ‘‘without during GBS’s early days (Carr, 2005; New York Google’s assistance, the cost of digitizing our books Public Library, 2004). — in both time and dollars — would be prohibitive.’’ Other partners spoke of leveraging Google’s technical A means to an end: Image capture expertise and innovation to inform future institutional digitization efforts (Carr, 2005; Palmer, 2005). Libraries Once it took custody of partner library books, employed different selection criteria, from committing Google deployed its own selection criteria. In a to digitize all holdings (e.g., University of Michigan) to (rare) concession to the library partners tasked with selecting only public domain holdings (e.g., Oxford, storing and preserving paper materials, Google used a NYPL) or special collections (later partners). Most nondestructive scanning technique. In patents filed in digitization contracts remained private, adding to the 2003 and 2004, Google provided descriptions of sev- secrecy surrounding Google’s efforts. eral high-resolution image capture systems designed Full-text search quickly emerged as a kind of lowest- around the logistical challenges posed by bound common-denominator primary functionality for the documents. The thicker the binding, for example, project. Using the Internet Archive’s Wayback the less likely a book is to lie flat. In flatbed or Machine, we can see how Google incrementally mod- overhead scanners, page curvature creates skewed or ified language relating to the project’s goals and distorted scanned images. Book cradles or glass pla- mechanisms throughout its first year (Google, Inc., tens can flatten page surfaces, but these labor-inten- 2004b). The answer to the question ‘‘What is the sive tools slow down scanning and can damage book Library Project’’ evolved from an effort to transport spines. Google addressed this page curvature problem media online (December 2004) to a pledge to make computationally, through a combination of 3D ima- ‘‘offline information searchable’’ (May 2005) to a ging and downstream image processing algorithms. more ambiguous plan to ‘‘include [libraries’] col- That decision shaped and complicated Google’s lections... and, like a card catalog, show users informa- workflow. tion about the book plus a few snippets – a few In the patent schematic shown in Figure 2, two cam- sentences of their search term in context’’ (November eras (305, 310) are positioned to capture two-dimen- 2005, emphasis added). sional images of opposing pages of a bound book The purpose behind these changes became clear in (301). Simultaneously, an infrared (IR) projector Fall 2005, as the Authors Guild and the Association of (325) superimposes a pattern on the book’s surface, American Publishers filed lawsuits alleging copyright enabling an IR stereoscopic camera (315) to generate infringement (Band, 2009). Google argued that by a three-dimensional map of each page (Lefevere and creating a ‘‘comprehensive, searchable, virtual card cat- Saric, 2009). Using a dewarping algorithm, Google alog of all books in all languages,’’ it provided pointers can subsequently detect page curvature in these 3D to book content rather than access to copyright- page maps and correct by straightening and stretching protected books. The company maintained that scan- text (Lefevere and Saric, 2008). ning-enabled indexing constituted ‘‘fair use’’ under the Scanning produces bitmapped images that repre- U.S. Copyright Act (Schmidt, 2005; US Copyright sent the pages of a print book as a grid of pixels Office, 2016). In November 2005, the project’s name for online viewing. Unlike text, this imaged content changed from Google Print to Google Book Search, cannot be searched and remains ‘‘opaque to the algo- reorienting users’ frame of reference from the world rithmic eyes of the machine’’ (Kirschenbaum, 2003). of paper to the world of the electronic web (Grant, As a next step after scanning, Google might have 2005). The change attempted to correct any mispercep- adopted existing library-based preservation best prac- tions that Google intended to enable access to tices for imaged content. Or it could have created user-printed copies of books and to deemphasize the new standards around 3D book imaging (Langley and Bloomberg, 2007; Leetaru, 2008). Instead, idea that the project was in the business of copying or of content ownership. Google chose to transform the raw 3D page maps Since December 2004, GBS has provided full access described above—rich in information, but unwieldy for public domain books. Google consistently down- for end users due to file size and format—into played this capability, maintaining that like a book- ‘‘clean and small images for efficient web serving’’ store ‘‘with a Google twist,’’ readers would use it (Vincent, 2007). Chalmers and Edwards 5 Figure 2. System for optically scanning documents (Lefevere and Saric, 2009). in analog form (Holihan, 2006; Schantz, 1982). Through OCR, imaged documents gain new function- Producing a machine-readable index: Image ality, as text may be searched, aggregated, mined for processing patterns, or converted to audio formats for visually For GBS, then, imaging ultimately represented a key impaired users. yet preliminary step toward text-searchable books on Tanner et al. (2009) argue that by providing search the web. The project’s image processing workflows thus functionality for large digitized corpora at low cost, acquired a dual imperative. It had to produce both (a) automated OCR systems have been a key driver of two-dimensional page images for web delivery, and (b) large-scale text digitization. GBS leveraged decades of machine-readable—and therefore searchable—text. computing research related to OCR. Through the ‘‘[O]ur general approach here has been to just get the 1990s, boutique library digitization efforts had books scanned, because until they are digitized and addressed the question of quality mainly by establish- OCR is done, you aren’t even in the game,’’ Google ing image-centric digitization standards (e.g., scanner Books engineering director James Crawford observed specifications and calibration, test targets, resolution) in 2010 (Madrigal, 2010). The ‘‘game’’ here, of course, (Baird, 2003). Rooted in libraries’ traditions of is search. In a web search engine, crawled page content ensuring long-term visual access to materials through and metadata are parsed and stored in an index, a list reformatting (e.g., copying, microfilming), these prac- of words accompanied by their locations. Indexing tices relied on labor-intensive visual inspection for qual- quickly became the key mechanism (and metaphor) ity control. By contrast, pattern recognition research through which Google sought to unlock the content developed systems for algorithmically assessing quality, of books for web search. measured by accurate recognition of printed characters To produce its full-text index, Google converted and document structure (Le Bourgeois et al., 2004; Lin, page images to text using optical character recognition 2006). (OCR). OCR software uses pattern recognition to iden- Google adopted this framing of digitization as a text extraction challenge, optimizing its processes to tify alphanumeric characters on scanned page images and encode them as machine-readable characters. produce the clean, high-contrast page images necessary Originally used to automate processing of highly stan- for accurate OCR. The GBS processing pipeline relied dardized business documents such as bank checks, over heavily on OCR to automate not only image processing the past 60 years OCR has become integral to organiz- and quality control but also volume-level metadata ing and accessing digital information previously stored extraction. Google’s Vincent (2007) described the 6 Big Data & Society digitized corpus as algorithmic ‘‘document understand- gain (Terras, 2008). In GBS, lost information includes ing and analysis on a massive scale.’’ the physical size, weight, or structure of a volume; the texture and color of its pages; and the sensory experi- ence of navigating its contents. Nontextual book fea- Books bite back: Bookness as bug, not tures such as illustrations, as well as marginalia and feature other evidence of print books’ physical histories of In their commitment to scale and standardized use, are often distorted or auto-cropped out of procedure, algorithmic systems often prioritize system Google’s screen-based representations. As for informa- requirements over the needs of individual inputs (e.g., tion gain, image capture, and processing embed traces books) or users. Google’s search engine, for example, of the digitization process into digitized objects. has come under criticism for failing to prioritize The quality of Google’s digitization output has been authoritative or accurate search results. In December systematically evaluated through empirical research 2016, the Guardian reported that a Google query on and widely critiqued in informal venues such as blogs. ‘‘Did the Holocaust happen?’’ returned a Holocaust While useful in characterizing quality concerns in the denial website as the first result. A Google spokesper- digitized corpus, this work generally does not consider son maintained that how and why digitization processes shape outputs. The following examples illustrate commonly identified [w]hile it might seem tempting to fix the results of an problems, but they also extend existing analyses by individual query by hand, that approach does not scale emphasizing the role of algorithms in concretizing rela- to the many different variants of that query and the tionships among system inputs, conversion processes, queries that we have not yet seen. So we prefer to and outputs. These types of problems remain endemic take a scalable algorithmic approach to fix problems, in the GBS corpus not because they are unsolvable, but rather than removing these one by one. (Cadwalladr, rather because they have been accepted as trade-offs. 2016, emphasis added) Their solutions do not fit easily into Google’s priorities and workflows, even as their persistence challenges Google’s acknowledgment here of the trade-offs it faces efforts to automate quality assurance processes. between scale and granularity highlights questions of algorithmic accountability (Pasquale, 2015). Visual content Google’s system also exposes tensions between the standardization required to scale digitization processes Output-based evaluations of large-scale book digitiza- and the flexibility needed to accommodate the diverse tion have found that except when catastrophic (rare), output of print publication history. It is perhaps no most text-oriented page scanning or image processing surprise that books, unlike business documents created errors result in thin, thick, blurry, or skewed text that to meet OCR requirements, persistently resisted the may frustrate or annoy readers but does not render structure imposed on them by Google’s homogenizing them entirely unreadable (Conway, 2013; James, processes. 2010). Objects such as fingers or clamps also appear Bound books evolved over centuries from earlier commonly in scans but often do not obstruct text writing formats such as scrolls and codices. But in significantly. Google’s conversion system, the hard-won features of In Figure 3, the very tiny book Mother Goose’s bound books—the very things that made them conve- Melody has been housed in a binder to prevent it nient, efficient, and durable media for so long—were from being lost on a library shelf. While the library- treated as bugs rather than features. Google routinely created cover fits Google’s selection criteria and has excluded materials from scanning due to size or condi- provided a frame size for image capture, several mate- tion. These included very large or small books as well as rial elements usually cropped out of Google-digitized books with tight bindings, tipped-in photographs and page images have crept into the frame due to the size illustrations, foldout maps, or uncataloged material mismatch between the cover and the actual book. These (Coyle, 2006). Very old, brittle, or otherwise fragile include a call slip and university label, metal book- books were also excluded (Ceynowa, 2009; Milne, securing clamps, and the page-turner’s hands. When 2008). Many of the rejected books remain undigitized, extra-textual features are detected and removed algor- while others have joined lengthy queues within ithmically—without the help of a human eye—they libraries’ ongoing internal digitization programs. often leave new artifacts behind. We see some of As a sampling process in which some, but not all, these less familiar traces here: the stretched appearance features of an analog signal are chosen for digital of book pages caused by the dewarping algorithm, and capture and representation, digitization is always the finger incompletely removed by another algorithm. accompanied by both information loss and information Further, the system has evidently misrecognized some Chalmers and Edwards 7 Figure 3. Imaging a tiny book (Thomas and Shakespeare, 1945). aging yellow tape as a color illustration, causing most In Figure 4, the grid misalignment has created a of the page images throughout the right side of the psychedelic blue and orange sky, which appears to fas- book to be rendered in color. While this book is an cinate the astronomer Hipparchus (Giberne, 1908). example of a relatively rare ‘‘bad book’’ (Conway, These moire´ patterns appear throughout the image, 2013), it aggregates many of the visual quality issues along with color aliasing, from wavy striations in the that pervade Google’s digitized corpus. sky and floor to geometric patterns on building col- Other material characteristics challenge image pro- umns. Color aliasing occurs when the spatial frequency cessing. These include ornate, unusual, or old fonts; of the original image is sampled at a rate inadequate to non-Roman characters/scripts; and rice paper, glossy capture all its details. Like moire´ , it is a common phe- paper, glassine, and tissue paper (Conway, 2013; nomenon among Google-digitized books that contain Weiss and James, 2015). Nontextual content such as engravings or etchings. illustrations (e.g., woodcuts, engravings, etchings, While the problem of digitization and moire´ has photographic reproductions, and halftones) also often been discussed since at least the 1970s, and corrective fare poorly. Halftone reproductions, for example, have measures have been identified (Huang, 1974), no fully been widely used since the 1880s to cheaply reproduce automated solution appears to have emerged. In 1996, graphic content for print. Placing a screen over an the Library of Congress acknowledged that moire´ image and dividing it into squares, variably sized and mitigation strategies remained unsuitable for produc- regularly spaced ink dots are used to create the image; tion-scale environments (Fleischhauer, 1996). This type of error is predictable, yet intractable, in the human eye fills in the gaps created by sampling and perceives the image as a continuous tone. large-scale book digitization. It is ironic that halftone Computerized scanning similarly creates a digital screening—a technique that facilitated the mass image by sampling the dots at regular intervals, but reproduction of photographs for print books and from a different angle; as the two grids meet, this mis- newspapers—became a significant challenge to mass alignment leaves visual artifacts on the digitized image. print digitization. 8 Big Data & Society Figure 4. Moire´ and color aliasing (Giberne, 1908). Google’s automated image processing also often adjoining pages to bleed through during scanning. misrecognized features of print books. Initially cap- This, combined with the nuanced shading of Chinese tured in full color, raw bitmapped images were then characters, caused the system to miscategorize the page processed down to bitonal images for textual content (Zhang and Kangxi Emperor of China, 1882). or 8-bit grayscale for illustrated content (University of On the other hand, the same Chinese text often fared Michigan Library, 2005). Figure 5 shows a page of text poorly when rendered as a bitonal image within the rendered as a grayscale illustration. The thinness of the GBS digitization model. Binarization converts a raw original rice-paper volume allowed content from color digital image into a bitonal image by using an Chalmers and Edwards 9 Figure 5. Grayscale rendering, Chinese text on rice paper (Zhang and Kangxi Emperor of China, 1882). automatically determined threshold to differentiate from rendered characters. Figure 6, from the same foreground and background. This technique reduces book as the preceding example, illustrates the conse- the amount of data contained in full-color scans, quences of automated binarization for calligraphy pen thereby speeding up OCR processing and downstream detail. image distribution (Holley, 2009; Vincent, 2007). This problem is avoided by interleaving blank pages However, Google’s threshold settings often have the to block adjoining page noise, but to do so routinely effect of darkening, lightening, or erasing nuance would slow the scanning process considerably. Further, 10 Big Data & Society Figure 6. Bitonal rendering of Chinese text on rice paper (Zhang and Kangxi Emperor of China, 1882). without specialized language skills, the original book in Google’s standard protocol—scanning books front to hand, or time for careful examination, it can be very back and left to right—often caused books with vertical difficult to recognize the nature or extent of informa- or right-to-left writing formats to be delivered back- tion loss in a digitized page image. In a related example, ward or upside down (Weiss and James, 2015). Chalmers and Edwards 11 in a distributed system’’ to solve problems that Textual content cannot (yet) be undertaken by computers alone. Optimizing workflows for OCR does not in itself assure high quality character recognition. Consistent with Metadata Google’s brute-force approach, corpus indexing (and keyword search) was built upon software-generated Scholarly users of Google Books quickly identified pro- uncorrected OCR. Research evaluating OCR in large- blems with its metadata, e.g. item descriptors such as scale text digitization reveals widespread accuracy and author, publication date, and subject classification con- reliability problems; as with imaging, OCR accuracy is tained in traditional library catalogs (Duguid, 2007; challenged by print material features such as age and Nunberg, 2009; Townsend, 2007). Using his knowledge condition, printing flaws, rare fonts, textual annota- of canonical texts as a point of departure, Nunberg tions, and nontext symbols (Holley, 2009; Tanner (2009) conducted searches in the GBS corpus that et al., 2009). OCR also suffers in the presence of ima- revealed extensive errors in volume-level metadata. ging quality issues such as page skew, low resolution, These included a disproportionate number of books bleed through, and insufficient contrast. listing 1899 as their publication date; anachronistic Recall the page images of Mother Goose’s Melody in dates for terms such as ‘‘internet’’; mixups of author, Figure 3. Surrounded by visual artifacts of the digitiza- editor, and/or translator; subject misclassification (e.g., tion process, the text—a maxim about the value (and using publishing industry classifications designed to challenge) of independence—appears generally read- allocate books to shelf space in stores, rather than able. However, the OCR provided for the page Library of Congress subject headings); and mis-linking (Figure 7) reveals numerous problems, from missing (e.g., mismatch between volume information and page words to problems caused by the long s’s in the original images). James and Weiss’s (2012) quantitative assess- text. ment supports Nunberg’s anecdotal findings. In Human OCR correction, traditionally completed by response, Google acknowledged that it had constructed professionals double-keying texts, is considered the book metadata records by parsing more than 100 accuracy gold standard but is cost-prohibitive at scale sources of data (Orwant, 2009). These included library (Tanner et al., 2009). In 2009, Google acquired catalogs, publishing industry data, third-party meta- reCAPTCHA, owner of the web security technology data providers, and likely data extracted from OCR. CAPTCHA (Completely Automated Public Turing If each source contained errors, Google’s Jon Orwant test to tell Computers and Humans Apart) (Von Ahn acknowledged, the GBS corpus aggregated millions of and Cathcart, 2009). This technology, in widespread metadata errors across trillions of individual data fields. use since 2006, asks users to examine digitized images (That the most explicit official statement of Google’s of words OCR cannot interpret. Harnessing the free approach to metadata takes the form of a 3000þ labor of web users a few seconds at a time, but aggre- word blog post comment is at once extraordinary and gating to millions of hours, reCAPTCHA has improved unsurprising.) the usability of the GBS corpus (for certain languages) Google’s metadata mess was quickly—and while also being fed back into the training sets of publicly—cast as a confrontation between old and machine-learning algorithms. GBS thus fills gaps in new information systems for accessing books, evidence its automated quality control system with ‘‘human of Google’s techno-utopian investment in machine computation,’’ defined by CAPTCHA creator Von intelligence and the power of full-text search to triumph Ahn (2005) as treating ‘‘human brains as processors over the centralized library cataloging systems Figure 7. OCR produced from page images of Mother Goose’s Melody p. 13 (Figure 4) (Thomas and Shakespeare, 1945). 12 Big Data & Society constructed painstakingly by librarians (Nunberg, drives this crawling, a scale of change only manageable 2009). At a minimum, the pervasiveness of metadata through constant wholesale capture. By contrast, the errors drew attention to the irony of Google’s public pace of change for print media on library shelves is construction of GBS as an ‘‘enhanced card catalog.’’ In normally much slower. Pages may turn brittle. Users practice, the need to circumvent license restrictions on may mark up books, or more rarely, steal them. bibliographic data significantly shaped Google’s While Google tried to deploy a ‘‘scan once’’ strategy approach to metadata. Coyle (2009) and Jones (2014) for initial imaging, when it comes to image processing it assert that although Google obtained catalog records has treated its book corpus with a disregard for stability from library partners, libraries’ contracts with borne out of its experience with web pages. Embracing OCLC—a company that produces the union catalog the iterative logic of algorithmic systems, Google WorldCat—probably prohibited Google from display- routinely updates and replaces scanned content after ing that metadata directly. (For efficiency and consis- running it through improved error detection and tency, libraries often download catalog records from image quality algorithms (University of Michigan and WorldCat rather than create their own, but OCLC Google, Inc., 2005). Even if changes to the corpus tend restricts their use.) to be small and incremental—algorithms erase a finger Google’s metadata problems exposed imperfections in the margins of a scan, restore a missing page, or in existing book cataloging systems, from the challenges deliver a once-buried quote in search results—the of algorithmically interpreting MARC records to the constant and accumulating changes generate a sense temporal and geographic limitations of ISBNs to of instability. Google has not consistently provided errors in human-catalogued bibliographic data. users with documentation related to this updating The incompatibility of legacy catalog systems further (Conway, 2015); the automated work of maintenance challenged Google’s attempts to aggregate metadata and repair remains invisible. It is a tangled, even from multiple sources. Over time, incremental modifi- paradoxical relationship, as the fundamental revisabil- cations to Google’s machine processing substantially ity of algorithms supersedes the print book’s material improved, identifying and ameliorating systemic meta- stability and persistence. But while algorithmic logic data problems. Nonetheless, GBS metadata continues suggests that the latest version of a page will always to be far from accurate. be the most accurate, critical traditions rooted in print culture may lead us to ask how GBS defines accu- racy and what other characteristics may be altered by Integrating books into the web real-time updating. Unlike print books, the web is not tied to a single This section has demonstrated that because GBS physical device for content delivery. In 2009, Google page images and machine-searchable text are in effect introduced ‘‘mobile editions’’ of the corpus. The devel- coproduced, an action at one stage of the process can opment team explained: set in motion a cascade of consequences that shape both visual and machine readability in the corpus. At Imperfect OCR is only the first challenge in the ulti- scale, optimizing workflows around textual properties mate goal of moving from collections of page images to of books ran the risk not only of distorting some books’ extracted-text-based books... The technical challenges visual properties but also of defining normative book are daunting, but we’ll continue to make enhancements characteristics. In Google’s one-size-fits-most scanning to our OCR and book structure extraction technolo- system, decisions about image processing may have a gies. With this launch, we believe that we’ve taken an disproportionate effect on certain aspects of the digi- important step toward more universal access to books. tized corpus; the Chinese-language volume described (Ratnakar et al., 2009) above was one of a set of 50, all digitized by Google at a single location and all subject to the same proces- By defining books as structured information carriers sing problems. from which content may be extracted and delivered Objects that are excluded from scanning, or dis- seamlessly via widely varying devices, Google’s focus torted and transformed beyond the point at which on mobile technology further distanced digitized they may be used as surrogates for their print originals, books from their print origins. become ‘‘noncharismatic objects’’ (Bowker, 2000): by failing to be ‘‘collected’’ through digitization, they are Approaching books as one among many objects to integrate into web search, Google also projected web- rendered invisible to future digitally based scholarship based expectations of change onto print books. Search or use. Further, Google’s opportunistic rather than sys- engines crawl the web constantly, capturing changes, tematic approach to digitization may amplify existing additions, and deletions to a massive set of networked selection biases in physical print collections, overrepre- pages. A well-justified expectation of constant flux sent certain types of publications (Pechenick et al., Chalmers and Edwards 13 2015), or perpetuate Anglo-American cultural domi- We must, then, attend carefully to how Google’s nance in digital cultural heritage (Jeanneney, 2008). algorithmic system supports some users’ requirements while simultaneously rendering others difficult or impossible to meet. For example, ‘‘visible page Mediating access: Indexing the world, texture’’—from marginalia to other signs of aging or one piece of text at a time use inscribed on the printed page—may be useful infor- By constructing books as data, GBS inserts them into a mation or a mark of authenticity for some users, yet it networked world where algorithms increasingly med- is defined as noise for automated image processing. iate human access to information. In the project’s A situated understanding of these details exposes wake, the dream of digitizing ‘‘everything’’ has taken limitations to GBS’s suitability as a flexible, general- hold, recalibrating the sense of what is possible and use collection that can meet the needs of a range of what is expected for both individual web users and stakeholders, such as readers (Duguid, 2007; cultural heritage institutions. Nunberg, 2009), researchers conducting quantitative This article is the first piece of a larger, ongoing analyses of cultural trends (Michel et al., 2011), or cul- study of several large-scale cultural heritage digitization tural heritage institutions. projects, including the Internet Archive and genealogy Further, the opacity of Google’s processes has organization FamilySearch. This project seeks to join contributed to widespread critique of libraries and an existing critique oriented toward material culture other memory institutions ‘‘outsourcing the risk and and labor process with an emerging critique of algorith- responsibility’’ for digitization to a private company mic culture. ‘‘Algorithmic digitization’’ thus serves us (Vaidhyanathan, 2012). Google’s ‘‘black box outsour- as a sensitizing concept emphasizing relationships cing model’’ (Leetaru, 2008) frames agreements with between inputs, materials, labor, processes, outputs, content providers as partnerships rather than custo- use, and users. We use it here to consider opportunities mer–client relationships. These partners give up some and limitations in Google’s approach to providing uni- control over project parameters, tacitly agree to parti- versal access to information. cipate in the digitizer’s larger projects or agendas, and Understanding GBS as an algorithmic system renders remain dependent on the digitizer’s continued interest visible multiple tensions in the project: between Google’s and investment in digitization. As smaller institutions and collections gain access to digitization through this universalizing public rhetoric about the project and the technical processes that must translate these ambiguous privatized model, the risks grow. Google’s digitization visions into workflows; between the competing goals of model conceals the resource-intensive nature of stakeholders such as Google, publishers, authors, and digitization, from the invisible labor of professional libraries; between aspirations of scale and the specialized librarians, contract workers, and end users filling in needs of individual end users or books; between the the gaps created by incomplete automation to unan- materiality of the print book and that of the computer; swered questions of long-term maintenance or preser- and between the invisible, iterative authority of algo- vation of digital assets. It may thus discourage cultural rithms and that of human visual experience or expertise. heritage institutions from budgeting sufficiently for As we have seen, notable limitations stem from their own digitization infrastructures. This will doubt- Google’s choices in resolving these tensions. less leave some institutions unprepared to maintain Imperfection is unavoidable in large-scale book digiti- their traditional stewardship roles with respect to digi- zation. Yet the vocabulary of error is often too static to tal content. be useful, since error is always relative to a particular Just as users (individuals or institutions) benefit or user and/or purpose. Gooding (2013) argues that large- suffer from Google’s reliance on algorithmic processing scale cultural heritage digitization sacrifices quality to differently, so too are print books unevenly affected. serve scale. We have shown that while intuitively Google’s highly proceduralized scanning workflows appealing, this argument is too simplistic. It tends to (perhaps inadvertently) imposed a normative idea of align ‘‘quality’’ with the needs and values of traditional the form and content of the English language book readers, thus privileging visual access. In doing so it on the digitization process. With its construction of ignores the extent to which quantity and quality are digitization as a text extraction and indexing challenge, mutually constitutive in building a digitization econ- Google further distanced itself from library-based omy of scale and misses the careful calibration of understanding of the value of scanned page images as trade-offs between multiple forms of access to books surrogates for print originals. Instead, the above ana- afforded by digitization. It misunderstands the mea- lysis has revealed several ways in which Google aligned sures by which the project itself has defined and eval- GBS with other iterative, algorithmic systems—from uated quality. Finally, it overstates Google’s concern Google Streetview to 23 & Me—created to bring phy- with end users more generally. sical objects, information systems, and even human 14 Big Data & Society 5. While patents provide only generic system descriptions, bodies within the visual and computational logics of the they provide sufficient detail for high-level reverse engi- web. neering of Google’s processes. Journalists’ accounts and Today, books maintain an uneasy parallel existence, output-oriented research provide anecdotal verification caught between the world of the web and the world of (Clements, 2009; Shankland, 2009). Gutenberg. GBS highlights the uneven rates of change and competing logics of these two worlds, the techno- logical and legal frameworks that may produce, orga- References nize, and mediate access to print and digital Baird HS (2003) Digital libraries and document image information differently but that digitization forces analysis. In: Seventh international conference on document together. Google shaped the processes and outputs of analysis and recognition, Los Alamitos, CA, 4–6 August GBS to respect the constraints of copyright law, for 2013, pp.2–14. IEEE. example. Yet it simultaneously sought to circumvent Band J (2009) The long and winding road to the Google print-based permissions management by emphasizing Books settlement. The John Marshall Review of functionality that resonated with its web- and scale- Intellectual Property Law 9(2): 227–329. centric mission but had no direct parallel with print. Bowker GC (2000) Biodiversity datadiversity. Social Studies GBS has provided searchable text access to mil- of Science 30(5): 643–683. Cadwalladr C (2016) How to bump Holocaust deniers off lions of books. The weight of this remarkable Google’s top spot? Pay Google. The Guardian,17 achievement must not be denied or underestimated. December. Available at: https://www.theguardian.com/ Yet by equating digital access with full-text search, technology/2016/dec/17/holocaust-deniers-google-search- the GBS corpus has created a future for books in top-spot (accessed 1 February 2017). which they are defined principally by their textual Carr R (2005) Oxford-Google Mass-Digitisation Programme. content. Google’s workflows have elided other (his- Washington, DC. Available at: http://www.bodley.ox.ac. torical, artifactual, material) properties of books that, uk/librarian/rpc/CNIGoogle/CNIGoogle.htm (accessed 1 when absent, threaten to disrupt or reframe the rela- February 2017). tionship between a digitized surrogate and its print Ceynowa K (2009) Mass digitization for research and study. original. As print libraries fade into the deep back- IFLA Journal 35(1): 17–24. ground of our brave new digital world, much has Clements M (2009) The secret of Google’s book scanning been lost that cannot be regained. machine revealed. National Public Radio website. Available at: http://www.npr.org/sections/library/2009/ 04/the_granting_of_patent_7508978.html (accessed 7 Declaration of conflicting interests February 2017). Conway P (2013) Preserving imperfection: Assessing the inci- The author(s) declared no potential conflicts of interest with dence of digital imaging error in HathiTrust. Preservation, respect to the research, authorship, and/or publication of this Digital Technology and Culture 42(1): 17–30. article. Conway P (2015) Digital transformations and the archival nature of surrogates. Archival Science 15(1): 51–69. Funding Coyle K (2006) Mass digitization of books. The Journal of Academic Librarianship 32(6): 641–645. The author(s) received no financial support for the research, Coyle K (2009) Google Books metadata and library func- authorship, and/or publication of this article. tions. Coyle’s InFormation. Available at: http://kcoyle. blogspot.com/2009/09/google-books-metadata-and- Notes library.html (accessed 19 April 2017). 1. Google Book Search: http://books.google.com. Darnton R (2009) The Case for Books: Past, Present, and 2. The original five libraries were Harvard, Stanford, the Future. New York: Public Affairs. University of Michigan, New York Public Library, and Duguid P (2007) Inheritance and loss? A brief survey of the Bodleian Library at Oxford University. Google Books. First Monday 12(8). 3. While not the first, GBS was the biggest and most contro- Fleischhauer C (1996) Digital Formats for Content versial of several large cultural heritage digitization pro- Reproductions. Library of Congress. Available at: http:// jects undertaken by entities such as Yahoo, Microsoft, memory.loc.gov/ammem/formatold.html (accessed 16 Google, and the Internet Archive in the early 2000s (St. June 2017). Clair, 2008). Giberne A (1908) The Story of the Sun, Moon, and Stars. 4. The Association of American Publishers lawsuit was Chicago, IL: Thompson & Thomas. Available at: https:// settled privately in 2011, while in 2015 the Second books.google.com/books?id¼KY8AAAAAMAAJ Circuit Court of Appeals upheld a 2013 lower court judg- (accessed 1 February 2017). ment rejecting the Authors Guild’s copyright infringement Gillespie T (2016) Algorithms. In: Peters B (ed.) Digital claims and affirming Google’s scanning as transformative Keywords. Princeton, NJ: Princeton University Press, and therefore ‘‘fair use.’’ pp. 18–30. Chalmers and Edwards 15 Golumbia D (2009) The Cultural Logic of Computation. Lesk M (2003) The price of digitization: New cost models for Cambridge, MA: Harvard University Press. cultural and educational institutions. Available at: http:// Gooding P (2013) Mass digitization and the garbage dump: www.ninch.org/forum/price.lesk.report.html (accessed 1 The conflicting needs of quantitative and qualitative meth- February 2017). ods. Literary and Linguistic Computing 28(3): 425–431. Lin XF (2006) Quality assurance in high volume document digitization: a survey. In: Second international conference Google, Inc. (1999) Company info. Available at: https://web. on document image analysis for libraries, Lyon, France, 27– archive.org/web/19991105194818/http://www.google.com/ 28 April 2006, pp.311–319. IEEE. company.html (accessed 1 February 2017). Lynch C (2002) Digital collections, digital libraries & the Google, Inc. (2004a) What is Google Print? About Google digitization of cultural heritage information. Microform Print (Beta). Available at: https://web.archive.org/web/ and Imaging Review 31(4): 131–145. 20041214092414/http://print.google.com/ (accessed 10 Madrigal AC (2010) Inside the Google Books Algorithm. The February 2017). Atlantic. Available at: https://www.theatlantic.com/tech- Google, Inc. (2004b) What is the library project? Google Print nology/archive/2010/11/inside-the-google-books-algo- Library Project. Available at: https://web.archive.org/web/ rithm/65422/ (accessed 3 February 2017). */http://print.google.com/googleprint/library.html Michel J-B, Shen YK, Aiden AP, et al. (2011) Quantitative (accessed 1 December 2017). analysis of culture using millions of digitized books. Grant J (2005) Judging book search by its cover. Official Science 331(6014): 176–182. Google Blog. Available at: https://googleblog.blogspot. Milne R (2008) From ‘‘boutique’’ to mass digitization: The com/2005/11/judging-book-search-by-its-cover.html Google Library Project at Oxford. In: Earnshaw R and (accessed 1 February 2017). Vince J (eds) Digital Convergence – Libraries of the Future. Holihan C (2006) Google seeks help with recognition. London: Springer, pp. 3–9. Business Week Online. Available at: http://www.bloom- Murrell M (2010) Digitalþlibrary: Mass book digitization as berg.com/bw/stories/2006-09-06/google-seeks-help-with- collective inquiry. New York Law School Law Review 55: recognition (accessed 1 February 2017). 221–249. Holley R (2009) How good can it get? Analysing and improv- New York Public Library (2004) NYPL partners with Google ing OCR accuracy in large scale historic newspaper digit- to make books available online. Available at: https://web. isation programs. D-Lib Magazine 15(3/4). archive.org/web/20050923130755/http://nypl.org/press/ Howard J (2012) Google begins to scale back its scanning of google.cfm (accessed 3 December 2016). books from university libraries. The Chronicle of Higher New York S of (1862) Code of Procedure of the State of New Education. York. New York: George S. Diossy. Available at: https:// Huang TS (1974) Digital transmission of halftone pictures. books.google.com/books?printsec¼frontcover&id¼aD0K Computer Graphics and Image Processing 3(3): 195–202. AAAAIAAJ (accessed 1 February 2017). James R (2010) An assessment of the legibility of Google Norman Wilson A (2009) Workers leaving the Googleplex. Books. Journal of Access Services 7(4): 223–228. Available at: http://www.andrewnormanwilson.com/ James R and Weiss A (2012) An assessment of Google Books’ WorkersGoogleplex.html (accessed 7 August 2016). metadata. Journal of Library Metadata 12(1): 15–22. Nunberg G (2009) Google Books: A metadata train wreck. Jeanneney J-N (2008) Google and the Myth of Universal Language Log. Available at: http://languagelog.ldc.upenn. Knowledge. Chicago: University of Chicago Press. edu/nll/?p¼1701 (accessed 4 February 2017). Jones EA (2014) Constructing the universal library. PhD Orwant J (2009) Re: Google Books: A metadata train wreck. Thesis, University of Washington, USA. Language Log. Available at: http://languagelog.ldc.upenn. Kirschenbaum MG (2003) The word as image in an age of edu/nll/?p¼1701#comment-41758 (accessed 24 April digital reproduction. In: Hocks ME and Kendrick M (eds) 2017). Eloquent Images. Cambridge: MIT Press, pp. 137–156. Palmer B (2005) Deals with Google to accelerate library digit- Langley A and Bloomberg DS (2007) Google Books: Making ization projects for Stanford, others. Stanford Report,12 the public domain universally accessible. In: Proceedings January. Available at: http://news.stanford.edu/news/ of SPIE-IS&T Electronic Imaging, 26–29 January 2007, 2005/january12/google-0112.html (accessed 3 February San Jose, CA: International Society for Optics and 2017). Photonics. Pasquale F (2015) The Black Box Society: The Secret Le Bourgeois F, Trinh E, Allier B, et al. (2004) Document Algorithms That Control Money and Information. image analysis solutions for digital libraries. In: First inter- Cambridge, MA: Harvard University Press. national workshop on document image analysis for libraries, Pechenick EA, Danforth CM and Dodds PS (2015) Palo Alto, CA, 23–24 January 2004, pp.2–24. IEEE. Characterizing the Google Books corpus: Strong limits Leetaru K (2008) Mass book digitization: The deeper story of to inferences of socio-cultural and linguistic evolution. Google Books and the Open Content Alliance. First PLOS ONE 10(10): e0137041. Monday 13(10). Ratnakar V, Poncin G, Bedger B, et al. (2009) 1.5 million Lefevere F-M and Saric M (2008) De-warping of scanned books in your pocket. Google Book Search blog. images. Patent 7463772, USA. Available at: http://booksearch.blogspot.com/2009/02/15- Lefevere F-M and Saric M (2009) Detection of grooves in million-books-in-your-pocket.html (accessed 29 January scanned images. Patent 7508978, USA. 2017). 16 Big Data & Society Roush W (2005) The infinite library. MIT Technology Review University of Michigan and Google, Inc. (2005) UM-Google 108(5): 54–59. Cooperative Agreement. Available at: www.lib.umich.edu/ Samuelson P (2009) Google Book Search and the future of mdp/um-google-cooperative-agreement.pdf (accessed 7 books in cyberspace. Minnesota Law Review 94: February 2017). 1308–1374. University of Michigan Library (2005) UM Library/Google Schantz HF (1982) The History of OCR. Manchester Center, digitization partnership FAQ. Available at: http://www. VT: Recognition Technologies Users Association. lib.umich.edu/files/services/mdp/faq.pdf (accessed 7 Schmidt E (2005) Books of revelation. Wall Street Journal 18 February 2017). October. US Copyright Office (2016) Fair use. Available at: http:// Seaver N (2013) Knowing algorithms. Cambridge, MA. copyright.gov/fair-use/more-info.html (accessed 7 Available at: http://nickseaver.net/papers/seaverMiT8.pdf February 2017). (accessed 1 February 2017). Vaidhyanathan S (2012) The Googlization of Everything. Shankland S (2009) Patent reveals Google’s book-scanning Berkeley, CA: University of California Press. advantage. CNET. Available at: https://www.cnet.com/ Vincent L (2007) Google Book Search: Document under- news/patent-reveals-googles-book-scanning-advantage/ standing on a massive scale. In: Ninth international confer- (accessed 30 January 2017). ence on document analysis and recognition, Parena, Brazil, St. Clair G (2008) The Million Book project in relation to 23–26 September 2007, pp.819–823. IEEE. Google. Journal of Library Administration 47(1–2): Von Ahn L (2005) Human computation. PhD Thesis, Carnegie 151–163. Mellon University, Pittsburgh, PA. Striphas T (2015) Algorithmic culture. European Journal of von Ahn L and Cathcart W (2009) Teaching computers to Cultural Studies 18(4–5): 395–412. read: Google acquires reCAPTCHA. Official Google Blog. Tanner S, Mun˜ oz T and Ros PH (2009) Measuring mass text Available at: https://googleblog.blogspot.com/2009/09/ digitization quality and usefulness. D-Lib Magazine 15(7/ teaching-computers-to-read-google.html (accessed 30 8): 1082–9873. January 2017). Terras MM (2008) Digital Images for the Information Weiss A and James R (2015) Comparing the access to and Professional. Burlington, VT: Ashgate Publishing. legibility of Japanese language texts in massive digital Thomas I and Shakespeare W (1945) Mother Goose’s libraries. In: International conference on culture and com- melody. New York: G. Melcher. Available at: https:// puting, Kyoto, Japan, 17–19 October 2015, pp.57–63. books.google.com/books?id=OG7YAAAAMAAJ IEEE. (accessed 7 February 2017). Zhang Y and Kangxi Emperor of China (1882) Pei wen yun Townsend RB (2007) Google Books: What’s not to like? fu. Available at: http://hdl.handle.net/2027/mdp. American Historical Association blog. Available at: 39015081214945 (accessed 7 February 2017). http://blog.historians.org/2007/04/google-books-whats- not-to-like/ (accessed 7 February 2017). http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Big Data & Society SAGE

Producing “one vast index”: Google Book Search as an algorithmic system:

Big Data & Society , Volume 4 (2): 1 – Jul 3, 2017

Loading next page...
 
/lp/sage/producing-one-vast-index-google-book-search-as-an-algorithmic-system-kYH25s1kwU
Publisher
SAGE
Copyright
Copyright © 2022 by SAGE Publications Ltd, unless otherwise noted. Manuscript content on this site is licensed under Creative Commons Licenses.
ISSN
2053-9517
eISSN
2053-9517
DOI
10.1177/2053951717716950
Publisher site
See Article on Publisher Site

Abstract

In 2004, Google embarked on a massive book digitization project. Forty library partners and billions of scanned pages later, Google Book Search has provided searchable text access to millions of books. While many details of Google’s conversion processes remain proprietary secret, here we piece together their general outlines by closely examining Google Book Search products, Google patents, and the entanglement of libraries and computer scientists in the longer history of digitization work. We argue that far from simply ‘‘scanning’’ books, Google’s efforts may be characterized as algorithmic digitization, strongly shaped by an equation of digital access with full-text searchability. We explore the consequences of Google’s algorithmic digitization system for what end users ultimately do and do not see, placing these effects in the context of the multiple technical, material, and legal challenges surrounding Google Book Search. By approaching digitization primarily as a text extraction and indexing challenge—an effort to convert print books into electronically searchable data—GBS enacts one possible future for books, in which they are defined largely by their textual content. Keywords Algorithmic system, digitization, algorithmic culture, Google, web search, scanning Reading a public domain book on the Google Books erase—human intervention. The dream of automation website is a mundane encounter with text on a screen. persists, even as the materials resist. In the midst of this experience, the appearance of a hand The hand’s ghostly presence also highlights the presents an unsettling disruption (Figure 1). Positioned opacity surrounding Google’s undertaking, a disjunc- within the front matter of the Code of Procedure of the ture between the company’s techno-utopian public State of New York (1862), bright pink rubbers cover rhetoric and the paucity of public access it provided three fingers. The hand bears a thick silver ring and to the technical specifics of digital conversion. matching pink nail polish. The thumb has been partially Envisioning a far-reaching public impact, Google erased, appearing as a brown, pixelated stripe. The words CEO Eric Schmidt (2005) described the project’s goals: ‘‘Digitized by Google’’ have been digitally tattooed on the hand’s skin. Imagine the cultural impact of putting tens of millions Momentarily pulling back the curtain on Google’s of previously inaccessible volumes into one vast index, digitization processes, the hand’s presence draws atten- every word of which is searchable by anyone, rich and tion both to the book’s print origins and to the human and machine labor required to transport (and transform) University of Michigan School of Information, USA it from library shelf to laptop screen. This hand belongs Stanford University, USA to a contract worker hired by Google to turn the pages of more than 20 million books digitally imaged through the Corresponding author: Google Book Search Project since 2004. These fingers, Melissa K Chalmers, University of Michigan School of Information, 105 S. skin, nails, and rings appear as visible traces of ongoing State St., Ann Arbor, MI 48109, USA. processes designed to obviate—and subsequently to Email: mechalms@umich.edu Creative Commons CC-BY: This article is distributed under the terms of the Creative Commons Attribution 4.0 License (http:// www.creativecommons.org/licenses/by/4.0/) which permits any use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access pages (https://us.sagepub.com/en-us/nam/open-access- at-sage). 2 Big Data & Society Vincent (2007) have presented elements of Google’s technical workflows to specialized technical research communities. Here we take a new tack, arguing that Google’s approach to digitization was shaped by a confluence of technical and cultural factors that must be understood together. These include Google’s corporate commitment to the scalable logic of web search, partner selection parameters, the lingering influence of print intellectual property regimes, and the requirements of Google’s highly standardized ‘‘mass digitization’’ processes (Coyle, 2006). This article proposes an alter- native descriptor, algorithmic digitization, intended to highlight how the algorithms Google uses to scale and automate digitization intertwine with the production logic that governs GBS planning and execution. Understanding GBS as an algorithmic system foregrounds Google’s commitment to scale, standar- dized processes, automation, and iterative improvement (Gillespie, 2016). These features must also be under- stood as negotiated translations of varied project, part- ner, and corporate goals into executable workflows. We first examine how algorithms shape and structure the work of digitization in GBS and consider the effects of algorithmic processing on digitized books accessible to users. We then explore the implications of Google’s embrace of an algorithmic solution to the multiple tech- nical, material, and legal challenges posed by GBS. Beyond simply scaling up existing book digitization, Google’s algorithmic digitization effort has had the effect of reimagining what the intended outcome of Figure 1. Hands scanned by Google (New York, 1862). such a project should be—with important implications for mediating digital access to print books. poor, urban and rural, First World and Third, en toute Books as data: Digital hammer seeks langue – and all, of course, entirely for free. digital nails Yet the actual digitization proceeded under a cloud of Google’s corporate mission: ‘‘to organize the world’s secrecy, leaving analysts such as ourselves to glean traces information and make it universally accessible and of the project’s values and processes from public state- useful,’’ has remained effectively unchanged since its ments, contracts, project webpages, blog posts, presen- first appearance on the company’s website in late tations, and patent applications—and sometimes from 1999 (Google, Inc., 1999). At the time, it referred the margins of the page images themselves. chiefly to web search, Google’s core business. In Existing research has investigated many aspects of December 2004, Google announced an extension to Google Book Search (hereafter GBS), including its that mission: a massive book digitization project in goals, its outputs, and its intellectual property frame- partnership with five elite research libraries. Since works (Samuelson, 2009). Scholars have considered then Google has worked with over 40 library partners GBS in the context of the corporate monopolization to scan over 20 million books, producing billions of of cultural heritage (Vaidhyanathan, 2012), the history pages of searchable text. In 2012, without any formal and future of the book as a physical medium (Darnton, announcement, Google quietly began to scale back the 2009), and the place of digitized books in knowledge project, falling short of its aspirations to scan ‘‘every- infrastructures such as libraries (Jones, 2014; Murrell, thing’’ (Howard, 2012). While it seems unlikely that 2010). Leetaru (2008) provides a rare analysis of GBS Google will stop digitizing books completely or jettison analog–digital conversion processes, while Google its digitized corpus anytime soon, the project’s future is employees Langley and Bloomberg (2007) and currently unknown. Chalmers and Edwards 3 To Google, converting print books into electroni- Seaver (2013) calls for reframing the questions we ask cally searchable data was GBS’s entire raison d’eˆ tre. about algorithmic systems, moving away from Therefore, Google constructed digitization as a step conceiving of algorithms as technical objects with cul- parallel to the web crawling that enabled web search. tural consequences and toward the question of ‘‘how In contracts with library partners, Google defined digi- algorithmic systems define and produce distinctions tization as ‘‘to convert content from a tangible, analog and relations between technology and culture’’ in spe- form into a digital representation of that content’’ cific settings. Studying algorithmic systems empirically (University of Michigan and Google, Inc., 2005). In may thus bring together several elements: the technical practice, this conversion produced a digital surrogate details of algorithm function; the imbrication of in which multiple representations of a print book exist humans (designers, production assistants, users) and simultaneously. Each digitized book is comprised of a human values in algorithmic systems; and the multiple series of page images, a file containing the book’s text, contexts in which algorithms are developed and and associated metadata. Layered to produce multiple deployed. types of human and machine access—page images, Like many contemporary digital systems, GBS full-text search, and pointers to physical copies held integrated humans as light industrial labor, necessary by libraries—each of these elements was produced by if inefficient elements of an incompletely automated separate, yet related, processes. process. Human labor in GBS was almost entirely phy- sical, heavily routinized, and kept largely out of sight; human expertise resides outside rather than inside Integrating human values—and labor—into Google’s system. Partner library employees pulled algorithmic systems books from shelves onto carts destined for a Google- As with many Google endeavors, the company managed off-site scanning facility (Palmer, 2005). reengineered familiar processes at new levels of techno- There, contract workers turned pages positioned logical sophistication. From that perspective, Google’s under cameras, feeding high-speed image processing primary innovation on libraries’ hand-crafted ‘‘bou- workflows around the clock (University of Michigan tique’’ digitization models (which pair careful content and Google, Inc., 2005). Directly supervised by the selection with preservation-quality scanning) was to machines they were hired to operate, scanning workers approach book digitization as it would any other were required to sign nondisclosure agreements but large-scale data management project: as a challenge of afforded none of the perks of being a Google employee scale, rather than kind. Susan Wojcicki, a product man- beyond the walls of a private scanning facility (Norman ager for the project, contextualized Google’s approach Wilson, 2009). For the time being, at least, human bluntly: ‘‘At Google we’re good at doing things at labor in book digitization remains necessary largely scale’’ (Roush, 2005). In other words, Google because of the material fragility, inconsistency, and turned book digitization into an algorithmic process. variety of print books. Scaled-up scanning required a work process centered in and around algorithms. Preparing to digitize: Partnerships, goal alignment, Algorithms are complex sequences of instructions selection expressed in computer code, flowcharts, decision trees, or other structured representations. From Facebook to Mass digitization initiatives are often characterized as Google and Amazon, algorithms increasingly shape operating without a selection principle: ‘‘everything’’ how we seek information, what information we find, must be digitized (Coyle, 2006). In practice, however, and how we use it. Because algorithms are typically partnerships, scaling requirements, intellectual property designed to operate with little oversight or intervention, regimes designed for print, and the particulars of the substantial human labor involved in their creation books’ material characteristics all challenged Google’s and deployment remain obscured. Algorithmic invisi- universal scanning aspirations. bility easily slides into a presumed neutrality, and they At the turn of the 21st century, Lynch (2002) remain outside users’ direct control as they undergo observed that cultural heritage institutions mostly iterative improvement and refinement. Finally, the understood the hows of digitization, even at moderately vast complexity of many algorithms—especially inter- large scale. The main challenge, he argued, was to opti- acting systems of algorithms—can render their beha- mize processes. Lesk (2003) described the challenges of vior impossible for even their designers to predict or scale and efficiency more succinctly: ‘‘we need the understand. Henry Ford of digitization,’’ i.e. an institution willing Embedded in systems, algorithms have the power to invest vast resources in ‘‘digitization on an industrial to reconfigure work, life, and even physical spaces scale’’ (Milne, 2008). Google stepped forward to (Gillespie, 2016; Golumbia, 2009; Striphas, 2015). assume this role. 4 Big Data & Society Google courted partners to provide content by mainly to discover books rather than to actually read incurring nearly all costs of scanning, while carefully them (Google, Inc., 2004a). Yet partners scanning avoiding the repository-oriented responsibilities of a public domain books often referenced online reading library. Each partner library brought its own goals as a benefit. This ambiguity perhaps contributed to and motivations into the project. The New York copyright-related concerns—and misunderstandings— Public Library (2004) observed that ‘‘without during GBS’s early days (Carr, 2005; New York Google’s assistance, the cost of digitizing our books Public Library, 2004). — in both time and dollars — would be prohibitive.’’ Other partners spoke of leveraging Google’s technical A means to an end: Image capture expertise and innovation to inform future institutional digitization efforts (Carr, 2005; Palmer, 2005). Libraries Once it took custody of partner library books, employed different selection criteria, from committing Google deployed its own selection criteria. In a to digitize all holdings (e.g., University of Michigan) to (rare) concession to the library partners tasked with selecting only public domain holdings (e.g., Oxford, storing and preserving paper materials, Google used a NYPL) or special collections (later partners). Most nondestructive scanning technique. In patents filed in digitization contracts remained private, adding to the 2003 and 2004, Google provided descriptions of sev- secrecy surrounding Google’s efforts. eral high-resolution image capture systems designed Full-text search quickly emerged as a kind of lowest- around the logistical challenges posed by bound common-denominator primary functionality for the documents. The thicker the binding, for example, project. Using the Internet Archive’s Wayback the less likely a book is to lie flat. In flatbed or Machine, we can see how Google incrementally mod- overhead scanners, page curvature creates skewed or ified language relating to the project’s goals and distorted scanned images. Book cradles or glass pla- mechanisms throughout its first year (Google, Inc., tens can flatten page surfaces, but these labor-inten- 2004b). The answer to the question ‘‘What is the sive tools slow down scanning and can damage book Library Project’’ evolved from an effort to transport spines. Google addressed this page curvature problem media online (December 2004) to a pledge to make computationally, through a combination of 3D ima- ‘‘offline information searchable’’ (May 2005) to a ging and downstream image processing algorithms. more ambiguous plan to ‘‘include [libraries’] col- That decision shaped and complicated Google’s lections... and, like a card catalog, show users informa- workflow. tion about the book plus a few snippets – a few In the patent schematic shown in Figure 2, two cam- sentences of their search term in context’’ (November eras (305, 310) are positioned to capture two-dimen- 2005, emphasis added). sional images of opposing pages of a bound book The purpose behind these changes became clear in (301). Simultaneously, an infrared (IR) projector Fall 2005, as the Authors Guild and the Association of (325) superimposes a pattern on the book’s surface, American Publishers filed lawsuits alleging copyright enabling an IR stereoscopic camera (315) to generate infringement (Band, 2009). Google argued that by a three-dimensional map of each page (Lefevere and creating a ‘‘comprehensive, searchable, virtual card cat- Saric, 2009). Using a dewarping algorithm, Google alog of all books in all languages,’’ it provided pointers can subsequently detect page curvature in these 3D to book content rather than access to copyright- page maps and correct by straightening and stretching protected books. The company maintained that scan- text (Lefevere and Saric, 2008). ning-enabled indexing constituted ‘‘fair use’’ under the Scanning produces bitmapped images that repre- U.S. Copyright Act (Schmidt, 2005; US Copyright sent the pages of a print book as a grid of pixels Office, 2016). In November 2005, the project’s name for online viewing. Unlike text, this imaged content changed from Google Print to Google Book Search, cannot be searched and remains ‘‘opaque to the algo- reorienting users’ frame of reference from the world rithmic eyes of the machine’’ (Kirschenbaum, 2003). of paper to the world of the electronic web (Grant, As a next step after scanning, Google might have 2005). The change attempted to correct any mispercep- adopted existing library-based preservation best prac- tions that Google intended to enable access to tices for imaged content. Or it could have created user-printed copies of books and to deemphasize the new standards around 3D book imaging (Langley and Bloomberg, 2007; Leetaru, 2008). Instead, idea that the project was in the business of copying or of content ownership. Google chose to transform the raw 3D page maps Since December 2004, GBS has provided full access described above—rich in information, but unwieldy for public domain books. Google consistently down- for end users due to file size and format—into played this capability, maintaining that like a book- ‘‘clean and small images for efficient web serving’’ store ‘‘with a Google twist,’’ readers would use it (Vincent, 2007). Chalmers and Edwards 5 Figure 2. System for optically scanning documents (Lefevere and Saric, 2009). in analog form (Holihan, 2006; Schantz, 1982). Through OCR, imaged documents gain new function- Producing a machine-readable index: Image ality, as text may be searched, aggregated, mined for processing patterns, or converted to audio formats for visually For GBS, then, imaging ultimately represented a key impaired users. yet preliminary step toward text-searchable books on Tanner et al. (2009) argue that by providing search the web. The project’s image processing workflows thus functionality for large digitized corpora at low cost, acquired a dual imperative. It had to produce both (a) automated OCR systems have been a key driver of two-dimensional page images for web delivery, and (b) large-scale text digitization. GBS leveraged decades of machine-readable—and therefore searchable—text. computing research related to OCR. Through the ‘‘[O]ur general approach here has been to just get the 1990s, boutique library digitization efforts had books scanned, because until they are digitized and addressed the question of quality mainly by establish- OCR is done, you aren’t even in the game,’’ Google ing image-centric digitization standards (e.g., scanner Books engineering director James Crawford observed specifications and calibration, test targets, resolution) in 2010 (Madrigal, 2010). The ‘‘game’’ here, of course, (Baird, 2003). Rooted in libraries’ traditions of is search. In a web search engine, crawled page content ensuring long-term visual access to materials through and metadata are parsed and stored in an index, a list reformatting (e.g., copying, microfilming), these prac- of words accompanied by their locations. Indexing tices relied on labor-intensive visual inspection for qual- quickly became the key mechanism (and metaphor) ity control. By contrast, pattern recognition research through which Google sought to unlock the content developed systems for algorithmically assessing quality, of books for web search. measured by accurate recognition of printed characters To produce its full-text index, Google converted and document structure (Le Bourgeois et al., 2004; Lin, page images to text using optical character recognition 2006). (OCR). OCR software uses pattern recognition to iden- Google adopted this framing of digitization as a text extraction challenge, optimizing its processes to tify alphanumeric characters on scanned page images and encode them as machine-readable characters. produce the clean, high-contrast page images necessary Originally used to automate processing of highly stan- for accurate OCR. The GBS processing pipeline relied dardized business documents such as bank checks, over heavily on OCR to automate not only image processing the past 60 years OCR has become integral to organiz- and quality control but also volume-level metadata ing and accessing digital information previously stored extraction. Google’s Vincent (2007) described the 6 Big Data & Society digitized corpus as algorithmic ‘‘document understand- gain (Terras, 2008). In GBS, lost information includes ing and analysis on a massive scale.’’ the physical size, weight, or structure of a volume; the texture and color of its pages; and the sensory experi- ence of navigating its contents. Nontextual book fea- Books bite back: Bookness as bug, not tures such as illustrations, as well as marginalia and feature other evidence of print books’ physical histories of In their commitment to scale and standardized use, are often distorted or auto-cropped out of procedure, algorithmic systems often prioritize system Google’s screen-based representations. As for informa- requirements over the needs of individual inputs (e.g., tion gain, image capture, and processing embed traces books) or users. Google’s search engine, for example, of the digitization process into digitized objects. has come under criticism for failing to prioritize The quality of Google’s digitization output has been authoritative or accurate search results. In December systematically evaluated through empirical research 2016, the Guardian reported that a Google query on and widely critiqued in informal venues such as blogs. ‘‘Did the Holocaust happen?’’ returned a Holocaust While useful in characterizing quality concerns in the denial website as the first result. A Google spokesper- digitized corpus, this work generally does not consider son maintained that how and why digitization processes shape outputs. The following examples illustrate commonly identified [w]hile it might seem tempting to fix the results of an problems, but they also extend existing analyses by individual query by hand, that approach does not scale emphasizing the role of algorithms in concretizing rela- to the many different variants of that query and the tionships among system inputs, conversion processes, queries that we have not yet seen. So we prefer to and outputs. These types of problems remain endemic take a scalable algorithmic approach to fix problems, in the GBS corpus not because they are unsolvable, but rather than removing these one by one. (Cadwalladr, rather because they have been accepted as trade-offs. 2016, emphasis added) Their solutions do not fit easily into Google’s priorities and workflows, even as their persistence challenges Google’s acknowledgment here of the trade-offs it faces efforts to automate quality assurance processes. between scale and granularity highlights questions of algorithmic accountability (Pasquale, 2015). Visual content Google’s system also exposes tensions between the standardization required to scale digitization processes Output-based evaluations of large-scale book digitiza- and the flexibility needed to accommodate the diverse tion have found that except when catastrophic (rare), output of print publication history. It is perhaps no most text-oriented page scanning or image processing surprise that books, unlike business documents created errors result in thin, thick, blurry, or skewed text that to meet OCR requirements, persistently resisted the may frustrate or annoy readers but does not render structure imposed on them by Google’s homogenizing them entirely unreadable (Conway, 2013; James, processes. 2010). Objects such as fingers or clamps also appear Bound books evolved over centuries from earlier commonly in scans but often do not obstruct text writing formats such as scrolls and codices. But in significantly. Google’s conversion system, the hard-won features of In Figure 3, the very tiny book Mother Goose’s bound books—the very things that made them conve- Melody has been housed in a binder to prevent it nient, efficient, and durable media for so long—were from being lost on a library shelf. While the library- treated as bugs rather than features. Google routinely created cover fits Google’s selection criteria and has excluded materials from scanning due to size or condi- provided a frame size for image capture, several mate- tion. These included very large or small books as well as rial elements usually cropped out of Google-digitized books with tight bindings, tipped-in photographs and page images have crept into the frame due to the size illustrations, foldout maps, or uncataloged material mismatch between the cover and the actual book. These (Coyle, 2006). Very old, brittle, or otherwise fragile include a call slip and university label, metal book- books were also excluded (Ceynowa, 2009; Milne, securing clamps, and the page-turner’s hands. When 2008). Many of the rejected books remain undigitized, extra-textual features are detected and removed algor- while others have joined lengthy queues within ithmically—without the help of a human eye—they libraries’ ongoing internal digitization programs. often leave new artifacts behind. We see some of As a sampling process in which some, but not all, these less familiar traces here: the stretched appearance features of an analog signal are chosen for digital of book pages caused by the dewarping algorithm, and capture and representation, digitization is always the finger incompletely removed by another algorithm. accompanied by both information loss and information Further, the system has evidently misrecognized some Chalmers and Edwards 7 Figure 3. Imaging a tiny book (Thomas and Shakespeare, 1945). aging yellow tape as a color illustration, causing most In Figure 4, the grid misalignment has created a of the page images throughout the right side of the psychedelic blue and orange sky, which appears to fas- book to be rendered in color. While this book is an cinate the astronomer Hipparchus (Giberne, 1908). example of a relatively rare ‘‘bad book’’ (Conway, These moire´ patterns appear throughout the image, 2013), it aggregates many of the visual quality issues along with color aliasing, from wavy striations in the that pervade Google’s digitized corpus. sky and floor to geometric patterns on building col- Other material characteristics challenge image pro- umns. Color aliasing occurs when the spatial frequency cessing. These include ornate, unusual, or old fonts; of the original image is sampled at a rate inadequate to non-Roman characters/scripts; and rice paper, glossy capture all its details. Like moire´ , it is a common phe- paper, glassine, and tissue paper (Conway, 2013; nomenon among Google-digitized books that contain Weiss and James, 2015). Nontextual content such as engravings or etchings. illustrations (e.g., woodcuts, engravings, etchings, While the problem of digitization and moire´ has photographic reproductions, and halftones) also often been discussed since at least the 1970s, and corrective fare poorly. Halftone reproductions, for example, have measures have been identified (Huang, 1974), no fully been widely used since the 1880s to cheaply reproduce automated solution appears to have emerged. In 1996, graphic content for print. Placing a screen over an the Library of Congress acknowledged that moire´ image and dividing it into squares, variably sized and mitigation strategies remained unsuitable for produc- regularly spaced ink dots are used to create the image; tion-scale environments (Fleischhauer, 1996). This type of error is predictable, yet intractable, in the human eye fills in the gaps created by sampling and perceives the image as a continuous tone. large-scale book digitization. It is ironic that halftone Computerized scanning similarly creates a digital screening—a technique that facilitated the mass image by sampling the dots at regular intervals, but reproduction of photographs for print books and from a different angle; as the two grids meet, this mis- newspapers—became a significant challenge to mass alignment leaves visual artifacts on the digitized image. print digitization. 8 Big Data & Society Figure 4. Moire´ and color aliasing (Giberne, 1908). Google’s automated image processing also often adjoining pages to bleed through during scanning. misrecognized features of print books. Initially cap- This, combined with the nuanced shading of Chinese tured in full color, raw bitmapped images were then characters, caused the system to miscategorize the page processed down to bitonal images for textual content (Zhang and Kangxi Emperor of China, 1882). or 8-bit grayscale for illustrated content (University of On the other hand, the same Chinese text often fared Michigan Library, 2005). Figure 5 shows a page of text poorly when rendered as a bitonal image within the rendered as a grayscale illustration. The thinness of the GBS digitization model. Binarization converts a raw original rice-paper volume allowed content from color digital image into a bitonal image by using an Chalmers and Edwards 9 Figure 5. Grayscale rendering, Chinese text on rice paper (Zhang and Kangxi Emperor of China, 1882). automatically determined threshold to differentiate from rendered characters. Figure 6, from the same foreground and background. This technique reduces book as the preceding example, illustrates the conse- the amount of data contained in full-color scans, quences of automated binarization for calligraphy pen thereby speeding up OCR processing and downstream detail. image distribution (Holley, 2009; Vincent, 2007). This problem is avoided by interleaving blank pages However, Google’s threshold settings often have the to block adjoining page noise, but to do so routinely effect of darkening, lightening, or erasing nuance would slow the scanning process considerably. Further, 10 Big Data & Society Figure 6. Bitonal rendering of Chinese text on rice paper (Zhang and Kangxi Emperor of China, 1882). without specialized language skills, the original book in Google’s standard protocol—scanning books front to hand, or time for careful examination, it can be very back and left to right—often caused books with vertical difficult to recognize the nature or extent of informa- or right-to-left writing formats to be delivered back- tion loss in a digitized page image. In a related example, ward or upside down (Weiss and James, 2015). Chalmers and Edwards 11 in a distributed system’’ to solve problems that Textual content cannot (yet) be undertaken by computers alone. Optimizing workflows for OCR does not in itself assure high quality character recognition. Consistent with Metadata Google’s brute-force approach, corpus indexing (and keyword search) was built upon software-generated Scholarly users of Google Books quickly identified pro- uncorrected OCR. Research evaluating OCR in large- blems with its metadata, e.g. item descriptors such as scale text digitization reveals widespread accuracy and author, publication date, and subject classification con- reliability problems; as with imaging, OCR accuracy is tained in traditional library catalogs (Duguid, 2007; challenged by print material features such as age and Nunberg, 2009; Townsend, 2007). Using his knowledge condition, printing flaws, rare fonts, textual annota- of canonical texts as a point of departure, Nunberg tions, and nontext symbols (Holley, 2009; Tanner (2009) conducted searches in the GBS corpus that et al., 2009). OCR also suffers in the presence of ima- revealed extensive errors in volume-level metadata. ging quality issues such as page skew, low resolution, These included a disproportionate number of books bleed through, and insufficient contrast. listing 1899 as their publication date; anachronistic Recall the page images of Mother Goose’s Melody in dates for terms such as ‘‘internet’’; mixups of author, Figure 3. Surrounded by visual artifacts of the digitiza- editor, and/or translator; subject misclassification (e.g., tion process, the text—a maxim about the value (and using publishing industry classifications designed to challenge) of independence—appears generally read- allocate books to shelf space in stores, rather than able. However, the OCR provided for the page Library of Congress subject headings); and mis-linking (Figure 7) reveals numerous problems, from missing (e.g., mismatch between volume information and page words to problems caused by the long s’s in the original images). James and Weiss’s (2012) quantitative assess- text. ment supports Nunberg’s anecdotal findings. In Human OCR correction, traditionally completed by response, Google acknowledged that it had constructed professionals double-keying texts, is considered the book metadata records by parsing more than 100 accuracy gold standard but is cost-prohibitive at scale sources of data (Orwant, 2009). These included library (Tanner et al., 2009). In 2009, Google acquired catalogs, publishing industry data, third-party meta- reCAPTCHA, owner of the web security technology data providers, and likely data extracted from OCR. CAPTCHA (Completely Automated Public Turing If each source contained errors, Google’s Jon Orwant test to tell Computers and Humans Apart) (Von Ahn acknowledged, the GBS corpus aggregated millions of and Cathcart, 2009). This technology, in widespread metadata errors across trillions of individual data fields. use since 2006, asks users to examine digitized images (That the most explicit official statement of Google’s of words OCR cannot interpret. Harnessing the free approach to metadata takes the form of a 3000þ labor of web users a few seconds at a time, but aggre- word blog post comment is at once extraordinary and gating to millions of hours, reCAPTCHA has improved unsurprising.) the usability of the GBS corpus (for certain languages) Google’s metadata mess was quickly—and while also being fed back into the training sets of publicly—cast as a confrontation between old and machine-learning algorithms. GBS thus fills gaps in new information systems for accessing books, evidence its automated quality control system with ‘‘human of Google’s techno-utopian investment in machine computation,’’ defined by CAPTCHA creator Von intelligence and the power of full-text search to triumph Ahn (2005) as treating ‘‘human brains as processors over the centralized library cataloging systems Figure 7. OCR produced from page images of Mother Goose’s Melody p. 13 (Figure 4) (Thomas and Shakespeare, 1945). 12 Big Data & Society constructed painstakingly by librarians (Nunberg, drives this crawling, a scale of change only manageable 2009). At a minimum, the pervasiveness of metadata through constant wholesale capture. By contrast, the errors drew attention to the irony of Google’s public pace of change for print media on library shelves is construction of GBS as an ‘‘enhanced card catalog.’’ In normally much slower. Pages may turn brittle. Users practice, the need to circumvent license restrictions on may mark up books, or more rarely, steal them. bibliographic data significantly shaped Google’s While Google tried to deploy a ‘‘scan once’’ strategy approach to metadata. Coyle (2009) and Jones (2014) for initial imaging, when it comes to image processing it assert that although Google obtained catalog records has treated its book corpus with a disregard for stability from library partners, libraries’ contracts with borne out of its experience with web pages. Embracing OCLC—a company that produces the union catalog the iterative logic of algorithmic systems, Google WorldCat—probably prohibited Google from display- routinely updates and replaces scanned content after ing that metadata directly. (For efficiency and consis- running it through improved error detection and tency, libraries often download catalog records from image quality algorithms (University of Michigan and WorldCat rather than create their own, but OCLC Google, Inc., 2005). Even if changes to the corpus tend restricts their use.) to be small and incremental—algorithms erase a finger Google’s metadata problems exposed imperfections in the margins of a scan, restore a missing page, or in existing book cataloging systems, from the challenges deliver a once-buried quote in search results—the of algorithmically interpreting MARC records to the constant and accumulating changes generate a sense temporal and geographic limitations of ISBNs to of instability. Google has not consistently provided errors in human-catalogued bibliographic data. users with documentation related to this updating The incompatibility of legacy catalog systems further (Conway, 2015); the automated work of maintenance challenged Google’s attempts to aggregate metadata and repair remains invisible. It is a tangled, even from multiple sources. Over time, incremental modifi- paradoxical relationship, as the fundamental revisabil- cations to Google’s machine processing substantially ity of algorithms supersedes the print book’s material improved, identifying and ameliorating systemic meta- stability and persistence. But while algorithmic logic data problems. Nonetheless, GBS metadata continues suggests that the latest version of a page will always to be far from accurate. be the most accurate, critical traditions rooted in print culture may lead us to ask how GBS defines accu- racy and what other characteristics may be altered by Integrating books into the web real-time updating. Unlike print books, the web is not tied to a single This section has demonstrated that because GBS physical device for content delivery. In 2009, Google page images and machine-searchable text are in effect introduced ‘‘mobile editions’’ of the corpus. The devel- coproduced, an action at one stage of the process can opment team explained: set in motion a cascade of consequences that shape both visual and machine readability in the corpus. At Imperfect OCR is only the first challenge in the ulti- scale, optimizing workflows around textual properties mate goal of moving from collections of page images to of books ran the risk not only of distorting some books’ extracted-text-based books... The technical challenges visual properties but also of defining normative book are daunting, but we’ll continue to make enhancements characteristics. In Google’s one-size-fits-most scanning to our OCR and book structure extraction technolo- system, decisions about image processing may have a gies. With this launch, we believe that we’ve taken an disproportionate effect on certain aspects of the digi- important step toward more universal access to books. tized corpus; the Chinese-language volume described (Ratnakar et al., 2009) above was one of a set of 50, all digitized by Google at a single location and all subject to the same proces- By defining books as structured information carriers sing problems. from which content may be extracted and delivered Objects that are excluded from scanning, or dis- seamlessly via widely varying devices, Google’s focus torted and transformed beyond the point at which on mobile technology further distanced digitized they may be used as surrogates for their print originals, books from their print origins. become ‘‘noncharismatic objects’’ (Bowker, 2000): by failing to be ‘‘collected’’ through digitization, they are Approaching books as one among many objects to integrate into web search, Google also projected web- rendered invisible to future digitally based scholarship based expectations of change onto print books. Search or use. Further, Google’s opportunistic rather than sys- engines crawl the web constantly, capturing changes, tematic approach to digitization may amplify existing additions, and deletions to a massive set of networked selection biases in physical print collections, overrepre- pages. A well-justified expectation of constant flux sent certain types of publications (Pechenick et al., Chalmers and Edwards 13 2015), or perpetuate Anglo-American cultural domi- We must, then, attend carefully to how Google’s nance in digital cultural heritage (Jeanneney, 2008). algorithmic system supports some users’ requirements while simultaneously rendering others difficult or impossible to meet. For example, ‘‘visible page Mediating access: Indexing the world, texture’’—from marginalia to other signs of aging or one piece of text at a time use inscribed on the printed page—may be useful infor- By constructing books as data, GBS inserts them into a mation or a mark of authenticity for some users, yet it networked world where algorithms increasingly med- is defined as noise for automated image processing. iate human access to information. In the project’s A situated understanding of these details exposes wake, the dream of digitizing ‘‘everything’’ has taken limitations to GBS’s suitability as a flexible, general- hold, recalibrating the sense of what is possible and use collection that can meet the needs of a range of what is expected for both individual web users and stakeholders, such as readers (Duguid, 2007; cultural heritage institutions. Nunberg, 2009), researchers conducting quantitative This article is the first piece of a larger, ongoing analyses of cultural trends (Michel et al., 2011), or cul- study of several large-scale cultural heritage digitization tural heritage institutions. projects, including the Internet Archive and genealogy Further, the opacity of Google’s processes has organization FamilySearch. This project seeks to join contributed to widespread critique of libraries and an existing critique oriented toward material culture other memory institutions ‘‘outsourcing the risk and and labor process with an emerging critique of algorith- responsibility’’ for digitization to a private company mic culture. ‘‘Algorithmic digitization’’ thus serves us (Vaidhyanathan, 2012). Google’s ‘‘black box outsour- as a sensitizing concept emphasizing relationships cing model’’ (Leetaru, 2008) frames agreements with between inputs, materials, labor, processes, outputs, content providers as partnerships rather than custo- use, and users. We use it here to consider opportunities mer–client relationships. These partners give up some and limitations in Google’s approach to providing uni- control over project parameters, tacitly agree to parti- versal access to information. cipate in the digitizer’s larger projects or agendas, and Understanding GBS as an algorithmic system renders remain dependent on the digitizer’s continued interest visible multiple tensions in the project: between Google’s and investment in digitization. As smaller institutions and collections gain access to digitization through this universalizing public rhetoric about the project and the technical processes that must translate these ambiguous privatized model, the risks grow. Google’s digitization visions into workflows; between the competing goals of model conceals the resource-intensive nature of stakeholders such as Google, publishers, authors, and digitization, from the invisible labor of professional libraries; between aspirations of scale and the specialized librarians, contract workers, and end users filling in needs of individual end users or books; between the the gaps created by incomplete automation to unan- materiality of the print book and that of the computer; swered questions of long-term maintenance or preser- and between the invisible, iterative authority of algo- vation of digital assets. It may thus discourage cultural rithms and that of human visual experience or expertise. heritage institutions from budgeting sufficiently for As we have seen, notable limitations stem from their own digitization infrastructures. This will doubt- Google’s choices in resolving these tensions. less leave some institutions unprepared to maintain Imperfection is unavoidable in large-scale book digiti- their traditional stewardship roles with respect to digi- zation. Yet the vocabulary of error is often too static to tal content. be useful, since error is always relative to a particular Just as users (individuals or institutions) benefit or user and/or purpose. Gooding (2013) argues that large- suffer from Google’s reliance on algorithmic processing scale cultural heritage digitization sacrifices quality to differently, so too are print books unevenly affected. serve scale. We have shown that while intuitively Google’s highly proceduralized scanning workflows appealing, this argument is too simplistic. It tends to (perhaps inadvertently) imposed a normative idea of align ‘‘quality’’ with the needs and values of traditional the form and content of the English language book readers, thus privileging visual access. In doing so it on the digitization process. With its construction of ignores the extent to which quantity and quality are digitization as a text extraction and indexing challenge, mutually constitutive in building a digitization econ- Google further distanced itself from library-based omy of scale and misses the careful calibration of understanding of the value of scanned page images as trade-offs between multiple forms of access to books surrogates for print originals. Instead, the above ana- afforded by digitization. It misunderstands the mea- lysis has revealed several ways in which Google aligned sures by which the project itself has defined and eval- GBS with other iterative, algorithmic systems—from uated quality. Finally, it overstates Google’s concern Google Streetview to 23 & Me—created to bring phy- with end users more generally. sical objects, information systems, and even human 14 Big Data & Society 5. While patents provide only generic system descriptions, bodies within the visual and computational logics of the they provide sufficient detail for high-level reverse engi- web. neering of Google’s processes. Journalists’ accounts and Today, books maintain an uneasy parallel existence, output-oriented research provide anecdotal verification caught between the world of the web and the world of (Clements, 2009; Shankland, 2009). Gutenberg. GBS highlights the uneven rates of change and competing logics of these two worlds, the techno- logical and legal frameworks that may produce, orga- References nize, and mediate access to print and digital Baird HS (2003) Digital libraries and document image information differently but that digitization forces analysis. In: Seventh international conference on document together. Google shaped the processes and outputs of analysis and recognition, Los Alamitos, CA, 4–6 August GBS to respect the constraints of copyright law, for 2013, pp.2–14. IEEE. example. Yet it simultaneously sought to circumvent Band J (2009) The long and winding road to the Google print-based permissions management by emphasizing Books settlement. The John Marshall Review of functionality that resonated with its web- and scale- Intellectual Property Law 9(2): 227–329. centric mission but had no direct parallel with print. Bowker GC (2000) Biodiversity datadiversity. Social Studies GBS has provided searchable text access to mil- of Science 30(5): 643–683. Cadwalladr C (2016) How to bump Holocaust deniers off lions of books. The weight of this remarkable Google’s top spot? Pay Google. The Guardian,17 achievement must not be denied or underestimated. December. Available at: https://www.theguardian.com/ Yet by equating digital access with full-text search, technology/2016/dec/17/holocaust-deniers-google-search- the GBS corpus has created a future for books in top-spot (accessed 1 February 2017). which they are defined principally by their textual Carr R (2005) Oxford-Google Mass-Digitisation Programme. content. Google’s workflows have elided other (his- Washington, DC. Available at: http://www.bodley.ox.ac. torical, artifactual, material) properties of books that, uk/librarian/rpc/CNIGoogle/CNIGoogle.htm (accessed 1 when absent, threaten to disrupt or reframe the rela- February 2017). tionship between a digitized surrogate and its print Ceynowa K (2009) Mass digitization for research and study. original. As print libraries fade into the deep back- IFLA Journal 35(1): 17–24. ground of our brave new digital world, much has Clements M (2009) The secret of Google’s book scanning been lost that cannot be regained. machine revealed. National Public Radio website. Available at: http://www.npr.org/sections/library/2009/ 04/the_granting_of_patent_7508978.html (accessed 7 Declaration of conflicting interests February 2017). Conway P (2013) Preserving imperfection: Assessing the inci- The author(s) declared no potential conflicts of interest with dence of digital imaging error in HathiTrust. Preservation, respect to the research, authorship, and/or publication of this Digital Technology and Culture 42(1): 17–30. article. Conway P (2015) Digital transformations and the archival nature of surrogates. Archival Science 15(1): 51–69. Funding Coyle K (2006) Mass digitization of books. The Journal of Academic Librarianship 32(6): 641–645. The author(s) received no financial support for the research, Coyle K (2009) Google Books metadata and library func- authorship, and/or publication of this article. tions. Coyle’s InFormation. Available at: http://kcoyle. blogspot.com/2009/09/google-books-metadata-and- Notes library.html (accessed 19 April 2017). 1. Google Book Search: http://books.google.com. Darnton R (2009) The Case for Books: Past, Present, and 2. The original five libraries were Harvard, Stanford, the Future. New York: Public Affairs. University of Michigan, New York Public Library, and Duguid P (2007) Inheritance and loss? A brief survey of the Bodleian Library at Oxford University. Google Books. First Monday 12(8). 3. While not the first, GBS was the biggest and most contro- Fleischhauer C (1996) Digital Formats for Content versial of several large cultural heritage digitization pro- Reproductions. Library of Congress. Available at: http:// jects undertaken by entities such as Yahoo, Microsoft, memory.loc.gov/ammem/formatold.html (accessed 16 Google, and the Internet Archive in the early 2000s (St. June 2017). Clair, 2008). Giberne A (1908) The Story of the Sun, Moon, and Stars. 4. The Association of American Publishers lawsuit was Chicago, IL: Thompson & Thomas. Available at: https:// settled privately in 2011, while in 2015 the Second books.google.com/books?id¼KY8AAAAAMAAJ Circuit Court of Appeals upheld a 2013 lower court judg- (accessed 1 February 2017). ment rejecting the Authors Guild’s copyright infringement Gillespie T (2016) Algorithms. In: Peters B (ed.) Digital claims and affirming Google’s scanning as transformative Keywords. Princeton, NJ: Princeton University Press, and therefore ‘‘fair use.’’ pp. 18–30. Chalmers and Edwards 15 Golumbia D (2009) The Cultural Logic of Computation. Lesk M (2003) The price of digitization: New cost models for Cambridge, MA: Harvard University Press. cultural and educational institutions. Available at: http:// Gooding P (2013) Mass digitization and the garbage dump: www.ninch.org/forum/price.lesk.report.html (accessed 1 The conflicting needs of quantitative and qualitative meth- February 2017). ods. Literary and Linguistic Computing 28(3): 425–431. Lin XF (2006) Quality assurance in high volume document digitization: a survey. In: Second international conference Google, Inc. (1999) Company info. Available at: https://web. on document image analysis for libraries, Lyon, France, 27– archive.org/web/19991105194818/http://www.google.com/ 28 April 2006, pp.311–319. IEEE. company.html (accessed 1 February 2017). Lynch C (2002) Digital collections, digital libraries & the Google, Inc. (2004a) What is Google Print? About Google digitization of cultural heritage information. Microform Print (Beta). Available at: https://web.archive.org/web/ and Imaging Review 31(4): 131–145. 20041214092414/http://print.google.com/ (accessed 10 Madrigal AC (2010) Inside the Google Books Algorithm. The February 2017). Atlantic. Available at: https://www.theatlantic.com/tech- Google, Inc. (2004b) What is the library project? Google Print nology/archive/2010/11/inside-the-google-books-algo- Library Project. Available at: https://web.archive.org/web/ rithm/65422/ (accessed 3 February 2017). */http://print.google.com/googleprint/library.html Michel J-B, Shen YK, Aiden AP, et al. (2011) Quantitative (accessed 1 December 2017). analysis of culture using millions of digitized books. Grant J (2005) Judging book search by its cover. Official Science 331(6014): 176–182. Google Blog. Available at: https://googleblog.blogspot. Milne R (2008) From ‘‘boutique’’ to mass digitization: The com/2005/11/judging-book-search-by-its-cover.html Google Library Project at Oxford. In: Earnshaw R and (accessed 1 February 2017). Vince J (eds) Digital Convergence – Libraries of the Future. Holihan C (2006) Google seeks help with recognition. London: Springer, pp. 3–9. Business Week Online. Available at: http://www.bloom- Murrell M (2010) Digitalþlibrary: Mass book digitization as berg.com/bw/stories/2006-09-06/google-seeks-help-with- collective inquiry. New York Law School Law Review 55: recognition (accessed 1 February 2017). 221–249. Holley R (2009) How good can it get? Analysing and improv- New York Public Library (2004) NYPL partners with Google ing OCR accuracy in large scale historic newspaper digit- to make books available online. Available at: https://web. isation programs. D-Lib Magazine 15(3/4). archive.org/web/20050923130755/http://nypl.org/press/ Howard J (2012) Google begins to scale back its scanning of google.cfm (accessed 3 December 2016). books from university libraries. The Chronicle of Higher New York S of (1862) Code of Procedure of the State of New Education. York. New York: George S. Diossy. Available at: https:// Huang TS (1974) Digital transmission of halftone pictures. books.google.com/books?printsec¼frontcover&id¼aD0K Computer Graphics and Image Processing 3(3): 195–202. AAAAIAAJ (accessed 1 February 2017). James R (2010) An assessment of the legibility of Google Norman Wilson A (2009) Workers leaving the Googleplex. Books. Journal of Access Services 7(4): 223–228. Available at: http://www.andrewnormanwilson.com/ James R and Weiss A (2012) An assessment of Google Books’ WorkersGoogleplex.html (accessed 7 August 2016). metadata. Journal of Library Metadata 12(1): 15–22. Nunberg G (2009) Google Books: A metadata train wreck. Jeanneney J-N (2008) Google and the Myth of Universal Language Log. Available at: http://languagelog.ldc.upenn. Knowledge. Chicago: University of Chicago Press. edu/nll/?p¼1701 (accessed 4 February 2017). Jones EA (2014) Constructing the universal library. PhD Orwant J (2009) Re: Google Books: A metadata train wreck. Thesis, University of Washington, USA. Language Log. Available at: http://languagelog.ldc.upenn. Kirschenbaum MG (2003) The word as image in an age of edu/nll/?p¼1701#comment-41758 (accessed 24 April digital reproduction. In: Hocks ME and Kendrick M (eds) 2017). Eloquent Images. Cambridge: MIT Press, pp. 137–156. Palmer B (2005) Deals with Google to accelerate library digit- Langley A and Bloomberg DS (2007) Google Books: Making ization projects for Stanford, others. Stanford Report,12 the public domain universally accessible. In: Proceedings January. Available at: http://news.stanford.edu/news/ of SPIE-IS&T Electronic Imaging, 26–29 January 2007, 2005/january12/google-0112.html (accessed 3 February San Jose, CA: International Society for Optics and 2017). Photonics. Pasquale F (2015) The Black Box Society: The Secret Le Bourgeois F, Trinh E, Allier B, et al. (2004) Document Algorithms That Control Money and Information. image analysis solutions for digital libraries. In: First inter- Cambridge, MA: Harvard University Press. national workshop on document image analysis for libraries, Pechenick EA, Danforth CM and Dodds PS (2015) Palo Alto, CA, 23–24 January 2004, pp.2–24. IEEE. Characterizing the Google Books corpus: Strong limits Leetaru K (2008) Mass book digitization: The deeper story of to inferences of socio-cultural and linguistic evolution. Google Books and the Open Content Alliance. First PLOS ONE 10(10): e0137041. Monday 13(10). Ratnakar V, Poncin G, Bedger B, et al. (2009) 1.5 million Lefevere F-M and Saric M (2008) De-warping of scanned books in your pocket. Google Book Search blog. images. Patent 7463772, USA. Available at: http://booksearch.blogspot.com/2009/02/15- Lefevere F-M and Saric M (2009) Detection of grooves in million-books-in-your-pocket.html (accessed 29 January scanned images. Patent 7508978, USA. 2017). 16 Big Data & Society Roush W (2005) The infinite library. MIT Technology Review University of Michigan and Google, Inc. (2005) UM-Google 108(5): 54–59. Cooperative Agreement. Available at: www.lib.umich.edu/ Samuelson P (2009) Google Book Search and the future of mdp/um-google-cooperative-agreement.pdf (accessed 7 books in cyberspace. Minnesota Law Review 94: February 2017). 1308–1374. University of Michigan Library (2005) UM Library/Google Schantz HF (1982) The History of OCR. Manchester Center, digitization partnership FAQ. Available at: http://www. VT: Recognition Technologies Users Association. lib.umich.edu/files/services/mdp/faq.pdf (accessed 7 Schmidt E (2005) Books of revelation. Wall Street Journal 18 February 2017). October. US Copyright Office (2016) Fair use. Available at: http:// Seaver N (2013) Knowing algorithms. Cambridge, MA. copyright.gov/fair-use/more-info.html (accessed 7 Available at: http://nickseaver.net/papers/seaverMiT8.pdf February 2017). (accessed 1 February 2017). Vaidhyanathan S (2012) The Googlization of Everything. Shankland S (2009) Patent reveals Google’s book-scanning Berkeley, CA: University of California Press. advantage. CNET. Available at: https://www.cnet.com/ Vincent L (2007) Google Book Search: Document under- news/patent-reveals-googles-book-scanning-advantage/ standing on a massive scale. In: Ninth international confer- (accessed 30 January 2017). ence on document analysis and recognition, Parena, Brazil, St. Clair G (2008) The Million Book project in relation to 23–26 September 2007, pp.819–823. IEEE. Google. Journal of Library Administration 47(1–2): Von Ahn L (2005) Human computation. PhD Thesis, Carnegie 151–163. Mellon University, Pittsburgh, PA. Striphas T (2015) Algorithmic culture. European Journal of von Ahn L and Cathcart W (2009) Teaching computers to Cultural Studies 18(4–5): 395–412. read: Google acquires reCAPTCHA. Official Google Blog. Tanner S, Mun˜ oz T and Ros PH (2009) Measuring mass text Available at: https://googleblog.blogspot.com/2009/09/ digitization quality and usefulness. D-Lib Magazine 15(7/ teaching-computers-to-read-google.html (accessed 30 8): 1082–9873. January 2017). Terras MM (2008) Digital Images for the Information Weiss A and James R (2015) Comparing the access to and Professional. Burlington, VT: Ashgate Publishing. legibility of Japanese language texts in massive digital Thomas I and Shakespeare W (1945) Mother Goose’s libraries. In: International conference on culture and com- melody. New York: G. Melcher. Available at: https:// puting, Kyoto, Japan, 17–19 October 2015, pp.57–63. books.google.com/books?id=OG7YAAAAMAAJ IEEE. (accessed 7 February 2017). Zhang Y and Kangxi Emperor of China (1882) Pei wen yun Townsend RB (2007) Google Books: What’s not to like? fu. Available at: http://hdl.handle.net/2027/mdp. American Historical Association blog. Available at: 39015081214945 (accessed 7 February 2017). http://blog.historians.org/2007/04/google-books-whats- not-to-like/ (accessed 7 February 2017).

Journal

Big Data & SocietySAGE

Published: Jul 3, 2017

Keywords: Algorithmic system; digitization; algorithmic culture; Google; web search; scanning

References