Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Are Searches in OCR-generated Archives Trustworthy?

Are Searches in OCR-generated Archives Trustworthy? AbstractDigitised archives are revolutionary tools for research that, in a few seconds, generate results that earlier often took years to obtain. But do they provide all results for the terms searched for? The accuracy of searches was tested by performing sample searches of leading newspaper databases. The test revealed several weaknesses in the search process, including an average 18 percent error rate for single words in body text, and a far higher error rates for advertisements. Such high error rates encourage a critical look at the 20-year-old sector. Although these errors can be reduced by a re-digitation and with new improved OCR engines and new search algorithms, searches will nevertheless return manipulated results. In response, and to identify infringed bias and skewed representation, database owners need to provide thorough metadata to ensure source criticism. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Jahrbuch für Wirtschaftsgeschichte / Economic History Yearbook de Gruyter

Are Searches in OCR-generated Archives Trustworthy?

Loading next page...
 
/lp/de-gruyter/are-searches-in-ocr-generated-archives-trustworthy-0RsYMVL6od
Publisher
de Gruyter
Copyright
© 2023 Jørgen Burchardt, published by De Gruyter
ISSN
2196-6842
eISSN
2196-6842
DOI
10.1515/jbwg-2023-0003
Publisher site
See Article on Publisher Site

Abstract

AbstractDigitised archives are revolutionary tools for research that, in a few seconds, generate results that earlier often took years to obtain. But do they provide all results for the terms searched for? The accuracy of searches was tested by performing sample searches of leading newspaper databases. The test revealed several weaknesses in the search process, including an average 18 percent error rate for single words in body text, and a far higher error rates for advertisements. Such high error rates encourage a critical look at the 20-year-old sector. Although these errors can be reduced by a re-digitation and with new improved OCR engines and new search algorithms, searches will nevertheless return manipulated results. In response, and to identify infringed bias and skewed representation, database owners need to provide thorough metadata to ensure source criticism.

Journal

Jahrbuch für Wirtschaftsgeschichte / Economic History Yearbookde Gruyter

Published: May 1, 2023

Keywords: optical character recognition; historical archive; source criticism; research methodology; Historische Archive; Quellenkritik; Forschungsmethodik; OCR; C 82

There are no references for this article.