Early primary-source electronic text collections were produced by manual input and proofreading, resulting in accurate but costly and slow-growing content. Keyed content is still found in many databases, but today’s abundant supply of online historic primary source data is largely the result of computerized optical character recognition (OCR), with some quality control but no painstaking proofreading (hence the name “dirty OCR.”) Recently, the growth of archival manuscript material online has introduced a new large body of text that could be made searchable only by keyboarding and is usually presented as page image only. Thus, we are confronted today with a vast but varied body of material, the predominant form probably being “dirty OCR” presented on-screen as searchable text hidden behind a page image. Since the underlying text is not completely accurate, multiple tools, including fuzzy searching, rich metadata, and detailed user guides are provided to enhance retrieval.
Approaches to helping users navigate, interpret, and access the frequently enormous results of a search also vary significantly. Almost universally, though, underlying “dirty OCR” is rarely displayed or downloadable, probably to avoid shocking the uninitiated with its unruly appearance. This lack of clarity and consistency presents a serious challenge to students and researchers, however, whether that be a naïve user unaware of differences in the texts she or he is addressing (and hence unable to formulate the best search strategy) or a cutting-edge digital humanities scholar who wants to combine and manipulate resources from different databases in an advanced project.
The speakers will address these issues in greater detail, survey the state of play among the major online primary source providers today, and suggest a set of best practices. Attendees will be asked to share their experiences teaching and using historical-text databases and to provide their input regarding the proposed best-practice model.