TitreImpact of OCR errors on the use of digital libraries - Towards a better access to information
Type de publicationArticle de colloque/conférence
Année de publication2017
AuteursGuillaume Chiron, Antoine Doucet, Mickael Coustaty, Jean-Philippe Moreux, Muriel Visani
Nom du colloqueJCDL'17, ACM/IEEE-CS Joint Conference on Digital Libraries, June 2017, Toronto, Ontario, Canada
Date de la réunion2017/06
Mots clésdigital libraries; document retrieval; indexation bias; OCR errors; search
Résumé

Digital collections are increasingly used for a variety of purposes. In Europe only, we can conservatively estimate that tens of thousands of users consult digital libraries daily. The usages are often motivated by qualitative and quantitative research. However, caution must be advised as most digitized documents are indexed through their OCRed version, which is far from perfect, especially for ancient documents.In this paper, we aim to estimate the impact of OCR errors on the use of a major online platform: The Gallica digital library from the National Library of France. It accounts for more than 100M OCRed documents and receives 80M search queries every year.In this context, we introduce two main contributions. First, an original corpus of OCRed documents composed of 12M characters along with the corresponding gold standard is presented and provided, with an equal share of English- and French-written documents. Next, statistics on OCR errors have been computed thanks to a novel alignment method introduced in this paper. Making use of all the user queries submitted to the Gallica portal over 4 months, we take advantage of our error model to propose an indicator for predicting the relative risk that queried terms mismatch targeted resources due to OCR errors, underlining the critical extent to which OCR quality impacts on digital library access.

Champ de recherche: 
impact of ocr errors on the use of digital libraries towards a better access to information jcdl17 acmieeecs joint conference on digital libraries june 2017 toronto ontario canada 201706 pdigital collections are increasingly used for a variety of purposes in europe only we can conservatively estimate that tens of thousands of users consult digital libraries daily the usages are often motivated by qualitative and quantitative research however caution must be advised as most digitized documents are indexed through their ocred version which is far from perfect especially for ancient documentsppin this paper we aim to estimate the impact of ocr errors on the use of a major online platform the gallica digital library from the national library of france it accounts for more than 100m ocred documents and receives 80m search queries every yearbrin this context we introduce two main contributions first an original corpus of ocred documents composed of 12m characters along with the corresponding gold standard is presented and provided with an equal share of english and frenchwritten documents next statistics on ocr errors have been computed thanks to a novel alignment method introduced in this paper making use of all the user queries submitted to the gallica portal over 4 months we take advantage of our error model to propose an indicator for predicting the relative risk that queried terms mismatch targeted resources due to ocr errors underlining the critical extent to which ocr quality impacts on digital library accessp guillaume chiron antoine doucet mickael coustaty jeanphilippe moreux muriel visani digital libraries document retrieval indexation bias ocr errors search
Retour en haut de page