TY - CONF T1 - OCR performance prediction using cross-OCR alignement T2 - ICDAR Y1 - 2015 A1 - Ahmed Ben Salah A1 - Jean-Philippe Moreux A1 - Nicolas Ragot A1 - Thierry Paquet JA - ICDAR ER - TY - CONF T1 - OCR performance prediction using cross-OCR alignment T2 - 13th International Conference on Document Analysis and Recognition (ICDAR 2015) Y1 - 2015 A1 - Ahmed Ben Salah A1 - Jean-Philippe Moreux A1 - Nicolas Ragot A1 - Thierry Paquet KW - Historical documents digitization KW - Mass digitization projects KW - OCR quality assessment KW - Support Vector Regression (SVR) AB -

Since 2006 the national library of France (BnF) has developed many mass digitization projects on its collections. The indexation of digital documents on Gallica (the digital library of the BnF) is done through their textual content obtained thanks to service providers that use Optical Character Recognition software (OCR). The modern technologies of OCR achieve good performances on modern documents produced with uniform layout and known fonts. However, for old documents, OCR results are of lower quality. The OCR quality assessment is a real challenge for the BnF. On the one hand, due to the sequential architecture of OCR treatments, the identification of OCR errors sources is intractable. On the other hand, besides the word confidence, no additional quality information is reported in OCR outputs. In this paper, we present a study on OCR performance estimation aiming to control the quality of word transcriptions achieved by OCR. This quality assessment process has to operate without any comparison with ground truthed data. In this respect, our methodology relies on cross alignment of the OCR results with those of a secondary OCR called reference OCR. This secondary OCR provides uncertain but useful information that will be used as uncertain groundtruth. OCR performance is estimated using support vector regression. This predictor uses some global features computed on the cross-alignment results. The experimentations reported show that our estimate describes more faithfully the quality of OCR outputs than average word confidence scores that are computed by OCR. The proposed methodology can be adapted easily to various corpora by tuning the system using a training dataset of documents that have similar properties to those to be treated.

JA - 13th International Conference on Document Analysis and Recognition (ICDAR 2015) CY - Nancy, Palais des congrès UR - https://hal.archives-ouvertes.fr/hal-01191701 ER - TY - CONF T1 - The digital documents quality control workflow at the BnF (operation, issue, improvement) T2 - Archiving Y1 - 2013 A1 - Ahmed Ben Salah A1 - Jean-Philippe Moreux A1 - Laurent Duplouy JA - Archiving ER - TY - CHAP T1 - Aide à la gestion des processus de numérisation en vue de l'OCRisation des ouvrages T2 - CORIA : Actes de la Journée Jeunes Chercheurs 2012 Y1 - 2012 A1 - Ahmed Ben Salah JA - CORIA : Actes de la Journée Jeunes Chercheurs 2012 CY - Bordeaux UR - http://www.asso-aria.org/coria/2012/455.pdf ER - TY - CHAP T1 - Prediction of Selection Decision of Document Using Bibliographic Data at the National Library of France (BnF) T2 - Archiving 2012, Copenhague, 12-15 juin 2012 Y1 - 2012 A1 - Ahmed Ben Salah A1 - Geneviève Cron A1 - Nicolas Ragot A1 - Thierry Paquet AB -

The selection process of the documents is a very important step in mass digitization projects. This is especially true at the BnF, where the digitization should include or not OCRization depending on the OCR results expected. Consequently, the selection task is very complex and time consuming due to the number of documents to be processed and the diversity of the selection criteria to consider. Trying to improve and simplify this task by automation, we studied the relationship between bibliographic data and the selection decisions of documents. We used two statistical analysis: a factor analysis of correspondence and a multiple correspondence analysis. Our analysis has shown that, for example, the documents in format "4 or GR FOL" and edited "between 1961 and 1990" in Morocco are more likely to be "Selected". However, the documents in format "16 or 8" and edited "between 1871 and 1800" in English or Spanish have a greater chance to be "Not Selected".

JA - Archiving 2012, Copenhague, 12-15 juin 2012 PB - Society for Imaging Sciences and Technology CY - Copenhague SN - 978-0-89208-300-8 UR - http://www.imaging.org/IST/store/epub.cfm?abstrid=45326 ER -