Since 2006 the national library of France (BnF) has developed many mass digitization projects on its collections. The indexation of digital documents on Gallica (the digital library of the BnF) is don
Since 2006 the national library of France (BnF) has developed many mass digitization projects on its collections. The indexation of digital documents on Gallica (the digital library of the BnF) is done through their textual content obtained thanks to service providers that use Optical Character Recognition software (OCR). The modern technologies of OCR achieve good performances on modern documents produced with uniform layout and known fonts. However, for old documents, OCR results are of lower quality. The OCR quality assessment is a real challenge for the BnF. On the one hand, due to the sequential architecture of OCR treatments, the identification of OCR errors sources is intractable. On the other hand, besides the word confidence, no additional quality information is reported in OCR outputs. In this paper, we present a study on OCR performance estimation aiming to control the quality of word transcriptions achieved by OCR. This quality assessment process has to operate without any comparison with ground truthed data. In this respect, our methodology relies on cross alignment of the OCR results with those of a secondary OCR called reference OCR. This secondary OCR provides uncertain but useful information that will be used as uncertain groundtruth. OCR performance is estimated using support vector regression. This predictor uses some global features computed on the cross-alignment results. The experimentations reported show that our estimate describes more faithfully the quality of OCR outputs than average word confidence scores that are computed by OCR. The proposed methodology can be adapted easily to various corpora by tuning the system using a training dataset of documents that have similar properties to those to be treated.
Lire la suite