TitreICDAR2017 Competition on Post-OCR Text Correction
Type de publicationArticle de colloque/conférence
Année de publication2018
AuteursJean-Philippe Moreux, Guillaume Chiron, Antoine Doucet, Mikael Coustady
Nom du colloque2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR)
Date de la réunion2017/11
OrganisateurICDAR
Lieu du colloqueKyoto, Japon
Mots clésOCR; OCR errors
Résumé

This paper describes the ICDAR2017 competitionon post-OCR text correction and presents the different methodssubmitted by the participants. OCR has been an active researchfield for over the past 30 years but results are still imperfect,especially for historical documents. The purpose of this competitionis to compare and evaluate automatic approaches forcorrecting (denoising) OCR-ed texts. The challenge consists oftwo independent tasks: 1) error detection and 2) error correction.An original dataset of 12M OCR-ed symbols along with analigned ground truth was provided to the participants with80% of the dataset dedicated to the training and 20% to theevaluation. Different sources were aggregated and namely containnewspapers and monographs covering 2 languages (English andFrench). 11 teams submitted results, while the difficulty of thetask was underlined by the fact that only half of the submittedmethods were able to denoise the evaluation dataset on average.In any case, this competition, which counted 35 registrations,illustrates the strong interest of the community in this essentialproblem, which is key to any digitization process involving textualdata.

Champ de recherche: 
icdar2017 competition on postocr text correction 2017 14th iapr international conference on document analysis and recognition icdar 201711 pthis paper describes the icdar2017 competitionbron postocr text correction and presents the different methodsbrsubmitted by the participants ocr has been an active researchbrfield for over the past 30 years but results are still imperfectbrespecially for historical documents the purpose of this competitionbris to compare and evaluate automatic approaches forbrcorrecting denoising ocred texts the challenge consists ofbrtwo independent tasks 1 error detection and 2 error correctionbran original dataset of 12m ocred symbols along with anbraligned ground truth was provided to the participants withbr80 of the dataset dedicated to the training and 20 to thebrevaluation different sources were aggregated and namely containbrnewspapers and monographs covering 2 languages english andbrfrench 11 teams submitted results while the difficulty of thebrtask was underlined by the fact that only half of the submittedbrmethods were able to denoise the evaluation dataset on averagebrin any case this competition which counted 35 registrationsbrillustrates the strong interest of the community in this essentialbrproblem which is key to any digitization process involving textualbrdatap kyoto japon jeanphilippe moreux guillaume chiron antoine doucet mikael coustady ocr ocr errors
Retour en haut de page