This paper describes the ICDAR2017 competitionon post-OCR text correction and presents the different methodssubmitted by the participants. OCR has been an active researchfield for over the past 30 yea
This paper describes the ICDAR2017 competitionon post-OCR text correction and presents the different methodssubmitted by the participants. OCR has been an active researchfield for over the past 30 years but results are still imperfect,especially for historical documents. The purpose of this competitionis to compare and evaluate automatic approaches forcorrecting (denoising) OCR-ed texts. The challenge consists oftwo independent tasks: 1) error detection and 2) error correction.An original dataset of 12M OCR-ed symbols along with analigned ground truth was provided to the participants with80% of the dataset dedicated to the training and 20% to theevaluation. Different sources were aggregated and namely containnewspapers and monographs covering 2 languages (English andFrench). 11 teams submitted results, while the difficulty of thetask was underlined by the fact that only half of the submittedmethods were able to denoise the evaluation dataset on average.In any case, this competition, which counted 35 registrations,illustrates the strong interest of the community in this essentialproblem, which is key to any digitization process involving textualdata.
Lire la suite