Maîtrise de la qualité des transcriptions numériques dans les projets de numérisation de masse

Ahmed Ben Salah 1, 2
1 DocApp - LITIS - Equipe Apprentissage
LITIS - Laboratoire d'Informatique, de Traitement de l'Information et des Systèmes
Abstract : This work focuses on the assessment of characters recognition results produced automatically by optical character recognition software (OCR on mass digitization projects. The goal is to design a global control system robust enough to deal with BnF documents collection. This collection includes old documents which are difficult to be treated by OCR. We designed a word detection system to detect missed words defects in OCR results, and a words recognition rate estimator to assess the quality of word recognition results performed by OCR. We create two kinds of descriptors to characterize OCR outputs. Image descriptors to characterize page segmentation results and cross alignment descriptors to characterize the quality of word recognition results. Furthermore, we adapt our learning process to make an adaptive decision or prediction systems. We evaluated our control systems on real images selected randomly from BnF collection. The mmissed word detection system detects 84.15% of words omitted by the OCR with a precision of 94.73%. The experiments performed also showed that 80% of the documents of word recognition rate less than 98% are detected with an accuracy of 92%. It can also automatically detect 45% of the material having a recognition rate less than 70% with greater than 92% accuracy.
Document type :
Theses
Complete list of metadatas

Cited literature [120 references]  Display  Hide  Download

https://hal-bnf.archives-ouvertes.fr/tel-01164698
Contributor : Ahmed Ben Salah <>
Submitted on : Monday, June 22, 2015 - 6:42:00 PM
Last modification on : Tuesday, February 5, 2019 - 11:44:33 AM
Long-term archiving on : Tuesday, September 15, 2015 - 5:55:48 PM

Identifiers

  • HAL Id : tel-01164698, version 1

Citation

Ahmed Ben Salah. Maîtrise de la qualité des transcriptions numériques dans les projets de numérisation de masse. Traitement des images [eess.IV]. Université de Rouen, 2014. Français. ⟨tel-01164698⟩

Share

Metrics

Record views

487

Files downloads

250