HMM-based Script Identification for OCR
文件大小: 169k
源码售价: 10 个金币 积分规则     积分充值
资源说明:HMM-based Script Identification for OCR While current OCR systems are able to recognize text in an increasing number of scripts and languages, typically they still need to be told in advance what those scripts and languages are. We propose an approach that repurposes the same HMM-based system used for OCR to the task of script/language ID, by replacing character labels with script class labels. We apply it in a multi-pass overall OCR process which achieves “universal” OCR over 54 tested languages in 18 distinct scripts, over a wide variety of typefaces in each. For comparison we also consider a brute-force ap- proach, wherein a singe HMM-based OCR system is trained to recognize all considered scripts. Results are presented on a large and diverse evaluation set extracted from book im- ages, both for script identification accuracy and for overall OCR accuracy. On this evaluation data, the script ID sys- tem provided a script ID error rate of 1.73% for 18 distinct scripts. The end-to-end OCR system with the script ID sys- tem achieved a character error rate of 4.05%, an increase of 0.77% over the case where the languages are known a priori.
本源码包内暂不包含可直接显示的源代码文件,请下载源码包。