--

12 (1) 2022

Khmer printed character recognition using attention-based Seq2Seq network


Author - Affiliation:
Rina Buoy - Techo Startup Center, Phnom Penh, Cambodia
Nguonly Taing - Techo Startup Center, Phnom Penh, Cambodia
Sovisal Chenda - Techo Startup Center, Phnom Penh, Cambodia
Sokchea Kor - Royal University of Phnom Penh, Phnom Penh, Cambodia
Corresponding author: Rina Buoy - rinabuoy13@gmail.com
Submitted: 28-03-2022
Accepted: 18-04-2022
Published: 20-04-2022

Abstract
This paper presents an end-to-end deep convolutional recurrent neural network solution for Khmer optical character recognition (OCR) task. The proposed solution uses a sequence-to-sequence (Seq2Seq) architecture with attention mechanism. The encoder extracts visual features from an input text-line image via layers of convolutional blocks and a layer of gated recurrent units (GRU). The features are encoded in a single context vector and a sequence of hidden states which are fed to the decoder for decoding one character at a time until a special end-of-sentence (EOS) token is reached. The attention mechanism allows the decoder network to adaptively select relevant parts of the input image while predicting a target character. The Seq2Seq Khmer OCR network is trained on a large collection of computer-generated text-line images for multiple common Khmer fonts. Complex data augmentation is applied on both train and validation dataset. The proposed model’s performance outperforms the state-of-art Tesseract OCR engine for Khmer language on the validation set of 6400 augmented images by achieving a character error rate (CER) of 0.7% vs 35.9%.

Keywords
Khmer, Optical Character Recognition, Deep Learning, Neural Network

Full Text:
PDF

Cite this paper as:

Buoy, R., Taing, N., Chenda, S., & Kor, S. (2022). Khmer printed character recognition using attention-based Seq2Seq network. Ho Chi Minh City Open University Journal of Science – Engineering and Technology, 12(1), 3-16. doi:10.46223/HCMCOUJS.tech.en.12.1.2217.2022


References

Annanurov, B., & Noor, N. M. (2018). Khmer handwritten text recognition with convolution neural networks. ARPN Journal of Engineering and Applied Sciences, 13(22), 8828-8833.


Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. Retrieved October 10, 2021, from https://arxiv.org/pdf/1409.0473.pdf


Buoy, R., Taing, N., & Kor, S. (2020). Khmer word segmentation using BiLSTM networks. Paper presented at the 4th Regional Conference on OCR and NLP for ASEAN Languages (ONA 2020), Phnom Penh, Cambodia.


Buoy, R., Taing, N., & Kor, S. (2021). Joint Khmer word segmentation and part-of-speech tagging using deep learning. Retrieved October 10, 2021, from https://arxiv.org/ftp/arxiv/ papers/2103/2103.16801.pdf


Chey, C., Kumhom, P., & Chamnongthai, K. (2005). Khmer printed character recognition by using wavelet descriptors. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 14(3), 337-350.


Ding, C., Utiyama, M., & Sumita, E. (2018). NOVA: A feasible and flexible annotation system for joint tokenization and part-of-speech tagging. ACM Transactions on Asian and Low-Resource Language Information Processing, 18(2). doi:10.1145/3276773


Lenleng, I., & Muaz, A. (2015). Khmer Optical Character Recognition (OCR). PAN Localization Cambodia, 1.


Liebl, B., & Burghardt, M. (2020). On the accuracy of CRNNs for line-based OCR: A multi-parameter evaluation. Retrieved October 10, 2021, from https://arxiv.org/pdf/2008.02777.pdf


Memon, J., Sami, M., Khan, R. A., & Uddin, M. (2020). Handwritten Optical Character Recognition (OCR): A comprehensive Systematic Literature Review (SLR). IEEE Access, 8, 142642-142668.


Meng, H., & Morariu, D. (2014). Khmer character recognition using artificial neural network. Retrieved October 10, 2021, from http://www.apsipa.org/proceedings_2014/Data/paper/1408.pdf


Namysl, M., & Konya, I. (2019). Efficient, lexicon-free OCR using deep learning. Paper presented at the 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, Australia.


Safir, F. B., Ohi, A. Q., Mridha, M. F., Monowar, M. M., & Hamid, M. A. (2021). End-to-end optical character recognition for bengali handwritten words. Retrieved October 10, 2021, from https://arxiv.org/pdf/2105.04020.pdf


Sahu, D. K., & Sukhwani, M. (2015). Sequence to sequence learning for optical character recognition. Retrieved October 10, 2021, from https://arxiv.org/pdf/1511.04176.pdf


Sok, M. (2016). Phonological principles and automatic phonemic and phonetic transcription of khmer words. Chiang Mai, Thailand: Payap University.


Sok, P., & Taing, N. (2014). Support Vectir Machine(SVM)-based classifier for Khmer character-set recognition. Retrieved October 10, 2021, from http://www.apsipa.org/proceedings_2014/data/paper/1407.pdf


Sokphyrum, K., Samak, S., & Sola, J. (2019). Khmer OCR finte tune engine for Unicode and legacy fonts using Tesseract 4.0 with Deep Neural Network. Optical character recognition for complex scripts and Natural language processing for ASEAN Languages, 1.


Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Retrieved October 10, 2021, from https://proceedings.neurips.cc/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf


Valy, D., Verleysen, M., Chhun, S., & Burie, J.-C. (2017). A new Khmer palm leaf manuscript dataset for document analysis and recognition: SleukRith set. Proceedings of the 4th International Workshop on Historical Document Imaging and Processing, 1-6. doi:10.1145/3151509.3151510



Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.