--

12 (1) 2022

A review of Khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory


Author - Affiliation:
Sreyteav Sry - Paragon International University, Phnom Penh, Cambodia
Amrudee Sukpan Nguyen - Computer Science Department, Paragon International University, Phnom Penh, Cambodia
Corresponding author: Sreyteav Sry - ssry@paragoniu.edu.kh
Submitted: 28-03-2022
Accepted: 18-04-2022
Published: 20-04-2022

Abstract
Large contiguous blocks of unsegmented Khmer words can cause major problems for natural language processing applications such as machine translation, speech synthesis, information extraction, etc. Thus, word segmentation and part-of- speech tagging are two important prior tasks. Since the Khmer language does not always use explicit separators to split words, the definition of words is not a natural concept. Hence, tokenization and part-of-speech tagging of these languages are inseparable because the definition and principle of one task unavoidably affect the other. In this study, different approaches using in Khmer word segmentation and part-of-speech are reviewed and experimental study using a single long short-term memory network is described. Dataset from Asia Language Treebank is used to train and test the model. The preliminary experimental model achieved 95% accuracy rate. However, more testing to evaluate the model and compare it with different models is needed to conduct to select the more higher accuracy model.

Keywords
Word Segmentation, Part-of-speech tagging, Khmer Natural Language Processing, LSTM

Full Text:
PDF

Cite this paper as:

Sry, S., & Nguyen, A. S. (2022). A review of Khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory. Ho Chi Minh City Open University Journal of Science – Engineering and Technology, 12(1), 23-34. doi:10.46223/HCMCOUJS.tech.en.12.1.2219.2022


References

Bi, N., & Taing, N. (2014). Khmer word segmentation based on Bi-directional maximal matching for plaintext and Microsoft Word document. Paper presented at the Signal and Information Processing Association Annual Summit and Conference (APSIPA), Chiang Mai, Thailand.


Buoy, R., Taing, N., & Kor, S. (2020). Khmer word segmentation using BiLSTM networks. Paper presented at the 4th Regional Conference on OCR and NLP for ASEAN Languages, Phnom Penh, Cambodia.


Buoy, R., Taing, N., & Kor, S. (2021). Joint Khmer word segmentation and part-of-speech tagging using deep learning. Retrieved October 10, 2021, from https://arxiv.org/ftp/arxiv/papers/ 2103/2103.16801.pdf


Chea, V., Thu, Y. K., Ding, C., Utiyama, M., Finch, A., & Sumita, E. (2015). Khmer word segmentation using conditional random fields. Retrieved October 10, 2021, from https://www2.nict.go.jp/astrec-att/member/ding/KhNLP2015-SEG.pdf


Dan, J., & James, H. M. (2021, December 29). Speech and language processing. Retrieved March 02, 2022, from https://web.stanford.edu/~jurafsky/slp3/


Ding, C., Aye, H. T., Pa, W. P., Nwet, K. T., Soe, K. M., Utiyama, M., & Sumita, E. (2020). Towards Burmese (Myanmar) morphological analysis. ACM Transactions on Asian and Low-Resource Language Information Processing, 19(1), 1-34.


Ding, C., Kaing, H., Utiyama, M., Chea, V., & Sumita, E. (2016). Tokenization and part-of-speech annotation guidelines for Khmer (Cambodian). Retrieved March 02, 2022, from https://att-astrec.nict.go.jp/member/mutiyama/ALT/Khmer-annotation-guideline.pdf


Ding, C., Utiyama, M., & Sumita, E. (2019). NOVA. ACM Transactions on Asian and Low-Resource Language Information Processing, 18(2), 1-18.


Huor, C. S., Hemy, R. P., & Navy, V. (2004). Detection and correction of homophonous error word for Khmer language. Retrieved March 02, 2022, from https://www.yumpu.com/en/document/read/25135741/detection-and-correction-of-homophonous-error-word-for-khmer-


Kaing, H., Ding, C., Utiyama, M., Sumita, E., Sam, S., Seng, S., . . . Nakamura, S. (2021). Towards tokenization and part-of-speech tagging for Khmer: Data and discussion. ACM Transactions on Asian and Low-Resource Language Information Processing, 20(6), 1-16.


Magueresse, A., Carles, V., & Heetderks, E. (2020). Low-resource languages: A review of past work and future challenges. Retrieved March 02, 2022, from https://arxiv.org/pdf/2006.07264.pdf


Nou, C., & Kameyama, W. (2007). Khmer POS tagger: A transformation-based approach with hybrid unknown word handling. Paper presented at the International Conference on Semantic Computing (ICSC 2007), Irvine, CA, USA.


Riza, H., Purwoadi, M., Gunarso, Uliniansyah, T., Ti, A. A., Aljunied, S. M., . . . Ding, C. (2016). Introduction of the Asian language treebank. Paper presented at the 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), Bali, Indonesia.


Sangvat, S., & Pluempitiwiriyawej, C. (2018). Khmer POS tagging using conditional random fields. Communications in Computer and Information Science, 169-178. doi:10.1007/978-981-10-8438-6_14


Seng, S., Sam, S., Besacier, L., Bigi, B., & Castelli, E. (2008). First broadcast news transcription system for khmer language. Retrieved March 02, 2022, from https://hal.archives-ouvertes.fr/hal-01392538/document


Thu, Y. K., Chea, V., & Sagisaka, Y. (2017). Comparison of six POS tagging methods on 12K sentences Khmer language POS tagged corpus. Proceedings 1st Regional Conference Optical Character Recognition and Natural Language Processing Technologies for ASEAN Languages, 1-12.



Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.