An Application of Malay Short-Form Word Conversion Using Levenshtein Distance

  • Azilawati Azizan Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Perak Branch Tapah Campus, Perak, Malaysia
  • NurAine Saidin Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Perak Branch Tapah Campus, Perak, Malaysia
  • Nurkhairizan Khairudin Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Perak Branch Tapah Campus, Perak, Malaysia
  • Rohana Ismail Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Kampus Besut, 22200 Besut Terengganu, Malaysia

Abstract

Formerly, short-form word was widely used in the field of journalism. However, nowadays, short-form word has been widely used by many people, especially in online communication. These short-form words trigger problems in the field of data mining, especially those involving online text processing. It leads to inaccurate result of text mining activities. On the other hand, only few works have investigated on Malay short-form word identification and conversion. Therefore, this work aims to develop an application that can identify and convert Malay short-form words into its’ full word. In order to develop this application, the short-form rules need to be carefully examined. The formal rules from Dewan Bahasa & Pustaka (DBP) are used as the primary reference for generating the short form word identification algorithm. While for the conversion algorithm, Levenshtein Distance (LD) is used to measure the similarity. The rule-based technique is also used as a complement to LD technique. As a result, 70.27% of the Malay short-form words have been correctly converted into their full words. The conversion rate is quite promising, and this work can
be further strengthened by incorporating more rules into the algorithm.

References

[1] N. I. B. Ahmad Bukhari, A. F. Anuar, K. M. Khazin, and T. M. F. Bin Tengku Abdul Aziz, “English-Malay Code-Mixing Innovation in Facebook among Malaysian University Students,” Res. World – J. Arts Sci. Commer., no. Cmc, pp. 01-10, 2015.
[2] N. Samsudin, M. Puteh, A. Razak, and M. Zakree, “N_ormalization of Common N_oisyTerms in Malaysian Online Media,” Proc. Knowl. Manag. Int. Conf., no. July, pp. 515–520, 2012.
[3] M. Mokhsin, A. A. Aziz, S. R. Hamidi, A. M. Lokman, and H. A. Halim, “Impact of using abbreviation and homophone words in social networking amongst Malaysian youth,” Adv. Sci. Lett., vol. 22, no. 5–6, pp. 1260–1264, 2016.
[4] R. Kasbon, N. A. Amran, E. M. Mazlan, and S. Mahamad, “Malay Language Sentence Checker,” World Appl. Sci. J., vol. 12, pp. 19–25, 2011.
[5] R. Alfred, S. B. Basri, J. H. Obit, and Z. I. B. A. Ismail, “Improved automatic spell checker for malay blog,” Adv. Sci. Lett., vol. 21, no. 10, pp. 3342–3345, 2015.
[6] N. Samsudin, M. Puteh, A. R. Hamdan, and M. Z. A. Nazri, “Normalization of noisy texts in Malaysian online reviews,” J. Inf. Commun. Technol., vol. 12, no. 1, pp. 147–159, 2013.
[7] N. Omar, A. F. Hamsani, N. A. S. Abdullah, and S. Z. Z. Abidin, “Construction of Malay abbreviation corpus based on social media data,” Journal of Engineering and Applied Sciences, vol. 12, no. 3. pp. 468–474, 2017.
[8] R. A. Raja, S. Lay-Ki, and H. Su-Cheng, “Exploring Edit Distance for Normalising Out-of-Vocabulary Malay Words on Social Media,” MATEC Web Conf., vol. 255, p. 03001, 2019.
[9] N. H. Ariyani, Sutardi, and R. Ramadhan, “Aplikasi Pendeteksi Kemiripan Isi Teks Dokumen Menggunakan Metode Levenshtein Distance,” semanTIK, vol. Vol 2, no. 1, pp. 279–286, 2016.
[10] V. Christanti Mawardi, N. Susanto, and D. Santun Naga, “Spelling Correction for Text Documents in Bahasa Indonesia Using Finite State Automata and Levinshtein Distance Method,” MATEC Web Conf., vol. 164, 2018.
[11] H. D. Tolentino et al., “A UMLS-based spell checker for natural language processing in vaccine safety,” BMC Med. Inform. Decis. Mak., vol. 7, no. February, 2007.
[12] T. Anjali, T. R. Krishnaprasad, and P. Jayakumar, “A Novel Sentiment Classification of Product Reviews using Levenshtein Distance,” pp. 507–511, 2020.
[13] D. K. Po, “Similarity Based Information Retrieval Using Levenshtein Distance Algorithm,” Int. J. Adv. Sci. Res. Eng., vol. 06, no. 04, pp. 06-10, 2020.
[14] N. Samsudin, A. Razak, M. Puteh, and M. Zakree, “Mining Opinion in Online Messages,” Int. J. Adv. Comput. Sci. Appl., vol. 4, no. 8, pp. 19–24, 2013.
[15] N. Samsudin, M. Puteh, A. R. Hamdan, and M. Z. A. Nazri, “Is artificial immune system suitable for opinion mining?,” Conf. Data Min. Optim., no. May 2016, pp. 131–136, 2012.
[16] M. F. R. Abu Bakar, N. Idris, L. Shuib, and N. Khamis, “Sentiment Analysis of Noisy Malay Text: State of Art, Challenges and Future Work,” IEEE Access, vol. 8, pp. 24687–24696, 2020.
[17] S. N. A. N. Ariffin and S. Tiun, “Part-of-speech tagger for malay social media texts,” GEMA Online J. Lang. Stud., vol. 18, no. 4, pp. 124–142, 2018.
[18] P. Singkatan Khidmat Pesanan Ringkas Bahasa Melayu, “Khidmat Pesanan Kandungan.P65,” Dewan Bhs. Pustaka, 2008.
Published
2020-11-21
How to Cite
AZIZAN, Azilawati et al. An Application of Malay Short-Form Word Conversion Using Levenshtein Distance. Mathematical Sciences and Informatics Journal, [S.l.], v. 1, n. 2, p. 34-42, nov. 2020. ISSN 2735-0703. Available at: <https://myjms.mohe.gov.my/index.php/mij/article/view/14183>. Date accessed: 07 dec. 2022. doi: https://doi.org/10.24191/mij.v1i2.14183.
Section
Articles

Most read articles by the same author(s)

Obs.: This plugin requires at least one statistics/report plugin to be enabled. If your statistics plugins provide more than one metric then please also select a main metric on the admin's site settings page and/or on the journal manager's settings pages.