Comparison Analysis of Naive Bayes and K-Nearest Neighbor Algorithms in Classifying Language Styles in Indonesian Texts

Fika Tsalsabila Tinanda; Herry Sujaini; Helfi Nasution

doi:10.61628/jsce.v6i4.2158

Fika Tsalsabila Tinanda Universitas Tanjungpura
Herry Sujaini
Helfi Nasution

DOI: https://doi.org/10.61628/jsce.v6i4.2158

Abstract

In the digital era, Indonesian-language texts have rapidly proliferated across social media, online news, blogs, and digital documents, often containing various figurative language styles such as personification, metaphor, hyperbole, euphemism, and irony. Manual identification of these language styles is inefficient on a large scale, especially when class distribution is imbalanced. This study aims to compare the performance of the Naïve Bayes and K-Nearest Neighbor (KNN) algorithms in classifying figurative language styles in Indonesian texts, and to evaluate the impact of applying the Synthetic Minority Over-sampling Technique (SMOTE) and hyperparameter tuning on model accuracy. The dataset consists of 5,155 original samples and 6,240 samples after SMOTE application, with an 80:20 train-test split. Evaluation was conducted under four scenarios: without SMOTE and without tuning, with SMOTE without tuning, without SMOTE with tuning, and with both SMOTE and tuning. The results show that Naïve Bayes demonstrated stable performance with an accuracy of up to 93.19%, while KNN achieved its highest accuracy of 93.43% after applying SMOTE and tuning. The implementation of SMOTE and hyperparameter tuning proved effective in improving accuracy, particularly for KNN. This study highlights the significant contribution of data balancing and parameter optimization in enhancing the automatic classification of figurative language styles in Indonesian texts.

References

Annur, H. (2018). Classification of poor communities using the Naïve Bayes method. August, 10(2).

Annur, M. (2018). Text classification using Naïve Bayes algorithm. Journal of Physics: Conference Series.

Arifadilah, F. (2023). Comparison of hyperparameter optimization: Population Based Training, Random Search, and Bayesian Optimization in radicalism sentiment analysis.

Arifadilah, R. (2023). Optimization of hyperparameters in text classification models. Procedia Computer Science.

Busiarli, N., Aditya, L. A., & Andika, A. Y. (2016). Application of the Naïve Bayes algorithm and natural language processing for classifying types of news in news archives. Proceedings of the National Seminar on Information and Communication Technology, 6–7. http://www.kopertis3.or.id/html/wp-

Choudhary, S., Gupta, R., & Kumar, A. (2023). Advances in text classification: A comprehensive review. Expert Systems with Applications, 216, 119541. https://doi.org/10.1016/j.eswa.2022.119541

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186. https://aclanthology.org/N19-1423

Dewi, P. S., Sastradipraja, C. K., & Gustian, D. (2021). Decision support system for job promotion using the Naïve Bayes classifier algorithm method. Jurnal Teknologi dan Informasi (JATI).

Dewi, R., et al. (2021). Comparative analysis of Naïve Bayes performance in text classification.

Farizki, H. (2023). The effect of Synthetic Minority Oversampling Technique (SMOTE) in sentiment analysis using the Support Vector Machine (SVM) algorithm.

Farizki, R. (2023). Data balancing with SMOTE for text classification tasks.

Fatiya, N. (2021). Synthetic oversampling in NLP for imbalanced datasets.

Fatiya, R. (2021). The effect of SMOTE (Synthetic Minority Oversampling Technique) to overcome data imbalance in sentiment analysis using the K-Nearest Neighbors algorithm.

Hendriyanto, M. D., & Sari, N. (2022). Application of the K-Nearest Neighbor algorithm in classifying hoax news titles.

Hendriyanto, A., & Sari, R. (2022). Evaluating KNN for Indonesian text categorization.

Kowsari, K., Meimandi, K. J., Heidarysafa, M., Mendu, S., Barnes, L. E., & Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4), 150. https://doi.org/10.3390/info10040150

Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., & Gao, J. (2021). Deep learning for text classification: A survey. ACM Computing Surveys, 54(3), 1–40. https://doi.org/10.1145/3439726

Putri, A., et al. (2023). KNN for multi-class text classification in Indonesian datasets.

Putri, T. A. E., Widiharih, T., & Santoso, R. (2023). Hyperparameter tuning using RandomSearchCV on Adaptive Boosting for predicting survival of heart failure patients. Jurnal Gaussian, 11(3), 397–406. https://doi.org/10.14710/j.gauss.11.3.397-406

Ranasinghe, T., & Zampieri, M. (2021). Multilingual offensive language identification with transformer models. Proceedings of the ACL Workshop. https://doi.org/10.18653/v1/2021.woah-1.20

Saleh, A. (2015). Implementation of the Naïve Bayes classification method to predict household electricity usage.

Saleh, M. (2015). Implementation of TF-IDF weighting in Indonesian text classification.

Sharfina, N., & Ramadhan, N. G. (2023). SMOTE analysis in Hepatitis C classification using Random Forest and Naïve Bayes. Jurnal Teknologi dan Sistem Komputer, 7(1). https://doi.org/10.14710/jtsiskom.7.1

Sharfina, Z., & Ramadhan, R. (2023). SMOTE applications in handling imbalanced Indonesian text data.