Class-Level Behavior Analysis under Metric Disagreement in Imbalanced Multi-Label Indonesian Emotion Classification

Authors

  • Jahda Rusti Putri Sriwijaya University, Indonesia
  • Ermatita Sriwijaya University, Indonesia
  • Abdiansah Sriwijaya University, Indonesia
Pages Icon

DOI:

https://doi.org/10.63158/journalisi.v8i3.1664

Keywords:

evaluation metrics, metric divergence, multi-label classification, class imbalance, emotion classification

Abstract

This study aims to analyze class-level model behavior under metric disagreement in imbalanced multi-label Indonesian emotion classification, using the divergence between Macro F1 and Micro F1 as a diagnostic signal rather than a mere performance indicator. A machine-translated Indonesian version of the GoEmotions dataset, comprising approximately 58,000 samples across 28 fine-grained emotion categories, is used as the experimental setting. The translated dataset was not manually revalidated, and findings are scoped to this translated GoEmotions setting. Two transformer-based models are evaluated: IndoBERT, a monolingual Indonesian model, and DistilBERT, a multilingual model, both fine-tuned with class-specific threshold optimization. The results reveal opposing divergence patterns: IndoBERT achieves higher Micro F1 than Macro F1, indicating performance concentrated on high-frequency classes, while DistilBERT exhibits the reverse pattern, suggesting broader but less precise label activation. Per-class analysis further shows that most minority classes consistently fall into unstable or non-functional performance regimes across both models. This study concludes that aggregate metrics alone are insufficient for evaluating model behavior in imbalanced multi-label settings. A behavior-oriented interpretation framework for Macro–Micro F1 divergence and a regime-based class reliability categorization are proposed to support more structured and informative evaluation practices.

Downloads

Download data is not yet available.

References

[1] O. Rainio, J. Teuho, and R. Klén, “Evaluation metrics and statistical tests for machine learning,” Sci. Rep., vol. 14, no. 1, p. 6086, Mar. 2024, doi: 10.1038/s41598-024-56706-x.

[2] S. Ossenov, “Developing a Dataset-Adaptive, Normalized Metric for Machine Learning Model Assessment: Integrating Size, Complexity, and Class Imbalance,” arXiv preprint arXiv: 2412.07244, 2024. Accessed: May 27, 2026. [Online]. Available: https://arxiv.org/abs/2412.07244

[3] M. C. Hinojosa Lee, J. Braet, and J. Springael, “Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores,” Applied Sciences, vol. 14, no. 21, p. 9863, Oct. 2024, doi: 10.3390/app14219863.

[4] S. Roohi, R. Skarbez, and H. D. Nguyen, “Reliable uncertainty estimation in emotion recognition in conversation using conformal prediction framework,” Natural Language Processing, vol. 31, no. 5, pp. 1163–1186, Sep. 2025, doi: 10.1017/nlp.2024.48.

[5] D. Harbecke, Y. Chen, L. Hennig, and C. Alt, “Why only Micro-F1? Class Weighting of Measures for Relation Classification,” in Proceedings of NLP Power! The First Workshop on Efficient Benchmarking in NLP, T. Shavrina, V. Mikhailov, V. Malykh, E. Artemova, O. Serikov, and V. Protasov, Eds., Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 32–41. doi: 10.18653/v1/2022.nlppower-1.4.

[6] Y. Xia, Q. Zhao, Y. Long, G. Xu, and J. Wang, “SensoryT5: Infusing Sensorimotor Norms into T5 for Enhanced Fine-grained Emotion Classification,” in Proc. Workshop on Cognitive Aspects of the Lexicon (CogALex), Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 144–152, doi: 10.18653/v1/2024.cogalex-1.19.

[7] B. Pithava, A. Magar, and S. Bharti, “Unveiling Sentiment Dynamics: Emotion Detection in Social Media,” in 2024 International Conference on Intelligent Computing and Emerging Communication Technologies (ICEC), IEEE, Nov. 2024, pp. 1–6. doi: 10.1109/ICEC59683.2024.10837523.

[8] Z. Su, H. Lyu, Y. Niu, and Y. Liu, “Based on Data Balancing and Model Improvement for Multi-Label Sentiment Classification Performance Enhancement,” arXiv preprint arXiv: 2511.14073, 2025, Accessed: May 27, 2026. [Online]. Available: https://arxiv.org/abs/2511.14073

[9] R. Chauhan, A. Gusain, P. Kumar, C. Bhatt, and I. Uniyal, “Fine Grained Sentiment Analysis using Machine Learning and Deep Learning,” in 2023 International Conference on Sustainable Emerging Innovations in Engineering and Technology (ICSEIET), IEEE, Sep. 2023, pp. 423–427. doi: 10.1109/ICSEIET58677.2023.10303481.

[10] A. Sharma, A. Avasthi, V. L. Vangipuram, P. G., S. V., and T. C. Manjunath, “Exploring Emotion Psychology in AI: Common Perspectives and Their Application in Research and Development to Enhance Empathetic Responses in Artificial Intelligence Systems,” in 2025 7th International Conference on Information Systems and Computer Networks (ISCON), IEEE, Sep. 2025, pp. 1–6. doi: 10.1109/ISCON65210.2025.11341720.

[11] D. Demszky, D. Movshovitz-Attias, J. Ko, A. Cowen, G. Nemade, and S. Ravi, "GoEmotions: A Dataset of Fine-Grained Emotions," in Proc. 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020, pp. 4040–4054, doi: 10.18653/v1/2020.acl-main.372.

[12] L. Piras, L. Boratto, and G. Ramos, “Evaluating the Prediction Bias Induced by Label Imbalance in Multi-label Classification,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management, New York, NY, USA: ACM, Oct. 2021, pp. 3368–3372. doi: 10.1145/3459637.3482100.

[13] F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, “IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP,” in Proceedings of the 28th International Conference on Computational Linguistics, Stroudsburg, PA, USA: International Committee on Computational Linguistics, 2020, pp. 757–770. doi: 10.18653/v1/2020.coling-main.66.

[14] M. R. Syazali and E. Yulianti, “Classification of Economic Activities in Indonesia Using IndoBERT Language Model,” Jurnal Ilmu Komputer dan Informasi, vol. 18, no. 2, pp. 155–165, Jun. 2025, doi: 10.21609/jiki.v18i2.1446.

[15] C. Shaw, P. LaCasse, and L. Champagne, “Exploring emotion classification of indonesian tweets using large scale transfer learning via IndoBERT,” Soc. Netw. Anal. Min., vol. 15, no. 1, Dec. 2025, doi: 10.1007/s13278-025-01439-6.

[16] W. Wongso, D. S. Setiawan, S. Limcorn, and A. Joyoadikusumo, “NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural,” in Proc. Second Workshop in South East Asian Language Processing (SEALP), Online: Association for Computational Linguistics, Jan. 2025, pp. 10–26, doi: 10.18653/v1/2025.sealp-1.2.

[17] W. Christian, D. Adamlu, A. Yu, and D. Suhartono, “Leveraging IndoBERT and DistilBERT for Indonesian emotion classification in e-commerce reviews,” Procedia Comput. Sci., vol. 269, pp. 321–330, 2025, doi: 10.1016/j.procs.2025.08.284.

[18] E. I. Setiawan, L. Kristianto, A. T. Hermawan, J. Santoso, K. Fujisawa, and M. H. Purnomo, “Social Media Emotion Analysis in Indonesian Using Fine-Tuning BERT Model,” in 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), IEEE, Apr. 2021, pp. 334–337. doi: 10.1109/EIConCIT50028.2021.9431885.

[19] S. Goldfarb-Tarrant, B. Ross, and A. Lopez, “Cross-lingual Transfer Can Worsen Bias in Sentiment Analysis,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds., Singapore: Association for Computational Linguistics, Dec. 2023, pp. 5691–5704. doi: 10.18653/v1/2023.emnlp-main.346.

[20] J. Li et al., “A Two-Stage Framework for Ambiguous Classification in Software Engineering,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), IEEE, Oct. 2023, pp. 275–286. doi: 10.1109/ISSRE59848.2023.00070.

[21] A. Amalia, M. S. Lydia, P. I. Nainggolan, Nurrahmadayeni, S. Br Siagian, and D. S. Br Ginting, “Multi-Label Emotion Classification for Indonesian Text using IndoBERT Fine-Tuning,” in 2025 9th International Conference on Electrical, Telecommunication and Computer Engineering (ELTICOM), IEEE, Nov. 2025, pp. 293–299. doi: 10.1109/ELTICOM67568.2025.11336043.

[22] R. Kumar, R. K. Ayyasamy, and A. K. Jebna, “Long-Tail Emotion Detection: Few-Shot Learning for Rare Pandemic Emotions via Prototype Networks,” Journal of Advanced Research in Applied Sciences and Engineering Technology, vol. 55, no. 1, pp. 236–244, Aug. 2025, doi: 10.37934/araset.55.1.236244.

[23] N. V. S. J. Jami et al., “Stratify or Die: Rethinking Data Splits in Image Segmentation,” arXiv preprint arXiv: 2509.21056, 2025. Accessed: May 27, 2026. [Online]. Available: https://arxiv.org/abs/2509.21056

[24] T. T. Inan, M. Liu, and A. Shehu, "F-Measure Optimization for Multi-class, Imbalanced Emotion Classification Tasks," in Artificial Neural Networks and Machine Learning – ICANN 2022, Lecture Notes in Computer Science, vol. 13529, Springer, 2022, pp. 158–170, doi: 10.1007/978-3-031-15919-0_14.

[25] S. Simhadri, M. Ponnam, R. Rajitha, and R. Balamurugan, "Enhanced Multi-Class Model Evaluation: Analyzing BERT, GPT-2, and LLaMA with Precision, Recall, and F1-Score Metrics," in Proc. 4th Int. Conf. Innovative Mechanisms for Industry Applications (ICIMIA), IEEE, 2025, pp. 984–989, doi: 10.1109/ICIMIA67127.2025.11200914.

[26] R. Vinston Raja et al., “Metrics and Techniques for Evaluating Machine Learning Models and Optimization Algorithms,” in AI Model Design and Data Management for Disease Prediction, A. Muniasamy, Ed., IGI Global Scientific Publishing, 2025, pp. 193–222, doi: 10.4018/979-8-3373-5137-7.ch007.

[27] B. Wilie et al., "IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding," in Proc. Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th Int. Joint Conf. Natural Language Processing (AACL-IJCNLP), 2020, pp. 843–857, doi: 10.18653/v1/2020.aacl-main.85.

Downloads

Published

2026-06-25

Issue

Section

Articles

Most read articles by the same author(s)