banner

An extensive analysis of several methods for classifying unbalanced datasets

Sharaf Alzoubi, Khaled Aldiabat, Mofleh Al-diabat, Laith Abualigah

Abstract


In large-scale data applications, handling unbalanced data is a major issue. In order to gather the uneven data at the fastest pace feasible, the imbalanced data categorization system was created. Numerous neural methods have been developed to accurately categorize unbalanced data. However, because of the intricacy of the data, the classification process becomes more challenging due to increased resource utilization, computing costs, and algorithm complexity. As a result, this research has provided specifics on the performances of many classification models in various unbalanced datasets. Ultimately, a performance study was conducted to evaluate each model's categorization performance. For this reason, the precision, specificity, accuracy, and sensitivity have been used to measure the robustness. Each model's advantages and disadvantages are also thoroughly covered. The categorization models then offered future approaches to enhance the unbalanced data based on the drawbacks.


Keywords


imbalanced data; data mining; deep learning; classifiers; over and under sampling; optimization algorithms

Full Text:

PDF

References


1. Yin X, Liu Q, Pan Y, et al. Strength of Stacking Technique of Ensemble Learning in Rockburst Prediction with Imbalanced Data: Comparison of Eight Single and Ensemble Models. Natural Resources Research. 2021, 30(2): 1795-1815. doi: 10.1007/s11053-020-09787-0

2. Dogan A, Birant D. Machine learning and data mining in manufacturing. Expert Systems with Applications. 2021, 166: 114060. doi: 10.1016/j.eswa.2020.114060

3. Thakkar H, Shah V, Yagnik H, et al. Comparative anatomization of data mining and fuzzy logic techniques used in diabetes prognosis. Clinical eHealth. 2021, 4: 12-23. doi: 10.1016/j.ceh.2020.11.001

4. Pan Y, Zhang L. A BIM-data mining integrated digital twin framework for advanced project management. Automation in Construction. 2021, 124: 103564. doi: 10.1016/j.autcon.2021.103564

5. Espadinha-Cruz P, Godina R, Rodrigues EMG. A Review of Data Mining Applications in Semiconductor Manufacturing. Processes. 2021, 9(2): 305. doi: 10.3390/pr9020305

6. Jedrzejowicz J, Jedrzejowicz P. GEP-based classifier for mining imbalanced data. Expert Systems with Applications. 2021, 164: 114058. doi: 10.1016/j.eswa.2020.114058

7. Liu P, Qingqing W, Liu W. Enterprise human resource management platform based on FPGA and data mining. Microprocessors and Microsystems. 2021, 80: 103330. doi: 10.1016/j.micpro.2020.103330

8. Al-Hashedi KG, Magalingam P. Financial fraud detection applying data mining techniques: A comprehensive review from 2009 to 2019. Computer Science Review. 2021, 40: 100402. doi: 10.1016/j.cosrev.2021.100402

9. Sanad Z, Al-Sartawi A. Financial Statements Fraud and Data Mining: A Review. Lecture Notes in Networks and Systems. Published online 2021: 407-414. doi: 10.1007/978-3-030-77246-8_38

10. Shabtay L, Fournier-Viger P, Yaari R, et al. A guided FP-Growth algorithm for mining multitude-targeted item-sets and class association rules in imbalanced data. Information Sciences. 2021, 553: 353-375. doi: 10.1016/j.ins.2020.10.020

11. Aminian E, Ribeiro RP, Gama J. Chebyshev approaches for imbalanced data streams regression models. Data Mining and Knowledge Discovery. 2021, 35(6): 2389-2466. doi: 10.1007/s10618-021-00793-1

12. Korycki Ł, Krawczyk B. Low-Dimensional Representation Learning from Imbalanced Data Streams. Lecture Notes in Computer Science. 2021, 629-641. doi: 10.1007/978-3-030-75762-5_50

13. Grzyb J, Klikowski J, Woźniak M. Hellinger Distance Weighted Ensemble for imbalanced data stream classification. Journal of Computational Science. 2021, 51: 101314. doi: 10.1016/j.jocs.2021.101314

14. Lu N, Yin T. Transferable common feature space mining for fault diagnosis with imbalanced data. Mechanical Systems and Signal Processing. 2021, 156: 107645. doi: 10.1016/j.ymssp.2021.107645

15. Sisodia D, Sisodia DS. Data sampling strategies for click fraud detection using imbalanced user click data of online advertising: An empirical review. IETE Technical Review. 2021, 39(4): 789–798. doi: 10.1080/02564602.2021.1915892

16. Alican D, Birant D. Machine learning and data mining in manufacturing. Expert Systems with Applications 2021, 166: 114060.

17. Mirzaei B, Nikpour B, Nezamabadi-pour H. CDBH: A clustering and density-based hybrid approach for imbalanced data classification. Expert Systems with Applications. 2021, 164: 114035. doi: 10.1016/j.eswa.2020.114035

18. Chen S xia, Wang X kang, Zhang H, et al. Customer purchase prediction from the perspective of imbalanced data: A machine learning framework based on factorization machine. Expert Systems with Applications. 2021, 173: 114756. doi: 10.1016/j.eswa.2021.114756

19. Zhu S. Analysis of the severity of vehicle-bicycle crashes with data mining techniques. Journal of Safety Research. 2021, 76: 218-227. doi: 10.1016/j.jsr.2020.11.011

20. Yang K, Yu Z, Chen CLP, et al. Incremental weighted ensemble broad learning system for imbalanced data. IEEE Transactions on Knowledge and Data Engineering. 2021, 34(12): 5809-5824. doi: 10.1109/TKDE.2021.3061428

21. Pradipta GA, Wardoyo R, Musdholifah A, et al. Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning From Imbalanced Data. IEEE Access. 2021, 9: 74763-74777. doi: 10.1109/access.2021.3080316

22. Wang W, Sun D. The improved AdaBoost algorithms for imbalanced data classification. Information Sciences. 2021, 563: 358-374. doi: 10.1016/j.ins.2021.03.042

23. Hou C, Wu J, Cao B, et al. A deep-learning prediction model for imbalanced time series data forecasting. Big Data Mining and Analytics. 2021, 4(4): 266-278. doi: 10.26599/bdma.2021.9020011

24. Pereira RM, Costa YMG, Silla Jr. CN. Toward hierarchical classification of imbalanced data using random resampling algorithms. Information Sciences. 2021, 578: 344-363. doi: 10.1016/j.ins.2021.07.033

25. Wang X, Xu J, Zeng T, et al. Local distribution-based adaptive minority oversampling for imbalanced data classification. Neurocomputing. 2021, 422: 200-213. doi: 10.1016/j.neucom.2020.05.030

26. Vuttipittayamongkol P, Elyan E, Petrovski A. On the class overlap problem in imbalanced data classification. Knowledge-Based Systems. 2021, 212: 106631. doi: 10.1016/j.knosys.2020.106631

27. Dang LM, Kyeong S, Li Y, et al. Deep learning-based sewer defect classification for highly imbalanced dataset. Computers & Industrial Engineering. 2021, 161: 107630. doi: 10.1016/j.cie.2021.107630

28. Sambasivam G, Opiyo GD. A predictive machine learning application in agriculture: Cassava disease detection and classification with imbalanced dataset using convolutional neural networks. Egyptian Informatics Journal. 2021, 22(1): 27-34. doi: 10.1016/j.eij.2020.02.007

29. Rupapara V, Rustam F, Shahzad HF, et al. Impact of SMOTE on Imbalanced Text Features for Toxic Comments Classification Using RVVC Model. IEEE Access. 2021, 9: 78621-78634. doi: 10.1109/access.2021.3083638

30. Asniar, Maulidevi NU, Surendro K. SMOTE-LOF for noise identification in imbalanced data classification. Journal of King Saud University-Computer and Information Sciences. 2021, 34(6): 3413-3423. doi: 10.1016/j.jksuci.2021.01.014

31. Yao P, Shen S, Xu M, et al. Single model deep learning on imbalanced small datasets for skin lesion classification. IEEE Transactions on Medical Imaging. 2021, 41(5): 1242-1254. doi: 10.1109/TMI.2021.3136682

32. Wan X, Zhang X, Liu L. An Improved VGG19 Transfer Learning Strip Steel Surface Defect Recognition Deep Neural Network Based on Few Samples and Imbalanced Datasets. Applied Sciences. 2021, 11(6): 2606. doi: 10.3390/app11062606

33. Fernando KRM, Tsokos CP. Dynamically weighted balanced loss: class imbalanced learning and confidence calibration of deep neural networks. IEEE Transactions on Neural Networks and Learning Systems. 2022, 33(7): 2940-2951. doi: 10.1109/TNNLS.2020.3047335

34. Yilmaz SF, Kaynak EB, Koç A, et al. Multi-Label Sentiment Analysis on 100 Languages With Dynamic Weighting for Label Imbalance. IEEE Transactions on Neural Networks and Learning Systems. 2023, 34(1): 331-343. doi: 10.1109/TNNLS.2021.3094304

35. Kim Y, Lee Y, Jeon M. Imbalanced image classification with complement cross entropy. Pattern Recognition Letters. 2021, 151: 33-40. doi: 10.1016/j.patrec.2021.07.017

36. Yan Z, Wen H. Electricity Theft Detection Base on Extreme Gradient Boosting in AMI. IEEE Transactions on Instrumentation and Measurement. 2021, 70: 1-9. doi: 10.1109/tim.2020.3048784

37. Nguyen HTT, Chen LH, Saravanarajan VS, et al. Using XG Boost and Random Forest Classifier Algorithms to Predict Student Behavior. 2021 Emerging Trends in Industry 40 (ETI 40). 2021. doi: 10.1109/eti4.051663.2021.9619217

38. Dong Y, Shen X, Jiang Z, et al. Recognition of imbalanced underwater acoustic datasets with exponentially weighted cross-entropy loss. Applied Acoustics. 2021, 174: 107740. doi: 10.1016/j.apacoust.2020.107740

39. Xu Y, Yu Z, Chen CLP, et al. Adaptive Subspace Optimization Ensemble Method for High-Dimensional Imbalanced Data Classification. IEEE Transactions on Neural Networks and Learning Systems. 2023, 34(5): 2284-2297. doi: 10.1109/tnnls.2021.3106306

40. Hassib EslamM, El-Desouky AliI, Labib LabibM, et al. WOA + BRNN: An imbalanced big data classification framework using Whale optimization and deep neural network. Soft Computing. 2019, 24(8): 5573-5592. doi: 10.1007/s00500-019-03901-y

41. Li Z, Zhang Q, He Y. Modified group theory-based optimization algorithms for numerical optimization. Applied Intelligence. 2022, 1-24.

42. Shaw SS, Ahmed S, Malakar S, et al. Hybridization of ring theory-based evolutionary algorithm and particle swarm optimization to solve class imbalance problem. Complex & Intelligent Systems. 2021, 7(4): 2069-2091. doi: 10.1007/s40747-021-00314-z

43. Desuky AS, Hussain S. An Improved Hybrid Approach for Handling Class Imbalance Problem. Arabian Journal for Science and Engineering. 2021, 46(4): 3853-3864. doi: 10.1007/s13369-021-05347-7

44. Pustokhina IV, Pustokhin DA, Nguyen PT, et al. Multi-objective rain optimization algorithm with WELM model for customer churn prediction in telecommunication sector. Complex & Intelligent Systems. 2021, 9(4): 3473-3485. doi: 10.1007/s40747-021-00353-6




DOI: https://doi.org/10.32629/jai.v7i3.966

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Sharaf Alzoubi, Khaled Aldiabat, Mofleh Al-diabat, Laith Abualigah

License URL: https://creativecommons.org/licenses/by-nc/4.0/