banner

A hybrid software defects prediction model for imbalance datasets using machine learning techniques: (S-SVM model)

Mohd. Mustaqeem, Tamanna Siddiqui

Abstract


Software defect prediction (SDP) is an essential task for developing quality software, and various models have been developed for this purpose. However, the imbalanced nature of software defect datasets has challenged these models, resulting in decreased performance. To address this challenge, the author has proposed a hybrid machine learning model that combines Synthetic Minority Oversampling Technique (SMOTE) with Support Vector Machine (SVM)—SMOTE-SVM (S-SVM) model. The author has empirically examined SDP using multiple datasets (CM1, PC1, JM1, PC3, KC1, EQ and JDT) from the PROMISE and AEEEM repositories. The experimental study indicates that the S-SVM model involved training and compared with previously developed balanced and imbalanced test datasets using four evaluation metrics: Precision, Recall, F1-score, and Accuracy. For the balanced dataset, the S-SVM model achieved precision values ranging from 70 to 96, recall values ranging from 52 to 94, F1-score values ranging from 67 to 90, and accuracy values ranging from 69 to 98. For the imbalanced dataset, the S-SVM model achieved precision values ranging from 60 to 93, recall values ranging from 64 to 97, F1-score values ranging from 69 to 91, and accuracy values ranging from 67 to 87. The proposed S-SVM model outperforms other models’ ability to classify and predict software defects. Therefore, the hybridisation of SMOTE and SVM improved the model’s ability to categories and predict balanced and imbalanced datasets when sufficient defective and non-defective data is provided.

Keywords


Software Defect Prediction (SDP); SVM; SMOTE; Empirical Software Engineering; Software Quality; Balanced & Imbalanced Learning

Full Text:

PDF

References


1. Amasaki S. On applicability of cross-project defect prediction method for multi-versions projects. In: Proceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering; 2017 Nov 8; Toronto Canada. New York: Association for Computing Machinery; 2017. p. 93–96. doi: 10.1145/3127005.3127015.

2. Shukla S, Radhakrishnan T, Muthukumaran K, Neti LBM. Multi-objective cross-version defect prediction. Soft Computing 2018; 22(6): 1959–1980. doi: 10.1007/s00500-016-2456-8.

3. Jayanthi R, Florence L. Software defect prediction techniques using metrics based on neural network classifier. Cluster Computing 2019; 22(1): 77–88. doi: 10.1007/s10586-018-1730-1.

4. Liu M, Miao L, Zhang D. Two-stage cost-sensitive learning for software defect prediction. IEEE Transactions on Reliability 2014; 63(2): 676–686. doi: 10.1109/TR.2014.2316951.

5. Herbold S, Trautsch A, Grabowski J. Global vs. local models for cross-project defect prediction. Empirical Software Engineering 2017; 22(4): 1866–1902. doi: 10.1007/s10664-016-9468-y.

6. Zhang F, Zheng Q, Zou Y, Hassan AE. Cross-project defect prediction using a connectivity-based unsupervised classifier. In: 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE); 2016 May 14–22; Austin, TX. New York: IEEE; 2017. p. 309–320. doi: 10.1145/2884781.2884839.

7. Hosseini S, Turhan B, Gunarathna D. A systematic literature review and meta-analysis on cross project defect prediction. IEEE Transactions on Software Engineering 2019; 45(2): 111–147. doi: 10.1109/TSE.2017.2770124.

8. Wahono RS. A systematic literature review of software defect prediction: Research trends, datasets, methods and frameworks. Journal of Software Engineering 2007; 1(1): 1–16. doi: 10.3923/jse.2007.1.12.

9. Malhotra R. A systematic review of machine learning techniques for software fault prediction. Applied Soft Computing 2015; 27: 504–518. doi: 10.1016/j.asoc.2014.11.023.

10. Benediktsson O, Dalcher D, Thorbergsson H. Comparison of software development life cycles: A multiproject experiment. IET Software 2006; 153(3): 87–101. doi: 10.1049/ip-sen:20050061.

11. Hassan MM, Afzal W, Blom M, et al. Testability and software robustness: A systematic literature review. In: 2015 41st Euromicro Conference on Software Engineering and Advanced Applications; 2015 Aug 26–28; Madeira, Portugal. New York: IEEE; 2015. p. 341–348. doi: 10.1109/SEAA.2015.47.

12. Shepperd M, Bowes D, Hall T. Researcher bias: The use of machine learning in software defect prediction. IEEE Transactions on Software Engineering 2014; 40(6): 603–616. doi: 10.1109/TSE.2014.2322358.

13. Manjula C, Florence L. Deep neural network based hybrid approach for software defect prediction using software metrics. Cluster Computing 2019; 22: 9847–9863. doi: 10.1007/s10586-018-1696-z.

14. Chidamber SR, Kemerer CF. A metrics suite for object oriented design. IEEE Transactions on Software Engineering 1994; (6): 476–493. doi: 10.1109/32.295895.

15. Basili VR, Briand LC, Melo WL. A validation of object-oriented design metrics as quality indicators. IEEE Transactions on Software Engineering 1996; 22(10): 751–761. doi: 10.1109/32.544352.

16. Alsawalqah H, Faris H, Aljarah I, et al. Hybrid SMOTE-ensemble approach for software defect prediction. Advances in Intelligent Systems and Computing 2017; 575: 355–366. doi: 10.1007/978-3-319-57141-6_39.

17. He H, Bai Y, Garcia EA, Li S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE International Joint Conference on Neural Networks; 2008 Jun 1–8; Hong Kong. New York: IEEE; 2008. p. 1322–1328. doi: 10.1109/IJCNN.2008.4633969.

18. Nian R. Fixing imbalanced datasets: An introduction to ADASYN (with code!) [Internet]. San Francisco: Medium; 2018 [published 2018 Dec 23]. Available from: https://medium.com/@ruinian/an-introduction-to-adasyn-with-code-1383a5ece7aa.

19. Mirzaei B, Nikpour B, Nezamabadi-Pour H. An under-sampling technique for imbalanced data classification based on DBSCAN algorithm. In: 2020 8th Iranian Joint Congress on Fuzzy and intelligent Systems (CFIS); 2020 Sept 2–4; Mashhad, Iran. New York: IEEE; 2020. p. 21–26. doi: 10.1109/CFIS49607.2020.9238718.

20. Hasanin T, Khoshgoftaar TM. The effects of random undersampling with simulated class imbalance for big data. In: 2018 IEEE International Conference on Information Reuse and Integration (IRI); 2018 Jul 6–9; Salt Lake City, UT. New York: IEEE; 2018. p. 70–79. doi: 10.1109/IRI.2018.00018.

21. Bach M, Werner A, Palt M. The proposal of undersampling method for learning from imbalanced datasets. Procedia Computer Science 2019; 159: 125–134. doi: 10.1016/j.procs.2019.09.167.

22. Sawangarreerak S, Thanathamathee P. Random forest with sampling techniques for handling imbalanced prediction of university student depression. Information 2020; 11(11): 1–13. doi: 10.3390/info11110519.

23. Software Defect Dataset. Promise software engineering repository [Internet]. Ottawa: University of Ottawa; 2004. Available from: http://promise.site.uottawa.ca/SERepository/datasets-page.html.

24. Kovács B, Tinya F, Németh C, Ódor P. Unfolding the effects of different forestry treatments on microclimate in oak forests: Results of a 4-yr experiment. Ecological Applications 2020; 30(2): 321–357. doi: 10.1002/eap.2043.

25. Wilson MD. Support vector machines. In: Encyclopedia of ecology. Amsterdam, Netherlands: Elsevier Science; 2008. p. 3431–3437. doi: 10.1016/B978-008045405-4.00168-3.

26. Zoppis I, Mauri G, Dondi R. Kernel methods: Support vector machines. Ranganathan R, Gribskov M, Nakai K, et al. (editors). Oxford: Academic Press; 2019. p. 503–510.

27. Chang YW, Hsieh CJ, Chang KW, et al. Training and testing low-degree polynomial data mappings via linear SVM. The Journal of Machine Learning Research 2010; 11(48): 1471–1490.

28. Sayyad Shirabad J, Menzies TJ. The PROMISE repository of software engineering databases [Internet]. Ottawa: University of Ottawa; 2005. Available from: http://promise.site.uottawa.ca/SERepository.

29. Alkhasawneh MS. Software defect prediction through neural network and feature selections. Applied Computational Intelligence and Soft Computing 2022; 2022: 2581832. doi: 10.1155/2022/2581832.

30. Mustaqeem M, Saqib M. Principal component based support vector machine (PC-SVM): A hybrid technique for software defect detection. Cluster Computing 2021; 24(3): 2581–2595. doi: 10.1007/s10586-021-03282-8.

31. Abualigah L. Group search optimiser: A nature-inspired meta-heuristic optimisation algorithm with its results, variants, and applications. Neural Computing and Applications 2021; 33: 2949–2972. doi: 10.1007/s00521-020-05107-y.

32. Abualigah L. Multi-verse optimiser algorithm: A comprehensive survey of its results, variants, and applications. Neural Computing and Applications 2020; 32: 12381–12401. doi: 10.1007/s00521-020-04839-1.

33. Rahim A, Hayat Z, Abbas M, et al. Software defect prediction with naïve bayes classifier. In: 2021 International Bhurban Conference on Applied Sciences and Technologies (IBCAST); 2021 Jan 12–16; Islamabad, Pakistan. New York: IEEE; 2021. p. 293–297. doi: 10.1109/IBCAST51254.2021.9393250.

34. Soe YN, Santosa PI, Hartanto R. Software defect prediction using random forest algorithm. In: 2018 12th South East Asian Technical University Consortium (SEATUC); 2018 Mar 12–13; Yogyakarta, Indonesia. New York: IEEE; 2019. p. 1–5. doi: 10.1109/SEATUC.2018.8788881.

35. Wang J, Shen B, Chen Y. Compressed C4.5 models for software defect prediction. In: 2012 12th International Conference on Quality Software; 2012 Aug 27–29; Xi’an, China. New York: IEEE; 2012. p. 13–16. doi: 10.1109/QSIC.2012.19.

36. Haouari AT, Souici-Meslati L, Atil F, Meslati D. Empirical comparison and evaluation of Artificial Immune Systems in inter-release software fault prediction. Applied Soft Computing 2020; 96: 106686. doi: 10.1016/j.asoc.2020.106686.

37. Arar ÖF, Ayan K. Software defect prediction using cost-sensitive neural network. Applied Soft Computing 2015; 33: 263–277. doi: 10.1016/j.asoc.2015.04.045.

38. Abaei G, Selamat A, Fujita H. An empirical study based on semi-supervised hybrid self-organising map for software fault prediction. Knowledge-Based Systems 2015; 74: 28–39. doi: 10.1016/j.knosys.2014.10.017.

39. Saifudin A, Hendric SWHL, Soewito B, et al. Tackling imbalanced class on cross-project defect prediction using ensemble SMOTE. IOP Publishing 2019; 662(6): 062011. doi: 10.1088/1757-899X/662/6/062011.




DOI: https://doi.org/10.32629/jai.v6i1.559

Refbacks

  • There are currently no refbacks.


Copyright (c) 2023 Mohd. Mustaqeem, Tamanna Siddiqui

License URL: https://creativecommons.org/licenses/by-nc/4.0