An empirical analysis of feature selection techniques for Software Defect Prediction
Abstract
Detecting software defects before they occur is crucial in software engineering as it impacts software system quality and reliability. Previous studies on predicting software defects have typically employed software features, such as code size, complexity, coupling, cohesion, inheritance, and other software metrics., to forecast whether a code file or commit is prone to defects in the future. However, it is advantageous to restrict the number of features employed in a defect prediction model to avoid the challenges associated with multicollinearity and the “curse of dimensionality” and to simplify the data analysis process. By using a reduced number of features, the defect prediction model can concentrate on the most significant variables and improve its accuracy. This research paper investigates the impact of eight feature selection methods on the accuracy and stability of six supervised learning models. This study is novel as it is based on exhaustive experimentation of each of the eight feature selection techniques with each of the six supervised learning models. Two notable findings have been obtained. First, we discovered that the association and coherence-based techniques have demonstrated the highest level of accuracy when it comes to defect prediction. The models that utilized these selected features outperformed those using the original features. Second, the feature selection techniques, namely Correlation feature selection, Recursive feature elimination, and Ridge feature selection when combined with the Support vector machine and Decision tree classifier, consistently selected low-variance features across multiple supervised defect prediction models. When combined with different classifiers, these techniques achieved exceptional performance on the publicly available NASA datasets CM1 and PC2. The findings revealed a remarkable accuracy rate of over 85% for CM1 and 95% for PC2, accompanied by precision, recall, and f-measure values exceeding 95%. These exceptional results indicate the achievement of the highest level of performance in the evaluation.
Keywords
Full Text:
PDFReferences
1. The Standish Group report 83.9% of IT projects partially or completely fail. Available online: https://www.linkedin.com/pulse/standish-group-report-839-projects-partially-fail-akbar-shahamati (accessed on 1 June 2020).
2. Anand K, Jena A, Choudhary T, Performance Analysis of Feature Selection Techniques in Software Defect Prediction using Machine Learning," 2022 International Conference on Advancements in Smart, Secure and Intelligent Computing (ASSIC), Bhubaneswar, India, 2022, pp. 1-7, doi:10.1109/ASSIC55218.2022.10088364.
3. Kondo M, Bezemer CP, Kamei Y, Hassan AE, Mizuno O. The impact of feature reduction techniques on defect prediction models. Empirical Software Engineering 2019;24(4):1925-1963. doi:10.1007/s10664-018-9679-5
4. Dao FY, Lv H, Wang F, et al. Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique. Bioinformatics 2018; 35(12): 2075–2083. doi: 10.1093/bioinformatics/bty943
5. Rostami M, Berahmand K, Nasiri E, Forouzandeh S. Review of swarm intelligence-based feature selection methods. Engineering Applications of Artificial Intelligence 2021; 100: 104210. doi: 10.1016/j.engappai.2021.104210
6. Chen L, Wang C, Song S. Software defect prediction based on nested-stacking and heterogeneous feature selection. Complex & Intelligent Systems 2022; 8(4): 3333–3348. doi: 10.1007/s40747-022-00676-y
7. Qu K, Gao F, Guo F, Zou Q. Taxonomy dimension reduction for colorectal cancer prediction. Computational Biology and Chemistry 2019; 83: 107160. doi: 10.1016/j.compbiolchem.2019.107160
8. Umbarkar S, Shukla S. Analysis of heuristic based feature reduction method in intrusion detection system. In: Proceedings of the 2018 5th International Conference on Signal Processing and Integrated Networks (SPIN); 22-23 February 2018; Noida, India. pp. 717–720
9. Zebari R, Abdulazeez A, Zeebaree D, et al. A comprehensive review of dimensionality reduction techniques for feature selection and feature extraction. Journal of Applied Science and Technology Trends 2020; 1(2): 56–70. doi: 10.38094/jastt1224
10. Spencer R, Thabtah F, Abdelhamid N, Thompson M. Exploring feature selection and classification methods for predicting heart disease. Digital Health 2020; 6: 205520762091477. doi: 10.1177/2055207620914777
11. Pabreja K, Singh A, Singh R, et al. Prediction of stress level on indian working professionals using machine learning. International Journal of Human Capital and Information Technology Professionals 2022; 13(1): 1–26. doi: 10.4018/ijhcitp.297077
12. Pabreja K, Arya S, Madnani P. An ensemble machine learning model for automatic prediction of perceived personal well-being of Indian university students during COVID-19 lockdown. MIER Journal of Educational Studies Trends and Practices 2022; 12(2): 301–319. doi: 10.52634/mier/2022/v12/i2/2280
13. Marian Z, Mircea IG, Czibula IG, Czibula G. A novel approach for software defect prediction using fuzzy decision trees. In: Proceedings of the 2016 18th International Symposium on Symbolic and Numeric Algorithms for Scientific Computing (SYNASC); 24–27 September 2016; Timisoara, Romania. pp. 240–247.
14. Elavarasan D, Vincent P M DR, Srinivasan K, Chang CY. A hybrid CFS filter and RF-RFE wrapper-based feature extraction for enhanced agricultural crop yield prediction modeling. Agriculture 2020; 10(9): 400. doi: 10.3390/agriculture10090400
15. Aggrawal R, Pal S. Sequential feature selection and machine learning algorithm-based patient’s death events prediction and diagnosis in heart disease. SN Computer Science 2020; 1(6): 344. doi: 10.1007/s42979-020-00370-1
16. Gárate-Escamila AK, Hajjam El Hassani A, Andrès E. Classification models for heart disease prediction using feature selection and PCA. Informatics in Medicine Unlocked 2020; 19: 100330. doi: 10.1016/j.imu.2020.100330
17. Bolón-Canedo V, Sánchez-Maroño N, Alonso-Betanzos A. A review of feature selection methods on synthetic data. Knowledge and Information Systems 2012; 34(3): 483–519. doi: 10.1007/s10115-012-0487-8
18. Liu Z, Wang J, Liu G, Zhang L. Discriminative low-rank preserving projection for dimensionality reduction. Applied Soft Computing 2019; 85: 105768. doi: 10.1016/j.asoc.2019.105768
19. Yuan X, Yuan J, Jiang T, Ain QU. Integrated long-term stock selection models based on feature selection and machine learning algorithms for China stock market. IEEE Access 2020; 8: 22672–22685. doi: 10.1109/access.2020.2969293
20. Khandelwal R. Feature selection in Python using the Filter method. Available online: https://towardsdatascience.com/feature-selection-in-python-using-filter-method-7ae5cbc4ee05 (accessed on 20 October 2023).
21. Tan J. Feature selection for machine learning in Python—Wrapper methods. Available online: https://towardsdatascience.com/feature-selection-for-machine-learning-in-python-wrapper-methods-2b5e27d2db31 (accessed on 14 October 2020).
22. Gupta A. A comprehensive guide to Feature Selection using Wrapper Methods in Python. Available online: https://www.analyticsvidhya.com/blog/2020/10/feature-selection-techniques-in-machine-learning/ (accessed on 20 October 2023).
23. Sharma T, Jatain A, Bhaskar S, Pabreja K. Ensemble machine learning paradigms in software defect prediction. Procedia Computer Science 2023; 218: 199–209. doi: 10.1016/j.procs.2023.01.002
DOI: https://doi.org/10.32629/jai.v7i3.1097
Refbacks
- There are currently no refbacks.
Copyright (c) 2024 Tarunim Sharma, Aman Jatain, Shalini Bhaskar, Kavita Pabreja
License URL: https://creativecommons.org/licenses/by-nc/4.0/