[关键词]
[摘要]
在机器学习技术逐渐渗透到各个领域的背景下,软件开发流程中的软件测试非常重要,面对在软件缺陷预测过程中出现的类别不平衡和准确性挑战,本文提出了一种基于监督学习的解决方案,采用样本平衡技术,结合合成少数类过采样技术(synthetic minority over-sampling technique,SMOTE)与编辑最近邻(edited nearest neighbor,ENN)算法,对局部加权学习(local weight learning,LWL)、J48、C4.8、随机森林、贝叶斯网络(bayes net,BN)、多层前馈神经网络(multilayer feedforward neural network,MFNN)、支持向量机(supported-vector-machine,SVM)以及朴素贝叶斯(naive-bayse key,NB-K)等多种算法进行测试。这些算法被应用于NASA数据库的3个不同数据集,并对其效果进行详细比较分析。研究结果显示,结合了SMOTE和ENN的随机森林模型在处理类别不平衡问题方面展现出高效且避免过拟合的优势,为解决软件缺陷预测中的类别不平衡问题提供了一种有效的解决方案。
[Key word]
[Abstract]
As the number of software on the market increases dramatically, the importance of software quality gradually intensifies, making software testing an indispensable part of the software development process. With the growing demand for software testing, emerging technologies have been widely applied in testing, among which machine learning's predictive models and scalability have gradually become mainstream technologies for software defect prediction. However, in this context, software prediction faces a series of issues, especially the class imbalance problem and prediction accuracy issues. This paper proposes a supervised learning-based software prediction method targeting these two core problems. Specifically, the approach involves balancing the samples in the datasets (KK1, KK3, PK2) from the NASA database, using the SMOTE algorithm for over-sampling and the ENN algorithm for under-sampling. Then, the paper compares and analyzes the actual effects of these three datasets using various algorithms based on supervised learning, including Local Weighted Learning (LWL), J48, C4.8, Random Forest, Bayesian Belief Network, Multilayer Feedforward Neural Network, Support Vector Machine (SVM), and NB-K. The results indicate that the SMOTE+ENN+Random Forest model can effectively address the class imbalance problem, while other methods have certain limitations in comparison.
[中图分类号]
[基金项目]