The Performance Stability of Defect Prediction Models with Class Imbalance: An Empirical Study

Qiao YU; Shujuan JIANG; Yanmei ZHANG

doi:10.1587/transinf.2016EDP7204

The Performance Stability of Defect Prediction Models with Class Imbalance: An Empirical Study

Qiao YU, Shujuan JIANG, Yanmei ZHANG

Full Text Views

0

Cite this

Summary :

Class imbalance has drawn much attention of researchers in software defect prediction. In practice, the performance of defect prediction models may be affected by the class imbalance problem. In this paper, we present an approach to evaluating the performance stability of defect prediction models on imbalanced datasets. First, random sampling is applied to convert the original imbalanced dataset into a set of new datasets with different levels of imbalance ratio. Second, typical prediction models are selected to make predictions on these new constructed datasets, and Coefficient of Variation (C·V) is used to evaluate the performance stability of different models. Finally, an empirical study is designed to evaluate the performance stability of six prediction models, which are widely used in software defect prediction. The results show that the performance of C4.5 is unstable on imbalanced datasets, and the performance of Naive Bayes and Random Forest are more stable than other models.

Publication: IEICE TRANSACTIONS on Information Vol.E100-D No.2 pp.265-272

Publication Date: 2017/02/01

Publicized: 2016/11/04

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2016EDP7204

Type of Manuscript: PAPER

Category: Software Engineering

Authors

Qiao YU
  China University of Mining and Technology
Shujuan JIANG
  China University of Mining and Technology
Yanmei ZHANG
  China University of Mining and Technology

Keyword

class imbalance, software defect prediction, prediction models, performance stability, imbalance ratio

Cite this

Copy

Qiao YU, Shujuan JIANG, Yanmei ZHANG, "The Performance Stability of Defect Prediction Models with Class Imbalance: An Empirical Study" in IEICE TRANSACTIONS on Information, vol. E100-D, no. 2, pp. 265-272, February 2017, doi: 10.1587/transinf.2016EDP7204.
Abstract: Class imbalance has drawn much attention of researchers in software defect prediction. In practice, the performance of defect prediction models may be affected by the class imbalance problem. In this paper, we present an approach to evaluating the performance stability of defect prediction models on imbalanced datasets. First, random sampling is applied to convert the original imbalanced dataset into a set of new datasets with different levels of imbalance ratio. Second, typical prediction models are selected to make predictions on these new constructed datasets, and Coefficient of Variation (C·V) is used to evaluate the performance stability of different models. Finally, an empirical study is designed to evaluate the performance stability of six prediction models, which are widely used in software defect prediction. The results show that the performance of C4.5 is unstable on imbalanced datasets, and the performance of Naive Bayes and Random Forest are more stable than other models.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016EDP7204/_p

Copy

@ARTICLE{e100-d_2_265,
author={Qiao YU, Shujuan JIANG, Yanmei ZHANG, },
journal={IEICE TRANSACTIONS on Information},
title={The Performance Stability of Defect Prediction Models with Class Imbalance: An Empirical Study},
year={2017},
volume={E100-D},
number={2},
pages={265-272},
abstract={Class imbalance has drawn much attention of researchers in software defect prediction. In practice, the performance of defect prediction models may be affected by the class imbalance problem. In this paper, we present an approach to evaluating the performance stability of defect prediction models on imbalanced datasets. First, random sampling is applied to convert the original imbalanced dataset into a set of new datasets with different levels of imbalance ratio. Second, typical prediction models are selected to make predictions on these new constructed datasets, and Coefficient of Variation (C·V) is used to evaluate the performance stability of different models. Finally, an empirical study is designed to evaluate the performance stability of six prediction models, which are widely used in software defect prediction. The results show that the performance of C4.5 is unstable on imbalanced datasets, and the performance of Naive Bayes and Random Forest are more stable than other models.},
keywords={},
doi={10.1587/transinf.2016EDP7204},
ISSN={1745-1361},
month={February},}

Copy

TY - JOUR
TI - The Performance Stability of Defect Prediction Models with Class Imbalance: An Empirical Study
T2 - IEICE TRANSACTIONS on Information
SP - 265
EP - 272
AU - Qiao YU
AU - Shujuan JIANG
AU - Yanmei ZHANG
PY - 2017
DO - 10.1587/transinf.2016EDP7204
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E100-D
IS - 2
JA - IEICE TRANSACTIONS on Information
Y1 - February 2017
AB - Class imbalance has drawn much attention of researchers in software defect prediction. In practice, the performance of defect prediction models may be affected by the class imbalance problem. In this paper, we present an approach to evaluating the performance stability of defect prediction models on imbalanced datasets. First, random sampling is applied to convert the original imbalanced dataset into a set of new datasets with different levels of imbalance ratio. Second, typical prediction models are selected to make predictions on these new constructed datasets, and Coefficient of Variation (C·V) is used to evaluate the performance stability of different models. Finally, an empirical study is designed to evaluate the performance stability of six prediction models, which are widely used in software defect prediction. The results show that the performance of C4.5 is unstable on imbalanced datasets, and the performance of Naive Bayes and Random Forest are more stable than other models.
ER -