Cluster-Based Minority Over-Sampling for Imbalanced Datasets

Kamthorn PUNTUMAPON; Thanawin RAKTHAMAMON; Kitsana WAIYAMAI

doi:10.1587/transinf.2016EDP7130

IEICE TRANSACTIONS on Information

Cluster-Based Minority Over-Sampling for Imbalanced Datasets

Kamthorn PUNTUMAPON, Thanawin RAKTHAMAMON, Kitsana WAIYAMAI

Full Text Views

0

Cite this

Summary :

Synthetic over-sampling is a well-known method to solve class imbalance by modifying class distribution and generating synthetic samples. A large number of synthetic over-sampling techniques have been proposed; however, most of them suffer from the over-generalization problem whereby synthetic minority class samples are generated into the majority class region. Learning from an over-generalized dataset, a classifier could misclassify a majority class member as belonging to a minority class. In this paper a method called TRIM is proposed to overcome the over-generalization problem. The idea is to identify minority class regions that compromise between generalization and overfitting. TRIM identifies all the minority class regions in the form of clusters. Then, it merges a large number of small minority class clusters into more generalized clusters. To enhance the generalization ability, a cluster connection step is proposed to avoid over-generalization toward the majority class while increasing generalization of the minority class. As a result, the classifier is able to correctly classify more minority class samples while maintaining its precision. Compared with SMOTE and extended versions such as Borderline-SMOTE, experimental results show that TRIM exhibits significant performance improvement in terms of F-measure and AUC. TRIM can be used as a pre-processing step for synthetic over-sampling methods such as SMOTE and its extended versions.

Publication: IEICE TRANSACTIONS on Information Vol.E99-D No.12 pp.3101-3109

Publication Date: 2016/12/01

Publicized: 2016/09/06

Online ISSN: 1745-1361

DOI: 10.1587/transinf.2016EDP7130

Type of Manuscript: PAPER

Category: Artificial Intelligence, Data Mining

Authors

Kamthorn PUNTUMAPON
  Kasetsart University
Thanawin RAKTHAMAMON
  Kasetsart University
Kitsana WAIYAMAI
  Kasetsart University

Keyword

imbalanced data, cluster-based minority over-sampling, synthetic minority over-sampling

Cite this

Copy

Kamthorn PUNTUMAPON, Thanawin RAKTHAMAMON, Kitsana WAIYAMAI, "Cluster-Based Minority Over-Sampling for Imbalanced Datasets" in IEICE TRANSACTIONS on Information, vol. E99-D, no. 12, pp. 3101-3109, December 2016, doi: 10.1587/transinf.2016EDP7130.
Abstract: Synthetic over-sampling is a well-known method to solve class imbalance by modifying class distribution and generating synthetic samples. A large number of synthetic over-sampling techniques have been proposed; however, most of them suffer from the over-generalization problem whereby synthetic minority class samples are generated into the majority class region. Learning from an over-generalized dataset, a classifier could misclassify a majority class member as belonging to a minority class. In this paper a method called TRIM is proposed to overcome the over-generalization problem. The idea is to identify minority class regions that compromise between generalization and overfitting. TRIM identifies all the minority class regions in the form of clusters. Then, it merges a large number of small minority class clusters into more generalized clusters. To enhance the generalization ability, a cluster connection step is proposed to avoid over-generalization toward the majority class while increasing generalization of the minority class. As a result, the classifier is able to correctly classify more minority class samples while maintaining its precision. Compared with SMOTE and extended versions such as Borderline-SMOTE, experimental results show that TRIM exhibits significant performance improvement in terms of F-measure and AUC. TRIM can be used as a pre-processing step for synthetic over-sampling methods such as SMOTE and its extended versions.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.2016EDP7130/_p

Copy

@ARTICLE{e99-d_12_3101,
author={Kamthorn PUNTUMAPON, Thanawin RAKTHAMAMON, Kitsana WAIYAMAI, },
journal={IEICE TRANSACTIONS on Information},
title={Cluster-Based Minority Over-Sampling for Imbalanced Datasets},
year={2016},
volume={E99-D},
number={12},
pages={3101-3109},
abstract={Synthetic over-sampling is a well-known method to solve class imbalance by modifying class distribution and generating synthetic samples. A large number of synthetic over-sampling techniques have been proposed; however, most of them suffer from the over-generalization problem whereby synthetic minority class samples are generated into the majority class region. Learning from an over-generalized dataset, a classifier could misclassify a majority class member as belonging to a minority class. In this paper a method called TRIM is proposed to overcome the over-generalization problem. The idea is to identify minority class regions that compromise between generalization and overfitting. TRIM identifies all the minority class regions in the form of clusters. Then, it merges a large number of small minority class clusters into more generalized clusters. To enhance the generalization ability, a cluster connection step is proposed to avoid over-generalization toward the majority class while increasing generalization of the minority class. As a result, the classifier is able to correctly classify more minority class samples while maintaining its precision. Compared with SMOTE and extended versions such as Borderline-SMOTE, experimental results show that TRIM exhibits significant performance improvement in terms of F-measure and AUC. TRIM can be used as a pre-processing step for synthetic over-sampling methods such as SMOTE and its extended versions.},
keywords={},
doi={10.1587/transinf.2016EDP7130},
ISSN={1745-1361},
month={December},}

Copy

TY - JOUR
TI - Cluster-Based Minority Over-Sampling for Imbalanced Datasets
T2 - IEICE TRANSACTIONS on Information
SP - 3101
EP - 3109
AU - Kamthorn PUNTUMAPON
AU - Thanawin RAKTHAMAMON
AU - Kitsana WAIYAMAI
PY - 2016
DO - 10.1587/transinf.2016EDP7130
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E99-D
IS - 12
JA - IEICE TRANSACTIONS on Information
Y1 - December 2016
AB - Synthetic over-sampling is a well-known method to solve class imbalance by modifying class distribution and generating synthetic samples. A large number of synthetic over-sampling techniques have been proposed; however, most of them suffer from the over-generalization problem whereby synthetic minority class samples are generated into the majority class region. Learning from an over-generalized dataset, a classifier could misclassify a majority class member as belonging to a minority class. In this paper a method called TRIM is proposed to overcome the over-generalization problem. The idea is to identify minority class regions that compromise between generalization and overfitting. TRIM identifies all the minority class regions in the form of clusters. Then, it merges a large number of small minority class clusters into more generalized clusters. To enhance the generalization ability, a cluster connection step is proposed to avoid over-generalization toward the majority class while increasing generalization of the minority class. As a result, the classifier is able to correctly classify more minority class samples while maintaining its precision. Compared with SMOTE and extended versions such as Borderline-SMOTE, experimental results show that TRIM exhibits significant performance improvement in terms of F-measure and AUC. TRIM can be used as a pre-processing step for synthetic over-sampling methods such as SMOTE and its extended versions.
ER -

IEICE TRANSACTIONS on Information