The sharing of resources about Statistical Learning Theory and Machine Learning(includeing SVM,Semi-Supervised Learning,Ensemble Learning,Clustering) ,welcome to contact and communicate with me: Email: xiankaichen@gmail.com,QQ:112035246,

Thursday, July 24, 2008

机器学习中 处理不平衡数据的方法

在现实世界中,数据不平衡问题是常常遇到的,不平衡的数据会对学习算法有一定的影响,降低学习模型的泛化能力.
解决不平衡样本问题的方法主要有两类,一类:通过抽样使得数据达到平衡;另一类为算法上的改进.

一.应用较多的是第一类方法,在文章
Experimental Perspectives on Learning from Imbalanced Data
中进行了详细讨论.抽样的方法有很多,主要包括:random undersampling(RUS),random oversampling(ROS),one-sided selection(OOS),cluster-based oversampling(CBOS),Wilson's editing(WE),SMOTE(SM),borderline-SMOTE(BSM).
下面是它们的简单介绍:
The two most common preprocessing techniques are ran-
dom minority oversampling (ROS) and random majority
undersampling (RUS). In ROS, instances of the minority
class are randomly duplicated. In RUS, instances of the
majority class are randomly discarded from the dataset.

In one of the earliest attempts to improve upon
the performance of random resampling, Kubat and
Matwin (Kubat & Matwin, 1997) proposed a technique
called one-sided selection (OSS). One-sided selection at-
tempts to intelligently undersample the majority class
by removing majority class examples that are consid-
ered either redundant or ‘noisy.’

Wilson’s editing (Barandela et al., 2004) (WE) uses the
kNN technique with k = 3 to classify each example in
the training set using all the remaining examples, and
removes those majority class examples that are misclas-
sified. Barandela et al. also propose a modified distance
calculation, which causes an example to be biased more
towards being identified with positive examples than
negative ones.
Chawla et al. (Chawla et al., 2002) proposed an intel-
ligent oversampling method called Synthetic Minority
Oversampling Technique (SMOTE). SMOTE (SM) adds
new, artificial minority examples by extrapolating be-
tween preexisting minority instances rather than simply
duplicating original examples. The technique first finds
the k nearest neighbors of the minority class for each
minority example (the paper recommends k = 5). The
artificial examples are then generated in the direction of
some or all of the nearest neighbors, depending on the
amount of oversampling desired.

Han et al. presented a modification of Chawla et al.’s
SMOTE technique which they call borderline-SMOTE
(Han et al., 2005) (BSM). BSM selects minority exam-
ples which are considered to be on the border of the
minority decision region in the feature-space and only
performs SMOTE to oversample those instances, rather
than oversampling them all or a random subset.

Cluster-based oversampling (Jo & Japkowicz, 2004)
(CBOS) attempts to even out the between-class imbal-
ance as well as the within-class imbalance. There may
be subsets of the examples of one class that are isolated
in the feature-space from other examples of the same
class, creating a within-class imbalance. Small subsets of
isolated examples are called small disjuncts. Small dis-
juncts often cause degraded classifier performance, and
CBOS aims to eliminate them without removing data.

在论文中对以上七种算法进行了研究和比较,实验表明:
One of the most important conclusions that can be
drawn from these experiments is the inferior perfor-
mance of the ‘intelligent’ sampling techniques, SM,
BSM, WE, OSS, and CBOS (especially the last two).
While these techniques seem to be promising solutions to the problem of class imbalance, simpler techniques such
as RUS or ROS often performed much better. CBOS
and OSS especially performed very poorly in our exper-
iments, very rarely being the best sampling technique
and often being among the worst.(RUS和ROS的实验结果总体好于其它的算法.)

实验结果表明对svm而言RUS和ROS的实验结果总体好于其它的算法.

二.svm算法上的该机

在算法上的改进对于不同算法有不同的改进策略,如支持向量机,可以通过调节惩罚因子C权重支持向量机或用抽样方法和集成学习方法结合起来的EUS-SVM .

评价不平衡算法的方法有area under the roc curve (AUC),K/S,Geometric mean(G),F-measure,accuracy,true positive rate

No comments: