SLT & ML: 最近对一个数据做预处理，上万的实例，用工具处理起来还是遇到了一些小麻烦，在...

最近对一个数据做预处理，上万的实例，用工具处理起来还是遇到了一些小麻烦，在此总结一下解决方案。
与处理中遇到的三个基本问题：数据缺失值处理，噪点处理以及归一化处理。
缺失值处理：
找了四个工具weka,matlab,excel,spss.结果发现matlab,weka和spss有直接处理缺失值的方法。
matlab:
matlab提供了一个的fixunknowns()函数处理缺失值。例子：
x1 = [1 2 3 4; 4 NaN 6 5; NaN 2 3 NaN]
[y1,ps] = fixunknowns(x1)
结果如图所示：

进一步的信息

下面是一个缺失值填充的一个matlab函数:
function [y]=ReplaceMissingValue(x,type,value)

%%
%input:
% x:n*d;n denote the number of samples,d:denote dimenssion
% type:
% 1：replace the NaN with the value;
% 2：replace the NaN with the mean, in this case, value is options;
% when the type is default,it would be set 2;
%value: which use to replace the NaN
%%output:
% name: 'fixunknowns'
%       xrows: 3
%       yrows: 5
%     unknown: [2 3]
%       known: 1
%       shift: [0 0 1]
%      xmeans: [3x1 double]

if nargin <2
    type=2;
end
if nargin <3 && type==1
    disp('pleaes assign a "value" using to replace');
    return;
end

if type==2
    [tempy,tempps]=fixunknowns(x');

    %计算空缺值所在行的全部索引
    indexaux=zeros(tempps.yrows,1);
    numNaN=size(tempps.unknown,2);
    indexaux(tempps.unknown'+(1:numNaN)')=1;
    indexaux=~indexaux;

    %计算修补后的矩阵
    rows=1:tempps.yrows;
    rows=rows';
    index=indexaux.*rows;
    y=tempy(find(index>0),:)';

end
if type==1
    nan=isnan(x);
    notnan=~nan;
    indexnotnan=find(double(notnan)==1);
    notan1=zeros(size(x));
    notan1(indexnotnan)=x(indexnotnan);
    y=notan1+nan*value;

end

weka:在filter->unsupervised->attribute->ReplaceMissingValues，其对缺失值处理只有一种方法，即modes and means.

而spss则要强大多了，其对缺失值的处理有好几种方法，Transform->Replace missing values;

matlab 和excel 也可以使用编程的方法或者其他处理方法来解决missing value 的问题，但是相对来说还是没有那么的方便。

数据归一化：
方法很多，最为常用的方法之一是：z-score，即 means=0,std=1，这种方法spss,和matlab都有提供，spss如图所示：

matlab使用的函数为：mapstd(data,means,std)，具体使用参考matlab （注意，data:d*n,d 表示维度，n表示样本数） .
如图所示：（x1为下面mapminmax中的x1）

还有一个方法即是归一化到一个区间，如[a,b]，此种方法都很简单大多都基于以下原理： y = (ymax-ymin)*(x-xmin)/(xmax-xmin) + ymin;,在matlab中有函数：mapminmax（data）,具体使用参考matlab （注意，data:d*n,d 表示维度，n表示样本数）.
如图所示：

噪音点处理：

一个简单方法是：对于连续变量，可以使用z-score归一化数据，然后对值大于3小于-3的单元所在的样本视为噪音。对于分类变量，由于每个值都有相应意义的，故对于偏离取值范围的值即可视为噪点。噪点可以剔除也可以将其作为缺失值处理。spss和excel就可以提供这样的简单处理。

SLT & ML

Tuesday, July 14, 2009

最近对一个数据做预处理，上万的实例，用工具处理起来还是遇到了一些小麻烦，在...

No comments:

Blog Pigeonhole

Links

My Links

SVM

Multiple Kernel Learning

Ensemble Learning

SLT

Matlab

Semi-Supervised Learning

My Blog List

About Me

ClusterMap