Weka学习四（属性选择）

TAG:

在这一节我们看看属性选择。在数据挖掘的研究中，通常要通过距离来计算样本之间的距离，而样本距离是通过属性值来计算的。我们知道对于不同的属性，它们在样本空间的权重是不一样的，即它们与类别的关联度是不同的，因此有必要筛选一些属性或者对各个属性赋一定的权重。这样属性选择的方法就应运而生了。

在属性选择方面InfoGain和GainRatio的比较常见，也是最通俗易懂的方法。它们与Decision Tree的构造原理比较相似，哪个节点拥有的信息量就为哪个节点赋较高的权重。其它的还有根据关联度的办法来进行属性选择（Correlation-based Feature Subset Selection for Machine Learning）。具体它的工作原理大家可以在网上看论文。

现在我将简单的属性选择实例给大家展示一下：

package com.csdn; 
import java.io.File; 
import weka.attributeSelection.InfoGainAttributeEval; 
import weka.attributeSelection.Ranker; 
import weka.classifiers.Classifier; 
import weka.core.Instances; 
import weka.core.converters.ArffLoader; 
public class SimpleAttributeSelection { 
    /** 
     * @param args 
     */ 
    public static void main(String[] args) { 
       // TODO Auto-generated method stub 
       Instances trainIns = null; 
       try{ 
           /* 
            * 1.读入训练 
            * 在此我们将训练样本和测试样本是由weka提供的segment数据集构成的 
            */ 
File file= new File("C://Program Files//Weka-3-6//data//segment-challenge.arff"); 
ArffLoader loader = new ArffLoader(); 
loader.setFile(file); 
trainIns = loader.getDataSet(); 
//在使用样本之前一定要首先设置instances的classIndex，否则在使用instances对象是会抛出异常 
           trainIns.setClassIndex(trainIns.numAttributes()-1); 
           /* 
            * 2.初始化搜索算法（search method）及属性评测算法（attribute evaluator） 
            */ 
           Ranker rank = new Ranker(); 
           InfoGainAttributeEval eval = new InfoGainAttributeEval(); 
           /* 
            * 3.根据评测算法评测各个属性 
            */ 
           eval.buildEvaluator(trainIns); 
           //System.out.println(rank.search(eval, trainIns)); 
           /* 
            * 4.按照特定搜索算法对属性进行筛选 
            * 在这里使用的Ranker算法仅仅是属性按照InfoGain的大小进行排序 
            */ 
           int[] attrIndex = rank.search(eval, trainIns); 
           /* 
            * 5.打印结果信息 
            * 在这里我们了属性的排序结果同时将每个属性的InfoGain信息打印出来 
            */ 
           StringBuffer attrIndexInfo = new StringBuffer(); 
           StringBuffer attrInfoGainInfo = new StringBuffer(); 
           attrIndexInfo.append("Selected attributes:"); 
           attrInfoGainInfo.append("Ranked attributes:/n"); 
           for(int i = 0; i < attrIndex.length; i ++){ 
              attrIndexInfo.append(attrIndex[i]); 
              attrIndexInfo.append(","); 
    attrInfoGainInfo.append(eval.evaluateAttribute(attrIndex[i])); 
              attrInfoGainInfo.append("/t"); 
attrInfoGainInfo.append((trainIns.attribute(attrIndex[i]).name())); 
              attrInfoGainInfo.append("/n"); 
           } 
           System.out.println(attrIndexInfo.toString()); 
           System.out.println(attrInfoGainInfo.toString()); 
       }catch(Exception e){ 
           e.printStackTrace(); 
       } 
    } 
}

在这个实例中，我用了InfoGain的属性选择类来进行特征选择。InfoGainAttributeEval主要是计算出各个属性的InfoGain信息。同时在weka中为属性选择方法配备的有搜索算法（seacher method），在这里我们用最简单的Ranker类。它对属性进行了简单的排序。在Weka中我们还可以对搜索算法设置一些其它的属性，例如设置搜索的属性集，阈值等等，如果有需求大家可以进行详细的设置。

在最后我们打印了一些结果信息，打印了各个属性的InfoGain的信息。

(anqiang1984)

Weka学习三（ensemble算法）	Weka初步二（聚类算法）
Weka初步一	数据挖掘十大经典算法(4) The Apriori algorithm
数据挖掘十大经典算法(3) Support vector machines	数据挖掘十大经典算法(2) The k-means algorithm
数据挖掘十大经典算法(1) C4.5	网络蜘蛛（网络爬虫）核心C#源代码
数据挖掘与统计学(data mining and statistics)	统计学和数据挖掘：交叉学科

搜索

热门标签:

Weka学习四（属性选择）