机器学习实战(Machine Learning in Action)学习笔记————03.决策树原理、源码解析及测试
关键字:决策树、python、源码解析、测试
作者:米仓山下
时间:2018-10-24
机器学习实战(Machine Learning in Action,@author: Peter Harrington)
源码下载地址:https://www.manning.com/books/machine-learning-in-action
[email protected]:pbharrin/machinelearninginaction.git
*************************************************************
一、决策树的原理及源码解析
文件:trees.py,是ID3决策树算法的实现。代码中的主要方法:
createDataSet----创建例子中的数据集5行×3列,即五条数据,前两列为特征,最后一列为是否为鱼类;lable为两个特征的含义'no_surfacing'(不浮出水是否可以生存),'flippers'(是否有脚蹼)
例子:五个海洋生物,两个特征:不浮出水是否可以生存;是否有脚蹼。利用决策树将这些动物分成两类。
数据要求:第一,数据必须是一种列表元素组成的列表,而且所有的列表元素都要具有相同的的数据长度;第二,数据的最后一列或则每个实例的最后一个元素是当前实例的类别标签。
#香农熵的计算
calcShannonEnt----计算给定数据集的香农熵,熵是集合信息的度量方式,公式
#按照信息增益选取最好的特征
chooseBestFeatureToSplit----选择最好的数据集划分方式,返回最好的特征(按照该特征分类的信息增益最大)索引。原理:遍历所有特征,按照当前特征将其划分(splitDataSet)为多个数据集,然后求他们信息熵的和。其中信息增益是熵的减少或则无序减少程度,这里是前原始数据集的信息熵与分类后的信息熵之差。最后选取信息增益最大的特征,返回对应的列索引
#数据集切分
splitDataSet----按照给定特征(axis)的值(value)划分数据集dataset
================================================================
#决策树构建
createTree----决策树构建代码,是一个递归函数
原理:得到原始数据集,然后基于最好的属性值划分数据集,由于特征值可能多于两个,因此可能存在大于两个分支的数据集划分。第一次划分之后,数据将被向下传递到树分支的下一个节点,在这个节点上,再次划分数据。因此我们采用递归的原则处理数据集。递归结束的条件:遍历完所有划分数据集的属性,或则每个分支下的所有实例都具有相同的类(任何到达叶节点的数据必然属于叶节点的分类)。第一种情况,如果遍历完所有的属性,但当前节点类标签不唯一时,通过多数表决的方法(majorityCnt)来确定叶节点的分类。
majorityCnt----多数表决的算法实现返回对应的类别
createTree代码解析:
def createTree(dataSet,labels): classList = [example[-1] for example in dataSet] #获取数据最后一列,即类别标签 if classList.count(classList[0]) == len(classList): #类别标签是否完全相同(都是第一个标签) return classList[0] #完全相同,停止,返回对应标签,叶节点 if len(dataSet[0]) == 1: #只剩一个标签(遍历一个删除一个,遍历完了) return majorityCnt(classList) #投票决定标签,叶节点 bestFeat = chooseBestFeatureToSplit(dataSet) #选取最佳的特征列索引 bestFeatLabel = labels[bestFeat] #获得其标签 myTree = {bestFeatLabel:{}} #构建节点(字典,key为最佳的标签) del(labels[bestFeat]) #删除已采用标签 featValues = [example[bestFeat] for example in dataSet] #获取最佳特征列的值 uniqueVals = set(featValues) #取得其不重复集合 for value in uniqueVals: #遍历 subLabels = labels[:] #copy all of labels, so trees don't mess up existing labels myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels) #在上面节点上增加节点,利用该特征值取值下的数据集,递归 return myTree
>>> import trees >>> import treePlotter >>> data,lable=trees.createDataSet() >>> data [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] >>> lable ['no surfacing', 'flippers'] >>> mytree=trees.createTree(data,lable) >>> mytree {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
================================================================
#利用构建的树完成测试样本的分类。
#原理:程序比较测试数据与决策树上的数值,递归执行该过程直到进入叶子节点;最后将测试数据定义为叶子节点所属的类型
classify----利用构建的树完成测试样本的分类
def classify(inputTree,featLabels,testVec): firstStr = inputTree.keys()[0] #获取根节点的key,即采用的第一个特征(只有一个) secondDict = inputTree[firstStr] #获取根节点的子节点(有多个) featIndex = featLabels.index(firstStr) #获取获取第一个特征的所在的列索引 key = testVec[featIndex] #取出测试数据对应特征的值 valueOfFeat = secondDict[key] #取出树第二层中key与测试数据相等的节点 if isinstance(valueOfFeat, dict): #判断取出的节点是否是dict对象 classLabel = classify(valueOfFeat, featLabels, testVec) #是,说明还有子节点,继续寻找 else: classLabel = valueOfFeat #否,说明已经到达叶节点,返回其类别 return classLabel
测试:
>>> import trees >>> import treePlotter >>> data,lable=trees.createDataSet() >>> data [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] >>> lable ['no surfacing', 'flippers'] >>> mytree=trees.createTree(data,lable) >>> mytree {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}} >>> data,lable=trees.createDataSet()#重置lable >>> trees.classify(mytree,lable,[0,0])#利用构建的树对测试数据分类 'no' >>>
#其他方法
storeTree----利用pickle将构建的树对象序列化保存到本地
grabTree----利用pickle加载保存在本地的树文件,构建树
trees.py源码(测试见后文)
''' Created on Oct 12, 2010 Decision Tree Source Code for Machine Learning in Action Ch. 3 @author: Peter Harrington ''' from math import log import operator def createDataSet(): dataSet = [[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']] labels = ['no surfacing','flippers'] #change to discrete values return dataSet, labels def calcShannonEnt(dataSet): numEntries = len(dataSet) labelCounts = {} for featVec in dataSet: #the the number of unique elements and their occurance currentLabel = featVec[-1] if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0 labelCounts[currentLabel] += 1 shannonEnt = 0.0 for key in labelCounts: prob = float(labelCounts[key])/numEntries shannonEnt -= prob * log(prob,2) #log base 2 return shannonEnt def splitDataSet(dataSet, axis, value): retDataSet = [] for featVec in dataSet: if featVec[axis] == value: reducedFeatVec = featVec[:axis] #chop out axis used for splitting reducedFeatVec.extend(featVec[axis+1:]) retDataSet.append(reducedFeatVec) return retDataSet def chooseBestFeatureToSplit(dataSet): numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels baseEntropy = calcShannonEnt(dataSet) bestInfoGain = 0.0; bestFeature = -1 for i in range(numFeatures): #iterate over all the features featList = [example[i] for example in dataSet]#create a list of all the examples of this feature uniqueVals = set(featList) #get a set of unique values newEntropy = 0.0 for value in uniqueVals: subDataSet = splitDataSet(dataSet, i, value) prob = len(subDataSet)/float(len(dataSet)) newEntropy += prob * calcShannonEnt(subDataSet) infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy if (infoGain > bestInfoGain): #compare this to the best gain so far bestInfoGain = infoGain #if better than current best, set to best bestFeature = i return bestFeature #returns an integer def majorityCnt(classList): classCount={} for vote in classList: if vote not in classCount.keys(): classCount[vote] = 0 classCount[vote] += 1 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) return sortedClassCount[0][0] def createTree(dataSet,labels): classList = [example[-1] for example in dataSet] if classList.count(classList[0]) == len(classList): return classList[0]#stop splitting when all of the classes are equal if len(dataSet[0]) == 1: #stop splitting when there are no more features in dataSet return majorityCnt(classList) bestFeat = chooseBestFeatureToSplit(dataSet) bestFeatLabel = labels[bestFeat] myTree = {bestFeatLabel:{}} del(labels[bestFeat]) featValues = [example[bestFeat] for example in dataSet] uniqueVals = set(featValues) for value in uniqueVals: subLabels = labels[:] #copy all of labels, so trees don't mess up existing labels myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value),subLabels) return myTree def classify(inputTree,featLabels,testVec): firstStr = inputTree.keys()[0] secondDict = inputTree[firstStr] featIndex = featLabels.index(firstStr) key = testVec[featIndex] valueOfFeat = secondDict[key] if isinstance(valueOfFeat, dict): classLabel = classify(valueOfFeat, featLabels, testVec) else: classLabel = valueOfFeat return classLabel def storeTree(inputTree,filename): import pickle fw = open(filename,'w') pickle.dump(inputTree,fw) fw.close() def grabTree(filename): import pickle fr = open(filename) return pickle.load(fr)
****************************************************************
二、利用matplotlib实现tree可视化
文件:treePlotter.py,代码中的主要方法:
其中方法createPlot可以将传给它的tree可视化展现出来
plotMidText----连线中间文字
plotTree----构建树
createPlot----主入口,将传给它的tree可视化展现出来
#获取叶节点的数目和树的层数
getNumLeafs----叶节点的数目
getTreeDepth----树的层数
#创建测试数据
retrieveTree----定义了两个树
#使用文本注解绘制树节点
plotNode----使用文本注解绘制树节点
treePlotter.py源码(测试见后文)
*************************************************************
三、测试实现决策树分类
>>> em=trees.calcShannonEnt(test_tree_data[0])#计算测试数据熵 >>> em 0.9709505944546686 >>> >>> import trees >>> import treePlotter >>> test_tree_data=trees.createDataSet()#测试数据 >>> test_tree_data ([[1, 1, 'yes'], [1, 1, 'yes'], [1, 0, 'no'], [0, 1, 'no'], [0, 1, 'no']], ['no surfacing', 'flippers']) >>> mytree=trees.createTree(*test_tree_data)#构建决策树 >>> mytree {'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}} >>> treePlotter.createPlot(mytree)#可视化树
(图Figure_1-1.png)
>>> test_tree_data=trees.createDataSet()#注意上面数据已经改变了test_tree_data[1],即lable >>> trees.classify(mytree,test_tree_data[1],[0,0]) 'no'
--------------------------------------------------------------------------
#测试眼镜的例子:特征包括['age','prescript','astigmatic','tearRate'],类别包括['lenses','soft','hard']
>>> import trees >>> import treePlotter >>> fr=open('lenses.txt') >>> lenses=[inst.strip().split('\t') for inst in fr.readlines()] >>> lenseslable=['age','prescript','astigmatic','tearRate'] >>> lensestree=trees.createTree(lenses,lenseslable) >>> lensestree {'tearRate': {'reduced': 'no lenses', 'normal': {'astigmatic': {'yes': {'prescript': {'hyper': {'age': {'pre': 'no lenses', 'presbyopic': 'no lenses', 'young': 'hard'}}, 'myope': 'hard'}}, 'no': {'age': {'pre': 'soft', 'presbyopic': {'prescript': {'hyper': 'soft', 'myope': 'no lenses'}}, 'young': 'soft'}}}}}} >>> treePlotter.createPlot(lensestree)
(图:测试眼)
存在问题:该树的匹配选项过多,过度匹配
最后提醒,使用这里的方法时,注意前文提到的数据要求