决策树算法：ID3

决策树是最经常使用的数据挖掘算法，其核心是一个贪心算法，它采用自顶向下的递归方法构建决策树，下面是一个典型的决策树：
这里写图片描述
目前常用的决策树算法有ID3算法、改进的C4.5，C5.0算法和CART算法

ID3算法的核心是在决策树各级节点上选择属性时，用信息增益作为属性的选择标准，使得在每一个非节点进行测试时，能获得关于被测试记录最大的类别信息。

ID3的特点
优点：理论清晰，方法简单，学习能力较强
缺点：
(1) 信息增益的计算比较依赖于特征数目比较多的特征
(2) ID3为非递增算法
(3) ID3为单变量决策树
(4) 抗糙性差

熵和信息增益
设S是训练样本集，它包括n个类别的样本，这些方法用 ${C_i}$ 表示，那么熵和信息增益用下面公式表示：
信息熵：

E (S) = - \sum_{i = 0}^{n} p_{i} \log_{2} p_{i}

$E(S) = - \sum\limits_{i = 0}^n {{p_i}{{\log }_2}{p_i}}$
其中

p_{i}

${p_i}$ 表示

C_{i}

${C_i}$ 的概率
样本熵：

E_{A} (S) = - \sum_{j = 1}^{m} \frac{| S_{j} |}{| S |} E (S_{j})

$E{_A}(S) = - \sum\limits_{j = 1}^m {\frac{{|{S_j}|}}{{|S|}}E({S_j})}$
其中

S_{i}

${S_i}$ 表示根据属性A划分的

S

${S}$ 的第

i

${i}$ 个子集，

S

${S}$ 和

S_{i}

${S_i}$ 表示样本数目
信息增益：

G a i n (S, A) = E (S) - E_{A} (S)

$Gain(S,A) = E(S) - {E_A}(S)$
ID3中样本分布越均匀，它的信息熵就越大，所以其原则就是样本熵越小越好，也就是信息增益越大越好。

算法实例分析

outlook	tem	hum	windy	play
overcast	hot	high	not	no
overcast	hot	high	very	no
overcast	hot	high	medium	no
sunny	hot	high	not	yes
sunny	hot	high	medium	yes
rain	mild	high	not	no
rain	mild	high	medium	no
rain	hot	normal	not	yes
rain	cool	normal	medium	no
rain	hot	normal	very	no
sunny	cool	normal	very	yes
sunny	cool	normal	medium	yes
overcast	mild	high	not	no
overcast	mild	high	medium	no
overcast	cool	normal	not	yes
overcast	cool	normal	medium	yes
rain	mild	normal	not	no
rain	mild	normal	medium	no
overcast	mild	normal	medium	yes
overcast	hot	normal	very	yes
sunny	mild	high	very	yes
sunny	mild	high	medium	yes
sunny	hot	normal	not	yes
rain	mild	high	very	no

在上面的样本中，属于 ${yes}$ 的结果有12个， ${no}$ 有12个，于是根据上面的公式算出来训练集的熵为：

E (S) = - \frac{12}{24} \log_{2} \frac{12}{24} - \frac{12}{24} \log_{2} \frac{12}{24} = 1

$E(S) = - \frac{12}{{24}}{\log _2}\frac{12}{{24}} - \frac{12}{{24}}{\log _2}\frac{12}{{24}} = 1$
下面对属性outlook 、tem 、hum 、windy 计算对应的信息增益。
outlook将S划成三个部分：sunny、rain、overcast，如果用

S_{v}

${S_v}$ 表示属性为

v

${v}$ 的样本集，就有

| S_{s u n n y} | = 7

$|{S_{sunny}}| = 7$ ，

| S_{o v e r c a s t} | = 9

$|{S_{overcast}}| = 9$ ，

| S_{r a i n} | = 8

$|{S_{rain}}| = 8$ ，而在

S_{s u n n y}

${S_{sunny}}$ 中，类

y e s

${yes}$ 的样本有7个，类

n o

${no}$ 的样本有0个，

S_{o v e r c a s t}

${S_{overcast}}$ 中，类

y e s

${yes}$ 的样本有4个，类

n o

${no}$ 的样本有5个，

S_{r a i n}

${S_{rain}}$ 中，类

y e s

${yes}$ 的样本有1个，类

n o

${no}$ 的样本有7个，于是算出outlook的条件熵为：

$E(S,outlook) = \frac{7}{{24}}( - \frac{7}{7}{\log _2}\frac{7}{7} - 0) + \frac{9}{{24}}( - \frac{4}{9}{\log _2}\frac{4}{9} - \frac{5}{9}{\log _2}\frac{5}{9}) + \frac{8}{{24}}( - \frac{1}{8}{\log _2}\frac{1}{8} - \frac{7}{8}{\log _2}\frac{7}{8}) = 0.4643$
$Gain(S,outlook) = 1 - 0.4643= 0.5357$

同理：
$E(S,tem) =0.6739$
$Gain(S,tem) = 0.3261$

$E(S,hum) =0.8183$
$Gain(S,hum) = 0.1817$

$E(S,windy) =1$
$Gain(S,windy) = 0$
从上面可以看出outlook的信息增益最大，所以选择outlook作为根节点的测试属性，windy的信息增益为0，不能做出任何分类信息，产生第一次决策树，然后对每个叶节点再次利用上面的过程，生成最终的决策树。
这里写图片描述

python代码实现：
网上很多ID3的实现代码，我这里也是寻找一篇将自己的实例带进去调试出结果，如涉及到版权，请联系博主，删除内容。

#!/usr/bin/python
#coding=utf-8

#File Name: ID3.py
#Author   : john
#Created Time: Fri 31 Aug 2018 10:19:40 AM CST

from math import log

def createDataSet():
    #outlook: 0 rain   1 overcast   2 sunny
    #tem:     0 cool   1 mild       2 hot
    #hum:     0 normal 1 high
    #windy    0 not    1 medium     2 very 
    dataSet = [[1, 2, 1, 0, 'no'],
               [1, 2, 1, 2, 'no'],
               [1, 2, 1, 1, 'no'],
               [2, 2, 1, 0, 'yes'],
               [2, 2, 1, 1, 'yes'],
               [0, 1, 1, 0, 'no'],
               [0, 1, 1, 1, 'no'],
               [0, 2, 0, 0, 'yes'],
               [0, 0, 0, 1, 'no'],
               [0, 2, 0, 2, 'no'],
               [2, 0, 0, 2, 'yes'],
               [2, 0, 0, 1, 'yes'],
               [1, 1, 1, 0, 'no'],
               [1, 1, 1, 1, 'no'],
               [1, 0, 0, 0, 'yes'],
               [1, 0, 0, 1, 'yes'],
               [0, 1, 0, 0, 'no'],
               [0, 1, 0, 1, 'no'],
               [1, 1, 0, 1, 'yes'],
               [1, 2, 0, 2, 'yes'],
               [2, 1, 1, 2, 'yes'],
               [2, 1, 1, 1, 'yes'],
               [2, 2, 0, 0, 'yes'],
               [0, 1, 1, 2, 'no'],]

    return dataSet

#获取数据集的熵
def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for featVec in dataSet:
        currentLable=featVec[-1] #取得最后一列数据
        if currentLable not in labelCounts.keys(): #获取结果
            labelCounts[currentLable] = 0
        labelCounts[currentLable] += 1

    #计算熵
    Ent = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key]) / numEntries
        Ent -= prob * log(prob, 2)
    #print ("信息熵: ", Ent)
    return Ent

#划分数据集
def splitDataSet(dataSet, axis, value):
    retDataSet = []  
    for featVec in dataSet:
        if featVec[axis] == value:      #每行中第axis个元素和value相等（去除第axis个数据）
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])  
            retDataSet.append(reducedFeatVec)
    return retDataSet  #返回分类后的新矩阵

#根据香农熵，选择最优的划分方式    #根据某一属性划分后，类标签香农熵越低，效果越好
def chooseBestFeatureToSplit(dataSet):
    baseEntropy = calcShannonEnt(dataSet)   #计算数据集的香农熵
    numFeatures = len(dataSet[0])-1
    bestInfoGain = 0.0  #最大信息增益
    bestFeature = 0    #最优特征

    for i in range(0, numFeatures):
        featList = [example[i] for example in dataSet]  #所有子列表（每行）的第i个元素，组成一个新的列表
        uniqueVals = set(featList)
        newEntorpy = 0.0
        for value in uniqueVals:    #数据集根据第i个属性进行划分，计算划分后数据集的香农熵
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntorpy += prob*calcShannonEnt(subDataSet)
        infoGain = baseEntropy-newEntorpy   #划分后的数据集，香农熵越小越好，即信息增益越大越好
        if(infoGain > bestInfoGain):
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

#如果数据集已经处理了所有属性，但叶子结点中类标签依然不是唯一的，此时需要决定如何定义该叶子结点。这种情况下，采用多数表决方法，对该叶子结点进行分类
def majorityCnt(classList): #传入参数：叶子结点中的类标签
    classCount = {}
    for vote in classList:
        if vote not in classCount.keys():
            classCount[vote] = 0
            classCount[vote] += 1
    sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
    return sortedClassCount[0][0]

#创建树
def createTree(dataSet, labels):    #传入参数：数据集，属性标签（属性标签作用：在输出结果时，决策树的构建更加清晰）
    classList = [example[-1] for example in dataSet]    #数据集样本的类标签
    if classList.count(classList[0]) == len(classList): #如果数据集样本属于同一类，说明该叶子结点划分完毕
        return classList[0]
    if len(dataSet[0]) == 1:    #如果数据集样本只有一列（该列是类标签），说明所有属性都划分完毕，则根据多数表决方法，对该叶子结点进行分类
        return majorityCnt(classList)
    bestFeat = chooseBestFeatureToSplit(dataSet)    #根据香农熵，选择最优的划分方式
    bestFeatLabel = labels[bestFeat]    #记录该属性标签
    myTree = {bestFeatLabel:{}} #树
    del(labels[bestFeat])   #在属性标签中删除该属性
    #根据最优属性构建树
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        subDataSet = splitDataSet(dataSet, bestFeat, value)
        myTree[bestFeatLabel][value] = createTree(subDataSet, subLabels)
    print ("myTree:", myTree)
    return myTree

#测试算法：使用决策树，对待分类样本进行分类
def classify(inputTree, featLabels, testVec):   #传入参数：决策树，属性标签，待分类样本
    firstStr = list(inputTree.keys())[0]  #树根代表的属性
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)  #树根代表的属性，所在属性标签中的位置，即第几个属性
    for key in secondDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]).__name__ == 'dict':
                classLabel = classify(secondDict[key], featLabels, testVec)
            else:
                classLabel = secondDict[key]
    return classLabel

if __name__ == '__main__':
    dataSet = createDataSet()
    labels = ['outlook', 'tem', 'hum', 'windy']

    labelsForCreateTree = labels[:]
    Tree = createTree(dataSet, labelsForCreateTree )
    testvec = [2, 2, 1, 0]
    print (classify(Tree, labels, testvec))

[John@127 ID3]$ python3 ID3.py                 
myTree: {'windy': {0: 'yes', 2: 'no'}}
myTree: {'tem': {0: 'no', 1: 'no', 2: {'windy': {0: 'yes', 2: 'no'}}}}
myTree: {'hum': {0: 'yes', 1: 'no'}}
myTree: {'outlook': {0: {'tem': {0: 'no', 1: 'no', 2: {'windy': {0: 'yes', 2: 'no'}}}}, 1: {'hum': {0: 'yes', 1: 'no'}}, 2: 'yes'}}
yes

猜你喜欢