克劳德·香农被公认为是二十世纪最聪明的人之一，威廉·庞德斯通在其2005年出版的《财富公式》一书中是这样描写的：“贝尔实验室和MIT有很多人将香农合爱因斯坦相提并论，而其他人则认为这种对比是不公平的——对香农是不公平的。”

克劳德·香农

克劳德·香农

克劳德·香农他定义了信息，用‘信息熵’对信息进行了量化。通过信息出现的概率来度量一个信息的重要程度。假设，一个事件发生的概率为 $p(x)$ 于是就有了以下定义，

信 息 量 (i n f o r m a t i o n) = - l o g_{2} p (x)

$信息量(information)=-log_2{p(x)}$

很显然，当事件发生概率为 $p(x)=1$ 时，信息量为0。相反，当概率趋近于0时，

lim_{p (x) \to 0} - l o g_{2} p (x) = \infty

$\lim_{p(x) \rightarrow 0 } -log_2 {p(x)} = \infty$

信息熵

以上，讨论的只是单个事件的信息量。那么，多个事件的混乱程度要如何度量呢？香农提出了‘信息熵’的概念，

信 息 熵 (H) = - \sum_{i = 1}^{n} p (x_{i}) \underline{l o g_{2} p (x_{i})}

$信息熵(H) = -\sum_{i=1}^n{p(x_i) \underline {log_2{p(x_i)}}}$

很明显，信息熵就是将单个事件的‘信息量’，按照其发生的概率进行加权的结果。下面给出一个dataSet实战一下信息熵的计算，其中，前两列为特征 $(feature1,feature2)$ ，最后一列为类别标签 $label$ 。

d a t a S e t = [\begin{matrix} 1 & 1 & ^{'} y e s^{'} \\ 1 & 1 & ^{'} y e s^{'} \\ 1 & 0 & ^{'} n o^{'} \\ 0 & 1 & ^{'} n o^{'} \\ 0 & 1 & ^{'} n o^{'} \end{matrix}]

$dataSet= \begin{bmatrix} 1 & 1& 'yes' \\ 1 &1 & 'yes' \\ 1 & 0 & 'no' \\ 0 & 1 & 'no' \\ 0 & 1 & 'no' \end{bmatrix}$

计算信息熵，输出结果应为 $H=0.9709505944546686$

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for fectVec in dataSet:
        currentLabel = fectVec[-1]
        labelCounts[currentLabel] = labelCounts.get(currentLabel,0) + 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * math.log(prob,2)
    return shannonEnt

信息熵的计算很简单，只要计算出 $p(x_i)$ 即可， $labelCounts<key,value>$ 字典中的key对应类别（‘yes’ or ‘no’），value表示的是出现的次数。

prob = float(labelCounts[key])/numEntries

很显然， $p('yes') = \frac{2}{5}, p('no')= \frac{3}{5}$ ，立即可得出信息熵 $H(dataSet)$ ，

H (d a t a S e t) = - \sum_{i = 1}^{2} p (x_{i}) l o g_{2} p (x_{i}) = - (\frac{2}{5} \times l o g_{2} \frac{2}{5} + \frac{3}{5} \times l o g_{2} \frac{3}{5}) = 0.97

$H(dataSet) = -\sum_{i=1}^2{p(x_i) {log_2{p(x_i)}}}=-(\frac{2}{5} \times {log_2\frac{2}{5}}+\frac{3}{5} \times {log_2\frac{3}{5}})=0.97$

信息增益

在信息增益中，衡量标准是看特征能够为分类系统带来多少信息。对一个特征而言，系统有它和没它时信息量将发生变化，而前后信息量的差值就是这个特征给系统带来的信息量（亦即，信息增益）。

划分数据集

计算信息增益时，需讨论每一特征（比如 $axis=0$ ，则对应 $feature1$ ）对信息熵的影响。而每一特征又有不同的值（下图一特征只有两种值——‘是’ $1$ 、‘否’ $0$ ），需要分别统计1、0出现的概率。很显然，对于特征1有 $prob_1 = \frac{3}{5}， prob_0 = \frac{2}{5}$ 。这时候再求出 $\begin{bmatrix} 1&'no' \\ 1 & 'no'\end{bmatrix}$ 的信息熵 $Ent_0$ ，接着求 $\begin{bmatrix} 1&'yes' \\ 1 & 'yes' \\ 0 & 'no'\end{bmatrix}$ 的信息熵 $Ent_1$ 。那么，特征1带来的信息增益则为，

I n f o r m a t i o n G a i n (I G) = p r o b_{0} \times E n t_{0} + p r o b_{1} \times E n t_{1}

$Information\ Gain(IG) = prob_0 \times Ent_0 + prob_1 \times Ent_1$

插图（补充）

数无形时少直觉，以下图示划分过程，

d a t a S e t = [\begin{matrix} 1 & 1 & ^{'} y e s^{'} \\ 1 & 1 & ^{'} y e s^{'} \\ 1 & 0 & ^{'} n o^{'} \\ 0 & 1 & ^{'} n o^{'} \\ 0 & 1 & ^{'} n o^{'} \end{matrix}] {\begin{cases} \underline{a x i s = 0, v a l u e = 0} [\begin{matrix} 1 & ^{'} n o^{'} \\ 1 & ^{'} n o^{'} \end{matrix}] \\ \underline{a x i s = 0, v a l u e = 1} [\begin{matrix} 1 & ^{'} y e s^{'} \\ 1 & ^{'} y e s^{'} \\ 0 & ^{'} n o^{'} \end{matrix}] \\ ⋮ \end{cases}

$dataSet= \begin{bmatrix} 1 & 1& 'yes' \\ 1 &1 & 'yes' \\ 1 & 0 & 'no' \\ 0 & 1 & 'no' \\ 0 & 1 & 'no' \end{bmatrix} \begin{cases} \underline{axis=0,value=0} \begin{bmatrix} 1 & 'no'\\ 1 & 'no' \end{bmatrix} \\ \\ \underline{axis=0,value=1} \begin{bmatrix} 1 & 'yes'\\ 1 & 'yes'\\ 0 & 'no' \end{bmatrix} \\ \\ \vdots \end{cases}$

对 $dataSet$ 按照 $axis$ 所在的特征列、 $value$ 所对应的值，进行划分（见上图），

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

补充1：关于extend

>>> a = [1,2,3]
>>> b = [4,5,6]
>>> a.extend(b)
>>> a
[1, 2, 3, 4, 5, 6]

补充2：关于append

>>> a = [1,2,3]
>>> b = [4,5,6]
>>> a.append(b)
>>> a
[1, 2, 3, [4, 5, 6]]

IG公式

上面已经说的很清楚了，如果非要给出完整的计算公式。那么，信息增益计算公式应该如下，

I G = H (C) - H (C | T) = H (C) - [\underline{p (t) H (C | t) + p (t^{'}) H (C | t^{'})}] = - \sum_{i = 1}^{n} p (x_{i}) l o g_{2} p (x_{i}) - [\underline{- p (t) \sum_{i = 1}^{n} p (x_{i} | t) l o g_{2} p (x_{i} | t) - p (t^{'}) \sum_{i = 1}^{n} p (x_{i} | t^{'}) l o g_{2} p (x_{i} | t^{'})}]

$IG= H(C) - H(C|T) = H(C) - [\underline{p(t)H(C|t) + p(t')H(C|t')}] \\ = -\sum_{i=1}^n{p(x_i) log_2{p(x_i)}} - [\underline{-p(t)\sum_{i=1}^n{p(x_i|t) log_2{p(x_i|t)}}-p(t')\sum_{i=1}^n{p(x_i|t') log_2{p(x_i|t')}}}]$

其中， $T$ 为选取的某一特征； $p(t)$ 为特征 $T$ 的出现概率， $p(t')$ 为不出现概率； $H(C)$ 为 $C$ 的信息熵。

以上的公式，只是一种最简单的形式，因为只考虑某一特征出现 $t$ 、不出现 $t'$ 的情况。实际上，一个特征可能有好几种取值（而不仅仅是‘是’或者‘否’）。比如，学校就有双一流、985、211、一本、二本、三本、专科…

最好的划分方式

通过比较不同的划分方式所带来的信息增益，选择带来信息增益最大的那种划分方式，即为最好的划分方式，

def chooseBestFectureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1 #remove label column
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):
        featList = [example[i] for example in dataSet]
        uniqueVals = set(featList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet,i,value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if infoGain > bestInfoGain:
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

插图（补充）

以下的过程图较好的分析了，上述划分代码执行的过程…

[\begin{matrix} 1 & 1 & ^{'} y e s^{'} \\ 1 & 1 & ^{'} y e s^{'} \\ 1 & 0 & ^{'} n o^{'} \\ 0 & 1 & ^{'} n o^{'} \\ 0 & 1 & ^{'} n o^{'} \end{matrix}] {\begin{cases} < c o l u m n = 0 > [\begin{matrix} 1 \\ 1 \\ 1 \\ 0 \\ 0 \end{matrix}] < s e t > [\begin{matrix} 0 \\ 1 \end{matrix}] {\begin{cases} < v a l u e = 0 > [\begin{matrix} 1 & ^{'} n o^{'} \\ 1 & ^{'} n o^{'} \end{matrix}] \\ < v a l u e = 1 > [\begin{matrix} 1 & ^{'} y e s^{'} \\ 1 & ^{'} y e s^{'} \\ 0 & ^{'} n o^{'} \end{matrix}] \end{cases} \to C a l c I G \\ < c o l u m n = 1 > [\begin{matrix} 1 \\ 1 \\ 0 \\ 1 \\ 1 \end{matrix}] < s e t > [\begin{matrix} 0 \\ 1 \end{matrix}] \dots \end{cases}

$\begin{bmatrix} 1 & 1& 'yes' \\ 1 &1 & 'yes' \\ 1 & 0 & 'no' \\ 0 & 1 & 'no' \\ 0 & 1 & 'no' \end{bmatrix} \begin{cases} <column=0> \begin{bmatrix} 1 \\ 1 \\ 1 \\ 0 \\ 0 \end{bmatrix} <set> \begin{bmatrix} 0 \\ 1 \end{bmatrix} \begin{cases} <value=0> \begin{bmatrix} 1 & 'no' \\ 1 & 'no' \end{bmatrix} \\ \\ <value=1> \begin{bmatrix} 1& 'yes' \\ 1 & 'yes' \\ 0 & 'no' \end{bmatrix} \end{cases} \rightarrow Calc\ IG \\ \\ <column=1> \begin{bmatrix} 1\\ 1 \\ 0 \\ 1 \\ 1 \end{bmatrix} <set> \begin{bmatrix} 0 \\ 1 \end{bmatrix} \dots \end{cases}$

构造决策树

构造决策树的过程就是不断的选取特征的过程，再判断特征的其中每一个值是否可以最终区分类型，如果不能，则继续选择特征。举个例子，这里第一列（特征1）为 $No\ surfacing$ ，第二列（特征2）为 $Flippers$ ，

d a t a S e t = [\begin{matrix} 1 & 1 & ^{'} y e s^{'} \\ 1 & 1 & ^{'} y e s^{'} \\ 1 & 0 & ^{'} n o^{'} \\ 0 & 1 & ^{'} n o^{'} \\ 0 & 1 & ^{'} n o^{'} \end{matrix}] < N o s u r f a c i n g > {\begin{cases} < 0 > [\begin{matrix} 1 & ^{'} n o^{'} \\ 1 & ^{'} n o^{'} \end{matrix}] \to N o \\ < 1 > [\begin{matrix} 1 & ^{'} y e s^{'} \\ 1 & ^{'} y e s^{'} \\ 0 & ^{'} n o^{'} \end{matrix}] < F l i p p e r s > {\begin{cases} < 0 > [\begin{matrix} ^{'} n o^{'} \end{matrix}] \to N o \\ < 1 > [\begin{matrix} ^{'} y e s^{'} \\ ^{'} y e s^{'} \end{matrix}] \to Y e s \end{cases} \end{cases}

$dataSet= \begin{bmatrix} 1 & 1& 'yes' \\ 1 &1 & 'yes' \\ 1 & 0 & 'no' \\ 0 & 1 & 'no' \\ 0 & 1 & 'no' \end{bmatrix} <No\ surfacing> \begin{cases} <0> \begin{bmatrix} 1 & 'no' \\ 1 & 'no' \end{bmatrix} \rightarrow No \\ \\ <1> \begin{bmatrix} 1& 'yes' \\ 1 & 'yes' \\ 0 & 'no' \end{bmatrix} <Flippers> \begin{cases} <0> \begin{bmatrix} 'no' \end{bmatrix} \rightarrow No \\ \\ <1> \begin{bmatrix} 'yes' \\ 'yes' \end{bmatrix} \rightarrow Yes \end{cases} \end{cases}$

以上是一个很好的例子，但是有些特别的例子，当所有特征都被选择完毕了，仍然不能区分类别。比如，假设上图中 $Flippers$ 取值为 $<1>$ 得到的类别为 $\begin{bmatrix} 'yes' \\ 'no' \\ 'yes' \end{bmatrix}$ 。此时，已经没有特征可以提取了，这时候需要一种投票的方式来区分类别。

def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        classCount[vote] = classCount.get(vote, 0) + 1
    sortedClassCount = sorted(classCount.items(), key = operator.intemgetter(1), reverse = True)
    return sortedClassCount[0][0]

根据上图中的描述，以下给出生成决策树的代码。整个过程就是根据‘信息增益’递归选取特征的过程。那么，递归终止条件是？

当某一特征取某一值时，全部为相同类别；（比如，上文中当 $No surfacing$ 特征取值为 $<0>$ 时，类别全部为 $No$ ，达到区分目的）
当再无特征可提取时，即上文中提到的投票法 $majorityCnt$ 。

def createTree(dataSet, labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFectureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    print(featValues)
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree

上文所使用的数据集将最终产生一个如下决策树，

{^{'} n o s u r f a c i n g^{'} : {0 :^{'} n o^{'}, 1 : {^{'} f l i p p e r s^{'} : {0 :^{'} n o^{'}, 1 :^{'} y e s^{'}}}}}

$\{'no\ surfacing': \{0: 'no', 1: \{'flippers': \{0: 'no', 1: 'yes'\}\}\}\}$

使用决策树分类

决策树构建完毕之后。对于输入的测试数据，根据决策树找到叶子节点即可，其值即为最终类别。代码实现过程为不停的寻找 $key$ ，一直到叶子节点（也就是直到执行下文代码中的 $else$ 部分，确定分类）。

决策树分类

#{'no surfacing': {0: 'no', 1: {'flippers': {0: 'no', 1: 'yes'}}}}
def classify(inputTree, featLabels, testVec):
    firstStr = list(inputTree.keys())[0]
    secondDict = inputTree[firstStr]
    featIndex = featLabels.index(firstStr)
    for key in secondDict.keys():
        if testVec[featIndex] == key:
            if type(secondDict[key]).__name__ == 'dict':
                classLabel = classify(secondDict[key], featLabels, testVec)
            else:
                classLabel = secondDict[key]
    return classLabel

给出一个主函数，这里注意由于 $createTree$ 过程中会修改 $Labels$ ，这里传递了一个拷贝。

if __name__ == '__main__':
    dataSet,labels = createDataSet()
    myTree = createTree(dataSet, labels[:])
    cls = classify(myTree, labels, [1,0])
    print(cls)

存储、读取决策树

$pickle$ 模块来序列化对象（该默认存储的为二进制），这里为了不出现编码错误，直接存储、读取时都直接指定为 $wb$ 与 $wr$ 格式。
存储决策树

def storeTree(inputTree, filename):
    fw = open(filename, 'wb')
    pickle.dump(inputTree, fw)
    fw.close()

读取决策树

def grabTree(filename):
    fr = open(filename,'rb')
    return pickle.load(fr)

再次，给出一个主函数，

if __name__ == '__main__':
    dataSet,labels = createDataSet()
    storeTree(createTree(dataSet, labels[:]), 'store_d_tree.txt')
    cls = classify(grabTree('store_d_tree.txt'), labels, [1,0])
    print(cls)

References:
[1] 李锐, 李鹏, 曲亚东, 王斌[译]. 机器学习实战[M]. 北京:人民邮电出版社, 2013.
[2] 百度百科，信息增益，https://baike.baidu.com/item/%E4%BF%A1%E6%81%AF%E5%A2%9E%E7%9B%8A，2018年06月29日

附录A （使用决策树预测隐形眼镜类型）

数据集地址: https://pan.baidu.com/s/1qHvawOiAxnBHieZdp9z_BQ 密码:7paq

#coding=utf-8
import math, operator, pickle

def calcShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for fectVec in dataSet:
        currentLabel = fectVec[-1]
        labelCounts[currentLabel] = labelCounts.get(currentLabel,0) + 1
    shannonEnt = 0.0
    for key in labelCounts:
        prob = float(labelCounts[key])/numEntries
        shannonEnt -= prob * math.log(prob,2)
    return shannonEnt

def splitDataSet(dataSet, axis, value):
    retDataSet = []
    for featVec in dataSet:
        if featVec[axis] == value:
            reducedFeatVec = featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet

def chooseBestFectureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1 #remove label column
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):
        fectList = [example[i] for example in dataSet]
        uniqueVals = set(fectList)
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet,i,value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)
        infoGain = baseEntropy - newEntropy
        if infoGain > bestInfoGain:
            bestInfoGain = infoGain
            bestFeature = i
    return bestFeature

def majorityCnt(classList):
    classCount = {}
    for vote in classList:
        classCount[vote] = classCount.get(vote, 0) + 1
    sortedClassCount = sorted(classCount.items(), key = operator.intemgetter(1), reverse = True)
    return sortedClassCount[0][0]

def createTree(dataSet, labels):
    classList = [example[-1] for example in dataSet]
    if classList.count(classList[0]) == len(classList):
        return classList[0]
    if len(dataSet[0]) == 1:
        return majorityCnt(classList)
    bestFeat = chooseBestFectureToSplit(dataSet)
    bestFeatLabel = labels[bestFeat]
    myTree = {bestFeatLabel:{}}
    del(labels[bestFeat])
    featValues = [example[bestFeat] for example in dataSet]
    uniqueVals = set(featValues)
    for value in uniqueVals:
        subLabels = labels[:]
        myTree[bestFeatLabel][value] = createTree(splitDataSet(dataSet, bestFeat, value), subLabels)
    return myTree

if __name__ == '__main__':
    fr = open('lenses.txt')
    lenses = [inst.strip().split('\t') for inst in fr.readlines()]
    lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate']
    print(createTree(lenses, lensesLabels))

决策树应该为，

{
    'tearRate': {
        'normal': {
            'astigmatic': {
                'no': {
                    'age': {
                        'young': 'soft',
                        'presbyopic': {
                            'prescript': {
                                'myope': 'no lenses',
                                'hyper': 'soft'
                            }
                        },
                        'pre': 'soft'
                    }
                },
                'yes': {
                    'prescript': {
                        'myope': 'hard',
                        'hyper': {
                            'age': {
                                'young': 'hard',
                                'presbyopic': 'no lenses',
                                'pre': 'no lenses'
                            }
                        }
                    }
                }
            }
        },
        'reduced': 'no lenses'
    }
}

机器学习——决策树ID3算法（预测隐形眼镜类型）

克劳德·香农

信息熵

信息增益

划分数据集

插图（补充）

补充1：关于extend

补充2：关于append

IG公式

最好的划分方式

插图（补充）

构造决策树

使用决策树分类

存储、读取决策树

附录A （使用决策树预测隐形眼镜类型）

猜你喜欢