组合分类方法之AdaBoost算法实战（单层决策树原理代码详解）---机器学习

先简单的回顾一下理论（下面的理论来源机器学习实战）：

AdaBoost

　　AdaBoost的一般流程如下所示：

（1）收集数据

（2）准备数据：依赖于所用的基分类器的类型，这里的是单层决策树，即树桩，该类型决策树可以处理任何类型的数据。

（3）分析数据

（4）训练算法：利用提供的数据集训练分类器

（5）测试算法：利用提供的测试数据集计算分类的错误率

（6）使用算法：算法的相关推广，满足实际的需要

接下来，具体阐述adaBoost分类算法

1 训练算法：基于错误提升分类器的性能

　　上面所述的基分类器，或者说弱分类器，意味着分类器的性能不会太好，可能要比随机猜测要好一些，一般而言，在二类分类情况下，弱分类器的分类错误率达到甚至超过50%，显然也只是比随机猜测略好。但是，强分类器的分类错误率相对而言就要小很多，adaBoost算法就是易于这些弱分类器的组合最终来完成分类预测的。

　　adaBoost的运行过程：训练数据的每一个样本，并赋予其一个权重，这些权值构成权重向量D，维度等于数据集样本个数。开始时，这些权重都是相等的，首先在训练数据集上训练出一个弱分类器并计算该分类器的错误率，然后在同一数据集上再次训练弱分类器，但是在第二次训练时，将会根据分类器的错误率，对数据集中样本的各个权重进行调整，分类正确的样本的权重降低，而分类错的样本权重则上升，但这些权重的总和保持不变为1.

　　并且，最终的分类器会基于这些训练的弱分类器的分类错误率，分配不同的决定系数alpha，错误率低的分类器获得更高的决定系数，从而在对数据进行预测时起关键作用。alpha的计算根据错误率得来：

　　alpha=0.5*ln(1-ε/max(ε,1e-16))

其中，ε=为正确分类的样本数目/样本总数，max(ε,1e-16)是为了防止错误率为而造成分母为0的情况发生

　　计算出alpha之后，就可以对权重向量进行更新了，使得分类错误的样本获得更高的权重，而分类正确的样本获得更低的权重。D的计算公式如下：

　　如果某个样本被正确分类，那么权重更新为：
　　D(m+1,i)=D(m,i)*exp(-alpha)/sum(D)

　　如果某个样本被错误分类，那么权重更新为：
　　D(m+1,i)=D(m,i)*exp(alpha)/sum(D)

其中，m为迭代的次数，即训练的第m个分类器，i为权重向量的第i个分量，i<=数据集样本数量

　　当我们更新完各个样本的权重之后，就可以进行下一次的迭代训练。adaBoost算法会不断重复训练和调整权重，直至达到迭代次数，或者训练错误率为0。

2 基于单层决策树构建弱分类器

　　单层决策树是一种简单的决策树，也称为决策树桩。这里的需要强调的是，和前面学的决策树有点不同，最大的不同之处在于他的决策依据不同了，我们前面学的是通过熵进行决策，而这里的决策只是通过一个阈值进行决策，即数据如果大于这个值或者小于这个值就会分类为1或者-1，就这么简单，但是这个阈值怎么找才能错误率最低呢？这里使用遍历的方法，怎么遍历呢？很简单，首先给定的数据中肯定有一个最大值和最小值，我们先设定待选的阈值有几个，假定有10个，那么我就通过最大值和最小值的差除以10的结果就是步长了，第一个阈值从最小值开始即把阈值设置为最小值，然后进行分类并计算分类的错误的个数，然后阈值向前步进一个步长然后继续计算分类错误率，计算完所有的阈值以后，然后选择一个最小的分类错误率的阈值为这个决策树的阈值。

原理就是这样，如果刚开始不理解可能会搞晕的，所以大家需要多看看，下面把代码贴上，注释很详细：

#!/usr/bin/env/python
# -*- coding: utf-8 -*-
# Author: 赵守风
# File name: adaboost.py
# Time:2018/10/27
# Email:[email protected]
import numpy as np

# 建立数据，用于后面的训练
def loadsimdata():
    datmat = np.matrix([[1, 2.1],
                        [2., 1.1],
                        [1.3, 1],
                        [1, 1],
                        [2., 1.]])
    classlabels = [1.0, 1.0, -1.0, -1.0, 1.0]

    return datmat, classlabels


# 单层决策树生成函数
# 这个函数作用是比较数据的某个特征的值和阈值的大小，通过这样进行分类，但是只看这个函数不好理解，需要
# 结合下面的函数进行理解
def stumpclassify(datamatrix, dimen, threshval, threshineq):
    '''
    :param datamatrix: 输入待分类的数据
    :param dimen: 输入数据的某个特征
    :param threshval: 设定的阈值
    :param threshineq: 阈值比较
    :return: 返回分类的结果
    '''
    retarray = np.ones((np.shape(datamatrix)[0], 1))  # 先默认分类都为1
    if threshineq == 'lt':  # 这个是为了找到最优的决策，因此两种情况都讨论了，即大于阈值和小于阈值
        retarray[datamatrix[:, dimen] <= threshval] = -1.0  # 当数据小于阈值时为-1，因为默认为1了，为了准确率，需要考虑大于阈值的情况
    else:
        retarray[datamatrix[:, dimen] > threshval] = -1.0  # 如果考虑大于阈值的情况则也是为-1，这里大家可能会有疑问，这是两种情况，调用这个函数的
        # 函数需要知道错误率最小的决策及阈值，因此他把两种情况都考虑了，即每次前进一步阈值都会更新，每次更新都计算大这个阈值或者小于这个阈值的情况
    return retarray


def buildstump(dataarr, classlabels, D):
    '''
    :param dataarr: 输入数据
    :param classlabels:  数据的真实分类标签
    :param  D: 数据的权值向量
    :return: beststump, minerror, bestclasest 即决策树，最小误差，预测值
    '''
    datamatrix = np.matrix(dataarr)  # 把数据转换为矩阵数据
    labelsmat = np.mat(classlabels).T  # 同理把标签数据转换为矩阵
    m, n = np.shape(datamatrix)  # 得到数据的维度即m行n列
    numsteps = 10.0  # 设置步数，目的是在步数以内找到最优的决策树
    beststump = {}  # 先建立一个空的字典，用作后面存储决策树
    bestclasest = np.mat(np.zeros((m, 1)))  # 预测分类空矩阵
    minerror = np.inf  # 错误率先设置为最大

    for i in range(n):  # 先遍历数据的所有特征
        rangemin = datamatrix[:, i].min()  # 寻找该特征下的最小值
        rangemax = datamatrix[:, i].max()  # 寻找该特征下的最大值
        stepsize = (rangemax - rangemin)/numsteps  # 通过上面来计算步长，为了找到最优的决策
        for j in range(-1, int(numsteps)+1):  # 上面说计算是在numsteps以内找到最优的，因此这个循环是步数
            for inequal in ['lt', 'gt']:  # 遍历大于或者小于两种情况，lt= less than  ， gt = great than
                threshval = (rangemin + float(j)*stepsize)  # 通过设置的步和步长计算该步的阈值（他把每一步的值用作阈值，这样找到最优的）
                predictedvals = stumpclassify(datamatrix, i, threshval, inequal)  # 调用函数进行预测所有数据，现在再回去看看这个函数
                errarr = np.mat(np.ones((m, 1)))  # 准备计算错误率类，默认全为1
                errarr[predictedvals == labelsmat] = 0  # 如果相同和真正类别相同则为0，则剩下为1则为分类错误的
                weightederror = D.T*errarr # 保留分错的数据， 分类正确的数据直接为0
                print("split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f"\
                       % (i, threshval, inequal, weightederror))
                if weightederror < minerror:  # 如果误差比之前的还小则更新新返回值，反之继续循环直达循环结束，返回
                    minerror = weightederror
                    bestclasest = predictedvals.copy()
                    beststump['dim'] = i
                    beststump['thresh'] = threshval
                    beststump['ineq'] = inequal

    return beststump, minerror, bestclasest

下面测试一下：

#!/usr/bin/env/python
# -*- coding: utf-8 -*-
# Author: 赵守风
# File name: test.py
# Time:2018/10/27
# Email:[email protected]
import matplotlib.pyplot as plt
import pandas as pd
import  numpy as np
import adaboost

datmat, classlabels = adaboost.loadsimdata()
print('datmat: ', datmat)
print('classlabels', classlabels)
D = np.mat(np.ones((5, 1))/5)
adaboost.buildstump(datmat, classlabels, D)

测试结果为（部分）：

datmat:  [[1.  2.1]
 [2.  1.1]
 [1.3 1. ]
 [1.  1. ]
 [2.  1. ]]
classlabels [1.0, 1.0, -1.0, -1.0, 1.0]
split: dim 0, thresh 0.90, thresh ineqal: lt, the weighted error is 0.400
split: dim 0, thresh 0.90, thresh ineqal: gt, the weighted error is 0.600
split: dim 0, thresh 1.00, thresh ineqal: lt, the weighted error is 0.400
split: dim 0, thresh 1.00, thresh ineqal: gt, the weighted error is 0.600
split: dim 0, thresh 1.10, thresh ineqal: lt, the weighted error is 0.400
split: dim 0, thresh 1.10, thresh ineqal: gt, the weighted error is 0.600
split: dim 0, thresh 1.20, thresh ineqal: lt, the weighted error is 0.400
split: dim 0, thresh 1.20, thresh ineqal: gt, the weighted error is 0.600

下面给出基于单层决策树的完整Adaboost代码

先给出伪代码：

完整AdaBoost算法实现
算法实现伪代码
对每次迭代：
    利用buildStump()函数找到最佳的单层决策树
    将最佳单层决策树加入到单层决策树数组
    计算alpha
    计算新的权重向量D
    更新累计类别估计值
    如果错误率为等于0.0，退出循环

代码如下，如果原理懂了，代码看起来还是很容易的。

def adaBoostTrainDS(dataArr,classLabels,numIt=40):
    weakClassArr = [] # 存储训练好的决策树使用的
    m = np.shape(dataArr)[0] # 取出数据的行
    D = np.mat(np.ones((m,1))/m)   #数据的权值初始相等
    aggClassEst = np.mat(np.zeros((m,1)))
    for i in range(numIt):
        bestStump,error,classEst = buildstump(dataArr,classLabels,D)# 建立第i个决策树
        print("D:",D.T)
        alpha = float(0.5*np.log((1.0-error)/max(error,1e-16)))#计算决策树权值
        bestStump['alpha'] = alpha # 更新权值
        weakClassArr.append(bestStump)  # 把第i个决策树添存储
        print("classEst: ", classEst.T)
        expon = np.multiply(-1*alpha*np.mat(classLabels).T,classEst) # 更新数据权值D
        D = np.multiply(D, np.exp(expon))  # 看不懂的建议吧原理搞懂再来看
        D = D/D.sum()
        # calc training error of all classifiers, if this is 0 quit for loop early (use break)
        aggClassEst += alpha*classEst
        print("aggClassEst: ",aggClassEst.T)
        aggErrors = np.multiply(np.sign(aggClassEst) != np.mat(classLabels).T,np.ones((m,1)))
        errorRate = aggErrors.sum()/m
        print("total error: ",errorRate)
        if errorRate == 0.0: break # 错误率为0时返回迭代
    return weakClassArr,aggClassEst

下面的测试就不贴了，很简单。

组合分类方法之AdaBoost算法实战（单层决策树原理代码详解）---机器学习

猜你喜欢