基于adaboost的svm实现

为了以后找到好工作，好实习。我决定自己也写一写博客，记录自身的新的体会，也为每周一次的讨论班打下基础。

其实基于机器学习的话，现在我觉得数据挖掘最有效的方法就是集成了，有bagging和boosting俩种思想。普通的单一用一个svm，或者决策树什么的已经几乎被淘汰了。最新的武器就只有GBDT、Adaboost、LightGbm,Xgboost,随机森林这些思想。

adaboost其实在做人脸识别，好像用harr分类器做集成的效果比较好。今天我就基于李航的统计学习方法，写了一个adaboost，具体的原理请参照李航统计学习方法138-139页

我这里贴上的代码是严格按照李航的统计学习方法的算法8.1部分来的，相互结合，食用效果更佳，这是对svm进行的一个简单决策树分类，这里的一个小trick是，sklearn分类器的sample_weight，很好的为我们进行了一个迭代的过程，如果想自己造分类器，可以继承sklearn.base.Estimator 造轮子，具体怎么造，我打算深入阅读一下sklearn的api，然后写一篇新得

import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import svm

import  pandas as pd
import  matplotlib.pyplot as plt
from sklearn.datasets import make_hastie_10_2
'''
Thanks for the elements of statistic 
this is a adaboost of svm by JXinyee
I just want to try
'''
#compute the error_rate
def error_rate(y,pred):
    return sum(y!= pred)/len(y)
#compute the base estimator error_rate
def initclf(X_train,y_train,X_test,y_test,clf):
    clf.fit(X_train,y_train)
    y_train_pred = clf.predict(X_train)
    y_test_pred = clf.predict(X_test)
    train_err = error_rate(y_train_pred,y_train)
    test_err = error_rate(y_test_pred,y_test)
    return train_err,test_err

#adaboost al..th
'''
M is the times of boost
clf is the base estimator
'''
def adaboost(X_train,y_train,X_test,y_test,M,clf):
    w = np.ones(len(X_train))/len(X_train)
    #刚开始总的分类器都是0
    n_train = len(X_train)
    n_test = len(y_train)
    pred_train,pred_test  = list(np.zeros(n_train)),list(np.zeros(n_test))
    for i in range(M):
        w1 = w*n_train
        clf.fit(X_train, y_train,sample_weight = w1)
        y_train_i = clf.predict(X_train)
        y_test_i = clf.predict(X_test)

        # miss is 8.1(b) 中的计算分类误差率要乘以w的
        miss = [int(i) for i in (y_train_i != y_train)]

        # miss2 是8.5中y*G(m)
        miss1 = [x if x == 1 else -1 for x in miss]
        #要注意np.dot()也可以一个ndarry 一个列表相乘 这里计算分类误差率和alpha_m
        error_m =np.dot(w,miss)
        #print(error_m)
        alpha_m = 0.5 *np.log((1-error_m)/error_m)
        #更新权重
        w = np.multiply(w,np.exp([-alpha_m * x for x in miss1 ]))

        #ensemble
        pred_train = [sum(x) for x in zip(pred_train,[alpha_m * i for i in y_train_i ])]
        pred_test = [sum(x) for x in zip(pred_test,[alpha_m * i for i in y_test_i ])]
    pred_train,pred_test = np.sign(np.array(pred_train)),np.sign(np.array(pred_test))
    return error_rate(pred_train,y_train), error_rate(pred_test,y_test)

#plot error curve
def plot_error_rate(er_train,er_test):
    df_err =pd.DataFrame([er_train,er_test]).T
    df_err.columns = ["Train","Test"]
    plot1 = df_err.plot(linewidth =3,figsize =(8,6),
                        color = ["lightblue","darkblue"],grid = True)
    plot1.set_xlabel('Number of iterations', fontsize=12)
    plot1.set_xticklabels(range(0, 450, 50))
    plot1.set_ylabel('Error rate', fontsize=12)
    plot1.set_title('Error rate vs number of iterations', fontsize=16)
    plt.axhline(y=er_test[0], linewidth=1, color='red', ls='dashed')

#just do it
if __name__ =='__main__':
    x,y  = make_hastie_10_2()
    X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)
    svm_clf = svm.SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
                      decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
                      max_iter=-1, probability=False, random_state=None, shrinking=True,
                      tol=0.001, verbose=False)
    er_train,er_test= initclf(X_train,y_train,X_test,y_test,svm_clf)
    er_train,er_test= [er_train],[er_test]
    xrange =(10,410,10)
    for i in xrange:
        er_train_i,er_test_i = adaboost(X_train,y_train,X_test,y_test,i,svm_clf)
        er_train.append(er_train_i)
        er_test.append(er_test_i)
    plot_error_rate(er_train,er_test)

截图附上我用jupyter 测试的效果，可以看到错误率在不断下降，但是当你迭代到100的时候，可能会出现一个警告

warning: class label 0 specified in weight is not found
warning: class label 1 specified in weight is not found

，这是用sklearn svm分类器的一个小bug，因为你在sample_weight=w 这一步调权重的话，有的instance权重太低，sklearn不知道怎么计算了，用DecisionTree就不会出现这样的问题。

基于adaboost的svm实现

猜你喜欢