实际解决机器学习问题过程中,我们会遇到一些“大数据”问题,比如有上百万条数据,上千上万维特征,此时数据存储已经达到10G这种级别。
如果是文本分类分体,你还需要提取文本特征,这时候如果把数据load到内存,那占用内存就太大了,如何解决:1. 对数据进行降维?2. 使用流式或类似流式处理?3. 上大机器,高内存的,或者用spark集群。
本文将要介绍的是一种增量学算法PassiveAggressiveClassifier
处理流程:
1. 流式数据
第一个条件,要给算法流式数据或小batch的数据,比如一次提供1000条这样。这一块是需要自己写代码提供的,可以实现一个生成器,每调用一次提供一份小batch数据。
2. 提取特征
第二个条件,可以使用任何一种sklearn中支持的特征提取方法。对于一些特殊情况,比如特征需要标准化或者是事先不知道特征值的情况下需要特殊处理。
3. 增量学习算法
对于第三个条件,sklearn中提供了很多增量学习算法。虽然不是所有的算法都可以增量学习,但是学习器提供了 partial_fit的函数的都可以进行增量学习。事实上,使用小batch的数据中进行增量学习(有时候也称为online learning)是这种学习方式的核心,因为它能让任何一段时间内内存中只有少量的数据。
sklearn提供很多增量学习算法 例如sklearn.linear_model.PassiveAggressiveClassifier
其中对于分类问题,在第一次调用partial_fit时需要通过classes参数指定分类的类别。
def iter_minibatches(filename, minibatch_size): ''' 迭代器 给定文件流(比如一个大文件),每次输出minibatch_size行,默认选择1k行 将输出转化成numpy输出,返回X, y ''' import pandas as pd import numpy as np from sklearn.utils import shuffle x = [] y = [] cur_line_num = 0 csvfile = open(filename, 'rb') reader = pd.read_csv(csvfile #,encoding = 'gb18030' ) #分割商品名称 reader['HWMC'] = sjcl(list(reader['HWMC'].astype(str))) reader['HWMC']=reader['HWMC'].apply(lambda x: np.NaN if str(x)=='' else x)#将空白替换为nan #df_null = df[df['HWMC'].isnull()] reader = reader[reader['HWMC'].notnull()] reader.index =np.arange(len(reader)) reader = shuffle(reader) for line in reader.index: x.append(reader.HWMC[line]) y.append(reader.U_CODE[line]) # 这里要将数据转化成float类型 cur_line_num += 1 if cur_line_num >= minibatch_size: x, y = np.array(x), np.array(y) # 将数据转成numpy的array类型并返回 yield x, y x, y = [], [] cur_line_num = 0 csvfile.close()
训练代码。。。大家不可直接复制,要根据业务需求,做好特征提取
import pandas as pd import numpy as np import datetime import gc from sklearn import metrics from sklearn.externals import joblib df_sc = pd.DataFrame([[0,0,0]],columns = ['model','time','score']) num = 1 for model in models: MD = models[model] print("获取classes",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S')) all_classes = get_classes(filename) minibatch_train_iterators = iter_minibatches(filename, size) x_test, y_test = next(minibatch_train_iterators) print("开始训练",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S')) for i, (X_train, y_train) in enumerate(minibatch_train_iterators): print("{} time".format(i)) # 当前次数 # 使用 partial_fit ,并在第一次调用 partial_fit 的时候指定 classes MD.partial_fit(get_hv(X_train), y_train, classes=all_classes) result=MD.predict(get_hv(x_test)) print(model,"score: %.4g" % metrics.accuracy_score(y_test,result)) # 在测试集上看效果 df_sc.loc[num] = {'model':model,'time':datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S'),'score':MD.score(get_hv(x_test),y_test)} if df_sc.score[num]>df_sc.score[num-1]: print("模型训练完成,保存模型",datetime.datetime.now().strftime('%Y.%m.%d-%H:%M:%S')) #保存模型 joblib.dump(MD, "/root/lizheng/model/model_learn1%s.pkl.gz" % model, compress=('gzip', 3))
from sklearn.linear_model import PassiveAggressiveClassifier import sys #sys.path.append("D:/PDM/SPBM") sys.path.append("/root/lizheng") models_learn ={#'pa1-0.6':PassiveAggressiveClassifier(C=0.6,max_iter=100000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),#71.4 #'pa1-0.7':PassiveAggressiveClassifier(C=0.7,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1), #'pa1-0.8':PassiveAggressiveClassifier(C=0.8,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1), #'pa1-0.9':PassiveAggressiveClassifier(C=0.9,max_iter=10000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1), #'pa1-1':PassiveAggressiveClassifier(C=1,max_iter=100000,loss = 'squared_hinge',average=True,n_jobs=-1,random_state=1),#71.6 'pa4-1':PassiveAggressiveClassifier(C=2,max_iter=10000,loss = 'hinge',average=True,n_jobs=-1,random_state=1) } sp.fitby_linear_model('/root/lizheng/fcqspbm_1214a.csv',models_learn,1000000)
sklearn.linear_model
.PassiveAggressiveClassifier
-
class
sklearn.linear_model.
PassiveAggressiveClassifier
( C=1.0, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, loss=’hinge’, n_jobs=1, random_state=None, warm_start=False, class_weight=None, average=False, n_iter=None ) [source] -
Passive Aggressive Classifier
Read more in the User Guide.
Parameters: C : float
Maximum step size (regularization). Defaults to 1.0.
fit_intercept : bool, default=False
Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.
max_iter : int, optional
The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the
fit
method, and not the partial_fit. Defaults to 5. Defaults to 1000 from 0.21, or if tol is not None.New in version 0.19.
tol : float or None, optional
The stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to None. Defaults to 1e-3 from 0.21.
New in version 0.19.
shuffle : bool, default=True
Whether or not the training data should be shuffled after each epoch.
verbose : integer, optional
The verbosity level
loss : string, optional
The loss function to be used: hinge: equivalent to PA-I in the reference paper. squared_hinge: equivalent to PA-II in the reference paper.
n_jobs : integer, optional
The number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. -1 means ‘all CPUs’. Defaults to 1.
random_state : int, RandomState instance or None, optional, default=None
The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
warm_start : bool, optional
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.
class_weight : dict, {class_label: weight} or “balanced” or None, optional
Preset for the class_weight fit parameter.
Weights associated with classes. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as
n_samples / (n_classes * np.bincount(y))
New in version 0.17: parameter class_weight to automatically weight samples.
average : bool or int, optional
When set to True, computes the averaged SGD weights and stores the result in the
coef_
attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.New in version 0.19: parameter average to use weights averaging in SGD
n_iter : int, optional
The number of passes over the training data (aka epochs). Defaults to None. Deprecated, will be removed in 0.21.
Changed in version 0.19: Deprecated
Attributes: coef_ : array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]
Weights assigned to the features.
intercept_ : array, shape = [1] if n_classes == 2 else [n_classes]
Constants in decision function.
n_iter_ : int
The actual number of iterations to reach the stopping criterion. For multiclass fits, it is the maximum over every binary fit.
sklearn.linear_model
.PassiveAggressiveClassifier
-
class
sklearn.linear_model.
PassiveAggressiveClassifier
( C=1.0, fit_intercept=True, max_iter=None, tol=None, shuffle=True, verbose=0, loss=’hinge’, n_jobs=1, random_state=None, warm_start=False, class_weight=None, average=False, n_iter=None ) [source] -
Passive Aggressive Classifier
Read more in the User Guide.
Parameters: C : float
Maximum step size (regularization). Defaults to 1.0.
fit_intercept : bool, default=False
Whether the intercept should be estimated or not. If False, the data is assumed to be already centered.
max_iter : int, optional
The maximum number of passes over the training data (aka epochs). It only impacts the behavior in the
fit
method, and not the partial_fit. Defaults to 5. Defaults to 1000 from 0.21, or if tol is not None.New in version 0.19.
tol : float or None, optional
The stopping criterion. If it is not None, the iterations will stop when (loss > previous_loss - tol). Defaults to None. Defaults to 1e-3 from 0.21.
New in version 0.19.
shuffle : bool, default=True
Whether or not the training data should be shuffled after each epoch.
verbose : integer, optional
The verbosity level
loss : string, optional
The loss function to be used: hinge: equivalent to PA-I in the reference paper. squared_hinge: equivalent to PA-II in the reference paper.
n_jobs : integer, optional
The number of CPUs to use to do the OVA (One Versus All, for multi-class problems) computation. -1 means ‘all CPUs’. Defaults to 1.
random_state : int, RandomState instance or None, optional, default=None
The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
warm_start : bool, optional
When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.
class_weight : dict, {class_label: weight} or “balanced” or None, optional
Preset for the class_weight fit parameter.
Weights associated with classes. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as
n_samples / (n_classes * np.bincount(y))
New in version 0.17: parameter class_weight to automatically weight samples.
average : bool or int, optional
When set to True, computes the averaged SGD weights and stores the result in the
coef_
attribute. If set to an int greater than 1, averaging will begin once the total number of samples seen reaches average. So average=10 will begin averaging after seeing 10 samples.New in version 0.19: parameter average to use weights averaging in SGD
n_iter : int, optional
The number of passes over the training data (aka epochs). Defaults to None. Deprecated, will be removed in 0.21.
Changed in version 0.19: Deprecated
Attributes: coef_ : array, shape = [1, n_features] if n_classes == 2 else [n_classes, n_features]
Weights assigned to the features.
intercept_ : array, shape = [1] if n_classes == 2 else [n_classes]
Constants in decision function.
n_iter_ : int
The actual number of iterations to reach the stopping criterion. For multiclass fits, it is the maximum over every binary fit.