使用sklearn.preprocessing做数据预处理

scikit-learn的preprocessing模块提供了多种用于数据预处理的类，它们可以用于数据的标准化、正则化、缺失数据的填补、类别特征的编码以及自定义数据转换等，sklearn.preprocessing包含如下的方法：

数据标准化

数据标准化是一项十分重要的工作，尤其是对于目前的有监督学习而言，尽管模型的复杂度越高会使得模型的学习能力越强，但是目前所有模型的效果更大程度上依赖于高质量的数据集，而这又是一件难以完美解决的事情。因此，目前构建高质量的数据集仍需要耗费大量的人力、物力进行数据收集和数据的标注。

数据标准化是对数据集中数据的一种预处理操作，它使得特征的分布满足标准的高斯分布 $N(0;1)$ ，从而避免某些特征影响算法的学习和收敛。

## StandardScaler
# sklearn.preprocessing.StandardScaler(copy=True, with_mean=True, with_std=True)
from sklearn import preprocessing
import numpy as np

X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])

scaler = preprocessing.StandardScaler()
scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

scaler.fit_transform(X_train)

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

print (scaler.mean_)
print (scaler.var_)

[1.         0.         0.33333333]
[0.66666667 0.66666667 1.55555556]

## scale()
# sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True)
# 它可以指定做标准化处理的数据的维度

x_scaled = preprocessing.scale(X_train)
x_scaled

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

x_scaled.mean(axis = 0)

array([0., 0., 0.])

x_scaled.var(axis = 0)

array([1., 1., 1.])

## MaxAbsScaler
# sklearn.preprocessing.MaxAbsScaler(copy=True)
# 将特征值缩放到最大绝对值处
maxabsscaler = preprocessing.MaxAbsScaler()
maxabsscaler

MaxAbsScaler(copy=True)

maxabsscaler.fit_transform(X_train)

array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])

maxabsscaler.scale_

array([2., 1., 2.])

maxabsscaler.max_abs_

array([2., 1., 2.])

## MinMaxScaler
# sklearn.preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
# 根据给定的最大值和最小值计算均值和方差
# X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
# X_scaled = X_std / (max - min) + min

minmaxscaler = preprocessing.MinMaxScaler()
minmaxscaler

MinMaxScaler(copy=True, feature_range=(0, 1))

minmaxscaler.fit_transform(X_train)

array([[0.5       , 0.        , 1.        ],
       [1.        , 0.5       , 0.33333333],
       [0.        , 1.        , 0.        ]])

数据稀疏对于模型的学习会造成很大的影响，但有时对其进行处理是必需的。MaxAbsScaler和maxabs_scale可处理稀疏数据。

sklearn.preprocessing.maxabs_scale(X, axis=0, copy=True)

对于离群点的处理可使用robust_scale和RobustScaler，RobustScaler对于异常值鲁棒性更好，它会减去中位数，然后除以四分位差，四分位差即第1个四分位数和第3个四分位数的差值。

sklearn.preprocessing.RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)
sklearn.preprocessing.robust_scale(X, axis=0, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True)

数据正则化

数据正则化是将数据在向量空间模型上做一个转换，它被广泛的应用于分类和聚类中。sklearn可以使用Normalizer和normalize()进行处理：

sklearn.preprocessing.Normalizer(norm=‘l2’, copy=True)
sklearn.preprocessing.normalize(X, norm=‘l2’, axis=1, copy=True, return_norm=False)

x_normalized = preprocessing.normalize(X_train, norm='l2')
x_normalized

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

# 根据数据构建normalizer
normalizer = preprocessing.Normalizer().fit(X_train)

# 实现转换
normalizer.transform(X_train)

array([[ 0.40824829, -0.40824829,  0.81649658],
       [ 1.        ,  0.        ,  0.        ],
       [ 0.        ,  0.70710678, -0.70710678]])

创建多项式特征

对于明显具备线性特征的数据，使用普通的线性回归或是分类器就可以处理。当数据复杂度较高时，单纯的使用线性特征很难学习到合适的模型，这时就需要考虑使用非线性特征进行处理。sklearn中preprocessing.PolynomialFeatures()可用于创建多项式特征。

sklearn.preprocessing.PolynomialFeatures(degree=2, interaction_only=False, include_bias=True, order=‘C’)
- degree：指定多项式的等级，默认为2，

poly = preprocessing.PolynomialFeatures(2)
poly

PolynomialFeatures(degree=2, include_bias=True, interaction_only=False,
                   order='C')

poly.fit_transform(X_train)

array([[ 1.,  1., -1.,  2.,  1., -1.,  2.,  1., -2.,  4.],
       [ 1.,  2.,  0.,  0.,  4.,  0.,  0.,  0.,  0.,  0.],
       [ 1.,  0.,  1., -1.,  0.,  0., -0.,  1., -1.,  1.]])

自定义transformer

sklearn.preprocessing.FunctionTransformer(func=None, inverse_func=None, validate=False, accept_sparse=False, check_inverse=True, kw_args=None, inv_kw_args=None)

func：指定数据完成转换的函数

transformer = preprocessing.FunctionTransformer(np.log1p)
x = np.array([[0, 1], [2, 3]])
transformer.transform(x)

D:\Anaconda\envs\tensorflow-2.0\lib\site-packages\sklearn\preprocessing\_function_transformer.py:97: FutureWarning: The default validate=True will be replaced by validate=False in 0.22.
  "validate=False in 0.22.", FutureWarning)





array([[0.        , 0.69314718],
       [1.09861229, 1.38629436]])

特征二值化

特征的二值化是指将数值型的特征数据转换成布尔类型的值。可以使用类Binarizer，默认是根据0来二值化，大于0的都标记为1，小于等于0的都标记为0。

sklearn.preprocessing.Binarizer(threshold=0.0, copy=True)

binarizer = preprocessing.Binarizer().fit(X_train)
binarizer.transform(X_train)

array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

类别特征

sklearn.preprocessing.LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False)
sklearn.preprocessing.OrdinalEncoder(categories=‘auto’, dtype=<class ‘numpy.float64’>)
sklearn.preprocessing.OneHotEncoder(categories=‘auto’, drop=None, sparse=True, dtype=<class ‘numpy.float64’>, handle_unknown=‘error’)

X = ['male', 'female','female', 'female', 'male']
enc = preprocessing.LabelBinarizer()
enc.fit_transform(X)

array([[1],
       [0],
       [0],
       [0],
       [1]])

X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc = preprocessing.OrdinalEncoder()
enc.fit(X)

OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)

enc.transform([['male', 'from US', 'uses Safari']])

array([[1., 1., 1.]])

enc = preprocessing.OneHotEncoder()
X = [['male', 'from US', 'uses Safari'], ['female', 'from Europe', 'uses Firefox']]
enc.fit(X)

enc.transform([['female', 'from US', 'uses Safari'],
               ['male', 'from Europe', 'uses Safari']]).toarray()

array([[1., 0., 0., 1., 0., 1.],
       [0., 1., 1., 0., 0., 1.]])

Forlogen

发布了295 篇原创文章 · 获赞 103 · 访问量 20万+

私信关注