机器学习中的处理数据

import numpy as np
X_train = np.random.randint(0,1000,100).reshape(-1,1 )
X_test = np.random.randint(0,1000,100).reshape(-1,1 )

StandardScaler数据归一化

当数据中的维度相差过大,比如一个样本中有两个特征:年龄和升高,这两者就相差很大了,所以我们要对数据进行最值归一化,即把所有数据映射到0-1之间,但是这个手outlier的影响比较大即X = (X-X_min)/(X_max-X_min).

from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
scale.fit(X_train)
X_test = scale.transform(X_test)
C:\Users\asus\Anaconda3\lib\site-packages\sklearn\utils\validation.py:475: DataConversionWarning: Data with input dtype int32 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)

Standardization

把所有数据归一到均值为0,方差为1的分布中。X_standardization = (X - X_mean) / S ,归一化处理处理有利与平衡数据集中的各种特征,我们要模拟真实环境,所以我们需要是由训练集产生的Standardization来对测试集进行操作

from sklearn.preprocessing import StandardScaler
standardScaler = StandardScaler()
standardScaler.fit(X_train)
X_test = standardScaler.transform(X_test)
C:\Users\asus\Anaconda3\lib\site-packages\sklearn\utils\validation.py:475: DataConversionWarning: Data with input dtype int32 was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)

小结

StandardScaler数据归一化 试用于数据由明显边界的情况。两种数据zip归一化都有利于平衡数据集合中的各种特征,而这个也就是数据归一化的意义所在

多项式

import numpy as np
X = np.random.uniform(-3,3,size=100).reshape(-1,1 )
X.shape
(100, 1)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3) # 规定最高项次是多少?
poly.fit(X)
X= poly.transform(X)
X.shape
(100, 4)
发布了67 篇原创文章 · 获赞 36 · 访问量 7万+

猜你喜欢

转载自blog.csdn.net/qq_41861526/article/details/88617856