最近学习了scikit-learn,scikit-learn是一个基于NumPy,SciPy,Matplotlib的开源学习工具包。
scikit-learn主要解决步骤:
- 数据准备与预处理
- 模型选择与训练
- 模型验证与参数调优
逻辑回归模型实例
scikit-learn支持多种格式的数据,包括经典的iris数据,LibSVM格式数据等等。
from sklearn.datasets import load_svmlight_file #用来加载libsvm型数据
X,y = load_svmlight_file('filename')
机器学习模型的模块
from sklearn.datasets import load_svmlight_file #用来加载libsvm型数据
X,y = load_svmlight_file('filename')
from sklearn.linear_model import LogisticRegression
reg = LogisticRegression(penalty='l2',tol=1e-4,c=1.0)
reg.fit(X,y) #训练
test_x = X[:10]
test_y = y[:10]
test_score = reg.score(test_x,test_y) #评分
为了更好的训练模型可以进行交叉验证,或者使用贪心算法进行参数调优
from sklearn.model_selection import cross_val_score
clf = svm.SVC(kernel='linear', C=1)
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
预处理加载数据
import numpy as np
import urllib.request
# url with dataset
url='http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
# download the file
raw_data = urllib.request.urlopen(url)
# load the CSV file as a numpy matrix
datset = np.loadtxt(raw_data,delimiter=',')
# separate the data from the target attributes
X = dataset[:,0:8]
y = dataset[:8]
数据归一化(Data Normalization)
绝大多数机器学习算法中的梯度方法对于数据的缩放和尺度都是敏感的,在开始跑算法前,我们应该进行归一化或者标准化的处理,把数据缩放到0-1的范围中
from sklearn import prepeocessing
# normalize the data attributes
normalized_X = preprocessing.normalize(X)
# standardize the data attributes
standardized_X = preprocessing.scale(X)
特征选择(Feature Selection)
在解决一个实际问题的过程中,选择合适的特征或者构建特征的能力特别重要。这成为特征选择或者特征工程。特征选择是一个很需要创造力的过程,更多的依赖于直觉的专业知识,并且有很多现成的算法来进行特征的选择。下面的树算法(Tree algorithms)计算特征的信息量:
from sklearn.ensemble import ExtraTreesClassifier
model = ExtraTreesClassifier()
model.fit(X,y)
# display the relative importance of each attribute
print(model.feature_importances_)
scikit-learn算法的使用
逻辑回归
大多数问题都可以归结为二分类问题。这个算法的优点是可以给出数据所在类别的概率。
我们使用Pima Indians Diabetes dataset,其中包含健康数据和糖尿病状态数据,一共有768个病人的数据
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data'
col_names = ['pregnant','glucose','bp','skin','insulin','bmi','pedigree','age','label']
pima = pd.read_csv(url,header=None,names=col_names)
print(pima.head())#显示前五条
结果(1代表有糖尿病,0代表没有)
|
pregnant
|
glucose
|
bp
|
skin
|
insulin
|
bmi
|
age
|
label |
0
|
6
|
148
|
72
|
35
|
0
|
33.6
|
50
|
1 |
1
|
1
|
85
|
66
|
29
|
0
|
26.6
|
31
|
0 |
2
|
8
|
183
|
64
|
0
|
0
|
23.3
|
32
|
1 |
3
|
1
|
89
|
66
|
23
|
94
|
28.1
|
21
|
0 |
4
|
0
|
137
|
40
|
35
|
168
|
43.1
|
33
|
1 |
feature_cols = ['pregant','insulin','bmi','age']
X = pima[feature_cols]
y=pima.label
# split X and y into training and testing sets
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0,test_size=0.33) #默认是0.25
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)
# make class predictions for the testing set
y_pred_class = clf.predict(X_test)
metrics
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X,y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected,predicted))
print(metrics.confusion_matrix(expected,predicted))
结果
precision recall f1-score support
0 0.71 0.88 0.79 500
1 0.60 0.34 0.43 268
avg / total 0.67 0.69 0.66 768
[[440 60]
[178 90]]
|
朴素贝叶斯
该方法的任务是还原训练样本数据的分布密度,其在多类别分类中有很好的效果。
from sklearn import metrics
form sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X,y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected,predicted))
print(metrics.confusion_matrix(expected,predicted))
结果
GaussianNB()
precision recall f1-score support
0 0.72 0.83 0.77 500
1 0.56 0.40 0.47 268
avg / total 0.67 0.68 0.67 768
[[415 85]
[160 108]]
|
k近邻
from sklearn import metrics
from sklearn.neighbors import KneighborsClassifier
# fit a k-nearest neighbor model to the data
model = KNeighborsClassifier()
model.fit(X,y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected,predicted))
print(metrics.confusion_matrix(expected,predicted))
结果
precision recall f1-score support
0 0.81 0.88 0.84 500
1 0.73 0.62 0.67 268
avg / total 0.78 0.79 0.78 768
[[439 61]
[103 165]]
|
决策树
分类与回归树(Classification and Regression Trees,CART)算法常用于特征含有类别信息的分类或者回归问题,适用于多分类问题
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
# fit a CART model to the data
model = DecisionTreeClassifier()
model.fit(X,y)
print(model)
# make predictions
expected = y
predicted = model.predict(X)
# summarize the fit of the model
print(metrics.classification_report(expected,predicted))
print(metrics.confusion_matrix(expected,predicted))
结果
precision recall f1-score support
0 1.00 1.00 1.00 500
1 1.00 1.00 1.00 268
avg / total 1.00 1.00 1.00 768
[[500 0]
[ 1 267]]
|
支持向量机
from sklearn.svm import SVC
model = SVC()
model.fit(X,y)
print(model)
expected = y
predicted = model.predict(X)
print(metrics.classification_report(expected,predicted))
print(metrics.confusion_matrix(expected,predicted))
precision recall f1-score support
0 0.95 1.00 0.97 500
1 0.99 0.91 0.95 268
avg / total 0.97 0.97 0.97 768
[[498 2]
[ 24 244]]
|
如何优化算法参数
一项更加困难的任务是构建一个有效的方法用于选择正确的参数,我们需要用搜索的方法来确定参数
import numpy as np
from sklearn.linear_model import Ridge
from sklearn.grid_search import GridSearchCV
# prepare a rang of alpha values to test
alphas = np.array([1,0.1,0.01,0.001,0.0001,0])
# create and fit a ridge regression model, testing each alpha
model = Ridge()
grid = GridSearchCV(estimator=model,param_grid=dict(alpha=alphas))
grid.fit(X,y)
print(grid)
# summarize the results of the grid search
print(grid.best_score_)
print(grid.best_estimator_alpha)
GridSearchCV(cv=None, error_score='raise',
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001),
fit_params={}, iid=True, n_jobs=1,
param_grid={'alpha': array([ 1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03,
1.00000e-04, 0.00000e+00])},
pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)
0.279617559313
1.0
|
有时随即从给定区间中选择参数是很有效的方法,然后根据这些参数来评估算法的效果进而选择最佳
import numpy as np
from scipy.stats import uniform as sp_rand
from sklearn.linear_model import Ridge
from sklearn.grid_search import RandomizedSearchCV
# prepare a uniform distribution to sample for the alpha parameter
param_grid = {'alpha':sp_rand()}
# create and fit a ridge regression model, testing random alpha values
model = Ridge()
rsearch = RandomizedSearchCV(estimator=model,param_distributions=param_grid,n_iter=100)
rsearch.fit(X,y)
print(rsearch)
# summarize the results of the random parameter search
print(rsearch.best_score_)
print(rsearch.best_estimator_.alpha)
RandomizedSearchCV(cv=None, error_score='raise',
estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
normalize=False, random_state=None, solver='auto', tol=0.001),
fit_params={}, iid=True, n_iter=100, n_jobs=1,
param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x00000000055DCE10>},
pre_dispatch='2*n_jobs', random_state=None, refit=True,
scoring=None, verbose=0)
0.279616856141
0.964295479202
|