sklearn自带一些数据集,其中手写数字数据集可通过load_digits加载,我找到load_digits里头是这样
def load_linnerud():
"""Load and return the linnerud dataset (multivariate regression).
Samples total: 20
Dimensionality: 3 for both data and targets
Features: integer
Targets: integer
Returns
-------
data : Bunch
Dictionary-like object, the interesting attributes are: 'data' and
'targets', the two multivariate datasets, with 'data' corresponding to
the exercise and 'targets' corresponding to the physiological
measurements, as well as 'feature_names' and 'target_names'.
"""
base_dir = join(dirname(__file__), 'data/')
# Read data
data_exercise = np.loadtxt(base_dir + 'linnerud_exercise.csv', skiprows=1)
data_physiological = np.loadtxt(base_dir + 'linnerud_physiological.csv',
skiprows=1)
# Read header
with open(base_dir + 'linnerud_exercise.csv') as f:
header_exercise = f.readline().split()
with open(base_dir + 'linnerud_physiological.csv') as f:
header_physiological = f.readline().split()
with open(dirname(__file__) + '/descr/linnerud.rst') as f:
descr = f.read()
return Bunch(data=data_exercise, feature_names=header_exercise,
target=data_physiological,
target_names=header_physiological,
DESCR=descr)
有数据说明,很详细,也可以看到实际那批数据来自一个csv文件,可以按照提示挑一个样例show一下看看
import pylab as pl
digits = load_digits()
# 数据纬度,1797幅图,8*8
# 显示一副图片
# pl.gray()
# pl.matshow(digits.images[0])
# pl.show()
言归正传,调用 sklearn的svm包里的LinearSVC做一下分类实验
# -*- coding:utf-8 -*-
import sys
from sklearn.datasets import load_digits # 加载手写数字识别数据
import pylab as pl
from sklearn.cross_validation import train_test_split # 训练测试数据分割
from sklearn.preprocessing import StandardScaler # 标准化工具
from sklearn.svm import LinearSVC
from sklearn.metrics import classification_report # 预测结果分析工具
reload(sys)
sys.setdefaultencoding('utf-8')
digits = load_digits()
# 数据纬度,1797幅图,8*8
print digits.data.shape
# 分割数据
X_train, X_test, Y_train, Y_test = train_test_split(digits.data, digits.target, test_size=0.25, random_state=33)
ss = StandardScaler()
# fit是实例方法,必须由实例调用
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
lsvc = LinearSVC()
lsvc.fit(X_train, Y_train)
Y_predict = lsvc.predict(X_test)
print classification_report(Y_test, Y_predict, target_names=digits.target_names.astype(str))
训练集与测试集1比3分割,分类预测的性能如下:
precision recall f1-score support
0 0.92 1.00 0.96 35
1 0.96 0.98 0.97 54
2 0.98 1.00 0.99 44
3 0.93 0.93 0.93 46
4 0.97 1.00 0.99 35
5 0.94 0.94 0.94 48
6 0.96 0.98 0.97 51
7 0.92 1.00 0.96 35
8 0.98 0.84 0.91 58
9 0.95 0.91 0.93 44
avg / total 0.95 0.95 0.95 450
需要记住几个性能指标的含义
precision为精确率,是真阳结果与所有被判为阳性结果的比。
recall为召回率, 是真阳结果与所有实际为阳性数据的比。
真阳者,实际为真且判为真。