scikit-learn主要模块和基本使用

最近学习了scikit-learn，scikit-learn是一个基于NumPy，SciPy，Matplotlib的开源学习工具包。

scikit-learn主要解决步骤：

数据准备与预处理
模型选择与训练
模型验证与参数调优

逻辑回归模型实例

scikit-learn支持多种格式的数据，包括经典的iris数据，LibSVM格式数据等等。

    from sklearn.datasets import load_svmlight_file #用来加载libsvm型数据 
  
    X,y = load_svmlight_file('filename')

机器学习模型的模块

    from sklearn.datasets import load_svmlight_file #用来加载libsvm型数据 
  
    X,y = load_svmlight_file('filename') 
  
    from sklearn.linear_model import LogisticRegression 
  
    reg = LogisticRegression(penalty='l2',tol=1e-4,c=1.0) 
  
    reg.fit(X,y) #训练 
  
    test_x = X[:10] 
  
    test_y = y[:10] 
  
    test_score = reg.score(test_x,test_y) #评分

为了更好的训练模型可以进行交叉验证，或者使用贪心算法进行参数调优

    from sklearn.model_selection import cross_val_score 
  
    clf = svm.SVC(kernel='linear', C=1) 
  
    scores = cross_val_score(clf, iris.data, iris.target, cv=5)

预处理加载数据

    import numpy as np 
  
    import urllib.request 
  
    # url with dataset 
  
    url='http://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data' 
  
    # download the file 
  
    raw_data = urllib.request.urlopen(url) 
  
    # load the CSV file as a numpy matrix 
  
    datset = np.loadtxt(raw_data,delimiter=',') 
  
    # separate the data from the target attributes 
  
    X = dataset[:,0:8] 
  
    y = dataset[:8]

数据归一化（Data Normalization）

绝大多数机器学习算法中的梯度方法对于数据的缩放和尺度都是敏感的，在开始跑算法前，我们应该进行归一化或者标准化的处理，把数据缩放到0-1的范围中

    from sklearn import prepeocessing 
  
    # normalize the data attributes 
  
    normalized_X = preprocessing.normalize(X) 
  
    # standardize the data attributes 
  
    standardized_X = preprocessing.scale(X)

http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

特征选择（Feature Selection）

在解决一个实际问题的过程中，选择合适的特征或者构建特征的能力特别重要。这成为特征选择或者特征工程。特征选择是一个很需要创造力的过程，更多的依赖于直觉的专业知识，并且有很多现成的算法来进行特征的选择。下面的树算法（Tree algorithms)计算特征的信息量：

    from sklearn.ensemble import ExtraTreesClassifier 
  
    model = ExtraTreesClassifier() 
  
    model.fit(X,y) 
  
    # display the relative importance of each attribute 
  
    print(model.feature_importances_)

scikit-learn算法的使用

逻辑回归

大多数问题都可以归结为二分类问题。这个算法的优点是可以给出数据所在类别的概率。

我们使用Pima Indians Diabetes dataset，其中包含健康数据和糖尿病状态数据，一共有768个病人的数据

    import pandas as pd 
  
    url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/pima-indians-diabetes/pima-indians-diabetes.data' 
  
    col_names = ['pregnant','glucose','bp','skin','insulin','bmi','pedigree','age','label'] 
  
    pima = pd.read_csv(url,header=None,names=col_names) 
  
    print(pima.head())#显示前五条

结果（1代表有糖尿病，0代表没有）

	pregnant	glucose	bp	skin	insulin	bmi	age	label
0	6	148	72	35	0	33.6	50	1
1	1	85	66	29	0	26.6	31	0
2	8	183	64	0	0	23.3	32	1
3	1	89	66	23	94	28.1	21	0
4	0	137	40	35	168	43.1	33	1

    feature_cols = ['pregant','insulin','bmi','age'] 
  
    X = pima[feature_cols] 
  
    y=pima.label 
  
    # split X and y into training and testing sets 
  
    from sklearn.cross_validation import train_test_split 
  
    X_train, X_test, y_train, y_test = train_test_split(X,y,random_state=0,test_size=0.33) #默认是0.25 
  
    from sklearn.linear_model import LogisticRegression 
  
    clf = LogisticRegression() 
  
    clf.fit(X_train, y_train) 
  
    # make class predictions for the testing set 
  
    y_pred_class = clf.predict(X_test)

metrics

    from sklearn import metrics 
  
    from sklearn.linear_model import LogisticRegression 
  
    model = LogisticRegression() 
  
    model.fit(X,y) 
  
    print(model) 
  
    # make predictions 
  
    expected = y 
  
    predicted = model.predict(X) 
  
    # summarize the fit of the model 
  
    print(metrics.classification_report(expected,predicted)) 
  
    print(metrics.confusion_matrix(expected,predicted))

结果

precision recall f1-score support

0 0.71 0.88 0.79 500

1 0.60 0.34 0.43 268

avg / total 0.67 0.69 0.66 768

[[440 60]

[178 90]]

朴素贝叶斯

该方法的任务是还原训练样本数据的分布密度，其在多类别分类中有很好的效果。

    from sklearn import metrics 
  
    form sklearn.naive_bayes import GaussianNB 
  
    model = GaussianNB() 
  
    model.fit(X,y) 
  
    print(model) 
  
    # make predictions 
  
    expected = y 
  
    predicted = model.predict(X) 
  
    # summarize the fit of the model 
  
    print(metrics.classification_report(expected,predicted)) 
  
    print(metrics.confusion_matrix(expected,predicted))

结果

GaussianNB()

precision recall f1-score support

0 0.72 0.83 0.77 500

1 0.56 0.40 0.47 268

avg / total 0.67 0.68 0.67 768

[[415 85]

[160 108]]

k近邻

    from sklearn import metrics 
  
    from sklearn.neighbors import KneighborsClassifier 
  
    # fit a k-nearest neighbor model to the data  
  
    model = KNeighborsClassifier() 
  
    model.fit(X,y) 
  
    print(model) 
  
    # make predictions 
  
    expected = y 
  
    predicted = model.predict(X) 
  
    # summarize the fit of the model 
  
    print(metrics.classification_report(expected,predicted)) 
  
    print(metrics.confusion_matrix(expected,predicted))

结果

precision recall f1-score support

0 0.81 0.88 0.84 500

1 0.73 0.62 0.67 268

avg / total 0.78 0.79 0.78 768

[[439 61]

[103 165]]

决策树

分类与回归树（Classification and Regression Trees，CART）算法常用于特征含有类别信息的分类或者回归问题，适用于多分类问题

    from sklearn import metrics 
  
    from sklearn.tree import DecisionTreeClassifier 
  
    # fit a CART model to the data  
  
    model = DecisionTreeClassifier() 
  
    model.fit(X,y) 
  
    print(model) 
  
    # make predictions 
  
    expected = y 
  
    predicted = model.predict(X) 
  
    # summarize the fit of the model 
  
    print(metrics.classification_report(expected,predicted)) 
  
    print(metrics.confusion_matrix(expected,predicted))

结果

precision recall f1-score support

0 1.00 1.00 1.00 500

1 1.00 1.00 1.00 268

avg / total 1.00 1.00 1.00 768

[[500 0]

[ 1 267]]

支持向量机

    from sklearn.svm import SVC 
  
    model = SVC() 
  
    model.fit(X,y) 
  
    print(model) 
  
    expected = y 
  
    predicted = model.predict(X) 
  
    print(metrics.classification_report(expected,predicted)) 
  
    print(metrics.confusion_matrix(expected,predicted))

precision recall f1-score support

0 0.95 1.00 0.97 500

1 0.99 0.91 0.95 268

avg / total 0.97 0.97 0.97 768

[[498 2]

[ 24 244]]

如何优化算法参数

一项更加困难的任务是构建一个有效的方法用于选择正确的参数，我们需要用搜索的方法来确定参数

    import numpy as np 
  
    from sklearn.linear_model import Ridge 
  
    from sklearn.grid_search import GridSearchCV 
  
    # prepare a rang of alpha values to test 
  
    alphas = np.array([1,0.1,0.01,0.001,0.0001,0]) 
  
    # create and fit a ridge regression model, testing each alpha 
  
    model = Ridge() 
  
    grid = GridSearchCV(estimator=model,param_grid=dict(alpha=alphas)) 
  
    grid.fit(X,y) 
  
    print(grid) 
  
    # summarize the results of the grid search 
  
    print(grid.best_score_) 
  
    print(grid.best_estimator_alpha)

GridSearchCV(cv=None, error_score='raise',

estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,

normalize=False, random_state=None, solver='auto', tol=0.001),

fit_params={}, iid=True, n_jobs=1,

param_grid={'alpha': array([ 1.00000e+00, 1.00000e-01, 1.00000e-02, 1.00000e-03,

1.00000e-04, 0.00000e+00])},

pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

0.279617559313

1.0

有时随即从给定区间中选择参数是很有效的方法，然后根据这些参数来评估算法的效果进而选择最佳

    import numpy as np 
  
    from scipy.stats import uniform as sp_rand 
  
    from sklearn.linear_model import Ridge 
  
    from sklearn.grid_search import RandomizedSearchCV 
  
    # prepare a uniform distribution to sample for the alpha parameter 
  
    param_grid = {'alpha':sp_rand()} 
  
    # create and fit a ridge regression model, testing random alpha values 
  
    model = Ridge() 
  
    rsearch = RandomizedSearchCV(estimator=model,param_distributions=param_grid,n_iter=100) 
  
    rsearch.fit(X,y) 
  
    print(rsearch) 
  
    # summarize the results of the random parameter search 
  
    print(rsearch.best_score_) 
  
    print(rsearch.best_estimator_.alpha)

RandomizedSearchCV(cv=None, error_score='raise',

estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,

normalize=False, random_state=None, solver='auto', tol=0.001),

fit_params={}, iid=True, n_iter=100, n_jobs=1,

param_distributions={'alpha': <scipy.stats._distn_infrastructure.rv_frozen object at 0x00000000055DCE10>},

pre_dispatch='2*n_jobs', random_state=None, refit=True,

scoring=None, verbose=0)

0.279616856141

0.964295479202

scikit-learn主要模块和基本使用

猜你喜欢