本次主要学习sklearn库,实验环境为win10下的anaconda
要求
- Create a classification dataset (n_samples >= 1000, n_features >= 10)
- Split the dataset using 10-fold cross validation
- Train the algorithms
I GaussianNB
I SVC (possible C values [1e-02, 1e-01, 1e00, 1e01, 1e02], RBF kernel)
I RandomForestClassifier (possible n estimators values [10, 100, 1000]) - Evaluate the cross-validated performance
I Accuracy
I F1-score
I AUC ROC - Write a short report summarizing the methodology and the results
代码如下:
from sklearn import datasets
from sklearn import cross_validation
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
#生成数据集
X, y = datasets.make_classification(n_samples = 1000, n_features = 10)
#use Scikit-learn for K-fold cross-validation
kf = cross_validation.KFold(len(X), n_folds = 10, shuffle = True)
for train_index, test_index in kf:
X_train, y_train = X[train_index], y[train_index]
X_test, y_test = X[test_index], y[test_index]
#Gaussian NB
clf = GaussianNB()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
acc_NB = metrics.accuracy_score(y_test, pred)
f1_NB = metrics.f1_score(y_test, pred)
auc_NB = metrics.roc_auc_score(y_test, pred)
print('NB:')
print(acc_NB)
print(f1_NB)
print(auc_NB)
print('--------------------')
#SVM
clf = SVC(C = 1e-01, kernel = 'rbf', gamma = 0.1)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
acc_SVM = metrics.accuracy_score(y_test, pred)
f1_SVM = metrics.f1_score(y_test, pred)
auc_SVM = metrics.roc_auc_score(y_test, pred)
print('SVM:')
print(acc_SVM)
print(f1_SVM)
print(auc_SVM)
print('--------------------')
#random forest
clf = RandomForestClassifier(n_estimators=10)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
acc_rf = metrics.accuracy_score(y_test, pred)
f1_rf = metrics.f1_score(y_test, pred)
auc_rf = metrics.roc_auc_score(y_test, pred)
print('random forest:')
print(acc_rf)
print(f1_rf)
print(auc_rf)
print('--------------------')
结果截图(以下是上次运行代码的截图)
总结
从几次运行结果来看,这三种方法的效果差别还是太大的,但无法肯定得知哪种方法的效果好,对不同数据有不同的预测结果。