sklearn分析Tianic数据（逻辑回归、随机森林）及简单的特征分析

数据是Titanic:

用逻辑回归算法：

import pandas
from sklearn.linear_model import LogisticRegression
from sklearn import cross_validation

#数据预处理
titanic=pandas.read_csv('F:\\test\\titanic_train.csv')
titanic['Age']=titanic['Age'].fillna(titanic['Age'].median())
titanic['Embarked']=titanic['Embarked'].fillna("S")
titanic.loc[titanic['Sex']=='male','Sex']=0
titanic.loc[titanic['Sex']=='female','Sex']=1
titanic.loc[titanic['Embarked']=='S','Embarked']=0
titanic.loc[titanic['Embarked']=='C','Embarked']=1
titanic.loc[titanic['Embarked']=='Q','Embarked']=2

#逻辑回归算法
predictors=['Pclass','Sex','Age','SibSp','Parch','Fare','Embarked']
alg = LogisticRegression(random_state=1)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)  #交叉验证分为3组
print(scores.mean())

最后正确率只有0.785634118967

然后是随机森林：

原理简述：随机森林是多个决策树，随机是体现在2个方面：①每个决策树选取样本的随机②每个样本选取特征的随机

分类的决策树最后是投票表决的方式决定最后的分类，回归的决策树是通过求森林的每颗决策树最终结果的平均值决定最后的结果

import pandas
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
#数据预处理
titanic=pandas.read_csv('F:\\test\\titanic_train.csv')
titanic['Age']=titanic['Age'].fillna(titanic['Age'].median())
titanic['Embarked']=titanic['Embarked'].fillna("S")
titanic.loc[titanic['Sex']=='male','Sex']=0
titanic.loc[titanic['Sex']=='female','Sex']=1
titanic.loc[titanic['Embarked']=='S','Embarked']=0
titanic.loc[titanic['Embarked']=='C','Embarked']=1
titanic.loc[titanic['Embarked']=='Q','Embarked']=2

#随机森林算法
predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]
# Initialize our algorithm with the default paramters
# n_estimators is the number of trees we want to make
# min_samples_split is the minimum number of rows we need to make a split
# min_samples_leaf is the minimum number of samples we can have at the place where a tree branch ends (the bottom points of the tree)
alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=4, min_samples_leaf=2)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

精度有所提升：

0.822671156004

特征分析：

原理简述：假设1个样本有A、B、C三个特征，现在先求A特征的“重要性”：把A特征重置一些垃圾值，其它特征保持不变，若最终error率明显上升，说明A特征值很重要。

import pandas
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
import numpy as np
from sklearn.feature_selection import SelectKBest, f_classif
import matplotlib.pyplot as plt
#数据预处理
titanic=pandas.read_csv('F:\\test\\titanic_train.csv')
titanic['Age']=titanic['Age'].fillna(titanic['Age'].median())
titanic['Embarked']=titanic['Embarked'].fillna("S")
titanic.loc[titanic['Sex']=='male','Sex']=0
titanic.loc[titanic['Sex']=='female','Sex']=1
titanic.loc[titanic['Embarked']=='S','Embarked']=0
titanic.loc[titanic['Embarked']=='C','Embarked']=1
titanic.loc[titanic['Embarked']=='Q','Embarked']=2

#添加2个特征值
# Generating a familysize column
titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"]

# The .apply method generates a new series
titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x))

predictors = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked", "FamilySize", "NameLength"]

# Perform feature selection
selector = SelectKBest(f_classif, k=5)
selector.fit(titanic[predictors], titanic["Survived"])

# Get the raw p-values for each feature, and transform from p-values into scores
scores = -np.log10(selector.pvalues_)

# Plot the scores.  See how "Pclass", "Sex", "Title", and "Fare" are the best?
plt.bar(range(len(predictors)), scores)
plt.xticks(range(len(predictors)), predictors, rotation='vertical')
plt.show()

# Pick only the four best features.
predictors = ["Pclass", "Sex", "Fare","NameLength"]

alg = RandomForestClassifier(random_state=1, n_estimators=100, min_samples_split=8, min_samples_leaf=4)
scores = cross_validation.cross_val_score(alg, titanic[predictors], titanic["Survived"], cv=3)

# Take the mean of the scores (because we have one for each fold)
print(scores.mean())

根据特征分析，上述代码选取最重要的四个特征：

"Pclass", "Sex", "Fare","NameLength"

虽然只用了四个特征，但最终结果并没有发生太大的变化：

0.801346801347

参考：吴恩达视频

sklearn分析Tianic数据（逻辑回归、随机森林）及简单的特征分析

猜你喜欢