在上一篇博客中我们对Quora的数据(question1和question2)做了完整的特征工程。特征工程所做的工作主要包含:
经过特征工程处理后,原来的question1和question2的文本信息转变成了计算机能“读懂”的向量化特征,接下来我们要做的就是将这些特征向量的不同组合“喂”给我们的算法模型,看看我们的模型在“吃了”各种不同的“饲料”后它的表现会如何,这里我们将选择XGBoost作为我们的预测模型,如果想深入了解XGBoost,请阅读官方文档,这里不再深入介绍XGBoost,因为我们的目的是“实战”和“应用”,好了,废话少说,让我们撸起袖子干起来吧!
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
import xgboost as xgb
fs_basic
我们首先使用的特征向量是fs_basic,fs_basic的含义大家应该还记得吧,这里再回顾一下:
- 计算question1的长度
- 计算question2的长度
- 计算question1和question2的长度差
- 计算过滤掉空格后的question1的字符长度
- 计算过滤掉空格后的question2的字符长度
- 计算question1中的单词数量
- 计算question2中的单词数量
- 计算question1和question2中的常用单词数量
下面我们将fs_basic喂给xgboost,我们看看xgboost的表现
X = fs_basic
y = df.loc[:, df.columns == 'is_duplicate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1,
colsample_bytree=.7, gamma=0, reg_alpha=4, objective='binary:logistic',
eta=0.3, silent=1, subsample=0.8).fit(X_train, y_train.values.ravel())
prediction = model.predict(X_test)
cm = confusion_matrix(y_test, prediction)
print(cm)
print('Accuracy', accuracy_score(y_test, prediction))
print(classification_report(y_test, prediction))
我们看到用fs_basic来训练模型,最后的预测准确率为72% 接下来我们要使用fs_basic和fs_fuzz来训练我们模型。
fs_basic + fs_fuzz
还记得fs_fuzz这个特征向量吗?fuzzywuzzy是python的一个用来进行模糊字符串匹配的工具,它的基本原理是基于计算两个相似字符串之间的编辑距离。下面我们就使用fs_basic + fs_fuzz来训练xgboost.
X = np.hstack((fs_basic, fs_fuzz))
y = df.loc[:, df.columns == 'is_duplicate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1,
colsample_bytree=.7, gamma=0, reg_alpha=4, objective='binary:logistic',
eta=0.3, silent=1, subsample=0.8).fit(X_train, y_train.values.ravel())
prediction = model.predict(X_test)
cm = confusion_matrix(y_test, prediction)
print(cm)
print('Accuracy', accuracy_score(y_test, prediction))
print(classification_report(y_test, prediction))
我们看到我们同时使用fs_basi和fs_fuzz后, 模型的准确率提高了2个百分点,达到了74%,在此基础上我们再增加两个特征向量:fs_distance和 fs_wmd,看看情况会怎么样?
fs_basic + fs_fuzz + fs_distance+ fs_wmd
fs_distance表示question1和question2在多维空间中的距离,fs_wmd表示question1和question2之间相似性的一种度量。
= np.hstack((fs_basic, fs_fuzz,fs_distance,fs_wmd))
y = df.loc[:, df.columns == 'is_duplicate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1,
colsample_bytree=.7, gamma=0, reg_alpha=4, objective='binary:logistic',
eta=0.3, silent=1, subsample=0.8).fit(X_train, y_train.values.ravel())
prediction = model.predict(X_test)
cm = confusion_matrix(y_test, prediction)
print(cm)
print('Accuracy', accuracy_score(y_test, prediction))
print(classification_report(y_test, prediction))
我们模型的准确率又提高到了77.6%,又增加了3个百分点。
fs_bow
接下来我们要使用词袋(BOW)的特征向量,我们用它再来训练一下我们的算法模型:
X = fs_bow
y = df.loc[:, df.columns == 'is_duplicate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1,
colsample_bytree=.7, gamma=0, reg_alpha=4, objective='binary:logistic',
eta=0.3, silent=1, subsample=0.8).fit(X_train, y_train.values.ravel())
prediction = model.predict(X_test)
cm = confusion_matrix(y_test, prediction)
print(cm)
print('Accuracy', accuracy_score(y_test, prediction))
print(classification_report(y_test, prediction))
我们看到模型的预测准确率又提高了1个百分点,达到了78.8%,不知道准确率还能不能再增加,下面我们用fs_tfidf_word再尝试一下
fs_tfidf_word
X = fs_tfidf_word
y = df.loc[:, df.columns == 'is_duplicate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1,
colsample_bytree=.7, gamma=0, reg_alpha=4, objective='binary:logistic',
eta=0.3, silent=1, subsample=0.8).fit(X_train, y_train.values.ravel())
prediction = model.predict(X_test)
cm = confusion_matrix(y_test, prediction)
print(cm)
print('Accuracy', accuracy_score(y_test, prediction))
print(classification_report(y_test, prediction))
这回准确率似乎增加了0.1个百分点,78.9%,下面我们尝试一下fs_tfidf_char
fs_tfidf_char
fs_tfidf_char是基于字符的tf-idf的特征向量,我们用它来训练一下我们的模型,看看效果好不好。
X = fs_tfidf_char
y = df.loc[:, df.columns == 'is_duplicate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1,
colsample_bytree=.7, gamma=0, reg_alpha=4, objective='binary:logistic',
eta=0.3, silent=1, subsample=0.8).fit(X_train, y_train.values.ravel())
prediction = model.predict(X_test)
cm = confusion_matrix(y_test, prediction)
print(cm)
print('Accuracy', accuracy_score(y_test, prediction))
print(classification_report(y_test, prediction))
这回效果非常的不错,准确率一下子提高到了81.9%。剩下还有2个SVD的特征向量,我们也分别在尝试一下
fs_svd_word
fs_svd_word是在fs_tfidf_word的基础上进行了降维处理,将原来fs_tfidf_word的维度降到了180维。
X = fs_svd_word
y = df.loc[:, df.columns == 'is_duplicate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1,
colsample_bytree=.7, gamma=0, reg_alpha=4, objective='binary:logistic',
eta=0.3, silent=1, subsample=0.8).fit(X_train, y_train.values.ravel())
prediction = model.predict(X_test)
cm = confusion_matrix(y_test, prediction)
print(cm)
print('Accuracy', accuracy_score(y_test, prediction))
print(classification_report(y_test, prediction))
我们可以看到原来fs_tfidf_word的准确率是78.9%,经过SVD降维后准确率下降到了76.8%,看来降维是会损失一些数据原有的特征。
fs_svd_char
fs_svd_char是在fs_tfidf_char的基础上进行了降维处理,将原来fs_tfidf_char的维度降到了180维。
X = fs_svd_char
y = df.loc[:, df.columns == 'is_duplicate']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
model = xgb.XGBClassifier(max_depth=50, n_estimators=80, learning_rate=0.1,
colsample_bytree=.7, gamma=0, reg_alpha=4, objective='binary:logistic',
eta=0.3, silent=1, subsample=0.8).fit(X_train, y_train.values.ravel())
prediction = model.predict(X_test)
cm = confusion_matrix(y_test, prediction)
print(cm)
print('Accuracy', accuracy_score(y_test, prediction))
print(classification_report(y_test, prediction))
我们可以看到原来fs_tfidf_char的准确率是81.9%,经过SVD降维后准确率下降到了77.8%,但是仍然要比fs_svd_word要高一个百分点。
总结
今天我们尝试了使用各种不同的特征向量来训练我们的算法模型(XGBoost),而且得到了不同的预测准确率,其中准确率最高的特征向量是fs_tfidf_char,达到了81.9%,准确率最低的特征向量是fs_basic,准确率为72%,这期间我们还尝试了多种特征向量的组合。有的组合也有不错的表现。