句子对匹配(Sentence Pair Matching)问题是NLP中非常常见的一类问题,所谓“句子对匹配”,就是说给定两个句子S1和S2,任务目标是判断这两个句子是否具备某种类型的关系。如果形式化地对这个问题定义,可以理解如下:
意思是给定两个句子,需要学习一个映射函数,输入是两个句子对,经过映射函数变换,输出是任务分类标签集合中的某类标签。
典型的例子就是Paraphrase任务,即要判断两个句子是否语义等价,所以它的分类标签集合就是个{等价,不等价}的二值集合。除此外,还有很多其它类型的任务都属于句子对匹配,比如问答系统中相似问题匹配和Answer Selection。
我在前一篇文章中写了一个基于Doc2vec和Word2vec的无监督句子匹配方法,这里就顺便用传统的机器学习算法做一下。用机器学习算法处理的话,这里的映射函数就是用训练一个分类模型来拟合F,当分类模型训练好之后,对于未待分类的数据,就可以输入分类模型,用训练好的分类模型进行预测直接输出结果。
关于分类算法:
常见的分类模型有逻辑回归(LR)、朴素贝叶斯、SVM、GBDT和随机森林(RandomForest)等。本文选用的机器学习分类算法有:逻辑回归(LR)、SVM、GBDT和随机森林(RandomForest)。
由于Sklearn中集成了常见的机器学习算法,包括分类、回归、聚类等,所以本文使用的是Sklearn,版本是0.17.1。
关于特征选择:
由于最近一直在使用doc2vec和Word2vec,而且上篇文章中对比结果表示,用Doc2vec得到句子向量表示比Word2vec求均值得到句子向量表示要好,所以这里使用doc2vec得到句子的向量表示,向量维数为100维,直接将句子的100维doc2vec向量作为特征输入分类算法。
关于数据集:
数据集使用的是Quora发布的Question Pairs语义等价数据集,和上文是同一个数据集,可以点击这个链接下载点击打开链接,其中包含了40多万对标注好的问题对,如果两个问题语义等价,则label为1,否则为0。统计之后,共有53万多个问题。具体格式如下图所示:
统计出所有的问题之后训练得到每一个问题的doc2vec向量,作为分类算法的特征输入。
将语料库随机打乱之后,切分出10000对数据作为验证集,剩余的作为训练集。
下面是具体的训练代码:
数据加载和得到句子的doc2vec代码是同一份,放在前面:
-
# coding:utf-8
-
import numpy as np
-
import csv
-
import datetime
-
from sklearn.ensemble import RandomForestClassifier
-
import os
-
import pandas as pd
-
from sklearn import metrics, feature_extraction
-
from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
-
cwd = os.getcwd()
-
def load_data(datapath):
-
data_train = pd.read_csv(datapath, sep='\t', encoding='utf-8')
-
print data_train.shape
-
qid1 = []
-
qid2 = []
-
question1 = []
-
question2 = []
-
labels = []
-
count = 0
-
for idx in range(data_train.id.shape[0]):
-
# for idx in range(400):
-
# count += 1
-
# if count == 21: break
-
print idx
-
q1 = data_train.qid1[idx]
-
q2 = data_train.qid2[idx]
-
qid1.append(q1)
-
qid2.append(q2)
-
question1.append(data_train.question1[idx])
-
question2.append(data_train.question2[idx])
-
labels.append(data_train.is_duplicate[idx])
-
return qid1, qid2, question1, question2, labels
-
def load_doc2vec(word2vecpath):
-
f = open(word2vecpath)
-
embeddings_index = {}
-
count = 0
-
for line in f:
-
# count += 1
-
# if count == 10000: break
-
values = line.split('\t')
-
id = values[0]
-
print id
-
coefs = np.asarray(values[1].split(), dtype='float32')
-
embeddings_index[int(id)+1] = coefs
-
f.close()
-
print('Total %s word vectors.' % len(embeddings_index))
-
return embeddings_index
-
def sentence_represention(qid, embeddings_index):
-
vectors = np.zeros((len(qid), 100))
-
for i in range(len(qid)):
-
print i
-
vectors[i] = embeddings_index.get(qid[i])
-
return vectors
将main函数中的数据集路径和doc2vec路径换成自己的之后就可以直接使用了。
1.逻辑回归(LR):
-
def main():
-
start = datetime.datetime.now()
-
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
-
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
-
qid1, qid2, labels = load_data(datapath)
-
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
-
vectors1 = sentence_represention(qid1, embeddings_index)
-
vectors2 = sentence_represention(qid2, embeddings_index)
-
vectors = np.hstack((vectors1, vectors2))
-
labels = np.array(labels)
-
VALIDATION_SPLIT = 10000
-
VALIDATION_SPLIT0 = 1000
-
indices = np.arange(vectors.shape[0])
-
np.random.shuffle(indices)
-
vectors = vectors[indices]
-
labels = labels[indices]
-
train_vectors = vectors[:-VALIDATION_SPLIT]
-
train_labels = labels[:-VALIDATION_SPLIT]
-
test_vectors = vectors[-VALIDATION_SPLIT:]
-
test_labels = labels[-VALIDATION_SPLIT:]
-
# train_vectors = vectors[:VALIDATION_SPLIT0]
-
# train_labels = labels[:VALIDATION_SPLIT0]
-
# test_vectors = vectors[-VALIDATION_SPLIT0:]
-
# test_labels = labels[-VALIDATION_SPLIT0:]
-
lr = LogisticRegression()
-
print '***********************training************************'
-
lr.fit(train_vectors, train_labels)
-
print '***********************predict*************************'
-
prediction = lr.predict(test_vectors)
-
accuracy = metrics.accuracy_score(test_labels, prediction)
-
print accuracy
-
end = datetime.datetime.now()
-
print end-start
-
if __name__ == '__main__':
-
main() # the whole one model
2.SVM:
-
def main():
-
start = datetime.datetime.now()
-
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
-
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
-
qid1, qid2, labels = load_data(datapath)
-
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
-
vectors1 = sentence_represention(qid1, embeddings_index)
-
vectors2 = sentence_represention(qid2, embeddings_index)
-
vectors = np.hstack((vectors1, vectors2))
-
labels = np.array(labels)
-
VALIDATION_SPLIT = 10000
-
VALIDATION_SPLIT0 = 1000
-
indices = np.arange(vectors.shape[0])
-
np.random.shuffle(indices)
-
vectors = vectors[indices]
-
labels = labels[indices]
-
train_vectors = vectors[:-VALIDATION_SPLIT]
-
train_labels = labels[:-VALIDATION_SPLIT]
-
test_vectors = vectors[-VALIDATION_SPLIT:]
-
test_labels = labels[-VALIDATION_SPLIT:]
-
# train_vectors = vectors[:VALIDATION_SPLIT0]
-
# train_labels = labels[:VALIDATION_SPLIT0]
-
# test_vectors = vectors[-VALIDATION_SPLIT0:]
-
# test_labels = labels[-VALIDATION_SPLIT0:]
-
svm = SVC()
-
print '***********************training************************'
-
svm.fit(train_vectors, train_labels)
-
print '***********************predict*************************'
-
prediction = svm.predict(test_vectors)
-
accuracy = metrics.accuracy_score(test_labels, prediction)
-
print accuracy
-
end = datetime.datetime.now()
-
print end-start
-
if __name__ == '__main__':
-
main() # the whole one model
3.GBDT:
-
def main():
-
start = datetime.datetime.now()
-
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
-
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
-
qid1, qid2, labels = load_data(datapath)
-
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
-
vectors1 = sentence_represention(qid1, embeddings_index)
-
vectors2 = sentence_represention(qid2, embeddings_index)
-
vectors = np.hstack((vectors1, vectors2))
-
labels = np.array(labels)
-
VALIDATION_SPLIT = 10000
-
VALIDATION_SPLIT0 = 1000
-
indices = np.arange(vectors.shape[0])
-
np.random.shuffle(indices)
-
vectors = vectors[indices]
-
labels = labels[indices]
-
train_vectors = vectors[:-VALIDATION_SPLIT]
-
train_labels = labels[:-VALIDATION_SPLIT]
-
test_vectors = vectors[-VALIDATION_SPLIT:]
-
test_labels = labels[-VALIDATION_SPLIT:]
-
# train_vectors = vectors[:VALIDATION_SPLIT0]
-
# train_labels = labels[:VALIDATION_SPLIT0]
-
# test_vectors = vectors[-VALIDATION_SPLIT0:]
-
# test_labels = labels[-VALIDATION_SPLIT0:]
-
gbdt = GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',
-
max_depth=3, max_features=None, max_leaf_nodes=None,
-
min_samples_leaf=1, min_samples_split=2,
-
min_weight_fraction_leaf=0.0, n_estimators=100,
-
random_state=None, subsample=1.0, verbose=0,
-
warm_start=False)
-
print '***********************training************************'
-
gbdt.fit(train_vectors, train_labels)
-
print '***********************predict*************************'
-
prediction = gbdt.predict(test_vectors)
-
accuracy = metrics.accuracy_score(test_labels, prediction)
-
acc = gbdt.score(test_vectors, test_labels)
-
print accuracy
-
print acc
-
end = datetime.datetime.now()
-
print end-start
-
if __name__ == '__main__':
-
main() # the whole one model
4.随机森林(RandomForest):
-
def main():
-
start = datetime.datetime.now()
-
datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'
-
doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"
-
qid1, qid2, question1, question2, labels = load_data(datapath)
-
embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)
-
vectors1 = sentence_represention(qid1, embeddings_index)
-
vectors2 = sentence_represention(qid2, embeddings_index)
-
vectors = np.hstack((vectors1, vectors2))
-
labels = np.array(labels)
-
VALIDATION_SPLIT = 10000
-
VALIDATION_SPLIT0 = 1000
-
indices = np.arange(vectors.shape[0])
-
np.random.shuffle(indices)
-
vectors = vectors[indices]
-
labels = labels[indices]
-
train_vectors = vectors[:-VALIDATION_SPLIT]
-
train_labels = labels[:-VALIDATION_SPLIT]
-
test_vectors = vectors[-VALIDATION_SPLIT:]
-
test_labels = labels[-VALIDATION_SPLIT:]
-
# train_vectors = vectors[:VALIDATION_SPLIT0]
-
# train_labels = labels[:VALIDATION_SPLIT0]
-
# test_vectors = vectors[-VALIDATION_SPLIT0:]
-
# test_labels = labels[-VALIDATION_SPLIT0:]
-
randomforest = RandomForestClassifier()
-
print '***********************training************************'
-
randomforest.fit(train_vectors, train_labels)
-
print '***********************predict*************************'
-
prediction = randomforest.predict(test_vectors)
-
accuracy = metrics.accuracy_score(test_labels, prediction)
-
print accuracy
-
end = datetime.datetime.now()
-
print end-start
-
if __name__ == '__main__':
-
main() # the whole one model
最终的结果如下:
LR 68.56%
SVM 69.77%
GBDT 71.4%
RandomForest 78.36%(跑了多次,最好的一次)
从准确率上来看,随机森林的效果最好。时间上面,SVM耗时最长。
未来:
其实本文在特征选择和分类算法的参数调整上还有很多可以深入的地方,我相信,通过继续挖掘更多的有用特征,以及对模型的参数进行调整还可以得到更好的结果。
详细代码参见我的GitHub,地址为:点击打开链接