Reading Wikipedia to Answer Open-Domain Questions

D. Q. Chen, A. Fisch, J. Weston, A. Bordes, Reading Wikipedia to Answer Open-Domain Questions, ACL (2017)

摘要

开放域问题回答（open domain question answering）

知识来源（knowledge source）：维基百科、且唯一（unique）

任意事实性问题（factoid question）的答案：维基百科文章的文本张成（a text span in a Wikipedia article）。

大规模机器阅读（machine reading at scale）：（1）文档检索（document retrieval），即相关文章查找（relevant articles）；（2）机器阅读理解（machine comprehension of text），即根据文章内容识别答案。

本文从维基百科（Wikipedia）的文章段落中查找问题答案，包含两个模块：（1）基于bigram哈希（bigram hashing）和TF-IDF匹配（matching）的搜索（search）组件；（2）多层递归神经网络（a multi-layer recurrent neural network）。（combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs）

1 引言

本文以维基百科作为唯一知识源（unique knowledge source），回答开放域（an open-domain setting）事实性问题（factoid questions）

知识库（knowledge bases，KBs）：易处理、但过于稀疏，不适合开放域问题（easier for computers to process but too sparsely populated for open-domain question answering）。

以以维基百科作为知识源的问答（question answering，QA）系统需要解决：（1）大规模开放域问答（large-scale
open-domain QA）；（2）机器文本阅读（machine comprehension of text）。前者检索相关文档（retrieve the few relevant articles），后者从相关文档中标识答案（identify the answer）。本文将其称为大规模机器阅读（machine reading at scale，MRS）。

在这里插入图片描述

2 相关工作

开放域问答定义：从非结构化文档集合中查找答案（open-domain QA was originally defined as finding answers in collections of unstructured documents）。

知识库的局限性：不完整（incompleteness）、架构固定（fixed schemas）。

机器文本理解（machine comprehension of text）：通过阅读短文、故事回答问题（answering questions after reading a short text or story）。

3 DrQA系统

DrQA系统组件：（1）文档检索（Document Retriever），查找相关文章；（2）文档阅读（Document Reader），机器理解（machine comprehension）模型，从单个文档或文档集中抽取答案（extracting answers from a single document or a small collection of documents）。

3.1 文档检索（Document Retriever）

非学习（non-machine learning）类文档检索（document retrieval system）：比较文章与问题的二元模型计数（bigram counts）；通过无符号murmur3哈希（an unsigned murmur3 hash）将二元模型映射到 $2^{24}$ 个区间上（bins）；对每个问题返回5篇维百文章。

3.2 文档阅读（Document Reader）

给定 $l$ 个词条（token）的问题（question）， $q = \{ q_{1}, \dots, q_{l} \}$ ；及 $m$ 个词条的段落（paragraph） $p = \{ p_{1}, \dots, p_{m} \}$ ，本文将各个段落依次输入RNN模型，并将预测结果汇总（aggregate the predicted answers）

段落编码（paragraph encoding）

将段落 $p$ 中的词条 $p_{i}$ 表示为特征向量 $\tilde{\mathbf{p}}_{i} \in \R^{d}$ 的序列（a sequence of feature vectors），并输入递归神经网络（recurrent neural network）：

$\{ \mathbf{p}_{1}, \dots, \mathbf{p}_{m} \} = \text{RNN}(\{ \tilde{\mathbf{p}}_{1}, \dots, \tilde{\mathbf{p}}_{m} \})$

其中， $\mathbf{p}_{i}$ 表示词条 $p_{i}$ 有效上下文信息的编码（expected to encode useful context information around token $p_{i}$ ）。本文采用多层双向LSTM（a multi-layer bidirectional long short-term memory network）网络，并将各层隐含单元最终输出的串联记为 $\mathbf{p}_{i}$ （the concatenation of each layer’s hidden units in the end）。

特征向量（feature vector） $\tilde{\mathbf{p}}_{m}$ 为如下特征的组合：

词嵌入（word embeddings）： $f_{\text{emb}} (p_{i}) = \text{E} (p_{i})$ ，本文使用300维Glove词嵌入，并对其中词频最高的1000个疑问词的词嵌入微调，其余保持不变（keep most of the pre-trained word embeddings fixed and only fine-tune the 1000 most frequent question words because the representations of some key words such as what, how, which, many could be crucial for QA systems）。
精确匹配（exact match）： $f_{\text{exact\_match}} (p_{i}) = \mathbb{I} (p_{i} \in q)$ ，若 $p_{i}$ 与问题 $q$ 中任意单词的原形（original）、小写形式（lowercase）或词根（lemma）相同，则称 $p_{i}$ 、 $q$ 完全匹配，用三个二值特征（binary features）表示（use three simple binary features, indicating whether $p_{i}$ can be exactly matched to one question word in $q$ , either in its original, lowercase or lemma form）。
词条特征（token features）： $f_{\text{token}} (p_{i}) = \left( \text{POS} (p_{i}), \text{NER} (p_{i}), \text{TF} (p_{i}) \right)$ ，引入少量表示词条 $p_{i}$ 属性（properties）的人工特征（manual features），如词性（part-of-speech，POS）、命名实体识别（named entity recognition tag，NER）标签、归一化词频（normalized term frequency，TF）
问题对齐嵌入（aligned question embedding）： $f_{\text{align}} (p_{i}) = \sum_{j} a_{i, j} \text{E} (p_{i})$ ，其中注意力评分（attention score） $\alpha_{i, j}$ 表示 $p_{i}$ 与 $q_{j}$ 的相似度（the attention score $\alpha_{i, j}$ captures the similarity between $p_{i}$ and each question words $q_{j}$ ）。
$a_{i, j} = \frac{ \exp \left( \alpha(\text{E} (p_{i})) \cdot \alpha(\text{E} (p_{j})) \right) }{ \sum_{j^{\prime}} \exp \left( \alpha(\text{E} (p_{i})) \cdot \alpha(\text{E} (p_{j^{\prime}})) \right) }$
其中， $\alpha(\cdot)$ 表示激活为ReLU的单层感知器（a single dense layer with ReLU nonlinearity）。

问题编码（question encoding）

将 $q_{j}$ 的词嵌入作为另一个RNN网络的输入，并将隐含单元状态合并成向量（apply another recurrent neural network on top of the word embeddings of $q_{j}$ and combine the resulting hidden units into one single vector）， $\{ \mathbf{q}_{1}, \dots, \mathbf{q}_{m} \} \rightarrow \mathbf{q}$ 。计算 $\mathbf{q} = \sum_{j} b_{j} \mathbf{q}_{j}$ ，其中 $b_{j}$ 对问题中各词的重要性编码（encodes the importance of each question word）：

$b_{j} = \frac{\exp (\mathbf{w} \cdot \mathbf{q}_{j})}{\sum_{j^{\prime}} \exp (\mathbf{w} \cdot \mathbf{q}_{j^{\prime}})}$

$\mathbf{w}$ 为待学习权值向量（a weight vector to learn）。

预测（prediction）

$\begin{aligned} P_{\text{start} (i)} & \propto & \exp (\mathbf{p}_{i} \mathbf{W}_{s} \mathbf{q}) \\ P_{\text{end} (i)} & \propto & \exp (\mathbf{p}_{i} \mathbf{W}_{e} \mathbf{q}) \end{aligned}$