Bidirectional LSTM-CRF Models for Sequence Tagging

Z. H. Huang, W. Xu, K. Yu, Bidirectional LSTM-CRF Models for Sequence Tagging, (2015)

摘要

基于长矩时记忆网络（long short-term memory，LSTM）的序列标注模型：LSTM、双向LSTM（bidirectional
LSTM，BI-LSTM）、条件随机场LSTM（LSTM with a conditional random field layer，LSTM-CRF）、双向（bidirectional LSTM with a conditional random field layer，BI-LSTM-CRF）

BI-LSTM-CRF模型：BI-LSTM能够充分使用输入特征的历史及未来信息（past and future input features）；CRF能够使用语义层面的标签信息（sentence level tag information）

1 引言

序列标注（sequence tagging）包括词性标注（part of speech tagging，POS）、组块分析（chunking）和命名实体识别（named entity recognition，NER）

现有序列标注模型多数为线性统计模型（linear statistical models），如：隐马尔科夫模型（Hidden Markov Models，HMM）、最大熵马尔科夫模型（Maximum entropy Markov models，MEMMs）、条件随机场（Conditional Random Fields，CRF）

本文给出四种序列标注模型：LSTM、BI-LSTM、LSTM-CRF、BI-LSTM-CRF，

BI-LSTM使用输入特征的历史及未来信息；CRF使用语义层面的标签信息
BI-LSTM-CRF鲁棒性（robust）高，且与词嵌入相关小（less dependence on word embedding）

2 模型

LSTM、BI-LSTM、LSTM-CRF、BI-LSTM-CRF

2.1 LSTM网络（LSTM Networks）

循环神经网络（recurrent neural networks，RNN）：保留关于历史信息的记忆（a memory based on history information），能够根据相隔很远的特征预测当前输出（predict the current output conditioned on long distance features）；网络结构包括输入层（input layer） $x$ 、隐含层（hidden layer） $h$ 、输出层（output layer） $y$ 。

输入层表示时间步 $t$ 的特征，与输入特征维度相同（an input layer has the same dimensionality as feature size）；
输出层表示时间步 $t$ 标签的概率分布（a probability distribution over labels），其维度与标签尺寸相同（the same dimensionality as size of labels）；

RNN引入前一时间步隐状态与当前时间步隐状态的连接（a RNN introduces the connection between the previous hidden state and current hidden state），即循环层权值参数（the recurrent layer weight parameters）。循环层用于存储历史信息（recurrent layer is designed to store history information）。

在这里插入图片描述
$\mathbf{h}_{t} = f( \mathbf{U} \mathbf{x}_{t} + \mathbf{W} \mathbf{h}_{t - 1}) \tag {1}$

$\mathbf{y}_{t} = g( \mathbf{V} \mathbf{h}_{t}) \tag {2}$

其中， $\mathbf{U}$ 、 $\mathbf{W}$ 、 $\mathbf{V}$ 表示连接权值（在训练过程中计算）， $f(z)$ 、 $g(z_{m})$ 分别表示sigmoid与softmax激活函数。

$f(z) = \frac{1}{1 + e^{-z}} \tag {3}$

$g(z_{m}) = \frac{e^{z_{m}}}{\sum_{k} e^{z_{k}}} \tag {4}$

LSTM（Long Short-Term Memory）网络：用记忆单元（purpose-built memory cells）代替隐含层（hidden layer）更新，以抽取数据的远距离相关性（long range dependencies in the data）。

在这里插入图片描述
■图2结构不准确，如 $\mathbf{h}_{t - 1}$ 并未反馈至各门输入。■

LSTM记忆单元（memory cell）：

$\begin{aligned} \mathbf{i}_{t} = & \sigma ( \mathbf{W}_{xi} \mathbf{x}_{t} + \mathbf{W}_{hi} \mathbf{h}_{t - 1} + \mathbf{W}_{ci} \mathbf{c}_{t - 1} + \mathbf{b}_{i} ) \\ \mathbf{f}_{t} = & \sigma ( \mathbf{W}_{xf} \mathbf{x}_{t} + \mathbf{W}_{hf} \mathbf{h}_{t - 1} + \mathbf{W}_{cf} \mathbf{c}_{t - 1} + \mathbf{b}_{f} ) \\ \mathbf{c}_{t} = & \mathbf{f}_{t} \mathbf{c}_{t - 1} + \mathbf{i}_{t} \tanh ( \mathbf{W}_{xc} \mathbf{x}_{t} + \mathbf{W}_{hc} \mathbf{h}_{t - 1} + \mathbf{b}_{f} ) \\ \mathbf{o}_{t} = & \sigma ( \mathbf{W}_{xo} \mathbf{x}_{t} + \mathbf{W}_{ho} \mathbf{h}_{t - 1} + \mathbf{W}_{co} \mathbf{c}_{t} + \mathbf{b}_{o} ) \\ \mathbf{h}_{t} = & \mathbf{o}_{t} \tanh ( \mathbf{c}_{t} ) \\ \end{aligned}$

其中， $\sigma$ 表示逻辑函数（logistic sigmoid function）； $\mathbf{i}$ 、 $\mathbf{f}$ 、 $\mathbf{o}$ 、 $\mathbf{c}$ 、 $\mathbf{h}$ 分别表示输入门向量（input gate vector）、遗忘门向量（forget gate vector）、输出门向量（output gate vector）、记忆向量（cell vector）、隐含向量（hidden vector），所有向量维数相同； $\mathbf{W}$ 的下标表示对应的向量。单元向量到门向量的权值矩阵（the weight matrices from the cell to gate vectors），如 $\mathbf{W}_{ci}$ ，为对角矩阵（diagonal），即门向量的第 $m$ 个元素仅与单元向量的第 $m$ 个元素相关。

在这里插入图片描述

2.2 双向LSTM网络（Bidirectional LSTM Networks）

双向LSTM网络（bidirectional LSTM network）：在给定时间步上，同时使用历史特征（正向状态）和未来特征（反向状态）（make use of past features (via forward states) and future features (via backward states) for a specific time frame）。

训练过程采用时域反向传播（back-propagation through time，BPTT），在各句起始处，隐状态设置为 $0$ （do forward and backward for whole sentences and we only need to reset the hidden states to 0 at the begging of each sentence）。

在这里插入图片描述

2.3 CRF网络（CRF Networks）

根据近邻标签信息预测当前标签（make use of neighbor tag information in predicting current tags）的方式：

（1）预测各时间步（time step）标签分布，并使用集束解码（beam-like decoding）查找最优标签序列（optimal tag sequences），如最大熵分类器（maximum entropy classifier）、最大熵马尔科夫模型（Maximum entropy Markov models，MEMMs）

（2）关注语句层面而非单个词条（focus on sentence level instead of individual positions），如条件随机场（Conditional Random Fields，CRF）模型，输入输出直接相连（inputs and outputs are directly connected）

在这里插入图片描述

2.4 LSTM-CRF网络（LSTM-CRF Networks）

LSTM-CRF网络使用LSTM层处理历史输入特征，CRF层处理语句层面标签信息（sentence level tag information）。

CRF层的参数为状态转移矩阵（A CRF layer has a state transition matrix as parameters），该层根据历史和未来标签预测当前标签（use past and future tags to predict the current tag）。

网络输出为分值矩阵（matrix of scores） $f_{\theta} ([x]_{1}^{T})$ ，矩阵元素 $[f_{\theta}]_{i, t}$ 表示参数为 $\theta$ 的网络预测语句（sentence） $[x]_{1}^{T}$ 中第 $t$ 个词条标签为 $i$ 的输出分值（the element $[f_{\theta}]_{i, t}$ of the matrix is the score output by the network with parameters $\theta$ , for the sentence $[x]_{1}^{T}$ and for the $i$ -th tag, at the $t$ -th word）。转移分值（transition score） $[A]_{i, j}$ 表示相邻时间步从标签 $i$ 至 $j$ 的转移（a transition score $[A]_{i, j}$ to model the transition from $i$ -th state to $j$ -th for a pair of consecutive time steps）。转移矩阵与时间步无关（transition matrix is position independent）。

将网络参数重写为： $\tilde{\theta} = \theta \cup \{ [A]_{i, j} \forall i, j \}$ ，则语句 $[x]_{1}^{T}$ 沿标签路径（along with a path of tags） $[i]_{1}^{T}$ 的分值为转移分值与网络分值之和（sum of transition scores and network scores）：

$s([x]_{1}^{T}, [i]_{1}^{T}, \tilde{\theta}) = \sum_{t = 1}^{T} \left( [A]_{[i]_{t - 1}, [i]_{t}} + [f_{\theta}]_{[i]_{t}, t} \right) \tag {5}$

$[A]_{i, j}$ 和推理最优标签序列（optimal tag sequences for inference）可由动态规化（dynamic programming）求解。

在这里插入图片描述
■■

方程（5）中，

$t$ ：时间步， $t = 1, 2, \cdots, T$

$[x]_{1}^{T}$ ：LSTM网络输入语句序列， $[x]_{1}^{T} = ( x_{1}, x_{2}, \cdots, x_{T} )$

$[i]_{1}^{T}$ ：LSTM-CRF的输出标签序列， $[i]_{1}^{T} = ( i_{1}, i_{2}, \cdots, i_{T} )$ 。其中， $i_{t}$ 的取值为所有可能标签

$f_{\theta} ([x]_{1}^{T})$ ：LSTM网络输出的分值

$[A]_{i, j}$ ：标签 $i$ 转移至标签 $j$ 的CRF输出分值，该分值与时间步 $t$ 无关。

$s([x]_{1}^{T}, [i]_{1}^{T}, \tilde{\theta})$ ：给定序列 $[x]_{1}^{T}$ ，LSTM-CRF输出的总分值

$\begin{aligned} s([x]_{1}^{T}, [i]_{1}^{T}, \tilde{\theta}) = \sum_{t = 1}^{T} s([x]_{t}, [i]_{t}, \tilde{\theta}) = \sum_{t = 1}^{T} \left( [A]_{[i]_{t - 1}, [i]_{t}} + [f_{\theta}]_{[i]_{t}, t} \right) \end{aligned}$

在时间步 $t$ 上，LSTM-CRF输出的分值为 $s([x]_{t}, [i]_{t}, \tilde{\theta})$ （CRF的输出分值与时间步 $t - 1$ 有关，LSTM的输出分值与时间步 $1, 2, \cdots t - 1$ 有关）

$s([x]_{t}, [i]_{t}, \tilde{\theta}) = [A]_{[i]_{t - 1}, [i]_{t}} + [f_{\theta}]_{[i]_{t}, t}$

■

2.5 双向LSTM-CRF网络（BI-LSTM-CRF Networks）

BI-LSTM-CRF同时处理历史与未来输入特征。

在这里插入图片描述

3 训练过程

模型训练：前向、后向随机梯度下降（a SGD forward and backward training procedure）

在这里插入图片描述

4 实验

POS：为每个词条标注语法角色（POS assigns each word with a unique tag that indicates its syntactic role）

组块分析：为每个词条标注短语类型（chunking, each word is tagged with its phrase type）

命名实体识别：为每个词条标注实体类型，人物、地点、组织或其他（NER task, each word is tagged with other or one of four entity types: Person, Location, Organization, or Miscellaneous）

组块分析和命名实体识别采用BIO2注释标准（annotation standard）