Transformer-XL: Attentive Language ModelsBeyond a Fixed-Length Context 论文总结

Paper:Transformer-XL: Attentive Language ModelsBeyond a Fixed-Length Context

Code:Transformer-XL code

1. 论文简介

Transfomer-XL = Transformer Extra Long

2. 什么是Transformer 

XLNet 使用了 Transformer-XL 中的 Segment Recurrence Mechanism (段循环) 和 Relative Positional Encoding (相对位置编码) 进行优化。

Segment Recurrence Mechanism 段循环的机制会将上一段文本输出的信息保存下来,用于当前文本的计算,使模型可以拥有更广阔的上下文信息。

在引入上一段信息后,可能会有两个 token 拥有相同的位置信息,例如上一段的第一个单词和当前段的第一个单词位置信息都是一样的。因此 Transformer-XL 采用了 Relative Positional Encoding (相对位置编码) ,不使用固定的位置,而是采用单词之间的相对位置进行编码。

3. Vanilla transfomer langange models 简单介绍与缺点

3.1 简单介绍

3.2 缺点

3.2.1 Training with the Vanilla Model (Vanila的训练阶段问题)

1. Tokens at the beginning of each segment do not have sufficent context for proper optimization.  

2. Limited by a fixed-length context

3.2.2 Evaluation with the Vanilla Model

1. Longest context limited by segment length.

2. very expensive due to recomputation.

3.2.3. Temporal Incoherence 

4. Transformer-XL贡献或主要改进

4.1 Transformer-XL 介绍

4.1.1 Training with Transformer-XL

 4.1.2  Evaluation with Transformer-XL

 4.1.3. Solution: Relative Positional Encodings

Benefits:

1. Allows recurrence mechanism

2. Better generalization

-> WordLM: Train with memory length 150 , evaluate with 640

-> CharLM: Train with memory length 680, evalute with 3800

 

4.1  Segment-level Recurrence

Cache and reuse hidden states from last batch

Analogous to Truncated BPTT for RNN : pass the last hidden state to the next segment as the initial hidden

4.2. Keep Temporal information coherenet 

5. 总结

参考资料

Transformer-XL_ Attentive Language Models beyond a Fixed-Length Context_哔哩哔哩_bilibili

猜你喜欢

转载自blog.csdn.net/keeppractice/article/details/119790553