note

文章目录

note
一、扩展LLM的Context长度
- 1. 常见方法
- 2. PCW方法
二、NBCE方法
三、RoPE方法
四、FlashAttention方法
Reference

一、扩展LLM的Context长度

1. 常见方法

扩展LLM的Context长度其实已有不少，但多数是通过结合检索或者摘要的方式来缩短样本的长Context，如Unlimiformer。由于不是直接处理长Context，因此通常无法做精细的阅读理解，而且这些方案往往需要在训练阶段就考虑进去，而不是事后即插即用到已有的LLM模型中。

2. PCW方法

以前能够不微调地扩展Context长度的方案是Parallel Context Window（下面简称PCW），出自论文《Parallel Context Windows for Large Language Models》和《Structured Prompting: Scaling In-Context Learning to 1,000 Examples》，两篇论文是同一时期不同作者的工作，但所提的方法只有细微的差别。

PCW适用于Self Attention模型，主要修改包括Position Encoding和Attention Mask：
在这里插入图片描述

二、NBCE方法

使用朴素贝叶斯方法。
在这里插入图片描述
为了改善Random Sample的效果，将Pooling方式改为直接输出不确定性最低的那个分布：
$\begin{array}{r} {[\log p(T \mid S)]=\log p\left(T \mid S_k\right)} \\ k=\arg \min \left\{H_1, H_2, \cdots, H_n\right\} \\ H_i=-\sum_T p\left(T \mid S_i\right) \log p\left(T \mid S_i\right) \end{array}$

三、RoPE方法

RoPE的目标：构建一个位置相关的投影矩阵, 使得
$\left(\mathbf{R}_m \mathbf{q}\right)^{\top}\left(\mathbf{R}_n \mathbf{k}\right)=\mathbf{q}^{\top} \mathbf{R}_m^{\top} \mathbf{R}_n \mathbf{k}=\mathbf{q}^{\top} \mathbf{R}_{n-m} \mathbf{k}$

对位置编码的转换称为位置插值。在这一步中，我们将位置索引从 $\left[0, L^{\prime}\right)$ 减小到 $\mathrm{~L})$ ，以匹配计算 RoPE 之前的原始索引范围。
因此，作为 RoPE 的输入，任意两个标记之间的最大相对距离从 $L^{\prime}$ 减小到 $L$ 。由于我们在扩展之前和之后对位置索引和相对距离的范围进行了对齐，减轻了上下文窗口扩展对注意力得分计算的影响，这使得模型更容易适应。
为了进一步证明这一点，下面的定理表明插值后的注意力得分具有良好的性质：

在这里插入图片描述

比如在chatGLM中也用到了旋转位置编码（下面的GLMBlock中的SelfAttention模块）：

ChatGLMForConditionalGeneration(
  (transformer): ChatGLMModel(
    (word_embeddings): Embedding(130528, 4096)
    (layers): ModuleList(
      (0-27): 28 x GLMBlock(
        (input_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (attention): SelfAttention(
          (rotary_emb): RotaryEmbedding()
          (query_key_value): QuantizedLinear(in_features=4096, out_features=12288, bias=True)
          (dense): QuantizedLinear(in_features=4096, out_features=4096, bias=True)
        )
        (post_attention_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
        (mlp): GLU(
          (dense_h_to_4h): QuantizedLinear(in_features=4096, out_features=16384, bias=True)
          (dense_4h_to_h): QuantizedLinear(in_features=16384, out_features=4096, bias=True)
        )
      )
    )
    (final_layernorm): LayerNorm((4096,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=4096, out_features=130528, bias=False)
)

具体的旋转编码类代码如下：

class RotaryEmbedding(torch.nn.Module):
    def __init__(self, dim, base=10000, precision=torch.half, learnable=False):
        super().__init__()
        inv_freq = 1. / (base ** (torch.arange(0, dim, 2).float() / dim))
        inv_freq = inv_freq.half()
        self.learnable = learnable
        if learnable:
            self.inv_freq = torch.nn.Parameter(inv_freq)
            self.max_seq_len_cached = None
        else:
            self.register_buffer('inv_freq', inv_freq)
            self.max_seq_len_cached = None
            self.cos_cached = None
            self.sin_cached = None
        self.precision = precision

    def _load_from_state_dict(self, state_dict, prefix, local_metadata, strict, missing_keys, unexpected_keys,
                              error_msgs):
        pass

    def forward(self, x, seq_dim=1, seq_len=None):
        if seq_len is None:
            seq_len = x.shape[seq_dim]
        if self.max_seq_len_cached is None or (seq_len > self.max_seq_len_cached):
            self.max_seq_len_cached = None if self.learnable else seq_len
            t = torch.arange(seq_len, device=x.device, dtype=self.inv_freq.dtype)
            freqs = torch.einsum('i,j->ij', t, self.inv_freq)
            # Different from paper, but it uses a different permutation in order to obtain the same calculation
            emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
            if self.precision == torch.bfloat16:
                emb = emb.float()

            # [sx, 1 (b * np), hn]
            cos_cached = emb.cos()[:, None, :]
            sin_cached = emb.sin()[:, None, :]
            if self.precision == torch.bfloat16:
                cos_cached = cos_cached.bfloat16()
                sin_cached = sin_cached.bfloat16()
            if self.learnable:
                return cos_cached, sin_cached
            self.cos_cached, self.sin_cached = cos_cached, sin_cached
        return self.cos_cached[:seq_len, ...], self.sin_cached[:seq_len, ...]

    def _apply(self, fn):
        if self.cos_cached is not None:
            self.cos_cached = fn(self.cos_cached)
        if self.sin_cached is not None:
            self.sin_cached = fn(self.sin_cached)
        return super()._apply(fn)

四、FlashAttention方法

最新发布的ChatGLM2-2B就用了这种方法将上下文长度（Context Length）由 ChatGLM-6B 的 2K 扩展到了 32K，并在对话阶段使用 8K 的上下文长度训练，允许更多轮次的对话。
用了哈希感知（hash-aware）的技术，可以根据它们的相似性将输入序列中的元素分配到不同的桶（bucket）中。这样，模型只需要计算桶元素之间的注意力权重，而不是整个序列。