语言模型中的事实在哪里？

知道与说不同：死记硬背说出单词与了解事实不同，因为对事实的了解可以跨上下文进行概括。在这个项目中，我们展示了 GPT 中的事实知识也对应于可以直接编辑的本地化计算。例如，我们可以对 GPT-J 的一小部分权重进行小的更改，以教导它反事实的“埃菲尔铁塔位于罗马市”。它将概括特定的反事实知识并将其应用到非常不同的语言环境中，而不是仅仅重复新句子。

为什么要查找事实？

我们对模型如何以及在何处存储其事实关联感兴趣，原因有两个：

理解巨大的不透明神经网络。大型语言模型的内部计算是模糊的。澄清事实的处理是理解大规模变压器网络的第一步。
修正错误。模型通常是不正确的、有偏见的或私人的，我们希望开发能够调试和修复特定事实错误的方法。

我们研究的事实采用知识元组t=(s,r,o)的形式，其中 s和o分别是主体和客体实体，r是连接两者的关系。例如，(s =梅根·拉皮诺 (Megan Rapinoe)，r = 专业从事体育运动，o =足球)表示拉皮诺 (Rapinoe) 以踢足球为生。每个变量代表一个可以在知识图中找到的实体或关系，并且可以写成自然语言字符串。

为了查询 GPT 以获取事实知识，我们将(s, r)表示为文本提示（通过从CounterFact数据集扩展模板），并检查生成的延续是否与o匹配。

我们发现了什么？

在 GPT 风格的 Transformer 模型中，我们发现了两件事：

1. 事实关联可以沿三个维度定位，以 (1) MLP 模块参数

(2) 在中间层的范围内，以及

(3) 特别是在处理主题的最后一个标记期间。

GPT 中事实陈述的因果追踪

上面的因果轨迹揭示了少量状态，其中包含可以将模型从一种事实预测翻转到另一种事实预测的信息。我们的研究利用了这样的因果痕迹，并找到了证据表明知识检索发生在早期站点的 MLP 模块中（图中的 (a)）；然后后期站点（图中（b））的注意力机制将信息带到计算的末尾，可以预测特定的单词。

2. 可以 通过在单个 MLP 模块中进行小的排名一更改来更改单个事实关联。我们可以通过衡量同一事实的其他措辞的泛化来区分 知识的变化和语言的表面变化。

使用 ROME 方法在 GPT 中编辑事实的示例。

上面的示例表明，如果通过以正确的方式更改所选参数来更改模型对有关埃菲尔铁塔的单个语句的处理，将导致在各种重要上下文中表达知识的变化。

在图中的 (a) 处，提出了反事实的单个直接陈述，并用于计算单个 MLP 模块中的排名一参数变化。尽管更改很简单，但 (b) 中显示的结果表明，对于有关从柏林出发的更复杂的提示，该模型将埃菲尔铁塔视为在罗马；类似地，在（c）中，当被问及附近的地点时，该模型会在明确提及罗马之前建议罗马的地点。在如此不同的上下文中预测的变化证明了变化的普遍化：模型不仅学会了模仿反事实中的确切单词序列，而且还在与原始示例非常不同的句子中应用了新知识。

如何定位事实检索

为了识别决定性的计算，我们引入了一种称为因果追踪的方法。通过在处理事实陈述时隔离网络内各个状态的因果效应，我们可以追踪信息通过网络所遵循的路径。

演示因果追踪方法的动画。

因果追踪的工作原理是多次运行网络，引入损坏来阻碍计算，然后恢复各个状态以识别恢复结果的信息。跟踪可用于测试任何单独的状态或状态的组合。我们使用精心设计的轨迹来识别一小组特定的 MLP 模块计算，这些计算可调解事实关联的检索。

然后我们通过询问来检查这一发现：是否可以更改 MLP 模块计算来编辑模型对特定事实的信念？

如何编辑事实存储

为了修改 GPT 模型中的单个事实，我们引入了一种称为ROME或排名一模型编辑的方法。它将 MLP 模块视为简单的键值存储：例如，如果键对主题进行编码，值对有关该主题的知识进行编码，则 MLP 可以通过检索与键对应的值来回忆关联。ROME 使用 MLP 权重的排名一修改来直接写入新的键值对。

MLP 模块图

上图展示了变压器内的单个 MLP 模块。(b) 处的 D 维向量充当代表要了解的主题的键，(c) 处的 H 维输出充当编码有关该主题的学习属性的值。ROME 通过对从键映射到值的矩阵 (d) 进行一级更改来插入新关联。

请注意，ROME 假设神经网络内的记忆是线性视图，而不是单个神经元视图。这种线性视角将个体记忆视为参数空间的一级切片。实验证实了这一观点：当我们对因果追踪识别的计算中心中的 MLP 模块进行排名一更新时，我们发现各个事实的关联可以以既具体又概括的方式进行更新。

如何区分知道事实和说出事实

知道不同于说。多种微调方法可以使语言模型模仿特定的新句子，但训练模型来调整其对事实的了解不同于仅仅教它反刍特定的单词序列。

我们可以通过衡量知识的两个特征来区分“知道”和“说”：特异性和概括性。

特异性意味着当你对某个事实的了解发生变化时，它不会改变其他事实。例如，在得知埃菲尔铁塔在罗马之后，您不应该认为其他所有旅游景点也在罗马。
泛化意味着你对事实的了解对于措辞和上下文的变化是稳健的。了解埃菲尔铁塔在罗马之后，您还应该知道参观它需要前往罗马。

我们的新数据集 CounterFact 包含数千个反事实以及文本，允许在学习反事实时对特异性和泛化进行定量测试。

区分“知道”和“说”的定量结果。

上面是使用CounterFact 来确认 GPT-2 XL 中已知参数和说出参数之间区别的实验结果。ROME 编辑了早期因果站点 (a) ，实现了出色的功效（通过反事实提示本身的表现来衡量）、特异性（不应该改变的邻近主题的表现）和泛化（释义的表现）。相比之下，如果我们修改后面站点（b）的注意机制，模型会实现相当的功效和特异性，但完全无法泛化。

如何引用

这项工作发表在 NeurIPS 2022 上。可以引用如下。

参考书目

凯文·孟、大卫·鲍、亚历克斯·安东尼安和约纳坦·贝林科夫。“在 GPT 中定位和编辑事实关联。 ”神经信息处理系统的进展 36 (2022)。

比布克斯

<span style="color:#292b2c"><span style="background-color:#ffffff"><span style="background-color:#ffffff"><span style="color:#292b2c">@文章{meng2022定位，
  title={在 {GPT} 中定位和编辑事实关联},
  作者={凯文·孟、大卫·鲍、亚历克斯·安多尼安和尤纳坦·贝林科夫}，
  期刊={神经信息处理系统的进展}，
  音量={36}，
  年={2022}
}</span></span></span></span>

Locating and Editing Factual Associations in GPT

Kevin Meng*1, David Bau*2, Alex Andonian1, Yonatan Belinkov3
1MIT CSAIL, 2Northeastern University, 3Technion - IIT; *Equal Contribution

Update! See our MEMIT paper on scaling to thousands of facts:Mass Editing Memory in a Transformer

编辑ArXivPreprint 编辑Source CodeGithub 编辑Datasetand Models 编辑Demo Colab:Model Editing 编辑Demo Colab:Causal Tracing 编辑Youtube(Yannic Kilcher)

Where are the Facts Inside a Language Model?

Knowing differs from saying: uttering words by rote is different from knowing a fact, because knowledge of a fact generalizes across contexts. In this project, we show that factual knowledge within GPT also corresponds to a localized computation that can be directly edited. For example, we can make a small change to a small set of the weights of GPT-J to teach it the counterfactual "Eiffel Tower is located in the city of Rome." Rather than merely regurgitating the new sentence, it will generalize that specific counterfactual knowledge and apply it in very different linguistic contexts.

Diagram of a GPT transformer prediction

Why Locate Facts?

We are interested how and where a model stores its factual associations, for two reasons:

To understand huge opaque neural networks. The internal computations of large language models are obscure. Clarifying the processing of facts is one step in understanding massive transformer networks.
Fixing mistakes. Models are often incorrect, biased, or private, and we would like to develop methods that will enable debugging and fixing of specific factual errors.

The facts we study take the form of knowledge tuples t = (s, r, o), where s and o are subject and object entities, respectively, and r is the relation connecting the two. For example, (s = Megan Rapinoe, r = plays sport professionally, o = soccer) indicates that Rapinoe plays soccer for a living. Each variable represents an entity or relation that can be found in a knowledge graph, and that can be written as a natural language string.

To query GPT for knowledge of a fact, we express (s, r) as a text prompt (by expanding a template from the CounterFact data set), and check whether the generated continuation matches o.

What Did We Find?

In GPT-style transformer models, we discovered two things:

1. Factual associations can be localized along three dimensions, to (1) MLP module parameters (2) at a range of middle layers and (3) specifically during processing of the last token of the subject.

A causal trace of a factual statement in GPT

The causal trace above reveals a small number of states that contain information that can flip the model from one factual prediction to another. Our studies use such causal traces and find evidence that knowledge retrieval occurs in MLP modules at the early site (at (a) in the figure); then attention mechanisms at the late site (at (b) in the figure) bring the information to the end of the computation where the specific word can be predicted.

2. Individual factual associations can be changed by making small rank-one changes in a single MLP module. We can distinguish between changes in knowledge versus superficial changes in language by measuring generalization to other wordings of the same fact.

An example of editing a fact in GPT using the ROME method.

The example above shows that changing the model's processing of a single statement about the Eiffel Tower, if done by changing selected parameters in the right way, will result in expressing a change in knowledge in a variety of nontrivial contexts.

At (a) in in the figure, a single direct statement of a counterfactual is posed, and it is used to compute a rank-one parameter change in a single MLP module. Despite the simplicity of the change, results shown at (b) show that for a more complex prompt about travel from Berlin, the model treats the Eiffel tower as if it is in Rome; similarly in (c) when asked about nearby sites, the model suggests places in Rome before explicitly mentioning Rome. Changes in predictions in such different contexts is evidence that change generalizes: the model has not merely learned to parrot the exact sequence of words in the counterfactual, but it also applies the new knowledge in sentences that are very different from the original example.

How to Locate Factual Retrieval

To identify decisive computations, we introduce a method called Causal Tracing. By isolating the causal effect of individual states within the network while processing a factual statement, we can trace the path followed by information through the network.

An animation demonstrating the Causal Tracing method.

Causal traces work by running a network multiple times, introducing corruptions to frustrate the computation, and then restoring individual states in order to identify the information that restores the results. Tracing can be used to test any individual state or combinations of states. We use carefully-designed traces to identify a specific small set of MLP module computations that mediate retrieval of factual associations.

Then we check this finding by asking: can the MLP module computations be altered to edit a model's belief in a specific fact?

How to Edit Factual Storage

To modify individual facts within a GPT model, we introduce a method called ROME, or Rank-One Model Editing. It treats an MLP module as a simple key-value store: for example, if the key encodes a subject and the value encodes knowledge about the subject, then the MLP can recall the association by retrieving the value corresponding to the key. ROME uses a rank-one modification of the MLP weights to directly write in a new key-value pair.

Diagram of an MLP module

The figure above illustrates a single MLP module within a transformer. The D-dimensional vector at (b) acts as the key that represents a subject to know about, and the H-dimensional output at (c) acts at the value that encodes learned properties about the subject. ROME inserts new association by making a rank-one change to the matrix (d) that maps from keys to values.

Note that ROME assumes a linear view of memory within a neural network rather than an individual-neuron view. This linear perspective sees individual memories as rank-one slices of parameter space. Experiments confirm this view: when we do a rank-one update to an MLP module in the computational center identified by causal tracing, we find that associations of individual facts can be updated in a way that is both specific and generalized.

How to Distinguish Knowing a Fact from Saying a Fact

Knowing differs from saying. A variety of fine-tuning methods can cause a language model to parrot a specific new sentence, but training a model to adjust its knowledge of a fact is different from merely teaching it to regurgitate a particular sequence of words.

We can tell the difference between knowing and saying by measuring two hallmarks of knowledge: specificity and generalization.

Specificity means that when your knowledge of a fact changes, it doesn't change other facts. For example, after learning that the Eiffel Tower is in Rome, you shouldn't also think that every other tourist attraction is also in Rome.
Generalization means that your knowledge of a fact is robust to changes in wording and context. After learning the Eiffel Tower is in Rome, then you should also know that visiting it will require travel to Rome.

Our new dataset CounterFact includes thousands of counterfactuals along with text that allows quantitative testing of specificity and generalization when learning a counterfactual.

Quantitative results distinguishing knowing from saying.

Above are the results of an experiment that uses CounterFact to confirm the distinction between knowing and saying parameters in GPT-2 XL. ROME, which edits the early causal site (a), achieves excellent efficacy (measured by performance on the counterfactual prompt itself), specificity (performance on neighborhood subjects not supposed to change), and generalization (performance on paraphrases). By contrast, if we modify the attention mechanism at the later site (b), the model achieves fair efficacy and specificity but completely fails to generalize.

Related Work

Our work builds upon insights in other work that has examined large transformer language models and large neural networks from several other perspectives:

Transformer Mechanisms

编辑Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah. A Mathematical Framework for Transformer Circuits. Anthropic 2021.
Notes: Analyzes internal mechanisms of transformer components, developing mathematical tools for understanding patterns of computations. Observes information-copying behavior in self-attention and implicates it in the strong performance of transformers.

编辑Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy. Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP 2021.
Notes: Proposes the view that transformer MLP modules act as key-value memories akin to two-layer softmax-based memory data structures. Analyzes the contribution of these modules to token representations at each layer.

编辑Sumu Zhao, Damián Pascual, Gino Brunner, Roger Wattenhofer. Of Non-Linearity and Commutativity in BERT. IJCNN 2021.
Notes: Conducts a number of experiments of the computations of transformer models, including an experiment that shows that swapping adjacent layers of a transformer has only minimal impact on its behavior.

Extracting Knowledge from LMs

编辑Fabio Petroni, Tim Rocktaschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel. Language Models as Knowledge Bases? EMNLP-IJCNLP 2019.
Notes: Proposes using fill-in-the-blank prompts for extracting knowledge from large language models.

编辑Zhengbao Jiang, Frank F. Xu, Jun Araki, Graham Neubig. How Can We Know What Language Models Know? TACL 2020.
Notes: Discusses various ways to diversify prompts to improve extraction of knowledge from language models.

编辑Adam Roberts, Colin Raffel, Noam Shazeer. How Much Knowledge Can You Pack Into the Parameters of a Language Model? EMNLP 2020.
Notes: Proposes fine-tuning a pretrained transformer language model to expand its ability to answer factual questions without reliance on an external knowledge source.

编辑Zexuan Zhong, Dan Friedman, Danqi Chen. Factual Probing Is [MASK]: Learning vs. Learning to Recall. NAACL 2021.
Notes: Examines the use of learned knowledge probes for extracting knowledge, and also notes the risks of hallucinating new knowledge rather than extracting knowledge when using this technique.

编辑Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, Yoav Goldberg. Measuring and Improving Consistency in Pretrained Language Models. TACL 2021.
Notes: Examines consistent generalization of language models, i.e., whether they predict the same facts under paraphrases. The fact that models are often inconsistent under paraphrases can be seen as evidence that they do not have generalizable knowledge of some facts. We use their ParaRel data set as the basis for CounterFact.

Causal Effects inside NNs

编辑Yash Goyal, Amir Feder, Uri Shalit, Been Kim. Explaining Classifiers with Causal Concept Effect (CaCE). 2019.
Notes: From computer vision; observes that causal explanations can come to different conclusion from a correlative analysis, and proposes ways to construct counterfactual explanations in computer vision.

编辑Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, Stuart Shieber. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. NeurIPS 2020.
Notes: Applies causal mediation analysis to identify decisive neurons and attention heads responsible for gender bias in large language models. Identifies a small handful of decisive attention heads in this case.

编辑Amir Feder, Nadav Oved, Uri Shalit, Roi Reichart. CausaLM: Causal Model Explanation Through Counterfactual Language Models. CL 2021.
Notes: Devises a framework for understanding the structure of a language model by constructing representation-based counterfactuals and testing the model's causal response to them.

编辑Yanai Elazar, Shauli Ravfogel, Alon Jacovi, Yoav Goldberg. Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals. TACL 2021.
Notes: Proposes measuring the importance of specific information within a model by introducing a causal intervention to erase that information, then observing the causal effects.

Knowledge Editing

编辑Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, Sanjiv Kumar. Modifying Memories in Transformer Models. 2020.
Notes: Finds that a simple constrained fine-tuning, in which weights are constrained to lie near their pretrained values, is very effective at modifying learned knowledge within a transformer.

编辑Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Furu Wei. Knowledge Neurons in Pretrained Transformers. 2021.
Notes: Building upon Geva (2021), proposes that individual neurons within MLP layers encode individual facts. Describes an attribution method to find the neurons for a fact, and and conducts experiments manipulating these neurons to edit stored facts.

编辑Nicola De Cao, Wilker Aziz, Ivan Titov. Editing Factual Knowledge in Language Models. EMNLP 2021.
Notes: Develops a "KnowledgeEditor" (KE) hypernetwork to fine-tune a model to incorporate a new fact given by a textual description of the fact. The hypernetwork is an RNN that processes the description as well as the gradients of a loss to propose a complex multilayer change in the network.

编辑Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, Christopher D. Manning. Fast Model Editing at Scale. ICLR 2022.
Notes: Develops a hypernetwork (MEND) to fine-tune a model to change its predictions to match a single run of text. The hypernetwork uses gradients within the network to infer a small rank-one update to the model; the method is shown to scale to very large transformers.

Model Editing in Computer Vision

Model Editing methods that use little or no training data have also been studied in computer vision.

编辑David Bau, Steven Liu, Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba. Rewriting a Deep Generative Model. ECCV 2020.
Notes: Demonstrates direct editing of associative rules within layers of a generative adversarial network (GAN), allowing a user to alter the appearance of objects in a model without supplying any new training images. In our current work, we adopt the rank-one memory editing framework and apply it to large language model transformers.

编辑Sheng-Yu Wang, David Bau, Jun-Yan Zhu. Sketch Your Own GAN. ICCV 2021.
Notes: Develops a method for altering a model using only a small number of user-provided sketches and without any new training photos. Addresses the challenge of having user guidance that is given by examples in a much simpler data domain than the output data.

编辑Rinon Gal, Or Patashnik, Haggai Maron, Amit Bermano, Gal Chechik, Daniel Cohen-Or. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators.
Notes: Introduces the use of text guidance to alter a generative model without providing any new training images. Alters stylegan parameters using a directional CLIP objective that guides modified-model images to have specific differences with original-model images, and selects specific layers to modify based on their effect on the objective.

How to Cite

This work appeared at NeurIPS 2022. It can be cited as follows.

bibliography

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. "Locating and Editing Factual Associations in GPT." Advances in Neural Information Processing Systems 36 (2022).

在 GPT 中查找和编辑事实关联

语言模型中的事实在哪里？

为什么要查找事实？

我们发现了什么？

如何定位事实检索

如何编辑事实存储

如何区分知道事实和说出事实

相关工作

变压器机构

从 LM 中提取知识

神经网络内部的因果效应

知识编辑

计算机视觉中的模型编辑

如何引用

参考书目

比布克斯

Locating and Editing Factual Associations in GPT

Where are the Facts Inside a Language Model?

Why Locate Facts?

What Did We Find?

How to Locate Factual Retrieval

How to Edit Factual Storage

How to Distinguish Knowing a Fact from Saying a Fact

Related Work

Transformer Mechanisms

Extracting Knowledge from LMs

Causal Effects inside NNs

Knowledge Editing

Model Editing in Computer Vision

How to Cite

bibliography

猜你喜欢

在 GPT 中查找和编辑 事实关联

语言模型中的事实在哪里？

为什么要查找事实？

我们发现了什么？

如何定位事实检索

如何编辑事实存储

如何区分知道事实和说出事实

相关工作

变压器机构

从 LM 中提取知识

神经网络内部的因果效应

知识编辑

计算机视觉中的模型编辑

如何引用

参考书目

比布克斯

Locating and Editing Factual Associations in GPT

Where are the Facts Inside a Language Model?

Why Locate Facts?

What Did We Find?

How to Locate Factual Retrieval

How to Edit Factual Storage

How to Distinguish Knowing a Fact from Saying a Fact

Related Work

Transformer Mechanisms

Extracting Knowledge from LMs

Causal Effects inside NNs

Knowledge Editing

Model Editing in Computer Vision

How to Cite

bibliography

猜你喜欢

在 GPT 中查找和编辑事实关联