在 GPT 中查找和编辑 事实关联

语言模型中的事实在哪里?

知道与不同:死记硬背说出单词与了解事实不同,因为对事实的了解可以跨上下文进行概括。在这个项目中,我们展示了 GPT 中的事实知识也对应于可以直接编辑的本地化计算。例如,我们可以对 GPT-J 的一小部分权重进行小的更改,以教导它反事实的“埃菲尔铁塔位于罗马市”。它将概括特定的反事实知识并将其应用到非常不同的语言环境中,而不是仅仅重复新句子。

GPT 变压器预测图

为什么要查找事实?

我们对模型 如何以及在何处存储其事实关联感兴趣,原因有两个:

  1. 理解巨大的不透明神经网络 大型语言模型的内部计算是模糊的。澄清事实的处理是理解大规模变压器网络的第一步。
  2. 修正错误。模型通常是不正确的、有偏见的或私人的,我们希望开发能够调试和修复特定事实错误的方法。

我们研究的事实采用知识元组t=(s,r,o)的形式,其中 so分别是主体和客体实体,r是连接两者的关系。例如,(s =梅根·拉皮诺 (Megan Rapinoe),r = 专业从事体育运动,o =足球)表示拉皮诺 (Rapinoe) 以踢足球为生。每个变量代表一个可以在知识图中找到的实体或关系,并且可以写成自然语言字符串。

为了查询 GPT 以获取事实知识,我们将(s, r)表示为文本提示(通过从CounterFact数据集扩展模板),并检查生成的延续是否与o匹配。

我们发现了什么?

在 GPT 风格的 Transformer 模型中,我们发现了两件事:

1. 事实关联可以沿三个维度定位,以 (1) MLP 模块参数

(2) 在中间层的范围内,以及

(3) 特别是在处理主题的最后一个标记期间。

GPT 中事实陈述的因果追踪

上面的因果轨迹揭示了少量状态,其中包含可以将模型从一种事实预测翻转到另一种事实预测的信息。我们的研究利用了这样的因果痕迹,并找到了证据表明知识检索发生在早期站点的 MLP 模块中(图中的 (a));然后后期站点(图中(b))的注意力机制将信息带到计算的末尾,可以预测特定的单词。

2. 可以 通过在单个 MLP 模块中进行小的排名一更改来更改单个事实关联。我们可以通过衡量同一事实的其他措辞的泛化来区分 知识的变化和语言的表面变化。

使用 ROME 方法在 GPT 中编辑事实的示例。

上面的示例表明,如果通过以正确的方式更改所选参数来更改模型对有关埃菲尔铁塔的单个语句的处理,将导致在各种重要上下文中表达知识的变化。

在图中的 (a) 处,提出了反事实的单个直接陈述,并用于计算单个 MLP 模块中的排名一参数变化。尽管更改很简单,但 (b) 中显示的结果表明,对于有关从柏林出发的更复杂的提示,该模型将埃菲尔铁塔视为在罗马;类似地,在(c)中,当被问及附近的地点时,该模型会在明确提及罗马之前建议罗马的地点。在如此不同的上下文中预测的变化证明了变化的普遍化:模型不仅学会了模仿反事实中的确切单词序列,而且还在与原始示例非常不同的句子中应用了新知识。

如何定位事实检索

为了识别决定性的计算,我们引入了一种称为因果追踪的方法 。通过在处理事实陈述时隔离网络内各个状态的因果效应,我们可以追踪信息通过网络所遵循的路径。

演示因果追踪方法的动画。

因果追踪的工作原理是多次运行网络,引入损坏来阻碍计算,然后恢复各个状态以识别恢复结果的信息。跟踪可用于测试任何单独的状态或状态的组合。我们使用精心设计的轨迹来识别一小组特定的 MLP 模块计算,这些计算可调解事实关联的检索。

然后我们通过询问来检查这一发现:是否可以更改 MLP 模块计算来编辑模型对特定事实的信念?

如何编辑事实存储

为了修改 GPT 模型中的单个事实,我们引入了一种称为ROME或排名一模型编辑的方法。它将 MLP 模块视为简单的键值存储:例如,如果键对主题进行编码,值对有关该主题的知识进行编码,则 MLP 可以通过检索与键对应的值来回忆关联。ROME 使用 MLP 权重的排名一修改来直接写入新的键值对。

MLP 模块图

上图展示了变压器内的单个 MLP 模块。(b) 处的 D 维向量充当代表要了解的主题的键,(c) 处的 H 维输出充当编码有关该主题的学习属性的值。ROME 通过对从键映射到值的矩阵 (d) 进行一级更改来插入新关联。

请注意,ROME 假设神经网络内的记忆是线性视图,而不是单个神经元视图。这种线性视角将个体记忆视为参数空间的一级切片。实验证实了这一观点:当我们对因果追踪识别的计算中心中的 MLP 模块进行排名一更新时,我们发现各个事实的关联可以以既具体又概括的方式进行更新。

如何区分知道事实和说出事实

知道不同于。多种微调方法可以使语言模型模仿特定的新句子,但训练模型来调整其对事实的了解不同于仅仅教它反刍特定的单词序列。

我们可以通过衡量知识的两个特征来区分“知道”和“说”:特异性和概括性。

  1. 特异性意味着当你对某个事实的了解发生变化时,它不会改变其他事实。例如,在得知埃菲尔铁塔在罗马之后,您不应该认为其他所有旅游景点也在罗马。
  2. 泛化意味着你对事实的了解对于措辞和上下文的变化是稳健的。了解埃菲尔铁塔在罗马之后,您还应该知道参观它需要前往罗马。

我们的新数据集 CounterFact 包含数千个反事实以及文本,允许在学习反事实时对特异性和泛化进行定量测试。

区分“知道”和“说”的定量结果。

上面是使用CounterFact 来确认 GPT-2 XL 中已知参数和说出参数之间区别的实验结果。ROME 编辑了早期因果站点 (a) ,实现了出色的功效(通过反事实提示本身的表现来衡量)、特异性(应该改变的邻近主题的表现)和泛化(释义的表现)。相比之下,如果我们修改后面站点(b)的注意机制,模型会实现相当的功效和特异性,但完全无法泛化。

相关工作

我们的工作建立在其他工作的见解之上,这些工作从其他几个角度研究了大型 Transformer 语言模型和大型神经网络:

变压器机构

​编辑尼尔森·艾尔哈奇、尼尔·南达、凯瑟琳·奥尔森、汤姆·赫尼汉、尼古拉斯·约瑟夫、本·曼、阿曼达·阿斯克尔、白云涛、陈安娜、汤姆·康纳利、诺瓦·达斯萨玛、道恩·德雷恩、迪普·甘古利、扎克·哈特菲尔德-多兹、丹尼·埃尔南德斯、安迪·琼斯杰克逊·凯尼恩、丽安·洛维特、卡迈勒·恩杜斯、达里奥·阿莫代、汤姆·布朗、杰克·克拉克、贾里德·卡普兰、萨姆·麦坎德利什、克里斯·奥拉。变压器电路的数学框架。Anthropic 2021。
注释:分析变压器组件的内部机制,开发用于理解计算模式的数学工具。观察 self-attention 中的信息复制行为,并将其与 Transformer 的强大性能联系起来。

​编辑莫尔·杰瓦、罗伊·舒斯特、乔纳森·贝兰特、奥马尔·利维。Transformer 前馈层是键值存储器。EMNLP 2021。
注释:提出了这样的观点:Transformer MLP 模块充当类似于基于 softmax 的两层内存数据结构的键值内存。分析这些模块对每一层令牌表示的贡献。

​编辑赵苏木、达米安·帕斯夸尔、吉诺·布伦纳、罗杰·瓦滕霍费尔。BERT 中的非线性和交换性。IJCNN 2021。
注释:对 Transformer 模型的计算进行了许多实验,其中一项实验表明,交换 Transformer 的相邻层对其行为的影响极小。

从 LM 中提取知识

​编辑法比奥·彼得罗尼、蒂姆·罗克塔舍尔、帕特里克·刘易斯、安东·巴赫金、吴宇翔、亚历山大·H·米勒、塞巴斯蒂安·里德尔。语言模型作为知识库?EMNLP-IJCNLP 2019。
注释: 建议使用填空提示从大型语言模型中提取知识。

​编辑蒋正宝、Frank F. Xu、Jun Araki、Graham Neubig。我们如何知道语言模型知道什么?TACL 2020。
注释:讨论使提示多样化的各种方法,以改进从语言模型中提取知识。

​编辑亚当·罗伯茨、科林·拉斐尔、诺姆·沙泽尔。您可以将多少知识装入语言模型的参数中?EMNLP 2020。
注释:建议微调预训练的 Transformer 语言模型,以扩展其在不依赖外部知识源的情况下回答事实问题的能力。

​编辑钟泽轩,丹·弗里德曼,陈丹琪。事实探究是[面具]:学习与学习回忆。NAACL 2021。
注释:检查使用学习的知识探针来提取知识,并指出使用此技术时产生幻觉新知识而不是提取知识的风险。

​编辑亚奈·埃拉扎 (Yanai Elazar)、诺拉·卡斯纳 (Nora Kassner)、绍利·拉夫福格尔 (Shauli Ravfogel)、阿比拉莎·拉维钱德 (Abhilasha Ravichander)、爱德华·霍维 (Eduard Hovy)、欣里奇·舒茨 (Hinrich Schütze)、约夫·戈德堡 (Yoav Goldberg)。测量和提高预训练语言模型的一致性。TACL 2021。
注释:检查语言模型的一致概括,即它们是否在释义下预测相同的事实。模型在释义下经常不一致的事实可以被视为模型对某些事实不具有普遍知识的证据。我们使用他们的ParaRel数据集作为CounterFact 的基础 。

神经网络内部的因果效应

​编辑亚什·戈亚尔、阿米尔·费德尔、乌里·沙利特、宾·金。用因果概念效应(CaCE)解释分类器。2019.
注释:来自计算机视觉;观察到因果解释可以从相关分析中得出不同的结论,并提出了在计算机视觉中构建反事实解释的方法。

​编辑杰西·维格、塞巴斯蒂安·格尔曼、尤纳坦·贝林科夫、钱莎朗、丹尼尔·尼沃、亚龙·辛格、斯图尔特·希伯。使用因果中介分析研究语言模型中的性别偏见。NeurIPS 2020。
注释:应用因果中介分析来识别导致大型语言模型中性别偏见的决定性神经元和注意头。在这种情况下确定了一小部分决定性的注意力头。

​编辑阿米尔·费德尔、纳达夫·奥维德、乌里·沙利特、罗伊·雷查特。CausaLM:通过反事实语言模型进行因果模型解释。CL 2021。
注释:通过构建基于表示的反事实并测试模型对它们的因果响应,设计一个框架来理解语言模型的结构。

​编辑亚奈·埃拉扎尔、绍利·拉夫福格尔、阿隆·雅科维、约夫·戈德堡。遗忘探测:遗忘反事实的行为解释。TACL 2021。
注释:建议通过引入因果干预来消除该信息,然后观察因果效应,来衡量模型中特定信息的重要性。

知识编辑

​编辑陈竺、Ankit Singh Rawat、Manzil Zaheer、Srinadh Bhojanapalli、李大良、Felix Yu、Sanjiv Kumar。修改 Transformer 模型中的内存。2020。
注释:发现简单的约束微调(将权重限制在其预训练值附近)对于修改变压器中学到的知识非常有效。

​编辑戴大麦,李东,郝亚茹,隋志方,魏福如。预训练 Transformer 中的知识神经元。2021。
注释:在 Geva (2021) 的基础上,提出 MLP 层中的各个神经元对各个事实进行编码。描述了一种查找事实神经元的归因方法,并进行了操纵这些神经元来编辑存储的事实的实验。

​编辑尼古拉·德·曹、威尔克·阿齐兹、伊万·蒂托夫。编辑语言模型中的事实知识。EMNLP 2021。
注释:开发一个“知识编辑器”(KE)超网络来微调模型,以纳入由事实的文本描述给出的新事实。超网络是一种 RNN,它处理损失的描述和梯度,以在网络中提出复杂的多层变化。

​编辑埃里克·米切尔、查尔斯·林、安托万·博斯卢特、切尔西·芬恩、克里斯托弗·D·曼宁。大规模快速模型编辑。ICLR 2022。
注释:开发一个超网络 (MEND) 来微调模型,以更改其预测以匹配单行文本。超网络使用网络内的梯度来推断对模型的小型一级更新;该方法被证明可以扩展到非常大的变压器。

计算机视觉中的模型编辑

计算机视觉领域也研究了使用很少或不使用训练数据的模型编辑方法。

​编辑David Bau、Steven Liu、王同舟、朱俊彦、Antonio Torralba。重写深度生成模型。ECCV 2020。
注释:演示生成对抗网络 (GAN) 层内关联规则的直接编辑,允许用户更改模型中对象的外观,而无需提供任何新的训练图像。在我们当前的工作中,我们采用Rank-1内存编辑框架并将其应用于大型语言模型转换器。

​编辑王胜宇,鲍大卫,朱俊彦。绘制您自己的 GAN。ICCV 2021。
注释:开发一种仅使用少量用户提供的草图且无需任何新的训练照片来更改模型的方法。解决了在比输出数据简单得多的数据域中通过示例给出用户指导的挑战。

​编辑Rinon Gal、Or Patashnik、Haggai Maron、Amit Bermano、Gal Chechik、Daniel Cohen-Or。StyleGAN-NADA:图像生成器的 CLIP 引导域适应。
注意:介绍使用文本指导来更改生成模型,而无需提供任何新的训练图像。使用定向 CLIP 目标改变 stylegan 参数,引导修改后的模型图像与原始模型图像具有特定差异,并根据其对目标的影响选择特定层进行修改。

如何引用

这项工作发表在 NeurIPS 2022 上。可以引用如下。

参考书目

凯文·孟、大卫·鲍、亚历克斯·安东尼安和约纳坦·贝林科夫。“在 GPT 中定位和编辑事实关联。 ”神经信息处理系统的进展 36 (2022)。

比布克斯

<span style="color:#292b2c"><span style="background-color:#ffffff"><span style="background-color:#ffffff"><span style="color:#292b2c">@文章{meng2022定位,
  title={在 {GPT} 中定位和编辑事实关联},
  作者={凯文·孟、大卫·鲍、亚历克斯·安多尼安和尤纳坦·贝林科夫},
  期刊={神经信息处理系统的进展},
  音量={36},
  年={2022}
}</span></span></span></span>

Locating and Editing Factual Associations in GPT

Kevin Meng*1, David Bau*2, Alex Andonian1, Yonatan Belinkov3
1MIT CSAIL, 2Northeastern University, 3Technion - IIT; *Equal Contribution

Update! See our MEMIT paper on scaling to thousands of facts:Mass Editing Memory in a Transformer

​编辑ArXivPreprint ​编辑Source CodeGithub ​编辑Datasetand Models ​编辑Demo Colab:Model Editing ​编辑Demo Colab:Causal Tracing ​编辑Youtube(Yannic Kilcher)

Where are the Facts Inside a Language Model?

Knowing differs from saying: uttering words by rote is different from knowing a fact, because knowledge of a fact generalizes across contexts. In this project, we show that factual knowledge within GPT also corresponds to a localized computation that can be directly edited. For example, we can make a small change to a small set of the weights of GPT-J to teach it the counterfactual "Eiffel Tower is located in the city of Rome." Rather than merely regurgitating the new sentence, it will generalize that specific counterfactual knowledge and apply it in very different linguistic contexts.

Diagram of a GPT transformer prediction

Why Locate Facts?

We are interested how and where a model stores its factual associations, for two reasons:

  1. To understand huge opaque neural networks. The internal computations of large language models are obscure. Clarifying the processing of facts is one step in understanding massive transformer networks.
  2. Fixing mistakes. Models are often incorrect, biased, or private, and we would like to develop methods that will enable debugging and fixing of specific factual errors.

The facts we study take the form of knowledge tuples t = (s, r, o), where s and o are subject and object entities, respectively, and r is the relation connecting the two. For example, (s = Megan Rapinoe, r = plays sport professionally, o = soccer) indicates that Rapinoe plays soccer for a living. Each variable represents an entity or relation that can be found in a knowledge graph, and that can be written as a natural language string.

To query GPT for knowledge of a fact, we express (s, r) as a text prompt (by expanding a template from the CounterFact data set), and check whether the generated continuation matches o.

What Did We Find?

In GPT-style transformer models, we discovered two things:

1. Factual associations can be localized along three dimensions, to (1) MLP module parameters (2) at a range of middle layers and (3) specifically during processing of the last token of the subject.

A causal trace of a factual statement in GPT

The causal trace above reveals a small number of states that contain information that can flip the model from one factual prediction to another. Our studies use such causal traces and find evidence that knowledge retrieval occurs in MLP modules at the early site (at (a) in the figure); then attention mechanisms at the late site (at (b) in the figure) bring the information to the end of the computation where the specific word can be predicted.

2. Individual factual associations can be changed by making small rank-one changes in a single MLP module. We can distinguish between changes in knowledge versus superficial changes in language by measuring generalization to other wordings of the same fact.

An example of editing a fact in GPT using the ROME method.

The example above shows that changing the model's processing of a single statement about the Eiffel Tower, if done by changing selected parameters in the right way, will result in expressing a change in knowledge in a variety of nontrivial contexts.

At (a) in in the figure, a single direct statement of a counterfactual is posed, and it is used to compute a rank-one parameter change in a single MLP module. Despite the simplicity of the change, results shown at (b) show that for a more complex prompt about travel from Berlin, the model treats the Eiffel tower as if it is in Rome; similarly in (c) when asked about nearby sites, the model suggests places in Rome before explicitly mentioning Rome. Changes in predictions in such different contexts is evidence that change generalizes: the model has not merely learned to parrot the exact sequence of words in the counterfactual, but it also applies the new knowledge in sentences that are very different from the original example.

How to Locate Factual Retrieval

To identify decisive computations, we introduce a method called Causal Tracing. By isolating the causal effect of individual states within the network while processing a factual statement, we can trace the path followed by information through the network.

An animation demonstrating the Causal Tracing method.

Causal traces work by running a network multiple times, introducing corruptions to frustrate the computation, and then restoring individual states in order to identify the information that restores the results. Tracing can be used to test any individual state or combinations of states. We use carefully-designed traces to identify a specific small set of MLP module computations that mediate retrieval of factual associations.

Then we check this finding by asking: can the MLP module computations be altered to edit a model's belief in a specific fact?

How to Edit Factual Storage

To modify individual facts within a GPT model, we introduce a method called ROME, or Rank-One Model Editing. It treats an MLP module as a simple key-value store: for example, if the key encodes a subject and the value encodes knowledge about the subject, then the MLP can recall the association by retrieving the value corresponding to the key. ROME uses a rank-one modification of the MLP weights to directly write in a new key-value pair.

Diagram of an MLP module

The figure above illustrates a single MLP module within a transformer. The D-dimensional vector at (b) acts as the key that represents a subject to know about, and the H-dimensional output at (c) acts at the value that encodes learned properties about the subject. ROME inserts new association by making a rank-one change to the matrix (d) that maps from keys to values.

Note that ROME assumes a linear view of memory within a neural network rather than an individual-neuron view. This linear perspective sees individual memories as rank-one slices of parameter space. Experiments confirm this view: when we do a rank-one update to an MLP module in the computational center identified by causal tracing, we find that associations of individual facts can be updated in a way that is both specific and generalized.

How to Distinguish Knowing a Fact from Saying a Fact

Knowing differs from saying. A variety of fine-tuning methods can cause a language model to parrot a specific new sentence, but training a model to adjust its knowledge of a fact is different from merely teaching it to regurgitate a particular sequence of words.

We can tell the difference between knowing and saying by measuring two hallmarks of knowledge: specificity and generalization.

  1. Specificity means that when your knowledge of a fact changes, it doesn't change other facts. For example, after learning that the Eiffel Tower is in Rome, you shouldn't also think that every other tourist attraction is also in Rome.
  2. Generalization means that your knowledge of a fact is robust to changes in wording and context. After learning the Eiffel Tower is in Rome, then you should also know that visiting it will require travel to Rome.

Our new dataset CounterFact includes thousands of counterfactuals along with text that allows quantitative testing of specificity and generalization when learning a counterfactual.

Quantitative results distinguishing knowing from saying.

Above are the results of an experiment that uses CounterFact to confirm the distinction between knowing and saying parameters in GPT-2 XL. ROME, which edits the early causal site (a), achieves excellent efficacy (measured by performance on the counterfactual prompt itself), specificity (performance on neighborhood subjects not supposed to change), and generalization (performance on paraphrases). By contrast, if we modify the attention mechanism at the later site (b), the model achieves fair efficacy and specificity but completely fails to generalize.

Related Work

Our work builds upon insights in other work that has examined large transformer language models and large neural networks from several other perspectives:

Transformer Mechanisms

​编辑Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, Chris Olah. A Mathematical Framework for Transformer Circuits. Anthropic 2021.
Notes: Analyzes internal mechanisms of transformer components, developing mathematical tools for understanding patterns of computations. Observes information-copying behavior in self-attention and implicates it in the strong performance of transformers.

​编辑Mor Geva, Roei Schuster, Jonathan Berant, Omer Levy. Transformer Feed-Forward Layers Are Key-Value Memories. EMNLP 2021.
Notes: Proposes the view that transformer MLP modules act as key-value memories akin to two-layer softmax-based memory data structures. Analyzes the contribution of these modules to token representations at each layer.

​编辑Sumu Zhao, Damián Pascual, Gino Brunner, Roger Wattenhofer. Of Non-Linearity and Commutativity in BERT. IJCNN 2021.
Notes: Conducts a number of experiments of the computations of transformer models, including an experiment that shows that swapping adjacent layers of a transformer has only minimal impact on its behavior.

Extracting Knowledge from LMs

​编辑Fabio Petroni, Tim Rocktaschel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, Alexander H. Miller, Sebastian Riedel. Language Models as Knowledge Bases? EMNLP-IJCNLP 2019.
Notes: Proposes using fill-in-the-blank prompts for extracting knowledge from large language models.

​编辑Zhengbao Jiang, Frank F. Xu, Jun Araki, Graham Neubig. How Can We Know What Language Models Know? TACL 2020.
Notes: Discusses various ways to diversify prompts to improve extraction of knowledge from language models.

​编辑Adam Roberts, Colin Raffel, Noam Shazeer. How Much Knowledge Can You Pack Into the Parameters of a Language Model? EMNLP 2020.
Notes: Proposes fine-tuning a pretrained transformer language model to expand its ability to answer factual questions without reliance on an external knowledge source.

​编辑Zexuan Zhong, Dan Friedman, Danqi Chen. Factual Probing Is [MASK]: Learning vs. Learning to Recall. NAACL 2021.
Notes: Examines the use of learned knowledge probes for extracting knowledge, and also notes the risks of hallucinating new knowledge rather than extracting knowledge when using this technique.

​编辑Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, Yoav Goldberg. Measuring and Improving Consistency in Pretrained Language Models. TACL 2021.
Notes: Examines consistent generalization of language models, i.e., whether they predict the same facts under paraphrases. The fact that models are often inconsistent under paraphrases can be seen as evidence that they do not have generalizable knowledge of some facts. We use their ParaRel data set as the basis for CounterFact.

Causal Effects inside NNs

​编辑Yash Goyal, Amir Feder, Uri Shalit, Been Kim. Explaining Classifiers with Causal Concept Effect (CaCE). 2019.
Notes: From computer vision; observes that causal explanations can come to different conclusion from a correlative analysis, and proposes ways to construct counterfactual explanations in computer vision.

​编辑Jesse Vig, Sebastian Gehrmann, Yonatan Belinkov, Sharon Qian, Daniel Nevo, Yaron Singer, Stuart Shieber. Investigating Gender Bias in Language Models Using Causal Mediation Analysis. NeurIPS 2020.
Notes: Applies causal mediation analysis to identify decisive neurons and attention heads responsible for gender bias in large language models. Identifies a small handful of decisive attention heads in this case.

​编辑Amir Feder, Nadav Oved, Uri Shalit, Roi Reichart. CausaLM: Causal Model Explanation Through Counterfactual Language Models. CL 2021.
Notes: Devises a framework for understanding the structure of a language model by constructing representation-based counterfactuals and testing the model's causal response to them.

​编辑Yanai Elazar, Shauli Ravfogel, Alon Jacovi, Yoav Goldberg. Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals. TACL 2021.
Notes: Proposes measuring the importance of specific information within a model by introducing a causal intervention to erase that information, then observing the causal effects.

Knowledge Editing

​编辑Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, Sanjiv Kumar. Modifying Memories in Transformer Models. 2020.
Notes: Finds that a simple constrained fine-tuning, in which weights are constrained to lie near their pretrained values, is very effective at modifying learned knowledge within a transformer.

​编辑Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Furu Wei. Knowledge Neurons in Pretrained Transformers. 2021.
Notes: Building upon Geva (2021), proposes that individual neurons within MLP layers encode individual facts. Describes an attribution method to find the neurons for a fact, and and conducts experiments manipulating these neurons to edit stored facts.

​编辑Nicola De Cao, Wilker Aziz, Ivan Titov. Editing Factual Knowledge in Language Models. EMNLP 2021.
Notes: Develops a "KnowledgeEditor" (KE) hypernetwork to fine-tune a model to incorporate a new fact given by a textual description of the fact. The hypernetwork is an RNN that processes the description as well as the gradients of a loss to propose a complex multilayer change in the network.

​编辑Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, Christopher D. Manning. Fast Model Editing at Scale. ICLR 2022.
Notes: Develops a hypernetwork (MEND) to fine-tune a model to change its predictions to match a single run of text. The hypernetwork uses gradients within the network to infer a small rank-one update to the model; the method is shown to scale to very large transformers.

Model Editing in Computer Vision

Model Editing methods that use little or no training data have also been studied in computer vision.

​编辑David Bau, Steven Liu, Tongzhou Wang, Jun-Yan Zhu, Antonio Torralba. Rewriting a Deep Generative Model. ECCV 2020.
Notes: Demonstrates direct editing of associative rules within layers of a generative adversarial network (GAN), allowing a user to alter the appearance of objects in a model without supplying any new training images. In our current work, we adopt the rank-one memory editing framework and apply it to large language model transformers.

​编辑Sheng-Yu Wang, David Bau, Jun-Yan Zhu. Sketch Your Own GAN. ICCV 2021.
Notes: Develops a method for altering a model using only a small number of user-provided sketches and without any new training photos. Addresses the challenge of having user guidance that is given by examples in a much simpler data domain than the output data.

​编辑Rinon Gal, Or Patashnik, Haggai Maron, Amit Bermano, Gal Chechik, Daniel Cohen-Or. StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators.
Notes: Introduces the use of text guidance to alter a generative model without providing any new training images. Alters stylegan parameters using a directional CLIP objective that guides modified-model images to have specific differences with original-model images, and selects specific layers to modify based on their effect on the objective.

How to Cite

This work appeared at NeurIPS 2022. It can be cited as follows.

bibliography

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. "Locating and Editing Factual Associations in GPT." Advances in Neural Information Processing Systems 36 (2022).

猜你喜欢

转载自blog.csdn.net/sinat_37574187/article/details/132239876
GPT