使用ChatGLM2-6b微调解决文本二分类任务

ChatGLM2-6B 是开源中英双语对话模型 ChatGLM-6B 的第二代版本，在保留了初代模型对话流畅、部署门槛较低等众多优秀特性的基础之上，ChatGLM2-6B 引入了如下新特性：

更强大的性能=混合目标函数+1.4T中英标识符：基于 ChatGLM 初代模型的开发经验，我们全面升级了 ChatGLM2-6B 的基座模型。ChatGLM2-6B 使用了 GLM 的混合目标函数，经过了 1.4T 中英标识符的预训练与人类偏好对齐训练，评测结果显示，相比于初代模型，ChatGLM2-6B 在 MMLU（+23%）、CEval（+33%）、GSM8K（+571%）、BBH（+60%）等数据集上的性能取得了大幅度的提升，在同尺寸开源模型中具有较强的竞争力。
更长的上下文=Flash Attention技术+上下文长度扩展到32K+8K训练+多轮对话：基于 Flash Attention 技术，我们将基座模型的上下文长度（Context Length）由 ChatGLM-6B 的 2K 扩展到了 32K，并在对话阶段使用 8K 的上下文长度训练，允许更多轮次的对话。但当前版本的 ChatGLM2-6B 对单轮超长文档的理解能力有限，我们会在后续迭代升级中着重进行优化。
更高效的推理=Multi-Query Attention技术+INT4量化：基于 Multi-Query Attention 技术，ChatGLM2-6B 有更高效的推理速度和更低的显存占用：在官方的模型实现下，推理速度相比初代提升了 42%，INT4 量化下，6G 显存支持的对话长度由 1K 提升到了 8K。
更开放的协议：ChatGLM2-6B 权重对学术研究完全开放，在获得官方的书面许可后，亦允许商业使用。如果您发现我们的开源模型对您的业务有用，我们欢迎您对下一代模型 ChatGLM3 研发的捐赠。

最近我尝试使用ChatGLM2-6b这个大模型来解决一个文本二分类任务,在微调和使用过程中,遇到了一些需要注意的点,本文将给出更详细的经验总结。

1. 准备数据集

我的数据集包含标题、作者、摘要等字段,首先读取csv格式的数据,然后将其转换成模型可以处理的格式:

import pandas as pd

train_df = pd.read_csv('./csv_data/train.csv') 
test_df = pd.read_csv('./csv_data/test.csv')

## 制作数据集
res = [] 

for i in range(len(train_df)):
  paper_item = train_df.loc[i]
  tmp = {
    
    
    "instruction": "Please judge...", 
    "input": f"title:{
      
      paper_item[1]},abstract:{
      
      paper_item[3]}",
    "output": str(paper_item[5])
  }
  res.append(tmp)

import json
with open('paper_label.json', mode='w', encoding='utf-8') as f:
  json.dump(res, f, ensure_ascii=False, indent=4)

另外,中文文本在JSON存储时需要设置ensure_ascii=False,这和Unicode编码有关,可以避免中文出现乱码。

2. 微调ChatGLM

微调大模型

首先需要clone微调脚本：git clone https://github.com/KMnO4-zx/huanhuan-chat.git
进入目录安装环境：cd ./huanhuan-chat；pip install -r requirements.txt
将脚本中的model_name_or_path更换为你本地的chatglm2-6b模型路径，然后运行脚本：sh xfg_train.sh
微调过程大概需要两个小时（我使用阿里云A10-24G运行了两个小时左右），微调过程需要18G的显存，推荐使用24G显存的显卡，比如3090，4090等。
当然，我们已经把训练好的lora权重放在了仓库里，您可以直接运行下面的代码。
更多微调的细节我们会持续在这个仓库更新，欢迎关注star！https://github.com/KMnO4-zx/huanhuan-chat.git

利用ChatGLM的微调脚本,在包含标题和摘要的文本上微调ChatGLM2-6b模型。此处需要注意的难点是:

微调需要消耗大量GPU算力,需要准备至少24G显存的高端GPU。
微调需要指定正确的模型路径,否则会导致错误。
如果遇到内存不足的问题,可以适当调小batch size。

3. 加载微调权重进行预测

微调预训练语言模型是迁移学习的一种典型应用。我们希望让模型学习特定的下游任务,而不是从零开始训练。在微调过程中,我选用了一种称为LoRA的技巧,其基本思想是在预训练语言模型中插入新的分类头,然后在下游任务的数据集上进行全模型联合训练。这种方式可以很好地融合预训练模型和下游任务,在许多NLP竞赛中能取得不错的效果。

利用Peft加载微调得到的LoRA权重构建预测函数,代码如下:

from peft import PeftModel
from transformers import AutoTokenizer, AutoModel, GenerationConfig, AutoModelForCausalLM

model_path = "chatglm2-6b"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True).half().cuda()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# 加载LoRA权重
model = PeftModel.from_pretrained(model, 'huanhuan-chat/output/label_xfg').half()
model = model.eval()
response, history = model.chat(tokenizer, "你好", history=[])
response

4. 预测

加载微调后的语言模型进行预测时,我们需要注意将模型切换到eval模式。这是PyTorch的知识点,eval模式将 BN和Dropout固定住,可以提高预测的稳定性。另外,为了获得确定的预测输出,可以设置temperature=0.01来取 argmax。

# 预测函数
def predict(text):
    response, history = model.chat(tokenizer, f"Please judge whether it is a medical field paper according to the given paper title and abstract, output 1 or 0, the following is the paper title, author and abstract -->{
      
      text}", history=[],
    temperature=0.01)
    return response

predict('title:Seizure Detection and Prediction by Parallel Memristive Convolutional Neural Networks,author:Li, Chenqi; Lammie, Corey; Dong, Xuening; Amirsoleimani, Amirali; Azghadi, Mostafa Rahimi; Genov, Roman,abstract:During the past two decades, epileptic seizure detection and prediction algorithms have evolved rapidly. However, despite significant performance improvements, their hardware implementation using conventional technologies, such as Complementary Metal-Oxide-Semiconductor (CMOS), in power and areaconstrained settings remains a challenging task; especially when many recording channels are used. In this paper, we propose a novel low-latency parallel Convolutional Neural Network (CNN) architecture that has between 2-2,800x fewer network parameters compared to State-Of-The-Art (SOTA) CNN architectures and achieves 5-fold cross validation accuracy of 99.84% for epileptic seizure detection, and 99.01% and 97.54% for epileptic seizure prediction, when evaluated using the University of Bonn Electroencephalogram (EEG), CHB-MIT and SWEC-ETHZ seizure datasets, respectively. We subsequently implement our network onto analog crossbar arrays comprising Resistive Random-Access Memory (RRAM) devices, and provide a comprehensive benchmark by simulating, laying out, and determining hardware requirements of theCNNcomponent of our system. We parallelize the execution of convolution layer kernels on separate analog crossbars to enable 2 orders of magnitude reduction in latency compared to SOTA hybrid Memristive-CMOS Deep Learning (DL) accelerators. Furthermore, we investigate the effects of non-idealities on our system and investigate Quantization Aware Training (QAT) to mitigate the performance degradation due to lowAnalog-to-Digital Converter (ADC)/Digital-to-Analog Converter (DAC) resolution. Finally, we propose a stuck weight offsetting methodology to mitigate performance degradation due to stuck RON/ROFF memristor weights, recovering up to 32% accuracy, without requiring retraining. The CNN component of our platform is estimated to consume approximately 2.791Wof power while occupying an area of 31.255 mm(2) in a 22 nm FDSOI CMOS process.')

# 预测测试集

from tqdm import tqdm

label = []

for i in tqdm(range(len(test_df))):
    test_item = test_df.loc[i]
    test_input = f"title:{
      
      test_item[1]},author:{
      
      test_item[2]},abstract:{
      
      test_item[3]}"
    label.append(int(predict(test_input)))

test_df['label'] = label

submit = test_df[['uuid', 'Keywords', 'label']]

submit.to_csv('submit.csv', index=False)

5.在微调Transformer类模型时,我总结了一些NLP竞赛的经验:

尽量利用预训练模型:现在的预训练语言模型已经能提取强大的语义特征,直接fine-tune往往能取得不错的结果。
多尝试微调技巧:例如使用LoRA进行层间微调,不仅可以提升效果,也更加参数效率。
仔细设计Prompt:根据任务设计合适的Prompt(语句模板),可以让模型更好地捕捉下游任务的特点。
多组验证试验:跑多个实验组合,如模型大小、Prompt长度、batch size等超参数,找出最优配置。
注意过拟合现象:大模型容易过拟合,可以采用早停等策略,或者增强训练数据。
炼数技巧:结合统计特征,或者堆叠多个模型的输出等手段,可以进一步提高分数。