学习Chatglm2,使用chatglm2-6b-int4,使用model.half().cuda()时,遇到的问题:
CUDA Error: no kernel image is available for execution on the device
如果只是想跑起来,如果对速度不介意,可以尝试用下面的简单方法:
1. 模型加载时,使用本地程序以便于修改
from chatglm2_6b_int4.configuration_chatglm import *
from chatglm2_6b_int4.modeling_chatglm import *
from chatglm2_6b_int4.tokenization_chatglm import *
from chatglm2_6b_int4.quantization import *
tokenizer = ChatGLMTokenizer.from_pretrained("chatglm2_6b_int4/")
model = ChatGLMForConditionalGeneration.from_pretrained("chatglm2_6b_int4/").half().cuda()
2. 修改 chatglm2_6b_int4/quantization.py 中的 extract_weight_to_half 函数
# func(
# gridDim,
# blockDim,
# 0,
# stream,
# [
# ctypes.c_void_p(weight.data_ptr()),
# ctypes.c_void_p(scale_list.data_ptr()),
# ctypes.c_void_p(out.data_ptr()),
# ctypes.c_int32(n),
# ctypes.c_int32(m),
# ],
# )
out[:, 0::2] = scale_list.view(-1,1) * (weight >> 4)
out[:, 1::2] = scale_list.view(-1,1) * ((weight << 4) >> 4)
2. 修改 chatglm2_6b_int4/quantization.py 中的 quant_gemv 函数
# func(
# gridDim,
# blockDim,
# shm_size,
# stream,
# [
# ctypes.c_void_p(weight.data_ptr()),
# ctypes.c_void_p(input.data_ptr()),
# ctypes.c_void_p(scale_list.data_ptr()),
# ctypes.c_void_p(out.data_ptr()),
# ctypes.c_int32(m),
# ctypes.c_int32(k),
# ],
# )
if input.dtype == torch.float:
source_bit_width = 8
elif input.dtype == torch.float16:
source_bit_width = 4
else:
assert False, f"unsupport input type: {input.dtype}"
tmp = torch.empty(weight.size(0), weight.size(1) * (8 // source_bit_width), dtype=input.dtype, device="cuda")
tmp[:, 0::2] = scale_list.view(-1,1) * (weight >> 4)
tmp[:, 1::2] = scale_list.view(-1,1) * ((weight << 4) >> 4)
out = torch.matmul(input, tmp.transpose(1,0))
实测可以运行,速度大约与CPU上面运行chatglm2-6b相当。