【ChatGLM3】（6）：使用1个2080Ti-11G版本，运行ChatGLM3-Int8模型，可以正常运行，速度6 words/s，不支持vllm启动，2张卡速度24 words/s

1，演示视频

https://www.bilibili.com/video/BV1pj41157Dg/

【ChatGLM3】（6）：使用1个2080Ti-11G版本，运行ChatGLM3-Int8模型，可以正常运行，速度6 words/s，不支持vllm启动

2，关于2080TI，5年前老显卡是支持的

在这里插入图片描述

NVIDIA GeForce RTX 2080Ti参数
显存容量： 11264MB 显存位宽：
352bit 核心频率： 1350/1635MHz 显存频率： 14000MHz
发布日期 2018年04月

环境使用：
CPU ：12 核心
内存：40 GB
GPU ：NVIDIA 2080TI 11G 1个

可以支持，理论上 7.0 算力的都支持。

3，关于ChatGLM3的模型

ChatGLM3-6B 是 ChatGLM 系列最新一代的开源模型，在保留了前两代模型对话流畅、部署门槛低等众多优秀特性的基础上，ChatGLM3-6B 引入了如下特性：

更强大的基础模型： ChatGLM3-6B 的基础模型 ChatGLM3-6B-Base 采用了更多样的训练数据、更充分的训练步数和更合理的训练策略。在语义、数学、推理、代码、知识等不同角度的数据集上测评显示，ChatGLM3-6B-Base 具有在 10B 以下的预训练模型中最强的性能。
更完整的功能支持： ChatGLM3-6B 采用了全新设计的 Prompt 格式，除正常的多轮对话外。同时原生支持工具调用（Function Call）、代码执行（Code Interpreter）和 Agent 任务等复杂场景。
更全面的开源序列：除了对话模型 ChatGLM3-6B 外，还开源了基础模型 ChatGLM-6B-Base、长文本对话模型 ChatGLM3-6B-32K。以上所有权重对学术研究完全开放，在填写问卷进行登记后亦允许免费商业使用。

模型下载地址：
https://www.modelscope.cn/models/ZhipuAI/chatglm3-6b/summary

https://gitee.com/mirrors/ChatGLM-6B

下载后占空间：
68G Yi-34B-Chat-4bits

3，使用autodl创建环境，安装最新的 fastchat

apt update && apt install -y git-lfs net-tools
# 一定要保证有大磁盘空间：
cd /root/autodl-tmp
git clone https://www.modelscope.cn/ZhipuAI/chatglm3-6b.git

# 最后安装 
pip3 install "fschat[model_worker,webui]"

安装完成之后就可以使用fastchat启动了。

4，使用 fastchat 启动 chatglm3-6b 模型

启动脚本：

# run_all_chatglm3.sh

# 清除全部 fastchat 服务
ps -ef | grep fastchat.serve | awk '{print$2}' | xargs kill -9
sleep 3

rm -f *.log

# 首先启动 controller ：
nohup python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001 > controller.log 2>&1 &

# 启动 openapi的 兼容服务 地址 8000
nohup python3 -m fastchat.serve.openai_api_server --controller-address http://127.0.0.1:21001 \
  --host 0.0.0.0 --port 8000 > api_server.log 2>&1 &

# 启动 web ui
#nohup python -m fastchat.serve.gradio_web_server --controller-address http://127.0.0.1:21001 \
# --host 0.0.0.0 --port 6006 > web_server.log 2>&1 &

## 启动 worker 
nohup python3 -m fastchat.serve.model_worker  --model-names chatglm3-6b \
  --model-path ./chatglm3-6b --controller-address http://127.0.0.1:21001 \
  --worker-address http://127.0.0.1:8080 --host 0.0.0.0 --port 8080 > model_worker.log 2>&1 &

sleep 2

tail -f model_worker.log

解决：内存不够，增加参数 --load-8bit 解决：

2023-12-02 21:40:05 | ERROR | stderr | torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 108.00 MiB. GPU 0 has a total capacty of 10.75 GiB of which 90.50 MiB is free. Process 510054 has 10.66 GiB memory in use. Of the allocated memory 9.94 GiB is allocated by PyTorch, and 786.50 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

启动成功：

2023-12-02 22:02:36 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=8080, worker_address='http://127.0.0.1:8080', controller_address='http://127.0.0.1:21001', model_path='./chatglm3-6b', revision='main', device='cuda', gpus=None, num_gpus=2, max_gpu_memory=None, dtype=None, load_8bit=False, cpu_offloading=False, gptq_ckpt=None, gptq_wbits=16, gptq_groupsize=-1, gptq_act_order=False, awq_ckpt=None, awq_wbits=16, awq_groupsize=-1, enable_exllama=False, exllama_max_seq_len=4096, exllama_gpu_split=None, enable_xft=False, xft_max_seq_len=4096, xft_dtype=None, model_names=['chatglm3-6b'], conv_template=None, embed_in_truncate=False, limit_worker_concurrency=5, stream_interval=2, no_register=False, seed=None, debug=False, ssl=False)
2023-12-02 22:02:36 | INFO | model_worker | Loading the model ['chatglm3-6b'] on worker 46604431 ...
Loading checkpoint shards:   0%|                                                                                                    | 0/7 [00:00<?, ?it/s]
Loading checkpoint shards:  14%|█████████████▏                                                                              | 1/7 [00:02<00:12,  2.11s/it]
Loading checkpoint shards:  29%|██████████████████████████▎                                                                 | 2/7 [00:04<00:10,  2.16s/it]
Loading checkpoint shards:  43%|███████████████████████████████████████▍                                                    | 3/7 [00:06<00:08,  2.12s/it]
Loading checkpoint shards:  57%|████████████████████████████████████████████████████▌                                       | 4/7 [00:08<00:06,  2.04s/it]
Loading checkpoint shards:  71%|█████████████████████████████████████████████████████████████████▋                          | 5/7 [00:10<00:04,  2.04s/it]
Loading checkpoint shards:  86%|██████████████████████████████████████████████████████████████████████████████▊             | 6/7 [00:12<00:02,  2.03s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:13<00:00,  1.76s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:13<00:00,  1.94s/it]
2023-12-02 22:02:50 | ERROR | stderr | 
2023-12-02 22:02:50 | INFO | model_worker | Register to controller
2023-12-02 22:02:51 | ERROR | stderr | INFO:     Started server process [2577]
2023-12-02 22:02:51 | ERROR | stderr | INFO:     Waiting for application startup.
2023-12-02 22:02:51 | ERROR | stderr | INFO:     Application startup complete.
2023-12-02 22:02:51 | ERROR | stderr | INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

然后在测试下 token 效果：

python3 -m fastchat.serve.test_throughput --controller-address http://127.0.0.1:21001 --model-name chatglm3-6b --n-thread 1
结果：6 words/s.

也可以使用2个 2080Ti ，可以不用 8bit 了：

nohup python3 -m fastchat.serve.model_worker --num-gpus 2 --model-names chatglm3-6b \
  --model-path ./chatglm3-6b --controller-address http://127.0.0.1:21001 \
  --worker-address http://127.0.0.1:8080 --host 0.0.0.0 --port 8080 > model_worker.log 2>&1 &

测速，反而提速了：

python3 -m fastchat.serve.test_throughput --controller-address http://127.0.0.1:21001 --model-name chatglm3-6b --n-thread 1
Models: ['chatglm3-6b']
worker_addr: http://127.0.0.1:8080
thread 0 goes to http://127.0.0.1:8080
Time (POST): 2.130715847015381 s
Time (Completion): 2.1307473182678223, n threads: 1, throughput: 24.873902008749692 words/s.

测试中文输出正常：

curl http://localhost:6006/v1/chat/completions   -H "Content-Type: application/json"   -d '{
     "model": "chatglm3-6b",
     "messages": [{"role": "user", "content": "北京景点"}],
     "temperature": 0.7
   }'

5，总结

测试效果还可以，发现不支持 vllm 优化，估计需要进行模型转换。
可以使用vllm 启动成功，但是模型不返回内容。
同时增加 --load-8bit 可以在 11 G 显卡上运行，占用显存 7G左右。
不增加参数，需要2张 2080Ti 11G 的显卡。模型大小就是显存占用大小。需要 12G 多显存。
综合看自己测试 2080Ti 还是可以使用的。跑 chatglm3-6b 没有啥问题。