【deepseek】（2）：使用3080Ti显卡，运行deepseek-coder-6.7b-instruct模型，因fastchat并没有说支持这个版本，或者模型有问题，出现死循环输出EOT问题

1，演示视频

https://www.bilibili.com/video/BV1a64y157Hc/

【deepseek】（2）：使用3080Ti显卡，fastchat运行deepseek-coder-6.7b-instruct模型，出现死循环EOT的BUG

2，关于RTX 3080 Ti * 1卡，2021年的显卡

RTX 3080 Ti 拥有 34 TFLOPS 的着色器性能、67 TFLOPS 的光追性能、以及 273 TFLOPS 的 Tensor（Sparsity）性能。该卡的外形设计，依然类似于现有的 RTX 3080 FE 公版显卡（双槽双面风冷散热器），但没有 RTX 3090 那样笨重（BFG），侧边仍是 12-pin 的 Microfit 辅助供电接口。

3，关于 deepseek-coder-6.7b-instruct 的模型，12月1日上传

只有关于代码生成的部分：
https://zhuanlan.zhihu.com/p/666077213

https://www.modelscope.cn/models/deepseek-ai/deepseek-coder-6.7b-instruct/summary
在这里插入图片描述

关于 DeepSeek
DeepSeek 致力于探索 AGI 的本质，不做中庸的事，带着好奇心，用最长期的眼光去回答最大的问题。

DeepSeek Coder 是深度求索发布的第一代大模型，在不久的将来，我们还将呈现给社区更多更好的研究成果。让我们在这个激动人心的时代，共同推进 AGI 的到来！

https://github.com/lm-sys/FastChat/blob/main/docs/model_support.md

在这里插入图片描述

扫描二维码关注公众号，回复： 17273742 查看本文章

3，使用autodl创建环境，安装最新的 fastchat

需要选择 python3.10 的镜像，否则会执行报错：
Miniconda conda3
Python 3.10(ubuntu22.04)
Cuda 11.8

在这里插入图片描述

apt update && apt install -y git-lfs net-tools
# 一定要保证有大磁盘空间：
cd /root/autodl-tmp
git clone https://www.modelscope.cn/deepseek-ai/deepseek-coder-6.7b-instruct.git

# 最后安装 
pip3 install "fschat[model_worker,webui]"

安装完成之后就可以使用fastchat启动了。

4，使用 fastchat 启动 deepseek-coder-6.7b-instruct 模型

启动脚本：

# run_all_deepseek.sh

# 清除全部 fastchat 服务
ps -ef | grep fastchat.serve | awk '{print$2}' | xargs kill -9
sleep 3

rm -f *.log

# 首先启动 controller ：
nohup python3 -m fastchat.serve.controller --host 0.0.0.0 --port 21001 > controller.log 2>&1 &

# 启动 openapi的 兼容服务 地址 8000
nohup python3 -m fastchat.serve.openai_api_server --controller-address http://127.0.0.1:21001 \
  --host 0.0.0.0 --port 8000 > api_server.log 2>&1 &

# 启动 web ui
nohup python -m fastchat.serve.gradio_web_server --model-list-mode reload \
  --controller-url http://127.0.0.1:21001 \
  --host 0.0.0.0 --port 6006 > web_server.log 2>&1 &

## 启动 worker 
nohup python3 -m fastchat.serve.model_worker  --load-8bit --model-names deepseek-coder-6.7b \
  --model-path ./deepseek-coder-6.7b-instruct --controller-address http://127.0.0.1:21001 \
  --worker-address http://127.0.0.1:8080 --host 0.0.0.0 --port 8080 > model_worker.log 2>&1 &

sleep 2

tail -f model_worker.log

解决：内存不够，增加参数 --load-8bit 解决：

2023-12-08 23:01:38 | ERROR | stderr |     return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
2023-12-08 23:01:38 | ERROR | stderr | torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB (GPU 0; 11.76 GiB total capacity; 11.48 GiB already allocated; 27.19 MiB free; 11.49 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

启动成功：

2023-12-09 09:21:19 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=8080, worker_address='http://127.0.0.1:8080', controller_address='http://127.0.0.1:21001', model_path='./deepseek-coder-6.7b-instruct', revision='main', device='cuda', gpus=None, num_gpus=1, max_gpu_memory=None, dtype=None, load_8bit=True, cpu_offloading=False, gptq_ckpt=None, gptq_wbits=16, gptq_groupsize=-1, gptq_act_order=False, awq_ckpt=None, awq_wbits=16, awq_groupsize=-1, enable_exllama=False, exllama_max_seq_len=4096, exllama_gpu_split=None, enable_xft=False, xft_max_seq_len=4096, xft_dtype=None, model_names=['deepseek-coder-6.7b'], conv_template=None, embed_in_truncate=False, limit_worker_concurrency=5, stream_interval=2, no_register=False, seed=None, debug=False, ssl=False)
2023-12-09 09:21:19 | INFO | model_worker | Loading the model ['deepseek-coder-6.7b'] on worker f6111a86 ...
  0%|                                                                                                        | 0/2 [00:00<?, ?it/s]
 50%|████████████████████████████████████████████████                                                | 1/2 [00:10<00:10, 10.43s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:39<00:00, 21.46s/it]
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:39<00:00, 19.80s/it]
2023-12-09 09:21:59 | ERROR | stderr | 
2023-12-09 09:21:59 | INFO | model_worker | Register to controller
2023-12-09 09:21:59 | ERROR | stderr | INFO:     Started server process [1908]
2023-12-09 09:21:59 | ERROR | stderr | INFO:     Waiting for application startup.
2023-12-09 09:21:59 | ERROR | stderr | INFO:     Application startup complete.
2023-12-09 09:21:59 | ERROR | stderr | INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)

发现界面中有死循环：
在这里插入图片描述

在这里插入图片描述

测试api 接口，使用流式输出，也有死循环：

curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
     "model": "deepseek-coder-6.7b", "stream":true,
     "messages": [{"role": "user", "content": "生成golang的helloworld代码"}], 
     "temperature": 0.7
   }'

发现了个模型在执行的时候会进入 eot 的死循环：

data: {
    
    "id": "chatcmpl-kBMteGhWJjkgUB7i3qWt8v", "model": "deepseek-coder-6.7b", "choices": [{
    
    "index": 0, "delta": {
    
    "content": "\n<|EOT|>"}, "finish_reason": null}]}

data: {
    
    "id": "chatcmpl-kBMteGhWJjkgUB7i3qWt8v", "model": "deepseek-coder-6.7b", "choices": [{
    
    "index": 0, "delta": {
    
    "content": "\n<|EOT|>"}, "finish_reason": null}]}

data: {
    
    "id": "chatcmpl-kBMteGhWJjkgUB7i3qWt8v", "model": "deepseek-coder-6.7b", "choices": [{
    
    "index": 0, "delta": {
    
    "content": "\n<|EOT|>"}, "finish_reason": null}]}

data: {
    
    "id": "chatcmpl-kBMteGhWJjkgUB7i3qWt8v", "model": "deepseek-coder-6.7b", "choices": [{
    
    "index": 0, "delta": {
    
    "content": "\n<|EOT|>"}, "finish_reason": null}]}

5，总结

总结，代码生成支持多个语言的生成，不知道是因为量化还是 fastchat的兼容问题。
目前看 deepseek-coder-6.7b-instruct 会有死循环输出 <|EOT|> 的问题。
因为启动的是 int8 量化版本，同时也不是fastchat官方说的兼容的33版本。
后续再使用原版33B验证下。