通过fastchat无法启动Yi-34B-Chat-4bits #187

ryangsun · 2023-11-25T02:15:24Z

root@2d5b8b709f6b:~/llmodel0922start# python3 -m fastchat.serve.multi_model_worker --model-path /data/Yi-34B-Chat-4bits/Yi-34B-Chat-4bits --model-names Yi-34B-Chat-4bits --host 0.0.0.0
2023-11-25 09:56:03 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=21002, worker_address='http://localhost:21002', controller_address='http://localhost:21001', revision='main', device='cuda', gpus=None, num_gpus=1, max_gpu_memory=None, load_8bit=False, cpu_offloading=False, gptq_ckpt=None, gptq_wbits=16, gptq_groupsize=-1, gptq_act_order=False, awq_ckpt=None, awq_wbits=16, awq_groupsize=-1, model_path=['/data/Yi-34B-Chat-4bits/Yi-34B-Chat-4bits'], model_names=[['Yi-34B-Chat-4bits']], limit_worker_concurrency=5, stream_interval=2, no_register=False)
2023-11-25 09:56:03 | INFO | model_worker | Loading the model ['Yi-34B-Chat-4bits'] on worker c5b518c4 ...
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

然后内存不断增加，等了十分钟，gpu没有负载，内存憋到了40G+，请问是哪里设置错误了？

forpanyang · 2023-11-27T03:35:32Z

不好意思，我这边没法复现上述的问题。fastchat.serve.multi_model_worker 涉及的代码相对复杂，为了简化和协助排查问题，你方便测试下 python3 -m fastchat.serve.cli --model-path /data/Yi-34B-Chat-4bits/Yi-34B-Chat-4bits 是否可以正常运行吗？

xiaohuboer · 2023-11-28T06:48:08Z

我遇到了类似的情况，一直hang在这里。显存并没有增加

Cuda compilation tools, release 11.7, V11.7.64
torch.version: 2.0.1+cu117

Halflifefa · 2023-11-29T00:39:16Z

使用如下命令启动可以调用模型，但是模型没有输出
python -m fastchat.serve.vllm_worker --model-path 01-ai/Yi-34B-Chat-4bits --trust-remote-code --tensor-parallel-size 2 --quantization awq --max-model-len 4096

ryangsun · 2023-11-29T03:26:53Z

lm-sys/FastChat#2723
fashchat那边说是在调整

xiaohuboer · 2023-12-01T06:00:32Z

我遇到了类似的情况，一直hang在这里。显存并没有增加

Cuda compilation tools, release 11.7, V11.7.64 torch.version: 2.0.1+cu117

运行下面指令解决了：
pip install transformers -U

codeman0987 · 2023-12-01T07:49:45Z

我使用python -m fastchat.serve.cli --model-path Yi-34B-Chat-4bits 可以load模型，但是我输入问题后就开始报错了

(py311) [root@gpu-server models]# python -m fastchat.serve.cli --model-path Yi-34B-Chat-4bits/
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3.73it/s]Human: 你好
Assistant: Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 303, in
main(args)
File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 226, in main
chat_loop(
File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/inference.py", line 532, in chat_loop
outputs = chatio.stream_output(output_stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 63, in stream_output
for outputs in output_stream:
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/inference.py", line 132, in generate_stream
out = model(input_ids=start_ids, use_cache=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward
outputs = self.model(
^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 366, in forward
query_states = self.q_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/awq/modules/linear.py", line 105, in forward
out = awq_inference_engine.gemm_forward_cuda(x.reshape(-1, x.shape[-1]), self.qweight, self.scales, self.qzeros, 8)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

ryangsun · 2023-12-01T08:00:38Z

我使用python -m fastchat.serve.cli --model-path Yi-34B-Chat-4bits 可以load模型，但是我输入问题后就开始报错了

(py311) [root@gpu-server models]# python -m fastchat.serve.cli --model-path Yi-34B-Chat-4bits/ You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3.73it/s]Human: 你好 Assistant: Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 303, in main(args) File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 226, in main chat_loop( File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/inference.py", line 532, in chat_loop outputs = chatio.stream_output(output_stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 63, in stream_output for outputs in output_stream: File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) ^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/inference.py", line 132, in generate_stream out = model(input_ids=start_ids, use_cache=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward outputs = self.model( ^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward layer_outputs = decoder_layer( ^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( ^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 366, in forward query_states = self.q_proj(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/awq/modules/linear.py", line 105, in forward out = awq_inference_engine.gemm_forward_cuda(x.reshape(-1, x.shape[-1]), self.qweight, self.scales, self.qzeros, 8) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

不单是不能运行，运行起来变量也对应不上，感觉得等fastchat那边更新了才行。

findmyway · 2023-12-02T09:42:18Z

试下 main branch 或者等 fast chat 的下一个 release

ZhaoFancy added question Further information is requested quantization labels Nov 27, 2023

ZhaoFancy assigned forpanyang Nov 27, 2023

findmyway closed this as completed Dec 2, 2023

markli404 added the doc-not-needed Your PR changes do not impact docs. label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

通过fastchat无法启动Yi-34B-Chat-4bits #187

通过fastchat无法启动Yi-34B-Chat-4bits #187

ryangsun commented Nov 25, 2023

forpanyang commented Nov 27, 2023 •

edited

Loading

xiaohuboer commented Nov 28, 2023 •

edited

Loading

Halflifefa commented Nov 29, 2023

ryangsun commented Nov 29, 2023

xiaohuboer commented Dec 1, 2023

codeman0987 commented Dec 1, 2023

ryangsun commented Dec 1, 2023

findmyway commented Dec 2, 2023

通过fastchat无法启动Yi-34B-Chat-4bits #187

通过fastchat无法启动Yi-34B-Chat-4bits #187

Comments

ryangsun commented Nov 25, 2023

forpanyang commented Nov 27, 2023 • edited Loading

xiaohuboer commented Nov 28, 2023 • edited Loading

Halflifefa commented Nov 29, 2023

ryangsun commented Nov 29, 2023

xiaohuboer commented Dec 1, 2023

codeman0987 commented Dec 1, 2023

ryangsun commented Dec 1, 2023

findmyway commented Dec 2, 2023

forpanyang commented Nov 27, 2023 •

edited

Loading

xiaohuboer commented Nov 28, 2023 •

edited

Loading