Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

通过fastchat无法启动Yi-34B-Chat-4bits #187

Closed
ryangsun opened this issue Nov 25, 2023 · 8 comments
Closed

通过fastchat无法启动Yi-34B-Chat-4bits #187

ryangsun opened this issue Nov 25, 2023 · 8 comments
Assignees
Labels
doc-not-needed Your PR changes do not impact docs. quantization question Further information is requested

Comments

@ryangsun
Copy link

root@2d5b8b709f6b:~/llmodel0922start# python3 -m fastchat.serve.multi_model_worker --model-path /data/Yi-34B-Chat-4bits/Yi-34B-Chat-4bits --model-names Yi-34B-Chat-4bits --host 0.0.0.0
2023-11-25 09:56:03 | INFO | model_worker | args: Namespace(host='0.0.0.0', port=21002, worker_address='http://localhost:21002', controller_address='http://localhost:21001', revision='main', device='cuda', gpus=None, num_gpus=1, max_gpu_memory=None, load_8bit=False, cpu_offloading=False, gptq_ckpt=None, gptq_wbits=16, gptq_groupsize=-1, gptq_act_order=False, awq_ckpt=None, awq_wbits=16, awq_groupsize=-1, model_path=['/data/Yi-34B-Chat-4bits/Yi-34B-Chat-4bits'], model_names=[['Yi-34B-Chat-4bits']], limit_worker_concurrency=5, stream_interval=2, no_register=False)
2023-11-25 09:56:03 | INFO | model_worker | Loading the model ['Yi-34B-Chat-4bits'] on worker c5b518c4 ...
You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at huggingface/transformers#24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

然后内存不断增加,等了十分钟,gpu没有负载,内存憋到了40G+,请问是哪里设置 错误了?

@forpanyang
Copy link
Contributor

forpanyang commented Nov 27, 2023

不好意思,我这边没法复现上述的问题。fastchat.serve.multi_model_worker 涉及的代码相对复杂,为了简化和协助排查问题,你方便测试下 python3 -m fastchat.serve.cli --model-path /data/Yi-34B-Chat-4bits/Yi-34B-Chat-4bits 是否可以正常运行吗?

@ZhaoFancy ZhaoFancy added question Further information is requested quantization labels Nov 27, 2023
@xiaohuboer
Copy link

xiaohuboer commented Nov 28, 2023

我遇到了类似的情况,一直hang在这里。显存并没有增加

image

Cuda compilation tools, release 11.7, V11.7.64
torch.version: 2.0.1+cu117

@Halflifefa
Copy link

使用如下命令启动可以调用模型,但是模型没有输出
python -m fastchat.serve.vllm_worker --model-path 01-ai/Yi-34B-Chat-4bits --trust-remote-code --tensor-parallel-size 2 --quantization awq --max-model-len 4096

@ryangsun
Copy link
Author

lm-sys/FastChat#2723
fashchat那边说是在调整

@xiaohuboer
Copy link

我遇到了类似的情况,一直hang在这里。显存并没有增加

image

Cuda compilation tools, release 11.7, V11.7.64 torch.version: 2.0.1+cu117

运行下面指令解决了:
pip install transformers -U

@codeman0987
Copy link

我使用python -m fastchat.serve.cli --model-path Yi-34B-Chat-4bits 可以load模型,但是我输入问题后就开始报错了

(py311) [root@gpu-server models]# python -m fastchat.serve.cli --model-path Yi-34B-Chat-4bits/
You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3.73it/s]Human: 你好
Assistant: Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 303, in
main(args)
File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 226, in main
chat_loop(
File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/inference.py", line 532, in chat_loop
outputs = chatio.stream_output(output_stream)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 63, in stream_output
for outputs in output_stream:
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 35, in generator_context
response = gen.send(None)
^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/inference.py", line 132, in generate_stream
out = model(input_ids=start_ids, use_cache=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward
outputs = self.model(
^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward
layer_outputs = decoder_layer(
^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 366, in forward
query_states = self.q_proj(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/.conda/envs/py311/lib/python3.11/site-packages/awq/modules/linear.py", line 105, in forward
out = awq_inference_engine.gemm_forward_cuda(x.reshape(-1, x.shape[-1]), self.qweight, self.scales, self.qzeros, 8)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@ryangsun
Copy link
Author

ryangsun commented Dec 1, 2023

我使用python -m fastchat.serve.cli --model-path Yi-34B-Chat-4bits 可以load模型,但是我输入问题后就开始报错了

(py311) [root@gpu-server models]# python -m fastchat.serve.cli --model-path Yi-34B-Chat-4bits/ You have loaded an AWQ model on CPU and have a CUDA device available, make sure to set your model on a GPU device in order to run your model. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 3.73it/s]Human: 你好 Assistant: Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 303, in main(args) File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 226, in main chat_loop( File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/inference.py", line 532, in chat_loop outputs = chatio.stream_output(output_stream) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/cli.py", line 63, in stream_output for outputs in output_stream: File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 35, in generator_context response = gen.send(None) ^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/fastchat/serve/inference.py", line 132, in generate_stream out = model(input_ids=start_ids, use_cache=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 1034, in forward outputs = self.model( ^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 922, in forward layer_outputs = decoder_layer( ^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 672, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( ^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/transformers/models/llama/modeling_llama.py", line 366, in forward query_states = self.q_proj(hidden_states) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/root/.conda/envs/py311/lib/python3.11/site-packages/awq/modules/linear.py", line 105, in forward out = awq_inference_engine.gemm_forward_cuda(x.reshape(-1, x.shape[-1]), self.qweight, self.scales, self.qzeros, 8) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: no kernel image is available for execution on the device CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

不单是不能运行,运行起来变量也对应不上,感觉得等fastchat那边更新了才行。

@findmyway
Copy link
Contributor

试下 main branch 或者等 fast chat 的下一个 release

@markli404 markli404 added the doc-not-needed Your PR changes do not impact docs. label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs. quantization question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants