You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CUDA extension not installed.
CUDA extension not installed.
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.18it/s]
Traceback (most recent call last):
File "D:\project\Qwen\cli_demo.py", line 210, in
main()
File "D:\project\Qwen\cli_demo.py", line 116, in main
model, tokenizer, config = _load_model_tokenizer(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\project\Qwen\cli_demo.py", line 54, in _load_model_tokenizer
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\qwen\Lib\site-packages\transformers\models\auto\auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\qwen\Lib\site-packages\transformers\modeling_utils.py", line 3928, in from_pretrained
model = quantizer.post_init_model(model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\qwen\Lib\site-packages\optimum\gptq\quantizer.py", line 588, in post_init_model
raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting disable_exllama=True in the quantization config object
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
CUDA extension not installed.
CUDA extension not installed.
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.18it/s]
Traceback (most recent call last):
File "D:\project\Qwen\cli_demo.py", line 210, in
main()
File "D:\project\Qwen\cli_demo.py", line 116, in main
model, tokenizer, config = _load_model_tokenizer(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\project\Qwen\cli_demo.py", line 54, in _load_model_tokenizer
model = AutoModelForCausalLM.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\qwen\Lib\site-packages\transformers\models\auto\auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\qwen\Lib\site-packages\transformers\modeling_utils.py", line 3928, in from_pretrained
model = quantizer.post_init_model(model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\ProgramData\miniconda3\envs\qwen\Lib\site-packages\optimum\gptq\quantizer.py", line 588, in post_init_model
raise ValueError(
ValueError: Found modules on cpu/disk. Using Exllama or Exllamav2 backend requires all the modules to be on GPU.You can deactivate exllama backend by setting
disable_exllama=True
in the quantization config objectBeta Was this translation helpful? Give feedback.
All reactions