-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Installation] pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows #9701
Comments
Please follow these instructions on how to install for CPU. |
|
Based on your title, you originally have PyTorch for CPU installed and do not want the CUDA version to be installed, so I guess you want the CPU version of vLLM as well. Correct me if I'm wrong. |
On Windows, you should be able to use vLLM via WSL if I recall correctly. |
Of course not. It is said from beginning to end that the CUDA version has been replaced with the CPU version. Of course, all I want is the CUDA version。 |
Oh sorry, I somehow read it the other way round. vLLM only officially supports Linux OS so it might not be able to detect your CUDA from native Windows. I suggest using vLLM through WSL. |
wsl 有致命缺陷。如果可以我都不在愿意使用wsl了。你的问题检测不到torch cuda。这听起来像是pip 和cmake命令没有处理好。可以学习一下flashattntion .flashattntion 可以在windows上安装了。通过 不隔离环境 就可以找到torch 并编译安装。如果问题是这样的话,应该是不难解决的。 |
@dtrifiro what's your opinion on supporting Windows? Is it feasible at this stage? |
@DarkLight1337 @dtrifiro |
From my understanding, PyTorch installation should be able to automatically choose CPU/CUDA based on your machine. What happens if you just install |
中国区网络环境不好。不想在折腾torch的版本。之前使用pip install vllm 安装成功以后。发现了torch 被替换成了cpu的这个问题。不过在装完cuda的torch以后我运行过一次vllm。然后报错了, 错误如下(不是很确定后来这个错误有没有被覆盖) File d:\my\env\python3.10.10\lib\site-packages\vllm\entrypoints\llm.py:177, in LLM.init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, mm_processor_kwargs, **kwargs) File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\llm_engine.py:570, in LLMEngine.from_engine_args(cls, engine_args, usage_context, stat_loggers) File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\arg_utils.py:903, in EngineArgs.create_engine_config(self) File d:\my\env\python3.10.10\lib\site-packages\vllm\engine\arg_utils.py:839, in EngineArgs.create_model_config(self) File d:\my\env\python3.10.10\lib\site-packages\vllm\config.py:200, in ModelConfig.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, dtype, seed, revision, code_revision, rope_scaling, rope_theta, tokenizer_revision, max_model_len, spec_target_max_model_len, quantization, quantization_param_path, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, max_logprobs, disable_sliding_window, skip_tokenizer_init, served_model_name, limit_mm_per_prompt, use_async_output_proc, override_neuron_config, config_format, mm_processor_kwargs) File d:\my\env\python3.10.10\lib\site-packages\vllm\config.py:219, in ModelConfig._init_multimodal_config(self, limit_mm_per_prompt) File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:387, in _ModelRegistry.is_multimodal_model(self, architectures) File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:356, in _ModelRegistry.inspect_model_cls(self, architectures) File d:\my\env\python3.10.10\lib\site-packages\vllm\model_executor\models\registry.py:317, in _ModelRegistry._raise_for_unsupported(self, architectures) ValueError: Model architectures ['Qwen2ForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'ArcticForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'DeepseekV2ForCausalLM', 'ExaoneForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'Gemma2ForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'GraniteForCausalLM', 'GraniteMoeForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'JambaForCausalLM', 'LlamaForCausalLM', 'LLaMAForCausalLM', 'MambaForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'MiniCPM3ForCausalLM', 'NemotronForCausalLM', 'OlmoForCausalLM', 'OlmoeForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PersimmonForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'Phi3SmallForCausalLM', 'PhiMoEForCausalLM', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'SolarForCausalLM', 'XverseForCausalLM', 'BartModel', 'BartForConditionalGeneration', 'Gemma2Model', 'MistralModel', 'Qwen2ForRewardModel', 'Phi3VForCausalLM', 'Blip2ForConditionalGeneration', 'ChameleonForConditionalGeneration', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'FuyuForCausalLM', 'InternVLChatModel', 'LlavaForConditionalGeneration', 'LlavaNextForConditionalGeneration', 'LlavaNextVideoForConditionalGeneration', 'LlavaOnevisionForConditionalGeneration', 'MiniCPMV', 'MolmoForCausalLM', 'NVLM_D', 'PaliGemmaForConditionalGeneration', 'PixtralForConditionalGeneration', 'QWenLMHeadModel', 'Qwen2VLForConditionalGeneration', 'UltravoxModel', 'MllamaForConditionalGeneration', 'EAGLEModel', 'MedusaModel', 'MLPSpeculatorPreTrainedModel'] |
It looks like this error I've encountered before: https://stackoverflow.com/questions/23212435/permission-denied-to-write-to-my-temporary-file. It can be solved by writing to a temporary directory instead, see if I can fix this real quick. |
Hi @DarkLight1337 , can you take a look at this installation issue: #9180 |
Fixed. Feel free to reopen if you still encounter issues. |
@DarkLight1337
|
This looks like a problem inside xformers. Maybe you should use other backends by setting |
@DarkLight1337 INFO 10-30 15:04:09 selector.py:267] Cannot use FlashAttention-2 backend because the vllm.vllm_flash_attn package is not found. Make sure that vllm_flash_attn was built and installed (on by default).
|
Can you use pytorch SDPA? |
what is pytorch SDPA? |
It is built into pytorch, so you should be able to use it as long as pytorch is installed. |
Now the question is Cannot find FlashAttention-2 . |
has a sapa sample ? |
This is where I'm unable to really help you. I guess vLLM's flash attention package only works on Linux. |
Maybe @dtrifiro can provide some insights here? |
But flash-attion already supports windows.So does vllm's flashatten need to be regenerated.Or tell me how to generate it using the source code |
|
from flash_attn.flash_attn_interface import flash_attn_func
from flash_attn.flash_attn_interface import flash_attn_with_kvcache
import torch
def main():
batch_size = 2
seqlen_q = 1
seqlen_k = 1
nheads = 4
n_kv_heads = 2
d = 3
device = "cuda"
causal = True
window_size = (-1, -1)
dtype = torch.float16
paged_kv_cache_size = None
cache_seqlens = None
rotary_cos = None
rotary_sin = None
cache_batch_idx = None
block_table = None
softmax_scale = None
rotary_interleaved = False
alibi_slopes = None
num_splits = 0
max_seq_len = 3
if paged_kv_cache_size is None:
k_cache = torch.zeros(batch_size, max_seq_len, n_kv_heads, d, device=device, dtype=dtype)
v_cache = torch.zeros(batch_size, max_seq_len, n_kv_heads, d, device=device, dtype=dtype)
block_table = None
prev_q_vals = []
prev_k_vals = []
prev_v_vals = []
torch.manual_seed(0)
for i in range(0,3):
print(">>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>")
q = torch.randn(batch_size, seqlen_q, nheads, d, device=device, dtype=dtype)
k = torch.randn(batch_size, seqlen_k, n_kv_heads, d, device=device, dtype=dtype)
v = torch.randn(batch_size, seqlen_k, n_kv_heads, d, device=device, dtype=dtype)
# kv cache
cache_seqlens = torch.tensor([i] * batch_size, dtype=torch.int32, device=device)
output_kvcache = flash_attn_with_kvcache(
q = q,
k_cache = k_cache,
v_cache = v_cache,
k = k,
v = v,
rotary_cos = rotary_cos,
rotary_sin = rotary_sin,
cache_seqlens = cache_seqlens,
cache_batch_idx = cache_batch_idx,
cache_leftpad = None,
block_table = block_table,
softmax_scale = softmax_scale,
causal = causal,
window_size = window_size,
softcap = 0.0,
rotary_interleaved = rotary_interleaved,
alibi_slopes = alibi_slopes,
num_splits = num_splits,
return_softmax_lse = False)
print(f"$$$ output KV CACHE MHA at {i} \n", output_kvcache)
# non kv cache MHA
prev_q_vals.append(q)
prev_k_vals.append(k)
prev_v_vals.append(v)
output_2 = flash_attn_func(
q=q,
k=torch.concat(prev_k_vals, axis=1),
v=torch.concat(prev_v_vals, axis=1),
dropout_p=0.0,
softmax_scale=None,
causal=causal,
window_size=window_size,
softcap=0.0,
alibi_slopes=None,
deterministic=False,
return_attn_probs=False)
print(f"!!! output MHA NON KV CACHE at {i} \n", output_2)
main() $$$ output KV CACHE MHA at 0
!!! output MHA NON KV CACHE at 0
$$$ output KV CACHE MHA at 1
!!! output MHA NON KV CACHE at 1
$$$ output KV CACHE MHA at 2
!!! output MHA NON KV CACHE at 2
But I can run success. |
vLLM uses a fork of the |
ImportError("cannot import name 'flash_attn_varlen_func' from 'vllm.vllm_flash_attn' (unknown location)") |
|
It's listed in this file: https://github.com/vllm-project/flash-attention/blob/5259c586c403a4e4d8bf69973c159b40cc346fb9/vllm_flash_attn/__init__.py |
@DarkLight1337 you mean I need replace all vllm_flash_attn files?why not update in vllm-project? |
I am not sure what you mean. Those functions are defined inside |
现在 在vllm-project 工程下的vllm_flash_attn文件内容与 |
After you clone vLLM repo, you should build from source using the provided instructions (in your case, better perform a full build to make sure you have the latest version of the compiled binaries). It should download the files from the vLLM |
现在我就是从源代码安装的。但是克隆vLLM repo 以后。他们的源代码是不一样的 |
In vLLM main repo, the |
我这里也是空的。也就是说vllm_flash_attn 没有被作为子项目。我需要先手动克隆vllm_flash_attnhttps://github.com/vllm-project/flash-attention项目 |
How are you installing vLLM from source? Can you show the commands which you've used? |
@dtrifiro |
@DarkLight1337 |
This is outside of my domain as I'm not involved with the vLLM build process. @dtrifiro may be able to help you more. |
The problem has not been resolved. Need to reopen. Also, can you help me contact dtrifiro
Danielea? Only he or their project team can solve it. But @ him, he doesn't respond
…------------------ 原始邮件 ------------------
发件人: "Simon ***@***.***>;
发送时间: 2024年10月29日(星期二) 中午1:08
收件人: ***@***.***>;
抄送: ***@***.***>; ***@***.***>;
主题: Re: [vllm-project/vllm] [Installation] pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows (Issue #9701)
Closed #9701 as completed via #9721.
—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
一样的问题 |
-- USE_CUDNN is set to 0. Compiling without cuDNN support |
pip install vllm (0.6.3) will force a reinstallation of the CPU version torch and replace cuda torch on windows. pip install vllm(0.6.3)将强制重新安装CPU版本的torch并在Windows上替换cuda torch。
What is your original version of pytorch?
Originally posted by @DarkLight1337 in #4194 (comment)
pip show torch
Name: torch
Version: 2.5.0+cu124
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3-Clause
Location: d:\my\env\python3.10.10\lib\site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: accelerate, auto_gptq, bitsandbytes, compressed-tensors, encodec, flash_attn, optimum, peft, stable-baselines3, timm, t
orchaudio, torchvision, trl, vector-quantize-pytorch, vocos
The text was updated successfully, but these errors were encountered: