-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix] Allow vllm to still work if triton is not installed. #6786
Conversation
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
I feel this approach is a bit hacky. In general, we should avoid import triton kernels when triton is not available. If we really have to mock triton, is it possible to keep the API compatible so that we don't need version != 0.0.0 guard? |
@comaniac that guard is actually there to keep the code in when using the mock (or triton >= 2.1.0). the mock is already compatible. we could just remove the guard entirely imo because the triton version should be controlled via the requirement.txt I will try a few things to see what it would take to do it without the mock entirely. My initial thought was that the mock would be the least-intrusive way to achieve this. |
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
90fe0bf
to
f778898
Compare
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
@comaniac I did another pass through this. Changes are:
|
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
/ready |
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks much clean to me. Thanks!
@@ -239,188 +241,6 @@ def apply(self, | |||
use_per_token_if_dynamic=False) | |||
|
|||
|
|||
class Fp8MoEMethod(FusedMoEMethodBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to keep Fp8MoEMethod
in fp8.py
instead of creating another file. Can we just lazy import fused_moe
like UnquantizedFusedMoEMethod
(https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/fused_moe/layer.py#L91) does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah good point. i thought it was more complicated because we need to inherit from FusedMoEMethodBase
but actually that base class doesn't involve any Triton import. have pushed the change
MAX_TRITON_N_COLS = 131072 | ||
|
||
|
||
def get_num_triton_sampler_splits(n_cols: int) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would make sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to clarify: is this something you'd like to have addressed in this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it's not necessary. We can merge this PR first.
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. cc @robertgshaw2-neuralmagic
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
* upstream/main: (66 commits) [Bugfix] Fix PaliGemma MMP (vllm-project#6930) [TPU] Fix greedy decoding (vllm-project#6933) [Kernel] Tuned int8 kernels for Ada Lovelace (vllm-project#6848) [Kernel] Fix marlin divide-by-zero warnings (vllm-project#6904) [ci] GHA workflow to remove ready label upon "/notready" comment (vllm-project#6921) [Kernel] Remove unused variables in awq/gemm_kernels.cu (vllm-project#6908) [Frontend] New `allowed_token_ids` decoding request parameter (vllm-project#6753) [Bugfix] Allow vllm to still work if triton is not installed. (vllm-project#6786) [TPU] Support tensor parallelism in async llm engine (vllm-project#6891) [Kernel] Fix deprecation function warnings squeezellm quant_cuda_kernel (vllm-project#6901) [Core] Reduce unnecessary compute when logprobs=None (vllm-project#6532) [Kernel] Tuned FP8 Kernels for Ada Lovelace (vllm-project#6677) [Model] Initialize support for InternVL2 series models (vllm-project#6514) [Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (vllm-project#6871) Add Nemotron to PP_SUPPORTED_MODELS (vllm-project#6863) [Kernel] Increase precision of GPTQ/AWQ Marlin kernel (vllm-project#6795) [TPU] Reduce compilation time & Upgrade PyTorch XLA version (vllm-project#6856) [Docs] Add RunLLM chat widget (vllm-project#6857) [Model] Initial support for BLIP-2 (vllm-project#5920) [CI/Build][Doc] Update CI and Doc for VLM example changes (vllm-project#6860) ...
…roject#6786) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
…roject#6786) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
…roject#6786) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com> Signed-off-by: Alvant <alvasian@yandex.ru>
…roject#6786) Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
We are currently needing to add
triton
as a dependency to all of the non-CUDA backends. This is because importing triton is still performed in various places throughout the library regardless of the backend.This PR adds a function
maybe_import_triton
which will check to see if Triton is available in the environment. If it is not, it will replace Triton with a mocked up version that allows all the vLLM code to be imported.An alternative approach might be to try to make the import conditional on the device. I have a feeling that would introduce a fair amount of additional complexity (e.g., we would need to only import triton after the engine has been constructed) but haven't worked through it in full.