-
-
Notifications
You must be signed in to change notification settings - Fork 11.2k
[torch.compile] Dynamic fp8 + rms_norm fusion #10906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[torch.compile] Dynamic fp8 + rms_norm fusion #10906
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
|
@ProExpertProg It looks like there is a real failure on TPU test https://buildkite.com/vllm/fastcheck/builds/9325#01939413-9e42-4fb8-898e-dfd2a3ec7828/6-306 |
SageMoore
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work. I mostly just have nits.
csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu
Outdated
Show resolved
Hide resolved
csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu
Outdated
Show resolved
Hide resolved
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
…ops to constants Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
- extracted MultiOutputMatch to own file - extracted utils to fx_utils - added named tuples for op keys Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
Signed-off-by: luka <luka@neuralmagic.com>
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found some potential integer overflows. I don't think those will automatically promote to 64-bit even though the output is int64_t
csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu
Outdated
Show resolved
Hide resolved
17ff1b9 to
720d537
Compare
tests/compile/test_fusion.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Future PR: use compilation counters for patterns replaced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(use context manager to check the counter is increased by certain number)
youkaichao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, excited to see the perf numbers!
|
please also work with @tlrmchlsmth to address his comments |
Signed-off-by: luka <luka@neuralmagic.com>
21abcff to
a70d496
Compare
- add kFp8Type constant for cuda/hip agnostic torch type checking - check contiguous - overflow - reduce number of tests Signed-off-by: luka <luka@neuralmagic.com>
tlrmchlsmth
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, thank you for the great work @ProExpertProg!
|
Seeing a 1-2% improvement in TPOT and 2-5% in TTFT. Fused: Unfused: |
Signed-off-by: luka <luka@neuralmagic.com> Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>
This PR adds support for RMSNorm + (fp8) quant fusion. It also refactors the fusion pass to make it easier to add patterns. That includes support for multiple values of epsilon.