Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TinyGemmQBitsTensor move #246

Merged
merged 5 commits into from
Jul 18, 2024
Merged

Fix TinyGemmQBitsTensor move #246

merged 5 commits into from
Jul 18, 2024

Conversation

dacorvo
Copy link
Collaborator

@dacorvo dacorvo commented Jul 18, 2024

What does this PR do?

This fixes a few issues on CUDA with arch <= sm80: a TinyGemmQBitsTensor could be moved to the CUDA device without calling QBitsTensor::create.

A small change is also added to remove a dual dispatch for AWQ gemm. This slightly improves the decode latency for LLM (but does not impact much the end-to-end latency that is still dominated by the prefill latency).

dacorvo added 5 commits July 18, 2024 08:31
This will allow to detach TinyGemmQBitsTensor that have different
inner tensors (scale and shift are combined).
Since QBitsTensor ops are now all compatible with TinyGemmQBitsTensor
we can remove the specific dispatch.
The QBitsTensor.create factory method checks for CUDA version, but
the unit tests that bypass that method must check themselves.
@dacorvo dacorvo requested a review from fxmarty July 18, 2024 09:42
@dacorvo dacorvo merged commit 9241b96 into main Jul 18, 2024
12 checks passed
@dacorvo dacorvo deleted the fix_tinygemm_move branch July 18, 2024 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant