Skip to content

Conversation

@andrewor14
Copy link
Contributor

Summary: #2253 added a step in quantize_affine_float8 to expand the scales for blockwise quantization. The purpose of this step is to make the scales always broadcastable with the input tensor. However, this is unnecessary for rowwise quantization, which already has broadcastable shapes, e.g.

scale = [32, 1]
input = [32, 16]

Today, we will repeat_interleave the above scales to pad the scale tensor until it reaches [32, 16], which adds non-trivial memory and latency overhead. This commit adds a fast path to skip this expanding step if we detect rowwise quantization.

Test Plan:

python test/quantization/test_quant_primitives.py -k test_maybe_expand_scale_to_tensor_shape

Also compared fine-tuning Qwen3-1.7B with fp8-fp8 QAT using batch size 32 on a single H100 GPU:

  • Before: 25.34 GB peak memory, 3047.25 tok/s
  • After: 22.53 GB peak memory, 3358.49 tok/s
  • This PR uses 11.1% less memory and is 10.2% faster

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 6, 2025
@andrewor14 andrewor14 added topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories) and removed CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. labels Sep 6, 2025
@andrewor14 andrewor14 force-pushed the reduce-fp8-qat-memory branch from 780003d to 935ac1a Compare September 6, 2025 00:10
@pytorch-bot
Copy link

pytorch-bot bot commented Sep 6, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2950

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 4cf5c90 with merge base 4872c4f (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 6, 2025
return scale

# For rowwise quantization, just return the scale as is
if scale.shape[:-1] == target_shape[:-1] and scale.shape[-1] == 1:
Copy link
Contributor

@drisspg drisspg Sep 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could probably do something fun, like

def is_trivial_expandable(scale, target_shape):
    return all(a == b or a == 1 for a, b in zip(scale.shape, target_shape))

**Summary:** #2253 added a step
in `quantize_affine_float8` to expand the scales for blockwise
quantization. The purpose of this step is to make the scales
always broadcastable with the input tensor. However, this is
unnecessary for rowwise quantization, which already has
broadcastable shapes, e.g.

```
scale = [32, 1]
input = [32, 16]
```

Today, we will `repeat_interleave` the above scales to pad
the scale tensor until it reaches `[32, 16]`, which adds
non-trivial memory and latency overhead. This commit adds a
fast path to skip this expanding step if we detect rowwise
quantization.

**Test Plan:**
```
python test/quantization/test_quant_primitives.py -k test_maybe_expand_scale_to_tensor_shape
```

Also compared fine-tuning Qwen3-1.7B with fp8-fp8 QAT using
batch size 32 on a single H100 GPU:
- Before: 25.34 GB peak memory, 3047.25 tok/s
- After: 22.53 GB peak memory, 3358.49 tok/s
- This PR uses 11.1% less memory and is 10.2% faster
@andrewor14 andrewor14 force-pushed the reduce-fp8-qat-memory branch from 935ac1a to 4cf5c90 Compare September 8, 2025 13:14
@andrewor14 andrewor14 requested a review from drisspg September 8, 2025 13:14
@andrewor14 andrewor14 merged commit a54417d into main Sep 8, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: improvement Use this tag if this PR is an improvement (doesn't fit into any of the other categories)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants