Add module-swap UX for INT8 mixed-precision training #1179

gau-nernst · 2024-10-26T01:29:45Z

Background

The current INT8 mixed-precision training recipe conceptually is an op-modifier - torch.matmul is replaced with dynamic_int8_mm. It is implemented with tensor subclass, though it doesn't use any tensor subclass-specific features, such as quantized storage and quantized FSDP all-gather. Having an alternative module-swap UX would have the following benefits:

State dict remains plain tensor. This is beneficial for model checkpointing, as well as some complex use cases such as shard-then-load pre-trained weights for FSDP fine-tuning (cc Integrate INT8 mixed-precision from torchao 0.7 torchtune#1552)
It can be easier to hack/compose with other techniques. e.g. use NF4 weight for storage and INT8 matmul for compute -> QLoRA integration

Usage

from torchao import quantize_
from torchao.prototype.quantized_training import int8_mixed_precision_training

model = ...
## nn.Linear -> Int8MixedPrecisionTrainingLinear
quantize_(model, int8_mixed_precision_training(module_swap=True))

Benchmarks

Pre-train Llama2-1B on 4070Ti SUPER. torch==2.6.0.dev20241029. No regression. Module swap has the same perf as tensor subclass

python benchmarks/quantized_training/pretrain_llama2.py --seed 2024 --bf16_model --compile --quantize int8_mixed_precision_module_swap --model 1B --activation_checkpointing

Pre-train Llama3-8B with torchtitan, 2x A100, torch==2.6.0.dev20241104+cu124. No regression. Module swap has the same perf as tensor subclass

Fine-tune Llama3-1B QLoRA with torchtune (using pytorch/torchtune@main...gau-nernst:qlora)

pytorch-bot · 2024-10-26T01:29:48Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1179

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit ca8c85a with merge base f99b667 ():

NEW FAILURE - The following job has failed:

Run Regression Tests / test (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/whl/nightl... / linux-job (gh)
test/prototype/test_parametrization.py::TestFakeSparsity::test_jit_trace

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vkuzo · 2024-10-28T18:38:51Z

It is implemented with tensor subclass, though it doesn't use any tensor subclass-specific features, such as quantized storage and quantized FSDP all-gather. Having an alternative module-swap UX would have the following benefits:

You can still use tensor subclass inside of your module swap to interact with FSDP and TP/SP and get low precision all_gather. See Float8Linear for reference. I think this is a good way to go for training in general.

gau-nernst · 2024-10-29T01:07:38Z

You can still use tensor subclass inside of your module swap

The way I see it is that, whatever can be done with module swap, can be done with tensor subclass. (maybe it's better to hold persistent states with modules? like for delayed scaling. But at least for my use cases, I don't need persistent states). So using both module swap + tensor subclass feels redundant to me.

vkuzo

lgtm

* add module swap UX * update * fix typing. add small notes * try NF4 support * fix * fix unpacking * fix * update nf4 integration * update backward pass

gau-nernst added 3 commits October 24, 2024 20:49

add module swap UX

f13809d

update

9d15a4c

Merge branch 'main' into int8mp_module

dd7187a

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 26, 2024

gau-nernst added 2 commits October 26, 2024 11:33

fix typing. add small notes

897a30a

try NF4 support

6ad289d

gau-nernst added 6 commits October 30, 2024 09:48

fix

507bca8

Merge branch 'main' into int8mp_module

7b68152

fix unpacking

a2f41e2

fix

0b9d421

update nf4 integration

b33fd09

update backward pass

ca8c85a

gau-nernst mentioned this pull request Nov 3, 2024

Integrate INT8 mixed-precision from torchao 0.7 pytorch/torchtune#1552

Open

13 tasks

gau-nernst marked this pull request as ready for review November 4, 2024 13:49

gau-nernst requested review from msaroufim and andrewor14 and removed request for msaroufim November 4, 2024 13:50

msaroufim requested a review from vkuzo November 7, 2024 04:17

vkuzo approved these changes Nov 7, 2024

View reviewed changes

gau-nernst merged commit e41ca4e into pytorch:main Nov 7, 2024
16 of 17 checks passed

gau-nernst deleted the int8mp_module branch November 7, 2024 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add module-swap UX for INT8 mixed-precision training #1179

Add module-swap UX for INT8 mixed-precision training #1179

gau-nernst commented Oct 26, 2024 •

edited

Loading

pytorch-bot bot commented Oct 26, 2024 •

edited

Loading

vkuzo commented Oct 28, 2024

gau-nernst commented Oct 29, 2024

vkuzo left a comment

Add module-swap UX for INT8 mixed-precision training #1179

Add module-swap UX for INT8 mixed-precision training #1179

Conversation

gau-nernst commented Oct 26, 2024 • edited Loading

Background

Usage

Benchmarks

pytorch-bot bot commented Oct 26, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1179

❌ 1 New Failure

vkuzo commented Oct 28, 2024

gau-nernst commented Oct 29, 2024

vkuzo left a comment

Choose a reason for hiding this comment

gau-nernst commented Oct 26, 2024 •

edited

Loading

pytorch-bot bot commented Oct 26, 2024 •

edited

Loading