Skip to content

Conversation

@bdhirsh
Copy link
Contributor

@bdhirsh bdhirsh commented Jul 10, 2025

the compute estimation in autoparallel uses torch.empty to allocate tensors, which is wrong if the input tensor has a specific striding.

This came up trying to run llama3 with float8 quantization, because:

(1) float8linears desugar into a call to aten._scaled_mm

(2) aten._scaled_mm requires its second input to be column-major, and its assert was failing (code: https://github.com/pytorch/pytorch/blob/main/torch/_meta_registrations.py#L6453)

repro code: checkout this branch pytorch/torchtitan#1378, and run:

CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" ./run_train.sh --model.name llama3_auto_parallel --parallelism.tensor_parallel_degree 4 --model.converters="float8"

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 10, 2025
@wconstab
Copy link
Contributor

lgtm. and pushed a lint fix

@wconstab wconstab merged commit 1af2a14 into main Jul 17, 2025
6 checks passed
@wconstab wconstab deleted the scaled_mm_fix branch July 17, 2025 22:58
wconstab added a commit that referenced this pull request Jul 18, 2025
Turns out the previous PR
#37

was not correct. It divided the wrong dim's stride.

This PR divides the dim to the left of the one being sharded, which is
what really happens.

Note: that we have this util at all is worrying me. Why don't we just
use dtensors to propagate?
fmassa pushed a commit that referenced this pull request Jul 18, 2025
Turns out the previous PR
#37

was not correct. It divided the wrong dim's stride.

This PR divides the dim to the left of the one being sharded, which is
what really happens.

Note: that we have this util at all is worrying me. Why don't we just
use dtensors to propagate?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants