Skip to content

bugs in DeepSeek v3 FSDP + TP #1531

@ruisizhang123

Description

@ruisizhang123

Bug description

As titled, I'm trying this command CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --model.name deepseek_v3 --parallelism.tensor_parallel_degree 2 with FSDP2 + TP. I hit the following errors.

I take a look at the commits made in torchtitan and found this pr might unintentionally cause this problem. After reverting my codebase to the commit before, my FSDP+TP went through.

cc @tianyu-l @danielvegamyhre

[rank0]:[titan] 2025-08-05 15:50:36,219 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-08-05 15:50:36,219 - root - INFO - Trainer is initialized with local batch size 8, global batch size 32, gradient accumulation steps 1, sequence length 2048, total steps 10 (warmup 2)
[rank0]:[titan] 2025-08-05 15:50:36,219 - root - INFO - Training starts at step 1
[rank0]:[rank0]: Traceback (most recent call last):
[rank0]:[rank0]: File "", line 198, in _run_module_as_main
[rank0]:[rank0]: File "", line 88, in _run_code
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 636, in
[rank0]:[rank0]: trainer.train()
[rank0]:[rank0]: ~~~~~~~~~~~~~^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
[rank0]:[rank0]: return f(*args, **kwargs)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 565, in train
[rank0]:[rank0]: self.train_step(data_iterator)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 471, in train_step
[rank0]:[rank0]: loss = self.forward_backward_step(input_dict, labels)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 447, in forward_backward_step
[rank0]:[rank0]: pred = model_parts[0](inputs, eos_id=self.tokenizer.eos_id)
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
[rank0]:[rank0]: result = forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/model.py", line 389, in forward
[rank0]:[rank0]: h = layer(h, self.freqs_cis)
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
[rank0]:[rank0]: result = forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/model.py", line 295, in forward
[rank0]:[rank0]: x = x + self.moe(self.ffn_norm(x))
[rank0]:[rank0]: ~~~~~~~~^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
[rank0]:[rank0]: result = forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/moe.py", line 339, in forward
[rank0]:[rank0]: routed_output = self.experts(routed_input, num_tokens_per_expert)
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]:[rank0]: return forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/moe.py", line 72, in forward
[rank0]:[rank0]: return GroupedExperts._run_experts_grouped_mm(
[rank0]:[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
[rank0]:[rank0]: self.w1, self.w2, self.w3, x, num_tokens_per_expert
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: )
[rank0]:[rank0]: ^
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/experiments/llama4/infra/expert_parallel.py", line 326, in wrapper
[rank0]:[rank0]: out_unpermuted[permuted_indices, :] = out
[rank0]:[rank0]: ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: RuntimeError: shape mismatch: value tensor of shape [49216, 128] cannot be broadcast to indexing result of shape [49216, 256]
[rank0]:[rank0]:[W805 15:50:41.876690234 ProcessGroupNCCL.cpp:1578] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0805 15:50:43.301000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078193 closing signal SIGTERM
W0805 15:50:43.301000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078194 closing signal SIGTERM
W0805 15:50:43.301000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078195 closing signal SIGTERM
W0805 15:50:43.302000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078196 closing signal SIGTERM
W0805 15:50:43.302000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078197 closing signal SIGTERM
W0805 15:50:43.303000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078198 closing signal SIGTERM
W0805 15:50:43.304000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078199 closing signal SIGTERM
E0805 15:50:44.478000 4077987 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 7 (pid: 4078200) of binary: /home/ruisizhang123/.conda/envs/simplefsdp/bin/python
E0805 15:50:44.481000 4077987 torch/distributed/elastic/multiprocessing/errors/error_handler.py:141] no error file defined for parent, to copy child error file (/tmp/torchelastic_77i0btff/none_kwo9vwqp/attempt_0/7/error.json)
Traceback (most recent call last):
File "/home/ruisizhang123/.conda/envs/simplefsdp/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/home/ruisizhang123/pytorch/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
File "/home/ruisizhang123/pytorch/torch/distributed/run.py", line 901, in main
run(args)
~~~^^^^^^
File "/home/ruisizhang123/pytorch/torch/distributed/run.py", line 892, in run
elastic_launch(
~~~~~~~~~~~~~~~
config=config,
~~~~~~~~~~~~~~
entrypoint=cmd,
~~~~~~~~~~~~~~~
)(*cmd_args)
~^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ruisizhang123/pytorch/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
...<2 lines>...
)

Root Cause (first observed failure):
[0]:
time : 2025-08-05_15:50:41
host : devvm006.dkl0.facebook.com
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 4078200)
error_file: /tmp/torchelastic_77i0btff/none_kwo9vwqp/attempt_0/7/error.json
traceback : Traceback (most recent call last):
File "/home/ruisizhang123/pytorch/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 565, in train
self.train_step(data_iterator)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 471, in train_step
loss = self.forward_backward_step(input_dict, labels)
File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 447, in forward_backward_step
pred = model_parts[0](inputs, eos_id=self.tokenizer.eos_id)
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
result = forward_call(*args, **kwargs)
File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/model.py", line 389, in forward
h = layer(h, self.freqs_cis)
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
result = forward_call(*args, **kwargs)
File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/model.py", line 295, in forward
x = x + self.moe(self.ffn_norm(x))
~~~~~~~~^^^^^^^^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
result = forward_call(*args, **kwargs)
File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/moe.py", line 339, in forward
routed_output = self.experts(routed_input, num_tokens_per_expert)
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/moe.py", line 72, in forward
return GroupedExperts._run_experts_grouped_mm(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self.w1, self.w2, self.w3, x, num_tokens_per_expert
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/ruisizhang123/torchtitan/torchtitan/experiments/llama4/infra/expert_parallel.py", line 326, in wrapper
out_unpermuted[permuted_indices, :] = out
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape mismatch: value tensor of shape [49216, 128] cannot be broadcast to indexing result of shape [49216, 256]

Versions

see above

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions