-
Notifications
You must be signed in to change notification settings - Fork 597
Description
Bug description
As titled, I'm trying this command CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/debug_model.toml" ./run_train.sh --model.name deepseek_v3 --parallelism.tensor_parallel_degree 2 with FSDP2 + TP. I hit the following errors.
I take a look at the commits made in torchtitan and found this pr might unintentionally cause this problem. After reverting my codebase to the commit before, my FSDP+TP went through.
[rank0]:[titan] 2025-08-05 15:50:36,219 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-08-05 15:50:36,219 - root - INFO - Trainer is initialized with local batch size 8, global batch size 32, gradient accumulation steps 1, sequence length 2048, total steps 10 (warmup 2)
[rank0]:[titan] 2025-08-05 15:50:36,219 - root - INFO - Training starts at step 1
[rank0]:[rank0]: Traceback (most recent call last):
[rank0]:[rank0]: File "", line 198, in _run_module_as_main
[rank0]:[rank0]: File "", line 88, in _run_code
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 636, in
[rank0]:[rank0]: trainer.train()
[rank0]:[rank0]: ~~~~~~~~~~~~~^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
[rank0]:[rank0]: return f(*args, **kwargs)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 565, in train
[rank0]:[rank0]: self.train_step(data_iterator)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 471, in train_step
[rank0]:[rank0]: loss = self.forward_backward_step(input_dict, labels)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 447, in forward_backward_step
[rank0]:[rank0]: pred = model_parts[0](inputs, eos_id=self.tokenizer.eos_id)
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
[rank0]:[rank0]: result = forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/model.py", line 389, in forward
[rank0]:[rank0]: h = layer(h, self.freqs_cis)
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
[rank0]:[rank0]: result = forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/model.py", line 295, in forward
[rank0]:[rank0]: x = x + self.moe(self.ffn_norm(x))
[rank0]:[rank0]: ~~~~~~~~^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
[rank0]:[rank0]: return inner()
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
[rank0]:[rank0]: result = forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/moe.py", line 339, in forward
[rank0]:[rank0]: routed_output = self.experts(routed_input, num_tokens_per_expert)
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
[rank0]:[rank0]: return self._call_impl(*args, **kwargs)
[rank0]:[rank0]: ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl
[rank0]:[rank0]: return forward_call(*args, **kwargs)
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/moe.py", line 72, in forward
[rank0]:[rank0]: return GroupedExperts._run_experts_grouped_mm(
[rank0]:[rank0]: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
[rank0]:[rank0]: self.w1, self.w2, self.w3, x, num_tokens_per_expert
[rank0]:[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: )
[rank0]:[rank0]: ^
[rank0]:[rank0]: File "/home/ruisizhang123/torchtitan/torchtitan/experiments/llama4/infra/expert_parallel.py", line 326, in wrapper
[rank0]:[rank0]: out_unpermuted[permuted_indices, :] = out
[rank0]:[rank0]: ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
[rank0]:[rank0]: RuntimeError: shape mismatch: value tensor of shape [49216, 128] cannot be broadcast to indexing result of shape [49216, 256]
[rank0]:[rank0]:[W805 15:50:41.876690234 ProcessGroupNCCL.cpp:1578] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0805 15:50:43.301000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078193 closing signal SIGTERM
W0805 15:50:43.301000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078194 closing signal SIGTERM
W0805 15:50:43.301000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078195 closing signal SIGTERM
W0805 15:50:43.302000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078196 closing signal SIGTERM
W0805 15:50:43.302000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078197 closing signal SIGTERM
W0805 15:50:43.303000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078198 closing signal SIGTERM
W0805 15:50:43.304000 4077987 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 4078199 closing signal SIGTERM
E0805 15:50:44.478000 4077987 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 7 (pid: 4078200) of binary: /home/ruisizhang123/.conda/envs/simplefsdp/bin/python
E0805 15:50:44.481000 4077987 torch/distributed/elastic/multiprocessing/errors/error_handler.py:141] no error file defined for parent, to copy child error file (/tmp/torchelastic_77i0btff/none_kwo9vwqp/attempt_0/7/error.json)
Traceback (most recent call last):
File "/home/ruisizhang123/.conda/envs/simplefsdp/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch', 'console_scripts', 'torchrun')())
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
File "/home/ruisizhang123/pytorch/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
File "/home/ruisizhang123/pytorch/torch/distributed/run.py", line 901, in main
run(args)
~~~^^^^^^
File "/home/ruisizhang123/pytorch/torch/distributed/run.py", line 892, in run
elastic_launch(
~~~~~~~~~~~~~~~
config=config,
~~~~~~~~~~~~~~
entrypoint=cmd,
~~~~~~~~~~~~~~~
)(*cmd_args)
~^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/distributed/launcher/api.py", line 143, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/ruisizhang123/pytorch/torch/distributed/launcher/api.py", line 277, in launch_agent
raise ChildFailedError(
...<2 lines>...
)
Root Cause (first observed failure):
[0]:
time : 2025-08-05_15:50:41
host : devvm006.dkl0.facebook.com
rank : 7 (local_rank: 7)
exitcode : 1 (pid: 4078200)
error_file: /tmp/torchelastic_77i0btff/none_kwo9vwqp/attempt_0/7/error.json
traceback : Traceback (most recent call last):
File "/home/ruisizhang123/pytorch/torch/distributed/elastic/multiprocessing/errors/init.py", line 357, in wrapper
return f(*args, **kwargs)
File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 565, in train
self.train_step(data_iterator)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^
File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 471, in train_step
loss = self.forward_backward_step(input_dict, labels)
File "/home/ruisizhang123/torchtitan/torchtitan/train.py", line 447, in forward_backward_step
pred = model_parts[0](inputs, eos_id=self.tokenizer.eos_id)
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
result = forward_call(*args, **kwargs)
File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/model.py", line 389, in forward
h = layer(h, self.freqs_cis)
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
result = forward_call(*args, **kwargs)
File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/model.py", line 295, in forward
x = x + self.moe(self.ffn_norm(x))
~~~~~~~~^^^^^^^^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1827, in inner
result = forward_call(*args, **kwargs)
File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/moe.py", line 339, in forward
routed_output = self.experts(routed_input, num_tokens_per_expert)
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/home/ruisizhang123/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/home/ruisizhang123/torchtitan/torchtitan/models/deepseek_v3/model/moe.py", line 72, in forward
return GroupedExperts._run_experts_grouped_mm(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self.w1, self.w2, self.w3, x, num_tokens_per_expert
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/home/ruisizhang123/torchtitan/torchtitan/experiments/llama4/infra/expert_parallel.py", line 326, in wrapper
out_unpermuted[permuted_indices, :] = out
~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape mismatch: value tensor of shape [49216, 128] cannot be broadcast to indexing result of shape [49216, 256]
Versions
see above