Skip to content

Conversation

@xmfan
Copy link
Member

@xmfan xmfan commented Nov 8, 2025

Stacked PRs:

Intended usage:

> torchrun --nproc-per-node=8 examples/example_ds3_pp.py --rng-seed=42; torchrun --nproc-per-node=4 examples/example_ds3_local_map.py --rng-seed=42

> diff out/0/pp_weights.log  out/1/weights.log 
--- out/0/pp_weights.log        2025-11-07 20:31:34.447960867 -0800
+++ out/1/weights.log   2025-11-07 20:32:52.499859593 -0800
@@ -60,12 +60,9 @@
 name='freqs_cis' hash=DTensor(real=54976837666734080, imag=9351734845035773952))
 name='layers.0.moe.expert_bias' hash=DTensor(0)
 name='layers.0.moe.tokens_per_expert' hash=DTensor(0)
-name='freqs_cis' hash=DTensor(real=54976837666734080, imag=9351734845035773952))
 name='layers.1.moe.expert_bias' hash=DTensor(0)
 name='layers.1.moe.tokens_per_expert' hash=DTensor(0)
-name='freqs_cis' hash=DTensor(real=54976837666734080, imag=9351734845035773952))
 name='layers.2.moe.expert_bias' hash=DTensor(0)
 name='layers.2.moe.tokens_per_expert' hash=DTensor(0)
-name='freqs_cis' hash=DTensor(real=54976837666734080, imag=9351734845035773952))
 name='layers.3.moe.expert_bias' hash=DTensor(0)
 name='layers.3.moe.tokens_per_expert' hash=DTensor(0)

Current difference is due to model implementation, where the pp stages each have freqs_cis, but for the non-pp version there's only 1 freqs_cis buffer on the root model class

Remove the per_op logging since numerics aren't diff friendly yet.


Log weight hashes for DSv3 w/ pp vs w/o pp

stack-info: PR: #240, branch: xmfan/stack/18
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants