Skip to content

Conversation

@xmfan
Copy link
Member

@xmfan xmfan commented Nov 7, 2025

Stacked PRs:


Add option for run-to-run deterministic and add optional numerics logging

Example log:
tlp torchrun --standalone --nproc-per-node 8 examples/example_ds3_pp.py --rng-seed 1234 | pastry; tlp torchrun --standalone --nproc-per-node 8 examples/example_ds3_pp.py --rng-seed 1234 | pastry

https://www.internalfb.com/intern/diffing/?paste_number=2027323783

xmfan added a commit that referenced this pull request Nov 7, 2025
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 7, 2025
xmfan added a commit that referenced this pull request Nov 7, 2025
xmfan added a commit that referenced this pull request Nov 7, 2025
@xmfan xmfan changed the title Add option for run-to-run deterministic and add optional numerics logging Add option for run-to-run deterministic and add numerics logging Nov 7, 2025
@xmfan xmfan changed the title Add option for run-to-run deterministic and add numerics logging Add option for run-to-run deterministic and add per op numerics logging Nov 7, 2025
numerics_logs += debug_interpreter.get_logs()
else:
fw_outputs = torch.fx.Interpreter(fw_module).boxed_run(fw_args)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add this to all the graph module calls

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you mean the backward? I didn't add it since I couldn't test it on the base commit

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, once I land #237 we can add it for full_bw, bw_dI, bw_dW, unshard and reduce_grad.

args = parser.parse_args()

run_test(fake_evaluate=args.fake_evaluate)
if args.rng_seed is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's say we have 8 ranks in total, they will all initialize their modules. Since each rank initializes a different part of the model, it is hard to compare it with a single rank implementation for numerics debugging. We should have a solution similar to what @wconstab used in torchtitan. Creating a seed checkpoint and using that for PP runs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By seed checkpoint, do you mean saving and loading random weights generated from a rng seed? I was thinking of just resetting the seed for weights init

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if pp has 8 stages, you would do init_weights for each one of them using the same seed? My concern is how would you compare the pp_runtime with spmd only for numerics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we cut our stages at nm module boundaries, and init weights in the same order, we could reset the seeds at the same cuts during the spmd init weights.

For supporting arbitrary stage splits, I would need to know more about how we would implement their init_weights and checkpointing. So I put that aside for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add an example that saves the params after init and the grads after after accumulating grads by running microbatches in spmd? Analogously, pp also saves the params after init and grads after running the step and finally a script that compares both?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I'm changing up example_ds3_local_map.py to use real tensors to be the SPMD microbatch + accumulate grad steps baseline. And I have a script to diff the outputs of DebugInterpreter that I was thinking of landing separately from this PR.

Copy link
Contributor

@sanketpurandare sanketpurandare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just land after #237 and add the DebugInterpreter to other graph_module calls if required.

@xmfan xmfan changed the title Add option for run-to-run deterministic and add per op numerics logging Add option for run-to-run deterministic and add optional numerics logging Nov 7, 2025
@xmfan xmfan merged commit 3088776 into main Nov 7, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants