Add option for run-to-run deterministic and add optional numerics logging #235

xmfan · 2025-11-07T00:43:30Z

Stacked PRs:

->Add option for run-to-run deterministic and add optional numerics logging #235

Add option for run-to-run deterministic and add optional numerics logging

Example log:
tlp torchrun --standalone --nproc-per-node 8 examples/example_ds3_pp.py --rng-seed 1234 | pastry; tlp torchrun --standalone --nproc-per-node 8 examples/example_ds3_pp.py --rng-seed 1234 | pastry

https://www.internalfb.com/intern/diffing/?paste_number=2027323783

…ging stack-info: PR: #235, branch: xmfan/stack/16

sanketpurandare · 2025-11-07T17:50:04Z

autoparallel/graph_pp_runner.py

+        numerics_logs += debug_interpreter.get_logs()
+    else:
+        fw_outputs = torch.fx.Interpreter(fw_module).boxed_run(fw_args)
+


Maybe we should add this to all the graph module calls

you mean the backward? I didn't add it since I couldn't test it on the base commit

Yeah, once I land #237 we can add it for full_bw, bw_dI, bw_dW, unshard and reduce_grad.

sanketpurandare · 2025-11-07T17:53:22Z

examples/example_ds3_pp.py

    args = parser.parse_args()

-    run_test(fake_evaluate=args.fake_evaluate)
+    if args.rng_seed is not None:


Let's say we have 8 ranks in total, they will all initialize their modules. Since each rank initializes a different part of the model, it is hard to compare it with a single rank implementation for numerics debugging. We should have a solution similar to what @wconstab used in torchtitan. Creating a seed checkpoint and using that for PP runs.

By seed checkpoint, do you mean saving and loading random weights generated from a rng seed? I was thinking of just resetting the seed for weights init

So if pp has 8 stages, you would do init_weights for each one of them using the same seed? My concern is how would you compare the pp_runtime with spmd only for numerics?

If we cut our stages at nm module boundaries, and init weights in the same order, we could reset the seeds at the same cuts during the spmd init weights.

For supporting arbitrary stage splits, I would need to know more about how we would implement their init_weights and checkpointing. So I put that aside for now.

Could you also add an example that saves the params after init and the grads after after accumulating grads by running microbatches in spmd? Analogously, pp also saves the params after init and grads after running the step and finally a script that compares both?

Yup, I'm changing up example_ds3_local_map.py to use real tensors to be the SPMD microbatch + accumulate grad steps baseline. And I have a script to diff the outputs of DebugInterpreter that I was thinking of landing separately from this PR.

sanketpurandare

Just land after #237 and add the DebugInterpreter to other graph_module calls if required.

…ging stack-info: PR: #235, branch: xmfan/stack/16

xmfan added a commit that referenced this pull request Nov 7, 2025

Add option for run-to-run deterministic and add optional numerics log…

ae67a82

…ging stack-info: PR: #235, branch: xmfan/stack/16

xmfan force-pushed the xmfan/stack/16 branch from 3f117bd to ae67a82 Compare November 7, 2025 00:43

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Nov 7, 2025

xmfan added a commit that referenced this pull request Nov 7, 2025

Add option for run-to-run deterministic and add optional numerics log…

45287c4

…ging stack-info: PR: #235, branch: xmfan/stack/16

xmfan force-pushed the xmfan/stack/16 branch from ae67a82 to 45287c4 Compare November 7, 2025 00:49

xmfan added a commit that referenced this pull request Nov 7, 2025

Add option for run-to-run deterministic and add optional numerics log…

a29bea8

…ging stack-info: PR: #235, branch: xmfan/stack/16

xmfan force-pushed the xmfan/stack/16 branch from 45287c4 to a29bea8 Compare November 7, 2025 00:55

xmfan changed the title ~~Add option for run-to-run deterministic and add optional numerics logging~~ Add option for run-to-run deterministic and add numerics logging Nov 7, 2025

xmfan changed the title ~~Add option for run-to-run deterministic and add numerics logging~~ Add option for run-to-run deterministic and add per op numerics logging Nov 7, 2025

xmfan requested review from bdhirsh, sanketpurandare and wconstab November 7, 2025 01:46

sanketpurandare reviewed Nov 7, 2025

View reviewed changes

sanketpurandare approved these changes Nov 7, 2025

View reviewed changes

Add option for run-to-run deterministic and add optional numerics log…

9815a8c

…ging stack-info: PR: #235, branch: xmfan/stack/16

xmfan force-pushed the xmfan/stack/16 branch from a29bea8 to 9815a8c Compare November 7, 2025 22:30

xmfan changed the title ~~Add option for run-to-run deterministic and add per op numerics logging~~ Add option for run-to-run deterministic and add optional numerics logging Nov 7, 2025

xmfan merged commit 3088776 into main Nov 7, 2025
5 of 6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option for run-to-run deterministic and add optional numerics logging #235

Add option for run-to-run deterministic and add optional numerics logging #235

xmfan commented Nov 7, 2025 •

edited

Loading

Uh oh!

sanketpurandare Nov 7, 2025

Uh oh!

xmfan Nov 7, 2025

Uh oh!

sanketpurandare Nov 7, 2025

Uh oh!

sanketpurandare Nov 7, 2025

Uh oh!

xmfan Nov 7, 2025

Uh oh!

sanketpurandare Nov 7, 2025

Uh oh!

xmfan Nov 7, 2025

Uh oh!

sanketpurandare Nov 7, 2025

Uh oh!

xmfan Nov 7, 2025

Uh oh!

sanketpurandare left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add option for run-to-run deterministic and add optional numerics logging #235

Add option for run-to-run deterministic and add optional numerics logging #235

Conversation

xmfan commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sanketpurandare left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xmfan commented Nov 7, 2025 •

edited

Loading