model fragments for diloco #1446

tushar00jain · 2025-07-23T13:21:26Z

Summary:

add a configuration option for users to provide how they want to partition the model
if this is provided, the model needs to implement FaultTolerantTrainingSpec that defines the framentation function to split the model based on the configuration
determine the model fragments in training script to pass to ft manager

Test Plan:
Running llama3 8b parameters with 2 fragments, 1 step delay, each fragment gets synced every 20 steps

Stack created with Sapling. Best reviewed with ReviewStack.

tushar00jain · 2025-07-28T18:42:24Z

Discussed offline with @tianyu-l. Planning to simplify some of this and keep the changes to train.py minimal. Also write up an RFC with the context around these changes including the value proposition and how the changes can be made.

torchtitan/train.py

torchtitan/protocols/train_spec.py

torchtitan/models/llama3/infra/fault_tolerance.py

Summary: - add a configuration option for users to provide how they want to partition the model - if this is provided, the model needs to implement `FaultTolerantTrainingSpec` that defines the framentation function to split the model based on the configuration - determine the model fragments in training script to pass to ft manager Test Plan: Running llama3 8b parameters with 2 fragments, 1 step delay, each fragment gets synced every 20 steps <img width="944" height="545" alt="image" src="https://github.com/user-attachments/assets/6d16f486-7260-49d6-8ba3-3e98cd331e58" />

Summary: remove some stale code that determines parameters to pass to outer optimizer --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1501). * pytorch#1446 * pytorch#1502 * __->__ pytorch#1501

Summary: the leaf folder wasn't being created so and no profiles were being written, so create it if it doesn't exist --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1502). * pytorch#1446 * __->__ pytorch#1502 * pytorch#1501

Summary: remove some stale code that determines parameters to pass to outer optimizer --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1501). * pytorch#1446 * pytorch#1502 * __->__ pytorch#1501

Summary: the leaf folder wasn't being created so and no profiles were being written, so create it if it doesn't exist --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1502). * pytorch#1446 * __->__ pytorch#1502 * pytorch#1501

Summary: - add a configuration option for users to provide how they want to partition the model - if this is provided, the model needs to implement `FaultTolerantTrainingSpec` that defines the framentation function to split the model based on the configuration - determine the model fragments in training script to pass to ft manager Test Plan: Running llama3 8b parameters with 2 fragments, 1 step delay, each fragment gets synced every 20 steps <img width="944" height="545" alt="image" src="https://github.com/user-attachments/assets/6d16f486-7260-49d6-8ba3-3e98cd331e58" /> --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1446). * pytorch#1516 * __->__ pytorch#1446

Summary: remove some stale code that determines parameters to pass to outer optimizer --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1501). * pytorch#1446 * pytorch#1502 * __->__ pytorch#1501

Summary: the leaf folder wasn't being created so and no profiles were being written, so create it if it doesn't exist --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1502). * pytorch#1446 * __->__ pytorch#1502 * pytorch#1501

Summary: - add a configuration option for users to provide how they want to partition the model - if this is provided, the model needs to implement `FaultTolerantTrainingSpec` that defines the framentation function to split the model based on the configuration - determine the model fragments in training script to pass to ft manager Test Plan: Running llama3 8b parameters with 2 fragments, 1 step delay, each fragment gets synced every 20 steps <img width="944" height="545" alt="image" src="https://github.com/user-attachments/assets/6d16f486-7260-49d6-8ba3-3e98cd331e58" /> --- [//]: # (BEGIN SAPLING FOOTER) Stack created with [Sapling](https://sapling-scm.com). Best reviewed with [ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1446). * pytorch#1516 * __->__ pytorch#1446

tushar00jain requested review from fegin, tianyu-l, wconstab and wwwjn as code owners July 23, 2025 13:21

tushar00jain mentioned this pull request Jul 23, 2025

Refactor PP splitting #1445

Closed

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 23, 2025

tushar00jain force-pushed the pr1446 branch 4 times, most recently from be87993 to 04977f1 Compare July 24, 2025 00:55

tushar00jain mentioned this pull request Jul 24, 2025

remove dead code #1450

Closed

tushar00jain force-pushed the pr1446 branch 3 times, most recently from 67b20d0 to 2926160 Compare July 26, 2025 19:43

tushar00jain mentioned this pull request Jul 26, 2025

fix creating leaf folder #1468

Closed

tushar00jain force-pushed the pr1446 branch 5 times, most recently from 321a888 to d67485a Compare July 28, 2025 18:34

tushar00jain marked this pull request as draft July 28, 2025 18:41

tianyu-l reviewed Jul 28, 2025

View reviewed changes

torchtitan/train.py Outdated Show resolved Hide resolved

torchtitan/train.py Outdated Show resolved Hide resolved

torchtitan/protocols/train_spec.py Outdated Show resolved Hide resolved

torchtitan/models/llama3/infra/fault_tolerance.py Outdated Show resolved Hide resolved

tushar00jain force-pushed the pr1446 branch 3 times, most recently from b7d7242 to bff2c52 Compare July 31, 2025 03:25

This was referenced Jul 31, 2025

fix creating leaf folder #1502

Merged

remove dead code #1501

Merged

tushar00jain force-pushed the pr1446 branch 2 times, most recently from ef69776 to a4284e0 Compare July 31, 2025 03:47

tushar00jain force-pushed the pr1446 branch 8 times, most recently from 9e317da to ac9ec1f Compare July 31, 2025 20:45

tushar00jain marked this pull request as ready for review July 31, 2025 20:55

tushar00jain force-pushed the pr1446 branch 2 times, most recently from 4377685 to 0a8e148 Compare August 1, 2025 21:55

tushar00jain mentioned this pull request Aug 1, 2025

separate out diloco configs #1516

Merged

tushar00jain force-pushed the pr1446 branch 3 times, most recently from 56cb433 to ec935f5 Compare August 2, 2025 22:29

tushar00jain requested a review from tianyu-l August 4, 2025 22:38

tushar00jain force-pushed the pr1446 branch from ec935f5 to badf5ec Compare August 5, 2025 17:57

tianyu-l approved these changes Aug 5, 2025

View reviewed changes

tianyu-l merged commit 3065a2a into pytorch:main Aug 6, 2025
10 of 15 checks passed

tushar00jain deleted the pr1446 branch August 6, 2025 05:26

tianyu-l mentioned this pull request Aug 6, 2025

quick fix import error #1536

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

model fragments for diloco #1446

model fragments for diloco #1446

Uh oh!

tushar00jain commented Jul 23, 2025 •

edited

Loading

Uh oh!

tushar00jain commented Jul 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

model fragments for diloco #1446

model fragments for diloco #1446

Uh oh!

Conversation

tushar00jain commented Jul 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tushar00jain commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tushar00jain commented Jul 23, 2025 •

edited

Loading

tushar00jain commented Jul 28, 2025 •

edited

Loading