Skip to content

Conversation

@tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jul 23, 2025

Summary:

  • add a configuration option for users to provide how they want to partition the model
  • if this is provided, the model needs to implement FaultTolerantTrainingSpec that defines the framentation function to split the model based on the configuration
  • determine the model fragments in training script to pass to ft manager

Test Plan:
Running llama3 8b parameters with 2 fragments, 1 step delay, each fragment gets synced every 20 steps

image

Stack created with Sapling. Best reviewed with ReviewStack.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 23, 2025
@tushar00jain tushar00jain force-pushed the pr1446 branch 4 times, most recently from be87993 to 04977f1 Compare July 24, 2025 00:55
@tushar00jain tushar00jain mentioned this pull request Jul 24, 2025
@tushar00jain tushar00jain force-pushed the pr1446 branch 3 times, most recently from 67b20d0 to 2926160 Compare July 26, 2025 19:43
@tushar00jain tushar00jain force-pushed the pr1446 branch 5 times, most recently from 321a888 to d67485a Compare July 28, 2025 18:34
@tushar00jain tushar00jain marked this pull request as draft July 28, 2025 18:41
@tushar00jain
Copy link
Contributor Author

tushar00jain commented Jul 28, 2025

Discussed offline with @tianyu-l. Planning to simplify some of this and keep the changes to train.py minimal. Also write up an RFC with the context around these changes including the value proposition and how the changes can be made.

@tushar00jain tushar00jain force-pushed the pr1446 branch 3 times, most recently from b7d7242 to bff2c52 Compare July 31, 2025 03:25
This was referenced Jul 31, 2025
@tushar00jain tushar00jain force-pushed the pr1446 branch 2 times, most recently from ef69776 to a4284e0 Compare July 31, 2025 03:47
@tushar00jain tushar00jain force-pushed the pr1446 branch 8 times, most recently from 9e317da to ac9ec1f Compare July 31, 2025 20:45
@tushar00jain tushar00jain marked this pull request as ready for review July 31, 2025 20:55
@tushar00jain tushar00jain force-pushed the pr1446 branch 2 times, most recently from 4377685 to 0a8e148 Compare August 1, 2025 21:55
@tushar00jain tushar00jain force-pushed the pr1446 branch 3 times, most recently from 56cb433 to ec935f5 Compare August 2, 2025 22:29
@tushar00jain tushar00jain requested a review from tianyu-l August 4, 2025 22:38
Summary:
- add a configuration option for users to provide how they want to partition the model
- if this is provided, the model needs to implement `FaultTolerantTrainingSpec` that defines the framentation function to split the model based on the configuration
- determine the model fragments in training script to pass to ft manager

Test Plan:
Running llama3 8b parameters with 2 fragments, 1 step delay, each fragment gets synced every 20 steps

<img width="944" height="545" alt="image" src="https://github.com/user-attachments/assets/6d16f486-7260-49d6-8ba3-3e98cd331e58" />
bentherien pushed a commit to bentherien/torchtitan_ that referenced this pull request Aug 5, 2025
Summary:
remove some stale code that determines parameters to pass to outer
optimizer

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1501).
* pytorch#1446
* pytorch#1502
* __->__ pytorch#1501
bentherien pushed a commit to bentherien/torchtitan_ that referenced this pull request Aug 5, 2025
Summary:
the leaf folder wasn't being created so and no profiles were being
written, so create it if it doesn't exist

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1502).
* pytorch#1446
* __->__ pytorch#1502
* pytorch#1501
@tianyu-l tianyu-l merged commit 3065a2a into pytorch:main Aug 6, 2025
10 of 15 checks passed
@tushar00jain tushar00jain deleted the pr1446 branch August 6, 2025 05:26
@tianyu-l tianyu-l mentioned this pull request Aug 6, 2025
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
Summary:
remove some stale code that determines parameters to pass to outer
optimizer

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1501).
* pytorch#1446
* pytorch#1502
* __->__ pytorch#1501
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
Summary:
the leaf folder wasn't being created so and no profiles were being
written, so create it if it doesn't exist

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1502).
* pytorch#1446
* __->__ pytorch#1502
* pytorch#1501
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
Summary:
- add a configuration option for users to provide how they want to
partition the model
- if this is provided, the model needs to implement
`FaultTolerantTrainingSpec` that defines the framentation function to
split the model based on the configuration
- determine the model fragments in training script to pass to ft manager

Test Plan:
Running llama3 8b parameters with 2 fragments, 1 step delay, each
fragment gets synced every 20 steps

<img width="944" height="545" alt="image"
src="https://github.com/user-attachments/assets/6d16f486-7260-49d6-8ba3-3e98cd331e58"
/>

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1446).
* pytorch#1516
* __->__ pytorch#1446
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
Summary:
remove some stale code that determines parameters to pass to outer
optimizer

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1501).
* pytorch#1446
* pytorch#1502
* __->__ pytorch#1501
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
Summary:
the leaf folder wasn't being created so and no profiles were being
written, so create it if it doesn't exist

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1502).
* pytorch#1446
* __->__ pytorch#1502
* pytorch#1501
joellidin pushed a commit to one-covenant/torchtitan that referenced this pull request Aug 8, 2025
Summary:
- add a configuration option for users to provide how they want to
partition the model
- if this is provided, the model needs to implement
`FaultTolerantTrainingSpec` that defines the framentation function to
split the model based on the configuration
- determine the model fragments in training script to pass to ft manager

Test Plan:
Running llama3 8b parameters with 2 fragments, 1 step delay, each
fragment gets synced every 20 steps

<img width="944" height="545" alt="image"
src="https://github.com/user-attachments/assets/6d16f486-7260-49d6-8ba3-3e98cd331e58"
/>

---
[//]: # (BEGIN SAPLING FOOTER)
Stack created with [Sapling](https://sapling-scm.com). Best reviewed
with
[ReviewStack](https://reviewstack.dev/pytorch/torchtitan/pull/1446).
* pytorch#1516
* __->__ pytorch#1446
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants