Skip to content

Conversation

aporialiao
Copy link
Contributor

Summary:

Main Changes

  1. Enable unit test with an adaptive optimizer Adagrad
    1. Previously I tested the optimizer state with an optimizer SGD that is static throughout training so didn't , instead here I used the Adagrad which exposed the previous implementation did not properly store optimziers.
  2. Properly store optimizer state in update_optimizer_state
    2. Append optimizer tensors as inputs to the all2all call, then parse through the output tensors to store the right tensors.
    2. Optimizer tensors that did not need to be sent to a new rank are persisted and resaved.
    2. After new lookups are created, use load_state_dict to load in the saved optimizer state to the current optimizers.
  3. Helpers & other small changes
    3. Helper to compare optimizer tensors for unit tests
    3. Update DMP reshard - optimizer saving to match the same fqn

Differential Revision: D75565054

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 6, 2025
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75565054

aporialiao added a commit to aporialiao/torchrec that referenced this pull request Jun 6, 2025
…torch#3053)

Summary:

# Main Changes
1. Enable unit test with an adaptive optimizer `Adagrad`
   1. Previously I tested the optimizer state  with an optimizer `SGD` that is static throughout training so didn't , instead here I used the `Adagrad` which exposed the previous implementation did not properly store optimziers.
2. Properly store optimizer state in `update_optimizer_state`
    2. Append optimizer tensors as inputs to the all2all call, then parse through the output tensors to store the right tensors.
    2. Optimizer tensors that did not need to be sent to a new rank are persisted and resaved.
    2. After new lookups are created, use `load_state_dict` to load in the saved optimizer state to the current optimizers.
3. Helpers & other small changes
    3. Helper to compare optimizer tensors for unit tests
    3. Update `DMP` reshard - optimizer saving to match the same fqn

Differential Revision: D75565054
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75565054

aporialiao added a commit to aporialiao/torchrec that referenced this pull request Jun 6, 2025
…torch#3053)

Summary:

# Main Changes
1. Enable unit test with an adaptive optimizer `Adagrad`
   1. Previously I tested the optimizer state  with an optimizer `SGD` that is static throughout training so didn't actually test if we stored opt state, instead here I used the `Adagrad` which exposed the previous implementation did not properly store optimziers.
2. Properly store optimizer state in `update_optimizer_state`
    2. Append optimizer tensors as inputs to the all2all call, then parse through the output tensors to store the right tensors.
    2. Optimizer tensors that did not need to be sent to a new rank are persisted and resaved.
    2. After new lookups are created, use `load_state_dict` to load in the saved optimizer state to the current optimizers.
3. Helpers & other small changes
    3. Helper to compare optimizer tensors for unit tests
    3. Update `DMP` reshard - optimizer saving to match the same fqn

Reviewed By: aliafzal

Differential Revision: D75565054
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75565054

aporialiao added a commit to aporialiao/torchrec that referenced this pull request Jun 6, 2025
…torch#3053)

Summary:

# Main Changes
1. Enable unit test with an adaptive optimizer `Adagrad`
   1. Previously I tested the optimizer state  with an optimizer `SGD` that is static throughout training so didn't actually test if we stored opt state, instead here I used the `Adagrad` which exposed the previous implementation did not properly store optimziers.
2. Properly store optimizer state in `update_optimizer_state`
    2. Append optimizer tensors as inputs to the all2all call, then parse through the output tensors to store the right tensors.
    2. Optimizer tensors that did not need to be sent to a new rank are persisted and resaved.
    2. After new lookups are created, use `load_state_dict` to load in the saved optimizer state to the current optimizers.
3. Helpers & other small changes
    3. Helper to compare optimizer tensors for unit tests
    3. Update `DMP` reshard - optimizer saving to match the same fqn

Reviewed By: aliafzal

Differential Revision: D75565054
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75565054

…torch#3053)

Summary:
Pull Request resolved: meta-pytorch#3053

# Main Changes
1. Enable unit test with an adaptive optimizer `Adagrad`
   1. Previously I tested the optimizer state  with an optimizer `SGD` that is static throughout training so didn't actually test if we stored opt state, instead here I used the `Adagrad` which exposed the previous implementation did not properly store optimziers.
2. Properly store optimizer state in `update_optimizer_state`
    2. Append optimizer tensors as inputs to the all2all call, then parse through the output tensors to store the right tensors.
    2. Optimizer tensors that did not need to be sent to a new rank are persisted and resaved.
    2. After new lookups are created, use `load_state_dict` to load in the saved optimizer state to the current optimizers.
3. Helpers & other small changes
    3. Helper to compare optimizer tensors for unit tests
    3. Update `DMP` reshard - optimizer saving to match the same fqn

Reviewed By: aliafzal

Differential Revision: D75565054
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D75565054

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants