Fix pytorch checkpointing for CL callback #1581

b-chu · 2024-10-10T14:10:12Z

Adds a dataclass to store the state for the CL callback. This PR is similar to the fix here.

Context:
With the 2.4 upgrade for DCP on pytorch, it flattens all state dict elements which are instances of typing. Mapping or lists before saving. However, during loading, we are expected either Mapping / lists, instead of flattened elements for the runs.

Save: https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/2834569508095648/runs/c5989c4fec75413595c029e70240cae5?o=7395834863327820
Load: https://dbc-559ffd80-2bfc.cloud.databricks.com/ml/experiments/2834569508095648/runs/91dd369dc38b4576b44e6c0059825c2b?o=7395834863327820

llmfoundry/callbacks/curriculum_learning_callback.py

tests/callbacks/test_curriculum_learning_callback.py

llmfoundry/callbacks/curriculum_learning_callback.py

tests/callbacks/test_curriculum_learning_callback.py

j316chuck

Can we add a test run in the PR description to make things work? Other than that this looks good to me ✅

bigning

please all add an save/load test

llmfoundry/callbacks/curriculum_learning_callback.py

b-chu · 2024-10-10T20:38:06Z

Save and load run added

This reverts commit 1654827.

b-chu requested a review from a team as a code owner October 10, 2024 14:10

b-chu requested review from j316chuck and bigning October 10, 2024 14:10

b-chu force-pushed the fix-cl branch from 7f8e3e7 to cbbbe8f Compare October 10, 2024 14:37

snarayan21 reviewed Oct 10, 2024

View reviewed changes

b-chu force-pushed the fix-cl branch from cbbbe8f to 1dae2ae Compare October 10, 2024 15:50

Fix pytorch checkpointing for CL callback

7025246

b-chu force-pushed the fix-cl branch from 1dae2ae to 7025246 Compare October 10, 2024 16:08

snarayan21 reviewed Oct 10, 2024

View reviewed changes

llmfoundry/callbacks/curriculum_learning_callback.py Show resolved Hide resolved

tests/callbacks/test_curriculum_learning_callback.py Show resolved Hide resolved

j316chuck approved these changes Oct 10, 2024

View reviewed changes

bigning approved these changes Oct 10, 2024

View reviewed changes

llmfoundry/callbacks/curriculum_learning_callback.py Show resolved Hide resolved

b-chu merged commit 1654827 into main Oct 10, 2024
9 checks passed

b-chu added a commit that referenced this pull request Oct 10, 2024

Revert "Fix pytorch checkpointing for CL callback (#1581)"

0e983bb

This reverts commit 1654827.

b-chu mentioned this pull request Oct 10, 2024

Fix pytorch checkpointing for CL callback #1583

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix pytorch checkpointing for CL callback #1581

Fix pytorch checkpointing for CL callback #1581

b-chu commented Oct 10, 2024 •

edited

Loading

j316chuck left a comment

bigning left a comment

b-chu commented Oct 10, 2024

Fix pytorch checkpointing for CL callback #1581

Fix pytorch checkpointing for CL callback #1581

Conversation

b-chu commented Oct 10, 2024 • edited Loading

j316chuck left a comment

Choose a reason for hiding this comment

bigning left a comment

Choose a reason for hiding this comment

b-chu commented Oct 10, 2024

b-chu commented Oct 10, 2024 •

edited

Loading