Commit b112c99
authored
Fix loading a universal checkpoint (#5263)
This PR fixes the following two points regarding checkpoint loading.
- Load optimizer states
With [this PR](#5104), we
removed optimizer's `step()` on initialization. This made the DS's
parameter update match with PyTorch's normal behavior. However, we don't
have keys in optimizer states any more when we load a checkpoint.
For legacy/elastic checkpoints, the PR changed the checkpoint loaders to
create keys and buffers on loading. However, the loader for universal
checkpoints still relies on keys in optimizer states. As the result,
loading a universal checkpoint fails.
This PR fixes the loader to find optimizer state keys from a given
checkpoint.
- Resume step count
2943e6a
The checkpoint loader for a universal checkpoint resumes step count for
optimizer only when the param group already has `step`. But some
optimizers creates the key `step` in a param group at the first call of
`step()` (e.g. Apex [Fused
Adam](https://github.com/NVIDIA/apex/blob/810ffae374a2b9cb4b5c5e28eaeca7d7998fca0c/apex/optimizers/fused_adam.py#L154).
In this case, the step count is not restored. This PR changes this
behavior to always set step count in a param group.
This PR also stop incrementing the step count when loading. I didn't see
why we need to increment the step count for my small example, but we may
need a discussion to consider various cases.1 parent 2df8e23 commit b112c99
File tree
6 files changed
+58
-21
lines changed- deepspeed
- checkpoint
- runtime
- zero
- utils
6 files changed
+58
-21
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
4 | 4 | | |
5 | 5 | | |
6 | 6 | | |
| 7 | + | |
7 | 8 | | |
8 | 9 | | |
9 | 10 | | |
10 | 11 | | |
11 | 12 | | |
12 | 13 | | |
13 | 14 | | |
14 | | - | |
15 | | - | |
16 | | - | |
17 | | - | |
18 | | - | |
19 | | - | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
20 | 24 | | |
21 | 25 | | |
22 | | - | |
| 26 | + | |
23 | 27 | | |
24 | 28 | | |
25 | 29 | | |
| |||
62 | 66 | | |
63 | 67 | | |
64 | 68 | | |
65 | | - | |
66 | 69 | | |
67 | 70 | | |
68 | 71 | | |
| |||
84 | 87 | | |
85 | 88 | | |
86 | 89 | | |
87 | | - | |
88 | | - | |
89 | 90 | | |
90 | 91 | | |
91 | 92 | | |
92 | 93 | | |
93 | | - | |
| 94 | + | |
| 95 | + | |
| 96 | + | |
| 97 | + | |
| 98 | + | |
| 99 | + | |
| 100 | + | |
| 101 | + | |
| 102 | + | |
| 103 | + | |
| 104 | + | |
94 | 105 | | |
95 | 106 | | |
96 | 107 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
18 | 18 | | |
19 | 19 | | |
20 | 20 | | |
21 | | - | |
| 21 | + | |
22 | 22 | | |
23 | 23 | | |
24 | 24 | | |
| |||
457 | 457 | | |
458 | 458 | | |
459 | 459 | | |
460 | | - | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
461 | 464 | | |
462 | 465 | | |
463 | 466 | | |
464 | 467 | | |
465 | 468 | | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
466 | 472 | | |
467 | 473 | | |
468 | 474 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2785 | 2785 | | |
2786 | 2786 | | |
2787 | 2787 | | |
2788 | | - | |
| 2788 | + | |
2789 | 2789 | | |
2790 | 2790 | | |
2791 | 2791 | | |
| |||
2966 | 2966 | | |
2967 | 2967 | | |
2968 | 2968 | | |
2969 | | - | |
| 2969 | + | |
2970 | 2970 | | |
2971 | 2971 | | |
2972 | 2972 | | |
| |||
2975 | 2975 | | |
2976 | 2976 | | |
2977 | 2977 | | |
2978 | | - | |
2979 | | - | |
| 2978 | + | |
2980 | 2979 | | |
2981 | | - | |
| 2980 | + | |
2982 | 2981 | | |
2983 | 2982 | | |
2984 | 2983 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
28 | 28 | | |
29 | 29 | | |
30 | 30 | | |
31 | | - | |
| 31 | + | |
32 | 32 | | |
33 | 33 | | |
34 | 34 | | |
| |||
2310 | 2310 | | |
2311 | 2311 | | |
2312 | 2312 | | |
2313 | | - | |
| 2313 | + | |
| 2314 | + | |
| 2315 | + | |
| 2316 | + | |
2314 | 2317 | | |
2315 | 2318 | | |
2316 | 2319 | | |
2317 | 2320 | | |
2318 | 2321 | | |
| 2322 | + | |
| 2323 | + | |
| 2324 | + | |
2319 | 2325 | | |
2320 | 2326 | | |
2321 | 2327 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
10 | 10 | | |
11 | 11 | | |
12 | 12 | | |
13 | | - | |
| 13 | + | |
14 | 14 | | |
15 | 15 | | |
16 | 16 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
58 | 58 | | |
59 | 59 | | |
60 | 60 | | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
| 67 | + | |
| 68 | + | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
61 | 76 | | |
62 | 77 | | |
63 | 78 | | |
| |||
0 commit comments