-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Closed
Description
I have a custom megatron model and a corresponding custom DeepSpeed. I believe that I have incorporated your recent update correctly, but when I try to train a ZeRO 3 model I get the error RuntimeError: The size of tensor a (171) must match the size of tensor b (169) at non-singleton dimension 0.
When I turn off CPU adam, I instead get this error RuntimeError: start (0) + length (174763) exceeds dimension size (174761)
I notice in both cases the shape of a tensor seems to be off by 2, but I have no idea what's causing this. My code is overall extremely similar to yours, though as I note at deepspeedai/DeepSpeedExamples#92 I cannot get your code to run either (though for different reasons).
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels