-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrepancy with Optimizer States and Model State Dict when using store_param_remainders==True #1842
Comments
cc @timmoon10 |
Update:
This function is invoked during checkpoint dump, but we are concerned about the implementation details related to every optimizer step as well. |
The loss is inevitable if we use The fix would be instead of doing a simple cast to get the bf16 param, we should rely on bit manipulation to separate the higher and lower 16 bits into bf16 param and 16-bit remainders, thus avoiding rounding/undo-rounding entirely. |
Great debugging. It's tricky that rn rounding is irreversible unless we store an extra bit, which seems excessive given that these errors are just at the level of machine epsilon. @alxzhang-amazon How important is bit-wise equivalence to If we do think of a good approach to reproduce rn rounding, then we'll also need to modify the Adam kernel:
This is in addition to
|
Hi @timmoon10. We are not reliant on using store_param_remainders==True, so we can switch the flag if bit-wise equivalence is required. To clarify my understanding, is it the case that this loss occurs in the Adam kernel as well and therefore occurs during each training step as well, or is it localized to just the (
|
To my understanding, the rounding methods are different between in the Please correct me if I am mistaken or misunderstanding. |
Both |
Thank you @timmoon10. So to my current understanding now:
This has no impact during training steps, since between training steps we use the Kernel code to do conversion, which is consist in its rounding logic.
To my understanding, the Distributed Fused Adam file does not use When dumping the optimizer states, we see that we Therefore, when using store_param_remainders==True, the tensor elements are constructed using a different rounding method that Does this summary seem correct to you? |
Context:
When we dump a V2-format optimizer state dict, it contains the params fields which we expect to be high-precision versions of the model parameters.
When we train using
store_params==True
andstore_param_remainders==False
, we see this expected behavior, where when we downcast the optimizer state dict params, they match the model state dict parameters exactly.However, when we train using
store_params==False
andstore_param_remainders==True
, we do not see this expected behavior. Instead, we see that there are differences in the down-casted optimizer state parameters and the actual model parameters.I am wondering if this is intentional, or if perhaps this function https://github.com/NVIDIA/apex/blob/master/apex/contrib/optimizers/distributed_fused_adam.py#L247
is not lossless.
I have provided a sample script that should showcase this issue.
We see when using
store_params==True
andstore_param_remainders==False
all tensors match.We see when using
store_params==False
andstore_param_remainders==True
we see mismatch.store_param_remainders_dfa_test.txt
store_params_dfa_test.txt
The script used to highlight this issue:
The text was updated successfully, but these errors were encountered: