-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replacing apex.normalization.FusedLayerNorm with torch.nn.LayerNorm #9377
Comments
I'm good with changing to |
Prior to about a year ago, If you have other than gtx-1070/rtx-3090 cards which I benchmarked with please run that benchmark and see if it stands true for other cards: pytorch/pytorch#37713 (comment)
The benchmark measures and reports a total run time, so the smaller the numbers the faster it is. If you do run the benchmarks please post your results at pytorch/pytorch#37713 so that it can be seen whether it's safe to drop Thank you. |
It seems that time has arrived to drop
apex.normalization.FusedLayerNorm
in favor oftorch.nn.LayerNorm
but note: this same benchmark run here facebookresearch/fairseq#2012 (comment) on V100 reports the opposite - that the native is slower (pt-1.5). So it might help to run this very quick benchmark on other cards and compare. In particular if you have access to V100 please report back the findings at this thread: pytorch/pytorch#37713
The main reason for this need is that
apex.normalization.FusedLayerNorm
is buggy (corrupts memory) when it comes to switching devices, which is done a lot under Model Parallel. NVIDIA/apex#1022With
apex.normalization.FusedLayerNorm
things fail a lot under MP and requires stickingtorch.cuda.set_device(id)
in many many places as a workaround :( Since this overload is used at model's init time it's not possible to not use it under MP as the latter gets activate after model's init.I will use that workaround if you find out that apex is faster still on some important-to-consider hardware. And, of course, in that case please report back to the pytorch team so that they could fix it. Otherwise apex support is pretty much no more and it's just a matter of time before apex will be unusable.
The models that need that change are bart/fsmt/prophetnet.
@patrickvonplaten, @LysandreJik
The text was updated successfully, but these errors were encountered: