Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replacing apex.normalization.FusedLayerNorm with torch.nn.LayerNorm #9377

Closed
stas00 opened this issue Jan 2, 2021 · 2 comments · Fixed by #9386
Closed

replacing apex.normalization.FusedLayerNorm with torch.nn.LayerNorm #9377

stas00 opened this issue Jan 2, 2021 · 2 comments · Fixed by #9386

Comments

@stas00
Copy link
Contributor

stas00 commented Jan 2, 2021

It seems that time has arrived to drop apex.normalization.FusedLayerNorm in favor of torch.nn.LayerNorm

  1. the latter was ported more than a year ago from apex Add fused layer norm impl on CUDA in PyTorch pytorch/pytorch#27634 (around pt-1.4)
  2. it's faster than the apex according to my benchmarks Builtin FusedLayerNorm is slower than apex one pytorch/pytorch#37713 (comment) (33% faster on rtx-3090!, 10% faster on gtx-1070)

but note: this same benchmark run here facebookresearch/fairseq#2012 (comment) on V100 reports the opposite - that the native is slower (pt-1.5). So it might help to run this very quick benchmark on other cards and compare. In particular if you have access to V100 please report back the findings at this thread: pytorch/pytorch#37713

The main reason for this need is that apex.normalization.FusedLayerNorm is buggy (corrupts memory) when it comes to switching devices, which is done a lot under Model Parallel. NVIDIA/apex#1022
With apex.normalization.FusedLayerNorm things fail a lot under MP and requires sticking torch.cuda.set_device(id) in many many places as a workaround :( Since this overload is used at model's init time it's not possible to not use it under MP as the latter gets activate after model's init.

I will use that workaround if you find out that apex is faster still on some important-to-consider hardware. And, of course, in that case please report back to the pytorch team so that they could fix it. Otherwise apex support is pretty much no more and it's just a matter of time before apex will be unusable.

The models that need that change are bart/fsmt/prophetnet.

@patrickvonplaten, @LysandreJik

@patrickvonplaten
Copy link
Contributor

I'm good with changing to torch.nn.LayerNorm. At @stas00 - do you know what the advantage of apex.normalization.FusedLayerNorm is supposed to be? Why did we add apex.normalization.FusedLayerNorm in the first place?

@stas00
Copy link
Contributor Author

stas00 commented Jan 2, 2021

Prior to about a year ago, apex.normalization.FusedLayerNorm was faster than torch.nn.LayerNorm, but then the former got ported to native torch.nn.LayerNorm, and now the native appears to be faster - at least the 2 cards I have experimented with. I checked with pt-1.4 .. pt-1.8.dev

If you have other than gtx-1070/rtx-3090 cards which I benchmarked with please run that benchmark and see if it stands true for other cards: pytorch/pytorch#37713 (comment)
it only takes a few seconds if you have apex installed already. To install apex:

git clone https://github.com/NVIDIA/apex
cd apex
rm -rf build
pip install  --global-option="--cpp_ext" --global-option="--cuda_ext"  .

The benchmark measures and reports a total run time, so the smaller the numbers the faster it is.

If you do run the benchmarks please post your results at pytorch/pytorch#37713 so that it can be seen whether it's safe to drop apex.normalization.FusedLayerNorm based on hard data and not anecdotal info.

Thank you.

@stas00 stas00 changed the title removing apex.normalization.FusedLayerNorm in favor of faster torch.nn.LayerNorm replacing apex.normalization.FusedLayerNorm with torch.nn.LayerNorm Jan 2, 2021
fhieber added a commit to awslabs/sockeye that referenced this issue Nov 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants