Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Full" gelu without approximation #628

Open
se-schmitt opened this issue Feb 5, 2025 · 9 comments · May be fixed by #629
Open

"Full" gelu without approximation #628

se-schmitt opened this issue Feb 5, 2025 · 9 comments · May be fixed by #629

Comments

@se-schmitt
Copy link

Motivation and description

Currently, only the tanh approximation of the gelu activation function is implemented. But the full gelu (0.5x(1 + erf(x/sqrt(2)))) is the default in PyTorch and Tensorflow and used in many pretrained models (like BERT). When using pretrained models from HuggingFace, e.g. through Transformers.jl, this causes slight but significant differences. Though, it would be nice to have the option to use the "full" gelu (see also this PR in Transformers.jl).

Possible Implementation

I would make a PR to add the full gelu. Options:

  • new function gelu_full -> least impact
  • replace gelu and rename existing gelu to e.g. gelu_tanh -> impact existing code
  • add kwarg like gelu(x; approximate=false)=... -> not sure if this this is easily possible as the kw needs to be propagated through the code (?)
    What would be the best option?
@ToucheSir
Copy link
Member

One idea would be to rename the current gelu to gelu_fast, matching the pattern we've established with other activation functions. However, I don't understand chengchingwen/Transformers.jl#209 (comment) well enough to know if this would help with that.

@se-schmitt se-schmitt linked a pull request Feb 6, 2025 that will close this issue
2 tasks
@se-schmitt
Copy link
Author

I added a PR and named the functions as suggested. I think the comment is meant in a way that the configs of Transformers.jl should match the HuggingFace nomenclature (otherwise they would'nt be loaded automatically). With the submitted PR here, that would be possible as it enables the correct assignment of the HuggingFace hidden_act options gelu and gelu_python to the new gelu function implemented in the PR. Is this correct, @chengchingwen ?

@chengchingwen
Copy link
Member

Yes, chengchingwen/Transformers.jl#209 (comment) is simply saying we need separate functions for gelu w/ and w/o tanh approx. and it would be nice if NNlib.jl has both.

@ToucheSir ToucheSir linked a pull request Feb 6, 2025 that will close this issue
2 tasks
@avik-pal
Copy link
Member

avik-pal commented Feb 6, 2025

Changing the implementation of a function as common as gelu should be a breaking change

@se-schmitt
Copy link
Author

Breaking changes could be avoided by naming the gelu without approximation sth like gelu_full or gelu_noapprox if this is preferred. This would also solve the original problem

@chengchingwen
Copy link
Member

This kind of implementation change wasn't considered a breaking change before (#371 is released with v0.7.31). I think it won't be that breaking on the accuracy/value side, though the performance side might.

@avik-pal
Copy link
Member

avik-pal commented Feb 8, 2025

this causes slight but significant differences.

This would be true in the reverse direction as well, making any trained models that use the approx gelu impl have differences once this change is made.

I think doing a gelu_full / gelu_approx for this version and switching the impl once we do a major release would be a good fix.

@chengchingwen
Copy link
Member

making any trained models that use the approx gelu impl have differences once this change is made.

It depends on the sensitivity of the trained model. Our sigmoid-approx-gelu (not even a tanh-approx-gelu) works quite well with many pretrained transformer, but not all indeed. I'm not against marking it a breaking change though, just pointing out the previous decision.

I think doing a gelu_full / gelu_approx for this version and switching the impl once we do a major release would be a good fix.

I would suggest using gelu_fast instead of gelu_approx so we can avoid changing two functions during the major release.

@se-schmitt
Copy link
Author

I made the changes in the PR accordingly (gelu is now aproximated (old) gelu and added gelu_full). I'd add the changes for major release in a new PR!? (if these changes are desired)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants