-
-
Notifications
You must be signed in to change notification settings - Fork 125
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Full" gelu without approximation #628
Comments
One idea would be to rename the current |
I added a PR and named the functions as suggested. I think the comment is meant in a way that the configs of Transformers.jl should match the HuggingFace nomenclature (otherwise they would'nt be loaded automatically). With the submitted PR here, that would be possible as it enables the correct assignment of the HuggingFace |
Yes, chengchingwen/Transformers.jl#209 (comment) is simply saying we need separate functions for gelu w/ and w/o tanh approx. and it would be nice if NNlib.jl has both. |
Changing the implementation of a function as common as |
Breaking changes could be avoided by naming the gelu without approximation sth like |
This kind of implementation change wasn't considered a breaking change before (#371 is released with |
This would be true in the reverse direction as well, making any trained models that use the approx gelu impl have differences once this change is made. I think doing a |
It depends on the sensitivity of the trained model. Our sigmoid-approx-gelu (not even a tanh-approx-gelu) works quite well with many pretrained transformer, but not all indeed. I'm not against marking it a breaking change though, just pointing out the previous decision.
I would suggest using |
I made the changes in the PR accordingly ( |
Motivation and description
Currently, only the tanh approximation of the gelu activation function is implemented. But the full gelu (
0.5x(1 + erf(x/sqrt(2)))
) is the default in PyTorch and Tensorflow and used in many pretrained models (like BERT). When using pretrained models from HuggingFace, e.g. through Transformers.jl, this causes slight but significant differences. Though, it would be nice to have the option to use the "full" gelu (see also this PR in Transformers.jl).Possible Implementation
I would make a PR to add the full gelu. Options:
gelu_full
-> least impactgelu
and rename existing gelu to e.g.gelu_tanh
-> impact existing codegelu(x; approximate=false)=...
-> not sure if this this is easily possible as the kw needs to be propagated through the code (?)What would be the best option?
The text was updated successfully, but these errors were encountered: