"Full" gelu without approximation #628

se-schmitt · 2025-02-05T08:32:41Z

Motivation and description

Currently, only the tanh approximation of the gelu activation function is implemented. But the full gelu (0.5x(1 + erf(x/sqrt(2)))) is the default in PyTorch and Tensorflow and used in many pretrained models (like BERT). When using pretrained models from HuggingFace, e.g. through Transformers.jl, this causes slight but significant differences. Though, it would be nice to have the option to use the "full" gelu (see also this PR in Transformers.jl).

Possible Implementation

I would make a PR to add the full gelu. Options:

new function gelu_full -> least impact
replace gelu and rename existing gelu to e.g. gelu_tanh -> impact existing code
add kwarg like gelu(x; approximate=false)=... -> not sure if this this is easily possible as the kw needs to be propagated through the code (?)
What would be the best option?

The text was updated successfully, but these errors were encountered:

ToucheSir · 2025-02-05T23:01:58Z

One idea would be to rename the current gelu to gelu_fast, matching the pattern we've established with other activation functions. However, I don't understand chengchingwen/Transformers.jl#209 (comment) well enough to know if this would help with that.

se-schmitt · 2025-02-06T11:01:04Z

I added a PR and named the functions as suggested. I think the comment is meant in a way that the configs of Transformers.jl should match the HuggingFace nomenclature (otherwise they would'nt be loaded automatically). With the submitted PR here, that would be possible as it enables the correct assignment of the HuggingFace hidden_act options gelu and gelu_python to the new gelu function implemented in the PR. Is this correct, @chengchingwen ?

chengchingwen · 2025-02-06T11:14:00Z

Yes, chengchingwen/Transformers.jl#209 (comment) is simply saying we need separate functions for gelu w/ and w/o tanh approx. and it would be nice if NNlib.jl has both.

avik-pal · 2025-02-06T21:18:15Z

Changing the implementation of a function as common as gelu should be a breaking change

se-schmitt · 2025-02-06T21:56:19Z

Breaking changes could be avoided by naming the gelu without approximation sth like gelu_full or gelu_noapprox if this is preferred. This would also solve the original problem

chengchingwen · 2025-02-08T05:39:31Z

This kind of implementation change wasn't considered a breaking change before (#371 is released with v0.7.31). I think it won't be that breaking on the accuracy/value side, though the performance side might.

avik-pal · 2025-02-08T13:26:37Z

this causes slight but significant differences.

This would be true in the reverse direction as well, making any trained models that use the approx gelu impl have differences once this change is made.

I think doing a gelu_full / gelu_approx for this version and switching the impl once we do a major release would be a good fix.

chengchingwen · 2025-02-08T14:42:58Z

making any trained models that use the approx gelu impl have differences once this change is made.

It depends on the sensitivity of the trained model. Our sigmoid-approx-gelu (not even a tanh-approx-gelu) works quite well with many pretrained transformer, but not all indeed. I'm not against marking it a breaking change though, just pointing out the previous decision.

I think doing a gelu_full / gelu_approx for this version and switching the impl once we do a major release would be a good fix.

I would suggest using gelu_fast instead of gelu_approx so we can avoid changing two functions during the major release.

se-schmitt · 2025-02-09T15:32:03Z

I made the changes in the PR accordingly (gelu is now aproximated (old) gelu and added gelu_full). I'd add the changes for major release in a new PR!? (if these changes are desired)

ToucheSir added good first issue help wanted labels Feb 5, 2025

se-schmitt linked a pull request Feb 6, 2025 that will close this issue

Adds full gelu without approximation #629

Open

2 tasks

ToucheSir linked a pull request Feb 6, 2025 that will close this issue

Adds full gelu without approximation #629

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Full" gelu without approximation #628

"Full" gelu without approximation #628

se-schmitt commented Feb 5, 2025

ToucheSir commented Feb 5, 2025

se-schmitt commented Feb 6, 2025

chengchingwen commented Feb 6, 2025

avik-pal commented Feb 6, 2025

se-schmitt commented Feb 6, 2025

chengchingwen commented Feb 8, 2025

avik-pal commented Feb 8, 2025

chengchingwen commented Feb 8, 2025

se-schmitt commented Feb 9, 2025

"Full" gelu without approximation #628

"Full" gelu without approximation #628

Comments

se-schmitt commented Feb 5, 2025

Motivation and description

Possible Implementation

ToucheSir commented Feb 5, 2025

se-schmitt commented Feb 6, 2025

chengchingwen commented Feb 6, 2025

avik-pal commented Feb 6, 2025

se-schmitt commented Feb 6, 2025

chengchingwen commented Feb 8, 2025

avik-pal commented Feb 8, 2025

chengchingwen commented Feb 8, 2025

se-schmitt commented Feb 9, 2025