add mish kernel #569

WindQAQ · 2019-10-05T01:55:23Z

Address please add more activation functions #437 - mish kernel Mish: A Self Regularized Non-Monotonic Neural Activation Function

WindQAQ · 2019-10-05T01:55:41Z

@digantamisra98 Hello Diganta, would you mind taking a look at this. Also, I would like to drop your name on the maintainer list if you agree. Thanks!

digantamisra98 · 2019-10-05T09:03:33Z

Hey @WindQAQ. Going through it. As of now, everything looks fine. The arXiv identifier in your first comment is incorrect though, just to point out.
Yeah, sure, feel free to add me on the maintainer list.
Thank you!

WindQAQ · 2019-10-05T17:06:26Z

Hey @WindQAQ. Going through it. As of now, everything looks fine. The arXiv identifier in your first comment is incorrect though, just to point out.
Yeah, sure, feel free to add me on the maintainer list.
Thank you!

Thanks for the review! Already updated the maintainer list. Please feel free to ping me if the mail address or the user id is wrong 😃

digantamisra98 · 2019-10-06T16:05:45Z

@WindQAQ The ID and email are correct. I would also encourage you to take a look at this repository (based on CUDA) - https://github.com/thomasbrandon/mish-cuda.
It's much faster. Though I'm not sure, it is the right approach (Since it still doesn't work pretty well on double precision)
Also what is the wait time on the review process?
Thanks

WindQAQ · 2019-10-06T16:39:45Z

@WindQAQ The ID and email are correct. I would also encourage you to take a look at this repository (based on CUDA) - https://github.com/thomasbrandon/mish-cuda.
It's much faster. Though I'm not sure, it is the right approach (Since it still doesn't work pretty well on double precision)

@digantamisra98 Eigen's GPUDevice is also based on CUDA (and ROCm). My previous experiment on gelu activations shows that Eigen's parallelism has the same performance with customized CUDA kernel in TensorFlow as well as PyTorch.

Also what is the wait time on the review process?

Maybe next week. Thanks!

WindQAQ · 2019-10-06T16:56:59Z

My performance test on gelu: TFA (Eigen's parallelism) vs PyTorch.
https://colab.research.google.com/drive/1pQFOp5sEp3Gw0LGHlZ6dPsNGfZPWXcgc

Will do a thorough test on TFA versus PyTorch on activation after all of these get merged :P

digantamisra98 · 2019-10-06T18:43:21Z

Thanks for that notebook link. Now I have better clarity on the same. Keep me posted on the progress of the TFA/Torch tests. Thanks!

thomasbrandon · 2019-10-07T07:08:18Z

As author of the PyTorch kernel linked by @digantamisra98 I'll just expand a little.
I tried a similar gradient calculation to the one you have (though with less optimised reuse of intermediates). I found this to be very unstable and training an actual networks generally resulted in nan output killing the whole network (though tensorflow/eigen might have some special handling for this, at least not killing the whole network). This mostly affected float16 calculations but it would also occur with float32 sometimes.
Switching to a symbolically derived gradient calculation resulted in much better stability and also improved performance (likely less pronounced in the more optimised version here which reuses more of the expensive exp calculations).
I just use a fairly naive backward that re-calculates the entire forward then derives the gradient. I have tested a method that stored intermediates in the forward to avoid this but found that the extra memory accesses for this actually reduced performance on CUDA float32 (I didn't test but would assume this would also apply to float16). Though I was only testing synthetically (with CUDA events), there may well be less impact on overall network performance.
Storing intermediates may be more useful for CUDA float64 and CPU where the extra calculations are likely slower than the extra memory accesses, so probably the ideal method would be to selectively use intermediates depending on device/type which I will likely try at some point.

WindQAQ · 2019-10-07T17:24:30Z

As author of the PyTorch kernel linked by @digantamisra98 I'll just expand a little.
I tried a similar gradient calculation to the one you have (though with less optimised reuse of intermediates). I found this to be very unstable and training an actual networks generally resulted in nan output killing the whole network (though tensorflow/eigen might have some special handling for this, at least not killing the whole network). This mostly affected float16 calculations but it would also occur with float32 sometimes.
Switching to a symbolically derived gradient calculation resulted in much better stability and also improved performance (likely less pronounced in the more optimised version here which reuses more of the expensive exp calculations).
I just use a fairly naive backward that re-calculates the entire forward then derives the gradient. I have tested a method that stored intermediates in the forward to avoid this but found that the extra memory accesses for this actually reduced performance on CUDA float32 (I didn't test but would assume this would also apply to float16). Though I was only testing synthetically (with CUDA events), there may well be less impact on overall network performance.
Storing intermediates may be more useful for CUDA float64 and CPU where the extra calculations are likely slower than the extra memory accesses, so probably the ideal method would be to selectively use intermediates depending on device/type which I will likely try at some point.

Storing intermediate values is kind of trade-off, but as you noted, I will investigate it on the combination of device/type. For the stability, I am aware the softplus operation in the forward pass is very likely to underflow/overflow. Will see how you deal with the backward gradients too! Thank you so much for the suggestion!

thomasbrandon · 2019-10-08T09:35:58Z

Yeah, the intermediate stuff was mainly as I gather that's what @digantamisra98 was referring to with "not sure, it is the right approach (Since it still doesn't work pretty well on double precision)".
Was mainly noting what seem to be stable forward/backwards. For forward I used a trick from PyTorch's softplus which is inp < 20 ? : log1p(exp(inp)) : inp. I gather the threshold of 20 is because exp(inp) overflows a float16 at around inp>=20.

WindQAQ · 2019-10-11T16:36:46Z

Hi @digantamisra98 @thomasbrandon, I have already dealt with precision problem. The softplus part is mainly copied from core TF and backward part is from @thomasbrandon implementation. Would you mind reviewing the PR again? Thank you so much for the time.

digantamisra98 · 2019-10-11T17:55:34Z

@WindQAQ will check tomorrow and let you know. Thanks!

digantamisra98 · 2019-10-12T16:32:10Z

@WindQAQ I did a quick check and was going through the log of the failed Ubuntu GPU build and couldn't interpret it correctly. Could you give it a bit of clarity?

WindQAQ · 2019-10-12T17:11:55Z

@WindQAQ I did a quick check and was going through the log of the failed Ubuntu GPU build and couldn't interpret it correctly. Could you give it a bit of clarity?

It seems to be the upstream's issue.

seanpmorgan · 2019-10-14T23:40:07Z

Hi @WindQAQ when time allows, mind refactoring to use the custom_op_library from #581

seanpmorgan · 2019-10-23T02:15:28Z

@WindQAQ I did a quick check and was going through the log of the failed Ubuntu GPU build and couldn't interpret it correctly. Could you give it a bit of clarity?

@digantamisra98 would you mind taking a look again. The fail was fixed in upstream tensorflow

digantamisra98 · 2019-10-23T06:13:12Z

@seanpmorgan yeah checked, all good. Thank you for notifying.

seanpmorgan · 2019-10-29T15:05:08Z

@WindQAQ mind resolving conflicts when you get a chance?

seanpmorgan

LGTM thanks for the contriubtion @WindQAQ and @digantamisra98 for the review

WindQAQ · 2019-10-30T05:49:56Z

To whom it may concern about the speed, I benchmark some activations in tensorflow/tensorflow-addons against pytorch's implementation. I am not testing xla and jit here :-)

Because it's hard to build a toolchain on Colab, @thomasbrandon would you mind installing tfa-nightly to run some tests on Mish if you still have concerns on it? Bug report would be really appreciated. Thank you all!

https://colab.research.google.com/drive/1zKuef-upkN_4jFnBRoHLk06xmtIDRemi

digantamisra98 · 2021-06-18T01:09:30Z

@seanpmorgan is there any scope of moving Mish to TF and Keras since now it is included in PyTorch 1.9. Reference

bhack · 2021-06-18T12:28:12Z

@digantamisra98 We don't handle anymore, on our side, migrations in the TF ecosystem as you can see in tensorflow/community#241 (comment).

Also, as the ecosystem review draft process is still not multi-lateral in the TF ecosystem, if you are going to create a PR to introduce MISH again in keras like:

keras-team/keras#13440
keras-team/keras#13439

we don't know how the deprecation will be triggered/handled. The process isn't currently documented and my best bet is to collaboratively expand and improve the ecosystem review process to the whole ecosystem.

/cc @seanpmorgan @yarri-oss @theadactyl @qlzh727

WindQAQ added 9 commits September 25, 2019 22:18

add mish

a38f391

format code

9810a66

merge master

a09dbbc

update README

cda5a6f

update tests

7315412

get rid of auto

61c9cf6

eval intermediate value

6fb3185

remove duplicated

6057c69

format codes

e66ed2d

WindQAQ added the activations label Oct 5, 2019

WindQAQ requested review from facaiy, seanpmorgan and a team as code owners October 5, 2019 01:55

googlebot added the cla: yes label Oct 5, 2019

update maintainer

ce0856f

WindQAQ added 4 commits October 10, 2019 23:01

safely deal with overflow/underflow

3ae20a2

test values that extremely close to zero

b850de6

merge master

f98aad6

format codes

504e2e8

WindQAQ added the kokoro:force-run label Oct 11, 2019

kokoro-team removed the kokoro:force-run label Oct 11, 2019

remove values close to zero

163eadc

WindQAQ added the kokoro:force-run label Oct 11, 2019

kokoro-team removed the kokoro:force-run label Oct 11, 2019

seanpmorgan added the kokoro:force-run label Oct 16, 2019

merge master

e053706

kokoro-team removed the kokoro:force-run label Oct 19, 2019

WindQAQ added the kokoro:force-run label Oct 22, 2019

kokoro-team removed the kokoro:force-run label Oct 22, 2019

Merge branch 'master' into add-mish

a795e7c

WindQAQ added the kokoro:force-run label Oct 29, 2019

kokoro-team removed the kokoro:force-run label Oct 29, 2019

seanpmorgan approved these changes Oct 29, 2019

View reviewed changes

seanpmorgan merged commit 093cdfa into tensorflow:master Oct 29, 2019

willbattel mentioned this pull request Aug 8, 2020

Optimizing models that use TensorFlow Addons activations, layers, etc tensorflow/model-optimization#471

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add mish kernel #569

add mish kernel #569

WindQAQ commented Oct 5, 2019 •

edited

Loading

WindQAQ commented Oct 5, 2019

digantamisra98 commented Oct 5, 2019

WindQAQ commented Oct 5, 2019

digantamisra98 commented Oct 6, 2019

WindQAQ commented Oct 6, 2019 •

edited

Loading

WindQAQ commented Oct 6, 2019

digantamisra98 commented Oct 6, 2019

thomasbrandon commented Oct 7, 2019

WindQAQ commented Oct 7, 2019

thomasbrandon commented Oct 8, 2019

WindQAQ commented Oct 11, 2019

digantamisra98 commented Oct 11, 2019

digantamisra98 commented Oct 12, 2019

WindQAQ commented Oct 12, 2019

seanpmorgan commented Oct 14, 2019

seanpmorgan commented Oct 23, 2019

digantamisra98 commented Oct 23, 2019

seanpmorgan commented Oct 29, 2019

seanpmorgan left a comment

WindQAQ commented Oct 30, 2019 •

edited

Loading

digantamisra98 commented Jun 18, 2021

bhack commented Jun 18, 2021 •

edited

Loading

add mish kernel #569

add mish kernel #569

Conversation

WindQAQ commented Oct 5, 2019 • edited Loading

WindQAQ commented Oct 5, 2019

digantamisra98 commented Oct 5, 2019

WindQAQ commented Oct 5, 2019

digantamisra98 commented Oct 6, 2019

WindQAQ commented Oct 6, 2019 • edited Loading

WindQAQ commented Oct 6, 2019

digantamisra98 commented Oct 6, 2019

thomasbrandon commented Oct 7, 2019

WindQAQ commented Oct 7, 2019

thomasbrandon commented Oct 8, 2019

WindQAQ commented Oct 11, 2019

digantamisra98 commented Oct 11, 2019

digantamisra98 commented Oct 12, 2019

WindQAQ commented Oct 12, 2019

seanpmorgan commented Oct 14, 2019

seanpmorgan commented Oct 23, 2019

digantamisra98 commented Oct 23, 2019

seanpmorgan commented Oct 29, 2019

seanpmorgan left a comment

Choose a reason for hiding this comment

WindQAQ commented Oct 30, 2019 • edited Loading

digantamisra98 commented Jun 18, 2021

bhack commented Jun 18, 2021 • edited Loading

WindQAQ commented Oct 5, 2019 •

edited

Loading

WindQAQ commented Oct 6, 2019 •

edited

Loading

WindQAQ commented Oct 30, 2019 •

edited

Loading

bhack commented Jun 18, 2021 •

edited

Loading