Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing steepness parameter of softplus #645

Closed
shiyi9801 opened this issue Apr 18, 2024 · 6 comments · Fixed by #651
Closed

Consider removing steepness parameter of softplus #645

shiyi9801 opened this issue Apr 18, 2024 · 6 comments · Fixed by #651

Comments

@shiyi9801
Copy link
Contributor

It's raised by @a-sully in CL review, thanks!

Softplus calculates ln(1 + exp(steepness * x)) / steepness, when the steepness is 0 it might result in division by zero.

I tried Pytorch torch.nn.Softplus(beta=0) and the results are all inf, and TF and ONNX don't have this attribute. Also DirectML doesn't support steepness < 1.0.

@shiyi9801 shiyi9801 changed the title Specify the behavior of sofplus when the steepness is 0 Specify the behavior of softplus when the steepness is 0 Apr 18, 2024
@huningxin
Copy link
Contributor

Does negative steepness value make sense? softplus should produce positive results (as a smooth approximation to relu), but negative steepness would result in negative results.

@a-sully
Copy link
Contributor

a-sully commented Apr 18, 2024

I think it's worth taking a step back and asking whether steepness is needed for this operator in the first place...

As mentioned above, TF and ONNX only support a more basic variant of softplus which computes log( 1 + e^x ) elementwise. This also matches the behavior of CoreML's softplus (though CoreML also supports a "parametric" variant of softplus which computes alpha_i * log( 1 + e^( beta_i * x_i ) ). This more generic operand can emulate the variant specified by DML, but without introducing the undefined behavior of division by 0. It will also happily compute negative values for alpha and beta)

How important is steepness? Many ML frameworks don't support it - including ONNX, which (for now, at least) is the primary consumer of WebNN. What would be the impacts of removing it?

@fdwr
Copy link
Collaborator

fdwr commented Apr 20, 2024

The expected results of division by zero for floating point values is:

  • 0 / 0 = indeterminate
  • positiveValue / 0 = infinity
  • negativeValue / 0 = -infinity

(unlike division by zero for integers, there's nothing ambiguous in the IEEE standard for floating point)

Also DirectML doesn't support steepness < 1.0.

It does now 😉. Coincidentally we relaxed DirectML's SOFTPLUS validation in February to permit < 1 (but that version is not out yet, and the docs are still valid for DML 1.13) which includes negative values, and even 0 for the sake of PyTorch and potentially WebNN.

Does negative steepness value make sense?

🤔 I don't know the use case, but like you say, the graph is smooth, and PyTorch supports it without complaint:

import torch

s = torch.nn.Softplus(beta=-1.0)
x = torch.tensor([0.5930860043, 0.9014285803, -0.6331304312, 0.4639878273], dtype=torch.float32)
y = s(x)

print("value:", y)
print("shape:", y.shape)
print("dtype:", y.dtype)

# value: tensor([-0.4399, -0.3407, -1.0590, -0.4878])
# shape: torch.Size([4])
# dtype: torch.float32

Other libraries would need to (assuming the parameter was kept) support it via decomposition, in which case the same question would arise anyway, just that the question occurs in the div operator instead. I feel the cleanest way to answer questions like this for operators is to simply ask what result would an equivalent decomposition produce?

@fdwr
Copy link
Collaborator

fdwr commented Apr 20, 2024

I think it's worth taking a step back and asking whether steepness is needed for this operator in the first place
What would be the impacts of removing it?

Considerations I see include:

  • semantics - does it rightly belong there for some reason per a paper that originally introduced it?
  • front-end complexity - how much easier does it make WebNN caller's lives?
  • back-end complexity and WPT conformance - how much more complex does it make backends?
  • usage - how many models actually use softplus steepness != 1?
  • performance - how much performance does this fusion gain?
  • precision - fused operators avoid truncating flushes to intermediates.

Semantics: I would be interested to know where this parameter came from in the first place, like maybe a paper that introduced it, and why PyTorch has it.

Front end complexity: Currently the biggest known front-end is ORT WebNN EP graph builder which just passes the default steepness (=1) to WebNN. Now, some small performance could be gained if the builder looked before and after by one operator for a mul&div (or mul & recip+mul), but the salient question is how often does that occur? (see below) If a web version of PyTorch called WebNN, then having a compatible softplus would make it a little easier, but calling composing mul&softplus&div isn't hard.

Backend complexity and WPT complexity: If only one front-end caller (PyTorch) supports it and only one backend (DML) supports it, then keeping it is more dubious. Removing steepness simplifies WPT and conformance some.

Usage: Scanning 700 models I have locally, I see very few that even use softplus activation to begin with. A notable one is Yolo V4, but it just uses steepness value = 1, and another internal product model uses 4 as a steepness value (which when converted to ONNX becomes a mul and recip&mul), but it only has 2 softplus nodes in the graph:

image

*of course my little hard drive collection doesn't represent the full world of ML 🌍, but a 🍰 of it.

Performance: Since GPU's are primarily memory bound for very simple math operations, then having 2 extra intermediates tensors to write out/read back reduces perf by 3x for the pattern mul&softplus&div.

Precision: For float16 tensors, computing float32 intermediate values (for this part ln(1 + exp(x))) and truncating them into a float16 intermediate tensor is lossier than computing the fused pattern in float32. It's small though, probably not more than 2-3 ULP.

weirdly I feel like we already discussed this before, but I can't find the issue 🤷

Separate issue for steepness parameter removal? Or retitle this post? (because the original question about the expected value is answered)

@huningxin
Copy link
Contributor

Separate issue for steepness parameter removal? Or retitle this post? (because the original question about the expected value is answered)

@fdwr , thanks for your nice summary. I think we can just retitle this post. And removing steepness makes sense to me.

@shiyi9801 shiyi9801 changed the title Specify the behavior of softplus when the steepness is 0 Consider removing steepness parameter of softplus Apr 22, 2024
@wacky6
Copy link

wacky6 commented Apr 22, 2024

:) I think removing steepness is fine.

From a API design perspective, adding it later is much easier than deprecation (if we find steepness isn't a good fit down the line).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants