Consider removing `steepness` parameter of softplus #645

shiyi9801 · 2024-04-18T03:25:48Z

It's raised by @a-sully in CL review, thanks!

Softplus calculates ln(1 + exp(steepness * x)) / steepness, when the steepness is 0 it might result in division by zero.

I tried Pytorch torch.nn.Softplus(beta=0) and the results are all inf, and TF and ONNX don't have this attribute. Also DirectML doesn't support steepness < 1.0.

The text was updated successfully, but these errors were encountered:

huningxin · 2024-04-18T16:15:44Z

Does negative steepness value make sense? softplus should produce positive results (as a smooth approximation to relu), but negative steepness would result in negative results.

a-sully · 2024-04-18T18:12:14Z

I think it's worth taking a step back and asking whether steepness is needed for this operator in the first place...

As mentioned above, TF and ONNX only support a more basic variant of softplus which computes log( 1 + e^x ) elementwise. This also matches the behavior of CoreML's softplus (though CoreML also supports a "parametric" variant of softplus which computes alpha_i * log( 1 + e^( beta_i * x_i ) ). This more generic operand can emulate the variant specified by DML, but without introducing the undefined behavior of division by 0. It will also happily compute negative values for alpha and beta)

How important is steepness? Many ML frameworks don't support it - including ONNX, which (for now, at least) is the primary consumer of WebNN. What would be the impacts of removing it?

fdwr · 2024-04-20T05:10:35Z

The expected results of division by zero for floating point values is:

0 / 0 = indeterminate
positiveValue / 0 = infinity
negativeValue / 0 = -infinity

(unlike division by zero for integers, there's nothing ambiguous in the IEEE standard for floating point)

Also DirectML doesn't support steepness < 1.0.

It does now 😉. Coincidentally we relaxed DirectML's SOFTPLUS validation in February to permit < 1 (but that version is not out yet, and the docs are still valid for DML 1.13) which includes negative values, and even 0 for the sake of PyTorch and potentially WebNN.

Does negative steepness value make sense?

🤔 I don't know the use case, but like you say, the graph is smooth, and PyTorch supports it without complaint:

import torch

s = torch.nn.Softplus(beta=-1.0)
x = torch.tensor([0.5930860043, 0.9014285803, -0.6331304312, 0.4639878273], dtype=torch.float32)
y = s(x)

print("value:", y)
print("shape:", y.shape)
print("dtype:", y.dtype)

# value: tensor([-0.4399, -0.3407, -1.0590, -0.4878])
# shape: torch.Size([4])
# dtype: torch.float32

Other libraries would need to (assuming the parameter was kept) support it via decomposition, in which case the same question would arise anyway, just that the question occurs in the div operator instead. I feel the cleanest way to answer questions like this for operators is to simply ask what result would an equivalent decomposition produce?

fdwr · 2024-04-20T05:39:15Z

I think it's worth taking a step back and asking whether steepness is needed for this operator in the first place
What would be the impacts of removing it?

Considerations I see include:

semantics - does it rightly belong there for some reason per a paper that originally introduced it?
front-end complexity - how much easier does it make WebNN caller's lives?
back-end complexity and WPT conformance - how much more complex does it make backends?
usage - how many models actually use softplus steepness != 1?
performance - how much performance does this fusion gain?
precision - fused operators avoid truncating flushes to intermediates.

Semantics: I would be interested to know where this parameter came from in the first place, like maybe a paper that introduced it, and why PyTorch has it.

Front end complexity: Currently the biggest known front-end is ORT WebNN EP graph builder which just passes the default steepness (=1) to WebNN. Now, some small performance could be gained if the builder looked before and after by one operator for a mul&div (or mul & recip+mul), but the salient question is how often does that occur? (see below) If a web version of PyTorch called WebNN, then having a compatible softplus would make it a little easier, but calling composing mul&softplus&div isn't hard.

Backend complexity and WPT complexity: If only one front-end caller (PyTorch) supports it and only one backend (DML) supports it, then keeping it is more dubious. Removing steepness simplifies WPT and conformance some.

Usage: Scanning 700 models I have locally, I see very few that even use softplus activation to begin with. A notable one is Yolo V4, but it just uses steepness value = 1, and another internal product model uses 4 as a steepness value (which when converted to ONNX becomes a mul and recip&mul), but it only has 2 softplus nodes in the graph:

*of course my little hard drive collection doesn't represent the full world of ML 🌍, but a 🍰 of it.

Performance: Since GPU's are primarily memory bound for very simple math operations, then having 2 extra intermediates tensors to write out/read back reduces perf by 3x for the pattern mul&softplus&div.

Precision: For float16 tensors, computing float32 intermediate values (for this part ln(1 + exp(x))) and truncating them into a float16 intermediate tensor is lossier than computing the fused pattern in float32. It's small though, probably not more than 2-3 ULP.

weirdly I feel like we already discussed this before, but I can't find the issue 🤷

Separate issue for steepness parameter removal? Or retitle this post? (because the original question about the expected value is answered)

huningxin · 2024-04-20T06:19:58Z

Separate issue for steepness parameter removal? Or retitle this post? (because the original question about the expected value is answered)

@fdwr , thanks for your nice summary. I think we can just retitle this post. And removing steepness makes sense to me.

wacky6 · 2024-04-22T14:38:36Z

:) I think removing steepness is fine.

From a API design perspective, adding it later is much easier than deprecation (if we find steepness isn't a good fit down the line).

Fixes webmachinelearning#645

shiyi9801 changed the title ~~Specify the behavior of sofplus when the steepness is 0~~ Specify the behavior of softplus when the steepness is 0 Apr 18, 2024

inexorabletash added the operator specific label Apr 18, 2024

fdwr mentioned this issue Apr 20, 2024

Add axis argument to softmax() #649

Merged

shiyi9801 changed the title ~~Specify the behavior of softplus when the steepness is 0~~ Consider removing steepness parameter of softplus Apr 22, 2024

a-sully added a commit to a-sully/webnn that referenced this issue Apr 22, 2024

Remove softplus() steepness option

6a158b4

Fixes webmachinelearning#645

This was referenced Apr 22, 2024

Remove softplus() steepness option #651

Merged

argMax/Min only support scalar axis in TFLite runtime #629

Closed

huningxin closed this as completed in #651 Apr 24, 2024

BruceDai mentioned this issue Apr 26, 2024

Remove softplus parameter steepness webmachinelearning/webnn-baseline#75

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider removing `steepness` parameter of softplus #645

Consider removing `steepness` parameter of softplus #645

shiyi9801 commented Apr 18, 2024

huningxin commented Apr 18, 2024

a-sully commented Apr 18, 2024

fdwr commented Apr 20, 2024

fdwr commented Apr 20, 2024 •

edited

Loading

huningxin commented Apr 20, 2024

wacky6 commented Apr 22, 2024

Consider removing steepness parameter of softplus #645

Consider removing steepness parameter of softplus #645

Comments

shiyi9801 commented Apr 18, 2024

huningxin commented Apr 18, 2024

a-sully commented Apr 18, 2024

fdwr commented Apr 20, 2024

fdwr commented Apr 20, 2024 • edited Loading

huningxin commented Apr 20, 2024

wacky6 commented Apr 22, 2024

Consider removing `steepness` parameter of softplus #645

Consider removing `steepness` parameter of softplus #645

fdwr commented Apr 20, 2024 •

edited

Loading