Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should softmax be a fused activation ? #471

Closed
mingmingtasd opened this issue Oct 19, 2023 · 11 comments
Closed

Should softmax be a fused activation ? #471

mingmingtasd opened this issue Oct 19, 2023 · 11 comments

Comments

@mingmingtasd
Copy link
Contributor

(raised by @junwei in Chromium CL review https://chromium-review.googlesource.com/c/chromium/src/+/4920337/comment/6ed907ff_2f86a723/

WebNN's softmax can be an MLActivation.

While the DirectML does not support fusing softmax according to the doc: https://learn.microsoft.com/en-us/windows/ai/directml/dml-fused-activations

/cc @fdwr @huningxin

@mingmingtasd mingmingtasd changed the title Should softmax be an fused activation ? Should softmax be a fused activation ? Oct 19, 2023
@fdwr
Copy link
Collaborator

fdwr commented Oct 19, 2023

While the DirectML does not support fusing softmax according to the doc

It looks like a documentation issue, because the actual code accepts it. Will follow up with team and doc writer...

[update] It was added in DML 1.12. So for the older DML target, we could always issue a separate Softmax call upon seeing that activation.

@inexorabletash
Copy link
Member

Given the above - implementations can either (1) only support backends with support for fusing softmax (e.g. by bundling a backend, refusing to run on older backend, etc) or (2) implement support in user-land - I think we can close this out.

Do you agree @fdwr ?

@fdwr
Copy link
Collaborator

fdwr commented Apr 19, 2024

Given the above - implementations can either (1) only support backends with support for fusing softmax (e.g. by bundling a backend, refusing to run on older backend, etc) or (2) implement support in user-land - I think we can close this out.

Do you agree @fdwr ?

@inexorabletash Yes, in any case, this is closeable because the backend can choose to execute it separately. I don't think it's good to croak on the model (your option 1), as backends should instead gracefully execute the convolution and the MLConv2dOptions::activation separately. @mingmingtasd: Concur?

@huningxin
Copy link
Contributor

as backends should instead gracefully execute the convolution and the MLConv2dOptions::activation separately.

Should we add a note into the spec mentioning this fallback path?

@wacky6
Copy link

wacky6 commented Apr 22, 2024

Purely as a thought experiment:

Softmax requires summing over a dimension (out[i] = exp(input[i]) / sum(exp(input))). In this sense, softmax feels more like a normalization than an activation.

Other activations defined in the spec are element-wise (out[i] = fn(input[i])), so they can be trivially applied when filling the output buffer.

Just curious, how could a fuzed softmax improve performance (or reduce memory consumption)? How is a fused softmax activation different from syntax sugar of "Op + Softmax"?

Intuitively, the entire pre-softmax output (i.e. the intermediate result) needs to be computed before softmax can produce the result (because of the dependency on sum(input)).

So it seems the intermediate result is unavoidable (as opposed to being inlined for activations such as ReLU), negating the benefit of a fuzed activation.

@fdwr
Copy link
Collaborator

fdwr commented Apr 23, 2024

In this sense, softmax feels more like a normalization than an activation.

@wacky6: Softmax is also normalization. I tried stuffing ML ops into a few tidy categories here of my own arbitrary grouping, but then I realized they overlap (e.g. softmax in the table).

@fdwr
Copy link
Collaborator

fdwr commented May 2, 2024

Related: #658

how could a fuzed softmax improve performance (or reduce memory consumption)? How is a fused softmax activation different from syntax sugar of "Op + Softmax"?

@wacky6: I haven't seen the algorithm firsthand, but I recall there's a clever optimization where GPU/NPU drivers can factor out the reduction of softmax's denominator directly into the convolution pass, without needing a separate resummation.

Should we add a note into the spec mentioning this fallback path?

@huningxin: Alternately if we delete the 3 activation fields per issue 658 from @a-sully, then this issue becomes moot, as we'd instead add a note about backend fusions rather than a note about backend unfusions.

@a-sully
Copy link
Contributor

a-sully commented May 2, 2024

@huningxin: Alternately if we delete the 3 activation fields per issue 658 from @a-sully, then this issue becomes moot, as we'd instead add a note about backend fusions rather than a note about backend unfusions.

Agreed, I propose we close this issue once #664 merges

@huningxin
Copy link
Contributor

@huningxin: Alternately if we delete the 3 activation fields per issue 658 from @a-sully, then this issue becomes moot, as we'd instead add a note about backend fusions rather than a note about backend unfusions.

Agreed, I propose we close this issue once #664 merges

SGTM!

@a-sully
Copy link
Contributor

a-sully commented May 6, 2024

#664 merged 🎉

@fdwr
Copy link
Collaborator

fdwr commented May 6, 2024

Closing as the activation field is deleted now, and the DML backend will make the decision to fuse or not based on the DML feature level (a tangentially related enabling CR: https://chromium-review.googlesource.com/c/chromium/src/+/5501114)

@fdwr fdwr closed this as completed May 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants