Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing lstm and gru operators #689

Open
2 of 4 tasks
a-sully opened this issue May 16, 2024 · 15 comments
Open
2 of 4 tasks

Consider removing lstm and gru operators #689

a-sully opened this issue May 16, 2024 · 15 comments

Comments

@a-sully
Copy link
Contributor

a-sully commented May 16, 2024

Proposal Status

From #689 (comment)


Related to #453. One recommendation of that issue is:

  • Remove model-specific instructions like LSTM and gru

Current support across backends

Operator DirectML CoreML TFLite
lstm ⚠️
gru -

†TFLite delegates generally do not support the entire TFLite opset. For instance, TFLite GPU delegates only support a very basic variant of LSTM which does not support most of the parameters specified by WebNN.

What does "supporting" LSTM really mean?

Higher-level operators such as lstm and gru tend to have more knobs to turn, and each additional knob increases the likelihood that WebNN will not actually be able to use the backend's operator of the same name. There are many variants of LSTM. Just because a backend has an LSTM operator does not mean that operator can express everything required by the variant of LSTM that WebNN specifies. Meanwhile, frameworks sitting on top of WebNN may only take advantage of WebNN’s lstm operator if it is exactly the variant the calling framework needs.

For example, TFLite's variant of LSTM uses coupled "input" and "forget" gates (CIFG), whereas this option is not available in WebNN - Chromium's DML implementation currently does not couple these gates. User agents implementing WebNN on TFLite cannot use its LSTM operator, and neither can frameworks calling into WebNN use its LSTM operator if they want the CIFG behavior.

Let's look at the problem @philloooo mentioned about the activations supported by LSTM across various backends: #573 (comment)

Supported activation functions for LSTM

Operator ONNX DirectML CoreML TFLite WebNN
affine optional - - - -
elu optional - -
ceu - - - -
gelu - - -
hardSigmoid optional ⚠️†† -
hardSwish - - -
identity - - - As operator only
leakyRelu optional - -
linear - ⚠️†† -
parametricSoftplus - - - -
relu
reluN1To1 - - - -
relu6 - - - -
scaledElu - - - -
scaledTanh optional ⚠️†† - -
shrink - - - -
swish - - - -
sigmoid -
sign - - - -
softplus optional - -
softsign optional - -
tanh
thresholdedRelu optional - - -

†I couldn't find documentation in DirectML saying which activations are supported by which operators. @ folks familiar with DML, please feel free to chime in! (done! #689 (comment))

††Does not support passing alpha nor beta values, as far as I can tell


Aside: Now that MLActivations are no longer used for op fusion, we should consider removing MLActivations which do not make sense for use with recurrent operators.


What activations can be specified on each backend?

Let's also remember that WebNN's lstm operator has multiple activations.

Gates DirectML† CoreML TFLite WebNN
input (i) f() recurrent_activation fused_activation_function activations[0]
output (o) f() recurrent_activation fused_activation_function activations[0]
forget (f) f() recurrent_activation fused_activation_function activations[0]
cell (g) g() cell_activation always sigmoid, I think? activations[1]
hidden (h) h() activation always sigmoid, I think? activations[2]

†DML also supports passing different activations for LSTM's forward and backward passes

Summary

Reviewing the Guildelines for new operations to retroactively evaluate whether lstm and gru meet these guidelines, this stands out:

  • Prefer primitives over new high level operations but consider performance consequences

The rationale for having these operators in WebNN is that the user agent's implementation of WebNN can plumb this directly to the backend's operator of the same name; otherwise there's no benefit compared to having the framework sitting on top of WebNN decompose the operator itself.

While there is some overlap - DML and CoreML are the most similar - there are still far more differences than similarities. For a web developer looking to deploy a model on WebNN across platforms, the tables suggest that aside from a few exceptional cases, these operators would need to be decomposed by the user agent.

If a high-level operator:

  • does not provide performance benefits for most of the permutations of its parameters and backends, and
  • cannot reliably be used by the framework sitting on top of WebNN

...should it still be in WebNN? :)

Options:

  1. Remove these operators from WebNN
  2. Keep these operators, understanding that they will often be decomposed
  3. Water down these operators into a least-common-denominator variant
    • Open question: Would real models use these watered-down variants?

Updates

2024-06-07:

2024-07-23:

  • Added Proposal Status section
@fdwr
Copy link
Collaborator

fdwr commented Jun 8, 2024

I couldn't find documentation in DirectML saying which activations are supported by which operators. @ folks familiar with DML, please feel free to chime in!

🔔 For RNN, GRU, LSTM:

✅:
DML_OPERATOR_ACTIVATION_IDENTITY
DML_OPERATOR_ACTIVATION_LINEAR
DML_OPERATOR_ACTIVATION_ELU
DML_OPERATOR_ACTIVATION_RELU
DML_OPERATOR_ACTIVATION_CELU
DML_OPERATOR_ACTIVATION_GELU
DML_OPERATOR_ACTIVATION_SCALED_ELU
DML_OPERATOR_ACTIVATION_LEAKY_RELU
DML_OPERATOR_ACTIVATION_THRESHOLDED_RELU
DML_OPERATOR_ACTIVATION_PARAMETERIZED_RELU
DML_OPERATOR_ACTIVATION_PARAMETRIC_SOFTPLUS
DML_OPERATOR_ACTIVATION_SIGMOID
DML_OPERATOR_ACTIVATION_HARD_SIGMOID
DML_OPERATOR_ACTIVATION_SOFTPLUS
DML_OPERATOR_ACTIVATION_SOFTSIGN
DML_OPERATOR_ACTIVATION_TANH
DML_OPERATOR_ACTIVATION_SCALED_TANH
DML_OPERATOR_ACTIVATION_SHRINK
DML_OPERATOR_ACTIVATION_SWISH
DML_OPERATOR_ACTIVATION_HARD_SWISH

❌:
DML_OPERATOR_ELEMENT_WISE_CLIP
DML_OPERATOR_ACTIVATION_SOFTMAX
DML_OPERATOR_ACTIVATION_SOFTMAX1
DML_OPERATOR_ACTIVATION_LOG_SOFTMAX
DML_OPERATOR_ACTIVATION_LOG_SOFTMAX1
DML_OPERATOR_ACTIVATION_HARDMAX
DML_OPERATOR_ACTIVATION_HARDMAX1

@a-sully
Copy link
Contributor Author

a-sully commented Jun 8, 2024

Thanks for the investigation! 🙏 Updated the table

@miaobin
Copy link

miaobin commented Jul 2, 2024

We decomposed the Gru operator in the RNNosie sample into some other operators supported by WebNN. And we ran the sample on some integrated and discrete graphics:

GRU (ms)     Integrated Graphics Discrete Graphics 1 Discrete Graphics 2
build MC enabled GRU single op 333.90 707.10 503.60
build MC enabled GRU decomposed 867.50 828.80 1534.50
build HLSL (MC disabled) GRU single op 152.00 344.70 497.50
build HLSL (MC disabled) GRU decomposed 880.50 756.80 1595.20
compute MC enabled GRU single op 11.77 14.81 21.57
compute MC enabled GRU decomposed 18.18 14.83 23.18
compute HLSL (MC disabled) GRU single op 14.62 13.21 18.47
compute HLSL (MC disabled) GRU decomposed 18.06 15.98 23.07

@a-sully
Copy link
Contributor Author

a-sully commented Jul 7, 2024

Thanks for the investigation, @miaobin!

Thinking about the options I mentioned above, this RNNoise model is a nice example of a model which could benefit from the third option:

  1. Water down these operators into a least-common-denominator variant

The sample always uses sigmoid for f()/recurrent_activation and relu for g()/cell_activation. These activations, along with tanh, are the most widely supported activations. If WebNN were to support only these 3 activations for lstm() and gru(), then it seems likely that both DirectML and Core ML could directly support it (though TFLite may still require a fallback). Since none of these activations take any arguments (and Core ML doesn't support passing arguments to activations anyways) these activations could be expressed with a string...

...This may sound familiar! :) https://github.com/webmachinelearning/webnn/pull/188/files#diff-5e793325cd2bfc452e268a4aa2f02b4024dd9584bd1db3c2595f61f1ecf7b985L909-L913

Given that MLActivation was proposed to accommodate op fusion, which has since been removed (in #664), I don't see a need to keep it in its current form. That sort of takes option 2 off the board:

2. Keep these operators, understanding that they will often be decomposed

There are still open questions with regards to supporting lstm() or gru() on Core ML. For instance:

I propose the following:

  • Remove MLActivation in favor of restoring the former MLRecurrentNetworkActivation enum (described here)
  • Add conformance tests for lstm() and gru()
  • Try implementing these operators on Core ML and TFLite
  • Re-evaluate lstm() and gru() based on the results of the previous step

WDYT?

@a-sully
Copy link
Contributor Author

a-sully commented Jul 8, 2024

I see there are a few thumbs-up on my previous comment so I just put up #718. PTAL :)

@huningxin
Copy link
Contributor

huningxin commented Jul 11, 2024

Add conformance tests for lstm() and gru()

/cc @BruceDai

@BruceDai
Copy link
Contributor

@huningxin Thanks for the reminder!
I'll add those tests ASAP.

@fdwr
Copy link
Collaborator

fdwr commented Jul 19, 2024

🙏 Updated the table

@a-sully : It is a beautiful table btw.

@a-sully
Copy link
Contributor Author

a-sully commented Jul 23, 2024

FYI I added a Proposal Status header to the issue description to track progress on my proposal from #689 (comment). @BruceDai please post an update on this issue when conformance WPTs are added!

@BruceDai
Copy link
Contributor

BruceDai commented Aug 2, 2024

I've initialed these two CL CL-1 and CL-2 for adding gru/gruCell and lstm/lstmCell conformance tests, and these two CL still are WIP on the tolerance criterial that @fdwr is helping to define for those complex operators.

@BruceDai
Copy link
Contributor

BruceDai commented Aug 6, 2024

For adding gru/gruCell and lstm/lstmCell conformance tests into WPT ASAP, I created another two CL CL-1' and CL-2' by leveraging test data and tolerance criteria from existed gru and lstm tests of ONNX CTS. Then we could go forward with continuing to work on defining exact tolerance criteria for these complex operators and add more others tests for test coverage.

Updated
CL-1' and CL-2' both landed, now we have such conformance tests.

@a-sully
Copy link
Contributor Author

a-sully commented Aug 15, 2024

Thanks for the update, @BruceDai! I've filed https://crbug.com/360052663 to track the next step: implementation of these operators in Chromium on TFLite and CoreML

@philloooo
Copy link
Contributor

Here is an update on CoreML implementation: CoreML gru only gives correct result with "cpu" target when gpu sandboxing is disabled, and gives incorrect results for all device targets when gpu sandbox is on. I've reported the bug to @mwyrzykowski .

For now given the bug, we can't use coreml's gru and has to implement full decomposition too.

If when the bug is resolved, here are the API differences that could be emulated:

Constant requirement

Weights and bias need to be constant, so weights/bias manipulation need to be done as a preprocessing step before passing to CoreML.

direction

WebNN support bi-directional gru. CoreML only support forward/backward.
WebNN gru takes weights/bias with the first dimension being the direction. Since CoreML only support single direction, this dimension is gone.
So weights&bias need to be reshaped/splitted:
Weights: [numDirections, 3 * hiddenSize, inputSize]. -> [ 3 * hiddenSize, inputSize].

Bias: [numDirections, 3 * hiddenSize] -> [3 * hiddenSize]

Merged bias and reset_after

CoreML takes a single bias value. It’s the result of add(bias, recurrentBias).
It also only supports reset_after = true after ios18(macOS15). I think the ios18 should also support separately passing bias and recurrentBias. But the documentation needs to be fixed.
Right now we use CoreML 7 (ios17, macOS14), so we need to do decomposition for ios17 for reset_after=true.

@philloooo
Copy link
Contributor

Circling back, given that there is only a single backend supporting this, we are back to the original question: should we just remove lstm and gru?

@huningxin
Copy link
Contributor

#689 (comment) data shows there is a significant performance improvement by using native gru op. If we are going to remove the lstm and gru, my question is: what WebNN can help for an implementation to recognize the pattern of decomposed form of lstm / gru and fuse them into native lstm / gru op when available?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants