Consider removing `lstm` and `gru` operators #689

a-sully · 2024-05-16T00:58:32Z

Proposal Status

Remove MLActivation in favor of restoring the former MLRecurrentNetworkActivation enum (described here)
- Completed in Replace MLActivation with MLRecurrentNetworkActivation #718
Add conformance tests for lstm() and gru()
Try implementing these operators on Core ML and TFLite
Re-evaluate lstm() and gru() based on the results of the previous step

Related to #453. One recommendation of that issue is:

Remove model-specific instructions like LSTM and gru

Current support across backends

Operator	DirectML	CoreML	TFLite
lstm	✅	✅	⚠️†
gru	✅	✅	-

†TFLite delegates generally do not support the entire TFLite opset. For instance, TFLite GPU delegates only support a very basic variant of LSTM which does not support most of the parameters specified by WebNN.

What does "supporting" LSTM really mean?

Higher-level operators such as lstm and gru tend to have more knobs to turn, and each additional knob increases the likelihood that WebNN will not actually be able to use the backend's operator of the same name. There are many variants of LSTM. Just because a backend has an LSTM operator does not mean that operator can express everything required by the variant of LSTM that WebNN specifies. Meanwhile, frameworks sitting on top of WebNN may only take advantage of WebNN’s lstm operator if it is exactly the variant the calling framework needs.

For example, TFLite's variant of LSTM uses coupled "input" and "forget" gates (CIFG), whereas this option is not available in WebNN - Chromium's DML implementation currently does not couple these gates. User agents implementing WebNN on TFLite cannot use its LSTM operator, and neither can frameworks calling into WebNN use its LSTM operator if they want the CIFG behavior.

Let's look at the problem @philloooo mentioned about the activations supported by LSTM across various backends: #573 (comment)

Supported activation functions for LSTM

Operator	ONNX	DirectML†	CoreML	TFLite	WebNN
affine	optional	-	-	-	-
elu	optional	✅	-	-	✅
ceu	-	✅	-	-	-
gelu	-	✅	-	-	✅
hardSigmoid	optional	✅	⚠️††	-	✅
hardSwish	-	✅	-	-	✅
identity	-	✅	-	-	As operator only
leakyRelu	optional	✅	-	-	✅
linear	-	✅	⚠️††	-	✅
parametricSoftplus	-	✅	-	-	-
relu	✅	✅	✅	✅	✅
reluN1To1	-	-	-	✅	-
relu6	-	-	-	✅	-
scaledElu	-	✅	-	-	-
scaledTanh	optional	✅	⚠️††	-	-
shrink	-	✅	-	-	-
swish	-	✅	-	-	-
sigmoid	✅	✅	✅	-	✅
sign	-	-	-	✅	-
softplus	optional	✅	-	-	✅
softsign	optional	✅	-	-	✅
tanh	✅	✅	✅	✅	✅
thresholdedRelu	optional	✅	-	-	-

~~†I couldn't find documentation in DirectML saying which activations are supported by which operators. @ folks familiar with DML, please feel free to chime in!~~ (done! #689 (comment))

††Does not support passing alpha nor beta values, as far as I can tell

Aside: Now that MLActivations are no longer used for op fusion, we should consider removing MLActivations which do not make sense for use with recurrent operators.

What activations can be specified on each backend?

Let's also remember that WebNN's lstm operator has multiple activations.

Gates	DirectML†	CoreML	TFLite	WebNN
`input (i)`	`f()`	`recurrent_activation`	`fused_activation_function`	`activations[0]`
`output (o)`	`f()`	`recurrent_activation`	`fused_activation_function`	`activations[0]`
`forget (f)`	`f()`	`recurrent_activation`	`fused_activation_function`	`activations[0]`
`cell (g)`	`g()`	`cell_activation`	always sigmoid, I think?	`activations[1]`
`hidden (h)`	`h()`	`activation`	always sigmoid, I think?	`activations[2]`

†DML also supports passing different activations for LSTM's forward and backward passes

Summary

Reviewing the Guildelines for new operations to retroactively evaluate whether lstm and gru meet these guidelines, this stands out:

Prefer primitives over new high level operations but consider performance consequences

The rationale for having these operators in WebNN is that the user agent's implementation of WebNN can plumb this directly to the backend's operator of the same name; otherwise there's no benefit compared to having the framework sitting on top of WebNN decompose the operator itself.

While there is some overlap - DML and CoreML are the most similar - there are still far more differences than similarities. For a web developer looking to deploy a model on WebNN across platforms, the tables suggest that aside from a few exceptional cases, these operators would need to be decomposed by the user agent.

If a high-level operator:

does not provide performance benefits for most of the permutations of its parameters and backends, and
cannot reliably be used by the framework sitting on top of WebNN

...should it still be in WebNN? :)

Options:

Remove these operators from WebNN
Keep these operators, understanding that they will often be decomposed
Water down these operators into a least-common-denominator variant
- Open question: Would real models use these watered-down variants?

Updates

2024-06-07:

Removed clamp and softmax activations from the table as per Remove MLActivations definitely not usable with recurrent ops #703
Filled out the DirectML column of the "Supported activation functions for LSTM" table as per Consider removing lstm and gru operators #689 (comment) by @fdwr (thanks!)

2024-07-23:

Added Proposal Status section

The text was updated successfully, but these errors were encountered:

fdwr · 2024-06-08T02:56:55Z

I couldn't find documentation in DirectML saying which activations are supported by which operators. @ folks familiar with DML, please feel free to chime in!

🔔 For RNN, GRU, LSTM:

✅:
DML_OPERATOR_ACTIVATION_IDENTITY
DML_OPERATOR_ACTIVATION_LINEAR
DML_OPERATOR_ACTIVATION_ELU
DML_OPERATOR_ACTIVATION_RELU
DML_OPERATOR_ACTIVATION_CELU
DML_OPERATOR_ACTIVATION_GELU
DML_OPERATOR_ACTIVATION_SCALED_ELU
DML_OPERATOR_ACTIVATION_LEAKY_RELU
DML_OPERATOR_ACTIVATION_THRESHOLDED_RELU
DML_OPERATOR_ACTIVATION_PARAMETERIZED_RELU
DML_OPERATOR_ACTIVATION_PARAMETRIC_SOFTPLUS
DML_OPERATOR_ACTIVATION_SIGMOID
DML_OPERATOR_ACTIVATION_HARD_SIGMOID
DML_OPERATOR_ACTIVATION_SOFTPLUS
DML_OPERATOR_ACTIVATION_SOFTSIGN
DML_OPERATOR_ACTIVATION_TANH
DML_OPERATOR_ACTIVATION_SCALED_TANH
DML_OPERATOR_ACTIVATION_SHRINK
DML_OPERATOR_ACTIVATION_SWISH
DML_OPERATOR_ACTIVATION_HARD_SWISH

❌:
DML_OPERATOR_ELEMENT_WISE_CLIP
DML_OPERATOR_ACTIVATION_SOFTMAX
DML_OPERATOR_ACTIVATION_SOFTMAX1
DML_OPERATOR_ACTIVATION_LOG_SOFTMAX
DML_OPERATOR_ACTIVATION_LOG_SOFTMAX1
DML_OPERATOR_ACTIVATION_HARDMAX
DML_OPERATOR_ACTIVATION_HARDMAX1

a-sully · 2024-06-08T03:30:45Z

Thanks for the investigation! 🙏 Updated the table

miaobin · 2024-07-02T07:04:06Z

We decomposed the Gru operator in the RNNosie sample into some other operators supported by WebNN. And we ran the sample on some integrated and discrete graphics:

GRU (ms)			Integrated Graphics	Discrete Graphics 1	Discrete Graphics 2
build	MC enabled	GRU single op	333.90	707.10	503.60
build	MC enabled	GRU decomposed	867.50	828.80	1534.50
build	HLSL (MC disabled)	GRU single op	152.00	344.70	497.50
build	HLSL (MC disabled)	GRU decomposed	880.50	756.80	1595.20
compute	MC enabled	GRU single op	11.77	14.81	21.57
compute	MC enabled	GRU decomposed	18.18	14.83	23.18
compute	HLSL (MC disabled)	GRU single op	14.62	13.21	18.47
compute	HLSL (MC disabled)	GRU decomposed	18.06	15.98	23.07

a-sully · 2024-07-07T23:38:45Z

Thanks for the investigation, @miaobin!

Thinking about the options I mentioned above, this RNNoise model is a nice example of a model which could benefit from the third option:

Water down these operators into a least-common-denominator variant

The sample always uses sigmoid for f()/recurrent_activation and relu for g()/cell_activation. These activations, along with tanh, are the most widely supported activations. If WebNN were to support only these 3 activations for lstm() and gru(), then it seems likely that both DirectML and Core ML could directly support it (though TFLite may still require a fallback). Since none of these activations take any arguments (and Core ML doesn't support passing arguments to activations anyways) these activations could be expressed with a string...

...This may sound familiar! :) https://github.com/webmachinelearning/webnn/pull/188/files#diff-5e793325cd2bfc452e268a4aa2f02b4024dd9584bd1db3c2595f61f1ecf7b985L909-L913

Given that MLActivation was proposed to accommodate op fusion, which has since been removed (in #664), I don't see a need to keep it in its current form. That sort of takes option 2 off the board:

~~2. Keep these operators, understanding that they will often be decomposed~~

There are still open questions with regards to supporting lstm() or gru() on Core ML. For instance:

These operators have no WPT conformance tests. Without them, we can't be sure the respective Core ML and DirectML operators behave identically. If not, then the user agent implementation will need to use fallbacks, which negates the benefit of having these operators in WebNN https://github.com/web-platform-tests/wpt/tree/master/webnn/conformance_tests
Core ML requires several parameters to be constant tensors: Do we need an MLConstantOperand? #668

I propose the following:

Remove MLActivation in favor of restoring the former MLRecurrentNetworkActivation enum (described here)
Add conformance tests for lstm() and gru()
Try implementing these operators on Core ML and TFLite
Re-evaluate lstm() and gru() based on the results of the previous step

WDYT?

a-sully · 2024-07-08T18:42:49Z

I see there are a few thumbs-up on my previous comment so I just put up #718. PTAL :)

huningxin · 2024-07-11T04:23:01Z

Add conformance tests for lstm() and gru()

/cc @BruceDai

BruceDai · 2024-07-11T05:30:18Z

@huningxin Thanks for the reminder!
I'll add those tests ASAP.

fdwr · 2024-07-19T01:55:12Z

🙏 Updated the table

@a-sully : It is a beautiful table btw.

a-sully · 2024-07-23T19:47:40Z

FYI I added a Proposal Status header to the issue description to track progress on my proposal from #689 (comment). @BruceDai please post an update on this issue when conformance WPTs are added!

BruceDai · 2024-08-02T07:32:11Z

I've initialed these two CL CL-1 and CL-2 for adding gru/gruCell and lstm/lstmCell conformance tests, and these two CL still are WIP on the tolerance criterial that @fdwr is helping to define for those complex operators.

BruceDai · 2024-08-06T13:44:19Z

For adding gru/gruCell and lstm/lstmCell conformance tests into WPT ASAP, I created another two CL CL-1' and CL-2' by leveraging test data and tolerance criteria from existed gru and lstm tests of ONNX CTS. Then we could go forward with continuing to work on defining exact tolerance criteria for these complex operators and add more others tests for test coverage.

Updated
CL-1' and CL-2' both landed, now we have such conformance tests.

a-sully · 2024-08-15T17:05:25Z

Thanks for the update, @BruceDai! I've filed https://crbug.com/360052663 to track the next step: implementation of these operators in Chromium on TFLite and CoreML

philloooo · 2024-10-29T18:27:53Z

Here is an update on CoreML implementation: CoreML gru only gives correct result with "cpu" target when gpu sandboxing is disabled, and gives incorrect results for all device targets when gpu sandbox is on. I've reported the bug to @mwyrzykowski .

For now given the bug, we can't use coreml's gru and has to implement full decomposition too.

If when the bug is resolved, here are the API differences that could be emulated:

Constant requirement

Weights and bias need to be constant, so weights/bias manipulation need to be done as a preprocessing step before passing to CoreML.

direction

WebNN support bi-directional gru. CoreML only support forward/backward.
WebNN gru takes weights/bias with the first dimension being the direction. Since CoreML only support single direction, this dimension is gone.
So weights&bias need to be reshaped/splitted:
Weights: [numDirections, 3 * hiddenSize, inputSize]. -> [ 3 * hiddenSize, inputSize].

Bias: [numDirections, 3 * hiddenSize] -> [3 * hiddenSize]

Merged bias and reset_after

CoreML takes a single bias value. It’s the result of add(bias, recurrentBias).
It also only supports reset_after = true after ios18(macOS15). I think the ios18 should also support separately passing bias and recurrentBias. But the documentation needs to be fixed.
Right now we use CoreML 7 (ios17, macOS14), so we need to do decomposition for ios17 for reset_after=true.

philloooo · 2024-11-07T23:27:15Z

Circling back, given that there is only a single backend supporting this, we are back to the original question: should we just remove lstm and gru?

huningxin · 2024-11-08T00:24:27Z

#689 (comment) data shows there is a significant performance improvement by using native gru op. If we are going to remove the lstm and gru, my question is: what WebNN can help for an implementation to recognize the pattern of decomposed form of lstm / gru and fuse them into native lstm / gru op when available?

inexorabletash added question operator specific labels May 16, 2024

a-sully mentioned this issue Jun 6, 2024

Remove MLActivations definitely not usable with recurrent ops #703

Merged

a-sully mentioned this issue Jul 8, 2024

Replace MLActivation with MLRecurrentNetworkActivation #718

Merged

inexorabletash assigned a-sully Jul 18, 2024

a-sully mentioned this issue Aug 13, 2024

How to define the algorithm of L2_Pool2d? #278

Closed

philloooo mentioned this issue Aug 22, 2024

Consider separately pass each activation and direction's weights&bias for lstm and gru #751

Open

a-sully mentioned this issue Oct 17, 2024

Google Chrome Feedback on WebNN: aiming for broad device coverage and maintainability #453

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider removing `lstm` and `gru` operators #689

Consider removing `lstm` and `gru` operators #689

a-sully commented May 16, 2024 •

edited

Loading

fdwr commented Jun 8, 2024 •

edited

Loading

a-sully commented Jun 8, 2024

miaobin commented Jul 2, 2024

a-sully commented Jul 7, 2024

a-sully commented Jul 8, 2024

huningxin commented Jul 11, 2024 •

edited

Loading

BruceDai commented Jul 11, 2024

fdwr commented Jul 19, 2024

a-sully commented Jul 23, 2024

BruceDai commented Aug 2, 2024

BruceDai commented Aug 6, 2024 •

edited

Loading

a-sully commented Aug 15, 2024

philloooo commented Oct 29, 2024

philloooo commented Nov 7, 2024

huningxin commented Nov 8, 2024

Consider removing lstm and gru operators #689

Consider removing lstm and gru operators #689

Comments

a-sully commented May 16, 2024 • edited Loading

Proposal Status

Current support across backends

What does "supporting" LSTM really mean?

Supported activation functions for LSTM

What activations can be specified on each backend?

Summary

Updates

fdwr commented Jun 8, 2024 • edited Loading

a-sully commented Jun 8, 2024

miaobin commented Jul 2, 2024

a-sully commented Jul 7, 2024

a-sully commented Jul 8, 2024

huningxin commented Jul 11, 2024 • edited Loading

BruceDai commented Jul 11, 2024

fdwr commented Jul 19, 2024

a-sully commented Jul 23, 2024

BruceDai commented Aug 2, 2024

BruceDai commented Aug 6, 2024 • edited Loading

a-sully commented Aug 15, 2024

philloooo commented Oct 29, 2024

Constant requirement

direction

Merged bias and reset_after

philloooo commented Nov 7, 2024

huningxin commented Nov 8, 2024

Consider removing `lstm` and `gru` operators #689

Consider removing `lstm` and `gru` operators #689

a-sully commented May 16, 2024 •

edited

Loading

fdwr commented Jun 8, 2024 •

edited

Loading

huningxin commented Jul 11, 2024 •

edited

Loading

BruceDai commented Aug 6, 2024 •

edited

Loading