Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need an MLConstantOperand? #668

Closed
a-sully opened this issue Apr 30, 2024 · 14 comments
Closed

Do we need an MLConstantOperand? #668

a-sully opened this issue Apr 30, 2024 · 14 comments

Comments

@a-sully
Copy link
Contributor

a-sully commented Apr 30, 2024

The constant() operator is special. Constants are effectively inputs that are known at the time of graph compilation (i.e. build()).

You may ask, why do we need a separate method when we could just pass this constant data as an input()?

Most discussions I've seen so far about constant() (#614 (comment), this thread about the since-abandoned "fill sequence" overload, etc) have been concerned with optimizing performance. There are a number of compile-time optimizations a backend may perform if it knows that some data is constant.

Our experience with CoreML has given us another reason:

The backend requires that a parameter to an operator must be constant

Take conv2d(), for example. It's defined for CoreML here. The bias parameter is defined as:

  • bias: const tensor<[C_out],T>

Meanwhile, WebNN allows this to be any MLOperand, as long as it's of the appropriate shape and dtype:

dictionary MLConv2dOptions {
  // ...
  MLOperand bias;
  // ...
};

This appears to be a straightforward plumbing through of DML's interface, which does not require the BiasTensor to be a constant. Neither does the corresponding operator for TFLite. From what I can tell, this seems to be because these frameworks don't have a way to express that some input tensor must be const. The options are either to pass the parameter as the framework's generic representation of a Tensor - which would in practice always(?) be created from a constant() - or to pass the parameter as a 1D array directly. If the parameters may be large (and perhaps unbounded), the former is the more appropriate choice.

To get a sense for whether this is a reasonable hypothesis, I've inspected of all† uses of the affected operators in the WebNN Samples repo repo:

operator.param Usage in WebNN Samples
batchNormalization.mean Constant only
batchNormalization.variance Constant only
batchNormalization.scale Constant only
batchNormalization.bias Constant only
conv2d.bias Constant only
convTranspose2d.bias Not used
gru.weight Constant only
gru.recurrentWeight Constant only
gru.bias Constant only
gru.recurrentBias Constant only
instanceNormalization.scale Constant only††
instanceNormalization.bias Constant only††
layerNormalization.scale Not used
layerNormalization.bias Not used
lstm.weight Not used
lstm.recurrentWeight Not used
lstm.bias Not used
lstm.recurrentBias Not used
lstm.peepholeWeight Not used
prelu.slope Not used

†This list only includes WebNN operators which trivially map to CoreML operators. WebNN operators which need to be in terms of other CoreML operators will be subject to the restrictions of those respective CoreML operators. For example, CoreML doesn't have operators for gruCell or lstmCell, so these operators will need to be implemented in terms of gru and lstm, respectively. These operators will in turn need many of their parameters to be const, as well

††One caller of passes the result of a reshape... but that's only because the sample was written before constant() took an MLOperandDescriptor. The reshape is just assigning dimensions to a constant(). Nowadays we'd just pass the constant() directly

Remarkably, every single instance where one of these params is used in the WebNN Samples, it was created from a constant(). Cool!

Of course, this is not close to a comprehensive list of all models hope to run with WebNN. That being said, if there are no significant known use cases for passing any of these parameters as non-constant tensors - if their non-constness is simply a limitation in the framework and there are no useful reasons to pass non-const tensors - I think there's a reasonable argument that WebNN should require these parameters to be constants. @fwdr could you perhaps provide some more color here? :)

It seems that we have the following options to support each of these operators on CoreML:

  1. Require that operator.param must be a constant MLOperand (my tentative preference)
    • Note that it's much easier to relax these restrictions than to try to impose them later
    • This can be done two ways:
      1. Create an MLConstantOperand interface which extends MLOperand, and specify that param takes an MLConstantOperand
      2. Specify that MLOperand has a "kind", as the Chromium implementation already does, and throw a TypeError if not a "constant" kind. This may be confusing to developers
  2. Make operator.param a sequence<MLNumber>
    • This doesn't make much sense for anything other than 1-D tensors
  3. Decide that this is a quirk of CoreML and fail if operator.param is not a constant only on CoreML

Thoughts?

@huningxin
Copy link
Contributor

@a-sully

gru.weight Not used
gru.recurrentWeight Not used
gru.bias Not used
gru.recurrentBias Not used

FYI, these operands (constants only) are used by NSNet2 (noise supression) example: https://github.com/webmachinelearning/webnn-samples/blob/master/nsnet2/nsnet2.js#L51

@a-sully
Copy link
Contributor Author

a-sully commented May 1, 2024

My apologies for the error. I've updated the respective rows (they're also all "Constant only")

@a-sully
Copy link
Contributor Author

a-sully commented May 2, 2024

Turns out XNNPACK also requires some tensors to be constants - for example prelu.slope. The Chromium implementation currently fails if the slope operand was not created as a constant()

@inexorabletash
Copy link
Member

It would be nice to make progress here, as @a-sully notes "it's much easier to relax these restrictions than to try to impose them later"

I'm a fan of the MLConstantOperand approach - it's very straightforward to document (e.g. MDN), specify (just WebIDL, no special case text needed), and implement (bindings code should make it "just work").

I see a 👍 from @huningxin above; @fdwr have you had a chance to think about this?

a-sully added a commit to a-sully/webnn that referenced this issue Jul 30, 2024
@a-sully
Copy link
Contributor Author

a-sully commented Jul 30, 2024

Sketched out a PR for MLConstantOperand in #747

@fdwr
Copy link
Collaborator

fdwr commented Jul 31, 2024

There are a number of compile-time optimizations a backend may perform if it knows that some data is constant.

Indeed, and I like the concept of an operand having a "constness" property, whether it be via a separate class MLConstantOperand, or just an inspectable property on the ordinary MLOperand (e.g. boolean MLOperand::isConstant()) that is set true when returned from constant, and false when generated from another operator. A constness property enables backends to make more informed decisions (not limited to CoreML), but I'm reluctant to require operands like batchNorm to only take constants.

It seems that we have the following options to support each of these operators on CoreML:

Can this be an additional property on the opSupportLimits, besides rank and dataType? My preference is your option 3, except add constant to opSupportLimits, expose an MLOperand::isConstant property, and fail synchronously before build if the backend requires constness. Changing my mind below - I think we should decompose in such cases in the backend.

if their non-constness is simply a limitation in the framework and there are no useful reasons to pass non-const tensors

Or is the reverse true, that mandatory constness is a limitation in a framework, and there are useful reasons to pass dynamic tensors such as for custom weights or as outputs from other operators? DirectML has a means (via DML_GRAPH_NODE_TYPE_CONSTANT) to specify that a tensor is a constant, which can indeed afford optimization opportunities, but it's not required. Accepting dynamic inputs can be useful for cases like convTranspose weights being inflated from a dequantized tensor at runtime (saving significant model space). It sounds like I should scan through all the models on my hard drive to find any cases when dynamic tensors are fed into an operator which CoreML requires to be constant. ⏳

Skimming through the MIL ops, some decisions are unclear to me. Why is MIL conv's weight dynamic while requiring bias to be constant, when there are more interesting preprocessing advantages to knowing that weights are constant? (update - Ningxin points out iOS17 supports dynamic weights, showing that iOS may relax more of these constraints over time) Why is conv's filter weight tensor dynamic, but closely related conv_transpose requires the filter weight to be constant?

Note we're not necessarily blocked for all the ops above when inputs have dynamic values, as multiple operators like normalization.batch_norm can be decomposed into operators that do not require constness, like {add, mul, div, sqrt}. When those input operands are constant, then the built-in CoreML normalization.batch_norm can be used directly, but if non-constant, then we can still achieve dynamic behavior via op emulation. How many other operators can this apply to? conv/convTranspose/gru/lstm seem the toughest ones 🤔.

@inexorabletash
Copy link
Member

Can this be an additional property on the opSupportLimits, besides rank and dataType? My preference is your option 3, except add constant to opSupportLimits, expose an MLOperand::isConstant property, and fail synchronously before build if the backend requires constness.

I was initially skeptical of this approach, as it would seem that developers would be likely to develop and test on one platform (e.g. Chromium/Windows/DML) and only discover incompatibilities when users with different systems ran into failures. But if the vast majority of models that we've seen use constant weights anyway then this will be rare in practice.

How smoothly would probing opSupportLimits here and/or responding to a sync build failure integrate into ORT? Is this an expected limitation of EPs already, or would we just rely on errors to propagate up from build?

@philloooo
Copy link
Contributor

Accepting dynamic inputs can be useful for cases like convTranspose weights being inflated from a dequantized tensor at runtime (saving significant model space).

This seems useful, are all of the use cases that need none constant weights derive these weights from actual constants? Or we have use cases that derive none constant weights from inputs?

If all dynamic weights are derived from constants, we can solve this by doing a pre-processing subgraph to generate these constants.

The pre-processing step can be within webnn logic, in such case, we would still need to expose the requirement that these weights need to be constant or are derived from constants. WebNN will detect when a weight is not constant and run a preprocessing subgraph to generate the constants.

The pre-processing step can also be done outside of webnn, in such case we can expose the constness requirement through opSupportLimits, and have ORT (or a javascript library) to implement the logic of preparing the const weights as a preprocessing step. The pros of doing it outside of webnn is:

  • Explicit separate the steps of preparing constants and running prediction, the prepared constants can be reused.
  • Logic is simpler to implement in js / ORT.

Cons:

  • Makes WebNN API harder to use by pushing some logic to user space.
  • Requires implementing MLBuffer for constants to make it efficient.

@philloooo
Copy link
Contributor

Breaking this into a couple sub problems.

Specify constant operand by MLConstantOperand or isConstant property

This one we seem to have consensus that it's useful to express such concept.

What is a constant operand

I'd like to propose a constant operand could be either the output of builder.constant or MLOperands generated from operations whose inputs are all constant operands.

This allows us to apply constant folding for use cases like inflating from dequantized weights, or doing any kind of reshape/split/transposes to the weights when there are format incompatibility.
The constant folding pass will be done (as a separate processing step using tflite) to generate derived constants before passing the graph to the backends who have constness requirements/preference(tflite and coreml).
Thoughts?

Define MLConstantOperand or MLOperand::isConstant()

I think the MLConstantOperand is more explicit for the spec. If we want to declare some weights need to be constant, using MLConstantOperand will be apparent in the function signature, but using MLOpernad::isConstant will burry this in the validation steps. WDYT?

Should we enforce weights to be MLConstantOperand.

If we define MLConstantOperand to be either straight from builder.constant or constant derived. Then I think it should work for almost all cases. As @a-sully mentioned above "it's much easier to relax these restrictions than to try to impose them later".
We can add such constraints for weights & bias to the function signatures as it's provide much more clarity and consistency than the dynamically probed opSupportLimits. And we can always relax them when let's say we find a use case for lstm to need dynamic weights.
WDYT?

@huningxin
Copy link
Contributor

@fdwr

Why is conv's filter weight tensor dynamic, but closely related conv_transpose requires the filter weight to be constant?

I noticed that iOS17 conv_transpose allows dynamic filter weight: https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS17.conv.conv_transpose

conv/convTranspose/gru/lstm seem the toughest ones

gru and lstm can also be emulated if non-constant weights provided.

If we assume conv/convTranspose weights can be dynamic, then the non-constant bias can be emulated by element-wise add operation too.

There are a number of compile-time optimizations a backend may perform if it knows that some data is constant.

Could we leave constant tensor optimization to implementation rather than API requirement?

@philloooo
Copy link
Contributor

philloooo commented Aug 28, 2024

Adding emulation when inputs are not constants for these kinds of operations:

  1. Is less efficient but for developers they wouldn't know the cause.
  2. Adds a lot more implementation complexity.

So if emulation is necessary, it would be better to by done on the client side IMO?
If we don't have use cases to support dynamic weights, I would prefer to expose such requirement to drive users to deploy models that are more reliably performant?

@fdwr
Copy link
Collaborator

fdwr commented Aug 28, 2024

I noticed that iOS17 conv_transpose allows dynamic filter weight: https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html#coremltools.converters.mil.mil.ops.defs.iOS17.conv.conv_transpose

@huningxin : Indeed, good to see that iOS relaxed that constness requirement later. Mingxian Yang mentions here that some stylegan models needed dynamic weights (for conv at least, not convTranspose). I still owe you all an analysis scanning through my local model collection to find any dynamic cases (hopefully before tomorrow's meeting).

@fdwr
Copy link
Collaborator

fdwr commented Sep 10, 2024

Model Scan Results

So I told Phillis and Reilly in the last meeting that I would scan my model cache for any of the above operators with dynamic inputs (I actually did right afterward, but it took a while longer to collate the results... 😓), and I found enough cases that I request the Chromium CoreML backend support these cases via decomposition: conv2d.bias, convTranspose2d.bias, batchNormalization.*, instanceNormalization.*, layerNormalization.*, prelu.slope.

  • 1439 .onnx files scanned
  • 2066 cases where an input (into one of the above listed operators) came from non-constants, meaning not a tensor initializer nor a ONNX Constant operator (+many more weights/biases that, although unlikely to actually be overridden in practice, can technically be overridden via dynamic graph inputs).

They can be organized into these categories:

  • (1) The input is upcast from a smaller data type on demand (e.g. float16 to float32, or uint8 to float32).
  • (2) The input is bindable directly and overridable (e.g. some models that allow replacing weights and biases)
  • (3) Silly cases that should be optimized out (like identities), but that still present a reality to contend with.
  • (4) Weird cases that might be optimizable better by whatever exported the model, but nonetheless are not, like a bunch of slices/concats to construct the tensors.

(1) Interestingly some weights were upcast to float32 for compute but stored as float16 in the weights (perhaps to save model space and weight upload time):
image

Similarly, weights were quantized to save space, but inflated at runtime:
image

(2) This conv bias is technically overridable/bindable at runtime. I could see some other models having similar dynamic overrides at runtime for fancier LoRA-like behavior:
image

(3) There were several silly cases with pointless Identity's (into the Conv bias in this case), but I'm not too worried about these because a WebNN backend could forward the tensor such that if identity's input.isConstant was true that identity's output.isConstant property would also be true (a basic degree of constant folding like Phillis mentions above):
image

(4) Then there were more complex cases which weren't trivial identity elimination opportunities, and maybe such cases could be resolved into something simpler and even a direct final tensor, but merely seeing that these cases exist, I have no confidence that there aren't other patterns I haven't seen:
image

For CoreML Prelu's alpha slope tensor, "The length of alpha must match the second dimension of x (channel dimension)", but there are already plenty of cases where the slope does not fit that constraint (like pictured below), merely broadcast compatibility. So we need to emulate prelu there anyway, and using that existing emulation path for !isConstant too is little more work.
image

Some questions

  • What does it mean exactly to be a constant operand for CoreML? Which device is that on for CoreML, CPU or GPU? Does it not matter because CoreML uses UMA? It might not make a difference for CoreML, but it can for others. DML has the concept of a constant node (DML_CONSTANT_DATA_GRAPH_NODE_DESC), which affords some optimizations for certain operators, and it's not required (the tensor input can always be dynamic), but it must be CPU-side data for the preprocessing.

Replies

Adds a lot more implementation complexity.

Except for GRU and LSTM (which do add complexity), it should add only a little more complexity? e.g. For layernorm and conv, it's an if/else:

if layerNormalizationOptions.scale.isConstant && layerNormalizationOptions.bias.isConstant
    return iOS15.normalization.layer_norm(input, axes, scale, bias, epsilon)
else
    // The "gamma" (scale!) and "beta" (bias!) tensors are optional and default 1 and 0.
    return add(mul(iOS15.normalization.layer_norm(input, axes, null, null, epsilon), scale), bias)
endif
if conv2dOptions.bias.isConstant
    return iOS15.conv.conv(input, filter, ..., bias)
else
    // The dynamic bias is omitted in conv but added explicitly afterward.
    return add(iOS15.conv.conv(input, filter, ..., null), bias)
endif

Is less efficient but for developers they wouldn't know the cause.

Isn't it only less efficient for the uncommon case, while the common case executes at full efficiency? Higher layer developers won't be able to implement the dynamic case as efficiently as the backend because (a) the backend operates closer to the metal with fewer layers (b) if a future CoreML relaxes constant constraints, then the backend can adopt that directly, whereas callers would not know and would still continue to artificially emulate (c) callers would unnecessarily decompose on backends that already support dynamic inputs directly if the input was defined as MLConstantOperand, because they wouldn't know it actually wasn't required to be a constant.

So if emulation is necessary, it would be better to by done on the client side IMO?

Supporting in WebNN (rather than by clients) benefits all callers, and callers do not need to repeat code that they potentially get wrong. Only the backend knows whether the underlying API truly requires decomposition and is best poised to make that decision.

I think the MLConstantOperand is more explicit for the spec.

Well it certainly is more in-your-face by being visible in the code definition, but then other tensor requirements like data types and rank are not baked into the type name. Joshua's pending operator table for spec could include a 3rd row for constness (beyond data type and rank). Though, if we can handle dynamic cases in the backend fully (defaulting to the direct operator when constant and decomposing when dynamic), then maybe we don't even need this type/property externally (only internally for fast path testing).

When deciding between properties vs polymorphism, notice properties honor orthogonality better than polymorphism. Right now, we're just talking about MLConstantOperand, but imagine we have a future MLAddressSwizzledOperand (a theoretical optimized operand in a tiled memory layout). Will we then need every permutation of them? (MLOperand, MLConstantOperand, MLAddressSwizzledOperand, MLConstantSwizzledTiledOperand...). Similarly, if we had (we don't, but if we did) baked data typeness into operands (MLFloat16Operand, MLFloat32Operand, MLInt8Operand ...) then it would become unwieldly quite quickly (MLFloat16Operand, MLConstantFloat16Operand, MLFloat32Operand, MLConstantFloat32Operand ...). Then if you wanted to extract orthogonal properties, (given those 4 distinct classes above: MLOperand, MLConstantOperand, MLAddressSwizzledOperand, MLConstantSwizzledTiledOperand), would you need to test all 4 cases to extract the 2 properties?

Properties enable concrete types in function signatures. It's common to have debugging utility functions (like I wrote to print all the MLOperand attributes to debug Stable Diffusion and debug other early operators), and it feels cleaner to pass a concrete MLOperand to a Typescript function rather than using "is a" dynamic type checks, or since WebIDL can technically be called from C++ too (a strongly typed language), without even JS in the picture, then it's sensible to pass a concrete type and avoid RTTI dynamic_casts to check a trait.

@a-sully
Copy link
Contributor Author

a-sully commented Sep 30, 2024

Closing this issue according to the discussion at TPAC here: https://www.w3.org/2024/09/23-webmachinelearning-minutes.html#08b6

Since all CoreML operators which require const parameters are decomposable into lower-level operators which do not have const requirements, the user agent can decompose when necessary

@a-sully a-sully closed this as completed Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants