Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow no-op graphs? #614

Closed
inexorabletash opened this issue Mar 21, 2024 · 24 comments · Fixed by #665
Closed

Allow no-op graphs? #614

inexorabletash opened this issue Mar 21, 2024 · 24 comments · Fixed by #665
Assignees
Labels

Comments

@inexorabletash
Copy link
Member

In #603 a step was added build() to match the Chromium implementation, which errors out if an input operand or a constant operands is specified as an output.

@fdwr writes that this

...would nullify the possibility of a nop graph (say one with just a constant and output), which conceptually seems valid to me, as one scenario is where you have a chained sequence of models with one step that can be an action or a nop, and it may be easier to express by always passing a valid (but empty) graph to that component which executes them all.

But also :

I'm not too worried about it either, since a caller can always insert a dummy identity node to satisfy this constraint.

Raising this per WG telecon discussion

@fdwr
Copy link
Collaborator

fdwr commented Mar 21, 2024

Open to hearing more thoughts that can sway this. Either direction, it would be good to document this, and if we prohibit nop graphs, to remark that inserting a dummy identity is an effective work-around.

@huningxin
Copy link
Contributor

say one with just a constant and output

This constant-only graph seems useful, especially for GPU or NPU device. Once this graph compiled, the constant data would be uploaded to the dedicated device memory and it's output can be used by other graphs on the same device as input (via MLBuffer) that would avoid cross-device data copy and memory duplication.

Today when ONNXRuntime WebNN EP inferencing transformer decoders, like Whisper, because WebNN doesn't support If operator, there will be two WebNN sub-graphs being built. One with past Key Value (KV) cache and another one without cache. The inference code would run the sub-graph without cache for the first iteration and run the sub-graph with past KV cache one for the following iterations. The two sub-graphs actually share some common weights. In the existing implementation, they are uploaded by the two sub-graphs separately and would cause memory duplication. If the constants-only graph is allowed, the common weights can be uploaded through a constants-only graph and be shared with the two decoder sub-graphs (via MLBuffer).

/cc @Honry

@a-sully
Copy link
Contributor

a-sully commented Mar 25, 2024

Thanks for providing this use case, @huningxin! A few thoughts...

Once this graph compiled, the constant data would be uploaded to the dedicated device memory and it's output can be used by other graphs on the same device as input

Taking a step back... isn't this the point of writeBuffer?

const inputMlBuffer = context.createBuffer({size: someArrayBuffer.byteLength, usage: MLBufferUsage.INPUT});
mlContext.writeBuffer(inputMlBuffer, { srcData: someArrayBuffer});
graph.dispatch({inputs: inputMlBuffer, outputs});

In this case, there's no need to create an MLGraph just for the sake of loading constant values to the device memory... Is there something I'm missing here? :)

If the constants-only graph is allowed, the common weights can be uploaded through a constants-only graph and be shared with the two decoder sub-graphs (via MLBuffer).

Just to clarify, are you suggesting that the weights would be shared with the two decoder sub-graphs as inputs or constants?

Currently MLBuffer is only proposed as being used for inputs/outputs, though of course it would be nice to expand this to constants in the future. That being said, constants are, well, constant, so I don't expect we would allow something like this:

builder.constant(mlBuffer)
//...
graph.dispatch({inputs, outputs: mlBuffer});

This relates back to #542 (FYI @bbernhar), since using an MLBuffer as a constant() would presumably require creating it with "read-only" usage flags, whereas "output" tensors must be writable

@bbernhar
Copy link

In this case, there's no need to create an MLGraph just for the sake of loading constant values to the device memory... Is there something I'm missing here? :)

Yea, its for optimization. When you build a graph with constants directly, the runtime can reformat it AOT for better execution performance. But this also creates a copy (in a new format). You don't want web developers reading or writing to/from this data like normal inputs because its owned by the runtime and illegal for them to execute these "runtime owned" constants buffers.

The usage flags would be need to more specific than read-only (ie. graph or context /dispatch based constant).

Hope that helps. =)

@a-sully
Copy link
Contributor

a-sully commented Mar 25, 2024

Thanks for the additional info, @bbernhar!

I assume by "runtime owned" constant buffers you're referring to DML_TENSOR_FLAG_OWNED_BY_DML in DML?

Looking back at @huningxin's proposal:

Once this graph compiled, the constant data would be uploaded to the dedicated device memory and it's output can be used by other graphs on the same device as input (via MLBuffer) that would avoid cross-device data copy and memory duplication

I'm still confused how compiling constant subgraphs will reduce data copies and memory duplication. Wouldn't the output of the constant sub-graphs just result in another data copy? According to the docs, the DML_TENSOR_FLAG_OWNED_BY_DML flag can only be used on input tensors. I do not expect we can create MLBuffers backed by this type of buffer (and as @bbernhar mentioned, these tensors should not be exposed to web developers)

Here's my understanding of the constant subgraph flow. Please let me know if anything is incorrect!

const constantBuilder = new MLGraphBuilder(context);

// Copies `someArrayBuffer`'s data into `constantBuilder`
const constantOperand = constantBuilder.constant(someArrayBuffer);

// Copies the data into `constantGraph`. This copy is an implementation detail,
// though it's hard to remove if `build()` can be called multiple times. See #567).
// Then the DML implementation reformats (another copy) this data into a
// "runtime owned" constant buffer
const constantGraph = await constantBuilder.build({'output': constantOperand});

// _copies_ the data to `mlBuffer`
//  - The "runtime owned" constant buffer is tied to `constantGraph`, so
//    presumably other `MLGraph` instances cannot use it?
//  - I assume it's destroyed alongside all other `constantGraph` resources when
//    `constantGraph` is GCed?
constantGraph.dispatch({inputs: /*none!*/{}, outputs: mlBuffer});

// Now, to use this constant data in other graphs...

const subGraphBuilder = new MLGraphBuilder(context);
// Build up some operands with `subGraphBuilder` to yield a `subGraph`
//  ...

// Since we're passing `mlBuffer` as an input rather than a constant, we cannot
// take advantage of the AOT reformatting optimization
subGraph.dispatch({inputs: mlBuffer, outputs: ... });

Looking again at the DML_TENSOR_FLAG_OWNED_BY_DML docs:

The effect of this flag is that DirectML makes a copy of the tensor data during initialization of an operator, storing it in the persistent resource.

This means that sharing this buffer between MLGraphs would require sharing that specific persistent resource, no?


The usage flags would be need to more specific than read-only (ie. graph or context /dispatch based constant).

Agreed, I was just trying to emphasize that "output" implies writable whereas "constant" implies read-only, so it seems unlikely that we'd allow a given MLBuffer to be used for both :)

@bbernhar
Copy link

I assume by "runtime owned" constant buffers you're referring to DML_TENSOR_FLAG_OWNED_BY_DML in DML?
This means that sharing this buffer between MLGraphs would require sharing that specific persistent resource, no?

Yes, thanks for pointing this out.

Wouldn't the output of the constant sub-graphs just result in another data copy?

@huningxin may correct me here but as I understand, if we dispatch a MLGraph of constants(), the dispatched output will stay on the device, allowing us to re-use the output MLBuffer, as normal input, into a another MLGraph, with the same constant data. This gets around requiring a special MLBuffer for constants().

@huningxin
Copy link
Contributor

huningxin commented Mar 26, 2024

@a-sully

I'm still confused how compiling constant subgraphs will reduce data copies and memory duplication.

I meant the shared constant data only needs to be uploaded to the constant-graph once and there is only one copy of constant data in device memory that are shared by the two decoder graphs (no_past and with_past).

I agreed it still needs data copy when passing the constant data from the constant-graph to the following decoder graphs as input. And in this scenario, the constant data is used by the decoder graphs as an input without DML_TENSOR_FLAG_OWNED_BY_DML flag, this may lose some inference performance.

This means that sharing this buffer between MLGraphs would require sharing that specific persistent resource, no?

AFAIK, the persistent resource is initialized by a particular graph initializer. I am not sure a persistent resource initialized by one graph can be bound for other graphs. @fdwr ?

@huningxin
Copy link
Contributor

huningxin commented Mar 26, 2024

@bbernhar

@huningxin may correct me here but as I understand, if we dispatch a MLGraph of constants(), the dispatched output will stay on the device, allowing us to re-use the output MLBuffer, as normal input, into a another MLGraph, with the same constant data. This gets around requiring a special MLBuffer for constants().

It works. But as I mentioned, because now the constant data is passed as an input resource (without setting DML_TENSOR_FLAG_OWNED_BY_DML flag), this would prevent DML from optimizing this resource for inference performance. This optimization happens at the graph initialization time rather than inference time.

@huningxin
Copy link
Contributor

huningxin commented Mar 26, 2024

If builder.constant() can take a MLBuffer, builder.build() of graphs sharing the same weight may bind that MLBuffer for tensor set with DML_TENSOR_FLAG_OWNED_BY_DML flag at graph initialization. This would not only eliminate the cross-device data copies (only need to write to MLBuffer once) but also ensure the constant data being optimized by each graph initializer.

@a-sully
Copy link
Contributor

a-sully commented Mar 26, 2024

I'm trying to square this comment

If builder.constant() can take a MLBuffer, builder.build() of graphs sharing the same weight may bind that MLBuffer for tensor set with DML_TENSOR_FLAG_OWNED_BY_DML flag at graph initialization.

with this comment

AFAIK, the persistent resource is initialized by a particular graph initializer. I am not sure a persistent resource initialized by one graph can be bound for other graphs. @fdwr ?

Just to make sure I'm understanding correctly... Are you suggesting that the optimized constant data can be shared between these graphs? Or that the same constant data in the MLBuffer can be passed to both graphs, but then each graph will need to create its own optimized copy? I think you're saying the latter?

So in summary:

  • passing an MLBuffer as an input() will allow the same data to be passed to multiple MLGraph instances (no additional copies). This data cannot be optimized during graph initialization because it's passed as an input rather than a constant
  • passing an MLBuffer as a constant() will ensure there's only one copy of the unoptimized constant data. Each MLGraph which this buffer is passed to may create its own optimized copy of the constant data during graph initialization.

Returning to the original question... Where do constant-only graphs fit into this picture?

Once this graph compiled, the constant data would be uploaded to the dedicated device memory and it's output can be used by other graphs on the same device as input

Taking a step back... isn't this the point of writeBuffer?

I think this still holds? :)

@fdwr
Copy link
Collaborator

fdwr commented Mar 26, 2024

I am not sure a persistent resource initialized by one graph can be bound for other graphs. @fdwr?

@huningxin: 🤔 It's not well defined - my teammate Justin Stoecker says he wouldn't rely on doing that.

@huningxin
Copy link
Contributor

huningxin commented Mar 27, 2024

@a-sully

Just to make sure I'm understanding correctly... Are you suggesting that the optimized constant data can be shared between these graphs? Or that the same constant data in the MLBuffer can be passed to both graphs, but then each graph will need to create its own optimized copy? I think you're saying the latter?

I meant the latter. As @fdwr mentioned, we should not rely on the former.

Returning to the original question... Where do constant-only graphs fit into this picture?

I'd say passing an MLBuffer as a constant() is better than constant-only graph for my use case ("no_past" and "with_past" decoder graphs share common weight).

@bbernhar
Copy link

I would be supportive of having "constant MLBuffer" usage (for the purposes of re-using constant input data). We could spec it such that after builder.constant(mlBuffer), it becomes read-only (ex. no writeBuffer() or dispatch() allowed).

@a-sully @huningxin Shall we add this to the prototype TODO?

@a-sully
Copy link
Contributor

a-sully commented Mar 28, 2024

SGTM 👍

@huningxin
Copy link
Contributor

Shall we add this to the prototype TODO?

SGTM! Thanks @bbernhar!

@anssiko
Copy link
Member

anssiko commented Apr 25, 2024

Can we close this in favour of "constant MLBuffer"?

Please check we have migrated relevant information from here over to the appropriate MLBuffer issue where that is tracked and cross-link to that spec issue, and to a Chromium prototyping issue too as appropriate.

Specifically, I haven't hear any use cases not addressed by the constant MLBuffer proposal. We discussed this in https://www.w3.org/2024/04/18-webmachinelearning-minutes.html#t08

@fdwr
Copy link
Collaborator

fdwr commented Apr 25, 2024

Can we close this in favour of "constant MLBuffer"?

I'm content either way (like I said, inserting a dummy identity suffices for my original concern). Deferring to Ningxin/Austin/Bryan/Josh.

@a-sully
Copy link
Contributor

a-sully commented Apr 26, 2024

If there's no clear use case (that's not addressed by MLBuffer) then I propose not allowing no-op graphs. We can always reconsider this stance later

@huningxin
Copy link
Contributor

@a-sully

If there's no clear use case (that's not addressed by MLBuffer) then I propose not allowing no-op graphs. We can always reconsider this stance later

+1

@inexorabletash

In #603 a step was added build() to match the Chromium implementation, which errors out if an input operand or a constant operands is specified as an output.

There is another case that an input or a constant operand can be specified as an output although the graph has ops.

a = builder.input('a', {dataType: 'float32'});
b = builder.relu(a);
graph = await builder.build({a, b});
// TypeError: Failed to execute 'build' on 'MLGraphBuilder': The operand with name "a" is not an output operand.

@inexorabletash
Copy link
Member Author

Agreed that the current spec text and Chromium impl. are more restrictive than just the "no-op graphs".

Are you calling out a use case (where we should reconsider the behavior) or just noting the restriction? I'd argue for keeping the restriction until a concrete use case turns up, but it's not a strongly held opinion.

@huningxin
Copy link
Contributor

@inexorabletash

Are you calling out a use case (where we should reconsider the behavior) or just noting the restriction?

I was just noting this not only restricts the no-op graphs but also prevents from setting graph outputs to input/constant operands. I am not aware of any use cases of the latter behavior.

I'd argue for keeping the restriction until a concrete use case turns up

+1

@inexorabletash
Copy link
Member Author

Sounds like we've got consensus. Should we just close this out? Do we want to add a note to the spec e.g.:

NOTE: Specifying an input operand or constant operand as a graph output results in an error, as this is usually an incorrect usage of the API. Callers can work around this by introducing an identity() operator.

@fdwr
Copy link
Collaborator

fdwr commented Apr 29, 2024

Sounds like we've got consensus. Should we just close this out? Do we want to add a note to the spec e.g.:

  • Close out. 👍
  • Nice little note, just like you wrote. 👍

@a-sully
Copy link
Contributor

a-sully commented Apr 29, 2024

Closing out SGTM! Let's also not forget to add appropriate WPT coverage (FYI @BruceDai )

@inexorabletash inexorabletash self-assigned this Apr 29, 2024
inexorabletash added a commit to inexorabletash/webnn that referenced this issue Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants