Context-based graph execution methods for different threading models. #257

wchao1115 · 2022-03-10T07:57:17Z

Alternative design to #255 for various graph execution methods, with both immediate and async execution methods folded into the context interface, making it easier for developer to associate execution method to the context created with different options. The queued execution method, however, is left to a separate MLCommandEncoder interface as it requires multi-step calling pattern that ends with the finish method, consistent with the WebGPU GPU command encoder calling pattern.

Preview | Diff

anssiko · 2022-03-10T16:03:50Z

index.bs

+
+[SecureContext, Exposed=(Window, DedicatedWorker)]
+interface MLCommandEncoder {
+  undefined initializeGraph(MLGraph graph, MLNamedGPUInputs inputs);


We may want to name inputs arguments passed to initializeGraph() and dispatch() differently to make it clearer their semantics differ. I think this would improve developer ergonomics.

I will add more descriptive text into describing these two methods in more detail. I dont think just changing the param name will make any material difference.

BTW, I'll encourage other reviewers to take a look at this PR and given their opinion on it relative to #255. I hope we can come to an agreement on this important technical topic soon. It has been a while in the making.

Please look at the latest commit. I've removed MLCommandEncoder and let computeAsync call handle both the CPU and GPU async use case per discussion with @RafaelCintron and @bbernhar. Detail please refer to this reply #255 (comment)

dontcallmedom · 2022-03-22T14:18:42Z

index.bs

+{{MLContext/compute()}} method represents a way the execution of the graph is carried out immediately 
+on the calling thread, which must also be a worker thread. The execution produces the results of the computation 
+from all the inputs bound to the graph. This type of execution is limited only to when the computational device 
+bound to the context is a CPU device.


as noted in #255, I'm still unsure why this is limited to CPU execution.

That restriction means the only way to use that API is in a try … catch structure with a fallback to the async version, since the developer has no way to know if the the MLContext it obtained is bound to a CPU or not (afaik).

Given that one of the premises of having the sync-version is to make it usable when converting sync-only code-bases, I don't think this fulfills the requirements.

@dontcallmedom The GPU execution is inherently async but not on the app's timeline. It's async on the GPU timeline. So having a sync call on a GPU device would mean stalling the CPU until the GPU queue has a chance to be executed. That is undesirable and none of our known use case will do that.

given that this would happen on a DedicatedWorker, I don't think having the worker blocked from doing other CPU work is particularly an issue (the developer can choose to use the said worker only for things that need to wait for the ML processing to happen; or they should use async) - it's similar to what happens when using the FileReaderSync methods.

I see that you argue in https://github.com/webmachinelearning/webnn/pull/251/files#r832248425 that device selection should not be a preference or a hint, but be normative; if so, that would address the gist of my concern in terms of API shape, although I'm not sure that interpretation is inline with what we've been discussing so far (or at least with what I've understood of our intents).

The TF.js use case on a GPU device is such that the execution call is done on the main thread and not the worker thread. However, the compute result is deferred until the GPU executes the result and the readback occurs. The reason the sync API is limited to only the worker thread is for a different purpose. It's because even for the CPU execution, the main thread should never block.

Again, this change assumes that the app knows what type of MLContext it creates. WebNN makes that assumption.

the app knows what type of MLContext it creates. WebNN makes that assumption.

To clarify my point - I'm not arguing here (as I did in #255) that the MLContext may be coming from a third party library. My assertion is that as a developer, even if I set MLDdevicePreference to CPU, the spec gives no guarantee that I'll get back a CPU-based MLContext, and doesn't even expose if the MLContext I got back is CPU-based or not. So as a developer, I don't know if I can or cannot execute the sync version of the API, except by actually executing it and getting an exception (which I don't think is a practical API to use).

As I commented in #255, I think it should support GPU as well. It would allow the caller who requires sync API could also access GPU, e.g., TFLite WebNN delegate. When the compute device is GPU, it would block the calling thread until the result is readback into the output ArrayBuffers. This would not impact the responsiveness of the main (UI) thread as its usage is restricted in the DedicatedWorker.

I'll defer this to @RafaelCintron who insisted that GPU sync call is bad for user experience.

let's bring this to a separate issue - I think I would keep "cpu" as a name and instead make it the WebIDL-default value of MLContextOptions.devicePreference; but my sense is that the primary question around such a change is more about the impact of making devicePreference no longer a hint but a setting.

In the latest commit, I've removed the default option and rename MLDevicePreference to MLDeviceType to make it more clear that it is not merely a hint. @dontcallmedom Please take a look.

If we agree sync call is not bound to cpu device, I would suggest to keep the default option that allows user agent to select the best device (cpu, gpu or ai accelerator) for application. And for this type of ML Context, always use ArrayBufferView as compute input and output. The sync version of compute is restricted in worker to avoid block the UI thread. The async version of compute is allowed to use in both main thread and worker. I think this design would support most common usages.

The way you explain it here is not really "default" but rather "auto" option i.e. automatic device selection. This is a hard subject with potential backward and forward compatibility guarantee involved. Also the definition of "best" varies greatly depending on what the application is trying to do. Generally speaking, this type of policy behavior should not belong to the backend component like the WebNN API. Rather, it should be part of the application logic or in the framework on behalf of the application.

"default" and "auto" aren't really the same thing. Default is a well-defined behavior that is chosen to be a default behavior when nothing more explicit is specified, while the behavior of auto is dynamic implying the notion of "best fit", a more open-ended policy.

I prefer that we discuss automatic device selection as a separate issue.

huningxin · 2022-03-25T06:20:20Z

index.bs

+{{MLContext/compute()}} method represents a way the execution of the graph is carried out immediately 
+on the calling thread, which must also be a worker thread. The execution produces the results of the computation 
+from all the inputs bound to the graph. This type of execution is limited only to when the computational device 
+bound to the context is a CPU device.


As I commented in #255, I think it should support GPU as well. It would allow the caller who requires sync API could also access GPU, e.g., TFLite WebNN delegate. When the compute device is GPU, it would block the calling thread until the result is readback into the output ArrayBuffers. This would not impact the responsiveness of the main (UI) thread as its usage is restricted in the DedicatedWorker.

explainer.md

huningxin · 2022-03-28T01:51:26Z

index.bs

  MLPowerPreference powerPreference = "default";
+  GPUDevice gpuDevice = null;
+  WebGLRenderingContext glContext = null;


With this change, developer may set conflicting options, e.g., {deviceType: 'cpu', gpuDevice: device}. The existing design won't let developers to make such conflicting case. Actually these options exclusively used to create ML context of three types: default, WebGPU-based, WebGL-based. I suppose the existing design reflects the usages more clearly.

When deviceType is set to anything but "gpu", the gpuDevice even if given is ignored. I don't think it is uncommon to have cross-member dependencies considering that every field in the MLContextOptions is already an optional field. By doing it this way, we avoid having multiple overload createContext methods on the ML interface.

huningxin · 2022-03-28T01:58:19Z

index.bs

+{{MLContext/compute()}} method represents a way the execution of the graph is carried out immediately 
+on the calling thread, which must also be a worker thread. The execution produces the results of the computation 
+from all the inputs bound to the graph. This type of execution is limited only to when the computational device 
+bound to the context is a CPU device.


If we agree sync call is not bound to cpu device, I would suggest to keep the default option that allows user agent to select the best device (cpu, gpu or ai accelerator) for application. And for this type of ML Context, always use ArrayBufferView as compute input and output. The sync version of compute is restricted in worker to avoid block the UI thread. The async version of compute is allowed to use in both main thread and worker. I think this design would support most common usages.

huningxin · 2022-03-28T02:04:37Z

index.bs

  <tr><td>cpu<td>Yes<td>No<td>No<td>No<td>No
+  <tr><td>gpu (GPUDevice == null)<td>Yes<td>No<td>No<td>No<td>No
+  <tr><td>gpu (GPUDevice != null)<td>Yes<td>Yes<td>Yes<td>No<td>No


If the MLContext is created from a GPUDevice, I would suggest it only accepts the GPUBuffer and GPUTexture. This design would help simplify the spec / implementation of WebNN API and leave the ArrayBufferView uploading / readback to WebGPU API where it is already well defined.

huningxin · 2022-03-28T02:38:57Z

Summarize my points, there should be three types of MLContext that support three usages respectively:

Usage	Creation Method	Device Type	Execution Method	Buffer Type
Default	Create from an `MLContextOptions`	CPU or GPU, device selected by device and power preferences	async `computeAsync` (both main thread and worker) sync `compute` (worker only)	`ArrayBufferView`
WebGPU interop	Create from a `GPUDevice`	GPU, device selected by WebGPU API	submit `MLGraph` as a `GPUCommandBuffer` to `GPUQueue`	`GPUBuffer` and `GPUTexture`
WebGL interop	Create from a `WebGLRenderingContext`	GPU, device selected by WebGL API	sync `compute` (both main thread and worker) ?	`WebGLBuffer` and `WebGLTexture`

I have an open for the execution method for WebGL interop usage: should it be sync API and be available in both main thread and worker given GPU execution is async by natural? But how would it happen and be implemented? It seems to me there is lack of investigation on WebGL interop. So should we remove the WebGL interop support and leave it for following PR?

wchao1115 · 2022-03-28T04:00:26Z

@huningxin The default device option is problematic for our use case because the caller doesn't know what type of device the user agent has selected. That type of design based on behavior hints is fine as long as the user agent retains full control of all the related behaviors when it decides to select a certain type of device. But in our case, the caller must choose the execution method that is best for their situation. This means that the caller must be conscious of their choice of the device type to use and therefore cannot defer that decision to the user agent. @dontcallmedom expressed this concern in his review, which I agreed.

On a separate issue, a design that allows synchronous GPU execution should be avoided because the CPU could be left needlessly stalled for a long period of time. Similarly, allowing synchronous CPU execution on the main thread should also be avoided as it could disrupt the user experience. The synchronous compute method is carefully designed to only allow CPU execution on a worker thread.

huningxin · 2022-03-28T07:13:17Z

@wchao1115 , my point is the spec should allow developers to specify their intent that "select the best device automatically". For that use case, the "auto" proposal from @yuhonglin is probably a good fit.

On a separate issue, a design that allows synchronous GPU execution should be avoided because the CPU could be left needlessly stalled for a long period of time.

I don't think this is an issue, because the sync compute method is restricted to be used only in the worker thread. The stall of worker thread won't impact the responsiveness of UI which is running in the main thread. If developers don't want CPU to wait for GPU execution, they are free to use the async compute method.

As my understanding, the sync compute method is required to support a framework code base that expects sync API call in a backend, e.g., TFLite WebNN delegate. The spec shouldn't restrict this type of frameworks only to CPU execution.

wchao1115 · 2022-03-29T00:56:06Z

Can we keep the 'auto' proposal as a separate issue for now? This is a very hard topic that to my knowledge none of the leading cross-vendor ML platform has fully solved. The only known case when this is supported today is CoreML and that only works on Apple M1 hardware platform.

As for supporting sync call on the GPU device on the worker thread, I would like to understand a bit more of the current use case. Since the GPU execution is async by nature (but on the GPU timeline), it seems odd that a worker thread would want to block for it.

huningxin · 2022-03-29T06:22:14Z

As for supporting sync call on the GPU device on the worker thread, I would like to understand a bit more of the current use case.

As I mentioned, the sync compute call is required by a framework backend that infers sub-graph synchronously. For example, TFLite WebNN delegate needs to implement the sync SimpleDelegateKernelInterface::Eval method for sub-graph inference. Besides TFLite, as I reported in my presentation for TPAC, the prototype of ONNX Runtime WebNN EP and OpenCV.js WebNN backend prototype use the sync compute method. The spec should allow these frameworks to use GPU in the WebNN backend implementation.

index.bs

wchao1115 · 2022-03-29T16:08:13Z

As I mentioned, the sync compute call is required by a framework backend that infers sub-graph synchronously. For example, TFLite WebNN delegate needs to implement the sync SimpleDelegateKernelInterface::Eval

Thanks @huningxin. I think it makes sense. Can we assume that the TFLite delegate use case only happen in the worker thread?

huningxin · 2022-03-30T02:11:42Z

@wchao1115

Can we assume that the TFLite delegate use case only happen in the worker thread?

I suppose so. People experienced unresponsiveness issue by inferring large model with TF.js Wasm backend (same applies to TFLite Wasm backend) in main thread and request to move all heavy processing to worker thread. /cc @pyu10055

Once this PR landed, we can update the design of TFLite WebNN delegate to reflect this restriction. WDYT?

wchao1115 · 2022-03-30T02:58:08Z

Ok. Sounds good. I'll make that change.

…text only supports CPU inputs and outputs (automatic upload/download). Reintroduce MLCommandEncoder for WebGPU interop.

wchao1115 · 2022-04-18T06:22:13Z

Commit ef9262b contains the following changes. Hopefully this is the last for this PR.

Allow GPU sync execution (limited to worker thread calling thread). I spoke with @RafaelCintron and he is signed off on that.
Both sync and async execution only accepts ArrayBufferView -- meaning default context (those created with MLContextOptions) always upload/download inputs/outputs if needed by the GPU device.
Device type must be explicit, default to CPU device. We assume that the caller of WebNN always deal with specific device type, as defering that choice to the user agent will only make resource management more complicate and less efficient.
Reintroduce MLCommandEncoder. After much back and forth, I think a separate context type that explicitly deals with WebGPU interop is the most practical and clean approach. An implementation that can't support this requirement should fail context creation. Also, after much deliberation, I prefer to stick with the original proposal of MLCommandEncoder that produces WebGPU-compatible GPUCommandBuffer as opposed to a WebNN-specific command buffer. We need to have a follow-up with the WebGPU CG on how to implement WebNN in a way that produces a compatible GPUCommandBuffer.

This change is unlikely to make everyone happy, that solution probably doesn't exist. But I hope it's a reasonable enough compromise where everyone gets something they want. The net result of completing this change at this stage of the draft spec is certainly going to be far better than nothing.

huningxin

Thanks much @wchao1115 for this update.

Allow GPU sync execution (limited to worker thread calling thread).

+1

Both sync and async execution only accepts ArrayBufferView

+1

Device type must be explicit, default to CPU device.

+1

Reintroduce MLCommandEncoder.

+1 with an open on constants initialization. Please take a look.

index.bs

… clarify graph initialization stage and remove the unnecessary second param.

…efault GPU context.

index.bs

…thod.

bbernhar · 2022-04-26T23:52:21Z

@wchao1115

I like the use of MLCommandBuffer but I suggest we offer it for WebNN Interop. WebGPU interop wasn't meant to be a means for gate keeping GPU access on the web. Changing of the WebGPU execution model, by allowing ML work to execute under WebGPU, and requiring a new means of WebGPU interop, by sharing command-buffers with WebNN are just what-ifs and far from practical - does WebGPU do this for every API ?!?

Since WebNN has no actual dependency on WebGPU's execution model, such a justification to WebGPU WG will be very hard and unlikely anyway and WebNN Interop would allow progress...

huningxin · 2022-04-27T08:54:27Z

@bbernhar

does WebGPU do this for every API ?!?

I suppose not. I think the use case, such as video background blur on GPU (#249) , would drive the requirement of WebGPU interop. As the investigation of that use case, the apps can do this with WebGPU-only pipeline. They may want to plug WebNN into that pipeline for better ML compute performance. The MLCommandEncoder proposal allows this usage.

I like the use of MLCommandBuffer but I suggest we offer it for WebNN Interop.

WebNN interop is a good perspective. As we discussed in the last WG call, @wchao1115 already did a great job in this PR that gives a clear separation between WebNN standalone usage and WebGPU interop. I believe it would allow more efficient iterations for each usage respectively. So I would suggest to merge this PR first and file a separate issue for WebNN interop proposal.

WDYT?

bbernhar · 2022-04-29T21:29:31Z

@huningxin

Interop can go both ways. Plugging WebGPU command buffers into WebNN achieves the exact same result; except, it doesn't break WebGPU and you can progress immediately. If not, I suggest we define it as MLExternalCommandBuffer (read-only) to denote such usage is one-way.

huningxin · 2022-04-30T14:52:37Z

Thanks @bbernhar for your feedback. I am interested in knowing more. Could you please file a separate issue? So we can focus on it.

huningxin

Looks good, thanks again for @wchao1115 's great work.

@wchao1115

…#257) SHA: 4c5b70b Reason: push, by @wchao1115 Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

bbernhar · 2022-05-02T21:54:53Z

@huningxin Done.

#264

wchao1115 mentioned this pull request Mar 10, 2022

Define graph execution methods used in different threading models. #255

Closed

wchao1115 requested review from huningxin, anssiko, dontcallmedom, pyu10055 and RafaelCintron March 10, 2022 15:02

anssiko reviewed Mar 10, 2022

View reviewed changes

dontcallmedom mentioned this pull request Mar 22, 2022

Update Security Considerations per review feedback #251

Merged

dontcallmedom reviewed Mar 22, 2022

View reviewed changes

anssiko mentioned this pull request Mar 23, 2022

"default" v.s. "auto" in MLDevicePreference and MLPowerPreference webmachinelearning/model-loader#30

Open

huningxin reviewed Mar 25, 2022

View reviewed changes

wchao1115 added 5 commits March 27, 2022 17:16

Context-based graph execution methods for different threading models.

e287ced

Fixing multiple refs on Promise type.

4bad865

Fix up over-aggressive search/replace.

1aeabca

Remove MLCommandEncoder and simplify MLContext creation with params.

8dff216

Fix missing graph param

26d5bc1

wchao1115 force-pushed the graph_execution_context branch from c0f66c5 to 26d5bc1 Compare March 28, 2022 00:18

wchao1115 added 2 commits March 27, 2022 17:25

Fix build break due to merge conflict.

3feadd7

Fix build break: Disambiguate the WebIDL link to {{Promise}} definition.

6a24927

huningxin reviewed Mar 28, 2022

View reviewed changes

huningxin reviewed Mar 29, 2022

View reviewed changes

index.bs Outdated Show resolved Hide resolved

GPU support on sync execution (limited to worker thread). Default con…

ef9262b

…text only supports CPU inputs and outputs (automatic upload/download). Reintroduce MLCommandEncoder for WebGPU interop.

wchao1115 added 2 commits April 17, 2022 23:25

Remove undefined MLResource type.

3710e71

Remove references to undefined MLInput type.

ff7e4f2

huningxin reviewed Apr 19, 2022

View reviewed changes

index.bs Outdated Show resolved Hide resolved

index.bs Outdated Show resolved Hide resolved

Remove outdated note section in the computeAsync description. Further…

59bbbe9

… clarify graph initialization stage and remove the unnecessary second param.

anssiko approved these changes Apr 21, 2022

View reviewed changes

Add a note section describing when graph initialization occurs on a d…

0ed918f

…efault GPU context.

huningxin reviewed Apr 22, 2022

View reviewed changes

index.bs Outdated Show resolved Hide resolved

Adjust some wordings on the note section of MLGraphBuilder.build me…

0ea819d

…thod.

wchao1115 mentioned this pull request Apr 29, 2022

Support asynchronous graph compilation #263

Closed

huningxin approved these changes Apr 30, 2022

View reviewed changes

wchao1115 merged commit 4c5b70b into main Apr 30, 2022

wchao1115 deleted the graph_execution_context branch April 30, 2022 18:03

anssiko mentioned this pull request May 3, 2022

Should WebNN support async APIs? #230

Closed

wchao1115 mentioned this pull request Jun 2, 2022

Should restrict the sync APIs to only exist in Workers? #229

Closed

huningxin mentioned this pull request Jun 13, 2022

Support asynchronous context creation #272

Closed

huningxin mentioned this pull request Jun 30, 2022

Integration with real-time video processing #226

Open

BruceDai mentioned this pull request Nov 9, 2022

Request rename MLDevicePreference to MLDeviceType to align with WebNN API webmachinelearning/model-loader#41

Open

This was referenced Nov 29, 2022

API review, questions, brainstorming #298

Closed

API simplification: context types, context options, createContext() #302

Open

huningxin mentioned this pull request Dec 30, 2022

Simplify MLContext creation #322

Closed

zolkis mentioned this pull request Jan 24, 2023

Clarify/simplify graph execution on GPU / MLCommandEncoder #333

Closed

huningxin mentioned this pull request Feb 13, 2023

Should validate MLGraph.[[context]] in MLContext.compute() and MLContext.computeSync() steps #341

Closed

huningxin mentioned this pull request Jun 5, 2024

WebNN should support NPU and QDQ operations #623

Open

Context-based graph execution methods for different threading models. #257

Context-based graph execution methods for different threading models. #257

Conversation

wchao1115 commented Mar 10, 2022 • edited by pr-preview bot Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wchao1115 Mar 28, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huningxin commented Mar 28, 2022 • edited Loading

wchao1115 commented Mar 28, 2022 • edited Loading

huningxin commented Mar 28, 2022

wchao1115 commented Mar 29, 2022

huningxin commented Mar 29, 2022 • edited Loading

wchao1115 commented Mar 29, 2022

huningxin commented Mar 30, 2022

wchao1115 commented Mar 30, 2022

wchao1115 commented Apr 18, 2022 • edited Loading

huningxin left a comment

Choose a reason for hiding this comment

bbernhar commented Apr 26, 2022

huningxin commented Apr 27, 2022 • edited Loading

bbernhar commented Apr 29, 2022

huningxin commented Apr 30, 2022

huningxin left a comment

Choose a reason for hiding this comment

bbernhar commented May 2, 2022

wchao1115 commented Mar 10, 2022 •

edited by pr-preview bot

Loading

wchao1115 Mar 28, 2022 •

edited

Loading

huningxin commented Mar 28, 2022 •

edited

Loading

wchao1115 commented Mar 28, 2022 •

edited

Loading

huningxin commented Mar 29, 2022 •

edited

Loading

wchao1115 commented Apr 18, 2022 •

edited

Loading

huningxin commented Apr 27, 2022 •

edited

Loading