WebNN should support NPU and QDQ operations #623

wchao1115 · 2024-03-27T00:24:48Z

Related to issue #128 and #302, we've been talking about supporting the NPU for the last few years. Now that more commercial NPU platforms become available (e.g. with the more recent arrival of Intel Core Ultra NPU), it is time to formally define NPU support in the WebNN spec. There are two key elements of this specification:

An ability to specify a device type for the NPU. Unlike more general-purpose devices such as the GPU and CPU, an NPU supports a limited finite set of operations without programmability support. To an extent needed to keep model execution stable and more predictable, the notion of a fallback device is needed to support NPU acceleration during model inference.
A minimum set of operators required to support quantized models. Because most NPU utilizes a much simpler and less power-hungry low-bit integer arithmetic units, models targeting the NPU almost always need to be quantized first. The bare minimal support here in terms of operators are just two -- the quantizeLinear and dequantizeLinear operators. These two will be enough to handle quantized models by pairing them up at the right places in the model graph, the so-called tensor-oriented QDQ format used in ONNX. Additionally, two more prominent quantized operators, one for convolution, and another for matmul will allow more quantized models not already expressed in the QDQ format to function i.e. conv2dInt and matmulInt.

The text was updated successfully, but these errors were encountered:

anssiko · 2024-03-27T07:44:55Z

@wchao1115, thanks for this proposal that outlines key elements for NPU support. I'll schedule this important topic for discussion at our upcoming meeting.

As you noted, these topics (NPU device type, support for quantized models) have been explored in the group prior and have been awaiting implementation experience. The timing is now appropriate for the group to reinvigorate this topic with NPU platforms more widely available and in the hands of consumers. Most importantly, the group can now validate proposed spec designs with implementation experience per our established work mode.

I'm looking forward to this discussion. Meanwhile, questions and comments are welcome in this issue from everyone.

philloooo · 2024-03-28T17:36:52Z

Hi thanks for bringing this up! I'd like to highlight a couple things based on my current implementation experience:

I don't think we can assume CPU and GPU device type always support everything, and only NPU is the outlier. For chromium implementations, tflite has a subset of ops supported for GPU backend. On CoreML, also opset coverage is like CPU > GPU > ANE. So if we want to provide fallback, we will need that for both GPU and NPU.
On CoreML, there is no option to NOT provide fallback. The computeUnits options are: {cpu, gpu, ane}, {cpu}, {cpu, gpu}, {cpu, ane}. So there is no way you can target just the ANE (aka NPU).
CoreML also could decide to execute on CPU even when that op is supported on ANE if it deems that it's more efficient for that case. So the {cpu, ane} option doesn't actually mean "only fallback to CPU when something is not supported on ANE" , but means "I will figure out what's the most efficient way to execute using CPU and ANE" . So it doesn't seem to match well with the current proposal here.

fdwr · 2024-04-18T14:07:35Z

We've discussed 3 possible options for extending the MLContextOptions::MLDeviceType:

deviceType: "npu" (currently prototyped in Chromium)
➕ Very simple API
➕ Least to test
➕ Affords backends the most control for fallback, since only the primary device preference is specified.
➖ App cannot specify the fallback preference, as the system instead decides any fallback devices (though, do web apps know best, better than the system?).
deviceType: "npu"
fallbackDeviceType: "gpu"
➕ More flexible, as app can state secondary preference (but not a 3rd preference)
➖ A little more complex API, but fallbackDeviceType would be optional anyway.
➖ More to test and verify. You have to consider which all combinations are valid. e.g. Is {deviceType: "npu" fallbackDeviceType: "cpu"} valid? That would likely require the graphBuilder to partition the graph per node because NPU's have a smaller core functionality and lack the same operator coverage of more generic ML devices. What about redundant statements like {deviceType: "gpu", fallbackDeviceType: "gpu"}? Would it ever make sense to fall back from a more capable device to a narrow one like {deviceType: "gpu", fallbackDeviceType: "npu"}? What happens when you specify a combination {deviceType: "npu", fallbackDeviceType: "gpu"} that isn't supported on the backends like CoreML computeUnits?...
deviceTypes: ["npu", "gpu"]
➕ Most flexible, allowing control and several devices in preferred order: ["npu", "gpu", "cpu"] (this is functionally similar to the bitflags in CoreML computeUnits).
➖ More complicated. More to test and verify.
➖ Platforms in practice don't actually support that much flexibility anyway. See CoreML computeUnits from Phillis, which has limited permutations. On Windows, the potential combinations will differ. If WebNN permits this much control, but the browser ignores that, then that is misleading.
deviceType: "npu", excludesDeviceTypes: ["gpu"] (from Phillis below)
Sometimes you care more about excluding a specific device, preferring to extend battery life over using the faster but power-hungry GPU. It's similar to option 3 with a different emphasis.

Other considerations

Error handling: If a device type does not exist at all, like asking for an NPU on a machine without one or a GPU on a headless server, then navigator.ml.createContext could fail. So the client should try again with a different device type. Such early failure is useful, before you've constructed too much of your graph. Note on Apple via CoreML, that's not really an option, as you can't ask for GPU or NPU without also getting CPU fallback.

Ultimate fallback: If navigator.ml.createContext succeeds for a given device type, then we should not introduce errors much later during graph node construction or the build call which would be very hard for the caller to unwind, or at least very inefficient to recreate the graph all over again with a new device type. So there should be a fallback behavior, and if any backend has incomplete coverage, then we want an "ultimate fallback" backend (like a catch-all universal font in font fallback) that handles every WebNN operator, which is typically the CPU backend. Currently (2024-04-19) the Chromium WebNN implementation is in the unusual state that the GPU backend is more complete in operator coverage than the CPU one, but eventually the CPU backend will catch up.

Quantized operators: These are necessary for NPU but are also independent, as they are useful for GPU and CPU too.

Feedback

Feedback welcome below. I have my preferences, but want to hear from you, and whether any other options/considerations are missing.

zolkis · 2024-04-24T16:30:26Z

Option 3 seems the best to me, also used in e.g. OpenVINO, and allows future interpretation for split/combined execution on multiple accelerators.

philloooo · 2024-04-25T17:50:49Z

I actually like the simplicity of option 1. As long as we make it clear in the spec that the system may decide to fallback to other devices.

The benefit of option 3 is use case like "I want to use anything except GPU" to load balance or something. But I feel we can explore this option when we gain more concrete needs from developers. A 4th option to satisfy the same need is : deviceType: "npu", excludesDeviceTypes: ["gpu"].

For now option 1 seems a good starting point?

fdwr · 2024-05-08T17:47:39Z

At the 2024-05-02 meeting, the W3C group agreed to start with option 1 (reserving the option to potentially expand if implementation experience shows need). Next step, spec update.

mwyrzykowski · 2024-05-30T16:42:48Z

There was feedback regarding the motivation for this. Why is MLDeviceType even necessary and shouldn't it be a browser implementation decision to choose the most appropriate processor given the existing MLPowerPreference?

Or do we really need this for an application with extensive WebNN <-> WebGPU interop for instance? I read through the other related issues but couldn't find a good motivating factor.

zolkis · 2024-05-30T16:55:56Z

Why is MLDeviceType even necessary and shouldn't it be a browser implementation decision to choose the most appropriate processor given the existing MLPowerPreference?

I think that could be possible even with the current API shape, if we spec it correctly.
If there are big differences in views over features like this, (at least in the past) specs used conformance classes, and implementations chose a conformance level. But in this case, we may not even need that - if we can spec a way implementations could disregard user preferences/hints, telling if hints were overridden, or a fallback happened.

The details would be important, so please share examples on how would you like these scenarios happen.

inexorabletash · 2024-05-30T19:39:19Z

There's been some past discussion on the use cases for explicit device type in a few places. One that comes to mind is:

API simplification: context types, context options, createContext() #302 (comment)

... where I ask a bunch of dumb questions about the need for MLDeviceType. I was missing the point of #322 which is about requiring GPU to use an explicit WebGPU-vended device you don't get surprising behavior. But it teases out some use cases anyway.

mwyrzykowski · 2024-06-05T06:03:45Z

After reading over #322 @inexorabletash I still don't fully understand the argument for the device type. Comparing to WebGPU, which can be implemented purely in software without a physical GPU, why is WebNN specifying a physical device type and not leaving this up to the implementation? Rather it would seem specifying that one wishes to interop with a WebGPU device is desirable. The WebGPU device may be created as a purely software device in which case running WebNN computations also on the cpu is desirable.

In any case, this seems best left up to the browser implementation. It is the browser implementation which ensures WebNN computations are consistent across any physical hardware device it runs on. In that scenario, MLDeviceType should be removed from the WebNN API.

huningxin · 2024-06-05T09:01:12Z

@mwyrzykowski

why is WebNN specifying a physical device type and not leaving this up to the implementation

AFAIK, there are two use cases need specifying a device type:

compute offloading: a game engine may want to run ML tasks on CPU to avoid interfering with the GPU time budget. See WebNN API in gaming scenarios discussion of WG 2022/04/07.
op fallback: a ML framework may want to create a CPU context with fallback to Wasm option, that would avoid expensive cross-device tensor data copy between WebNN graph inference and Wasm operators execution. Custom operations #6 and Support CPU - WebAssembly scenario of the op level execution use case #156

It is the browser implementation which ensures WebNN computations are consistent across any physical hardware device it runs on.

This scenario was discussed before as a "system" or "auto" device type. There were some relevant issues: webmachinelearning/model-loader#30 and #257.

zolkis · 2024-06-13T14:23:58Z

Summarizing the use cases, we seem to have the following set of constraints:

op sets (may) depend on the device (accelerator) type; fallbacks might be automatic or by a user defined/hinted policy.
user might explicitly want to limit execution to certain device(s) / device type(s)
when the implementation cannot satisfy the user hint, should it fail or may it override user hints and decide the device and possible fallbacks?
an API that attempts serving all these use cases should not cause confusion or friction on the success and error paths.

Strictly including the discussion above (and references),

For the examples cited by @huningxin we could solve by specifying deviceType: "cpu", meaning "up to CPU".
For supporting GPU or NPU, those could both be included under deviceType: "gpu", meaning "up to GPU, including any possible NPUs". If there is any NPU or GPU (or CPU), this would allow the implementation to choose.
Do we have a use case to specify "NPU or CPU" only? Then we could introduce deviceType: "npu".
For the use case of letting the implementation choose the best, we could (re-)introduce deviceType = "auto". This might cover fallbacks in any order. But AFAIU, this would be the same as deviceType: "gpu", which could be the default.

Did I miss something?

mwyrzykowski · 2024-06-13T15:16:30Z

Another common use case would be the desire to run workloads on the GPU and ANE (and possibly CPU) simultaneously.

zolkis · 2024-06-13T15:24:00Z

Another common use case would be the desire to run workloads on the GPU and ANE (and possibly CPU) simultaneously.

Assuming the interpretation above, that could fall under deviceType: "gpu" context - decided by the implementation.

If we'd like to make that explicit, together with the fallback option, [and eventually when we want an error,] we'd need to use option 3 from here, but there are downsides/complications.

mwyrzykowski · 2024-06-13T15:25:41Z

It would seem preferable to keep it implicit because from the web it would be hard, especially in a privacy preserving manner, for the website to make the correct decision for a given scenario.

fdwr · 2024-06-13T16:11:22Z

It would seem preferable to keep it implicit because from the web it would be hard, especially in a privacy preserving manner, for the website to make the correct decision for a given scenario.

Note websites won't be the only clients, as installable web apps can also run locally via WebNN, and they know their scenarios moreso. In any case, the device type is a hint, not a requirement. Apps can also leave MLDeviceType set to default in conjunction with a power preference to let the implementation do as it pleases.

wchao1115 added v2 opset labels Mar 27, 2024

anssiko added the feature request label Mar 27, 2024

fdwr mentioned this issue Apr 25, 2024

Add NPU option webmachinelearning/webnn-samples#220

Open

fdwr mentioned this issue May 30, 2024

Add MLDeviceType npu #696

Merged

anssiko mentioned this issue Jun 27, 2024

Process: Add "device selection" label #712

Merged

anssiko added the device selection label Jun 27, 2024

inexorabletash mentioned this issue Jul 12, 2024

WebML WG - TPAC 2024 agenda webmachinelearning/meetings#25

Open

inexorabletash mentioned this issue Jul 26, 2024

WebNN should support int8 quantized models #128

Open

mwyrzykowski mentioned this issue Aug 8, 2024

MLContextOptions.deviceType seems unnecessary outside of conformance testing #749

Open

fdwr mentioned this issue Aug 16, 2024

Add QuantizeLinear and DequantizeLinear for mixed precision #93

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebNN should support NPU and QDQ operations #623

WebNN should support NPU and QDQ operations #623

wchao1115 commented Mar 27, 2024

anssiko commented Mar 27, 2024

philloooo commented Mar 28, 2024 •

edited

Loading

fdwr commented Apr 18, 2024 •

edited

Loading

zolkis commented Apr 24, 2024

philloooo commented Apr 25, 2024

fdwr commented May 8, 2024

mwyrzykowski commented May 30, 2024 •

edited

Loading

zolkis commented May 30, 2024

inexorabletash commented May 30, 2024

mwyrzykowski commented Jun 5, 2024

huningxin commented Jun 5, 2024 •

edited

Loading

zolkis commented Jun 13, 2024 •

edited

Loading

mwyrzykowski commented Jun 13, 2024

zolkis commented Jun 13, 2024

mwyrzykowski commented Jun 13, 2024 •

edited

Loading

fdwr commented Jun 13, 2024

WebNN should support NPU and QDQ operations #623

WebNN should support NPU and QDQ operations #623

Comments

wchao1115 commented Mar 27, 2024

anssiko commented Mar 27, 2024

philloooo commented Mar 28, 2024 • edited Loading

fdwr commented Apr 18, 2024 • edited Loading

Other considerations

Feedback

zolkis commented Apr 24, 2024

philloooo commented Apr 25, 2024

fdwr commented May 8, 2024

mwyrzykowski commented May 30, 2024 • edited Loading

zolkis commented May 30, 2024

inexorabletash commented May 30, 2024

mwyrzykowski commented Jun 5, 2024

huningxin commented Jun 5, 2024 • edited Loading

zolkis commented Jun 13, 2024 • edited Loading

mwyrzykowski commented Jun 13, 2024

zolkis commented Jun 13, 2024

mwyrzykowski commented Jun 13, 2024 • edited Loading

fdwr commented Jun 13, 2024

philloooo commented Mar 28, 2024 •

edited

Loading

fdwr commented Apr 18, 2024 •

edited

Loading

mwyrzykowski commented May 30, 2024 •

edited

Loading

huningxin commented Jun 5, 2024 •

edited

Loading

zolkis commented Jun 13, 2024 •

edited

Loading

mwyrzykowski commented Jun 13, 2024 •

edited

Loading