Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebNN should support NPU and QDQ operations #623

Open
wchao1115 opened this issue Mar 27, 2024 · 16 comments
Open

WebNN should support NPU and QDQ operations #623

wchao1115 opened this issue Mar 27, 2024 · 16 comments

Comments

@wchao1115
Copy link
Collaborator

Related to issue #128 and #302, we've been talking about supporting the NPU for the last few years. Now that more commercial NPU platforms become available (e.g. with the more recent arrival of Intel Core Ultra NPU), it is time to formally define NPU support in the WebNN spec. There are two key elements of this specification:

  1. An ability to specify a device type for the NPU. Unlike more general-purpose devices such as the GPU and CPU, an NPU supports a limited finite set of operations without programmability support. To an extent needed to keep model execution stable and more predictable, the notion of a fallback device is needed to support NPU acceleration during model inference.
  2. A minimum set of operators required to support quantized models. Because most NPU utilizes a much simpler and less power-hungry low-bit integer arithmetic units, models targeting the NPU almost always need to be quantized first. The bare minimal support here in terms of operators are just two -- the quantizeLinear and dequantizeLinear operators. These two will be enough to handle quantized models by pairing them up at the right places in the model graph, the so-called tensor-oriented QDQ format used in ONNX. Additionally, two more prominent quantized operators, one for convolution, and another for matmul will allow more quantized models not already expressed in the QDQ format to function i.e. conv2dInt and matmulInt.
@anssiko
Copy link
Member

anssiko commented Mar 27, 2024

@wchao1115, thanks for this proposal that outlines key elements for NPU support. I'll schedule this important topic for discussion at our upcoming meeting.

As you noted, these topics (NPU device type, support for quantized models) have been explored in the group prior and have been awaiting implementation experience. The timing is now appropriate for the group to reinvigorate this topic with NPU platforms more widely available and in the hands of consumers. Most importantly, the group can now validate proposed spec designs with implementation experience per our established work mode.

I'm looking forward to this discussion. Meanwhile, questions and comments are welcome in this issue from everyone.

@philloooo
Copy link
Contributor

philloooo commented Mar 28, 2024

Hi thanks for bringing this up! I'd like to highlight a couple things based on my current implementation experience:

  1. I don't think we can assume CPU and GPU device type always support everything, and only NPU is the outlier. For chromium implementations, tflite has a subset of ops supported for GPU backend. On CoreML, also opset coverage is like CPU > GPU > ANE. So if we want to provide fallback, we will need that for both GPU and NPU.
  2. On CoreML, there is no option to NOT provide fallback. The computeUnits options are: {cpu, gpu, ane}, {cpu}, {cpu, gpu}, {cpu, ane}. So there is no way you can target just the ANE (aka NPU).
  3. CoreML also could decide to execute on CPU even when that op is supported on ANE if it deems that it's more efficient for that case. So the {cpu, ane} option doesn't actually mean "only fallback to CPU when something is not supported on ANE" , but means "I will figure out what's the most efficient way to execute using CPU and ANE" . So it doesn't seem to match well with the current proposal here.

@fdwr
Copy link
Collaborator

fdwr commented Apr 18, 2024

We've discussed 3 possible options for extending the MLContextOptions::MLDeviceType:

  1. deviceType: "npu" (currently prototyped in Chromium)
    ➕ Very simple API
    ➕ Least to test
    ➕ Affords backends the most control for fallback, since only the primary device preference is specified.
    ➖ App cannot specify the fallback preference, as the system instead decides any fallback devices (though, do web apps know best, better than the system?).

  2. deviceType: "npu"
    fallbackDeviceType: "gpu"
    ➕ More flexible, as app can state secondary preference (but not a 3rd preference)
    ➖ A little more complex API, but fallbackDeviceType would be optional anyway.
    ➖ More to test and verify. You have to consider which all combinations are valid. e.g. Is {deviceType: "npu" fallbackDeviceType: "cpu"} valid? That would likely require the graphBuilder to partition the graph per node because NPU's have a smaller core functionality and lack the same operator coverage of more generic ML devices. What about redundant statements like {deviceType: "gpu", fallbackDeviceType: "gpu"}? Would it ever make sense to fall back from a more capable device to a narrow one like {deviceType: "gpu", fallbackDeviceType: "npu"}? What happens when you specify a combination {deviceType: "npu", fallbackDeviceType: "gpu"} that isn't supported on the backends like CoreML computeUnits?...

  3. deviceTypes: ["npu", "gpu"]
    ➕ Most flexible, allowing control and several devices in preferred order: ["npu", "gpu", "cpu"] (this is functionally similar to the bitflags in CoreML computeUnits).
    ➖ More complicated. More to test and verify.
    ➖ Platforms in practice don't actually support that much flexibility anyway. See CoreML computeUnits from Phillis, which has limited permutations. On Windows, the potential combinations will differ. If WebNN permits this much control, but the browser ignores that, then that is misleading.

  4. deviceType: "npu", excludesDeviceTypes: ["gpu"] (from Phillis below)
    Sometimes you care more about excluding a specific device, preferring to extend battery life over using the faster but power-hungry GPU. It's similar to option 3 with a different emphasis.

Other considerations

Error handling: If a device type does not exist at all, like asking for an NPU on a machine without one or a GPU on a headless server, then navigator.ml.createContext could fail. So the client should try again with a different device type. Such early failure is useful, before you've constructed too much of your graph. Note on Apple via CoreML, that's not really an option, as you can't ask for GPU or NPU without also getting CPU fallback.

Ultimate fallback: If navigator.ml.createContext succeeds for a given device type, then we should not introduce errors much later during graph node construction or the build call which would be very hard for the caller to unwind, or at least very inefficient to recreate the graph all over again with a new device type. So there should be a fallback behavior, and if any backend has incomplete coverage, then we want an "ultimate fallback" backend (like a catch-all universal font in font fallback) that handles every WebNN operator, which is typically the CPU backend. Currently (2024-04-19) the Chromium WebNN implementation is in the unusual state that the GPU backend is more complete in operator coverage than the CPU one, but eventually the CPU backend will catch up.

Quantized operators: These are necessary for NPU but are also independent, as they are useful for GPU and CPU too.

Feedback

Feedback welcome below. I have my preferences, but want to hear from you, and whether any other options/considerations are missing.

@zolkis
Copy link
Collaborator

zolkis commented Apr 24, 2024

Option 3 seems the best to me, also used in e.g. OpenVINO, and allows future interpretation for split/combined execution on multiple accelerators.

@philloooo
Copy link
Contributor

I actually like the simplicity of option 1. As long as we make it clear in the spec that the system may decide to fallback to other devices.

The benefit of option 3 is use case like "I want to use anything except GPU" to load balance or something. But I feel we can explore this option when we gain more concrete needs from developers. A 4th option to satisfy the same need is : deviceType: "npu", excludesDeviceTypes: ["gpu"].

For now option 1 seems a good starting point?

@fdwr
Copy link
Collaborator

fdwr commented May 8, 2024

At the 2024-05-02 meeting, the W3C group agreed to start with option 1 (reserving the option to potentially expand if implementation experience shows need). Next step, spec update.

@mwyrzykowski
Copy link

mwyrzykowski commented May 30, 2024

There was feedback regarding the motivation for this. Why is MLDeviceType even necessary and shouldn't it be a browser implementation decision to choose the most appropriate processor given the existing MLPowerPreference?

Or do we really need this for an application with extensive WebNN <-> WebGPU interop for instance? I read through the other related issues but couldn't find a good motivating factor.

@zolkis
Copy link
Collaborator

zolkis commented May 30, 2024

Why is MLDeviceType even necessary and shouldn't it be a browser implementation decision to choose the most appropriate processor given the existing MLPowerPreference?

I think that could be possible even with the current API shape, if we spec it correctly.
If there are big differences in views over features like this, (at least in the past) specs used conformance classes, and implementations chose a conformance level. But in this case, we may not even need that - if we can spec a way implementations could disregard user preferences/hints, telling if hints were overridden, or a fallback happened.

The details would be important, so please share examples on how would you like these scenarios happen.

@inexorabletash
Copy link
Member

There's been some past discussion on the use cases for explicit device type in a few places. One that comes to mind is:

... where I ask a bunch of dumb questions about the need for MLDeviceType. I was missing the point of #322 which is about requiring GPU to use an explicit WebGPU-vended device you don't get surprising behavior. But it teases out some use cases anyway.

@mwyrzykowski
Copy link

After reading over #322 @inexorabletash I still don't fully understand the argument for the device type. Comparing to WebGPU, which can be implemented purely in software without a physical GPU, why is WebNN specifying a physical device type and not leaving this up to the implementation? Rather it would seem specifying that one wishes to interop with a WebGPU device is desirable. The WebGPU device may be created as a purely software device in which case running WebNN computations also on the cpu is desirable.

In any case, this seems best left up to the browser implementation. It is the browser implementation which ensures WebNN computations are consistent across any physical hardware device it runs on. In that scenario, MLDeviceType should be removed from the WebNN API.

@huningxin
Copy link
Contributor

huningxin commented Jun 5, 2024

@mwyrzykowski

why is WebNN specifying a physical device type and not leaving this up to the implementation

AFAIK, there are two use cases need specifying a device type:

It is the browser implementation which ensures WebNN computations are consistent across any physical hardware device it runs on.

This scenario was discussed before as a "system" or "auto" device type. There were some relevant issues: webmachinelearning/model-loader#30 and #257.

@zolkis
Copy link
Collaborator

zolkis commented Jun 13, 2024

Summarizing the use cases, we seem to have the following set of constraints:

  • op sets (may) depend on the device (accelerator) type; fallbacks might be automatic or by a user defined/hinted policy.
  • user might explicitly want to limit execution to certain device(s) / device type(s)
  • when the implementation cannot satisfy the user hint, should it fail or may it override user hints and decide the device and possible fallbacks?
  • an API that attempts serving all these use cases should not cause confusion or friction on the success and error paths.

Strictly including the discussion above (and references),

  • For the examples cited by @huningxin we could solve by specifying deviceType: "cpu", meaning "up to CPU".

  • For supporting GPU or NPU, those could both be included under deviceType: "gpu", meaning "up to GPU, including any possible NPUs". If there is any NPU or GPU (or CPU), this would allow the implementation to choose.

  • Do we have a use case to specify "NPU or CPU" only? Then we could introduce deviceType: "npu".

  • For the use case of letting the implementation choose the best, we could (re-)introduce deviceType = "auto". This might cover fallbacks in any order. But AFAIU, this would be the same as deviceType: "gpu", which could be the default.

Did I miss something?

@mwyrzykowski
Copy link

Another common use case would be the desire to run workloads on the GPU and ANE (and possibly CPU) simultaneously.

@zolkis
Copy link
Collaborator

zolkis commented Jun 13, 2024

Another common use case would be the desire to run workloads on the GPU and ANE (and possibly CPU) simultaneously.

Assuming the interpretation above, that could fall under deviceType: "gpu" context - decided by the implementation.

If we'd like to make that explicit, together with the fallback option, [and eventually when we want an error,] we'd need to use option 3 from here, but there are downsides/complications.

@mwyrzykowski
Copy link

mwyrzykowski commented Jun 13, 2024

It would seem preferable to keep it implicit because from the web it would be hard, especially in a privacy preserving manner, for the website to make the correct decision for a given scenario.

@fdwr
Copy link
Collaborator

fdwr commented Jun 13, 2024

It would seem preferable to keep it implicit because from the web it would be hard, especially in a privacy preserving manner, for the website to make the correct decision for a given scenario.

Note websites won't be the only clients, as installable web apps can also run locally via WebNN, and they know their scenarios moreso. In any case, the device type is a hint, not a requirement. Apps can also leave MLDeviceType set to default in conjunction with a power preference to let the implementation do as it pleases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants