Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Chrome Feedback on WebNN: aiming for broad device coverage and maintainability #453

Closed
vsekhar opened this issue Aug 22, 2023 · 4 comments

Comments

@vsekhar
Copy link

vsekhar commented Aug 22, 2023

This issue is a high-level summary of the Chrome team's feedback on WebNN, posting here for further discussion with WG members.

--

Google strongly supports the work of the Web ML WG to bring on-device ML capabilities to the open Web and we recognize the long-term contributions from many participants, in particular Intel who spearheaded the WebNN effort.

Since browser vendors will need to keep the resulting API up-to-date over many years, Google feels the proposal warrants special scrutiny to ensure the API remains relevant while imposing a manageable long-term support and maintenance cost, including after the initial WG contributors may have moved on to other projects.

To that end, several senior technical staff members on the Google Chrome team who are familiar with Web APIs, the Web standards process, and the technical implementation of various advanced browser APIs and capabilities, have carefully reviewed the WebNN proposal. This document summarizes their feedback. While we draw on the expertise of other ML research and infrastructure teams at Google (e.g. those working on TensorFlow, TensorFlow Lite, JAX, OpenXLA), we do not aim to speak for them or their projects.

Our feedback on the WebNN proposal is informed by our observation that, for new or single-vendor OS APIs or hardware accelerators, we must assume that most Web users don't have them. While we too aim to create compelling and performant experiences for users of the latest hardware and OS platforms, we have an obligation to ensure a workable experience for other users as well.

Our goal for an ML API for the Web is not to demonstrate performance with specific accelerators or drivers. Instead, Chrome's goal is to achieve 80% of a device's theoretical hardware-accelerated ML runtime performance across 80% of devices on the Web, and to do so while imposing a manageable long-term support burden on browser vendors. Users of other devices, with hardware accelerators or architectures that differ significantly from the mainstream and are not integrated by browser or OS vendors, should still benefit from workable execution of ML models on the CPU and GPU.

The ML ecosystem is still rapidly evolving, making it difficult for any API to keep up. For example, the long short-term memory (LSTM) approach to ML has already been obsoleted by Transformers, and Softmax has been succeeded by various approximate and access-efficient versions and implementations. Accelerators and hardware architectures continue to evolve as well.

Consider what would be involved in adding a new high level operator like FlashAttention to the current API. Implementers would need to connect it to each equivalent OS API operator (where it exists) or implement it as a GPU shader (when a GPU is available) or emulate it in CPU code. Current plans across the ecosystem for new models, operations, and hardware may already present an intractable roadmap for WebNN implementers who prioritize broad reach.

To address this issue, we favor adopting RISC-style tensor operations in the mathematical domain, drawing on the basics of tensor math that are unlikely to change in the near term, in contrast to the less stable higher-level CISC-style operations like HardSwish or SoftMax that are often obsoleted by new operations. The ML community is building consensus around certain low-level operator sets that work across frameworks and tool chains and we believe this work could benefit WebNN, particularly those operator sets that specifically target long-term stability.

We recognize that, in their current form, OS APIs for ML may not yet be conducive to RISC-style tensor operations. However we hope the WebNN effort will produce an API design that is performant, portable and stable, and that it will in turn have a positive influence on the evolution and long-term maintainability of OS APIs as well. We expect to evolve our own OS APIs in this way as well.

Based on the above, we recommend building on the WebNN proposal in the following ways:

  1. Request public positions from major browser implementers on the WebNN spec as currently proposed
  2. Reduce the long term support burden of WebNN by streamlining the API surface
    • Consider evolving towards operator sets emerging from the ML community, especially those targeting long-term stability
    • Remove model-specific instructions like LSTM and gru
    • Remove CISC-style operators like hardSwish and softmax
    • Limit tensor layout specifications to functions that read or write buffers
    • Complete the set of basic scalar and tensor math operations
  3. Demonstrate WebNN performance for CPU and GPU execution across multiple OS platforms
    • Suggestion: consider implementing WebNN as a polyfill on top of WebAssembly and WebGPU to reuse the compatibility work already done for these APIs
  4. Demonstrate WebNN performance gains utilizing OS- and hardware-specific optimizations
    • Extend WebNN implementations in a pluggable fashion, where HW and OS vendors contribute, maintain and deprecate backends for their platforms
    • Gated on demonstrated performance gains on targeted platforms and no regressions or performance cliffs on other platforms or fallback WebAssembly/WebGPU implementation

With regards to OS- and hardware-specific optimizations, we further propose an engineering approach that clearly demonstrates the value to Web users for the ecosystem to adopt and maintain them over the long-term:

  1. Select 2-5 demonstrative ML models, for example (source):
    • Segment Anything
    • Stable Diffusion
    • Whisper Tiny
  2. Run on a demonstrative set of platforms with accelerator hardware:
    • Apple Neural Engine on M2 Macbook Pro via CoreML
    • Intel VPU on Meteor Lake desktop via DirectML
    • Mainstream mobile devices running iOS and Android
    • ... other platforms at the suggestion of the Working Group
  3. Evaluate latency, throughput and power efficiency between:
    • Lowering WebNN for execution on typical CPUs and GPUs on the above platforms
    • Lowering WebNN for execution on hardware accelerators on the above platforms

We look forward to continuing the discussion with the WG participants to deploy powerful ML capabilities across the Web's many platforms and benefitting all of the Web's users.

@anssiko
Copy link
Member

anssiko commented Aug 24, 2023

Thank you @vsekhar and @inexorabletash. We discussed this feedback at WebML WG Teleconference – 24 August 2023.

I encourage the WG to use this issue for general discussion and cross-link to this issue from topic-specific GH issues as appropriate.

@anssiko
Copy link
Member

anssiko commented May 8, 2024

I observe a subset of these recommendations have been or are being discussed in topic-specific issues, currently open are e.g. #456 and #573. Furthermore, the group has focused on the models and hardware targets mentioned in this high-level summary, in both specification and prototyping efforts.

I would like to revisit this high-level issue in our future meeting to see what has been done, what remains to be done, and to discuss any new information that may have come up since. And to understand whether a revision to this high-level summary would be appropriate.

@anssiko
Copy link
Member

anssiko commented May 24, 2024

To follow up on https://www.w3.org/2024/05/16-webmachinelearning-minutes.html#t09, I'd ask @mwyrzykowski to file an issue for WebKit's standards position repo. Mike is an active WG participant, informed of both this API and Apple platforms, and as such well positioned to file the issue in a way it includes appropriate details important to Apple.

@inexorabletash, I'd like to invite a Mozillian(s) to our meeting and have a discussion on this topic and familiarize them with the API. Someone from the WebGPU-land might be interested?

@a-sully
Copy link
Contributor

a-sully commented Oct 17, 2024

As discussed at TPAC, the concerns raised in this issue now have more specific issues filed, such as #573 and #689. I expect discussions to continue on those issues. We see no need to keep this meta-issue open and would like to close this issue

(someone with repo edit permissions needs to hit the button. Thanks!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants