Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: GEMM on 8-bit buffers? #4

Closed
abrown opened this issue Sep 16, 2020 · 12 comments
Closed

Question: GEMM on 8-bit buffers? #4

abrown opened this issue Sep 16, 2020 · 12 comments

Comments

@abrown
Copy link
Collaborator

abrown commented Sep 16, 2020

This issue serves as a follow up to the following discussion in the W3C Machine Learning Workshop. The question was about fp16 and i8 support in Wasm, specifically related to ML models that may need these data types:

From Sangwhan Moon to Everyone:  07:41 AM
https://github.com/WebAssembly/simd/tree/master/proposals/simd is on-going work, I recall seeing i8 types in there.

From Kenneth Heafield (University of Edinburgh) to Everyone:  07:41 AM
https://github.com/webmachinelearning/webnn/issues/84

From Andrew Brown to Everyone:  07:42 AM
While it is true that Wasm doesn't have f16 and i8 types, it is possible to create buffers in memory and pack them (through shifting, etc.) so they would "look like" f16/i8 buffers--is this not enough?

From Kenneth Heafield (University of Edinburgh) to Everyone:  07:44 AM
While we're talking about size, speed matters.  The relevant WebAssembly issue is https://github.com/WebAssembly/simd/issues/328 .
So WasiNN would do GEMM for me on the 8-bit buffers?

From Andrew Brown to Everyone:  07:45 AM
I believe it could; right now we are working on a POC that exposes what OpenVINO can do through the wasi-nn API

From Kenneth Heafield (University of Edinburgh) to Everyone:  07:47 AM
Can it call _mm512_permutexvar_epi16 to implement lookup tables for operators?  And if all I have is an Intel CPU, will WebAssembly allow it to call pmaddubsw or vpmaddubsw depsite the Intel-specific saturation behavior that doesn't exist on ARM / GPU?
@abrown abrown changed the title Question: Question: GEMM on 8-bit buffers? Sep 16, 2020
@abrown
Copy link
Collaborator Author

abrown commented Sep 16, 2020

To answer Kenneth's question, let me provide a thought: wasi-nn exists because platform-specific operations (like the ones you mention perhaps) are unlikely to be exposed through the Wasm SIMD specification. That specification (and Wasm in general) has prioritized portability so it is difficult to expose Intel-specific operations (or any other platform, really) directly. Enter WASI and this proposal, wasi-nn: by exposing ML functionality as a system interface, we can then implement the ML functionality using optimized, platform-specific operations, which should give you access to the operations you are looking to use. Some caveats:

  • WASI and wasi-nn are targeted towards standalone Wasm runtimes such as wasmtime, not browsers; there may be work out there to bring WASI to the browser but I am not up-to-date on this
  • we are working on releasing a POC that would implement wasi-nn in wasmtime using OpenVINO; I am not too familiar with what your use case would look like in OpenVINO but, roughly, I would say that the path would look like: develop a model (TF, ONNX, OpenVINO IR, etc.) that uses the operations you want, create Wasm code that loads the model bits using a wasi-nn call, run the Wasm in wasmtime and pass wasi-nn buffers that are packed to contain i8 elements
  • Because I have not yet released all of the pieces, the previous workflow is currently not possible but would be once the POC is released; I am interested in your thoughts on this approach.

@abrown
Copy link
Collaborator Author

abrown commented Sep 16, 2020

cc: @mingqiusun

@kpu
Copy link

kpu commented Sep 16, 2020

Background: https://browser.mt/ aims to run client-side machine translation with browser integration https://www.w3.org/2020/06/machine-learning-workshop/talks/privacy_focused_machine_translation_in_firefox.html . We're running natively in reasonable speed https://neural.mt/speed/ with 8-bit GEMM dispatched by CPUID with different SIMD lengths. But if this is an extension then we're stuck with (currently) slow web APIs.

8-bit GEMM is our most expensive kernel and we want it in Firefox. Keep in mind that I also care about different matrix sizes than the vision people apache/mxnet#17980 .

We can export to 3 ONNX graphs that get glued together. The exported models are somewhat inefficient though, even natively, because shortlisting is crucial to performance. In shortlisting, the system guesses what words will occur then selects them in the output matrix, avoiding a full multiply of the output matrix. Those guesses are based on set operations. So I'm hesitant to go for a full "give us your graph" approach when much of the work to get the speed entailed customizing our C++ toolkit including operators that don't exist in ONNX. But if I can just call GEMM, logsoftmax, elementwise kernels, etc. that's most of what I need.

@mingqiusun
Copy link
Collaborator

@kpu Is there any machine learning framework that supports shortlisting?

@kpu
Copy link

kpu commented Sep 16, 2020

Sockeye, OpenNMT, and Marian all do shortlisting. Sockeye does it python side because MXNet doesn't have it. OpenNMT and Marian have integrated C++ stacks with CPU and GPU backends. It's not hard per se. Read the input, do some hash table operations to take the union of predicted tokens for each token in the batch, and run a select operation on the output matrix. Just not something ONNX supports out of the box.

In any case, my main performance interest is getting 8-bit GEMM as fast as possible in the browser as fast as possible though whichever standard. The other kernels are icing.

@kpu
Copy link

kpu commented Sep 16, 2020

@kpu
Copy link

kpu commented Oct 2, 2020

Let's not worry about the shortlisting; I can just do that in WebAssembly with a hash table and provide it as an extra input.

What I do want is 8-bit GEMM in the browser.

I feel like WebNN is pursuing a full-package approach which will be nice in the long term "Expected completion: [CR Q1 2022]" that is much bigger than what I need to get reasonable efficiency.

@abrown
Copy link
Collaborator Author

abrown commented Oct 7, 2020

Thinking about this more, the Wasm SIMD repo issues (e.g. WebAssembly/simd#127, https://github.com/WebAssembly/simd/issues/328, WebAssembly/simd#224) and your comments there seem the most likely way to get, e.g., PMADDUBSW in the browser. WASI and modules like wasi-nn are not primarily aimed at browser consumption though at some point someone may make that work.

@geekbeast
Copy link
Contributor

It's not a good way to get into browser, since browsers are unlikely to support custom operators for security reasons.

@abrown I feel like this issue might be a good candidate for getting closed resolved both for fit and inactivity.

@kpu
Copy link

kpu commented Sep 28, 2023

It's not a good way to get into browser, since browsers are unlikely to support custom operators for security reasons.

I find the timing here a bit ironic given that Firefox 118 just launched in-browser machine translation https://www.mozilla.org/en-US/firefox/118.0/releasenotes/ powered by a custom 8-bit GEMM operator because WASM was too slow.

@geekbeast
Copy link
Contributor

Hi, sorry I wasn't clear. Browsers are unlikely to support custom user defined operators loaded from the internet. That means that browsers would not only have to support wasi-nn, but they would have to provide an implementation linking against existing framework backends. I believe this is what @abrown was alluding to in his comment.

I'm glad that Firefox decided to implement their own 8-bit GEMM operator and I hope that it meets your needs.

@abrown
Copy link
Collaborator Author

abrown commented Oct 10, 2023

Let's close this: @kpu's use cases are more browser-specific and WebNN is the better fit for that.

@abrown abrown closed this as completed Oct 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants