[MLBuffer] Creation and representing MLBuffer on a XPU devices #542

bbernhar · 2024-01-30T22:16:50Z

Purpose/Motivation

Defines a device-based storage object that may be used by WebNN operations. This is a sub-issue of #482.

Proposed API

typedef unsigned long MLFlagsConstant;

[Exposed=(Window, DedicatedWorker)]
interface MLBuffer {
  readonly attribute MLFlagsConstant usage;
  readonly attribute MLOperandDescriptor descriptor;
  [CallWith=Isolate] void destroy();
}

[Exposed=(Window, DedicatedWorker), SecureContext]
namespace MLBufferUsage {
    // TBD
};

[Exposed=(Window, DedicatedWorker), SecureContext]
partial interface MLContext {
    Promise<MLBuffer> createBuffer(MLOperandDescriptor descriptor, MLBufferUsage usages);
};

Example JS

const ml_buffer = await mlContext.createBuffer(descriptor, usages);
ml_buffer.Destroy(); // invalid

The buffer's allocation will be zeroed (as it is for WebGPU's createBuffer() method)
Layout of MLBuffer is always known (and linear access is assumed).
Destroy() gets called on the context timeline but doesn't actually release until the device signals completion.

Edits

5/14/24 - Removed "size" in favor or using MLOperandDescriptor
6/03/24 - Added usage flags and descriptor attributes
7/09/24 - createBuffer() now returns promise.

Alternative API proposals

N/A

Opens

Where will an MLBuffer's memory be allocated on systems where an MLContext may not be as closely tied to a given physical device as an IDMLDevice? See Need to understand how WebNN supports implementation that involves multiple devices and timelines #350 @a-sully
Must an MLBuffer only be used with an MLContext it was created from? @a-sully
Can an MLBuffer's size always be known at the time of buffer allocation? @a-sully
When will MLBuffer be deallocated if destroy() is not called? @a-sully
Does MLBuffer require explicit buffer usages (ex. input, output, or both)? @bbernhar
Does MLBuffer need to support being a staging buffer? @bbernhar
Is a zero sized MLBuffer allowed? @bbernhar

The text was updated successfully, but these errors were encountered:

a-sully · 2024-04-17T21:03:18Z

My recent investigation into supporting MLBuffer on CoreML has lead me to the following two suggestions for createBuffer():

1. We need a WebGPU usage flag (at minimum)

The only zero-copy way to pass a buffer to both WebGPU (as an IOSurface) and CoreML (as an MLMultiArray) is to first allocate the buffer as an IOSurface containing "float16" data (IOSurface -> CVPixelBuffer -> MLMultiArray)

If the MLBuffer is to be used with WebGPU it must be allocated in this fashion (to be zero-copy, at least), whereas an MLBuffer which is only used within WebNN may be allocated as an MLMultiArray directly (more on that below)

2. `MLBufferDescriptor` should include an `MLOperandDescriptor` rather than an `MLSize64`

CoreML's inputs and outputs are given as MLMultiArrays, which require the data type and dimensions to be known. If we're to allocate a hardware buffer for createBuffer(), this information must be known.

Given that the dimensions + data type of input and output operands to an MLGraph are well-defined anyways, it seems reasonable to enforce that an MLBuffer must have matching constraints to be passed as an input or output to an MLGraph as #544 describes? Is there a reason why we should keep MLSize64?

bbernhar · 2024-04-18T18:03:45Z

Thanks @a-sully for delving into the CoreML side of things.

Regarding the need for a WebGPU usage flag:

Is it feasible for an MLBuffer to always be created as an MLMultiArray where, upon import to WebGPU, we could assign or request the usages? Assigning GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC appears to be sufficient.

As for the question about keeping MLSize64:

Without MLSize64, any ML framework that doesn't represent its tensor datatype like MLMultiArray would require re-architecting to avoid creating (especially output) tensors from raw allocations (or malloc). Alternatively, the web developer would need to defer calling createBuffer() until dispatch(), impacting the first inference-time. Could MLOperandDescriptor be made optional instead? The size could then be ignored where irrelevant.

a-sully · 2024-04-18T19:32:25Z

Is it feasible for an MLBuffer to always be created as an MLMultiArray where, upon import to WebGPU, we could assign or request the usages?

AFAICT an MLMultiArray cannot be handed off to WebGPU. It's a data type specific to CoreML. Importing to a type WebGPU can understand would always require a copy - even on UMA systems, which would be unfortunate!

Without MLSize64, any ML framework that doesn't represent its tensor datatype like MLMultiArray would require re-architecting to avoid creating (especially output) tensors from raw allocations (or malloc). Alternatively, the web developer would need to defer calling createBuffer() until dispatch(), impacting the first inference-time.

Hmm I'm not sure if I understand your concern.... Implementations are still welcome to allocate an MLBuffer as one contiguous block of memory. It's just that the WebNN "front end" would assert (when you call dispatch()) that the dtype and dimensions of the passed-in MLBuffer match what the graph expects, rather than just that the sizes are the same. Concretely, that maps to these checks in your prototype CL.

Is the use case you're referring to one where a single MLBuffer is assumed to be able to be contort into different dtype and dimensions? For example:

const mlBuffer = new MLBuffer({size:3*4*4});

// `graph1` expects a float32 output with shape [3, 4]
context.dispatch(graph1, inputs, {'out': mlBuffer});

// `graph2` expects a float16 input with shape [4, 3, 2]
context.dispatch(graph2, {'in': mlBuffer}, outputs);

This hides an internal reinterpretation of the data type and dimensions of what's assumed to be an opaque bag of bytes. I think there's a reasonable argument that this implementation detail should not make its way into the WebNN spec, which shouldn't prescribe a particular implementation.

WebNN has reshape and cast operators. In the example above, graph2 may use these operators to convert an input into whatever dtype and dimensions it needs, if it still wants to be able to use mlBuffer. An advantage of this approach is that the otherwise opaque reinterpretation of the buffer can be expressed in terms of other well-defined operators.

Could you elaborate on the use case(s) you have in mind?

Could MLOperandDescriptor be made optional instead? The size could then be ignored where irrelevant.

What would be the expected behavior on platforms which require a data type and dimensions when the buffer is allocated? An MLOperandDescriptor implies a size - but not the other way around.

bbernhar · 2024-04-18T21:55:47Z

AFAICT an MLMultiArray cannot be handed off to WebGPU.

I was expecting we start the allocation in CoreML via MLBuffer then import it as a MTLBuffer into WebGPU, using the GPUBuffer usages I mentioned.

Implementations are still welcome to allocate an MLBuffer as one contiguous block of memory

Consider a native C++ framework which implements a Tensor dtype as a bag of bytes. If you want to deploy this ML framework using WebNN JS API as an execution provider (or EP), it expects buffers will be allocated using a size. If we force createBuffer() to accept only a MLOperandDescriptor then this EP couldn't simply map Tensor allocation to createBuffer(). They would need to come up with a solution just for MLBuffer, preserving MLOperandDescriptor, or defer createBuffer(), which seems either burdensome or ineffective.

a-sully · 2024-04-19T18:23:26Z

AFAICT an MLMultiArray cannot be handed off to WebGPU.

I was expecting we start the allocation in CoreML via MLBuffer then import it as a MTLBuffer into WebGPU, using the GPUBuffer usages I mentioned.

Ah, I think my wording of "We need a WebGPU usage flag" above was misleading. I'm not suggesting that we need WebGPU usage flags here, but rather a usage flag saying "I want this MLBuffer to convertible to a GPUBuffer" (because the implementation may use that information to determine where/how the buffer should be allocated). Does that clear things up?

Could you also clarify what exactly you mean by "start the allocation in CoreML"? I assume you mean "as an MLMultiArray", but that would require the dtype and dimensions to be known, no?

Implementations are still welcome to allocate an MLBuffer as one contiguous block of memory

Consider a native C++ framework which implements a Tensor dtype as a bag of bytes. If you want to deploy this ML framework using WebNN JS API as an execution provider (or EP), it expects buffers will be allocated using a size. If we force createBuffer() to accept only a MLOperandDescriptor then this EP couldn't simply map Tensor allocation to createBuffer(). They would need to come up with a solution just for MLBuffer, preserving MLOperandDescriptor, or defer createBuffer(), which seems either burdensome or ineffective.

Thanks for the explanation. Could you provide a concrete example of where this concern is relevant? A quick glance at some common ML frameworks suggests that just size is often not sufficient to allocate a tensor. OnnxRuntime's JavaScript Tensor requires dtype and dimensions, for example. As does TFLite’s equivalent. Are there known examples where a size is available but not the dtype and dimensions? Presumably the MLBuffer is being allocated with use by some given MLGraph in mind, and the data types and dimensions of inputs and outputs must already be known? (input() and build() (for outputs) each require an MLOperandDescriptor)

Another consideration is that size may not be enough regardless of whether we want to replace size with an MLOperandDescriptor. As mentioned above, I expect we'll need usage flags, too. Does your concern still hold if arguments other than size become required?

bbernhar · 2024-04-19T19:51:25Z

Could you also clarify what exactly you mean by "start the allocation in CoreML"?

Could we pass a union to createBuffer() which specifies either the size or MLOperandDescriptor so MLBuffer could be always created as MLMultiArray? If not, another (possible) alt. solution is have createBuffer(size) defer creation of MLMultiArray until dispatch().

Are there known examples where a size is available but not the dtype and dimensions?

Yes, the ORT web tensor dtype can only be implemented behind a "malloc" like C inference. When WebNN is used as a EP, it exists within the ML runtime itself.

reillyeon · 2024-04-19T22:10:13Z

Yes, the ORT web tensor dtype can only be implemented behind a "malloc" like C inference.

I don't understand this comment because all of the Tensor constructors in that header take shape information. What am I missing?

bbernhar · 2024-04-19T22:25:35Z

I don't understand this comment because all of the Tensor constructors in that header take shape information. What am I missing?

Notice the Tensor constructor uses a IAllocator interface. That's the only way MLBuffer can be created from because it must own the buffer for the specified shape. Funny enough, the shape information is right there but the main point is ORT expects its possible to whip up a device buffer given only a size.

a-sully · 2024-04-19T23:47:05Z

Taking a step back, the Web Platform Design Principles implore us to "design based on user needs, not the underlying API or hardware":

This means newly proposed APIs should be designed with careful consideration on how they are intended to be used rather than how the underlying hardware, device, or native API available today.

The use cases for an MLBuffer - using some (hardware-optimized) buffer as an input or output to an ML graph - all require that the data type and dimensions of the buffer be known. We should not prescribe implementation details, such at that the buffer must be allocated contiguously, as this other design principle cautions:

Be particularly careful about exposing the exact lifecycle and data structures of the underlying native APIs. When possible, consider flexibility for new hardware.

The point about considering flexibility for new hardware is especially pertinent to WebNN :)

While I understand the desire to design a web platform API which (especially WASM) user-space frameworks can easily plug into, the web platform API should not bend over backwards to accommodate the implementation choices of any given framework. And the web platform API certainly should not bake in assumptions based on the current limitations of said frameworks! In this case, ORT does not support CoreML in cases where an MLMultiArray used as an output is not contiguously allocated. It seems likely that addressing that limitation would require changes to ORT which are ~the same as what would be needed to support MLBuffer if creating an MLBuffer required a dtype and dimensions?

RafaelCintron · 2024-04-20T01:29:26Z

[@a-sully wrote]

The only zero-copy way to pass a buffer to both WebGPU (as an IOSurface) and CoreML (as an MLMultiArray) is to first allocate the buffer as an IOSurface containing "float16" data (IOSurface -> CVPixelBuffer -> MLMultiArray)

For Apple platforms, my understanding is you can go from MLMultiArray -> MTLBuffer by calling getBytesWithHandler + newBufferWithBytesNoCopy. With an MTLBuffer you should be able to create a WebGPU buffer.

Why are IOSurfaces be required?

huningxin · 2024-04-20T02:25:55Z

another (possible) alt. solution is have createBuffer(size) defer creation of MLMultiArray until dispatch().

Seems doable. writeBuffer() may hold the BigBuffer with user data, then at dispatch(), create an MLMultiArray by initWithDataPointer?

a-sully · 2024-04-20T05:05:20Z

For Apple platforms, my understanding is you can go from MLMultiArray -> MTLBuffer by calling getBytesWithHandler + newBufferWithBytesNoCopy. With an MTLBuffer you should be able to create a WebGPU buffer.

Good question! I originally thought so too, but my current understanding is that this is not generically true (i.e. for all data types). If anyone can definitively confirm or dispute this understanding (@mwyrzykowski?) please speak up! Alright here goes...

The docs of newBufferWithBytesNoCopy say that it:

Creates a buffer that wraps an existing contiguous memory allocation

whereas the docs for getBytesWithHandler say of the buffer:

It may not store these scalar values contiguously

so I would assume that this would not be allowed (or at least not be zero-copy) unless the MLMultiArray was specifically allocated contiguously.

How can we ensure an MLMultiArray is allocated contiguously?

Of all the MLMultiArray constructors, the candidates for ensuring a contiguous memory allocation seem to be:

The first one looks promising! Unfortunately it seems - based on past offline discussions - that CoreML internally makes a copy of the bytes when using this constructor. That strides is a parameter seems to corroborate this.

So this would not be zero-copy:

another (possible) alt. solution is have createBuffer(size) defer creation of MLMultiArray until dispatch().

Seems doable. writeBuffer() may hold the BigBuffer with user data, then at dispatch(), create an MLMultiArray by initWithDataPointer?

The latter constructor takes a CVPixelBuffer, but this only works if the CVPixelBuffer is a "float16" IOSurface in disguise:

Use this initializer to create an IOSurface-backed MLMultiArray that reduces the inference latency by avoiding the buffer copy to and from some compute units.

The pixel buffer’s pixel format type must be kCVPixelFormatType_OneComponent16Half. The MLMultiArray data type is MLMultiArrayDataType.float16.

So eith regards to this question....

Why are IOSurfaces be required?

It seems that the only way to avoid copies of a backing memory which is to be shared as both an MLMultiArray and an MTLBuffer is to start with a float16 IOSurface. Unfortunately this suggests that zero-copy buffer sharing is only possible under certain dtype + "do we need to share with WebGPU" configurations. Of course, if we know the memory will stay within CoreML (i.e. it doesn't need to be shared with WebGPU) then we can allocate an MLMultiArray directly, though this would require dtype and shape to be known before writeBuffer()

Data Type	WebNN Use Only	WebGPU Interop
float16	✅ Zero copy (as `MLMultiArray` or `IOSurface`)	✅ Zero copy (as `IOSurface`)
float32	✅ Zero copy (as `MLMultiArray`)	🌕 Data copies (with `initWithDataPointer`)
float64	✅ Zero copy (as `MLMultiArray`)	🌕 Data copies (with `initWithDataPointer`)
int32	✅ Zero copy (as `MLMultiArray`)	🌕 Data copies (with `initWithDataPointer`)
other	❓ May be emulated as int32?	❓ Not sure

mwyrzykowski · 2024-04-22T16:40:02Z

For Apple platforms, my understanding is you can go from MLMultiArray -> MTLBuffer by calling getBytesWithHandler + newBufferWithBytesNoCopy. With an MTLBuffer you should be able to create a WebGPU buffer.

Good question! I originally thought so too, but my current understanding is that this is not generically true (i.e. for all data types). If anyone can definitively confirm or dispute this understanding (@mwyrzykowski?) please speak up! Alright here goes...

It is zero copy in CoreML but anything other than fp16 + CVPixelBuffer will result in a copy below CoreML

bbernhar · 2024-04-30T00:19:03Z

web platform API certainly should not bake in assumptions based on the current limitations of said frameworks!

Not all HW APIs require a MLOperandDescriptor for buffer creation, not specific to ORT (ex. DML). If the ML framework wants to pre-allocate buckets of memory but WebNN cannot (aka GPUBuffer), that's equally an assumption on WebNN's behalf IMO.

Unless MLMultiArray can NOT be implemented through a MLBuffer, it seems unnecessary to require only a MLOperandDescriptor.

bbernhar · 2024-05-09T17:15:20Z

@a-sully Thinking of a way forward to unblock CoreML.

Here's the options I've gathered:

Use MLBuffer(MLOperandDescriptor) and workaround the problem in ORT by calling createBuffer() in dispatch().
Re-implement MLBuffer API to be typed like MLMultiArray, WebNN RT provides an IAllocator impl.
Keep MLBuffer and have the CoreML impl. cache MLMultiArray(s) upon dispatch().

I am not a fan of (1) because it bakes assumptions into the WebNN spec (ex. ORT never pre-allocates or uses untyped buffers). Untyped buffers (aka byte buffers with a linear layout) for example, could be partially dispatched via a MLBufferView, re-used between multiple calls to dispatch(), or pre-allocated from a larger MLBuffer using createBuffer(size).

The other option (2), means WebNN backends (ex. DML resources) must be re-implemented to work like MLMultiArray (which requires strides to read and write), which is a considerable effort/burden. If (3) is possible, it seems like the simplest path forward, did you have a chance to investigate this?

a-sully · 2024-05-09T21:00:00Z

Thanks for the input @bbernhar. I've been exploring this space more, and I still believe the path forward if we want "a device-based storage object that may be used by WebNN operations" is the following:

4. Use MLBuffer(MLOperandDescriptor, MLBufferUsageFlags) and frameworks which use WebNN should not assume implementation details, such as that tensors will always be contiguously allocated

Responses inline:

web platform API certainly should not bake in assumptions based on the current limitations of said frameworks!

Not all HW APIs require a MLOperandDescriptor for buffer creation, not specific to ORT (ex. DML). If the ML framework wants to pre-allocate buckets of memory but WebNN cannot (aka GPUBuffer), that's equally an assumption on WebNN's behalf IMO.

Hmm I'm not following here. The question is not whether HW APIs need an MLOperandDescriptor, but whether HW APIs can support the contract specified by MLBuffer.

If an ML framework wants to allocate a GPUBuffer, how is that relevant to WebNN? Could you please elaborate on this point?

I am not a fan of (1) because it bakes assumptions into the WebNN spec (ex. ORT never pre-allocates or uses untyped buffers). Untyped buffers (aka byte buffers with a linear layout) for example, could be partially dispatched via a MLBufferView, re-used between multiple calls to dispatch(), or pre-allocated from a larger MLBuffer using createBuffer(size).

Please refer back to #542 (comment). The WebNN spec should not prescribe implementation details, such at that the buffer must be allocated contiguously. This violates the design principles here: https://w3ctag.github.io/design-principles/#usecase-oriented-apis

The other option (2), means WebNN backends (ex. DML resources) must be re-implemented to work like MLMultiArray (which requires strides to read and write), which is a considerable effort/burden.

I don't understand this suggestion. MLOperandDescriptor does not include strides - just dtype and shape. And this shape does not imply there must be strides; how/where an MLBuffer is allocated is entirely an implementation detail. If an MLBuffer were to be created with an MLOperandDescriptor, presumably the user agent's DML backend could calculate the total byte size and allocate a contiguous array as it currently does. The only thing that would change in the user agent implementation is a check that an MLBuffer's MLOperandDescriptor matches the MLOperandDescriptor expected by the input and output operands (in the Chromium implementation, this would be a platform-agnostic check that happens in the renderer anyways).

If (3) is possible, it seems like the simplest path forward, did you have a chance to investigate this?

This is not possible without several data copies (e.g. where does the data go when writeBuffer() is called?). This also falls apart if MLBuffer is not type-safe and can be assumed to be recasted/reshaped to any dtype and shape: #542 (comment)

bbernhar · 2024-05-09T22:30:50Z

If an ML framework wants to allocate a GPUBuffer, how is that relevant to WebNN? Could you please elaborate on this point?

The developer has to know the layout in order to calculate offsets which split-up and re-use a larger buffer piece-meal. Note: a linear layout does not dictate how MLBuffer gets implemented, it could actually be non-contiguous. In WebGPU, GPUBuffer layout is known (and linear) so web developers can implement IAllocator on-top of GPUBuffer. If we don't allow createBuffer(size), then that problem gets punted into the WebNN runtime. If the DML backend called CreateCommitedResource() every call to createBuffer(), our first-inference performance would be awful, which is why compute() implements its own IAllocator already. But since MLBuffer are pre-allocated before build(), we can't just FIFO it and be done with it.

This is not possible without several data copies

Bummer. The more I think about it, the more likely MLBuffer needs to behave like MLTensor. DML can emulate MTLMultiArray ops but not vise-versa.

a-sully · 2024-05-09T23:50:19Z

The more I think about it, the more likely MLBuffer needs to behave like MLTensor

Ah yes, this is what I've been advocating for but without using that specific vocabulary 😛

bbernhar · 2024-05-10T17:47:29Z

@a-sully

If the layout of MLBuffer will be unknown, we also need to specify a way for the web developer to initialize tensor data, as readBuffer() and writeBuffer() assumed the layout was linear. For zero-copy, it seems MLBuffer must index into a MTLMultiArray since createBuffer(MLOperandDescriptor) wouldn't accept an ArrayBufferView.

Could you help me understand the plan there?

a-sully · 2024-05-10T18:11:07Z

Hmmm I thought it was a given (based on my earlier comments here) that readBuffer() and writeBuffer() would not be zero-copy. A closer look at the CoreML API has convinced me that guaranteed zero-copy buffer-mapping from JS is not possible (since again, initWithDataPointer would still result in copies) - and as I stated in that earlier comment, I don't think this is too big of a deal, at least for inputs and outputs (constants may be a different story)

My claim - if we assume that readBuffer() and writeBuffer() will have copies - is that the web platform layer should always be able to provide the caller the illusion of linear memory, even if it's not linear under the hood. The MLMultiArray's subscript(_:) method provides this abstraction this, for example. Do you see any issues with this approach?

bbernhar · 2024-05-10T18:54:55Z

Do you see any issues with this approach?

Nope, the proposed change SGTM then. I wasn't sure where offset translation was occurring (now I understand its an impl. detail). Thanks for answering.

bbernhar · 2024-06-11T20:10:52Z

A couple issues were re-raised today by @huningxin during @a-sully's prototyping of buffer usages.

Summarized as follows:

Should createBuffer() be given a default usage at creation (ex. INPUT|OUTPUT)?
OUTPUT cannot disambiguate between "on-device only" or efficiently used by readBuffer().

The use-case for (2) is when a MLBuffer output gets imported into WebGPU where readBuffer() is never called (either WebGPU is the final destination or WebNN re-uses the output). A "on-device only" usage is unique because it offers better bandwidth, namely for dGPU.

For 1) I see value assuming INPUT|OUTPUT upon creation because it allows the web developer to forget about usages or tracking buffers-by-usage, esp. if performance wasn't an issue.

For 2) shall we consider prefixing CPU access visibility?

CPU_INPUT: CPU write optimal, slow GPU read/write
CPU_OUTPUT: CPU read optimal, slow GPU read/write
OUTPUT: CPU has no access, fast GPU read/write

Appreciate any thoughts/feedback.

@RafaelCintron @huningxin

huningxin · 2024-06-13T03:18:14Z

For 2) shall we consider prefixing CPU access visibility?

CPU_INPUT: CPU write optimal, slow GPU read/write

CPU_OUTPUT: CPU read optimal, slow GPU read/write

OUTPUT: CPU has no access, fast GPU read/write

+1, regarding to enum value naming, should consider using something like D3D12_HEAP_TYPE enumeration?

UPLOAD: CPU write optimal, slow GPU read/write
READBACK: CPU read optimal, slow GPU read/write
DEFAULT: CPU has no access, fast GPU read/write

INPUT|OUTPUT

Do we need to distinguish whether a GPU buffer is used for graph input or output? I mean, how would an implementation handle INPUT and OUTPUT differently?

bbernhar · 2024-06-13T14:25:55Z

@huningxin Thanks for the comments.

+1, regarding to enum value naming, should consider using something like D3D12_HEAP_TYPE enumeration?

The underlying memory/heap type used by the WebNN implementation could be determined based on the usage alone. See WebGPU: https://www.w3.org/TR/webgpu/#programming-model-resource-usages

Do we need to distinguish whether a GPU buffer is used for graph input or output? I mean, how would an implementation handle INPUT and OUTPUT differently?

WebNN runtime would use INPUT or OUTPUT to create buffers in write-combined or write-back memory (aka UPLOAD and READBACK per this table) and could validate if the usage matches: INPUT => dispatch(input, ...).

CPU_INPUT: must be dispatched input, writeBuffer() is fast, readBuffer() is slow.
CPU_OUTPUT: must be dispatched output, readBuffer() is fast, writeBuffer() is slow.
OUTPUT: must be as dispatched output, cannot use writeBuffer() or readBuffer().

This CL gives MLBufferDescriptor an MLOperandDescriptor as per webmachinelearning/webnn#542 To represent this descriptor, this CL also creates a new typemapped OperandDescriptor type which ensures that the buffer descriptor is valid. OperandDescriptor will be used more pervasively within WebNN in follow-up CLs 1) Move Operand::DataType to DataType (MERGED) 2) Create a typemapped OperandDescriptor class for MLBuffer <-- this CL 3) Use OperandDescriptor in mojom::Operand 4+) Remove duplicate code (especially with //components) Bug: 343638938, 325598628 Cq-Include-Trybots: luci.chromium.try:win11-blink-rel Change-Id: I775340f5c5e0e80942332cbae750d0d305cdd458

This CL gives MLBufferDescriptor an MLOperandDescriptor as per webmachinelearning/webnn#542 To represent this descriptor, this CL also creates a new typemapped OperandDescriptor type which ensures that the buffer descriptor is valid. OperandDescriptor will be used more pervasively within WebNN in follow-up CLs 1) Move Operand::DataType to DataType (MERGED) 2) Create a typemapped OperandDescriptor class for MLBuffer <-- this CL 3) Use OperandDescriptor in mojom::Operand 4+) Remove duplicate code (especially with //components) Fuchsia binary size seems to be unavoidable for now, and I suspect may be temporary once duplicate code is removed in follow-ups. bloaty shows a binary size increase primarily in //t/b/r/m/ml/webnn/ml_graph_type_converter.cc, as well as a handful of other renderer-side files which depend on the mojom component Bug: 343638938, 325598628 Fuchsia-Binary-Size: See commit description Cq-Include-Trybots: luci.chromium.try:win11-blink-rel Change-Id: I775340f5c5e0e80942332cbae750d0d305cdd458

This CL gives MLBufferDescriptor an MLOperandDescriptor as per webmachinelearning/webnn#542 To represent this descriptor, this CL also creates a new typemapped OperandDescriptor type which ensures that the buffer descriptor is valid. OperandDescriptor will be used more pervasively within WebNN in follow-up CLs 1) Move Operand::DataType to DataType (MERGED) 2) Create a typemapped OperandDescriptor class for MLBuffer <-- this CL 3) Use OperandDescriptor in mojom::Operand 4+) Remove duplicate code (especially with //components) Fuchsia binary size seems to be unavoidable for now, and I suspect may be temporary once duplicate code is removed in follow-ups. bloaty shows a binary size increase primarily in //t/b/r/m/ml/webnn/ml_graph_type_converter.cc, as well as a handful of other renderer-side files which depend on the mojom component Bug: 343638938, 325598628 Fuchsia-Binary-Size: See commit description Cq-Include-Trybots: luci.chromium.try:win11-blink-rel Change-Id: I775340f5c5e0e80942332cbae750d0d305cdd458 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5604163 Reviewed-by: ningxin hu <ningxin.hu@intel.com> Commit-Queue: Austin Sullivan <asully@chromium.org> Reviewed-by: Alex Gough <ajgo@chromium.org> Reviewed-by: Reilly Grant <reillyg@chromium.org> Cr-Commit-Position: refs/heads/main@{#1315553}

…riptor, a=testonly Automatic update from web-platform-tests webnn: Give an MLBuffer an MLOperandDescriptor This CL gives MLBufferDescriptor an MLOperandDescriptor as per webmachinelearning/webnn#542 To represent this descriptor, this CL also creates a new typemapped OperandDescriptor type which ensures that the buffer descriptor is valid. OperandDescriptor will be used more pervasively within WebNN in follow-up CLs 1) Move Operand::DataType to DataType (MERGED) 2) Create a typemapped OperandDescriptor class for MLBuffer <-- this CL 3) Use OperandDescriptor in mojom::Operand 4+) Remove duplicate code (especially with //components) Fuchsia binary size seems to be unavoidable for now, and I suspect may be temporary once duplicate code is removed in follow-ups. bloaty shows a binary size increase primarily in //t/b/r/m/ml/webnn/ml_graph_type_converter.cc, as well as a handful of other renderer-side files which depend on the mojom component Bug: 343638938, 325598628 Fuchsia-Binary-Size: See commit description Cq-Include-Trybots: luci.chromium.try:win11-blink-rel Change-Id: I775340f5c5e0e80942332cbae750d0d305cdd458 Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5604163 Reviewed-by: ningxin hu <ningxin.hu@intel.com> Commit-Queue: Austin Sullivan <asully@chromium.org> Reviewed-by: Alex Gough <ajgo@chromium.org> Reviewed-by: Reilly Grant <reillyg@chromium.org> Cr-Commit-Position: refs/heads/main@{#1315553} -- wpt-commits: ed9e9309618bdf76de06ff85757edbc8e1d7da82 wpt-pr: 46770

### Description This PR enables the API added in #20816 as well as moving context creation to JS. ### Motivation and Context In order to enable I/O Binding with the upcoming [MLBuffer](webmachinelearning/webnn#542) API in the WebNN specification, we need to share the same `MLContext` across multiple sessions. This is because `MLBuffer`s are restricted to the `MLContext` where they were created. This PR enables developers to use the same `MLContext` across multiple sessions.

bbernhar · 2024-08-06T17:15:40Z

@a-sully @reillyeon @huningxin @RafaelCintron

Thoughts/concerns with introducing the (proposed) buffer creation usages below?

For context, these new usages allow DML to correctly configure (and directly maps) memory properties upon creatingBuffer() [1] and would determine how a MLBuffer may be used after creation. WebNN backend APIs that do not require this merely validate the usage is allowed.

MLBufferUsage(s):

JS_READ: buffer can be used with readBuffer(). Can be combined with JS_WRITE.
JS_WRITE: buffer can be used with writeBuffer(). Can be combined with JS_READ.
JS_NONE: buffer can only be used for dispatch(). Cannot be combined with JS_WRITE or JS_READ.

JS example

const output = await mlContext.createBuffer({
  usage: GPUBufferUsage.JS_READ
});
await mlContext.readBuffer(output); // OK
mlContext.writeBuffer(output, ..); // throws error

[1] https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/context_impl_dml.cc;drc=0c5a4a1c3588e362ca65d556ff3a7fee3b3b31b8;l=246

a-sully · 2024-08-06T18:35:58Z

JS example

const output = await mlContext.createBuffer({
  usage: GPUBufferUsage.JS_WRITE
});
await mlContext.readBuffer(output); // OK
mlContext.writeBuffer(output, ..); // throws error

nit: Did you mean to use MLBufferUsage.JS_READ in this example?

Eventually we'll need a flag to indicate that this buffer may be shared with WebGPU. As I've discussed elsewhere, this dictates how an MLBuffer should be allocated on Mac. That's a separate issue (#688) that I'm not trying to solve here, though it would be nice to have an idea of how the proposed MLBufferUsage flags will interact with that flag (e.g. #688 suggests that importing an MLBuffer into WebGPU will yield a GPUBuffer with GPUBufferUsageFlags.STORAGE and GPUBufferUsageFlags.COPY_SRC flags. Is this true/allowed in all cases?)

Overall this seems reasonable, though I do have a few thoughts:

I don't think "JS" should be in the name (e.g. this API may also be used by TypeScript or Wasm)
Ideally the usage flags signal what can be done rather than what can't. So rather than JS_NONE it could be DISPATCH...
...or if all MLBuffers have the ability to be used with dispatch(), then this is implied and we don't need this flag at all. Not passing any other usage flags would map to D3D12_HEAP_TYPE_DEFAULT

Thoughts on:

READ_FROM: buffer can be used with readBuffer(). Can be combined with WRITE_TO
WRITE_TO: buffer can be used with writeBuffer(). Can be combined with READ_FROM
(eventually) WEB_GPU_INTEROP: buffer can be used with GPUDevice.importExternalBuffer(). Can be combined with ???

bbernhar · 2024-08-06T19:06:57Z

Thanks @a-sully for the feedback.

nit: Did you mean to use MLBufferUsage.JS_READ in this example?

Good catch, fixed.

and we don't need this flag at all

SGTM.

Thoughts on:

READ_FROM: buffer can be used with readBuffer(). Can be combined with WRITE_TO

WRITE_TO: buffer can be used with writeBuffer(). Can be combined with READ_FROM

SGTM.

(eventually) WEB_GPU_INTEROP: buffer can be used with GPUDevice.importExternalBuffer(). Can be combined with ???

With any other WebNN usages. Calling importExternalBuffer() could simply ignore them as MLBuffer is (currently) treated as WebGPU-equivalent usage of STORAGE and is neutered.

Exposes MLBufferUsageFlags to MLBufferDescriptor and adds new usages to maximize device memory bandwidth. After this change, createBuffer() assumes "no usage" by default. To readBuffer() or writeBuffer(), the corresponding usage flag must be specified by the web developer. Combining usages is allowed but could be inefficient. Usages are always validated even if a backend doesn't use it. webmachinelearning/webnn#542 Bug: 343638938 Change-Id: I4d78e3f8bacd7cbabce3038c234c062c7c07b095 Cq-Include-Trybots: luci.chromium.try:win11-blink-rel

Exposes MLBufferUsageFlags to MLBufferDescriptor and adds new usages to maximize device memory bandwidth. After this change, createBuffer() assumes "no usage" by default. To readBuffer() or writeBuffer(), the corresponding usage flag must be specified by the web developer. Combining usages is allowed but could be inefficient. Usages are always validated even if a backend doesn't use it. webmachinelearning/webnn#542 Bug: 343638938 Change-Id: I4d78e3f8bacd7cbabce3038c234c062c7c07b095 Cq-Include-Trybots: luci.chromium.try:win11-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5787041 Commit-Queue: Bryan Bernhart <bryan.bernhart@intel.com> Reviewed-by: Alex Gough <ajgo@chromium.org> Reviewed-by: ningxin hu <ningxin.hu@intel.com> Reviewed-by: Austin Sullivan <asully@chromium.org> Cr-Commit-Position: refs/heads/main@{#1344910}

…, a=testonly Automatic update from web-platform-tests WebNN: add buffer usages for DML backend Exposes MLBufferUsageFlags to MLBufferDescriptor and adds new usages to maximize device memory bandwidth. After this change, createBuffer() assumes "no usage" by default. To readBuffer() or writeBuffer(), the corresponding usage flag must be specified by the web developer. Combining usages is allowed but could be inefficient. Usages are always validated even if a backend doesn't use it. webmachinelearning/webnn#542 Bug: 343638938 Change-Id: I4d78e3f8bacd7cbabce3038c234c062c7c07b095 Cq-Include-Trybots: luci.chromium.try:win11-blink-rel Reviewed-on: https://chromium-review.googlesource.com/c/chromium/src/+/5787041 Commit-Queue: Bryan Bernhart <bryan.bernhart@intel.com> Reviewed-by: Alex Gough <ajgo@chromium.org> Reviewed-by: ningxin hu <ningxin.hu@intel.com> Reviewed-by: Austin Sullivan <asully@chromium.org> Cr-Commit-Position: refs/heads/main@{#1344910} -- wpt-commits: f21d93823c4ca8b1cb01b3ff1730af9c049840e5 wpt-pr: 47718

bbernhar changed the title ~~[MLBuffer] Creation and representing MLBuffer on a XPU devices (ie. MLContext.createBuffer)~~ [MLBuffer] Creation and representing MLBuffer on a XPU devices Jan 30, 2024

bbernhar mentioned this issue Jan 31, 2024

[MLBuffer] Uploading/downloading tensor data #543

Open

anssiko added the webgpu interop label Feb 2, 2024

a-sully mentioned this issue Feb 12, 2024

Add MLBuffer exploration doc #541

Open

a-sully mentioned this issue Mar 25, 2024

Allow no-op graphs? #614

Closed

egalli mentioned this issue May 7, 2024

[js/webnn] Enable user-supplied MLContext microsoft/onnxruntime#20600

Merged

a-sully mentioned this issue May 9, 2024

Support output tensor in native memory format #173

Closed

chromium-wpt-export-bot mentioned this issue Jun 14, 2024

webnn: Give an MLBuffer an MLOperandDescriptor web-platform-tests/wpt#46770

Merged

webmachinelearning deleted a comment Jun 17, 2024

bbernhar mentioned this issue Jul 9, 2024

MLBuffer destruction timeline #716

Open

chromium-wpt-export-bot mentioned this issue Aug 21, 2024

WebNN: add buffer usages for DML backend web-platform-tests/wpt#47718

Merged

a-sully mentioned this issue Sep 11, 2024

Add MLTensor explainer #754

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLBuffer] Creation and representing MLBuffer on a XPU devices #542

[MLBuffer] Creation and representing MLBuffer on a XPU devices #542

bbernhar commented Jan 30, 2024 •

edited

Loading

a-sully commented Apr 17, 2024

bbernhar commented Apr 18, 2024

a-sully commented Apr 18, 2024

bbernhar commented Apr 18, 2024 •

edited

Loading

a-sully commented Apr 19, 2024

bbernhar commented Apr 19, 2024

reillyeon commented Apr 19, 2024

bbernhar commented Apr 19, 2024

a-sully commented Apr 19, 2024

RafaelCintron commented Apr 20, 2024

huningxin commented Apr 20, 2024

a-sully commented Apr 20, 2024

mwyrzykowski commented Apr 22, 2024

bbernhar commented Apr 30, 2024

bbernhar commented May 9, 2024

a-sully commented May 9, 2024

bbernhar commented May 9, 2024 •

edited

Loading

a-sully commented May 9, 2024

bbernhar commented May 10, 2024

a-sully commented May 10, 2024

bbernhar commented May 10, 2024

bbernhar commented Jun 11, 2024 •

edited

Loading

huningxin commented Jun 13, 2024 •

edited

Loading

bbernhar commented Jun 13, 2024

bbernhar commented Aug 6, 2024 •

edited

Loading

a-sully commented Aug 6, 2024

bbernhar commented Aug 6, 2024

[MLBuffer] Creation and representing MLBuffer on a XPU devices #542

[MLBuffer] Creation and representing MLBuffer on a XPU devices #542

Comments

bbernhar commented Jan 30, 2024 • edited Loading

Purpose/Motivation

Proposed API

Edits

Alternative API proposals

Opens

a-sully commented Apr 17, 2024

1. We need a WebGPU usage flag (at minimum)

2. MLBufferDescriptor should include an MLOperandDescriptor rather than an MLSize64

bbernhar commented Apr 18, 2024

a-sully commented Apr 18, 2024

bbernhar commented Apr 18, 2024 • edited Loading

a-sully commented Apr 19, 2024

bbernhar commented Apr 19, 2024

reillyeon commented Apr 19, 2024

bbernhar commented Apr 19, 2024

a-sully commented Apr 19, 2024

RafaelCintron commented Apr 20, 2024

huningxin commented Apr 20, 2024

a-sully commented Apr 20, 2024

mwyrzykowski commented Apr 22, 2024

bbernhar commented Apr 30, 2024

bbernhar commented May 9, 2024

a-sully commented May 9, 2024

bbernhar commented May 9, 2024 • edited Loading

a-sully commented May 9, 2024

bbernhar commented May 10, 2024

a-sully commented May 10, 2024

bbernhar commented May 10, 2024

bbernhar commented Jun 11, 2024 • edited Loading

huningxin commented Jun 13, 2024 • edited Loading

bbernhar commented Jun 13, 2024

bbernhar commented Aug 6, 2024 • edited Loading

a-sully commented Aug 6, 2024

bbernhar commented Aug 6, 2024

bbernhar commented Jan 30, 2024 •

edited

Loading

2. `MLBufferDescriptor` should include an `MLOperandDescriptor` rather than an `MLSize64`

bbernhar commented Apr 18, 2024 •

edited

Loading

bbernhar commented May 9, 2024 •

edited

Loading

bbernhar commented Jun 11, 2024 •

edited

Loading

huningxin commented Jun 13, 2024 •

edited

Loading

bbernhar commented Aug 6, 2024 •

edited

Loading