Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API review, questions, brainstorming #298

Closed
zolkis opened this issue Oct 20, 2022 · 9 comments
Closed

API review, questions, brainstorming #298

zolkis opened this issue Oct 20, 2022 · 9 comments

Comments

@zolkis
Copy link
Collaborator

zolkis commented Oct 20, 2022

As a newcomer with fresh/naive eyes, I use this issue to summarize my thoughts, from the perspective of defining future algorithms. I tried to draft some algorithms, and bumped into how context, operators, graph and internal slots are to be exposed and referenced in the spec.

This is a "questions" category issue, no fix is required, please just bear with me and explain stuff, as I've been missing prior discussions. If needed, targeted issues can be spawned.

From the spec:

  • Context objects are created synchronously by a factory method.
    Internal slots are used for storing device type and power prefs as string enums.
  • CPU context has a synchronous compute() method
  • GPU context has sync and async compute(),
    as well as a sync factory method to create a command encoder.
  • MLGraphBuilder is created with constructor, but bound to a single context.
  • Graphs, operators, operands are intermediary constructs / empty interfaces.
  • method overloading is used for returning MLOperator.

From the explainer,

  • A context owns the set of operations supported by a builder, and the way graphs are compiled and executed.

Basically the main workflows that are covered:

  • programmatically build a compute graph
    (though in theory it could be also built based on a JSON or JSON-LD description as well, I do very well understand why a programmatic approach is more practical here).
  • compile a compute graph to obtain an immutable MLGraph (an opaque type)
  • encode an MLGraph to GPUCommandBuffer (only on GPU contexts)
  • execute (compute) a graph asynchronously (GPU) or synchronously (CPU, GPU),
  • integrate the graph with other Web APIs that provide input data to the graph.

To me it seems use cases are revolving around graphs, which leads to the premise they could be the primary interface exposed by the spec.
Mainly first we construct them, and there I am not sure if context is relevant, is this a syntactic construction, or some context dependent optimizations already supposed to happen by specification during the build?

I am checking the implementation to figure out things, but for now I have some questions:

  • Does Context correspond to anything more substantial than string enum options for device (CPU, GPU) and power prefs? To me context seems to be only a parameter to the workflows above, not a standalone thing.
  • How many contexts are allowed? How many are typically used?
  • How many builders can (are meant to) work in a context?
  • Could a graph (in theory) be built as a structure only, and then compiled/executed for different contexts?
  • From the implementation: an operator object always belongs to one builder, can be connected (well, bound) or not.
    Question: could one bind a user function to it, that overrides the default one?

(This post may be edited/reformulated in the future).

@huningxin
Copy link
Contributor

Thanks for your inputs, @zolkis ! I have some initial comments regarding to sync/async and implementation.

From the spec:

  • Context objects are created synchronously by a factory method.
    Internal slots are used for storing device type and power prefs as string enums.

It's true for current spec. However, the WG agreed to support async context creation. There are two pending PRs (#274 and #285) that add the async context creation method. They are pending on the TAG review on the naming.

  • CPU context has a synchronous compute() method

CPU context also has async computeAsync() method. As the spec says "Asynchronously carries out the computational workload of a compiled graph MLGraph on a separate timeline, either on a worker thread for the CPU execution, or on a GPU timeline for the submission of GPU workload on the command queue. "

From the workflow:

  • execute (compute) a graph asynchronously (GPU) or synchronously (CPU, GPU),

ditto, one can execute (compute) a graph synchronously (CPU, GPU).

From the implementation:

  • Does Context correspond to anything more substantial than string enum options for device (CPU, GPU) and power prefs? To me context seems to be only a parameter to the workflows above, not a standalone thing.

That just reflects the current implementation. The MLContext should be implemented by native ML API, for example to check whether a required device, such as GPU, is capable to support the WebNN graph execution.

  • an operator object always belongs to one builder, can be connected (well, bound) or not.
    Question: could one bind a user function to it, that overrides the default one?

No. There is no custom operator support in current WebNN spec. There was related discussion at #6 .

@zolkis
Copy link
Collaborator Author

zolkis commented Oct 25, 2022

Thanks for the replies. Here is a second round of questions/arguments about the following topics. Please bear with me. :)

Context

It seems the ML frameworks do without it, but I agree it might be future proof thinking to introduce the notion of context, comprising the HW abstraction, resources needed, etc.

So we can create contexts in sync or async way.
Question: for simplicity, could we use constructor for creating a context synchronously, and a factory method to create it asynchronously? That would also avoid the naming issues.

[SecureContext, Exposed=(Window, DedicatedWorker)]
interface ML {
    Promise<MLContext> createContext(optional MLContextOptions options = {});
};

[SecureContext, Exposed=(Window, DedicatedWorker)]
interface MLContext {
    constructor(optional MLContextOptions options = {});
    readonly attribute MLDeviceType deviceType;
    readonly attribute MLPowerPreference powerPreference;
};

Also, it might make sense to expose the content of the context instead of internal slots, for testing, clarity, etc. There might be other internal slots of context that we don't want to expose, but these data were provided by the script.

CPU vs GPU context

The CPU and GPU contexts are rather different. Currently the differentiation is going to be in the algorithms. That is both obscure and a complication for algorithms and testing alike.
One possibility is to define separate contexts for CPU and GPU. That is, deviceType would move out from MLContextOptions and we could define separate MLContextCPU and MLContextGPU interfaces (name is a question).
While the world certainly would look rounder :), I am not sure of the practical gain, so I leave this here now, first let's see how exactly we need to use context in our use cases, then return to this.

Builder vs context

I needed multiple re-reads to rethink the developer use cases here.
For now please bear with me with the hypothesis that we could delay binding an MLContext to a builder.
What if we define a part of builder without MLContext, i.e. the (internal) structure we build here would be usable in multiple contexts.

[SecureContext, Exposed=(Window, DedicatedWorker)]
interface MLGraphBuilder {
  // Construct the graph builder without context.
  constructor();
  // ... include all the current functions, except the following
};

We use the MLGraphBuilder to store the result of programmatic definition of a graph of MLOperands and MLOperators.
This in theory is not context dependent, unless there are context-dependent optimizations already done in this phase, which cannot be deferred to be done within the build() method(s).
Question: would it be enough to provide the context to the build methods?
It would be nice, because it would allow building graphs on multiple contexts using the same internal graph representation.

partial interface MLGraphBuilder {
  // Compile the graph up to the specified output operands asynchronously.
  Promise<MLGraph> build(MLContext context, MLNamedOperands outputs);

 // Compile the graph up to the specified output operands synchronously.
  MLGraph buildSync(MLContext context, MLNamedOperands outputs);
};

Graphs, encode, compute

We have arrived to the most important interface IMHO, the MLGraph, which right now is quite opaque, representing a graph built for a given context.
However, I wonder if it makes sense to move some methods to this interface. Let's start with the compute() methods, and leave the command encoder to the next phase.
Note that a given MLGraph is already bound to a given context.

[SecureContext, Exposed=(Window, DedicatedWorker)]
interface MLGraph {
  readonly attribute MLContext context;

  Promise<undefined> compute(MLNamedArrayBufferViews inputs, MLNamedArrayBufferViews outputs);  
  undefined computeSync(MLNamedArrayBufferViews inputs, MLNamedArrayBufferViews outputs);
};

Command encoder

In addition, we have the command encoder for GPU contexts.
We could approach this the following ways:

  1. Define a single MLGraph interface, and separate MLContextCPU and MLContextGPU, and the latter would contain the createCommandEncoder() method.
  2. Move the createCommandEncoder() method to MLGraph as well, which will only work for a graph built with GPU context.
  3. Define separate MLGraphCPU, MLGraphGPU, etc. The latter would contain the createCommandEncoder() method.

Since an encoder is first tied to a graph, which is tied to a context, I would start with option 2, write the algorithm, see if it's good enough, then explore the other options.
That is, keep single MLContext and MLGraph definitions, but minimize the context-dependent differences in algorithms by the changes suggested above.

I think with these changes the examples would become more intuitive. I need to check if there are implementation blockers (this comment may be updated depending on the findings).

@zolkis
Copy link
Collaborator Author

zolkis commented Oct 31, 2022

Additional arguments came up in a discussion with @huningxin and @anssiko.

Builder vs context

In #149, it was discussed if builders should be context agnostic before compiling to e.g. CPU or GPU. A counterargument was that there might already be some preload happening already in the build phase, and it would be inefficient to undo that later - therefore the builder should be early-bound to context. It is not clear whether could this preload be postponed until a later point.

The first argument is a developer use case: being able to (efficiently) reuse a pre-compilation-graph to be compiled for different contexts.
The counter-argument is a valid implementation feedback.

I think both should be enabled by design. So while early-binding context to builder, it should be easy to reuse programmatically built structures, typically probably migrating CPU to GPU.

That could be achieved in a number of ways, e.g.

  • (currently) do it in the client script, e.g. define the build sequence in a function that takes a builder as argument and builds up a given structure, then invoke the same function with a different builder;
  • a kind of "copy constructor", in which a builder can take another builder (of the same or different context) as a construction argument;
  • expose a "built" graph structure (e.g. reuse the MLGraph name) vs "compiled" graph structure (e.g. CompiledGraph or MLOutput), or internally identify the structure built by a builder by an opaque id, and allow passing that id to a new builder with different context. Not sure if all this is worth is, we should check.

I think the current way is good enough (perhaps even the best, since avoids complex transforms), so we don't necessarily need to change the API, but would be interested what people think today. It would be certainly more generic approach to explicitly expose boundaries in the typical developer flow, described in this comment.

The general steps of operation execution is build > compile > dispatch > execute. The compile step is where activities like constant folding and fusions occur. Fusion in particular is the result of fusing multiple semantic operations that were previously built as a subgraph into an atomic executable unit. Additionally for compute platforms like the GPU, the compile step is the time that JITing of shader fragments and preprocessing of constant weight data (aka. weight-packing) can occur.

Once the subgraph is compiled down, the compiled operations can be dispatched or placed into the command buffers. The placement of dispatches can be done by multiple CPU threads simultaneously while the GPU is executing the command buffers that were previously dispatched, preventing stalls and bubbles while appropriately saturating the GPU.

Under the hood, eager execution is all these 4 steps sequentially executed in one go.

In any case, we should include an example on how to solve this developer use case, and mention that implementations may use early optimizations in the build process.

@zolkis
Copy link
Collaborator Author

zolkis commented Oct 31, 2022

Another argument in #149 is whether a context was created using defaults, or explicit options.

Context creation: implicit vs explicit

This comment states,

When the caller creates an MLContext they have an option to either create a default context with preference options e.g. power preference, etc. -or- create a context from an existing device.
This distinction is important because if a default context is created, it is implied that the API implementation chooses and completely owns the device it uses for subsequent execution. In this use case, the caller has no way to know if the underlying device manages a HW resource or CPU memory because -- it is up to the API to decide.
So, the output must be in a form that can readily be used, which is an array buffer view.
If the underlying device happens to be a GPU, the output is automatically downloaded.
This contract is important because it is how the API is used in a device-agnostic way -- from the caller's point of view.

But if the context is explicitly created by an existing device, it means the device is under the control of the caller, which implies that the caller is the one who would manage the lifetime of the resources created by that device. In this use case, the output will be on the device's resource, and that the caller would own downloading the data from the resources at the appropriate time in the user's scenario. This use case is what being described as the one that needs the atomic operations (aka the op-level execution) because its when the caller would have the control on the device resource including data uploads/downloads, layout assignment, and block layout processing etc.

So, in short, yes, for the default context, the data is automatically uploaded/downloaded if the underlying device needs them, but that is not the scenario that would call for the efficient op-level execution where the explicit resource management is a big implicit part of the said "efficiency" to be had.

This is a fundamental difference in behavior, which normally would warrant defining separate, easily identifiable algorithms, or even separate interfaces. I think we can keep one interface, but need a way to cleanly separate these behaviors in the spec, without creating confusion. We could make referenceable definitions for the two classes/behaviors of context types.

To keep changes minimal, I propose introducing an "auto" device type, i.e.

enum MLDeviceType {
  "auto",
  "cpu",
  "gpu"
};

Note that this aligns well with MLDevicePreference from the Model Loader API.

I think this would clarify (make explicit) the user choice better, and would make the spec more concise as well.
The keyword "auto" may be discussed, of course.

@zolkis
Copy link
Collaborator Author

zolkis commented Nov 2, 2022

At the moment it is a speculation, but based on what I read so far, it seems to me that an MLContext object actually revolves around a single graph's lifecycle: construct it, build it, represent the background resources, preprocess/dispatch/encode, compute. Or at least could do that without losing generality, since we can create multiple contexts.

One version of a minimal Web IDL that would enable this could be the following (not a proposal at this point, only an illustration to discussion).

interface mixin NavigatorML {
  [SecureContext, SameObject] readonly attribute ML ml;
};

Navigator includes NavigatorML;
WorkerNavigator includes NavigatorML;

[SecureContext, Exposed=(Window, DedicatedWorker)]
interface ML {
  Promise<MLContext> createContext(optional MLContextOptions options = {});
  Promise<MLContext> createContext(GPUDevice gpuDevice);
};

[SecureContext, Exposed=(Window, DedicatedWorker)]
interface MLContext {
  constructor(optional MLContextOptions options = {});

  readonly attribute MLContextOptions options;
  // graph is an internal slot (eventually in different forms during lifecycle) 
  // lifecycle state is an internal slot, e.g. "building", "compiled", "encoding", "encoded", "compute" etc.

  readonly attribute MLGraphBuilder builder;  // dedicated builder operates on internal slots

  Promise<undefined> build(MLNamedOperands outputs); // async version split out from MLGraphBuilder

  Promise<MLNamedArrayBufferViews> compute(MLNamedArrayBufferViews inputs);  
};

[SecureContext, Exposed=(Window, DedicatedWorker)]
interface MLContextGPU extends MLContext { };

partial interface MLContextGPU {
  constructor(GPUDevice gpuDevice);
  
  // the following deal with command encoding, using internal slots
  Promise<undefined> preprocess();  // former initializeGraph
  Promise<MLNamedGPUResources>  dispatch(MLNamedGPUResources inputs);
  Promise<GPUCommandBuffer> encode(optional GPUCommandBufferDescriptor descriptor = {});
};

Alternatively, we can fuse into a single MLContext, and expose an MLCommandEncoder interface attribute that can be null in the case of non-GPU contexts.

@zolkis
Copy link
Collaborator Author

zolkis commented Nov 3, 2022

TODO (placeholder): check FW use cases

  • custom layers/activation functions
  • graph/model composition/concatenation
  • performance adaptation
  • integration with real-time video/audio (buffering etc).

@zolkis
Copy link
Collaborator Author

zolkis commented Nov 10, 2022

TL;DR (edited after discussion with @huningxin)

Proposal for exposing context

This proposal would simplify the discussion in #257, #255, #162.

It provides the context type as a single descriptor for the combination of resources used in the context, e.g. a valid combination of device(s), power preference etc.
(It is analogous to the adapter plus device(s) concept in Web GPU.)

Enables simple and more intuitive differentiation in algorithms between script-controlled CPU, GPU and UA-managed (Web GPU) contexts.

In addition, enables future use of hybrid contexts (e.g. CPU+accelerator) as well.

enum MLContextType {  
  "cpu",   // script-controlled context
  "gpu"   // script-controlled context
  "webgpu",  // managed by the user agent
  // later other context types may be defined, even using multiple devices, e.g. "cpu+npu" etc.
  // Note: in fact all these context types could be separate interface classes as well...
};

enum MLPowerPreference {  // a hint
  "default",
  "high-performance",
  "low-power"
};

dictionary MLContextOptions {  // not a hint
  MLContextType contextType = "cpu";
  MLPowerPreference powerPreference = "default";
  GPUDevice? gpuDevice = null;
};

[SecureContext, Exposed=(Window, DedicatedWorker)]
interface ML {
  Promise<MLContext> createContext(optional MLContextOptions options);

  [Exposed=(DedicatedWorker)]
  MLContext createContextSync(optional MLContextOptions options);

  // Internal slots
  // [[boolean managed]]  // `true` if the user agent controls the context (not neeeded)
  // [[MLContextType contextType]] 
  // [[MLPowerPreference powerPreference]]
  // [[implementation]] // perhaps "adapter" would be better

  // further methods (and eventually properties) will follow
};

So far, not much change, except:

  • contextType gets explicit, encapsulating the participating device(s), but could be named after them if there is single device; this would simplify the algorithmic steps.
  • include GPUDevice to ContextOptions and differentiate in the algorithm, resulting in simpler factory methods.

@zolkis
Copy link
Collaborator Author

zolkis commented Nov 29, 2022

The spec contains certain constraints that is hard to describe and enforce via algorithms. For instance, the note from the [MLContext section(https://webmachinelearning.github.io/webnn/#api-mlcontext):

When the [[contextType]] is set to default with the MLContextOptions.deviceType set to gpu, the user agent is responsible for creating an internal GPU device that operates within the context and is capable of ML workload submission on behalf of the calling application. In this setting however, only ArrayBufferView inputs and outputs are allowed in and out of the graph execution since the application has no way to know what type of internal GPU device is being created on their behalf. In this case, the user agent is responsible for automatic uploads and downloads of the inputs and outputs to and from the GPU memory using this said internal device.

Or, from the MLCommandEncoder section,

Record the initialization of the MLGraph. This is a necessary step for optimal performance during graph execution as it gives the platform an opportunity to prepare and optimize constant input data for the subsequent execution of the graph. This method should only be called once per graph.

To achieve that, there should be a possibility of binding MLGraph and MLCommandEncoder to an MLContext of type "webgpu".

Therefore I'd add an internal slot [[model]] to MLContext that represents a compute graph bound to a context. If that context is of type "webgpu", then it will have MLCommandEncoder-specific initialization, dispatch and finish (e.g. the MLCommandEncoder interface could be exposed in the context as an attribute).

Also, the discussion in #149 reveals a possible use case of discerning between a compute graph (as built by a builder, that could be executed in a multiple contexts) and a graph that is initialized for a given context for execution.
A builder could be generic (initialized without a context) or bound to a context (when adaptation to the context could already happen during the build).
A context-bound builder's build() produces MLGraph that is already bound to that context.
A generic builder's build() could have a parameter for context for which the MLGraph is built.

In summary: a builder's output, i.e. an MLGraph is always bound to an MLContext, so it could as well be (part of ) an internal slot of MLContext, and also the builder could be an attribute of MLContext.
As noted previously, MLCommandEncoder could also be an attribute of MLContext.

Proposal: include graph builder, command encoder as attributes to MLContext, and make MLGraph an internal slot of MLContext

This would simplify things a great deal: avoid exposing the empty MLGraph interface in the spec, and using it as internal slot would allow differentiation of how it can be used in different context types and also manage its lifecycle.

[SecureContext, Exposed=(Window, DedicatedWorker)]
interface MLContext {
  // graph is an internal slot (eventually in different forms during lifecycle) 
  // lifecycle state could also be an internal slot, e.g. 
  //    "building", "compiled", "encoding", "encoded", "compute" etc.

  readonly attribute MLGraphBuilder builder;  // dedicated builder operates on internal slots
  readonly attribute MLCommandEncoder? commandEncoder;  // only defined for "webgpu" context

  Promise<MLNamedArrayBufferViews> compute(MLNamedArrayBufferViews inputs);  
};

This and the previous proposal to MLContext would not change the implementation by much (mostly rearranging stuff), but would make the API and the spec algorithms much more clear, benefiting users and implementers alike.

@zolkis
Copy link
Collaborator Author

zolkis commented Dec 1, 2022

Closing this issue, since the conclusions are followed up in separate issues.

@zolkis zolkis closed this as completed Dec 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants