-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][VM] Heterogeneous execution in Relay VM #4178
Comments
I think if we look at my recent PR we need to probably track the device context when we allocate storage. The storage's context will prevent merging different pieces of storage. |
@jroesch thanks. I have put references to the PR in the RFC. |
I'm interested in this. @wweic I'll talk to you for advice. |
ahh, thanks for reminding. This is closed by #6337 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Heterogeneous execution in Relay VM
Goal
Relay graph runtime supports executing different parts of the graph in various devices, namely heterogeneous execution. We’d like to port the feature to Relay VM.
Non-goals
There is a limitation of device annotation pass that it assumes all the computation happens inside a single function, so it’s not able to compute the device assignment of multiple relay functions. It might be an issue that we allocate GPU tensor in the main function, but calls out to a tensor array concatenate operation which is another relay function, it might crash or copy to CPU memory(I haven’t experimented yet). A proper way to fix this is implement interprocedural analysis for the device annotation pass.
Current Design in Relay Graph Runtime
Compilation
Reference: #2361
Summary: If users want to specify a device for an operator to run on, they can use an annotation operator named
on_device(expr, dev_id)
to wrap an expression. At a stepRunDeviceAnnotationPass
duringrelay.build
, we will replaceon_device
node withdevice_copy
node. At the step ofPasGraphPlanMemory
, we compute the device assignment(device_type
see next section) of each memory block. This is possible because graph runtime only support static graph, so we can capture all the information statically. Then during native code generation,device_copy
node is mapped to special packed function named__copy
.Runtime
Reference: #1695
Summary: In the graph json file, a new field named
device_type
specifies which device a static memory node should be scheduled to, the runtime allocates the memory in on the device accordingly. When graph runtime sees special operator named__copy
, it callsTVMArrayCopyFromTo
to move memory across devices correctly.Proposal for Relay VM
Compilation
References:
AllocStorage
opcode which allocates physical memory. ([Relay][Memory][VM] #3560)We should be able to reuse all the workflow up until
RunDeviceAnnotationPass
. VM compiler which translate relay expression into vm opcodes needs to mapdevice_copy
node into an opcode namedDeviceCopy(src_register, dst_register)
. The tensor object in each register should have the device context so vm knows how to copy the data. We need to changeAllocTensor
(laterAllocStorage
) as well, we need to attach the device context to the instruction so we know where to allocate the memory, right now we just use the default context.VM Runtime
VM needs to implement the changes to
AllocTensor
andDeviceCopy
.Tasks
DeviceCopy
.AllocTensor
/AllocStorage
.AllocTensor
/AllocStorage
.DeviceCopy
opcode.cc @icemelon9 @zhiics @zxy844288792 @jroesch @tqchen @yzhliu
The text was updated successfully, but these errors were encountered: