-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[Retiarii] Bugfix: wrong device placement and invalid CUDA ordinal when using CGO engine #4086
Conversation
def __init__(self, node_id, gpu_id, status='idle'): | ||
self.node_id = node_id | ||
self.gpu_id = gpu_id | ||
self.status = status |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this typical usage of @dataclass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if it is typical. Since the dataclass GPUDevice
inherits the dataclass Device
, the order in __init__
by default should be node_id, status, gpu_id
, which looks not nature to human. So I explicitly declare the order here with the expected order.
''' | ||
Since CUDA_VISIBLE_DEVICES will be set to the list of real GPU ID, | ||
we need to remap the GPU ID when generating code to match them correctly. | ||
For example, when CUDA_VISIBLE_DEVICES="0,3", we need to use "cuda:0", "cuda:1" in the generated code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't get the point, why CUDA_VISIBLE_DEVICES="0,3"
equals to cuda:0, cuda:1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nni_manager sets CUDA_VISIBLE_DEVICES
to the allocated GPUs when running a trial, which are the physical GPU IDs.
When CUDA_VISIBLE_DEVICES=0,3
, Pytorch identifies there are two GPUs, and names them as cuda:0
and cuda:1
.
Thus, when generating code that explicitly place operations (e.g., x.to("cuda:1")
), we should use the "cuda:X" ID instead of physical GPU ID.
def __repr__(self): | ||
return f'to("{self.device}")' | ||
if self.overridden_device_repr == None: | ||
return f'to("{self.device.device_repr()}")' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in what cases that overridden_device_repr
is None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CUDA GPUDevice
may remap GPU physical ID to CUDA ID. The device_repr is different from GPUDevice.device_repr
.
override_device_repr
will be called in pytorch.graph_to_pytorch_model
to replace device_repr
with the correct
CUDA ID, e.g., when a job uses Physical GPU-1,2, its CUDA ID should be cuda:0
and cuda:1
.
self.device.device_repr()
would return cuda:1
and cuda:2
, but override_device_repr
should be cuda:0
and
cuda:1
.
I just add the comments in code to explain this.
def assemble(self, multi_model_placement: Dict[Model, GPUDevice]) \ | ||
-> Tuple[Model, Dict[Node, Union[GPUDevice, CPUDevice]]]: | ||
def assemble(self, multi_model_placement: Dict[Model, Device]) \ | ||
-> Tuple[Model, Dict[Node, Device]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better to add docstring for this function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have added the docstring for assemble
in AbstractLogicalNode
. Also, I add comments in each type of logical node to explain its function and how should they be assembled.
|
||
from ...graph import Cell, Edge, Graph, Model, Node | ||
from ...operation import Operation, _IOPseudoOperation | ||
|
||
|
||
class CPUDevice: | ||
class CPUDevice(Device): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a little strange that why not put CPUDevice
into device.py? it should also be a dataclass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good comment. I have moved CPUDevice
in nni.common.device and mark it as a dataclass.
Device
forGPUDevice
andCPUDevice
CUDA_VISIBLE_DEVICES
is set.CGOExecutionEngine
misses theplacement_constraint
parameter whensend_trial
.This PR is based on #4075, which should be merged first before this PR.