Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

[Retiarii] Bugfix: wrong device placement and invalid CUDA ordinal when using CGO engine #4086

Merged
merged 17 commits into from
Oct 11, 2021

Conversation

hzhua
Copy link
Contributor

@hzhua hzhua commented Aug 19, 2021

  • Use an abstract class Device for GPUDevice and CPUDevice
  • Fix invalid CUDA ordinal because the CUDA device ID in PyTorch is different from the physical GPU ID when CUDA_VISIBLE_DEVICES is set.
  • The device placement is wrong since CGOExecutionEngine misses the placement_constraint parameter when send_trial.
  • Fix: error is wrongly thrown since single-host multi-GPU placement is identified as multi-host placement.

This PR is based on #4075, which should be merged first before this PR.

@hzhua hzhua requested review from J-shang, QuanluZhang and SparkSnail and removed request for J-shang August 19, 2021 06:27
@hzhua hzhua marked this pull request as draft August 20, 2021 07:46
@hzhua hzhua marked this pull request as ready for review August 20, 2021 08:24
def __init__(self, node_id, gpu_id, status='idle'):
self.node_id = node_id
self.gpu_id = gpu_id
self.status = status
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this typical usage of @dataclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it is typical. Since the dataclass GPUDevice inherits the dataclass Device, the order in __init__ by default should be node_id, status, gpu_id, which looks not nature to human. So I explicitly declare the order here with the expected order.

@hzhua hzhua changed the title Bugfix: wrong device placement and invalid CUDA ordinal when using CGO engine [Retiarii] Bugfix: wrong device placement and invalid CUDA ordinal when using CGO engine Aug 23, 2021
@hzhua hzhua marked this pull request as draft August 25, 2021 02:23
@hzhua hzhua marked this pull request as ready for review August 26, 2021 05:33
'''
Since CUDA_VISIBLE_DEVICES will be set to the list of real GPU ID,
we need to remap the GPU ID when generating code to match them correctly.
For example, when CUDA_VISIBLE_DEVICES="0,3", we need to use "cuda:0", "cuda:1" in the generated code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't get the point, why CUDA_VISIBLE_DEVICES="0,3" equals to cuda:0, cuda:1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nni_manager sets CUDA_VISIBLE_DEVICES to the allocated GPUs when running a trial, which are the physical GPU IDs.

When CUDA_VISIBLE_DEVICES=0,3, Pytorch identifies there are two GPUs, and names them as cuda:0 and cuda:1.

Thus, when generating code that explicitly place operations (e.g., x.to("cuda:1")), we should use the "cuda:X" ID instead of physical GPU ID.

def __repr__(self):
return f'to("{self.device}")'
if self.overridden_device_repr == None:
return f'to("{self.device.device_repr()}")'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in what cases that overridden_device_repr is None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA GPUDevice may remap GPU physical ID to CUDA ID. The device_repr is different from GPUDevice.device_repr.
override_device_repr will be called in pytorch.graph_to_pytorch_model to replace device_repr with the correct
CUDA ID, e.g., when a job uses Physical GPU-1,2, its CUDA ID should be cuda:0 and cuda:1.
self.device.device_repr() would return cuda:1 and cuda:2, but override_device_repr should be cuda:0 and
cuda:1.

I just add the comments in code to explain this.

def assemble(self, multi_model_placement: Dict[Model, GPUDevice]) \
-> Tuple[Model, Dict[Node, Union[GPUDevice, CPUDevice]]]:
def assemble(self, multi_model_placement: Dict[Model, Device]) \
-> Tuple[Model, Dict[Node, Device]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add docstring for this function

Copy link
Contributor Author

@hzhua hzhua Sep 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I have added the docstring for assemble in AbstractLogicalNode. Also, I add comments in each type of logical node to explain its function and how should they be assembled.


from ...graph import Cell, Edge, Graph, Model, Node
from ...operation import Operation, _IOPseudoOperation


class CPUDevice:
class CPUDevice(Device):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a little strange that why not put CPUDevice into device.py? it should also be a dataclass?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comment. I have moved CPUDevice in nni.common.device and mark it as a dataclass.

@QuanluZhang QuanluZhang merged commit 1458312 into microsoft:master Oct 11, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants