[Retiarii] Bugfix: wrong device placement and invalid CUDA ordinal when using CGO engine #4086

hzhua · 2021-08-19T06:27:24Z

Use an abstract class Device for GPUDevice and CPUDevice
Fix invalid CUDA ordinal because the CUDA device ID in PyTorch is different from the physical GPU ID when CUDA_VISIBLE_DEVICES is set.
The device placement is wrong since CGOExecutionEngine misses the placement_constraint parameter when send_trial.
Fix: error is wrongly thrown since single-host multi-GPU placement is identified as multi-host placement.

This PR is based on #4075, which should be merged first before this PR.

… as multi-host

QuanluZhang · 2021-08-21T09:52:34Z

nni/common/device.py

+    def __init__(self, node_id, gpu_id, status='idle'):
+        self.node_id = node_id
+        self.gpu_id = gpu_id
+        self.status = status


is this typical usage of @dataclass?

I'm not sure if it is typical. Since the dataclass GPUDevice inherits the dataclass Device, the order in __init__ by default should be node_id, status, gpu_id, which looks not nature to human. So I explicitly declare the order here with the expected order.

SparkSnail · 2021-08-27T06:39:54Z

nni/retiarii/codegen/pytorch.py

+    '''
+    Since CUDA_VISIBLE_DEVICES will be set to the list of real GPU ID,
+    we need to remap the GPU ID when generating code to match them correctly.
+    For example, when CUDA_VISIBLE_DEVICES="0,3", we need to use "cuda:0", "cuda:1" in the generated code.


Can't get the point, why CUDA_VISIBLE_DEVICES="0,3" equals to cuda:0, cuda:1?

nni_manager sets CUDA_VISIBLE_DEVICES to the allocated GPUs when running a trial, which are the physical GPU IDs.

When CUDA_VISIBLE_DEVICES=0,3, Pytorch identifies there are two GPUs, and names them as cuda:0 and cuda:1.

Thus, when generating code that explicitly place operations (e.g., x.to("cuda:1")), we should use the "cuda:X" ID instead of physical GPU ID.

nni/retiarii/execution/cgo_engine.py

QuanluZhang · 2021-08-31T11:46:20Z

nni/retiarii/operation_def/torch_op_def.py

    def __repr__(self):
-        return f'to("{self.device}")'
+        if self.overridden_device_repr == None:
+            return f'to("{self.device.device_repr()}")'


in what cases that overridden_device_repr is None?

CUDA GPUDevice may remap GPU physical ID to CUDA ID. The device_repr is different from GPUDevice.device_repr.
override_device_repr will be called in pytorch.graph_to_pytorch_model to replace device_repr with the correct
CUDA ID, e.g., when a job uses Physical GPU-1,2, its CUDA ID should be cuda:0 and cuda:1.
self.device.device_repr() would return cuda:1 and cuda:2, but override_device_repr should be cuda:0 and
cuda:1.

I just add the comments in code to explain this.

QuanluZhang · 2021-08-31T11:56:51Z

nni/retiarii/execution/logical_optimizer/logical_plan.py

-    def assemble(self, multi_model_placement: Dict[Model, GPUDevice]) \
-            -> Tuple[Model, Dict[Node, Union[GPUDevice, CPUDevice]]]:
+    def assemble(self, multi_model_placement: Dict[Model, Device]) \
+            -> Tuple[Model, Dict[Node, Device]]:


better to add docstring for this function

Done. I have added the docstring for assemble in AbstractLogicalNode. Also, I add comments in each type of logical node to explain its function and how should they be assembled.

QuanluZhang · 2021-08-31T12:01:18Z

nni/retiarii/execution/logical_optimizer/logical_plan.py


 from ...graph import Cell, Edge, Graph, Model, Node
 from ...operation import Operation, _IOPseudoOperation


-class CPUDevice:
+class CPUDevice(Device):


it is a little strange that why not put CPUDevice into device.py? it should also be a dataclass?

Good comment. I have moved CPUDevice in nni.common.device and mark it as a dataclass.

hzhua added 5 commits August 16, 2021 10:31

update interface of CGO's BypassAccelerator to latest

ac7f94c

set pytorch-lightning version >= 1.4.2 in recommanded.txt

afbb014

nit (lint)

822329d

Add device placement in CGO Engine with new device interface

3fdfdc8

fix bug: single host placement of multiple GPUs is wrongly identified…

cc5a48e

… as multi-host

hzhua requested review from J-shang, QuanluZhang and SparkSnail and removed request for J-shang August 19, 2021 06:27

hzhua added 5 commits August 19, 2021 06:37

fix lint

fa943a4

fix typo

98d3c57

fix lint

bcdd06e

fix lint

bdf0160

don't generate cuda mapping when placement is None

6a48572

hzhua marked this pull request as draft August 20, 2021 07:46

hzhua marked this pull request as ready for review August 20, 2021 08:24

QuanluZhang reviewed Aug 21, 2021

View reviewed changes

Merge remote-tracking branch 'upstream/master' into fix-placement-cgo

d0d52ea

hzhua mentioned this pull request Aug 23, 2021

[Retiarii] Retry a failed multi-model trial by disabling CGO in CGOExecutionEngine #4098

Merged

hzhua changed the title ~~Bugfix: wrong device placement and invalid CUDA ordinal when using CGO engine~~ [Retiarii] Bugfix: wrong device placement and invalid CUDA ordinal when using CGO engine Aug 23, 2021

hzhua marked this pull request as draft August 25, 2021 02:23

Merge remote-tracking branch 'upstream/master' into fix-placement-cgo

0f96890

hzhua marked this pull request as ready for review August 26, 2021 05:33

SparkSnail reviewed Aug 27, 2021

View reviewed changes

QuanluZhang reviewed Aug 31, 2021

View reviewed changes

nni/retiarii/execution/cgo_engine.py Show resolved Hide resolved

QuanluZhang reviewed Aug 31, 2021

View reviewed changes

hzhua added 2 commits September 23, 2021 03:10

add explain for overridden_device_repr

0f5b458

format

5b47e79

hzhua added 3 commits September 23, 2021 03:11

move CPUDevice to nni.common.device

005fd4e

docstring and comments for assemble and logical nodes

031bfc4

fix lint

b196726

QuanluZhang approved these changes Sep 27, 2021

View reviewed changes

SparkSnail approved these changes Oct 11, 2021

View reviewed changes

QuanluZhang merged commit 1458312 into microsoft:master Oct 11, 2021

liuzhe-lz mentioned this pull request Oct 15, 2021

NNI 2021 August~September Iteration Planning #3986

Closed

78 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Retiarii] Bugfix: wrong device placement and invalid CUDA ordinal when using CGO engine #4086

[Retiarii] Bugfix: wrong device placement and invalid CUDA ordinal when using CGO engine #4086

hzhua commented Aug 19, 2021 •

edited

Loading

QuanluZhang Aug 21, 2021

hzhua Aug 26, 2021

SparkSnail Aug 27, 2021

hzhua Aug 27, 2021

QuanluZhang Aug 31, 2021

hzhua Sep 23, 2021

QuanluZhang Aug 31, 2021

hzhua Sep 23, 2021 •

edited

Loading

QuanluZhang Aug 31, 2021

hzhua Sep 23, 2021

[Retiarii] Bugfix: wrong device placement and invalid CUDA ordinal when using CGO engine #4086

[Retiarii] Bugfix: wrong device placement and invalid CUDA ordinal when using CGO engine #4086

Conversation

hzhua commented Aug 19, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzhua Sep 23, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hzhua commented Aug 19, 2021 •

edited

Loading

hzhua Sep 23, 2021 •

edited

Loading