Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Multi-GPU support of one-shot NAS #4603

Merged
merged 99 commits into from
May 20, 2022

Conversation

Frandium
Copy link
Contributor

@Frandium Frandium commented Mar 2, 2022

Description

Add multi-gpu support for one-shot NAS.

In its core, the technical issue with multi-GPU is to support data sharding in the Lightning way. We replaced the original ad-hoc dataloaders with CombinedLoader provided by lightning official, and a subclass ConcatLoader. These dataloaders naturally support data sharding.

This PR also addresses some minor bugs and deprecated variable/method names.

Checklist

  • test case
  • doc - no need

How to test

Test one-shot on multi-GPU following the guide here.

@ultmaster ultmaster changed the title Multicard oneshot nas dataloader Multi-GPU support of one-shot NAS May 10, 2022
@ultmaster ultmaster marked this pull request as ready for review May 10, 2022 13:52
@matluster
Copy link

Tests require PL 1.6. Upgrade is in #4814.

@@ -74,51 +79,67 @@ class Lightning(Evaluator):

Parameters
----------
lightning_module : LightningModule
lightning_module
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs can auto add type hint for these parameters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

f'You might have forgotten to import DataLoader from {__name__}: {train_dataloaders}',
RuntimeWarning)
if not _check_dataloader(val_dataloaders):
warnings.warn(f'Unexpected dataloader type: {type(train_dataloaders)}. '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type(val_dataloaders)

val_dataloaders: Union[DataLoader, List[DataLoader], None] = None):
train_dataloaders: Optional[Any] = None,
val_dataloaders: Optional[Any] = None,
train_dataloader: Optional[Any] = None):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why there are both train_dataloaders and train_dataloader?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see, backward compatibility

assert _check_dataloader(val_dataloaders), f'Wrong dataloader type. Try import DataLoader from {__name__}.'
if not _check_dataloader(train_dataloaders):
warnings.warn(f'Unexpected dataloader type: {type(train_dataloaders)}. '
f'You might have forgotten to import DataLoader from {__name__}: {train_dataloaders}',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I may miss some background. what does this message mean?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_check_dataloader used to check whether the dataloader is wrapped with nni.trace.

But now, serializer has improved, and due to the fact that we now support complex types of dataloaders (e.g., list of dataloader, dict of dataloader, list of dict of dataloaders), there is no easy way to check those dataloaders. So I give up. :(

@@ -36,6 +37,9 @@ class LightningModule(pl.LightningModule):
See https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html
"""

running_mode: Literal['multi', 'oneshot'] = 'multi'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that a lightning class can be used both for multi and oneshot?

Copy link

@matluster matluster May 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I used this flag, I meant to check before two actions:

  1. Whether to save an onnx graph.
  2. Whether to report intermediate / final results.

But on a second thought now, maybe evaluator could figure it out by itself whether its inner module is a one-shot supernet. So this might be not needed.

No we can't. Revert.

return {
'train': train_dataloaders,
'val': val_dataloaders
}, None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return signature is changed from the base class?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put both train and val dataloaders into the train dataloader, and we don't need a val dataloader.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but the first returned value is a dict, not a train dataloader?

@QuanluZhang QuanluZhang merged commit 39ec21c into microsoft:master May 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants