This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

Multi-GPU support of one-shot NAS #4603

Merged

QuanluZhang merged 99 commits into microsoft:master from Frandium:multicard-oneshot

May 20, 2022

Contributor

Frandium commented Mar 2, 2022 •

edited by ultmaster

Loading

Description

Add multi-gpu support for one-shot NAS.

In its core, the technical issue with multi-GPU is to support data sharding in the Lightning way. We replaced the original ad-hoc dataloaders with CombinedLoader provided by lightning official, and a subclass ConcatLoader. These dataloaders naturally support data sharding.

This PR also addresses some minor bugs and deprecated variable/method names.

Checklist

test case
doc - no need

How to test

Test one-shot on multi-GPU following the guide here.

v-fangdong added 30 commits

January 11, 2022 10:59


init

156fe72


init

6ec5f85


differentiable and sampling-based ontshot algs

0fd40e8


fix import mistake

32ac1ab


proxyless inherits darts


test files

f29c39a


add more comments and fix test files

d60f87e


revert unrelated changes

09a065f


unify comments style and remove debug codes

a2e2fa8


remove unnecessary methods

2289c19


fix link

9193af6


unify code style

50ae803


SNAS

559518c


fix pylint, configure optimizers and training step loss

1744e74


fix pylint

e4d5feb


disable pylint unsubscriptable-opject warning

61150e5


use lib

95038b3


use metrics


remove unused

48081c9


fix bugs

badcc83


fix lint

a7fb932


solve lr_scheduler

0a60dee


fix pylint

e1d2995


fix pylint

f763aaf


remove validation_step

de23dfa


fix pylint

1229bc6


fix bug

9bebecf


fix bug

09da5c3


fix bug

3e71de6


fix bug

2ac3d3d

ultmaster and others added 9 commits

May 9, 2022 13:08


Merge branch 'master' of github.com:microsoft/nni into multicard-oneshot

8fc6fff


revert

9d9beaf


revert

77a8c8b

001b1b4


checkpoint

a4e245b


finish up

6a1974b

f31759c


add tests

6966aca


revert

78374b1

ultmaster changed the title ~~Multicard oneshot nas dataloader~~ Multi-GPU support of one-shot NAS

ultmaster added 2 commits

May 10, 2022 22:44


revert

f191ce6


revert

4d3f187

ultmaster marked this pull request as ready for review

May 10, 2022 13:52

ultmaster and others added 2 commits

May 10, 2022 23:05


compat fix

1c66ff0


fix tests

74edc65

matluster commented May 10, 2022

Tests require PL 1.6. Upgrade is in #4814.

QuanluZhang requested review from QuanluZhang and J-shang

May 11, 2022 02:53

J-shang reviewed

View reviewed changes

nni/retiarii/evaluator/pytorch/lightning.py

@@ @@ -74,51 +79,67 @@ class Lightning(Evaluator): @@
  Parameters
  ----------
- lightning_module : LightningModule
+ lightning_module

Contributor

J-shang May 11, 2022

docs can auto add type hint for these parameters?

Contributor

ultmaster May 11, 2022

yes

nni/retiarii/evaluator/pytorch/lightning.py Outdated

+ f'You might have forgotten to import DataLoader from {__name__}: {train_dataloaders}',
+ RuntimeWarning)
+ if not _check_dataloader(val_dataloaders):
+ warnings.warn(f'Unexpected dataloader type: {type(train_dataloaders)}. '

Contributor

J-shang May 11, 2022

type(val_dataloaders)

fix

J-shang approved these changes

View reviewed changes


Merge branch 'master' of github.com:microsoft/nni into multicard-oneshot

7dec65c

QuanluZhang reviewed

View reviewed changes

nni/retiarii/evaluator/pytorch/lightning.py

- val_dataloaders: Union[DataLoader, List[DataLoader], None] = None):
+ train_dataloaders: Optional[Any] = None,
+ val_dataloaders: Optional[Any] = None,
+ train_dataloader: Optional[Any] = None):

Contributor

QuanluZhang May 18, 2022

why there are both train_dataloaders and train_dataloader?

Contributor

QuanluZhang May 18, 2022

i see, backward compatibility

QuanluZhang reviewed

View reviewed changes

nni/retiarii/evaluator/pytorch/lightning.py Outdated

- assert _check_dataloader(val_dataloaders), f'Wrong dataloader type. Try import DataLoader from {__name__}.'
+ if not _check_dataloader(train_dataloaders):
+ warnings.warn(f'Unexpected dataloader type: {type(train_dataloaders)}. '
+ f'You might have forgotten to import DataLoader from {__name__}: {train_dataloaders}',

Contributor

QuanluZhang May 18, 2022

I may miss some background. what does this message mean?

matluster May 19, 2022

_check_dataloader used to check whether the dataloader is wrapped with nni.trace.

But now, serializer has improved, and due to the fact that we now support complex types of dataloaders (e.g., list of dataloader, dict of dataloader, list of dict of dataloaders), there is no easy way to check those dataloaders. So I give up. :(

QuanluZhang reviewed

View reviewed changes

nni/retiarii/evaluator/pytorch/lightning.py

@@ @@ -36,6 +37,9 @@ class LightningModule(pl.LightningModule): @@
  See https://pytorch-lightning.readthedocs.io/en/stable/common/lightning_module.html
  """
+ running_mode: Literal['multi', 'oneshot'] = 'multi'

Contributor

QuanluZhang May 18, 2022

is it possible that a lightning class can be used both for multi and oneshot?

matluster May 19, 2022 •

edited

Loading

When I used this flag, I meant to check before two actions:

Whether to save an onnx graph.
Whether to report intermediate / final results.

~~But on a second thought now, maybe evaluator could figure it out by itself whether its inner module is a one-shot supernet. So this might be not needed.~~

No we can't. Revert.

QuanluZhang reviewed

View reviewed changes

nni/retiarii/oneshot/pytorch/strategy.py

+ return {
+ 'train': train_dataloaders,
+ 'val': val_dataloaders
+ }, None

Contributor

QuanluZhang May 18, 2022

the return signature is changed from the base class?

matluster May 19, 2022

I put both train and val dataloaders into the train dataloader, and we don't need a val dataloader.

Contributor

QuanluZhang May 19, 2022

I know, but the first returned value is a dict, not a train dataloader?

QuanluZhang approved these changes

View reviewed changes

ultmaster added 2 commits

May 19, 2022 13:37


fix comments

cad1b2c

b53f98b

QuanluZhang merged commit 39ec21c into microsoft:master

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.