-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to train insgen with only 1 GPU #5
Comments
I ran this on 2 gpus much slower than the baseline stylegan2, taking nearly twice as long.Then,I followed the solution in the issue1 and ran it on Colab. Again, it took twice as long. |
If I apply this change I can run on a single GPU: --- a/train.py
+++ b/train.py
@@ -413,7 +413,7 @@ def subprocess_fn(rank, args, temp_dir):
dnnlib.util.Logger(file_name=os.path.join(args.run_dir, 'log.txt'), file_mode='a', should_flush=True)
# Init torch.distributed.
- if args.num_gpus > 1:
+ if args.num_gpus > 0: The key to the above patch is that even with 1 GPU the following code needs to run to init process groups via the Lines 406 to 430 in 52bda7c
I changed the |
This method does not work on my mathine. Many different strange bugs prompted. I don't know why. |
@jkla139 I actually used the copy from https://github.com/Zhendong-Wang/Diffusion-GAN . I cloned from the main branch of this repo and got:
I tracked this down to the following: --- a/training/training_loop.py
+++ b/training/training_loop.py
@@ -403,8 +403,6 @@ def training_loop(
snapshot_data = dict(training_set_kwargs=dict(training_set_kwargs))
for name, module in [('G', G), ('D', D), ('G_ema', G_ema), ('augment_pipe', augment_pipe), ('D_ema', D_ema), ('DHead', DHead), ('GHead', GHead)]:
if module is not None:
- if name in ['DHead', 'GHead']:
- module = module.module
if num_gpus > 1:
misc.check_ddp_consistency(module, ignore_regex=r'.*\.w_avg')
module = copy.deepcopy(module).eval().requires_grad_(False).cpu() diffusion-gan doesn't seem to have this, so I just removed those lines. It seems to be working. |
Yes, these two lines need to be deleted, now it's work. |
thanks for your sharing, but I only have 1 GPU , it can not be trained
I see the reason why need multi-GPU is for ‘effect of disabling shuffle BN to MoCo’
but I can not understand why must shuffle batch data among all gpus not only in GPU?
would you provide a way to shuffle batch date on 1 GPU, it can be not ‘effect’ ?
The text was updated successfully, but these errors were encountered: