-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: one of the variables needed for gradient computation has been modified #6
Comments
emmm, I also have encountered this problem. Maybe you can check the code is lastest? besides, I will check the problem tomorrow. you can also use old version by check the commit history. The problem maybe caused by my recent changes about scheduler. |
can you share your environment? |
I think you can debug step by step to understand the specific problem |
I have used latest code in two RTX 3090 to test the multi GPU training and one RTX 3090 to test the single GPU training. And I'm not encounter this bug. Here are my code setup.
the log : 2023-06-02 12:48:15,326: INFO: [train_multi_gpu.py: 119]: {'common': {'save_interval': 5, 'test_interval': 2, 'max_epoch': 100, 'seed': 3401, 'amp': False}, 'datasets': {'train_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/libritts_train100h.csv', 'test_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/LibriTTS_dev-other.csv', 'batch_size': 6, 'tensor_cut': 100000, 'num_workers': 0, 'fixed_length': 1000, 'pin_memory': True}, 'checkpoint': {'resume': False, 'checkpoint_path': '', 'disc_checkpoint_path': '', 'save_folder': './checkpoints/', 'save_location': '${checkpoint.save_folder}batch${datasets.batch_size}_cut${datasets.tensor_cut}_length${datasets.fixed_length}_'}, 'optimization': {'lr': 1e-05, 'disc_lr': 1e-05}, 'lr_scheduler': {'warmup_epoch': 2}, 'model': {'target_bandwidths': [1.5, 3.0, 6.0, 12.0, 24.0], 'sample_rate': 24000, 'channels': 1, 'train_discriminator': True, 'audio_normalize': True, 'filters': 32}, 'distributed': {'data_parallel': True, 'world_size': 2, 'find_unused_parameters': True, 'torch_distributed_debug': False}}
2023-06-02 12:48:15,331: INFO: [train_multi_gpu.py: 120]: Encodec Model Parameters: 14855843
2023-06-02 12:48:15,331: INFO: [train_multi_gpu.py: 121]: Disc Model Parameters: 283398
2023-06-02 12:48:15,331: INFO: [train_multi_gpu.py: 122]: model train mode :True | quantizer train mode :True
2023-06-02 12:48:15,593: INFO: [train_multi_gpu.py: 119]: {'common': {'save_interval': 5, 'test_interval': 2, 'max_epoch': 100, 'seed': 3401, 'amp': False}, 'datasets': {'train_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/libritts_train100h.csv', 'test_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/LibriTTS_dev-other.csv', 'batch_size': 6, 'tensor_cut': 100000, 'num_workers': 0, 'fixed_length': 1000, 'pin_memory': True}, 'checkpoint': {'resume': False, 'checkpoint_path': '', 'disc_checkpoint_path': '', 'save_folder': './checkpoints/', 'save_location': '${checkpoint.save_folder}batch${datasets.batch_size}_cut${datasets.tensor_cut}_length${datasets.fixed_length}_'}, 'optimization': {'lr': 1e-05, 'disc_lr': 1e-05}, 'lr_scheduler': {'warmup_epoch': 2}, 'model': {'target_bandwidths': [1.5, 3.0, 6.0, 12.0, 24.0], 'sample_rate': 24000, 'channels': 1, 'train_discriminator': True, 'audio_normalize': True, 'filters': 32}, 'distributed': {'data_parallel': True, 'world_size': 2, 'find_unused_parameters': True, 'torch_distributed_debug': False}}
2023-06-02 12:48:15,594: INFO: [train_multi_gpu.py: 120]: Encodec Model Parameters: 14855843
2023-06-02 12:48:15,595: INFO: [train_multi_gpu.py: 121]: Disc Model Parameters: 283398
2023-06-02 12:48:15,595: INFO: [train_multi_gpu.py: 122]: model train mode :True | quantizer train mode :True
2023-06-02 12:48:15,708: INFO: [distributed_c10d.py: 432]: Added key: store_based_barrier_key:1 to store for rank: 1
2023-06-02 12:48:15,716: INFO: [distributed_c10d.py: 432]: Added key: store_based_barrier_key:1 to store for rank: 0
2023-06-02 12:48:15,717: INFO: [distributed_c10d.py: 466]: Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-06-02 12:48:15,733: INFO: [distributed_c10d.py: 466]: Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
2023-06-02 12:49:27,759: INFO: [train_multi_gpu.py: 65]: | epoch: 1 | loss: 33.31180191040039 | loss_g: 31.2178897857666 | loss_w: 2.093912363052368 | lr: 1.0000000000000001e-07 | disc_lr: 1.0000000000000001e-07
2023-06-02 12:50:30,795: INFO: [train_multi_gpu.py: 65]: | epoch: 2 | loss: 18.22700309753418 | loss_g: 17.53032684326172 | loss_w: 0.6966761350631714 | lr: 9.990754267376514e-06 | disc_lr: 9.990754267376514e-06
2023-06-02 12:51:12,410: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 2 | loss_g: 16.217060089111328 | loss_disc: 1.999979853630066
2023-06-02 12:52:35,755: INFO: [train_multi_gpu.py: 65]: | epoch: 3 | loss: 11.52556324005127 | loss_g: 11.375588417053223 | loss_w: 0.1499750018119812 | lr: 9.979206008271393e-06 | disc_lr: 9.979206008271393e-06
2023-06-02 12:52:35,756: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9994720220565796
2023-06-02 12:53:58,440: INFO: [train_multi_gpu.py: 65]: | epoch: 4 | loss: 12.130687713623047 | loss_g: 12.015471458435059 | loss_w: 0.11521672457456589 | lr: 9.963055062204609e-06 | disc_lr: 9.963055062204609e-06
2023-06-02 12:53:58,440: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9973113536834717
2023-06-02 12:54:26,933: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 4 | loss_g: 10.545166969299316 | loss_disc: 1.9967026710510254
2023-06-02 12:55:49,203: INFO: [train_multi_gpu.py: 65]: | epoch: 5 | loss: 10.983535766601562 | loss_g: 10.894118309020996 | loss_w: 0.08941741287708282 | lr: 9.942318025365027e-06 | disc_lr: 9.942318025365027e-06
2023-06-02 12:55:49,204: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9918556213378906
2023-06-02 12:57:13,010: INFO: [train_multi_gpu.py: 65]: | epoch: 6 | loss: 10.523397445678711 | loss_g: 10.419259071350098 | loss_w: 0.10413823276758194 | lr: 9.917016206459796e-06 | disc_lr: 9.917016206459796e-06
2023-06-02 12:57:13,011: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9791228771209717
2023-06-02 12:57:40,469: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 6 | loss_g: 9.559894561767578 | loss_disc: 1.9766931533813477
2023-06-02 12:59:04,421: INFO: [train_multi_gpu.py: 65]: | epoch: 7 | loss: 9.345637321472168 | loss_g: 9.30517292022705 | loss_w: 0.040464136749506 | lr: 9.887175604818207e-06 | disc_lr: 9.887175604818207e-06
2023-06-02 12:59:04,422: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9636516571044922
2023-06-02 13:00:27,272: INFO: [train_multi_gpu.py: 65]: | epoch: 8 | loss: 9.28451919555664 | loss_g: 9.215991973876953 | loss_w: 0.06852763891220093 | lr: 9.852826883675634e-06 | disc_lr: 9.852826883675634e-06
2023-06-02 13:00:27,273: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9527044296264648
2023-06-02 13:00:54,698: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 8 | loss_g: 8.622124671936035 | loss_disc: 1.9357608556747437
2023-06-02 13:02:17,948: INFO: [train_multi_gpu.py: 65]: | epoch: 9 | loss: 6.146459579467773 | loss_g: 6.11637544631958 | loss_w: 0.03008396551012993 | lr: 9.814005338664973e-06 | disc_lr: 9.814005338664973e-06
2023-06-02 13:02:17,949: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9932515621185303
2023-06-02 13:03:42,270: INFO: [train_multi_gpu.py: 65]: | epoch: 10 | loss: 6.795133113861084 | loss_g: 6.749742031097412 | loss_w: 0.04539122059941292 | lr: 9.77075086154801e-06 | disc_lr: 9.77075086154801e-06
2023-06-02 13:03:42,271: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9943803548812866
2023-06-02 13:04:09,624: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 10 | loss_g: 6.215439796447754 | loss_disc: 1.9836242198944092
the log: 2023-06-02 13:08:02,786: INFO: [train_multi_gpu.py: 119]: {'common': {'save_interval': 5, 'test_interval': 2, 'max_epoch': 100, 'seed': 3401, 'amp': False}, 'datasets': {'train_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/libritts_train100h.csv', 'test_csv_path': '/mnt/lustre/sjtu/home/zkn02/train_encodec/datasets/LibriTTS_dev-other.csv', 'batch_size': 6, 'tensor_cut': 100000, 'num_workers': 0, 'fixed_length': 1000, 'pin_memory': True}, 'checkpoint': {'resume': False, 'checkpoint_path': '', 'disc_checkpoint_path': '', 'save_folder': './checkpoints/', 'save_location': '${checkpoint.save_folder}batch${datasets.batch_size}_cut${datasets.tensor_cut}_length${datasets.fixed_length}_'}, 'optimization': {'lr': 1e-05, 'disc_lr': 1e-05}, 'lr_scheduler': {'warmup_epoch': 2}, 'model': {'target_bandwidths': [1.5, 3.0, 6.0, 12.0, 24.0], 'sample_rate': 24000, 'channels': 1, 'train_discriminator': True, 'audio_normalize': True, 'filters': 32}, 'distributed': {'data_parallel': False, 'world_size': 4, 'find_unused_parameters': True, 'torch_distributed_debug': False}}
2023-06-02 13:08:02,791: INFO: [train_multi_gpu.py: 120]: Encodec Model Parameters: 14855843
2023-06-02 13:08:02,792: INFO: [train_multi_gpu.py: 121]: Disc Model Parameters: 283398
2023-06-02 13:08:02,792: INFO: [train_multi_gpu.py: 122]: model train mode :True | quantizer train mode :True
2023-06-02 13:10:11,058: INFO: [train_multi_gpu.py: 65]: | epoch: 1 | loss: 35.07368850708008 | loss_g: 33.48434066772461 | loss_w: 1.5893492698669434 | lr: 1.0000000000000001e-07 | disc_lr: 1.0000000000000001e-07
2023-06-02 13:12:14,356: INFO: [train_multi_gpu.py: 65]: | epoch: 2 | loss: 8.310249328613281 | loss_g: 8.114068984985352 | loss_w: 0.1961808204650879 | lr: 9.990754267376514e-06 | disc_lr: 9.990754267376514e-06
2023-06-02 13:13:08,171: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 2 | loss_g: 12.444842338562012 | loss_disc: 1.999990463256836
2023-06-02 13:15:45,271: INFO: [train_multi_gpu.py: 65]: | epoch: 3 | loss: 13.03136157989502 | loss_g: 12.901001930236816 | loss_w: 0.13035933673381805 | lr: 9.979206008271393e-06 | disc_lr: 9.979206008271393e-06
2023-06-02 13:15:45,272: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9959542751312256
2023-06-02 13:18:24,182: INFO: [train_multi_gpu.py: 65]: | epoch: 4 | loss: 10.319283485412598 | loss_g: 10.254348754882812 | loss_w: 0.06493447721004486 | lr: 9.963055062204609e-06 | disc_lr: 9.963055062204609e-06
2023-06-02 13:18:24,182: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9799656867980957
2023-06-02 13:19:17,341: INFO: [train_multi_gpu.py: 83]: | TEST | epoch: 4 | loss_g: 8.972946166992188 | loss_disc: 1.9816977977752686
2023-06-02 13:21:53,971: INFO: [train_multi_gpu.py: 65]: | epoch: 5 | loss: 7.428076267242432 | loss_g: 7.343043804168701 | loss_w: 0.08503223955631256 | lr: 9.942318025365027e-06 | disc_lr: 9.942318025365027e-06
2023-06-02 13:21:53,973: INFO: [train_multi_gpu.py: 67]: | loss_disc: 1.9636757373809814
My Best suggestion is make sure your torch version is samed as my torch version and use the lateset code or older code which I have not add more WarmupScheduler , Maybe there are some change in torch1.x and torch 2.x. Good luck to you. I hope this can help you @sjjbsj |
if you have another question, please contact me and I will close this issue |
Thank you for your reply. I tried not to use the newest WarmupScheduler optimizer, but still have this error. |
maybe you can add more info about your environment, it can helps me to test the code. |
tool: [I logger.cpp:213] [Rank 1]: DDP Initialized with: [I reducer.cpp:126] Reducer initialized with bucket_bytes_cap: 26214400 first_bucket_bytes_cap: 1048576 [I logger.cpp:213] [Rank 0]: DDP Initialized with: When the epoch is greater than lr_scheduler.warmup_epoch, the above error will be thrown. Please let me know if I am missing any information. |
emmm, I will constract a same environment with you to test the code. Maybe you can use the commit |
Besides, I think you can debug step by step to understand the specific problem. It can helps us to solve the problem @sjjbsj |
hello @sjjbsj ! I test the code in torch 1.13.0, I found there are the bugs samed as you. I guess you can change the code as follows: model.train()
disc_model.train()
for input_wav in tqdm(trainloader):
# warmup learning rate, warmup_epoch is defined in config file,default is 5
input_wav = input_wav.cuda() #[B, 1, T]: eg. [2, 1, 203760]
optimizer.zero_grad()
optimizer_disc.zero_grad()
output, loss_w, _ = model(input_wav) #output: [B, 1, T]: eg. [2, 1, 203760] | loss_w: [1]
logits_real, fmap_real = disc_model(input_wav)
logits_fake, fmap_fake = disc_model(output)
loss_g = total_loss(fmap_real, logits_fake, fmap_fake, input_wav, output)
loss = loss_g + loss_w
loss.backward()
optimizer.step()
scheduler.step()
# train discriminator when epoch > warmup_epoch and train_discriminator is True
if config.model.train_discriminator and epoch > config.lr_scheduler.warmup_epoch:
logits_fake, _ = disc_model(output.detach()) # detach to avoid backpropagation to model
loss_disc = disc_loss([logit_real.detach() for logit_real in logits_real], logits_fake) # compute discriminator loss
loss_disc.backward()
optimizer_disc.step()
disc_scheduler.step()
|
also you can find the problems in NVlabs/FUNIT#23 |
I will test the code's results, if perform well, I will merge the issue6 branch to main |
Thanks , the problem has been successfully resolved! |
Hi,
I am writing to seek your advice on an issue I am experiencing during backpropagation of my model. Specifically, I am encountering an error in the loss function after the warmup stage and am unsure how to proceed. Maybe enter
if config.model.train_discriminator and epoch > config.lr_scheduler.warmup_epoch:
I would greatly appreciate any guidance or suggestions you may have to help me address this problem.
log:
Error executing job with overrides: ['distributed.torch_distributed_debug=False', 'distributed.find_unused_parameters=True', 'distributed.world_size=2', 'common.max_epoch=15', 'datasets.tensor_cut=8000', 'datasets.batch_size=40', 'datasets.train_csv_path=/home/anna.peng/PycharmProjects/encodec-pytorch-main/librispeech_train100h_anna.csv', 'lr_scheduler.warmup_epoch=2', 'optimization.lr=1e-4', 'optimization.disc_lr=1e-4']
Traceback (most recent call last):
File "train_multi_gpu.py", line 258, in main
join=True
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes
while not context.join():
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap
fn(i, *args)
File "/home/anna.peng/PycharmProjects/encodec-pytorch-main/train_multi_gpu.py", line 209, in train
scheduler,disc_scheduler)
File "/home/anna.peng/PycharmProjects/encodec-pytorch-main/train_multi_gpu.py", line 59, in train_one_step
loss.backward()
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/_tensor.py", line 489, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File "/home/anna.peng/.local/lib/python3.7/site-packages/torch/autograd/init.py", line 199, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 32, 3, 3]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
The text was updated successfully, but these errors were encountered: