Train problem #14

zhangbaijin · 2022-11-01T07:49:53Z

Thanks for your contribution, there is a problem when i train it on 512X512 dataset on the Second stage Transformer_XX.yaml, RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 1 How should i change the yaml if i want to train my dataset with 512X512

The text was updated successfully, but these errors were encountered:

liuqk3 · 2022-11-02T01:23:49Z

Thanks for your interests in our work.

If you use our provided P-VQVAE to train the second stage transformer on 512x512, one thing should be kept in mind is that our provided P-VQVAE divides images into 16x16 patches. The number of tokens (or the length of sequence) for an images is 64x64. Change 1024 in the following line to 4096.

PUT/configs/put_cvpr2022/ffhq/transformer_ffhq.yaml

Line 5 in 7de8ce0

content_seq_len: 1024

and change [32,32] in the following line to [64, 64].

PUT/configs/put_cvpr2022/ffhq/transformer_ffhq.yaml

Line 25 in 7de8ce0

token_shape: [32, 32]

These modifications should work.
Best wishes!

zhangbaijin · 2022-11-02T05:20:19Z

Thanks for your help, can you give an example of how to train transformer, the pvqvae training time is below:
pvqvae_Bihua: val generator: Epoch 1/150 | rec_loss: 0.3322 | loss: 0.3337 | quantize_loss: 0.0015 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 13061.0000 | masked_num_token: 3323.0000 discriminator: Epoch 1/150 | loss: 0.0000 pvqvae_Bihua: train: Epoch 2/150 iter 6/92 || generator | rec_loss: 0.2488 | loss: 0.2501 | quantize_loss: 0.0013 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 14845.0000 | masked_num_token: 1539.0000 || discriminator | loss: 0.0000 || generator_lr: 7.64e-06, discriminator_lr: 7.64e-07 || data_time: 0.2s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 02s | spend_time: 01m:10s | left_time: 01h:23m:57s pvqvae_Bihua: train: Epoch 2/150 iter 16/92 || generator | rec_loss: 0.2533 | loss: 0.2547 | quantize_loss: 0.0014 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 13854.0000 | masked_num_token: 2530.0000 || discriminator | loss: 0.0000 || generator_lr: 8.04e-06, discriminator_lr: 8.04e-07 || data_time: 0.2s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 06s | spend_time: 01m:14s | left_time: 01h:23m:47s pvqvae_Bihua: train: Epoch 2/150 iter 26/92 || generator | rec_loss: 0.3294 | loss: 0.3312 | quantize_loss: 0.0018 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 11445.0000 | masked_num_token: 4939.0000 || discriminator | loss: 0.0000 || generator_lr: 8.44e-06, discriminator_lr: 8.44e-07 || data_time: 0.2s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 10s | spend_time: 01m:17s | left_time: 01h:23m:34s pvqvae_Bihua: train: Epoch 2/150 iter 36/92 || generator | rec_loss: 0.2410 | loss: 0.2427 | quantize_loss: 0.0017 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 14810.0000 | masked_num_token: 1574.0000 || discriminator | loss: 0.0000 || generator_lr: 8.84e-06, discriminator_lr: 8.84e-07 || data_time: 0.1s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 13s | spend_time: 01m:21s | left_time: 01h:23m:19s pvqvae_Bihua: train: Epoch 2/150 iter 46/92 || generator | rec_loss: 0.2721 | loss: 0.2745 | quantize_loss: 0.0024 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 11951.0000 | masked_num_token: 4433.0000 || discriminator | loss: 0.0000 || generator_lr: 9.24e-06, discriminator_lr: 9.24e-07 || data_time: 0.2s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 17s | spend_time: 01m:24s | left_time: 01h:23m:09s pvqvae_Bihua: train: Epoch 2/150 iter 56/92 || generator | rec_loss: 0.2303 | loss: 0.2324 | quantize_loss: 0.0021 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 15426.0000 | masked_num_token: 958.0000 || discriminator | loss: 0.0000 || generator_lr: 9.64e-06, discriminator_lr: 9.64e-07 || data_time: 0.1s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 20s | spend_time: 01m:28s | left_time: 01h:22m:57s

My train dataset is 1048, is it to fast? and what's the command of train transformer, it this right?
python train_net.py --name pvqvae_Bihua --config_file ./configs/put_cvpr2022/bihua/transformer_bihua.yaml --num_node 1 --tensorboard --auto_resume

liuqk3 · 2022-11-02T13:19:15Z

Hi @zhangbaijin ，
As you said, your dataset is small. The training time depends on the number of images in training set and the number of epochs you provided. After training, you can check the reconstruction results of an image.

For the training of transformer, you should provide a suitable config file based on the pre-trained P-VQVAE. If you indeed have written the trainsformer config file correctly, the command you give should work. But if it's me to train a transformer, I will set the --name transformer_Bihua to distinguish from the pre-trained P-VQVAE on Bihua dataset.

For more details please refer to README.md file.

zhangbaijin · 2022-11-03T14:24:24Z

Thanks a lot, there is problem when run inference.py,
if len(list(set(data_i['relative_path']) & processed_image)) == len(data_i['relative_path']): KeyError: 'relative_path' And i add "relative_path" in transformer_xxx.yaml like
use_provided_mask_ratio: [0.4, 1.0] use_provided_mask: 0.8 mask: 1.0 mask_low_to_high: 0.0 mask_low_size: [32, 32] multi_image_mask: False return_data_keys: [image, mask,relative_path]
But it doesn't work, can you tell me how to set it ? Thanks for your time. Best wishes

liuqk3 · 2022-11-06T01:26:47Z

Did you add return_data_keys: [image, mask,relative_path] in the validation_datasets? According to my experience, if you did, it should work. But you should have a debug about this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train problem #14

Train problem #14

zhangbaijin commented Nov 1, 2022

liuqk3 commented Nov 2, 2022 •

edited

Loading

zhangbaijin commented Nov 2, 2022

liuqk3 commented Nov 2, 2022

zhangbaijin commented Nov 3, 2022

liuqk3 commented Nov 6, 2022 •

edited

Loading

Train problem #14

Train problem #14

Comments

zhangbaijin commented Nov 1, 2022

liuqk3 commented Nov 2, 2022 • edited Loading

zhangbaijin commented Nov 2, 2022

liuqk3 commented Nov 2, 2022

zhangbaijin commented Nov 3, 2022

liuqk3 commented Nov 6, 2022 • edited Loading

liuqk3 commented Nov 2, 2022 •

edited

Loading

liuqk3 commented Nov 6, 2022 •

edited

Loading