Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train problem #14

Open
zhangbaijin opened this issue Nov 1, 2022 · 5 comments
Open

Train problem #14

zhangbaijin opened this issue Nov 1, 2022 · 5 comments

Comments

@zhangbaijin
Copy link

Thanks for your contribution, there is a problem when i train it on 512X512 dataset on the Second stage Transformer_XX.yaml, RuntimeError: The size of tensor a (1024) must match the size of tensor b (512) at non-singleton dimension 1 How should i change the yaml if i want to train my dataset with 512X512

@liuqk3
Copy link
Owner

liuqk3 commented Nov 2, 2022

Thanks for your interests in our work.

If you use our provided P-VQVAE to train the second stage transformer on 512x512, one thing should be kept in mind is that our provided P-VQVAE divides images into 16x16 patches. The number of tokens (or the length of sequence) for an images is 64x64. Change 1024 in the following line to 4096.

and change [32,32] in the following line to [64, 64].

These modifications should work.
Best wishes!

@zhangbaijin
Copy link
Author

Thanks for your help, can you give an example of how to train transformer, the pvqvae training time is below:
pvqvae_Bihua: val generator: Epoch 1/150 | rec_loss: 0.3322 | loss: 0.3337 | quantize_loss: 0.0015 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 13061.0000 | masked_num_token: 3323.0000 discriminator: Epoch 1/150 | loss: 0.0000 pvqvae_Bihua: train: Epoch 2/150 iter 6/92 || generator | rec_loss: 0.2488 | loss: 0.2501 | quantize_loss: 0.0013 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 14845.0000 | masked_num_token: 1539.0000 || discriminator | loss: 0.0000 || generator_lr: 7.64e-06, discriminator_lr: 7.64e-07 || data_time: 0.2s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 02s | spend_time: 01m:10s | left_time: 01h:23m:57s pvqvae_Bihua: train: Epoch 2/150 iter 16/92 || generator | rec_loss: 0.2533 | loss: 0.2547 | quantize_loss: 0.0014 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 13854.0000 | masked_num_token: 2530.0000 || discriminator | loss: 0.0000 || generator_lr: 8.04e-06, discriminator_lr: 8.04e-07 || data_time: 0.2s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 06s | spend_time: 01m:14s | left_time: 01h:23m:47s pvqvae_Bihua: train: Epoch 2/150 iter 26/92 || generator | rec_loss: 0.3294 | loss: 0.3312 | quantize_loss: 0.0018 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 11445.0000 | masked_num_token: 4939.0000 || discriminator | loss: 0.0000 || generator_lr: 8.44e-06, discriminator_lr: 8.44e-07 || data_time: 0.2s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 10s | spend_time: 01m:17s | left_time: 01h:23m:34s pvqvae_Bihua: train: Epoch 2/150 iter 36/92 || generator | rec_loss: 0.2410 | loss: 0.2427 | quantize_loss: 0.0017 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 14810.0000 | masked_num_token: 1574.0000 || discriminator | loss: 0.0000 || generator_lr: 8.84e-06, discriminator_lr: 8.84e-07 || data_time: 0.1s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 13s | spend_time: 01m:21s | left_time: 01h:23m:19s pvqvae_Bihua: train: Epoch 2/150 iter 46/92 || generator | rec_loss: 0.2721 | loss: 0.2745 | quantize_loss: 0.0024 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 11951.0000 | masked_num_token: 4433.0000 || discriminator | loss: 0.0000 || generator_lr: 9.24e-06, discriminator_lr: 9.24e-07 || data_time: 0.2s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 17s | spend_time: 01m:24s | left_time: 01h:23m:09s pvqvae_Bihua: train: Epoch 2/150 iter 56/92 || generator | rec_loss: 0.2303 | loss: 0.2324 | quantize_loss: 0.0021 | used_unmasked_quantize_embed: 1.0000 | used_masked_quantize_embed: 1.0000 | unmasked_num_token: 15426.0000 | masked_num_token: 958.0000 || discriminator | loss: 0.0000 || generator_lr: 9.64e-06, discriminator_lr: 9.64e-07 || data_time: 0.1s | fbward_time: 0.2s | iter_time: 0.4s | iter_avg_time: 0.4s | epoch_time: 20s | spend_time: 01m:28s | left_time: 01h:22m:57s

My train dataset is 1048, is it to fast? and what's the command of train transformer, it this right?
python train_net.py --name pvqvae_Bihua --config_file ./configs/put_cvpr2022/bihua/transformer_bihua.yaml --num_node 1 --tensorboard --auto_resume

@liuqk3
Copy link
Owner

liuqk3 commented Nov 2, 2022

Hi @zhangbaijin
As you said, your dataset is small. The training time depends on the number of images in training set and the number of epochs you provided. After training, you can check the reconstruction results of an image.

For the training of transformer, you should provide a suitable config file based on the pre-trained P-VQVAE. If you indeed have written the trainsformer config file correctly, the command you give should work. But if it's me to train a transformer, I will set the --name transformer_Bihua to distinguish from the pre-trained P-VQVAE on Bihua dataset.

For more details please refer to README.md file.

@zhangbaijin
Copy link
Author

Thanks a lot, there is problem when run inference.py,
if len(list(set(data_i['relative_path']) & processed_image)) == len(data_i['relative_path']): KeyError: 'relative_path' And i add "relative_path" in transformer_xxx.yaml like
use_provided_mask_ratio: [0.4, 1.0] use_provided_mask: 0.8 mask: 1.0 mask_low_to_high: 0.0 mask_low_size: [32, 32] multi_image_mask: False return_data_keys: [image, mask,relative_path]
But it doesn't work, can you tell me how to set it ? Thanks for your time. Best wishes

@liuqk3
Copy link
Owner

liuqk3 commented Nov 6, 2022

Did you add return_data_keys: [image, mask,relative_path] in the validation_datasets? According to my experience, if you did, it should work. But you should have a debug about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants