-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some beginner questions #12
Comments
Just some additional info: I'm training a 512x512 network with ~10,000 normalized fantasy portrait images. I've attached a representative sample of the dataset, as well as some of the generated images. The fine details don't seem to have improved much at all for a couple of days, despite the dataset being 512x512. I'm at 38,500 runs with a batch size of 16. |
Hey! Dont worry about questions being basic. It was actually a while ago since I fiddled with GANs so its only good to repeat some of the basics so I dont forget them! So what I can start by mentioning is that the reason the results of the original paper were so amazing was the quality of the data. All of the portraits were of humans (which drastically reduces the amount of structural variation compared to fantasy) and all of the faces were perfectly aligned where the pupils of the eyes were in the exact same pixel position for all images. I can see in your data sample that there are multiple styles present which the model did not have to learn when generating faces (where the only style was "real photo"). I am honestly amazed with the results you already have so I think I might have overestimated how good the data actually has to be. But either way I think that there is too much variation in your dataset for the model to learn to generate high fidelity at those resolutions. For example, the original face generating model has learned how to create very realistic hairs. That is possible because human hairs are quite similar to each other (the variation is more in length, curls, e.t.c.). For your dataset the hair of each portrait have a bit more different styles. Same goes for the skin, clothes, e.t.c. You can see how the generator has learned to produce only one "style" of portraits. If you could extract all images from the dataset with the same kind of style I think you could reach a higher fidelity for your output. There is another example of similar training where someone trained the generator to produce anime portraits. This worked really well as the style was more limited. Now I am speculating quite a bit but I do think it might be that the images in your training data that are of different styles than the ones currently being generated will end up polluting the training. Anyway I hope you manage to improve the results. Still pretty cool how far you got! Maybe it could be better at lower resolutions with the same number of model parameters to account for the increased variety?
This is very hard to say, if not impossible. Since the training is completely adversarial, the loss may not be indicative of any kind of progress (which is a big problem for GANs, there are no really good metrics for overall progress). I think you can just ignore the loss values (unless they go up to infinity or something crazy).
For every backward pass we calculate the gradient for each parameter. The gradient norm (grad_norm) is the euclidean distance of all those gradients. So if we had two parameters in total (we have millions in reality), we would have two gradients after a backward pass, gradient_A and gradient_B. We get the gradient norm by calculating sqrt(gradient_A^2 + gradient_B^2) (which is Pythagoras theorem). A higher gradient norm will indicate that the gradients calculated during training are high. This may not be indicative of the progress of the training but I have found it to sometimes be really high when training is about to collapse.
Since training GANs is a very unstable task it is often useful to regularize that task. Regularization is usually some kind of "regulation to keep things in check". A classic regularization in training of neural networks is to use weight decay, which in practice keeps weights from becoming too large (negative or positive). Some regularization may produce a loss which we try to minimize (which in turn regularizes the training).
This one is hard to answer. You probably have to test a lot of different values. A general rule of thumb is that the higher your batch size, the higher you can have your learning rate. The default learning rate for the Adam optimizer (which is used here) is 1e-3. So try something around that maybe. |
So as I've been looking around, I'm seeing a lot of people talking about having gotten good results with transfer learning (oddly enough, even with completely dissimilar datasets). When I try resuming training on an existing model (nivdia's 512x512 ffhq model, for instance), my generated images look as if it's completely dumped the old model, as if it's starting from the first iteration. Is there some way I can get around that? |
Make sure you use I do not have any experience doing transfer learning for these type of models but maybe a lower learning rate will make sure the pretrained model isnt discarded too quickly through parameter change during training? Would love to hear if you make any progress on this! |
I'm getting an out of memory error: (pytorch) D:\AI\sg2\stylegan2_pytorch>python run_training.py ffhq.yaml --g_file=checkpoints\ffhq_512x512\0000\G.pth --d_file=checkpoints\ffhq_512x512\0000\D.pth --resume --gpu 1 My yaml file is as follows:
I tried adjusting the batch size down to 8 and then 4, and ran into the same error. My other network is also 512x512, so I'm not sure what the difference here would be. |
This happens when loading a pretrained model? Have you updated to the latest code? Recently some guy found a way to reduce memory usage so that has been added like 1 day ago |
You are using 6.37 out of 8 GB but only have 23.69 mb free when Pytorch is trying to allocate 32 mb. I'm guessing some other memory is being used by some other application as well? I dont remember how much memory I was using when training a 512x512 resolution model :/ |
Are you running this on Linux? Maybe it's a Windows thing. I left some space unpartitioned on my hard drive for Linux, so maybe it's time I installed it. I'll try today's build and see if that fixes it. |
Also, I typed this up yesterday and didn't save it: On a lark, I tried running training on my CPU to see what the memory usage would like like, and it was 41 gigabytes (fortunately I have 64 gigs on here), so I wonder if there's something else going on there. Still trying to get Linux to work, but I'll report back after that. |
Yea, I gave up on windows for this kind of stuff. Running linux just makes things a lot more easy! Although you should probably look into learning the basics of the linux terminal first, but thats super easy. If you want to use 0% of your GPU memory for the operating system you will have to run it without any graphical user interface. Otherwise theres always gonna be a bit of memory used by the OS (unless maybe you can somehow run the OS graphic stuff through some integrated GPU and then only use your dedicated GPU for Pytorch). 41 GB sounds like there might be a bug or some very weird settings. Get back to me when you have played around with it :) |
I had to manually install the nvidia file and things finally worked. Forutunately I'm a linux admin at work, so I'm already comfortable with the command line -- it's just hard with a completely black screen. :) Anyway, I'm getting the same out of memory problem on Linux: $ python run_training.py ffhq.yaml --g_file=ffhq512/Gs.pth --d_file=ffhq512/D.pth --gpu 1 |
Also, would it be possible while training to load the generator and discriminator into CPU memory and then load an unload them on the GPU depending on whether they're being used? I feel like that could cut GPU memory use way down (at a cost of some performance), but it would still be vastly better than training on the CPU. |
I think it could be possible to do that but the vast majority of memory usage actually comes from the training, not the model weights themselves! Im guessing a model is like 100 mb. Is the batch size 4 by default? 4 should be the lowest batch size you can run on these models. Can you try that? If that doesnt work I can try to find the same model and run it and see how much memory is used on my machine (I have 11 GB GPU memory). |
I tested whether that would be enough to slip in under the RAM limit, and it wasn't. I tried a batch size of 4 and it still failed. The model was nvidia's 512x512 face photos model. I think I'm just going to have to give up and get a new graphics card. Maybe I'll suck it up and get one of those 3090s they just announced, with 24 gigs of ram. Then I won't have to worry about this for a while. Until, of course, they start making even bigger models in like 6 months. :) |
If you get a 3090 I will be mighty jealous! If I had more time to play around with GANs and fun things like that I would probably buy one |
I'm trying to get an idea of the training times. To get the results you show, how much training time did you dedicate and what resources? Thank you very much. |
Hey, If you use transfer learning you should be able to train the model a lot faster. The authors of the paper state that they trained on 61 images per second using 8 NVIDIA V100 GPUs. The discriminator saw 25M images in total during training. |
Yes effectively. It is to know a little the approximate times in which it takes to obtain those results. Unfortunately, I can't have 8 V100s haha Additionally, I want to thank all the effort in this project because it is very great to see that stylegan2 is available in PyTorch. |
Oh yea, I wish I had access to 8 V100s! I could not train a model with resolution 1024x1024 as I did not have enough GPU memory. I have actually never tried using transfer learning with GANs so I dont dare say how long it would take to do transfer learning. You can start the training and there is an option to log the generator output to tensorboard at a certain iteration interval so you can look at the progress while the training is progressing. I am glad you found the code useful! It took a while to figure out exactly how some different parts worked before I could have them running correctly in pytorch. I am not sure this code is as performant as the tensorflow version since they use custom CUDA code for some operations and this is written in pure Pytorch |
I think WGAN can provide a rather good metric for evaluating the training progress, it can not tell where you are because it doesn't converge to the theoretical Nash equilibrium point, but a declining Wasserstein distance can safely tell that the model is getting better. |
@hyx07 Glad you enjoy the code! If you test out running training with wgan-loss instead of the default loss post an update if the loss was any indication of progress, would be fun to hear about! |
@hyx07 eagerly waiting foy your reply. |
Sorry I didn't have the time to try WGAN on StyleGAN2, but I have tried it on other generation task like pix2pix. In general if I have a generator as competitive as the Critic(the discriminator is called Critic in WGAN), I would find the output of the Critic, i.e. the estimated W Distance, to rise in the first a few epochs, and then slowly decline until the end of training. |
Hi, I'm trying to train a network and I can't find much in the way of documentation as to stylegan2's metrics on tensorboard, so I figured I'd ask if you have any insight on this stuff.
Based on this screenshot here:
https://i.imgur.com/gXFEyys.png
Apologies about how basic these questions are, but:
Thanks!
The text was updated successfully, but these errors were encountered: