-
Notifications
You must be signed in to change notification settings - Fork 298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Train time? #4
Comments
I am investigating this issue. There are mainly three reasons in my opinion:
|
training facades took 10 hours on GTX1080 ~ 180 sec per epoch |
My test with GRID K520 GPU (Amazon g2.2xlarge) using my own dataset shows that pix2pix/Torch runs about 30 times faster than pix2pix/Tensorflow version. Monitoring using 'watch nvidia-smi' shows that the Tensorflow version is not using the GPU at all. |
@kaihuchen sorry for the obvious question, but did you install "tensorflow-gpu"? |
@kaihuchen I'm sure that I'm training this code with GPU. Can you tell me how you installed tensorflow? @eyaler I've updated the codebase alot recently (which gain speed comparable to torch version, will upload later) |
@yenchenlin My bad! I have many servers and it would seem that I did the test on a server with the CPU version of the tensorflow, and not the GPU one. |
@yenchenlin Any update on this? I do not see any recent commits pertaining to speed. Otherwise, I am tempted be forced to use the code provided by @ppwwyyxx. I have tested the Tensorpack implementation and it 4-5X faster and uses approximately 1/3 the memory of this implementation. |
The code looks clean and straight forward... I really can't get my head around the reason why it's slow... It's pretty much a standard GAN so why is it so slow?! answer to this question has become one of the reasons I check this thread every now and then... |
I have one idea. Feed_dicts are incredibly awefully slows. We should do what Tensorpack does and load say 50 images at a time, keep them in a queue of numpy arrays and then feed them in with a queue runner. This alone might be responsible for the speed difference since it doubles the number of copies needed and causes a lot of expensive switching between Python and Tensorflow C code. Reference to issue from Tensorflow: tensorflow/tensorflow#2919 |
@Skylion007 |
That's another issue, there was a pull request to address this, but it was rejected because it made the edges sightly more blurry. I'm open to try to that and see if it improves the speed. You want to try experimenting with that pull request and see if it yields any results? My GPU is currently in use by another experiment. |
That's exactly my case too! I've been running one for 3 days and last night it started showing acceptable improvements, I really don't want to stop it for at least 3 more days... |
The graphs for each network look very different as well. @ppwwyyxx implementation's graph looks like this for instance while the network in this repo seems to have alot of dependencies so much so that the graph looks more like a straight line than tree. A very different appearance from the one below: Not entirely sure how much of that is due to good Tensorboard formatting and how much of that is a fundamental differences in the architecture between the networks. |
@Skylion007 Tensorboard tends to organize ops under the same name scope together, so what you see in the above figure isn't the real architecture but more about summaries and utilities. You can open the "gen" and "discrim" block in the above figure, and they will contain the model architecture for generator and discriminator. |
Yeah, I see that now. I am just so confused why the other code is so much faster. I just discovered Tensorboard so I was trying to see what I could gain from it. I will say that the GPU memory use is much higher in this implementation. I am really curious why that would be the case. That could explain why it's slower maybe. Any ideas @ppwwyyxx ? Any special tricks your code is doing? |
@Skylion007 it's probably the Fully Connected Layer... Those things takes a lot of memory... |
with (2) and (3) i could bring the epoch time down from 180s to 110s (facades on GTX1080). |
Thanks @eyaler , it's 2. and 3 IMO, I'm really sorry that I'm dealing with some other annoying stuffs recently 😭 |
@eyaler This was an eye opener... I had so many misconceptions about performance in this project! Thanks for the time... please keep going on! |
can anyone tell me why we do this:
why do we concat doesn't this force the discriminator to accepts dual images (one half the real image and the other half the generated one) and double the time needed to process them? |
Hello @Neltherion, please see the image from paper: |
Hmm... You're right... and just giving the fake images to the Discriminator is probably not enough... my bad! thanks for the quick reply... |
Normally, conditional GAN will send the conditional data (e.g., class, attribute, text, image) together with the synthesized image to the discriminator. See this paper for a more complicated discriminator. |
some benchmarks for the community: image_iterations/sec: so seems that tensorpack is the fastest, and that 1080 is twice as fast as K80 all experiments are on the facades dataset and use cudnn 5.1 phillipi = https://github.com/phillipi/pix2pix |
Curious what sort of train times you're seeing with this implementation.
I'm using a GRID K520 GPU (Amazon g2.2xlarge) -- i'm seeing each Epoch take around 1200 seconds, which seems wrong.
From the original paper:
"Data requirements and speed We note that decent results
can often be obtained even on small datasets. Our facade
training set consists of just 400 images (see results in
Figure 12), and the day to night training set consists of only
91 unique webcams (see results in Figure 13). On datasets
of this size, training can be very fast: for example, the results
shown in Figure 12 took less than two hours of training
on a single Pascal Titan X GPU."
Granted I'm not using a Pascal GPU -- which as 2496 CUDA cores, but the g2.2xlarge has around 1500 CUDA cores. At the current rate 200 epochs would take 3 days, as opposed to the 2 hours quoted in the original paper.
Are you seeing similar train times when running this code? Wondering why there is such a discrepancy compared to the original paper/Torch implementation
The text was updated successfully, but these errors were encountered: