Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train time? #4

Open
eturner303 opened this issue Dec 7, 2016 · 26 comments
Open

Train time? #4

eturner303 opened this issue Dec 7, 2016 · 26 comments

Comments

@eturner303
Copy link

Curious what sort of train times you're seeing with this implementation.

I'm using a GRID K520 GPU (Amazon g2.2xlarge) -- i'm seeing each Epoch take around 1200 seconds, which seems wrong.

From the original paper:

"Data requirements and speed We note that decent results
can often be obtained even on small datasets. Our facade
training set consists of just 400 images (see results in
Figure 12), and the day to night training set consists of only
91 unique webcams (see results in Figure 13). On datasets
of this size, training can be very fast: for example, the results
shown in Figure 12 took less than two hours of training
on a single Pascal Titan X GPU."

Granted I'm not using a Pascal GPU -- which as 2496 CUDA cores, but the g2.2xlarge has around 1500 CUDA cores. At the current rate 200 epochs would take 3 days, as opposed to the 2 hours quoted in the original paper.

Are you seeing similar train times when running this code? Wondering why there is such a discrepancy compared to the original paper/Torch implementation

@yenchenlin
Copy link
Owner

yenchenlin commented Dec 7, 2016

I am investigating this issue.
It took me around 10 hours to run 200 epochs on a Pascal GPU.

There are mainly three reasons in my opinion:

  1. In this implementation (inherited from DCGAN-tensorflow), generator needs to update twice in each iteration, which slows down the training process a lot.
  2. Since the project is inherited from DCGAN-tensorflow, it uses fully connected layer in the discriminator.
  3. The data preprocessing step is currently performed on the fly during training, which may can be enhanced.

@eyaler
Copy link
Contributor

eyaler commented Dec 15, 2016

training facades took 10 hours on GTX1080 ~ 180 sec per epoch

@kaihuchen
Copy link

My test with GRID K520 GPU (Amazon g2.2xlarge) using my own dataset shows that pix2pix/Torch runs about 30 times faster than pix2pix/Tensorflow version. Monitoring using 'watch nvidia-smi' shows that the Tensorflow version is not using the GPU at all.

@eyaler
Copy link
Contributor

eyaler commented Dec 30, 2016

@kaihuchen sorry for the obvious question, but did you install "tensorflow-gpu"?

@yenchenlin
Copy link
Owner

@kaihuchen I'm sure that I'm training this code with GPU. Can you tell me how you installed tensorflow?
sidenote: 看來您是畢業自台灣清華大學的學長 😄

@eyaler I've updated the codebase alot recently (which gain speed comparable to torch version, will upload later)

@kaihuchen
Copy link

@yenchenlin My bad! I have many servers and it would seem that I did the test on a server with the CPU version of the tensorflow, and not the GPU one.

@ppwwyyxx
Copy link

@eyaler I also had a tensorflow implementation here. It takes me 43 seconds every epoch (400 iterations of batch=1 on facades dataset) on GTX1080, while the torch version takes 42 seconds.

@yenchenlin
Copy link
Owner

Thanks @ppwwyyxx for the info!

@eyaler I think currently the code mentioned above works better!
However, I'll still update code here in these 3 days.

@Skylion007
Copy link

Skylion007 commented Jan 13, 2017

@yenchenlin Any update on this? I do not see any recent commits pertaining to speed. Otherwise, I am tempted be forced to use the code provided by @ppwwyyxx. I have tested the Tensorpack implementation and it 4-5X faster and uses approximately 1/3 the memory of this implementation.

@Neltherion
Copy link

The code looks clean and straight forward... I really can't get my head around the reason why it's slow... It's pretty much a standard GAN so why is it so slow?! answer to this question has become one of the reasons I check this thread every now and then...

@Skylion007
Copy link

Skylion007 commented Jan 13, 2017

I have one idea.

Feed_dicts are incredibly awefully slows. We should do what Tensorpack does and load say 50 images at a time, keep them in a queue of numpy arrays and then feed them in with a queue runner. This alone might be responsible for the speed difference since it doubles the number of copies needed and causes a lot of expensive switching between Python and Tensorflow C code.

Reference to issue from Tensorflow: tensorflow/tensorflow#2919

@Neltherion
Copy link

Neltherion commented Jan 13, 2017

@Skylion007
hmmm... How about the fact that this network is using a fully connected layer in the discriminator... last I checked Tensorpack uses a 1x1 Convolution in the last layer (instead of a fully connected layer)... couldn't it be because of this?

@Skylion007
Copy link

Skylion007 commented Jan 13, 2017

That's another issue, there was a pull request to address this, but it was rejected because it made the edges sightly more blurry. I'm open to try to that and see if it improves the speed. You want to try experimenting with that pull request and see if it yields any results? My GPU is currently in use by another experiment.

@Neltherion
Copy link

Neltherion commented Jan 13, 2017

My GPU is currently in use by another experiment.

That's exactly my case too! I've been running one for 3 days and last night it started showing acceptable improvements, I really don't want to stop it for at least 3 more days...

@Skylion007
Copy link

The graphs for each network look very different as well. @ppwwyyxx implementation's graph looks like this for instance while the network in this repo seems to have alot of dependencies so much so that the graph looks more like a straight line than tree. A very different appearance from the one below:
image

Not entirely sure how much of that is due to good Tensorboard formatting and how much of that is a fundamental differences in the architecture between the networks.

@ppwwyyxx
Copy link

ppwwyyxx commented Jan 15, 2017

@Skylion007 Tensorboard tends to organize ops under the same name scope together, so what you see in the above figure isn't the real architecture but more about summaries and utilities. You can open the "gen" and "discrim" block in the above figure, and they will contain the model architecture for generator and discriminator.

@Skylion007
Copy link

Yeah, I see that now. I am just so confused why the other code is so much faster. I just discovered Tensorboard so I was trying to see what I could gain from it. I will say that the GPU memory use is much higher in this implementation. I am really curious why that would be the case. That could explain why it's slower maybe. Any ideas @ppwwyyxx ? Any special tricks your code is doing?

@Neltherion
Copy link

@Skylion007 it's probably the Fully Connected Layer... Those things takes a lot of memory...

@eyaler
Copy link
Contributor

eyaler commented Jan 17, 2017

  1. changing the last layer from fully connected to a convolution as in the original pix2pix implementation did not give me any speedup
  2. i think we should not run the G optim twice. it is against common wisdom to try to balance D and G by hand, and even some suggest do train D twice and G once.
  3. preproc alone can take up to ~50% of epoch time (in a specific case i had) - should be done only once before train.
  4. i tried holding all facade train images in memory (instead of loading preprocessed versions from ssd disk) - this did not help (this way is not scalable but could be done in chunks)
  5. not evaluating losses after each batch - i assume there is a better way to get this from the train run()?

with (2) and (3) i could bring the epoch time down from 180s to 110s (facades on GTX1080).
also doing (5) brought it down to 85s. still a factor of 2 too slow.

@yenchenlin
Copy link
Owner

yenchenlin commented Jan 18, 2017

Thanks @eyaler , it's 2. and 3 IMO,
and 2 is a crucial point.

I'm really sorry that I'm dealing with some other annoying stuffs recently 😭

@Neltherion
Copy link

@eyaler This was an eye opener... I had so many misconceptions about performance in this project! Thanks for the time... please keep going on!

@Neltherion
Copy link

Neltherion commented Jan 21, 2017

can anyone tell me why we do this:

        self.fake_AB = tf.concat(3, [self.real_A, self.fake_B])
        self.D_, self.D_logits_ = self.discriminator(self.fake_AB, reuse=True)

why do we concat real_A and fake_B and give them BOTH to the discriminator while what we want is to give it just one image (the generated fake: self.fake_B) ?

doesn't this force the discriminator to accepts dual images (one half the real image and the other half the generated one) and double the time needed to process them?

@yenchenlin
Copy link
Owner

Hello @Neltherion, please see the image from paper:

screen shot 2017-01-21 at 8 56 56 pm

@Neltherion
Copy link

Neltherion commented Jan 21, 2017

Hmm... You're right... and just giving the fake images to the Discriminator is probably not enough... my bad! thanks for the quick reply...

@yenchenlin
Copy link
Owner

Normally, conditional GAN will send the conditional data (e.g., class, attribute, text, image) together with the synthesized image to the discriminator. See this paper for a more complicated discriminator.

@eyaler
Copy link
Contributor

eyaler commented Feb 23, 2017

some benchmarks for the community:

image_iterations/sec:
5.2 phillipi K80/torch/cuda8
1.1 yenchenlin K80/tf0.12.1/cuda7.5
1.2 yenchenlin K80/tf0.12.1/cuda8
1.2 yenchenlin K80/tf1.0/cuda8
2.2 yenchenlin 1080/tf0.12.0/cuda8
2.3 yenchenlin_mod K80/tf0.12.1/cuda7.5
2.5 yenchenlin_mod K80/tf0.12.1/cuda8
2.5 yenchenlin_mod K80/tf1.0/cuda8
4.7 yenchenlin_mod 1080/tf0.12.0/cuda8
4.7 affinelayer K80/tf1.0/cuda8
5.5 tensorpack K80/tf1.0/cuda8

so seems that tensorpack is the fastest, and that 1080 is twice as fast as K80

all experiments are on the facades dataset and use cudnn 5.1

phillipi = https://github.com/phillipi/pix2pix
yenchenlin = https://github.com/yenchenlin/pix2pix-tensorflow
yenchenlin_mod = #4 (comment)
tensorpack = https://github.com/ppwwyyxx/tensorpack
affinelayer = https://github.com/affinelayer/pix2pix-tensorflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants