Very confused by the discriminator loss #93

xesdiny · 2021-08-07T08:26:17Z

When training the VQGAN pipeline in FFHQ dataset.
I checked the disc_loss use the function like vanilla_d_loss

def hinge_d_loss(logits_real, logits_fake):
    loss_real = torch.mean(F.relu(1. - logits_real))
    loss_fake = torch.mean(F.relu(1. + logits_fake))
    d_loss = 0.5 * (loss_real + loss_fake)
    return d_loss

But the metric in tensorboard ,the loss is very strangeness！

I am confused whether this discriminator loss is really optimized for generator training.

The discriminator loss is joined to the process after the training step reaches 30K. By the way, add the metric of discriminator loss form training starts to the shown in the picture above.

The text was updated successfully, but these errors were encountered:

hyakuchiki · 2021-08-07T23:50:42Z

A lot of people seems to have the same problem with the discriminator not being trained properly.
#73
Have you looked at the d_weight value on Tensorboard? If it is fluctuating at high values then it might be a problem.
I suspect that if the disc_start parameter is higher, the reconstruction will settle first and the d_weight will be a sensible value. The authors suggest that you train 3-5 epochs without the discriminator in case of ImageNet, so that would mean that disc_start should be several millions? I guess that the discriminator should only be used when the VQVAE is starting to produce alright results.
#31
The default value for disc_start is 10000 in custom_vqgan.yaml, which seems way too low.
I had the same problem, so, I set disc_start to 50000 and disc_weight to 0.2 and I'm getting somewhat better results (Although I'm worried that disc_weight is a bit too low now?).

xesdiny · 2021-08-09T07:12:01Z

Emm Yeah!
I understand what you mean is that the discriminator is invalid before the generator reaches the nice benchmark, so the time when the discriminator enters the training phase should be delayed.
The d_weight fraction is used as the weight coefficient of the discriminator to weight the total_loss.
And It It calculates the 2-norm ratio after deriving the parameters of the last layer of the model based on rec_loss and g_loss.

    def calculate_adaptive_weight(self, nll_loss, g_loss, last_layer=None):
        if last_layer is not None:
            nll_grads = torch.autograd.grad(nll_loss, last_layer, retain_graph=True)[0]
            g_grads = torch.autograd.grad(g_loss, last_layer, retain_graph=True)[0]
        else:
            nll_grads = torch.autograd.grad(nll_loss, self.last_layer[0], retain_graph=True)[0]
            g_grads = torch.autograd.grad(g_loss, self.last_layer[0], retain_graph=True)[0]

        d_weight = torch.norm(nll_grads) / (torch.norm(g_grads) + 1e-4)
        d_weight = torch.clamp(d_weight, 0.0, 1e4).detach()
        d_weight = d_weight * self.discriminator_weight
        return d_weight

The d_weight_step value in yours tensorboard approaching zeros.
And I think this value should be stable at about 1 to guide the generation of the generator.(But in fact, when the value was floating around 1, disc_loss was not decreased.)Maybe I did't understand the meaning behind d_weight correctly.
Emm .. I will adopt your suggestions on this pipeline.
Thx~

A lot of people seems to have the same problem with the discriminator not being trained properly.
#73
Have you looked at the d_weight value on Tensorboard? If it is fluctuating at high values then it might be a problem.
I suspect that if the disc_start parameter is higher, the reconstruction will settle first and the d_weight will be a sensible value. The authors suggest that you train 3-5 epochs without the discriminator in case of ImageNet, so that would mean that disc_start should be several millions? I guess that the discriminator should only be used when the VQVAE is starting to produce alright results.
#31
The default value for disc_start is 10000 in custom_vqgan.yaml, which seems way too low.
I had the same problem, so, I set disc_start to 50000 and disc_weight to 0.2 and I'm getting somewhat better results (Although I'm worried that disc_weight is a bit too low now?).

fortunechen · 2021-09-07T13:46:18Z

Hi, How is your results now? Could you please share your learning from tuning the disc_start and disc_weight ?

Thx

MaxyLee · 2021-10-15T04:50:21Z

Succeed to get a good result on CUB dataset by setting disc_start=50,000 and disc_weight=0.2:
Original images:

Reconstructed images:

PanXiebit · 2021-10-17T06:44:23Z

@MaxyLee congratulations! could you show more setting details? how many examples of your CUB dataset, and how many steps are in one epoch? Exactly, how many epochs do you start the discriminator?

MaxyLee · 2021-10-17T08:29:47Z

@MaxyLee congratulations! could you show more setting details? how many examples of your CUB dataset, and how many steps are in one epoch? Exactly, how many epochs do you start the discriminator?

Here is my config:

model:
  base_learning_rate: 4.5e-6
  target: taming.models.vqgan.VQModel
  params:
    embed_dim: 256
    n_embed: 1024
    ddconfig:
      double_z: False
      z_channels: 256
      resolution: 256
      in_channels: 3
      out_ch: 3
      ch: 128
      ch_mult: [ 1,1,2,2,4]  # num_down = len(ch_mult)-1
      num_res_blocks: 2
      attn_resolutions: [16]
      dropout: 0.0

    lossconfig:
      target: taming.modules.losses.vqperceptual.VQLPIPSWithDiscriminator
      params:
        disc_conditional: False
        disc_in_channels: 3
        disc_start: 50000
        disc_weight: 0.2
        codebook_weight: 1.0

data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 5
    num_workers: 8
    train:
      target: taming.data.custom.CustomTrain
      params:
        training_images_list_file: /data/share/data/birds/CUB_200_2011/cub_train.txt
        size: 256
    validation:
      target: taming.data.custom.CustomTest
      params:
        test_images_list_file: /data/share/data/birds/CUB_200_2011/cub_test.txt
        size: 256

I trained this model on CUB train split(8,855 images) using 4 GPUs with approximately 400 steps per epoch. The discriminator therefore started at more than 100 epochs.
Hope it will help

PanXiebit · 2021-10-17T10:02:23Z

@MaxyLee thank u very much!!!

PanXiebit · 2021-10-19T02:19:32Z

Hi @MaxyLee, I have trained the vqgan with your setting on my own dataset, the discriminator startes at about 100 epochs, and disc_weight is 0.2. However I still faced the problem, the generated quality was alright. But after starting discriminator, it became worse. This is my training curve.

In fact the generated images are alright without discriminator. In your traning process, do your generated images become much better after gan training?

MaxyLee · 2021-10-19T06:26:30Z

Hi @MaxyLee, I have trained the vqgan with your setting on my own dataset, the discriminator startes at about 100 epochs, and disc_weight is 0.2. However I still faced the problem, the generated quality was alright. But after starting discriminator, it became worse. This is my training curve.

In fact the generated images are alright without discriminator. In your traning process, do your generated images become much better after gan training?

Yes, my model performed much better when the discriminator loss was introduced. As shown in the figure, my model could not generate fine-grained images without the discriminator.

Maybe you can try to train the generator longer before adding d loss and select the best checkpoint.
Below are my training curves:

PanXiebit · 2021-10-20T01:26:28Z

@MaxyLee thank you for your patience and kindness! I will try more experiments.

kaihe · 2022-01-13T09:55:59Z

I think for a successful discriminator training, logits fake should be negative and logits real should be positive. But I noticed that in the abrove train curves, logits fake and logits real looks always same. Does that mean discriminator is failed and just output same value regardless of input image? @MaxyLee would you also share your training curves of logits?

MaxyLee · 2022-01-13T14:56:45Z

I think for a successful discriminator training, logits fake should be negative and logits real should be positive. But I noticed that in the abrove train curves, logits fake and logits real looks always same. Does that mean discriminator is failed and just output same value regardless of input image? @MaxyLee would you also share your training curves of logits?

These are my training curves:

kaihe · 2022-01-14T03:15:54Z

I think for a successful discriminator training, logits fake should be negative and logits real should be positive. But I noticed that in the abrove train curves, logits fake and logits real looks always same. Does that mean discriminator is failed and just output same value regardless of input image? @MaxyLee would you also share your training curves of logits?

These are my training curves:

Thanks very much, that confirm my suspicions: a good discriminator is enough for sharp images, no need for gan equilibrium

ThisisBillhe · 2024-06-17T09:09:45Z

I think for a successful discriminator training, logits fake should be negative and logits real should be positive. But I noticed that in the abrove train curves, logits fake and logits real looks always same. Does that mean discriminator is failed and just output same value regardless of input image? @MaxyLee would you also share your training curves of logits?

Hi, How to solve the problem of logits_real and logits_fake being almost the same?

xesdiny closed this as completed Aug 18, 2021

dribnet mentioned this issue Sep 16, 2021

debugging custom models #107

Open

snoop2head mentioned this issue Dec 13, 2021

Deciding when to initiate discriminator loss to kick in #129

Open

kaihe mentioned this issue Jan 26, 2022

Discriminator Loss Bug #137

Open

zhuqiangLu mentioned this issue Jan 9, 2023

vq_gan reconstruction results blurry using default code dome272/MaskGIT-pytorch#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Very confused by the discriminator loss #93

Very confused by the discriminator loss #93

xesdiny commented Aug 7, 2021 •

edited

Loading

hyakuchiki commented Aug 7, 2021 •

edited

Loading

xesdiny commented Aug 9, 2021 •

edited

Loading

fortunechen commented Sep 7, 2021

MaxyLee commented Oct 15, 2021

PanXiebit commented Oct 17, 2021

MaxyLee commented Oct 17, 2021

PanXiebit commented Oct 17, 2021

PanXiebit commented Oct 19, 2021

MaxyLee commented Oct 19, 2021 •

edited

Loading

PanXiebit commented Oct 20, 2021

kaihe commented Jan 13, 2022

MaxyLee commented Jan 13, 2022

kaihe commented Jan 14, 2022

ThisisBillhe commented Jun 17, 2024

Very confused by the discriminator loss #93

Very confused by the discriminator loss #93

Comments

xesdiny commented Aug 7, 2021 • edited Loading

hyakuchiki commented Aug 7, 2021 • edited Loading

xesdiny commented Aug 9, 2021 • edited Loading

fortunechen commented Sep 7, 2021

MaxyLee commented Oct 15, 2021

PanXiebit commented Oct 17, 2021

MaxyLee commented Oct 17, 2021

PanXiebit commented Oct 17, 2021

PanXiebit commented Oct 19, 2021

MaxyLee commented Oct 19, 2021 • edited Loading

PanXiebit commented Oct 20, 2021

kaihe commented Jan 13, 2022

MaxyLee commented Jan 13, 2022

kaihe commented Jan 14, 2022

ThisisBillhe commented Jun 17, 2024

xesdiny commented Aug 7, 2021 •

edited

Loading

hyakuchiki commented Aug 7, 2021 •

edited

Loading

xesdiny commented Aug 9, 2021 •

edited

Loading

MaxyLee commented Oct 19, 2021 •

edited

Loading