Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why does prior distribution have no encoder loss? #6

Open
HaopengZhang96 opened this issue Oct 21, 2019 · 12 comments
Open

Why does prior distribution have no encoder loss? #6

HaopengZhang96 opened this issue Oct 21, 2019 · 12 comments

Comments

@HaopengZhang96
Copy link

the following code :

term_a = torch.log(self.prior_d(prior)).mean()
term_b = torch.log(1.0 - self.prior_d(y)).mean()
PRIOR = - (term_a + term_b) * self.gamma

"-(term_a + term_b)" is the loss of Discriminator, and “term_b” is the loss of encoder( similar as generator of gan )

In the code you only backward Discriminator's loss(part of prior distribution), and there is no backward of the loss that belongs to the encoder in the prior distribution.

loss.backward()  // loss = global+local + prior , prior =-(term_a+term_b)
optim.step()
loss_optim.step()

I think it could be the following process

term_a = torch.log(self.prior_d(prior)).mean()
term_b = torch.log(1.0 - self.prior_d(y.detach())).mean()  // y should be detach
PRIOR = - (term_a + term_b) * self.gamma
encoder_loss_for_p = term_b
.............

loss.backward()  // loss = global+local + prior , prior =-(term_a+term_b)
optim.step()   //update the gradient from global+local but no prior
loss_optim.step()

encoder_loss_for_p.backward()   //optim the encoder for Adversarial
optim.step()

Is my understanding wrong?

@HaopengZhang96
Copy link
Author

I just found out that someone asked the same question earlier.

@DuaneNielsen
Copy link
Owner

Yeah, this is why this is such a good technique. Unlike GAN it's not a minimax optimization.

Gradients are propagated directly through the loss function to the encoder network and they are optimized jointly.

The experimental setup is based on the idea that the mutual information between an image, and a randomly selected image should be zero.

This forces the encoder to learn a latent space where the encodings that share mutual information are close in distance, but those that don't are farther away.

@tianlili1
Copy link

Excuse me, I met a problem when I use the mutual information, its loss value is negative at the beginning. Is this normal?

@SchafferZhang
Copy link

Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?

@HaopengZhang96
Copy link
Author

Excuse me, I met a problem when I use the mutual information, its loss value is negative at the beginning. Is this normal?

not normal. I use the mutual Infomation is always positive.

@HaopengZhang96
Copy link
Author

HaopengZhang96 commented Oct 15, 2020

Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?

I read the code for the original paper,and I think I am right. The encoder and discriminator loss should be divide,like GAN

@SchafferZhang
Copy link

Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?

I read the code for the original paper,and I think I am right. The encoder and decoder loss should be divide,like GAN

So, Did you reimplement the code in this repo or use the official code? How is it work?

@HaopengZhang96
Copy link
Author

Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?

I read the code for the original paper,and I think I am right. The encoder and decoder loss should be divide,like GAN

So, Did you reimplement the code in this repo or use the official code? How is it work?

I follow the DIM‘s work and do some job on user behavior modeling. In my experiment,the local mutual information is have a good performance on Sequence modeling,when the downstream task is Classification.Actually,the prior Loss is not important if only focus on downstream task performance. Prior loss plays a role of normalization to some extent.

My paper is being submitted and I haven't sorted out the relevant code.

@SchafferZhang
Copy link

Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?

I read the code for the original paper,and I think I am right. The encoder and decoder loss should be divide,like GAN

So, Did you reimplement the code in this repo or use the official code? How is it work?

I follow the DIM‘s work and do some job on user behavior modeling. In my experiment,the local mutual information is have a good performance on Sequence modeling,when the downstream task is Classification.Actually,the prior Loss is not important if only focus on downstream task performance. Prior loss plays a role of normalization to some extent.

My paper is being submitted and I haven't sorted out the relevant code.

Looking forward to your work!

@DuaneNielsen
Copy link
Owner

Just to put a pin in this one. I think the answer is quite clear from the paper.

image

All three terms are added. There is no double backward pass in Infomax.

As to the loss becoming negative. This can happen because the Pytorch F divergences can return "probabilities" greater than 1.0. This is due to the way F-divergences are calculated in practice. See this explanation of the formula pytorch/pytorch#7637 for the reason as to how log_prob can return a value greater than one.

@yuu-Wang
Copy link

the following code :

term_a = torch.log(self.prior_d(prior)).mean()
term_b = torch.log(1.0 - self.prior_d(y)).mean()
PRIOR = - (term_a + term_b) * self.gamma

"-(term_a + term_b)" is the loss of Discriminator, and “term_b” is the loss of encoder( similar as generator of gan )

In the code you only backward Discriminator's loss(part of prior distribution), and there is no backward of the loss that belongs to the encoder in the prior distribution.

loss.backward()  // loss = global+local + prior , prior =-(term_a+term_b)
optim.step()
loss_optim.step()

I think it could be the following process

term_a = torch.log(self.prior_d(prior)).mean()
term_b = torch.log(1.0 - self.prior_d(y.detach())).mean()  // y should be detach
PRIOR = - (term_a + term_b) * self.gamma
encoder_loss_for_p = term_b
.............

loss.backward()  // loss = global+local + prior , prior =-(term_a+term_b)
optim.step()   //update the gradient from global+local but no prior
loss_optim.step()

encoder_loss_for_p.backward()   //optim the encoder for Adversarial
optim.step()

Is my understanding wrong?

I have changed the encoder loss as you said, but at the beginning, the encoder loss is -9, and then when it comes to epoch40-epoch133, the loss is about -34. Would you like to know how you changed it? What's wrong with me?
def forward(self, y, M, M_prime):

    # see appendix 1A of https://arxiv.org/pdf/1808.06670.pdf

    y_exp = y.unsqueeze(-1).unsqueeze(-1)
    y_exp = y_exp.expand(-1, -1, 26, 26)

    y_M = torch.cat((M, y_exp), dim=1)
    y_M_prime = torch.cat((M_prime, y_exp), dim=1)

    Ej = -F.softplus(-self.local_d(y_M)).mean()
    Em = F.softplus(self.local_d(y_M_prime)).mean()
    LOCAL = (Em - Ej) * self.beta

    Ej = -F.softplus(-self.global_d(y, M)).mean()
    Em = F.softplus(self.global_d(y, M_prime)).mean()
    GLOBAL = (Em - Ej) * self.alpha

    prior = torch.rand_like(y)

    epsilon = 1e-8
    # term_a = torch.log(self.prior_d(prior)).mean()
    # term_b = torch.log(1.0 - self.prior_d(y)).mean()
    term_a = torch.log(self.prior_d(prior) + epsilon).mean()
    term_b = torch.log(1.0 - self.prior_d(y) + epsilon).mean()
    PRIOR = - (term_a + term_b) * self.gamma

    discriminator_loss = LOCAL + GLOBAL + PRIOR
    encoder_loss = torch.log(self.prior_d(y.detach()) ).mean()

    return discriminator_loss, encoder_loss

encoder = Encoder().to(device)
loss_fn = DeepInfoMaxLoss().to(device)
encoder_optim = Adam(encoder.parameters(), lr=1e-4)
discriminator_optim = Adam(loss_fn.parameters(), lr=1e-4)

epoch_restart = 0
root = Path('/root/桌面/code/wangxy/DIM1/models/encoder/run3')

# if epoch_restart is not None and root is not None:
#     enc_file = root / Path('encoder' + str(epoch_restart) + '.wgt')
#     loss_file = root / Path('loss' + str(epoch_restart) + '.wgt')
#     encoder.load_state_dict(torch.load(str(enc_file)))
#     loss_fn.load_state_dict(torch.load(str(loss_file)))


for epoch in range(epoch_restart + 1, 1001):
    batch = tqdm(cifar_10_train_l, total=len(cifar_10_train_dt) // batch_size)
    dis_loss = []
    enc_loss = []

    for x, target in batch:
        x = x.to(device)

        y, M = encoder(x)
        # rotate images to create pairs for comparison
        M_prime = torch.cat((M[1:], M[0].unsqueeze(0)), dim=0)

        discriminator_loss, encoder_loss = loss_fn(y, M, M_prime)
        dis_loss.append(discriminator_loss.item())
        enc_loss.append(encoder_loss.item())

        batch.set_description(str(epoch) + ' dis_Loss: ' + str(stats.mean(dis_loss[-20:])))
        batch.set_description(str(epoch) + ' enc_Loss: ' + str(stats.mean(enc_loss[-20:])))

        discriminator_optim.zero_grad()
        discriminator_loss.backward()
        discriminator_optim.step()

        encoder_optim.zero_grad()
        encoder_loss.backward()
        encoder_optim.step()

@yuu-Wang
Copy link

Hi, @DuaneNielsen, have you ever check @HaopengZhang96's questions? Is it right that prior distribution does not need encoder loss?

I read the code for the original paper,and I think I am right. The encoder and decoder loss should be divide,like GAN

So, Did you reimplement the code in this repo or use the official code? How is it work?

Do you know how this code should be adjusted to reach 0.7 in his paper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants