Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Recommended] Questions on Re-implementation #3

Open
quqixun opened this issue Sep 13, 2021 · 21 comments
Open

[Recommended] Questions on Re-implementation #3

quqixun opened this issue Sep 13, 2021 · 21 comments

Comments

@quqixun
Copy link

quqixun commented Sep 13, 2021

您好,我在复现您的这篇论文,在复现过程中遇到了以下问题:

  1. 从数据集Celebrity-Asian和VggFace2中裁剪人脸,存在很多模糊人脸,请问数据集中的模糊数据您是如何处理的,您最终用于训练模型的数据量大概是多少,这部分在论文中没有详细提及;

  2. Feature-Level最后一段中:

After the feature-level fusion, we generate Ilow to com-
pute auxiliary loss for better disentangling the identity and
attributes. Then we use a 4× Upsample Module Fup which
contains several res-blocks to better fuse the feature maps.
Based on Fup , it’s convenient for our HifiFace to generate
even higher resolution results (e.g., 512 × 512).

Experiments中:

For our more
precise model (i.e., Ours-512), we adopt a portrait enhance-
ment network [Li et al., 2020] to improve the resolution of
the training images to 512×512 as supervision, and also cor-
respondingly add another res-block in Fup of SFF compared
to Ours-256. 

再结合您在issue2中的回答,您看我下面的理解是否正确:
2.1 Ours-256 和 Ours-512 模型输入的 It 均为256尺寸图像;
2.2 在 Ours-256 模型中,在获得 zfuse 后,先过两层 AdaIN Res-Block,再过两层有upsample的Res-Block的Fup
2.3 Ours-512 与 Ours-256 相比,区别仅在于,在Ours-512模型的 Fup模块中多一层有upsample的Res-Block;

  1. 其他问题:
    3.1 获得 Mlow 、Mr、Ilow、Ir过程中,输出层结构是怎样的;
    3.2 使用3D人脸模型获得人脸关键点后,其坐标范围为[0, 224],是否将其转换为[0, 1]范围;
    3.3 在计算 Lcyc 时,cyc过程输出G(Ir , It),Ir是否有detach;训练过程中发现Lcyc 特别低,相比其他loss差一到两个数量级;
    3.4 训练过程中是怎样做数据采样的;
    3.5 在辨别器中使用的也是Res-Block,Res-Block中使用的是InstanceNorm2D,这里请您确认在辨别器中使用的是InstanceNorm2D;
    3.6 能否列举SFF的详细结构;
    3.7 HRNet的人脸分割效果比较差,是否做了其他优化;
    3.8 是否是分阶段训练的,还是在一开始辨别器就参与了训练;
    3.9 脸型差异较大时,生成结果存在双下巴现象,这种现象是否是靠辨别器抑制掉的。

问题较多,期待回复,谢谢。

@johannwyh
Copy link
Owner

johannwyh commented Sep 13, 2021

Thank you for your affection for HifiFace, this project is still undergoing the approval process of open source. Thank you for your patience. Additionally we sincerely hope you can raise your question again in English and these information can be shared by more of this community.
Here I will answer your questions. Feel free to keep in touch.

  1. We stated our data cleaning process in the Implementation Details part of 4. Experiments

For our model with resolution 256 (i.e., Ours-256), we remove images with either size smaller than 256 for better image quality.

The size of Celebrity-Asian is roughly 680k, while that of 'VGG-Face2' is 640k

    1. Exactly. Ours-256 and Ours-512 use both 256 input.
    2. Exactly.
    3. Exactly. By the way, we use an enhancement model to create pairs for same-identity samples.

Feedbacks for question 3 are listed below.

@johannwyh
Copy link
Owner

For questions in part 3,

  1. We generate I and M both from feature maps of corresponding size, where for
  • I, feature map goes through a LeakyReLU(0.2) activation and a Conv Layer sequentially, ranges in [-1, 1]
  • M, feature map goes through a Conv Layer and a Sigmoid Layer sequentially, ranges in [0, 1]
  1. We do not transform it. The loss is adjusted by parameters.
  2. Your findings are correct. We do not detach I_r and the loss value is relatively small.

@johannwyh
Copy link
Owner

johannwyh commented Sep 13, 2021

For questions in part 3,

  1. in every mini-batch, we put 50% pairs of same identity and another 50% of different identity.
  2. In discriminator, we set normalize=False when setting ResBlks, so actually there are no normalization layers in Discriminator.
  3. You can refer to the supplementary materials of our arxiv version and Questions of SFF #2 issue to find information you need.
  4. Our HRNet is trained on face data, you can use any face segmentation model that performs well.
  5. Our entire model is trained end2end and discriminator starts its training from 1st epoch. What to be mentioned is that it may require 1000-5000 iters for the training to warmup, before which the G generates nothing.
  6. When source face is much thinner than target, it is the most challenging case of face shape preserved swapping. Our SFF is designed to handle this but still not a perfect solution. You are sincerely welcome to discuss your ideas on this issue with us. We are also continually working on bringing out more robust and crazy results.

@quqixun
Copy link
Author

quqixun commented Sep 14, 2021

Thanks a lot for your response. It is very helpful to understand the paper better.

Some questions about dataset:

  • 4.1 The size of Celebrity-Asian and VGG-Face2 are much more than 680k and 640k.
    • One image might have multiple faces.
    • After filtering images by the size threshold (256), about 3.8 million faces can be detected.
    • If face images were filtered by some rules?

Some questions about training:

  • 5.1 In paper, hyperparameters are ADAM(lr=0.0001, betas=(0.0, 0.99)), if the same settings were used for discriminator?
  • 5.2 If any tricks were applied to improve the training stability, especially for discriminator?
    • I found that discriminator losses, D(real, 1), D(fake.detach(), 0) and D(fake, 1) get very high loss after some iters.
    • Simultaneously, other losses, such as Lseg, Lsid, Lshape etc., are also very high.
    • Simultaneously, the output of image-level fusion is almost exactly same as target image, which means that generator tends to predict empty mask.
    • Then, discriminator losses reduce to normal level, Lseg reduces very slowly;
    • The above process will repeat non-periodically.
    • Any suggestions for this issue?
  • 5.3 Lrec was calculated if Is and It share the same identity.
    • How about Lcyc and Llpips? If Lcyc and Llpips were calculated in the same way as Lrec?

Some questions about implementation:

  • 6.1 Open source pretrained models of DFD, 3DMM, HRNet(LIP) and Curricularface are availabel.
    • If these models were finetunned by your private data?

@johannwyh
Copy link
Owner

About Section 5

  1. Yes
  2. I am sorry that I cannot make direct suggestions about the situation you encountered. But I can give you two hints that might ease your stress.
    • Our Lshape is clamped to (0, 10). Otherwise at early stages extreme face shape difference might cause training collapse.
    • At early stages (about <5000 iters), the generator will "learn" to generate an image exactly the same as the target. This will keep for some iters and than the loss will force the generator to give it up.
    • Hope these information might help.
  3. Some losses only make sense when the source and target have the same identity
    • Llpips can only be calculated when source and target are of the same person
    • Lcyc can be applied to different identity pairs. You can analyze the generation logic yourself.

About Section 6

  • NO, open-source weights perform well.
  • Perhaps the only thing you need to do is to transform 3DMM weights from TF to PyTorch.

@johannwyh johannwyh pinned this issue Sep 17, 2021
@johannwyh
Copy link
Owner

About Section 4

  • Celebrity-Asian and VGG-Face2 contains many images of low face quality.
  • You can use some open-source quality measuring models to eliminate the images of bad quality.
  • This is important for face generation models. For example, in this project, high quality Lrec supervision performs a key role in high-fidelity face swapping generation.

@johannwyh johannwyh changed the title 复现相关问题 [Recommended] Questions on Re-implementation Sep 17, 2021
@yfji
Copy link

yfji commented Sep 29, 2021

For questions in part 3,

  1. in every mini-batch, we put 50% pairs of same identity and another 50% of different identity.
  2. In discriminator, we set normalize=False when setting ResBlks, so actually there are no normalization layers in Discriminator.
  3. You can refer to the supplementary materials of our arxiv version and Questions of SFF #2 issue to find information you need.
  4. Our HRNet is trained on face data, you can use any face segmentation model that performs well.
  5. Our entire model is trained end2end and discriminator starts its training from 1st epoch. What to be mentioned is that it may require 1000-5000 iters for the training to warmup, before which the G generates nothing.
  6. When source face is much thinner than target, it is the most challenging case of face shape preserved swapping. Our SFF is designed to handle this but still not a perfect solution. You are sincerely welcome to discuss your ideas on this issue with us. We are also continually working on bringing out more robust and crazy results.

What is the effect to the generator and the synthesis result whether to use InstanceNorm in Dicriminator?
I have this question for a long time. In my own re-implementation, I found the loss of Discriminator always very high without InstanceNorm, and the synthesis is terrible. When I added InstanceNorm, the result is OK.
Besides, have you tried to use PatchGAN?

@johannwyh
Copy link
Owner

For questions in part 3,

  1. in every mini-batch, we put 50% pairs of same identity and another 50% of different identity.
  2. In discriminator, we set normalize=False when setting ResBlks, so actually there are no normalization layers in Discriminator.
  3. You can refer to the supplementary materials of our arxiv version and Questions of SFF #2 issue to find information you need.
  4. Our HRNet is trained on face data, you can use any face segmentation model that performs well.
  5. Our entire model is trained end2end and discriminator starts its training from 1st epoch. What to be mentioned is that it may require 1000-5000 iters for the training to warmup, before which the G generates nothing.
  6. When source face is much thinner than target, it is the most challenging case of face shape preserved swapping. Our SFF is designed to handle this but still not a perfect solution. You are sincerely welcome to discuss your ideas on this issue with us. We are also continually working on bringing out more robust and crazy results.

What is the effect to the generator and the synthesis result whether to use InstanceNorm in Dicriminator? I have this question for a long time. In my own re-implementation, I found the loss of Discriminator always very high without InstanceNorm, and the synthesis is terrible. When I added InstanceNorm, the result is OK. Besides, have you tried to use PatchGAN?

Sorry for a late reply.

It is quite weird that you encounter this issue. In fact, in our implementation, we do not use instance norm in the discriminator. In the generator, besides AdaIn, we use InstanceNorm in the encode and bottleneck part.

I wonder if your BP of D loss is correct. Remember to detach the generator part when BP the D loss.

@yfji
Copy link

yfji commented Oct 18, 2021

For questions in part 3,

  1. in every mini-batch, we put 50% pairs of same identity and another 50% of different identity.
  2. In discriminator, we set normalize=False when setting ResBlks, so actually there are no normalization layers in Discriminator.
  3. You can refer to the supplementary materials of our arxiv version and Questions of SFF #2 issue to find information you need.
  4. Our HRNet is trained on face data, you can use any face segmentation model that performs well.
  5. Our entire model is trained end2end and discriminator starts its training from 1st epoch. What to be mentioned is that it may require 1000-5000 iters for the training to warmup, before which the G generates nothing.
  6. When source face is much thinner than target, it is the most challenging case of face shape preserved swapping. Our SFF is designed to handle this but still not a perfect solution. You are sincerely welcome to discuss your ideas on this issue with us. We are also continually working on bringing out more robust and crazy results.

What is the effect to the generator and the synthesis result whether to use InstanceNorm in Dicriminator? I have this question for a long time. In my own re-implementation, I found the loss of Discriminator always very high without InstanceNorm, and the synthesis is terrible. When I added InstanceNorm, the result is OK. Besides, have you tried to use PatchGAN?

Sorry for a late reply.

It is quite weird that you encounter this issue. In fact, in our implementation, we do not use instance norm in the discriminator. In the generator, besides AdaIn, we use InstanceNorm in the encode and bottleneck part.

I wonder if your BP of D loss is correct. Remember to detach the generator part when BP the D loss.

Yeah, I fixed some tricks and the discriminator without IN worked now, but there seems no difference with the discriminator using IN. I really wonder the effect of IN on the discriminator, and single output v.s. PatchGAN. I'd be very happy to learn your opinion!
Besides, I find the synthesis in videos are not as realistic as single images. In the videos, the faces are often suffered from color jittering and shaking between consequtive frames (I already used alpha-filtering in face alignment and color re-normalization). Do you have any suggestions? Thank you very much!

@johannwyh
Copy link
Owner

HifiFace is mainly researched on image based swapping instead of video. So it is normal that the model performs not that perfect in video, as we did not tune this.

If you need any help on video synthesis, feel free to drop me an email and maybe we can provide an official generation for you to compare with your result.

@yfji
Copy link

yfji commented Oct 22, 2021

HifiFace is mainly researched on image based swapping instead of video. So it is normal that the model performs not that perfect in video, as we did not tune this.

If you need any help on video synthesis, feel free to drop me an email and maybe we can provide an official generation for you to compare with your result.

Hi, I've send an email to hififace.youtu@gmail.com, wish for your reply! Thanks!

@gitlabspy
Copy link

Hi, according to what you mentioned above, M_{low} is generated by z_{dec} passed through a conv layer with sigmoid act and I_{low} is generated by z_{dec} through a lrelu and a conv layer?

@taotaonice
Copy link

Your job is great and really nice.
My re-implementation result seems to be OK. But I have some questions about re-implementation.

  1. Does the adv or Lpips loss apply on 64x64 level images?
  2. Is The GAN loss implemented by WGAN-GP?
  3. Does the adv loss apply on I_cyc?

And in my experimentation, PatchGAN seems to perform better.
Hope for your nice answer!

@johannwyh
Copy link
Owner

Hi, according to what you mentioned above, M_{low} is generated by z_{dec} passed through a conv layer with sigmoid act and I_{low} is generated by z_{dec} through a lrelu and a conv layer?

Exactly

@johannwyh
Copy link
Owner

Your job is great and really nice. My re-implementation result seems to be OK. But I have some questions about re-implementation.

  1. Does the adv or Lpips loss apply on 64x64 level images?
  2. Is The GAN loss implemented by WGAN-GP?
  3. Does the adv loss apply on I_cyc?

And in my experimentation, PatchGAN seems to perform better. Hope for your nice answer!

  1. They are neither applied to 64x64 image in our implementation.
  2. The GAN loss is implemented as the raw GAN loss with log D trick (simply follow the setting of StarGAN v2). For discriminator loss, we apply gradient penalty.
  3. No. I_cyc is only used in cycle loss.

Thank you very much for your advice, we will definitely try more SOTA backbones for better results!

@gitlabspy
Copy link

Thx! I still have some questions:

  1. Face model generated with coeff in loss Lshape, does q_fuse (or qr) stand for face_shape in these lines of codes? https://github.com/sicxu/Deep3DFaceRecon_pytorch/blob/master/models/bfm.py#L86-L99
  2. You mentioned lpips loss makes sense only with two images of same person, so all data paired with the same person in the training set (source and target is the same person)?

@johannwyh
Copy link
Owner

Thx! I still have some questions:

  1. Face model generated with coeff in loss Lshape, does q_fuse (or qr) stand for face_shape in these lines of codes? https://github.com/sicxu/Deep3DFaceRecon_pytorch/blob/master/models/bfm.py#L86-L99
  2. You mentioned lpips loss makes sense only with two images of same person, so all data paired with the same person in the training set (source and target is the same person)?
  1. q_fuse is the landmark reconstructed from a fused embedding, using the face reconstruction model. landmark stands for those 17 facial contour points among 68 landmark points. fused embedding stands for an embedding that uses the source's shape id, and target's expr+pose id. The reconstruction process does not use textures.
  2. See Implementation Details. 50% training pairs are of the same identity, while the others are of the different. When it comes to the pairs of different identity, simply do not apply these losses to their results.

@dypromise
Copy link

Hi, author, thanks your great job! I have some questions about implementation details:

  1. Mask_gt in loss_G_seg you mentioned is dilated from mask of It image, How big is the dilated kernel?
  2. About loss_Rec & loss_lpips, we take loss_rec for example, you mentioned it is only ON when Is and It come from same identity, Yes, In theory it should be like this. But I have re-implement some papers like 'Faceshifter' or 'SimSwap', in their paper they actually said like this, but in fact, the result when a loss_rec always appear no matter I_t and I_s from same ID is better, Especially in maintaining the attributes of the I_t face, SO, in your implementation, DO YOU REALLY have loss_rec when pair come from same id , Or always have this loss?
  3. Did you normalize the 3D coeff before it concat to arcface embedding? or either norm two kinds of coeffs then concat?

@Continue7777
Copy link

I try to re-implement the paper,but meet some troubles.Ask for help.
First,after training for a long time, the mask tend to all empty(all 0).rec loss stay near 0.
To fix that,i just use segmap loss , I find that high segLoss can work well but the low segloss is invalid. i think the segHigh tend to force the segLow to empty for a better effect.
I think it establish a short path for high loss and longer and more difficult path for the low one.

@Continue7777
Copy link

CDB533BB-0A89-448d-9FFF-B6AE2F9E0E8F

@xuehy
Copy link

xuehy commented May 15, 2023

For questions in part 3,

  1. in every mini-batch, we put 50% pairs of same identity and another 50% of different identity.
  2. In discriminator, we set normalize=False when setting ResBlks, so actually there are no normalization layers in Discriminator.
  3. You can refer to the supplementary materials of our arxiv version and Questions of SFF #2 issue to find information you need.
  4. Our HRNet is trained on face data, you can use any face segmentation model that performs well.
  5. Our entire model is trained end2end and discriminator starts its training from 1st epoch. What to be mentioned is that it may require 1000-5000 iters for the training to warmup, before which the G generates nothing.
  6. When source face is much thinner than target, it is the most challenging case of face shape preserved swapping. Our SFF is designed to handle this but still not a perfect solution. You are sincerely welcome to discuss your ideas on this issue with us. We are also continually working on bringing out more robust and crazy results.

What is the effect to the generator and the synthesis result whether to use InstanceNorm in Dicriminator? I have this question for a long time. In my own re-implementation, I found the loss of Discriminator always very high without InstanceNorm, and the synthesis is terrible. When I added InstanceNorm, the result is OK. Besides, have you tried to use PatchGAN?

Sorry for a late reply.
It is quite weird that you encounter this issue. In fact, in our implementation, we do not use instance norm in the discriminator. In the generator, besides AdaIn, we use InstanceNorm in the encode and bottleneck part.
I wonder if your BP of D loss is correct. Remember to detach the generator part when BP the D loss.

Yeah, I fixed some tricks and the discriminator without IN worked now, but there seems no difference with the discriminator using IN. I really wonder the effect of IN on the discriminator, and single output v.s. PatchGAN. I'd be very happy to learn your opinion! Besides, I find the synthesis in videos are not as realistic as single images. In the videos, the faces are often suffered from color jittering and shaking between consequtive frames (I already used alpha-filtering in face alignment and color re-normalization). Do you have any suggestions? Thank you very much!

How did you apply color re-normalization? Is there any reference article or code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants