Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Current results of training - epoch 4 #36

Closed
johndpope opened this issue Jun 4, 2024 · 56 comments
Closed

Current results of training - epoch 4 #36

johndpope opened this issue Jun 4, 2024 · 56 comments

Comments

@johndpope
Copy link
Owner

johndpope commented Jun 4, 2024

i used another of the videos as driving - and it's (almost) obviously not rotating the head past the point where the original movie went - see below.

Screenshot from 2024-06-04 22-45-21

cross_reenacted_image_57
cross_reenacted_image_57

pred_frame_191
pred_frame_191

tomorrow i plug in bigger dataset.

UPDATE - #37

when I normalize the images - i end up with this - looks bad - I add some code in train.py to un-normalize - happy with current results....

1

fyi - this is the frames dump out from mp4 - head cropped / maybe some warping.
Screenshot from 2024-06-04 23-11-08

@johndpope
Copy link
Owner Author

johndpope commented Jun 4, 2024

@Jie-zju
Copy link

Jie-zju commented Jun 5, 2024

I trained on data like vox. Looking forward for more results!

@johndpope
Copy link
Owner Author

pred_frame_361
epoch 21 - it's converging.....

@Jie-zju
Copy link

Jie-zju commented Jun 5, 2024

So,as I mentioned before. Loss on face foreground?

@johndpope
Copy link
Owner Author

johndpope commented Jun 5, 2024

there's like 6-7 different losses - https://github.com/johndpope/MegaPortrait-hack/blob/main/train.py
i didnt do gaze loss yet (i drafted it - hit a snag - need to take another look)

self.weights['gaze'] * 1 #gaze_loss

@johndpope
Copy link
Owner Author

epoch 50
pred_frame_355

@JZArray
Copy link

JZArray commented Jun 5, 2024

epoch 50 pred_frame_355

Is this the self-reconstruction result?

@JZArray
Copy link

JZArray commented Jun 5, 2024

How do you reenactment results look like in the eval dataset? BTW, how many IDs have you used to train this model?

@johndpope
Copy link
Owner Author

I need 2 years of gpu time to actually complete 200,000 epochs -
I have dataset here 35,000 videos
#37

The augmentation I’m rendering every frame - but hit a snag with different lengths so I’m only overfitting to 1 source video - 1 driving - 1 star source - 1 star driving.
Don’t really want to burn out my 3090 card - I’m looking at vertex ai - the pre processing to warp and crop is significant time sink.

I’m exploring cheaper gpu training hacks to collapse training time line. My 3090 can spit out very res - hi fidelity images with stable diffusion - so this code is useless to me if I can’t train it.

https://github.com/johndpope/LadaGAN-pytorch

@JZArray
Copy link

JZArray commented Jun 5, 2024

I need 2 years of gpu time to actually complete 200,000 epochs - I have dataset here 35,000 videos #37

The augmentation I’m rendering every frame - but hit a snag with different lengths so I’m only overfitting to 1 source video - 1 driving - 1 star source - 1 star driving. Don’t really want to burn out my 3090 card - I’m looking at vertex ai - the pre processing to warp and crop is significant time sink.

I’m exploring cheaper gpu training hacks to collapse training time line. My 3090 can spit out very res - hi fidelity images with stable diffusion - so this code is useless to me if I can’t train it.

https://github.com/johndpope/LadaGAN-pytorch

ok, I see, because when increasing the number of IDs, in my case, ID leakage appears, not sure whether you also have the same problem with your codes.

@Kwentar
Copy link

Kwentar commented Jun 5, 2024

Hi, congratulations! I decided to change their pipeline and currently far away from paper, in my experience:

  • Two losses are important: perceptual ( I use awesome lpips) and cross entropy on Z vectors (the way to fix ID leakage)
  • We don't need Es at all
  • We don't need rotation/translation warping operation, result of warping generator is enough
  • One grid sample is enough (before g3d)

I am currently training on VoxCeleb2, results:
Drive:
image
Predicted drive:
image
Predicted S* based on drive:
image

If you have questions or need details feel free to ask

@JZArray
Copy link

JZArray commented Jun 5, 2024

Hi, congratulations! I decided to change their pipeline and currently far away from paper, in my experience:

  • Two losses are important: perceptual ( I use awesome lpips) and cross entropy on Z vectors (the way to fix ID leakage)
  • We don't need Es at all
  • We don't need rotation/translation warping operation, result of warping generator is enough
  • One grid sample is enough (before g3d)

I am currently training on VoxCeleb2, results: Drive: image Predicted drive: image Predicted S* based on drive: image

If you have questions or need details feel free to ask

@Kwentar Nice work! Could you tell more about how to do warp in your case? Or maybe can you share your codes?

@johndpope
Copy link
Owner Author

johndpope commented Jun 5, 2024

You may have more luck cherry picking my warp and crop /rremove background code in emodataset - I’m saving out npz file for faster iterations next run. They say in paper they don’t do backgrounds. Did you look at gaze loss?
With epochs - my code is saying cycle through short video 90+ frames. Is that fair? Or is that 90 epochs?? They mention batch size 16, is that 16 frames total of video?
The more training I do on single video - the better it gets. What amount of training did you get to? What gpu compute do you have? Ru training in cloud? Azure / aws? The video dataset / code from emoportraits is going to render this codebase obsolete- what’s your motives? Academic or commercial? I have some videos from Voxceleb2 -

How big are ur checkpoints? Presumably your saving discriminator? This is patchgan cyclegan based - would it make results on my training dramatically improve if you just share that?
I was only able to get where I am thanks to @Kevinfringe - and his cycle consistency loss
he used a concatenation of 2 images - not sure this is necessary/ desirable. Its in
Main branch.

I’m interested to plug in novel architectures - the vasa stuff - but there’s also others. How about you? This architecture can’t do audio - is that important to you?

Share code if you can.

Regarding throwing out the rotation / translation - I kinda see why this would still work (being more like a face swapper) - but have a look at this video - michaildoukas/headGAN#10 where the control of head pose of target is completely disentangled. The MS vasa I think had this capability too.
Theres many libraries - repos doing this video a - drive video b - aniportrait - Is outstanding- but the magic with this architecture is high frame rate - real time controlling - I switch back to vasa (which needs to be completely rebuilt ) in a little bit.

https://github.com/johndpope/VASA-1-hack

@Kwentar
Copy link

Kwentar commented Jun 5, 2024

@johndpope I am not ready to share code because it is unreadable :D

  • I didnt do anything with data yet, just use only voxceleb2 academic dataset without any processing and augmentation (closest plans remove background and make losses on eyes and mouths)
  • Currently I have only two losses: lpisp and cosFace on Zs, will add more losses later
  • About epoch -- dont worry, it really doesnt matter and it is just conventions, i.e. my epoch each "one pair from video", so, my batch is 12*8 (I Have 8 GPUs) and epoch is 11000 iters. Currently it is still training, images above at 34000 iters (3+ epochs)
  • what’s your motives? -- I do it for commercial and as I said I used megaportrait only for start
  • How big are ur checkpoints? Presumably your saving discriminator? -- I still have no success with discriminator and I dont use it, checkpoints around 500MB all networks (g2d is biggest with 300MB)
  • I’m interested to plug in novel architectures - the vasa stuff - but there’s also others. How about you? This architecture can’t do audio - is that important to you? -- Yes, I go to megaportrait after Vasa, so, audio is important and it is quite ez -- only we need is make Eaudio with the same output as Emtn. Currently I am trying to move this task to diffussion network, but have no success here yet

Warp Generator:

class ResBlock3D(nn.Module):
    def __init__(self, input_channels, output_channels, padding=1):
        super().__init__()
        self.conv1 = nn.Conv3d(input_channels, output_channels, 3, padding=padding)
        self.conv2 = nn.Conv3d(output_channels, output_channels, 3, padding=padding)
        if input_channels != output_channels:
            self.shortcut = nn.Sequential(
                nn.Conv3d(input_channels, output_channels, kernel_size=1),
                nn.GroupNorm(num_channels=output_channels, num_groups=32)
            )
        else:
            self.shortcut = nn.Identity()

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = F.group_norm(out, num_groups=32)
        out = F.relu(out)
        out = self.conv2(out)
        out = F.group_norm(out, num_groups=32)
        if self.shortcut is not None:
            residual = self.shortcut(x)
        out += residual
        out = F.relu(out)
        return out
        
class WarpGenerator(nn.Module):
    def __init__(self, input_channels):
        super(WarpGenerator, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=input_channels, out_channels=2048, kernel_size=1, padding=0, stride=1)
        self.resblock1 = ResBlock3D(512, 256)
        self.resblock2 = ResBlock3D(256, 128)
        self.resblock3 = ResBlock3D(128, 64)
        self.resblock4 = ResBlock3D(64, 32)

        self.gn = nn.GroupNorm(num_groups=32, num_channels=32)
        self.conv2 = nn.Conv3d(32, 3, kernel_size=3, padding=1)

    def forward(self, zs_es):
        x = self.conv1(zs_es)
        x = x.view(x.size(0), 512, 4, x.size(2), x.size(3))

        x = self.resblock1(x)
        x = F.upsample(x, scale_factor=(2, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock2(x)
        x = F.upsample(x, scale_factor=(2, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock3(x)
        x = F.upsample(x, scale_factor=(1, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock4(x)
        x = F.upsample(x, scale_factor=(1, 2, 2), mode='trilinear', align_corners=True)

        x = self.gn(x)
        x = F.relu(x, inplace=True)
        x = self.conv2(x)
        x = F.tanh(x)
        return x

GBase Inference:

motion_latent_source = z_network(source)
motion_latent_drive = z_network(drive)
volume_source = model_app(source)

Wem_source = warping_generator(torch.cat([-motion_latent_source, motion_latent_drive], dim=1))
g3d_input = F.grid_sample(volume_source, Wem_source.permute(0, 2, 3, 4, 1), align_corners=True)

volume_generated = g3d(g3d_input)
x_generated = g2d(volume_generated)

@JZArray
Copy link

JZArray commented Jun 5, 2024

@Kwentar thanks for sharing the information. Could you tell the motivation of removing the rotation and translation, and using only warp once? Since it lacks the ability of controlling the head pose explicitly. Have you try following the paper exactly, and how about the result?

@Kwentar
Copy link

Kwentar commented Jun 5, 2024

@JZArray

  • Could you tell the motivation of removing the rotation and translation, and using only warp once? I am a fan of "end2end" and dont need ability of controling head pose explicitly :)
  • Have you try following the paper exactly Yes, but no success -- the article has A LOT of problems (wrong architecture images, wrong formulas, abnormal quantity of losses etc.) So, I decided to do it based on my experience

@hazard-10
Copy link

Judging from these preliminary resutls it seems like RT-bene for gaze loss isn't necessary at all ?

@hazard-10
Copy link

@Kwentar Hey great work! Do you mind sharing a bit more on hardware usage like gpu spec. sxm/pcie, vram, and training time each iter / epoch ? And roughtly many epochs are you planning on training with 8 gpus for convergence before transferring that to VASA ?

@johndpope
Copy link
Owner Author

johndpope commented Jun 5, 2024

There’s a cross re-enactment image that gets spat out. Quality is low at 50 epochs. The eyes are aligning up - this code has mpgazeloss using mediapope which may do the job. But in the other ticket I describe preparation data different eyes blinking - not trivial. Also I want preprocessing of videos to happen 1000x faster - it’s taking 5 mins per video dmlc/decord#302

The main branch should run with training as is - let me know if it doesn’t. Theres a feature branch I’m stabilising to get more videos / ids- though I hit a bump.

UPDATE - fyi @Kwentar - diffusion + talking - tencent-ailab/V-Express#6
( no training code)

@JZArray
Copy link

JZArray commented Jun 6, 2024

@johndpope I am not ready to share code because it is unreadable :D

  • I didnt do anything with data yet, just use only voxceleb2 academic dataset without any processing and augmentation (closest plans remove background and make losses on eyes and mouths)
  • Currently I have only two losses: lpisp and cosFace on Zs, will add more losses later
  • About epoch -- dont worry, it really doesnt matter and it is just conventions, i.e. my epoch each "one pair from video", so, my batch is 12*8 (I Have 8 GPUs) and epoch is 11000 iters. Currently it is still training, images above at 34000 iters (3+ epochs)
  • what’s your motives? -- I do it for commercial and as I said I used megaportrait only for start
  • How big are ur checkpoints? Presumably your saving discriminator? -- I still have no success with discriminator and I dont use it, checkpoints around 500MB all networks (g2d is biggest with 300MB)
  • I’m interested to plug in novel architectures - the vasa stuff - but there’s also others. How about you? This architecture can’t do audio - is that important to you? -- Yes, I go to megaportrait after Vasa, so, audio is important and it is quite ez -- only we need is make Eaudio with the same output as Emtn. Currently I am trying to move this task to diffussion network, but have no success here yet

Warp Generator:

class ResBlock3D(nn.Module):
    def __init__(self, input_channels, output_channels, padding=1):
        super().__init__()
        self.conv1 = nn.Conv3d(input_channels, output_channels, 3, padding=padding)
        self.conv2 = nn.Conv3d(output_channels, output_channels, 3, padding=padding)
        if input_channels != output_channels:
            self.shortcut = nn.Sequential(
                nn.Conv3d(input_channels, output_channels, kernel_size=1),
                nn.GroupNorm(num_channels=output_channels, num_groups=32)
            )
        else:
            self.shortcut = nn.Identity()

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = F.group_norm(out, num_groups=32)
        out = F.relu(out)
        out = self.conv2(out)
        out = F.group_norm(out, num_groups=32)
        if self.shortcut is not None:
            residual = self.shortcut(x)
        out += residual
        out = F.relu(out)
        return out
        
class WarpGenerator(nn.Module):
    def __init__(self, input_channels):
        super(WarpGenerator, self).__init__()

        self.conv1 = nn.Conv2d(in_channels=input_channels, out_channels=2048, kernel_size=1, padding=0, stride=1)
        self.resblock1 = ResBlock3D(512, 256)
        self.resblock2 = ResBlock3D(256, 128)
        self.resblock3 = ResBlock3D(128, 64)
        self.resblock4 = ResBlock3D(64, 32)

        self.gn = nn.GroupNorm(num_groups=32, num_channels=32)
        self.conv2 = nn.Conv3d(32, 3, kernel_size=3, padding=1)

    def forward(self, zs_es):
        x = self.conv1(zs_es)
        x = x.view(x.size(0), 512, 4, x.size(2), x.size(3))

        x = self.resblock1(x)
        x = F.upsample(x, scale_factor=(2, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock2(x)
        x = F.upsample(x, scale_factor=(2, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock3(x)
        x = F.upsample(x, scale_factor=(1, 2, 2), mode='trilinear', align_corners=True)
        x = self.resblock4(x)
        x = F.upsample(x, scale_factor=(1, 2, 2), mode='trilinear', align_corners=True)

        x = self.gn(x)
        x = F.relu(x, inplace=True)
        x = self.conv2(x)
        x = F.tanh(x)
        return x

GBase Inference:

motion_latent_source = z_network(source)
motion_latent_drive = z_network(drive)
volume_source = model_app(source)

Wem_source = warping_generator(torch.cat([-motion_latent_source, motion_latent_drive], dim=1))
g3d_input = F.grid_sample(volume_source, Wem_source.permute(0, 2, 3, 4, 1), align_corners=True)

volume_generated = g3d(g3d_input)
x_generated = g2d(volume_generated)

@Kwentar hallo, a quick question about F.grid sample, don't you need to first interpolate Wem_source [batch, 3 ,4, 16, 16] to the same shape as volume_source [batch, 96, 16, 32, 32]? Otherwise, after F_grid sample operation, the shape of volume_source will be changed into [batch, 96, 4, 16, 16], and cannot be correctly processed by the following g2d module, because g2d requires the input with the shape [batch, 96, 16, 32, 32].

@johndpope
Copy link
Owner Author

May not be specifc case here but I found this discrepancy creeps in with different image inputs 256 vs 512. My code in main is only handling 512 atm.

@JackAILab
Copy link

JackAILab commented Jun 6, 2024

纪元 50 pred_frame_355

hi, @johndpope May I ask if your current results are the in-domain results saved during training?
Have you tried to test during inference and use some out-of-domain results?

My current model structure is mostly consistent with your current structure.
Unfortunately, the visualization results saved during my current training process are very good, but the results of the inference process (one epoch) are very bad. It may be that the number of epochs I trained is not enough, or there is a problem with the model structure described in the paper. @Kwentar Can you share more experience regarding the current results?

Currently, I use 2024 song videos and 2880 speech videos from the RAVDESS data (as S* frames). After one epoch, loss_perceptual converges from 150 to 73.3125.

epoch0 -> epoch1 in training process (bs is set to 6, and 6 images are output side by side.)
cross_reenacted_0

cross_reenacted_40

cross_reenacted_79

cross_reenacted_40_0

cross_reenacted_83_55

epoch1 in infer process (The first one is the original image, the second one is the driving image, and the last one is the output result. The result is not converged.)

output_25_epoch3_0
output_51_epoch3_0

@JZArray
Copy link

JZArray commented Jun 6, 2024

纪元 50 pred_frame_355

hi, @johndpope May I ask if your current results are the in-domain results saved during training? Have you tried to test during inference and use some out-of-domain results?

My current model structure is mostly consistent with your current structure. Unfortunately, the visualization results saved during my current training process are very good, but the results of the inference process (one epoch) are very bad. It may be that the number of epochs I trained is not enough, or there is a problem with the model structure described in the paper. @Kwentar Can you share more experience regarding the current results?

Currently, I use 2024 song videos and 2880 speech videos from the RAVDESS data (as S* frames). After one epoch, loss_perceptual converges from 150 to 73.3125.

epoch0 -> epoch1 in training process (bs is set to 6, and 6 images are output side by side.) cross_reenacted_0

cross_reenacted_40

cross_reenacted_79

cross_reenacted_40_0

cross_reenacted_83_55

** epoch1 in infer process (The first one is the original image, the second one is the driving image, and the last one is the output result. The result is not converged.)** source

drive

output_v3_2_epoch1_80

are 4.th and 5.th row the self-reconstruction results in the training set?

@JackAILab
Copy link

@JZArray yes

@johndpope
Copy link
Owner Author

fyi - anyone actually training might want to check this out #41

@flyingshan
Copy link

flyingshan commented Jun 12, 2024

@Kwentar Hi, did you make any other modifications comparing with the original paper? Currently we have followed your instructions about the warping module and implement the other module following the paper. But the predict results can not capture the small movements in the local area, especially for the lip movement and eye blink. I am wondering if the resolution of appearance feature which we use 32*32 is too small, any suggestions? BTW, we train on 256 * 256 input.

@JZArray Have you solved this? I encountered the same problem. The eyes won't close (but the closed eyes can be open). Did you train the model in an end to end way as Kwentar did or control the pose explicitly?

@JZArray
Copy link

JZArray commented Jun 12, 2024

@Kwentar Hi, did you make any other modifications comparing with the original paper? Currently we have followed your instructions about the warping module and implement the other module following the paper. But the predict results can not capture the small movements in the local area, especially for the lip movement and eye blink. I am wondering if the resolution of appearance feature which we use 32*32 is too small, any suggestions? BTW, we train on 256 * 256 input.

@JZArray Have you solved this? I encountered the same problem. The eyes won't close (but the closed eyes can be open). Did you train the model in an end to end way as Kwentar did or control the pose explicitly?

@flyingshan Not yet, we still insist on our model, not try his model. Furthermore, may I ask how your model's generalization ability looks like, can your model generalize well to unseen ID, i.e. does ID leakage appear?

@flyingshan
Copy link

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@JZArray
Copy link

JZArray commented Jun 12, 2024

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

@flyingshan
Copy link

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@JZArray
Copy link

JZArray commented Jun 12, 2024

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

@flyingshan
Copy link

flyingshan commented Jun 12, 2024

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process.
[source/prediction/drive]

@JZArray
Copy link

JZArray commented Jun 12, 2024

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

@flyingshan we got the similar results hhhh

@JZArray
Copy link

JZArray commented Jun 12, 2024

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

@flyingshan we got the similar results hhhh

@flyingshan Forgot to ask, how many IDs are you used now? And in your evaluationresults, did these IDs also appear in train dataset?

@flyingshan
Copy link

@JZArray I have not really counted this yet, it may around hundreds. The ID in evaluation may appear in the training set.

@coachqiao2018
Copy link

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it.
image

@JZArray
Copy link

JZArray commented Jun 12, 2024

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it. image

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it. image

your background looks well! How many IDs have you used for training to get such results, and in your evaluation results, did these IDs also appear in train dataset? Also, have you tried reenactment between different IDs, does IDs leakage appera? (could your share codes if possible?)

@coachqiao2018
Copy link

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it. image

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it. image

your background looks well! How many IDs have you used for training to get such results, and in your evaluation results, did these IDs also appear in train dataset? Also, have you tried reenactment between different IDs, does IDs leakage appera? (could your share codes if possible?)

These results are sampled during training, I didn't perform quantitative evaluations for now, because the mouth and eye areas are bad, and the code needs more improvements. About 40,000 IDs for training, not from the voxceleb dataset. I plan to reprocess our dataset and remove backgrounds to train the model.

@JZArray
Copy link

JZArray commented Jun 12, 2024

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it. image

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it. image

your background looks well! How many IDs have you used for training to get such results, and in your evaluation results, did these IDs also appear in train dataset? Also, have you tried reenactment between different IDs, does IDs leakage appera? (could your share codes if possible?)

These results are sampled during training, I didn't perform quantitative evaluations for now, because the mouth and eye areas are bad, and the code needs more improvements. About 40,000 IDs for training, not from the voxceleb dataset. I plan to reprocess our dataset and remove backgrounds to train the model.

how long have you trained to get such results? Could you show some details about how you implement warp operations?

@coachqiao2018
Copy link

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it. image

JZArray

I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do.

@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs.

I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait.

@flyingshan can you share some visual results here if possible?

I only got some visualizations in the evaluation process. [source/prediction/drive]

The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it. image

your background looks well! How many IDs have you used for training to get such results, and in your evaluation results, did these IDs also appear in train dataset? Also, have you tried reenactment between different IDs, does IDs leakage appera? (could your share codes if possible?)

These results are sampled during training, I didn't perform quantitative evaluations for now, because the mouth and eye areas are bad, and the code needs more improvements. About 40,000 IDs for training, not from the voxceleb dataset. I plan to reprocess our dataset and remove backgrounds to train the model.

how long have you trained to get such results? Could you show some details about how you implement warp operations?

About 200 epochs, and one week. My warp operation correspond to the above results are similar with johndpope's, but I predict 6d pose and use tanh to constrain the rotation and translation.
image
which is copied from pytorch3d. This is my attempt according to EMOPortrait.

@johndpope
Copy link
Owner Author

johndpope commented Jun 15, 2024

try new main code - Jay @hazard-10 spotted an error with cosface in training - and claude fixed it.
thats on top of these fixes.

  • Save / restore checkpoint) specify in config ./configs/training/stage10base.yaml to restore checkpoint
  • auto crop video frames to sweet spot
  • tensorboard losses
  • LPIPS added to perceptual - it's currently 10x - these wasn't specifeid in paper.
  • class PerceptualLoss(nn.Module):
    def init(self, device, weights={'vgg19': 20.0, 'vggface': 5.0, 'gaze': 4.0, 'lpips': 10.0}):
  • gaze (not yet done)
  • additional imagepyramide from one shot view code for loss (hopefully to sharpen image) - seems to be working
  • https://github.com/johndpope/MegaPortrait-hack/blob/main/model.py#L1070

the discriminator i've drafted code to take it to multiscale patch gan. maybe also boost image quality...
#46

the leakage - im seeing with my overfitted videos. i think the es is source of problems.
when I worked on Emote paper -
https://github.com/johndpope/Emote-hack/blob/main/train_stage_1_referencenet.py

UPDATE -
from re-reading above - i understand adding more losses - maybe counterproductive.
that said - https://arxiv.org/pdf/2404.10667 - i put DPE losses from VASA paper into training code. it doesn't seem to be hurting.
#51

@johndpope
Copy link
Owner Author

Dear CommitCrew -

I bring you a cleaner / faster / smarter way to disentangle images using 3x resnet50 backbones.
https://arxiv.org/pdf/2405.07257

https://github.com/johndpope/speak-hack
i just start training 5 minutes ago - so far.... not converging.

@JZArray
Copy link

JZArray commented Jun 28, 2024

@Kwentar @flyingshan how are your progresses now?

Repository owner deleted a comment from JZArray Jun 28, 2024
@johndpope
Copy link
Owner Author

had incorrectly configured to overfit - updated now johndpope/SPEAK-hack#1

@JZArray
Copy link

JZArray commented Jul 12, 2024

@Kwentar hallo, may I ask how you implement cross-entropy for Z_exp?

@johndpope
Copy link
Owner Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants