-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Current results of training - epoch 4 #36
Comments
|
I trained on data like vox. Looking forward for more results! |
So,as I mentioned before. Loss on face foreground? |
there's like 6-7 different losses - https://github.com/johndpope/MegaPortrait-hack/blob/main/train.py Line 1885 in ff9cf22
|
How do you reenactment results look like in the eval dataset? BTW, how many IDs have you used to train this model? |
I need 2 years of gpu time to actually complete 200,000 epochs - The augmentation I’m rendering every frame - but hit a snag with different lengths so I’m only overfitting to 1 source video - 1 driving - 1 star source - 1 star driving. I’m exploring cheaper gpu training hacks to collapse training time line. My 3090 can spit out very res - hi fidelity images with stable diffusion - so this code is useless to me if I can’t train it. |
ok, I see, because when increasing the number of IDs, in my case, ID leakage appears, not sure whether you also have the same problem with your codes. |
@Kwentar Nice work! Could you tell more about how to do warp in your case? Or maybe can you share your codes? |
You may have more luck cherry picking my warp and crop /rremove background code in emodataset - I’m saving out npz file for faster iterations next run. They say in paper they don’t do backgrounds. Did you look at gaze loss? How big are ur checkpoints? Presumably your saving discriminator? This is patchgan cyclegan based - would it make results on my training dramatically improve if you just share that? I’m interested to plug in novel architectures - the vasa stuff - but there’s also others. How about you? This architecture can’t do audio - is that important to you? Share code if you can. Regarding throwing out the rotation / translation - I kinda see why this would still work (being more like a face swapper) - but have a look at this video - michaildoukas/headGAN#10 where the control of head pose of target is completely disentangled. The MS vasa I think had this capability too. |
@johndpope I am not ready to share code because it is unreadable :D
Warp Generator: class ResBlock3D(nn.Module):
def __init__(self, input_channels, output_channels, padding=1):
super().__init__()
self.conv1 = nn.Conv3d(input_channels, output_channels, 3, padding=padding)
self.conv2 = nn.Conv3d(output_channels, output_channels, 3, padding=padding)
if input_channels != output_channels:
self.shortcut = nn.Sequential(
nn.Conv3d(input_channels, output_channels, kernel_size=1),
nn.GroupNorm(num_channels=output_channels, num_groups=32)
)
else:
self.shortcut = nn.Identity()
def forward(self, x):
residual = x
out = self.conv1(x)
out = F.group_norm(out, num_groups=32)
out = F.relu(out)
out = self.conv2(out)
out = F.group_norm(out, num_groups=32)
if self.shortcut is not None:
residual = self.shortcut(x)
out += residual
out = F.relu(out)
return out
class WarpGenerator(nn.Module):
def __init__(self, input_channels):
super(WarpGenerator, self).__init__()
self.conv1 = nn.Conv2d(in_channels=input_channels, out_channels=2048, kernel_size=1, padding=0, stride=1)
self.resblock1 = ResBlock3D(512, 256)
self.resblock2 = ResBlock3D(256, 128)
self.resblock3 = ResBlock3D(128, 64)
self.resblock4 = ResBlock3D(64, 32)
self.gn = nn.GroupNorm(num_groups=32, num_channels=32)
self.conv2 = nn.Conv3d(32, 3, kernel_size=3, padding=1)
def forward(self, zs_es):
x = self.conv1(zs_es)
x = x.view(x.size(0), 512, 4, x.size(2), x.size(3))
x = self.resblock1(x)
x = F.upsample(x, scale_factor=(2, 2, 2), mode='trilinear', align_corners=True)
x = self.resblock2(x)
x = F.upsample(x, scale_factor=(2, 2, 2), mode='trilinear', align_corners=True)
x = self.resblock3(x)
x = F.upsample(x, scale_factor=(1, 2, 2), mode='trilinear', align_corners=True)
x = self.resblock4(x)
x = F.upsample(x, scale_factor=(1, 2, 2), mode='trilinear', align_corners=True)
x = self.gn(x)
x = F.relu(x, inplace=True)
x = self.conv2(x)
x = F.tanh(x)
return x GBase Inference: motion_latent_source = z_network(source)
motion_latent_drive = z_network(drive)
volume_source = model_app(source)
Wem_source = warping_generator(torch.cat([-motion_latent_source, motion_latent_drive], dim=1))
g3d_input = F.grid_sample(volume_source, Wem_source.permute(0, 2, 3, 4, 1), align_corners=True)
volume_generated = g3d(g3d_input)
x_generated = g2d(volume_generated) |
@Kwentar thanks for sharing the information. Could you tell the motivation of removing the rotation and translation, and using only warp once? Since it lacks the ability of controlling the head pose explicitly. Have you try following the paper exactly, and how about the result? |
|
Judging from these preliminary resutls it seems like RT-bene for gaze loss isn't necessary at all ? |
@Kwentar Hey great work! Do you mind sharing a bit more on hardware usage like gpu spec. sxm/pcie, vram, and training time each iter / epoch ? And roughtly many epochs are you planning on training with 8 gpus for convergence before transferring that to VASA ? |
There’s a cross re-enactment image that gets spat out. Quality is low at 50 epochs. The eyes are aligning up - this code has mpgazeloss using mediapope which may do the job. But in the other ticket I describe preparation data different eyes blinking - not trivial. Also I want preprocessing of videos to happen 1000x faster - it’s taking 5 mins per video dmlc/decord#302 The main branch should run with training as is - let me know if it doesn’t. Theres a feature branch I’m stabilising to get more videos / ids- though I hit a bump. UPDATE - fyi @Kwentar - diffusion + talking - tencent-ailab/V-Express#6 |
@Kwentar hallo, a quick question about F.grid sample, don't you need to first interpolate Wem_source [batch, 3 ,4, 16, 16] to the same shape as volume_source [batch, 96, 16, 32, 32]? Otherwise, after F_grid sample operation, the shape of volume_source will be changed into [batch, 96, 4, 16, 16], and cannot be correctly processed by the following g2d module, because g2d requires the input with the shape [batch, 96, 16, 32, 32]. |
May not be specifc case here but I found this discrepancy creeps in with different image inputs 256 vs 512. My code in main is only handling 512 atm. |
hi, @johndpope May I ask if your current results are the in-domain results saved during training? My current model structure is mostly consistent with your current structure. Currently, I use 2024 song videos and 2880 speech videos from the RAVDESS data (as S* frames). After one epoch, loss_perceptual converges from 150 to 73.3125. epoch0 -> epoch1 in training process (bs is set to 6, and 6 images are output side by side.) epoch1 in infer process (The first one is the original image, the second one is the driving image, and the last one is the output result. The result is not converged.) |
are 4.th and 5.th row the self-reconstruction results in the training set? |
@JZArray yes |
fyi - anyone actually training might want to check this out #41 |
@JZArray Have you solved this? I encountered the same problem. The eyes won't close (but the closed eyes can be open). Did you train the model in an end to end way as Kwentar did or control the pose explicitly? |
@flyingshan Not yet, we still insist on our model, not try his model. Furthermore, may I ask how your model's generalization ability looks like, can your model generalize well to unseen ID, i.e. does ID leakage appear? |
I have implemented the cycle loss mentioned in the paper, but I want to first solve the expression fitting problem now so this loss is not included in the training yet. To generalize to unseen ID, I think it is important to have a large dataset with various IDs. I think voxceleb2 with 6k IDs will do. |
@flyingshan Ok, did you use the model provided by @johndpope, or implement it by yourself? I also implement the cycle loss, but not used yet, because I want to the model first to have a good generalization ability to unssen IDs when doing self-reenactment tasks, then pay more attnetion to cross-ID reenactment cases. Otherwises, I think it makes no sense to use cycle loss when the model is unable to do self-reenactment tasks to seen/unseen IDs. |
I implement G3d by myself according to the paper. The warping module is implemented according to Kwentar. As for Eapp/G2d, I adopt the structure in face-vid2vid, and modify some structures to make them look like the structure in megaportrait. |
@flyingshan can you share some visual results here if possible? |
I only got some visualizations in the evaluation process. |
@flyingshan we got the similar results hhhh |
@flyingshan Forgot to ask, how many IDs are you used now? And in your evaluationresults, did these IDs also appear in train dataset? |
@JZArray I have not really counted this yet, it may around hundreds. The ID in evaluation may appear in the training set. |
The results seems like mine. The lip and mouth areas are not driven well. I follow the paper to implement it. |
your background looks well! How many IDs have you used for training to get such results, and in your evaluation results, did these IDs also appear in train dataset? Also, have you tried reenactment between different IDs, does IDs leakage appera? (could your share codes if possible?) |
These results are sampled during training, I didn't perform quantitative evaluations for now, because the mouth and eye areas are bad, and the code needs more improvements. About 40,000 IDs for training, not from the voxceleb dataset. I plan to reprocess our dataset and remove backgrounds to train the model. |
how long have you trained to get such results? Could you show some details about how you implement warp operations? |
About 200 epochs, and one week. My warp operation correspond to the above results are similar with johndpope's, but I predict 6d pose and use tanh to constrain the rotation and translation. |
try new main code - Jay @hazard-10 spotted an error with cosface in training - and claude fixed it.
the discriminator i've drafted code to take it to multiscale patch gan. maybe also boost image quality... the leakage - im seeing with my overfitted videos. i think the es is source of problems. UPDATE - |
Dear CommitCrew - I bring you a cleaner / faster / smarter way to disentangle images using 3x resnet50 backbones. https://github.com/johndpope/speak-hack |
@Kwentar @flyingshan how are your progresses now? |
had incorrectly configured to overfit - updated now johndpope/SPEAK-hack#1 |
@Kwentar hallo, may I ask how you implement cross-entropy for Z_exp? |
i used another of the videos as driving - and it's (almost) obviously not rotating the head past the point where the original movie went - see below.
cross_reenacted_image_57
![cross_reenacted_image_57](https://private-user-images.githubusercontent.com/289994/336454141-73964277-9421-452b-9344-5a28b6f92b7e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5Mzc5NTEsIm5iZiI6MTczODkzNzY1MSwicGF0aCI6Ii8yODk5OTQvMzM2NDU0MTQxLTczOTY0Mjc3LTk0MjEtNDUyYi05MzQ0LTVhMjhiNmY5MmI3ZS5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwN1QxNDE0MTFaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1jZTM3NmM5NzRjOThlZDUyOTRiYjZhY2RkNzljN2E3YjFiMGVlMTMzN2M4MGQ0NWJlMTgzYTk0NmVhNjVhZWE3JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.cXxxkbczY2YifcnwGmc0WqK_T_PXsQRYzk50rtWeemI)
pred_frame_191
![pred_frame_191](https://private-user-images.githubusercontent.com/289994/336454179-efdb5b54-04d9-4850-8ea4-4b9f8bd57293.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5Mzc5NTEsIm5iZiI6MTczODkzNzY1MSwicGF0aCI6Ii8yODk5OTQvMzM2NDU0MTc5LWVmZGI1YjU0LTA0ZDktNDg1MC04ZWE0LTRiOWY4YmQ1NzI5My5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwN1QxNDE0MTFaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lMTIwMTQ4OTE0YzE1MDE0ODkwOWJiYmM3YTI5ZmI3MzY3ZTQwNWIzMTkxMWVlNmYyNjQzY2FlYWFkN2ZmOThjJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.0uHjAWuMpXq1tCxUpmolgO0-0oP-xCyyLxaGjcUTTcI)
tomorrow i plug in bigger dataset.
UPDATE - #37
when I normalize the images - i end up with this - looks bad - I add some code in train.py to un-normalize - happy with current results....
fyi - this is the frames dump out from mp4 - head cropped / maybe some warping.
![Screenshot from 2024-06-04 23-11-08](https://private-user-images.githubusercontent.com/289994/336466258-d2e40d1a-7a7a-4fcf-a37e-36faaa7ddd7c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5Mzc5NTEsIm5iZiI6MTczODkzNzY1MSwicGF0aCI6Ii8yODk5OTQvMzM2NDY2MjU4LWQyZTQwZDFhLTdhN2EtNGZjZi1hMzdlLTM2ZmFhYTdkZGQ3Yy5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjA3JTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIwN1QxNDE0MTFaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1lYzMzZWFlMGZlNDNlNWJlYWEzNGZlMDA3NTU0NzhjZjJhOTVhZDEyZmE5YzVmNGZhNDY2ODJmNjQ3ZWNlY2IwJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.qtDQe7sDUJ3aHyw-46NNqZWKy9k5scvHZi7pjLvrPG4)
The text was updated successfully, but these errors were encountered: