-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HeadPose net - with resnet18 backbone / pitch / yaw / roll #19
Comments
I think we need to train the custom resnet18 in order to predict translation. The hopenet applies a series of augmentation during training which does not alter yaw, roll and pitch (except for flipping), but does alter translation I think. Any idea on whether some code applies augmentation correctly for translation? |
i don't know yet about translation. I did follow the paper - and add some crop and warp function for augmenting training. i work on this some more tomorrow. |
When the warp and crop is applied that is center aligned according to paper l. That’s how I’ve coded this most recently in training fork - now merged. I think translation may not be a problem |
In the MegaPortraits paper it says: We use a pre-trained network to estimate head rotation data, but the latent expression vectors z s/d and the warpings to and from the canonical coordinate space are trained without direct supervision. ... The head pose prediction network is pre-trained, while the expression prediction network is trained from scratch. It seems the network should be pretrained and frozen during the training of In the referenced paper, the module is designed the same as hopenet, except for the output heads. It also has a loss that uses the pretrained hopenet to generate ground truth for rotation angles, but not translation. I assume that in the referenced paper, this module is trained from scratch. Now the problem is how to obtain the "pretrained" resnet18 module that predicts rotation and translation. We can:
|
so that parituclar part - after inspecting - the bad results for yaw/pitch/roll - i replace with off the shelf SixDRepNet it's possible to freeze this using I attempt to extract the translation using this custom model - but failed.... so still work to do.
consider that when the training is underway - there's a mask that dictates where the head should be drawn into...so it kinda must learn where to draw from the source. |
I contacted author - @chientv99 - and he sent this - https://github.com/chientv99/maskpose unfortunately the pretrained weights are missing. :( |
I browsed the code and paper a bit. If I am not wrong, this project does not address translation at all. I am pretraining a hopenet on resnet18, using dataset 300W-LP. My observation is that angles converge easily, but translation is much harder. Translation along xy still seems converging, though slowly. Translation along z does not converge at all. This is probably because my augmentation does not crop aggressively enough (translation along can only be modified by cropping + resizing). @johndpope If you can contact the author, do you mind to ask them which dataset do you use to pretrain the model and how is translation predicted? |
He forwarded a message to a lady to find the weights. I’m pretty sure from train script it’s the same one as you’re using. With the preprocessing steps to get images - there’s some caveats MP
I work on gaze loss - it’s converging / though 3090 GPU is getting cooked…. Need some cloud compute. |
I don't quite get it - are you saying that the author seems to be also pretraining a hopenet with 300W-LP dataset? Any detail in the design of prediction head - same as original hopenet or using 6drepnet like what you implemented? I understand that megaportraits need matting and don't do background, but that should not affect how this module is pretrained. And my translation is indeed converging, with z axis slowest, I should have waited a bit longer. |
In the paper they say the head centered and cropped(if I recall correctly) so the shifting left / right shouldn’t matter. I can extend the augmentation of frames to include both a zoomed in and sweet spot crop. The model should learn to do both. My interest now is to work on VASA which will disentangle with transformer. The portrait code is going to drop so this will be completely academic exercise. |
This has different cropping from v-express findings - #36 |
FYI - #36 - this part of architecture maybe redundant |
Thanks a lot! Results in the referenced issue gave me a lot of confidence that warping is not a critical component. We can retrain with the module plugged in later, if it turns out that explicit control of the pose is needed (it's sort of good to have in my case, but not absolutely necessary). |
so I did some digging - found this paper from 2022
https://arxiv.org/pdf/2210.13705
it spells out how to exactly do this -
I recreate this -
https://github.com/johndpope/HPENet-hack
but model needs training
but now looking for eval set - leads me here - and frankly this looks much better
https://github.com/thohemp/6drepnet
so I will rewire the HeadPose to just use this instead.
UPDATE
but it doesn't support translations....
https://github.com/search?q=repo%3Athohemp%2F6DRepNet+translation&type=discussions
The text was updated successfully, but these errors were encountered: