Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HeadPose net - with resnet18 backbone / pitch / yaw / roll #19

Closed
johndpope opened this issue May 28, 2024 · 13 comments
Closed

HeadPose net - with resnet18 backbone / pitch / yaw / roll #19

johndpope opened this issue May 28, 2024 · 13 comments

Comments

@johndpope
Copy link
Owner

johndpope commented May 28, 2024

so I did some digging - found this paper from 2022
https://arxiv.org/pdf/2210.13705

it spells out how to exactly do this -
I recreate this -
https://github.com/johndpope/HPENet-hack

but model needs training
but now looking for eval set - leads me here - and frankly this looks much better
https://github.com/thohemp/6drepnet

so I will rewire the HeadPose to just use this instead.

# Create model
# Weights are automatically downloaded
model = SixDRepNet()
img = cv2.imread('/path/to/image.jpg')
pitch, yaw, roll = model.predict(img)

UPDATE

but it doesn't support translations....
https://github.com/search?q=repo%3Athohemp%2F6DRepNet+translation&type=discussions

@robinchm
Copy link

I think we need to train the custom resnet18 in order to predict translation. The hopenet applies a series of augmentation during training which does not alter yaw, roll and pitch (except for flipping), but does alter translation I think. Any idea on whether some code applies augmentation correctly for translation?

@johndpope johndpope reopened this May 28, 2024
@johndpope
Copy link
Owner Author

i don't know yet about translation.

I did follow the paper - and add some crop and warp function for augmenting training. i work on this some more tomorrow.
https://github.com/johndpope/MegaPortrait-hack/blob/feat/14-training/train_base.py#L51

@johndpope
Copy link
Owner Author

When the warp and crop is applied that is center aligned according to paper l. That’s how I’ve coded this most recently in training fork - now merged. I think translation may not be a problem

@robinchm
Copy link

@johndpope

In the MegaPortraits paper it says:

We use a pre-trained network to estimate head rotation data, but the latent expression vectors z s/d and the warpings to and from the canonical coordinate space are trained without direct supervision.

...

The head pose prediction network is pre-trained, while the expression prediction network is trained from scratch.

It seems the network should be pretrained and frozen during the training of Gbase. There is no mention of the architecture of this head pose estimator, but the paper says it's inspired by "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing".

In the referenced paper, the module is designed the same as hopenet, except for the output heads. It also has a loss that uses the pretrained hopenet to generate ground truth for rotation angles, but not translation. I assume that in the referenced paper, this module is trained from scratch.

Now the problem is how to obtain the "pretrained" resnet18 module that predicts rotation and translation. We can:

  • train it from scratch, but then the translation parameter in the 300w-lp dataset used by hopenet needs to be understood correctly, since almost all augmentation modifies translation
  • use the unofficial implementation of "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing" paper, which contains the weight, but of a head pose estimator based on resnet50 and additionally computes expression coefficients.

@johndpope
Copy link
Owner Author

johndpope commented May 30, 2024

so that parituclar part - after inspecting - the bad results for yaw/pitch/roll - i replace with off the shelf SixDRepNet

it's possible to freeze this using
.eval() after the instantiation - but I'm not saving that model....
https://github.com/johndpope/MegaPortrait-hack/blob/main/mysixdrepnet.py#L796

I attempt to extract the translation using this custom model - but failed.... so still work to do.
it apparently needs retraining. what i did was augment the 2 datasources to get proper rotation parameters.

class Emtn(nn.Module):
    def __init__(self):
        super().__init__()
        # https://github.com/johndpope/MegaPortrait-hack/issues/19
        # replace this with off the shelf SixDRepNet
        self.head_pose_net = resnet18(pretrained=True).to(device)
        self.head_pose_net.fc = nn.Linear(self.head_pose_net.fc.in_features, 6).to(device)  # 6 corresponds to rotation and translation parameters
        self.rotation_net =  SixDRepNet_Detector()

        model = resnet18(pretrained=False,num_classes=512).to(device)  # 512 feature_maps = resnet18(input_image) ->   Should print: torch.Size([1, 512, 7, 7])
        # Remove the fully connected layer and the adaptive average pooling layer
        self.expression_net = nn.Sequential(*list(model.children())[:-1])
        self.expression_net.adaptive_pool = nn.AdaptiveAvgPool2d(FEATURE_SIZE)  # https://github.com/neeek2303/MegaPortraits/issues/3
        # self.expression_net.adaptive_pool = nn.AdaptiveAvgPool2d((7, 7)) #OPTIONAL 🤷 - 16x16 is better?

        ## TODO 2
        outputs=COMPRESS_DIM ## 512,,方便后面的WarpS2C操作 512 -> 2048 channel
        self.fc = torch.nn.Linear(2048, outputs)

    def forward(self, x):
        # Forward pass through head pose network
        rotations,_ = self.rotation_net.predict(x)
        logging.debug(f"📐 rotation :{rotations}")
       

        head_pose = self.head_pose_net(x)

        # Split head pose into rotation and translation parameters
        # rotation = head_pose[:, :3]  - this is shit
        translation = head_pose[:, 3:]


        # Forward pass image through expression network
        expression_resnet = self.expression_net(x)
        ### TODO 2
        expression_flatten = torch.flatten(expression_resnet, start_dim=1)
        expression = self.fc(expression_flatten)  # (bs, 2048) ->>> (bs, COMPRESS_DIM)

        return rotations, translation, expression
    #This encoder outputs head rotations R𝑠/𝑑 ,translations t𝑠/𝑑 , and latent expression descriptors z𝑠/𝑑

consider that when the training is underway - there's a mask that dictates where the head should be drawn into...so it kinda must learn where to draw from the source.

@johndpope
Copy link
Owner Author

I contacted author - @chientv99 - and he sent this - https://github.com/chientv99/maskpose
Screenshot 2024-06-04 at 11 24 43 AM

unfortunately the pretrained weights are missing. :(

@robinchm
Copy link

robinchm commented Jun 5, 2024

I contacted author - @chientv99 - and he sent this - https://github.com/chientv99/maskpose Screenshot 2024-06-04 at 11 24 43 AM

unfortunately the pretrained weights are missing. :(

I browsed the code and paper a bit. If I am not wrong, this project does not address translation at all.

I am pretraining a hopenet on resnet18, using dataset 300W-LP. My observation is that angles converge easily, but translation is much harder. Translation along xy still seems converging, though slowly. Translation along z does not converge at all. This is probably because my augmentation does not crop aggressively enough (translation along can only be modified by cropping + resizing).

@johndpope If you can contact the author, do you mind to ask them which dataset do you use to pretrain the model and how is translation predicted?

@johndpope
Copy link
Owner Author

He forwarded a message to a lady to find the weights. I’m pretty sure from train script it’s the same one as you’re using. With the preprocessing steps to get images - there’s some caveats

MP

  1. They don’t do backgrounds
  2. Shoulders are also off
  3. They recreate videos with matting and third party libraries

I work on gaze loss - it’s converging / though 3090 GPU is getting cooked….

Need some cloud compute.
#36

@robinchm
Copy link

robinchm commented Jun 5, 2024

He forwarded a message to a lady to find the weights. I’m pretty sure from train script it’s the same one as you’re using. With the preprocessing steps to get images - there’s some caveats

MP

1. They don’t do backgrounds

2. Shoulders are also off

3. They recreate videos with matting and third party libraries

I work on gaze loss - it’s converging / though 3090 GPU is getting cooked….

Need some cloud compute. #36

I don't quite get it - are you saying that the author seems to be also pretraining a hopenet with 300W-LP dataset? Any detail in the design of prediction head - same as original hopenet or using 6drepnet like what you implemented?

I understand that megaportraits need matting and don't do background, but that should not affect how this module is pretrained.

And my translation is indeed converging, with z axis slowest, I should have waited a bit longer.

image

@johndpope
Copy link
Owner Author

In the paper they say the head centered and cropped(if I recall correctly) so the shifting left / right shouldn’t matter. I can extend the augmentation of frames to include both a zoomed in and sweet spot crop. The model should learn to do both. My interest now is to work on VASA which will disentangle with transformer. The portrait code is going to drop so this will be completely academic exercise.

@johndpope
Copy link
Owner Author

This has different cropping from v-express findings - #36

@johndpope
Copy link
Owner Author

FYI - #36 - this part of architecture maybe redundant

@robinchm
Copy link

robinchm commented Jun 6, 2024

Thanks a lot! Results in the referenced issue gave me a lot of confidence that warping is not a critical component. We can retrain with the module plugged in later, if it turns out that explicit control of the pose is needed (it's sort of good to have in my case, but not absolutely necessary).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants