HeadPose net - with resnet18 backbone / pitch / yaw / roll #19

johndpope · 2024-05-28T00:58:15Z

so I did some digging - found this paper from 2022
https://arxiv.org/pdf/2210.13705

it spells out how to exactly do this -
I recreate this -
https://github.com/johndpope/HPENet-hack

but model needs training
but now looking for eval set - leads me here - and frankly this looks much better
https://github.com/thohemp/6drepnet

so I will rewire the HeadPose to just use this instead.

# Create model
# Weights are automatically downloaded
model = SixDRepNet()
img = cv2.imread('/path/to/image.jpg')
pitch, yaw, roll = model.predict(img)

UPDATE

but it doesn't support translations....
https://github.com/search?q=repo%3Athohemp%2F6DRepNet+translation&type=discussions

robinchm · 2024-05-28T08:35:39Z

I think we need to train the custom resnet18 in order to predict translation. The hopenet applies a series of augmentation during training which does not alter yaw, roll and pitch (except for flipping), but does alter translation I think. Any idea on whether some code applies augmentation correctly for translation?

johndpope · 2024-05-29T07:32:52Z

i don't know yet about translation.

I did follow the paper - and add some crop and warp function for augmenting training. i work on this some more tomorrow.
https://github.com/johndpope/MegaPortrait-hack/blob/feat/14-training/train_base.py#L51

johndpope · 2024-05-30T03:14:00Z

When the warp and crop is applied that is center aligned according to paper l. That’s how I’ve coded this most recently in training fork - now merged. I think translation may not be a problem

robinchm · 2024-05-30T07:00:34Z

@johndpope

In the MegaPortraits paper it says:

We use a pre-trained network to estimate head rotation data, but the latent expression vectors z s/d and the warpings to and from the canonical coordinate space are trained without direct supervision.

...

The head pose prediction network is pre-trained, while the expression prediction network is trained from scratch.

It seems the network should be pretrained and frozen during the training of Gbase. There is no mention of the architecture of this head pose estimator, but the paper says it's inspired by "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing".

In the referenced paper, the module is designed the same as hopenet, except for the output heads. It also has a loss that uses the pretrained hopenet to generate ground truth for rotation angles, but not translation. I assume that in the referenced paper, this module is trained from scratch.

Now the problem is how to obtain the "pretrained" resnet18 module that predicts rotation and translation. We can:

train it from scratch, but then the translation parameter in the 300w-lp dataset used by hopenet needs to be understood correctly, since almost all augmentation modifies translation
use the unofficial implementation of "One-Shot Free-View Neural Talking-Head Synthesis for Video Conferencing" paper, which contains the weight, but of a head pose estimator based on resnet50 and additionally computes expression coefficients.

johndpope · 2024-05-30T07:29:48Z

so that parituclar part - after inspecting - the bad results for yaw/pitch/roll - i replace with off the shelf SixDRepNet

it's possible to freeze this using
.eval() after the instantiation - but I'm not saving that model....
https://github.com/johndpope/MegaPortrait-hack/blob/main/mysixdrepnet.py#L796

I attempt to extract the translation using this custom model - but failed.... so still work to do.
it apparently needs retraining. what i did was augment the 2 datasources to get proper rotation parameters.

class Emtn(nn.Module):
    def __init__(self):
        super().__init__()
        # https://github.com/johndpope/MegaPortrait-hack/issues/19
        # replace this with off the shelf SixDRepNet
        self.head_pose_net = resnet18(pretrained=True).to(device)
        self.head_pose_net.fc = nn.Linear(self.head_pose_net.fc.in_features, 6).to(device)  # 6 corresponds to rotation and translation parameters
        self.rotation_net =  SixDRepNet_Detector()

        model = resnet18(pretrained=False,num_classes=512).to(device)  # 512 feature_maps = resnet18(input_image) ->   Should print: torch.Size([1, 512, 7, 7])
        # Remove the fully connected layer and the adaptive average pooling layer
        self.expression_net = nn.Sequential(*list(model.children())[:-1])
        self.expression_net.adaptive_pool = nn.AdaptiveAvgPool2d(FEATURE_SIZE)  # https://github.com/neeek2303/MegaPortraits/issues/3
        # self.expression_net.adaptive_pool = nn.AdaptiveAvgPool2d((7, 7)) #OPTIONAL 🤷 - 16x16 is better?

        ## TODO 2
        outputs=COMPRESS_DIM ## 512,,方便后面的WarpS2C操作 512 -> 2048 channel
        self.fc = torch.nn.Linear(2048, outputs)

    def forward(self, x):
        # Forward pass through head pose network
        rotations,_ = self.rotation_net.predict(x)
        logging.debug(f"📐 rotation :{rotations}")
       

        head_pose = self.head_pose_net(x)

        # Split head pose into rotation and translation parameters
        # rotation = head_pose[:, :3]  - this is shit
        translation = head_pose[:, 3:]


        # Forward pass image through expression network
        expression_resnet = self.expression_net(x)
        ### TODO 2
        expression_flatten = torch.flatten(expression_resnet, start_dim=1)
        expression = self.fc(expression_flatten)  # (bs, 2048) ->>> (bs, COMPRESS_DIM)

        return rotations, translation, expression
    #This encoder outputs head rotations R𝑠/𝑑 ,translations t𝑠/𝑑 , and latent expression descriptors z𝑠/𝑑

consider that when the training is underway - there's a mask that dictates where the head should be drawn into...so it kinda must learn where to draw from the source.

johndpope · 2024-06-04T01:26:27Z

I contacted author - @chientv99 - and he sent this - https://github.com/chientv99/maskpose

unfortunately the pretrained weights are missing. :(

robinchm · 2024-06-05T03:40:35Z

I contacted author - @chientv99 - and he sent this - https://github.com/chientv99/maskpose

unfortunately the pretrained weights are missing. :(

I browsed the code and paper a bit. If I am not wrong, this project does not address translation at all.

I am pretraining a hopenet on resnet18, using dataset 300W-LP. My observation is that angles converge easily, but translation is much harder. Translation along xy still seems converging, though slowly. Translation along z does not converge at all. This is probably because my augmentation does not crop aggressively enough (translation along can only be modified by cropping + resizing).

@johndpope If you can contact the author, do you mind to ask them which dataset do you use to pretrain the model and how is translation predicted?

johndpope · 2024-06-05T04:17:17Z

He forwarded a message to a lady to find the weights. I’m pretty sure from train script it’s the same one as you’re using. With the preprocessing steps to get images - there’s some caveats

MP

They don’t do backgrounds
Shoulders are also off
They recreate videos with matting and third party libraries

I work on gaze loss - it’s converging / though 3090 GPU is getting cooked….

Need some cloud compute.
#36

robinchm · 2024-06-05T06:38:46Z

He forwarded a message to a lady to find the weights. I’m pretty sure from train script it’s the same one as you’re using. With the preprocessing steps to get images - there’s some caveats

MP
1. They don’t do backgrounds

2. Shoulders are also off

3. They recreate videos with matting and third party libraries
I work on gaze loss - it’s converging / though 3090 GPU is getting cooked….

Need some cloud compute. #36

I don't quite get it - are you saying that the author seems to be also pretraining a hopenet with 300W-LP dataset? Any detail in the design of prediction head - same as original hopenet or using 6drepnet like what you implemented?

I understand that megaportraits need matting and don't do background, but that should not affect how this module is pretrained.

And my translation is indeed converging, with z axis slowest, I should have waited a bit longer.

johndpope · 2024-06-05T07:30:42Z

In the paper they say the head centered and cropped(if I recall correctly) so the shifting left / right shouldn’t matter. I can extend the augmentation of frames to include both a zoomed in and sweet spot crop. The model should learn to do both. My interest now is to work on VASA which will disentangle with transformer. The portrait code is going to drop so this will be completely academic exercise.

johndpope · 2024-06-05T07:31:57Z

This has different cropping from v-express findings - #36

johndpope · 2024-06-05T11:09:03Z

FYI - #36 - this part of architecture maybe redundant

robinchm · 2024-06-06T02:47:16Z

Thanks a lot! Results in the referenced issue gave me a lot of confidence that warping is not a critical component. We can retrain with the module plugged in later, if it turns out that explicit control of the pose is needed (it's sort of good to have in my case, but not absolutely necessary).

johndpope mentioned this issue May 28, 2024

Feat/19 sixdrepnet #20

Merged

johndpope closed this as completed May 28, 2024

johndpope reopened this May 28, 2024

johndpope closed this as completed May 30, 2024

johndpope mentioned this issue Jun 26, 2024

about the results JaLnYn/talkinghead#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HeadPose net - with resnet18 backbone / pitch / yaw / roll #19

HeadPose net - with resnet18 backbone / pitch / yaw / roll #19

johndpope commented May 28, 2024 •

edited

Loading

robinchm commented May 28, 2024

johndpope commented May 29, 2024

johndpope commented May 30, 2024

robinchm commented May 30, 2024

johndpope commented May 30, 2024 •

edited

Loading

johndpope commented Jun 4, 2024

robinchm commented Jun 5, 2024

johndpope commented Jun 5, 2024

robinchm commented Jun 5, 2024

johndpope commented Jun 5, 2024

johndpope commented Jun 5, 2024

johndpope commented Jun 5, 2024

robinchm commented Jun 6, 2024

HeadPose net - with resnet18 backbone / pitch / yaw / roll #19

HeadPose net - with resnet18 backbone / pitch / yaw / roll #19

Comments

johndpope commented May 28, 2024 • edited Loading

robinchm commented May 28, 2024

johndpope commented May 29, 2024

johndpope commented May 30, 2024

robinchm commented May 30, 2024

johndpope commented May 30, 2024 • edited Loading

johndpope commented Jun 4, 2024

robinchm commented Jun 5, 2024

johndpope commented Jun 5, 2024

robinchm commented Jun 5, 2024

johndpope commented Jun 5, 2024

johndpope commented Jun 5, 2024

johndpope commented Jun 5, 2024

robinchm commented Jun 6, 2024

johndpope commented May 28, 2024 •

edited

Loading

johndpope commented May 30, 2024 •

edited

Loading