Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training concept Issue: Use of Repetition for Short Motion Sequences #31

Open
rohitpaul24 opened this issue Jan 15, 2025 · 4 comments
Open

Comments

@rohitpaul24
Copy link

rohitpaul24 commented Jan 15, 2025

Thanking you again for the training script, just have some doubt regarding the training approach

In the current implementation, short motion sequences are handled by repeating the motion data to match the required length of 2 * n_motions. While this ensures a uniform batch size, it introduces potential issues with motion continuity and smoothness. Specifically, repeating the same motion clip can lead to a discontinuity between the repeated frames, where the first frame of the repeated clip may have a large difference from the last frame of the original clip.

Potential Issue:
When the repeated frames are processed during training, the model might struggle to maintain smooth transitions, resulting in artifacts or jitter in the generated motion. This discontinuity could negatively impact the model's ability to learn realistic and smooth motion sequences.

Whether Instead of repeating the motion data, using one of the following approaches would be better:
Neutral Source Motion Padding: Use a predefined neutral motion state (e.g., rest pose) for padding and an indicator to mark these as non-informative frames.
As it will preserve the continuity of the original motion data and prevent the model from learning unrealistic transitions.

Questions:
What was the rationale behind using repetition instead of padding?
Have you tested it with zero padding or neutral source padding, and not opted to go with it?

Looking forward to your thoughts on this!

@xuyangcao
Copy link
Collaborator

Thanking you again for the training script, just have some doubt regarding the training approach

In the current implementation, short motion sequences are handled by repeating the motion data to match the required length of 2 * n_motions. While this ensures a uniform batch size, it introduces potential issues with motion continuity and smoothness. Specifically, repeating the same motion clip can lead to a discontinuity between the repeated frames, where the first frame of the repeated clip may have a large difference from the last frame of the original clip.

Potential Issue: When the repeated frames are processed during training, the model might struggle to maintain smooth transitions, resulting in artifacts or jitter in the generated motion. This discontinuity could negatively impact the model's ability to learn realistic and smooth motion sequences.

Whether Instead of repeating the motion data, using one of the following approaches would be better: Neutral Source Motion Padding: Use a predefined neutral motion state (e.g., rest pose) for padding and an indicator to mark these as non-informative frames. As it will preserve the continuity of the original motion data and prevent the model from learning unrealistic transitions.

Questions: What was the rationale behind using repetition instead of padding? Have you tested it with zero padding or neutral source padding, and not opted to go with it?

Looking forward to your thoughts on this!

Hi, thank you for your question. Yes, simply repeating the motion data may be one of the reasons for unsmooth motions. In the latest version of the model, we removed data less than 4s, which means repeating strategy will not work during training.

As for the two alternatives you mentioned, we did not try yet, you can try them on.

Looking forward to your feedback.

@xuyangcao
Copy link
Collaborator

Thanking you again for the training script, just have some doubt regarding the training approach

In the current implementation, short motion sequences are handled by repeating the motion data to match the required length of 2 * n_motions. While this ensures a uniform batch size, it introduces potential issues with motion continuity and smoothness. Specifically, repeating the same motion clip can lead to a discontinuity between the repeated frames, where the first frame of the repeated clip may have a large difference from the last frame of the original clip.

Potential Issue: When the repeated frames are processed during training, the model might struggle to maintain smooth transitions, resulting in artifacts or jitter in the generated motion. This discontinuity could negatively impact the model's ability to learn realistic and smooth motion sequences.

Whether Instead of repeating the motion data, using one of the following approaches would be better: Neutral Source Motion Padding: Use a predefined neutral motion state (e.g., rest pose) for padding and an indicator to mark these as non-informative frames. As it will preserve the continuity of the original motion data and prevent the model from learning unrealistic transitions.

Questions: What was the rationale behind using repetition instead of padding? Have you tested it with zero padding or neutral source padding, and not opted to go with it?

Looking forward to your thoughts on this!

Another strategy that is worth trying is to smooth the motions generated in Liveportrait during data preparation, inspired by this issue: KwaiVGI/LivePortrait#439

@rohitpaul24
Copy link
Author

@xuyangcao Thanks for the reply

As i was testing on replacing the repetitive data, with some new data structure where instead of cropping 200 frames. I am using a sliding window approach as the video data i am using are quite long. Where i am also adding 0 padding at the end of the data to make it a multple of 100 frames similar to as we are doing it in inference.

Now given I am seeing a convergence of validation loss. For exp smooth loss, it is rising eventhough the value is in range e-6. Whereas i see some zig zag pattern in exp velocity loss as it still converging.
Is it common? or is it something to do with my approach

Thank you

@johndpope
Copy link

johndpope commented Feb 9, 2025

in vasa paper - they use 50 frames with stride of 25 to augment data.
presumably they found the 100 frames from diffposetalk to be inferior. am i mistaken?
this utils/common.py/ truncate_motion_coef_and_audio - in my understanding - helps the model produce an output that's not uniform to window size (eg 7.3 seconds) - but the prev_motion / prev_audio should provide transformer enough context to create a smooth recreation / flow - right?

Image

did removing the alignment mask do anything for training?
im running experiment locally -
how long till it can converge? how many steps?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants