Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train with custom datset ? #21

Closed
Unmesh28 opened this issue Jul 2, 2022 · 46 comments
Closed

How to train with custom datset ? #21

Unmesh28 opened this issue Jul 2, 2022 · 46 comments

Comments

@Unmesh28
Copy link

Unmesh28 commented Jul 2, 2022

Hi, I am quiet new to this. I am looking for step by step guide to train custom dataset OR Train on AVSpeech dataset and finetune for other videos. Steps can be :

  • Download the dataset
  • Clean and convert to 25fps [If 30fps what should be done]
  • Train
  • Finetune on custom videos
  • Test

I think such guide will help a lot of people not get confused.

Thank You.

@ghost
Copy link

ghost commented Jul 2, 2022

  • download dataset
  • convert to 25fps.
  • change sample rate to 16000hz.
  • split video less than 5s.
  • using syncnet_python to filter dataset in range [-3, 3], model works best with [-1,1].
  • detect faces.
  • train expert_syncnet with evaluation loss < 0.25 then you can stop your training
  • train wav2lip model

@Unmesh28
Copy link
Author

Unmesh28 commented Jul 2, 2022

@primepake Thanks.

But I was looking for the codes to run for each step with directory structure and other related info required.

@ghost
Copy link

ghost commented Jul 2, 2022

you should process your dataset carefully, It will effect your training

@Unmesh28
Copy link
Author

Unmesh28 commented Jul 2, 2022

Yeah, thats why looking for a step by step guide from you. It would really help me. Can you provide a doc or readme or something ? So that anybody can just follow the steps and start training. I have gone through and done training using the 96*96 wav2lip repo, but looking for more res results.

@lsw5835
Copy link

lsw5835 commented Jul 7, 2022

Hi @primepake , Thanks for your comments.
Could you let me know the location of ' syncnet_python ' file to filter out the dataset?

@Unmesh28
Copy link
Author

Unmesh28 commented Jul 7, 2022

@donggeon I think he meant "color_syncnet_train.py" this file.

If you have figured out previous steps can u tell me what u did exactly ?

@ghost
Copy link

ghost commented Jul 7, 2022

https://github.com/joonson/syncnet_python
you can use this repo

@lsw5835
Copy link

lsw5835 commented Jul 8, 2022

https://github.com/joonson/syncnet_python you can use this repo

Thanks for your answering. I was wondering if you could give me some more detailed instructions.
Because we detected faces using preprocessing code in wav2lip, it seems that we only need to check sync using some functions of syncnet python code.

@Unmesh28
Copy link
Author

@primepake Can you please share detailed instructions for training ?

@ghost
Copy link

ghost commented Jul 12, 2022

I will release the code, it's a ton of code

@lsw5835
Copy link

lsw5835 commented Jul 12, 2022

Thanks @primepake, it'd be a great help.

@Unmesh28
Copy link
Author

I will release the code, it's a ton of code

Thanks 👍🏻

@skyler14
Copy link

skyler14 commented Jul 14, 2022

do you also have your checkpoint from the avspeech runs, prior to running on your private dataset? I'm interested in comparing how it turned out on your end vs training via the instructions you provide.

@ghost
Copy link

ghost commented Jul 14, 2022

I will public pretrained on avspeech

@skyler14
Copy link

Great, thank you. Can you also leave an estimate of the gpu hardware and compute time it took for you to do the checkpoint training and finetuning.

@ghost
Copy link

ghost commented Jul 19, 2022

I used 10 GPU A6000 with nearly 200GB memory of GPU

@ghost
Copy link

ghost commented Jul 19, 2022

the trial and error just in a day

@skyler14
Copy link

ok, great. For the public avspeech pretrained checkpoint is it being put in the repo, as a link in the readme, or just here in the issues?

@ghost
Copy link

ghost commented Jul 19, 2022

for some reasons, I will public another day

@Unmesh28
Copy link
Author

Unmesh28 commented Jul 19, 2022

@primepake When can you upload detailed training instructions for preprocessing with codes for each step ?

@crazyxprogrammer
Copy link

Can you please detailed instructions for https://github.com/joonson/syncnet_python. How to use this repo.

@sylvie-lauf
Copy link

  • download dataset
  • detect faces
  • convert 25fps
  • using syncnet_python to filter dataset in range [-3, 3], model works best with [-1,1]
  • train expert_syncnet with evaluation loss < 0.25 then you can stop your training
  • train wav2lip model

Is this the correct order ?

@ghost
Copy link

ghost commented Jul 26, 2022

yes

@skyler14
Copy link

Any update on the avspeech only checkpoint?

@ghost
Copy link

ghost commented Jul 27, 2022

hi, I updated my preprocessing step, sorry about missing ordering

@ghost ghost mentioned this issue Jul 27, 2022
@skyler14
Copy link

How many videos did you need per the method you wrote for the fine-tuning step?

@crazyxprogrammer
Copy link

crazyxprogrammer commented Jul 29, 2022

can you give more details about this 4th step (split video less than 5s.). Is this step included in clean_data.py. And in the fifth step (using syncnet_python to filter dataset in the range [-3, 3]) should I have to only filter the dataset on the basis of offset given by Syncnet_python or Should I have to correct Synchronization.

@Unmesh28
Copy link
Author

Unmesh28 commented Jul 29, 2022

  • download dataset
  • convert to 25fps.
  • change sample rate to 16000hz.
  • split video less than 5s.
  • using syncnet_python to filter dataset in range [-3, 3], model works best with [-1,1].
  • detect faces.
  • train expert_syncnet with evaluation loss < 0.25 then you can stop your training
  • train wav2lip model

@primepake what do you mean by "split video less than 5s" ? Does it mean split longer videos in smaller videos with duration less than 5 sec ?

@ghost
Copy link

ghost commented Jul 29, 2022

lip-sync expert has many problems, you need to find it. As the author mentioned, it doesn't care about similarity between frames. You need to read the paper to understand more.

Does not reflect the real-world usage. As discussed before,
during generation at test time, the model must not change the pose,
as the generated face needs to be seamlessly pasted into the frame.
However, the current evaluation framework feeds random reference
frames in the input, thus demanding the network to change the
pose. Thus, the above system does not evaluate how the model
would be used in the real world.

@sylvie-lauf
Copy link

@primepake I want to buy your model. Can you please share details on : sylvie.nexus11@gmail.com

@skyler14
Copy link

skyler14 commented Aug 1, 2022

How many videos did you need per the method you wrote for the fine-tuning step?

just was wondering if I could get this estimate for the fine-tuning after AVspeech (videos and/or minutes of footage)

Also any updates on the AVspeech checkpoint?

@Unmesh28
Copy link
Author

Unmesh28 commented Aug 4, 2022

When I am running syncnet python getting below error :

WARNING: Audio (3.6720s) and video (3.7200s) lengths are different. Traceback (most recent call last): File "run_syncnet.py", line 40, in <module> offset, conf, dist = s.evaluate(opt,videofile=fname) File "/home/ubuntu/wav2lip_288x288/syncnet_python/SyncNetInstance.py", line 112, in evaluate im_out = self.__S__.forward_lip(im_in.cuda()); File "/home/ubuntu/wav2lip_288x288/syncnet_python/SyncNetModel.py", line 108, in forward_lip out = self.netfclip(mid); File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward return F.linear(input, self.weight, self.bias) RuntimeError: mat1 and mat2 shapes cannot be multiplied (4x278528 and 512x512)

Does anybody know how to resolve this ?

@skyler14
Copy link

skyler14 commented Aug 9, 2022

@primepake thanks for the note about fine-tuning ,do you have any updates on:
Roughly how many videos, GPU hardware, and compute time it took to do the AVSpeech only training
Status of adding the AVSpeech checkpoint

Also I was wondering if fine-tuning generally needed to be done on a per person basis or your propreity data was just alot of people all as one fine-tuned model?

@Unmesh28
Copy link
Author

Unmesh28 commented Aug 22, 2022

@primepake I guess the issues are not clear yet, why did you close them ?

@ghost
Copy link

ghost commented Aug 22, 2022

this is the problem in your code, you have to figure it out yourself. Just take a screenshot and leave it here so we can't solve it. thank you

@Unmesh28
Copy link
Author

I am using your exact code, haven't changed it

@ghost
Copy link

ghost commented Aug 22, 2022

did you change size input in inference file?
image

@Unmesh28
Copy link
Author

No, I did not change it. Have kept it as it is.

The only thing I tried changing is img_size = 288 in hparams.py
I tried changing it to 192 when suggested by you , but getting same error for 288 and 192.

@ghost
Copy link

ghost commented Aug 22, 2022

you need to change args.img_size = 288 in inference.py

@aishoot
Copy link

aishoot commented Aug 23, 2022

  • download dataset
  • convert to 25fps.
  • change sample rate to 16000hz.
  • split video less than 5s.
  • using syncnet_python to filter dataset in range [-3, 3], model works best with [-1,1].
  • detect faces.
  • train expert_syncnet with evaluation loss < 0.25 then you can stop your training
  • train wav2lip model

Thanks for your nice work. I want to ask why “split video less than 5s”? What effect does it have on the results? I split videos maximum of 20s, is that ok?

@ghost
Copy link

ghost commented Aug 23, 2022

  • download dataset
  • convert to 25fps.
  • change sample rate to 16000hz.
  • split video less than 5s.
  • using syncnet_python to filter dataset in range [-3, 3], model works best with [-1,1].
  • detect faces.
  • train expert_syncnet with evaluation loss < 0.25 then you can stop your training
  • train wav2lip model

Thanks for your nice work. I want to ask why “split video less than 5s”? What effect does it have on the results? I split videos maximum of 20s, is that ok?

To understand more, you should read the paper but if the length of video is too long, it can lead to duplicate sound in this video so the positive pair and negative can be the same in a high probability

@aishoot
Copy link

aishoot commented Aug 23, 2022

Thanks

@wllps1988315
Copy link

I used 10 GPU A6000 with nearly 200GB memory of GPU

how long have you trained expert_syncnet and wav2lip using AVspeech?

@ldz666666
Copy link

ldz666666 commented Sep 28, 2023

  • download dataset
  • convert to 25fps.
  • change sample rate to 16000hz.
  • split video less than 5s.
  • using syncnet_python to filter dataset in range [-3, 3], model works best with [-1,1].
  • detect faces.
  • train expert_syncnet with evaluation loss < 0.25 then you can stop your training
  • train wav2lip model

Hi, why we need to split video less than 5s to train the syncnet, what if i train with longer video clips which are about 1 min ?

@easonhyx
Copy link

I will public pretrained on avspeech
Dear author, may I ask if the model pre-trained on the AVSpeech dataset can be made public?If there is a plan to make it public, may I ask when it will be available?

@1129571
Copy link

1129571 commented Mar 4, 2024

  • download dataset
  • convert to 25fps.
  • change sample rate to 16000hz.
  • split video less than 5s.
  • using syncnet_python to filter dataset in range [-3, 3], model works best with [-1,1].
  • detect faces.
  • train expert_syncnet with evaluation loss < 0.25 then you can stop your training
  • train wav2lip model

Hello, I would like to know if the filter dataset in range [-3, 3] you mentioned here refers to offset, conf, or dist in the syncnet_python project?
My current understanding is that:

  1. Offset in [-3, 3]?
  2. Confidence in [6, 9]?
  3. Can I refer to this issue in the original wav2lip? Advice on sync correcting videos Rudrabha/Wav2Lip#91

Is my understanding correct?

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants