Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size #27

Closed
programmerworld123 opened this issue Aug 11, 2022 · 16 comments

Comments

@programmerworld123
Copy link

programmerworld123 commented Aug 11, 2022

I have trained wloss_hq_wav2lip_train.py and used checkpoint checkpoint_step000003000.pth for inference

python inference.py --checkpoint_path "/content/gdrive/MyDrive/wav2lip_288x288/checkpoints/checkpoint.pth"  --face "/content/gdrive/MyDrive/Wav2Lip/video.mp4" --audio "/content/gdrive/MyDrive/Wav2Lip/input_audio.wav" 
Using cuda for inference.
Reading video frames...
Number of frames available for inference: 5760
/usr/local/lib/python3.7/dist-packages/librosa/core/audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
(80, 222)
Length of mel chunks: 157
  0% 0/2 [00:00<?, ?it/s]
  0% 0/10 [00:00<?, ?it/s]
 10% 1/10 [00:06<00:56,  6.25s/it]
 20% 2/10 [00:07<00:26,  3.37s/it]
 30% 3/10 [00:08<00:17,  2.45s/it]
 40% 4/10 [00:10<00:12,  2.02s/it]
 50% 5/10 [00:11<00:08,  1.77s/it]
 60% 6/10 [00:13<00:06,  1.63s/it]
 70% 7/10 [00:14<00:04,  1.54s/it]
 80% 8/10 [00:15<00:02,  1.48s/it]
 90% 9/10 [00:17<00:01,  1.44s/it]
100% 10/10 [00:21<00:00,  2.10s/it]
Load checkpoint from: /content/gdrive/MyDrive/wav2lip_288x288/checkpoints/checkpoint.pth
Model loaded
  0% 0/2 [00:24<?, ?it/s]
Traceback (most recent call last):
  File "inference.py", line 280, in <module>
    main()
  File "inference.py", line 263, in main
    pred = model(mel_batch, img_batch)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/gdrive/MyDrive/wav2lip_288x288/models/wav2lipv2.py", line 117, in forward
    x = f(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/gdrive/MyDrive/wav2lip_288x288/models/conv2.py", line 16, in forward
    out = self.conv_block(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size
@ghost
Copy link

ghost commented Aug 11, 2022

it maybe your input size is not 192x192

@programmerworld123
Copy link
Author

programmerworld123 commented Aug 11, 2022

Should I have to resize the input video to 192 *192 before executing inference.py? Is this thing is not handled in inference.py.

If we have to resize the input video then how to resize the video? I have found one FFmpeg command

ffmpeg -i movie.mp4 -vf scale=640:192 video.mp4

@ghost
Copy link

ghost commented Aug 11, 2022

no, you should setup it in hparams.py

@programmerworld123
Copy link
Author

I have changed the img_size from 288 to 192 in hparams.py. But still the same error

       min_level_db=-100,
	ref_level_db=20,
	fmin=55,
	# Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To 
	# test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
	fmax=7600,  # To be increased/reduced depending on data.

	###################### Our training parameters #################################
	img_size=192,
	fps=25,

@Unmesh28
Copy link

Yeah, I am also facing similar issue even img_size changed to 192 in hparams.py

Any solution @primepake

@ghost
Copy link

ghost commented Aug 12, 2022

it can be your hyper-paramater setup

@Unmesh28
Copy link

`# Default hyperparameters
hparams = HParams(
num_mels=80, # Number of mel-spectrogram channels and local conditioning dimensionality
# network
rescale=True, # Whether to rescale audio prior to preprocessing
rescaling_max=0.9, # Rescaling value

# Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction
# It"s preferred to set True to use with https://github.com/r9y9/wavenet_vocoder
# Does not work if n_ffit is not multiple of hop_size!!
use_lws=False,

n_fft=800,  # Extra window size is filled with 0 paddings to match this parameter
hop_size=200,  # For 16000Hz, 200 = 12.5 ms (0.0125 * sample_rate)
win_size=800,  # For 16000Hz, 800 = 50 ms (If None, win_size = n_fft) (0.05 * sample_rate)
sample_rate=16000,  # 16000Hz (corresponding to librispeech) (sox --i <filename>)

frame_shift_ms=None,  # Can replace hop_size parameter. (Recommended: 12.5)

# Mel and Linear spectrograms normalization/scaling and clipping
signal_normalization=True,
# Whether to normalize mel spectrograms to some predefined range (following below parameters)
allow_clipping_in_normalization=True,  # Only relevant if mel_normalization = True
symmetric_mels=True,
# Whether to scale the data to be symmetric around 0. (Also multiplies the output range by 2, 
# faster and cleaner convergence)
max_abs_value=4.,
# max absolute value of data. If symmetric, data will be [-max, max] else [0, max] (Must not 
# be too big to avoid gradient explosion, 
# not too small for fast convergence)
# Contribution by @begeekmyfriend
# Spectrogram Pre-Emphasis (Lfilter: Reduce spectrogram noise and helps model certitude 
# levels. Also allows for better G&L phase reconstruction)
preemphasize=True,  # whether to apply filter
preemphasis=0.97,  # filter coefficient.

# Limits
min_level_db=-100,
ref_level_db=20,
fmin=55,
# Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To 
# test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
fmax=7600,  # To be increased/reduced depending on data.

###################### Our training parameters #################################
img_size=192,
fps=25,

batch_size=4,
initial_learning_rate=1e-4,
nepochs=200000000000000000,  ### ctrl + c, stop whenever eval loss is consistently greater than train loss for ~10 epochs
num_workers=16,
checkpoint_interval=3000,
eval_interval=3000,
save_optimizer_state=True,

syncnet_wt=0.0, # is initially zero, will be set automatically to 0.03 later. Leads to faster convergence. 
syncnet_batch_size=64,
syncnet_lr=1e-4,
syncnet_eval_interval=10000,
syncnet_checkpoint_interval=10000,

disc_wt=0.07,
disc_initial_learning_rate=1e-4,

)
`

this is my hparams.py , just changed img_size to 192 from your code

@ghost
Copy link

ghost commented Aug 12, 2022

Have you modified model hidden layers? My repo is just for 288x288 input size
If you want to another input, you need to remove some hidden layers

@Unmesh28
Copy link

Have you modified model hidden layers? My repo is just for 288x288 input size If you want to another input, you need to remove some hidden layers

So if I convert my input video to 288*288, it will work , u saying ? let me try that. Thanks

@skyler14
Copy link

@Unmesh28 is there a good way to reach out to you besides here? I was planning on trying to make an AVSpeech checkpoint so we might as well pool our efforts and resources together

@programmerworld123
Copy link
Author

Have you modified model hidden layers? My repo is just for 288x288 input size If you want to another input, you need to remove some hidden layers

Hi @primepake I have not changed any layer. I have used training code as it is.

@Unmesh28
Copy link

Unmesh28 commented Aug 12, 2022

I also used 288*288 training code with all the preprocessing you mentioned here : #21 (comment)

getting same error :

Model loaded 0%| | 0/14 [14:16<?, ?it/s] Traceback (most recent call last): File "inference.py", line 280, in <module> main() File "inference.py", line 263, in main pred = model(mel_batch, img_batch) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/wav2lip_288x288/models/wav2lipv2.py", line 117, in forward x = f(x) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/wav2lip_288x288/models/conv2.py", line 16, in forward out = self.conv_block(x) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

Tried with original img_size = 288 and also with 192 in hparams.py, getting same error for both.

@primepake please let me know if u know the reason or if I am doing anything wrong here

@Unmesh28
Copy link

@primepake Any suggestion on this ?

@ghost ghost closed this as completed Aug 22, 2022
@Unmesh28
Copy link

@primepake The issue is not resolved yet

@skyler14
Copy link

So far I've recreated the issue with his sample checkpoint. Inside the forward function a kernel with 3x3 shape is trying to run over a conv2d (128, 512, 2, 2) for its shape, resulting in an error.

This is occuring at line the self.conv_block(x) in the Conv2d module during forward

There actually isn't any spot I've directly observed an img_size/resolution being used in the project that isnt using 96 pixels instead of hparams specifed value (args.img_size is hardcoded to 96).

Few things im trying to figure out:

What is the shape of a pretrained checkpoint supposed to be? From what I read x seems to be part of the face embedding (pulling out of self.face_encoder_blocks). Is the hardcoded 96 pixels and lack of direct references to hparam img size leading to mismatched configs?

@ghost
Copy link

ghost commented Aug 22, 2022

replace args.img_size=96 to 288

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants