RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size #27

programmerworld123 · 2022-08-11T03:13:17Z

I have trained wloss_hq_wav2lip_train.py and used checkpoint checkpoint_step000003000.pth for inference

python inference.py --checkpoint_path "/content/gdrive/MyDrive/wav2lip_288x288/checkpoints/checkpoint.pth"  --face "/content/gdrive/MyDrive/Wav2Lip/video.mp4" --audio "/content/gdrive/MyDrive/Wav2Lip/input_audio.wav"

Using cuda for inference.
Reading video frames...
Number of frames available for inference: 5760
/usr/local/lib/python3.7/dist-packages/librosa/core/audio.py:165: UserWarning: PySoundFile failed. Trying audioread instead.
  warnings.warn("PySoundFile failed. Trying audioread instead.")
(80, 222)
Length of mel chunks: 157
  0% 0/2 [00:00<?, ?it/s]
  0% 0/10 [00:00<?, ?it/s]
 10% 1/10 [00:06<00:56,  6.25s/it]
 20% 2/10 [00:07<00:26,  3.37s/it]
 30% 3/10 [00:08<00:17,  2.45s/it]
 40% 4/10 [00:10<00:12,  2.02s/it]
 50% 5/10 [00:11<00:08,  1.77s/it]
 60% 6/10 [00:13<00:06,  1.63s/it]
 70% 7/10 [00:14<00:04,  1.54s/it]
 80% 8/10 [00:15<00:02,  1.48s/it]
 90% 9/10 [00:17<00:01,  1.44s/it]
100% 10/10 [00:21<00:00,  2.10s/it]
Load checkpoint from: /content/gdrive/MyDrive/wav2lip_288x288/checkpoints/checkpoint.pth
Model loaded
  0% 0/2 [00:24<?, ?it/s]
Traceback (most recent call last):
  File "inference.py", line 280, in <module>
    main()
  File "inference.py", line 263, in main
    pred = model(mel_batch, img_batch)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/gdrive/MyDrive/wav2lip_288x288/models/wav2lipv2.py", line 117, in forward
    x = f(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/content/gdrive/MyDrive/wav2lip_288x288/models/conv2.py", line 16, in forward
    out = self.conv_block(x)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 457, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/conv.py", line 454, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

The text was updated successfully, but these errors were encountered:

ghost · 2022-08-11T03:47:03Z

it maybe your input size is not 192x192

programmerworld123 · 2022-08-11T05:07:43Z

Should I have to resize the input video to 192 *192 before executing inference.py? Is this thing is not handled in inference.py.

If we have to resize the input video then how to resize the video? I have found one FFmpeg command

ffmpeg -i movie.mp4 -vf scale=640:192 video.mp4

ghost · 2022-08-11T06:29:32Z

no, you should setup it in hparams.py

programmerworld123 · 2022-08-11T07:04:18Z

I have changed the img_size from 288 to 192 in hparams.py. But still the same error

       min_level_db=-100,
	ref_level_db=20,
	fmin=55,
	# Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To 
	# test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
	fmax=7600,  # To be increased/reduced depending on data.

	###################### Our training parameters #################################
	img_size=192,
	fps=25,

Unmesh28 · 2022-08-12T02:58:39Z

Yeah, I am also facing similar issue even img_size changed to 192 in hparams.py

Any solution @primepake

ghost · 2022-08-12T03:21:06Z

it can be your hyper-paramater setup

Unmesh28 · 2022-08-12T04:40:12Z

`# Default hyperparameters
hparams = HParams(
num_mels=80, # Number of mel-spectrogram channels and local conditioning dimensionality
# network
rescale=True, # Whether to rescale audio prior to preprocessing
rescaling_max=0.9, # Rescaling value

# Use LWS (https://github.com/Jonathan-LeRoux/lws) for STFT and phase reconstruction
# It"s preferred to set True to use with https://github.com/r9y9/wavenet_vocoder
# Does not work if n_ffit is not multiple of hop_size!!
use_lws=False,

n_fft=800,  # Extra window size is filled with 0 paddings to match this parameter
hop_size=200,  # For 16000Hz, 200 = 12.5 ms (0.0125 * sample_rate)
win_size=800,  # For 16000Hz, 800 = 50 ms (If None, win_size = n_fft) (0.05 * sample_rate)
sample_rate=16000,  # 16000Hz (corresponding to librispeech) (sox --i <filename>)

frame_shift_ms=None,  # Can replace hop_size parameter. (Recommended: 12.5)

# Mel and Linear spectrograms normalization/scaling and clipping
signal_normalization=True,
# Whether to normalize mel spectrograms to some predefined range (following below parameters)
allow_clipping_in_normalization=True,  # Only relevant if mel_normalization = True
symmetric_mels=True,
# Whether to scale the data to be symmetric around 0. (Also multiplies the output range by 2, 
# faster and cleaner convergence)
max_abs_value=4.,
# max absolute value of data. If symmetric, data will be [-max, max] else [0, max] (Must not 
# be too big to avoid gradient explosion, 
# not too small for fast convergence)
# Contribution by @begeekmyfriend
# Spectrogram Pre-Emphasis (Lfilter: Reduce spectrogram noise and helps model certitude 
# levels. Also allows for better G&L phase reconstruction)
preemphasize=True,  # whether to apply filter
preemphasis=0.97,  # filter coefficient.

# Limits
min_level_db=-100,
ref_level_db=20,
fmin=55,
# Set this to 55 if your speaker is male! if female, 95 should help taking off noise. (To 
# test depending on dataset. Pitch info: male~[65, 260], female~[100, 525])
fmax=7600,  # To be increased/reduced depending on data.

###################### Our training parameters #################################
img_size=192,
fps=25,

batch_size=4,
initial_learning_rate=1e-4,
nepochs=200000000000000000,  ### ctrl + c, stop whenever eval loss is consistently greater than train loss for ~10 epochs
num_workers=16,
checkpoint_interval=3000,
eval_interval=3000,
save_optimizer_state=True,

syncnet_wt=0.0, # is initially zero, will be set automatically to 0.03 later. Leads to faster convergence. 
syncnet_batch_size=64,
syncnet_lr=1e-4,
syncnet_eval_interval=10000,
syncnet_checkpoint_interval=10000,

disc_wt=0.07,
disc_initial_learning_rate=1e-4,

)
`

this is my hparams.py , just changed img_size to 192 from your code

ghost · 2022-08-12T04:42:53Z

Have you modified model hidden layers? My repo is just for 288x288 input size
If you want to another input, you need to remove some hidden layers

Unmesh28 · 2022-08-12T04:44:48Z

Have you modified model hidden layers? My repo is just for 288x288 input size If you want to another input, you need to remove some hidden layers

So if I convert my input video to 288*288, it will work , u saying ? let me try that. Thanks

skyler14 · 2022-08-12T04:58:27Z

@Unmesh28 is there a good way to reach out to you besides here? I was planning on trying to make an AVSpeech checkpoint so we might as well pool our efforts and resources together

programmerworld123 · 2022-08-12T05:38:04Z

Have you modified model hidden layers? My repo is just for 288x288 input size If you want to another input, you need to remove some hidden layers

Hi @primepake I have not changed any layer. I have used training code as it is.

Unmesh28 · 2022-08-12T06:16:14Z

I also used 288*288 training code with all the preprocessing you mentioned here : #21 (comment)

getting same error :

Model loaded 0%| | 0/14 [14:16<?, ?it/s] Traceback (most recent call last): File "inference.py", line 280, in <module> main() File "inference.py", line 263, in main pred = model(mel_batch, img_batch) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/wav2lip_288x288/models/wav2lipv2.py", line 117, in forward x = f(x) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/wav2lip_288x288/models/conv2.py", line 16, in forward out = self.conv_block(x) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/container.py", line 139, in forward input = module(input) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl return forward_call(*input, **kwargs) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 457, in forward return self._conv_forward(input, self.weight, self.bias) File "/home/ubuntu/anaconda3/envs/wav2lip/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 454, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size

Tried with original img_size = 288 and also with 192 in hparams.py, getting same error for both.

@primepake please let me know if u know the reason or if I am doing anything wrong here

Unmesh28 · 2022-08-13T03:19:32Z

@primepake Any suggestion on this ?

Unmesh28 · 2022-08-22T08:02:14Z

@primepake The issue is not resolved yet

skyler14 · 2022-08-22T08:57:20Z

So far I've recreated the issue with his sample checkpoint. Inside the forward function a kernel with 3x3 shape is trying to run over a conv2d (128, 512, 2, 2) for its shape, resulting in an error.

This is occuring at line the self.conv_block(x) in the Conv2d module during forward

There actually isn't any spot I've directly observed an img_size/resolution being used in the project that isnt using 96 pixels instead of hparams specifed value (args.img_size is hardcoded to 96).

Few things im trying to figure out:

What is the shape of a pretrained checkpoint supposed to be? From what I read x seems to be part of the face embedding (pulling out of self.face_encoder_blocks). Is the hardcoded 96 pixels and lack of direct references to hparam img size leading to mismatched configs?

ghost · 2022-08-22T09:00:41Z

replace args.img_size=96 to 288

ghost closed this as completed Aug 22, 2022

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size #27

RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size #27

programmerworld123 commented Aug 11, 2022 •

edited

Loading

ghost commented Aug 11, 2022

programmerworld123 commented Aug 11, 2022 •

edited

Loading

ghost commented Aug 11, 2022

programmerworld123 commented Aug 11, 2022

Unmesh28 commented Aug 12, 2022

ghost commented Aug 12, 2022

Unmesh28 commented Aug 12, 2022

ghost commented Aug 12, 2022

Unmesh28 commented Aug 12, 2022

skyler14 commented Aug 12, 2022

programmerworld123 commented Aug 12, 2022

Unmesh28 commented Aug 12, 2022 •

edited

Loading

Unmesh28 commented Aug 13, 2022

Unmesh28 commented Aug 22, 2022

skyler14 commented Aug 22, 2022

ghost commented Aug 22, 2022

RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size #27

RuntimeError: Calculated padded input size per channel: (2 x 2). Kernel size: (3 x 3). Kernel size can't be greater than actual input size #27

Comments

programmerworld123 commented Aug 11, 2022 • edited Loading

ghost commented Aug 11, 2022

programmerworld123 commented Aug 11, 2022 • edited Loading

ghost commented Aug 11, 2022

programmerworld123 commented Aug 11, 2022

Unmesh28 commented Aug 12, 2022

ghost commented Aug 12, 2022

Unmesh28 commented Aug 12, 2022

ghost commented Aug 12, 2022

Unmesh28 commented Aug 12, 2022

skyler14 commented Aug 12, 2022

programmerworld123 commented Aug 12, 2022

Unmesh28 commented Aug 12, 2022 • edited Loading

Unmesh28 commented Aug 13, 2022

Unmesh28 commented Aug 22, 2022

skyler14 commented Aug 22, 2022

ghost commented Aug 22, 2022

programmerworld123 commented Aug 11, 2022 •

edited

Loading

programmerworld123 commented Aug 11, 2022 •

edited

Loading

Unmesh28 commented Aug 12, 2022 •

edited

Loading