Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about hop_lenth #1

Open
ioyy900205 opened this issue Mar 29, 2022 · 10 comments
Open

about hop_lenth #1

ioyy900205 opened this issue Mar 29, 2022 · 10 comments

Comments

@ioyy900205
Copy link

Hello author, in the config.py file, is the hop_lenth 256 instead of 32?

@jackhwalters
Copy link
Owner

Hi, the hop_length variable in config.py is set to 32, in past commits it has been more though

@ioyy900205
Copy link
Author

I mean, considering a frame length of 32 ms and a frame shift of 16 ms, shouldn't hop_lenth be set to 256?
Also, I found that in the def mag_phase_2_wav() function, there is something wrong, after my test the value of pad should not be 0,0,0,1 but 0,0,1,0.
Secondly, in the istft of this function, you need to add window to recover the voice perfectly

@jackhwalters
Copy link
Owner

I've made sure to keep the hop_length as a power of 2 less than fft_size so I am at liberty to experiement with various window types. I'd be reluctant to make hop_length 256 as that would come at the cost of time resolution.

I padded (0,0,0,1) so that I'm padding DC with 0. I'd rather have done this than added 0 to the top frequency bin, which would have altered the signal. Great spot on the lack of window in the ISTFT, I'm implementing that now.

@ioyy900205
Copy link
Author

ioyy900205 commented Mar 30, 2022

Thank you for your reply, I understand what you mean. But I have experimented with your code in the padding process. Because you removed the DC component after STFT, you need to add the DC component before ISTFT. But the correct way should be (0,0,1,0) . Because I could not recover the voice correctly using the 0,0,0,1 you provided.
In addition, I recovered the voice perfectly using 0,0,1,0.
Perhaps the problem lies in the fact that the top part of the input that ISTFT understands is the low frequency part

@jackhwalters
Copy link
Owner

I just checked this and you're right. I was assuming padding_bottom from the documentation would pad the bottom of the STFT, but it's padding the bottom of the tensor which is the reverse! I assume you calculated something like the L1 between the waveforms to show the difference?

@ioyy900205
Copy link
Author

Hi, I output the original speech and the transformed speech to the speech spectrogram to check whether they can be perfectly reconstructed.
I found another problem, please check your complex_adaptive_max_pool2d function, this is the same as complex_adaptive_avg_pool2d now. : (
I think this repo will help you.
https://github.com/TVS-AI/Pytorch_rppgs/blob/52568e4cc4a4069007bb4f65a6963901b0d0df91/nets/funcs/complexFunctions.py

@jackhwalters
Copy link
Owner

I've been using https://github.com/wavefrontshaping/complexPyTorch/blob/master/complexPyTorch/complexFunctions.py as a reference, which looks similar to what you've sent. Good spot though, I must have not changed the function names when copying them over from complex_adaptive_avg_pool2d. I'm not convinced that the technique of max pooling the magnitude from intermediate feature maps then recombining them with the layer input phase is the best way to go, considering one of the advantages of complex networks is phase prediction from complex-number feature maps. However I can see why they wanted to perform maxpooling on exclusively non-negative feature maps. I will try both!

@ioyy900205
Copy link
Author

The reason for using maxpooling is to implement the CBAM attention mechanism. There are too many attention mechanisms out there now, and your attempt will make a great contribution.

I also have a doubt that there is no way to use DDP acceleration with the current form of X.real,X.imag data, because the NCCL backend does not support pytorch's complex arithmetic. I don't know if pytorchlighting supports

@ioyy900205
Copy link
Author

DCSNet网络图 (1)
I drew a diagram to describe your network for the case where the input is 1,256,1876

@jackhwalters
Copy link
Owner

jackhwalters commented Apr 3, 2022

Yes my use of maxpooling is inspired by the channel attention modules in https://arxiv.org/abs/2102.01993, which themselves are inspired by CBAM. About the DDP, I ran into this issue of not being able to use DDP with complex tensors; I actually created an issue in NCCL's repo. I considered using torch.view_as_real throughout the repo to be able to use DDP, but didn't really seem worth it. That diagram is correct, aside from the input is 256x256 to account for variable sample length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants