Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wav_to_spectogram.py stops converting before it should #11

Closed
ibro45 opened this issue Nov 23, 2018 · 7 comments
Closed

wav_to_spectogram.py stops converting before it should #11

ibro45 opened this issue Nov 23, 2018 · 7 comments

Comments

@ibro45
Copy link

ibro45 commented Nov 23, 2018

Hi,

I'm working with four languages and for each I have downloaded only one video so that I can check that the scripts work as they should before running them on my VM on the cloud.

The issue I have is that the script wav_to_spectogram.py acts weird with one language.
The languages and the number of segmented .wav file for each are:

  • Croatian - 64
  • English - 42
  • French - 39
  • Spanish - 45

So, the expected result after running the script is that there will be 38 or 39 .png spectograms for each language since the language with the least number of .wav files is French. It does execute as it should when I run it for all the languages except English:

w/o English

But running the script with English manages to count only 13 files in English, even though there are 42:

w/ English

I still haven't come up with an explanation to why it's happening, so any clue would be of a great help!

Here's the sources.yml that I used to download the videos if someone prefers to check it himself.

croatian:
  users:
    -
  playlists:
    - https://www.youtube.com/playlist?list=PLv3j2_RROTdEh39boAuP-JPeDR7dy6wih

english:
  users:
    - 
  playlists:
    - https://www.youtube.com/playlist?list=PLv3j2_RROTdHSp1oIY4L_t5xX0dFV3GMH

french:
  users:
    - 
  playlists:
    - https://www.youtube.com/playlist?list=PLv3j2_RROTdEgT-oLhk11Xjbev7Q02F3-
spanish:
  users:
    - 
  playlists:
    - https://www.youtube.com/playlist?list=PLv3j2_RROTdHpKmps4DaomrqXd8VmZV1g

I'd also note that I'm working with 3-seconds segments, so if someone will be recreating what I am doing, it is important to change the number of seconds by which the files will be splitted. It is on the line 66 in download_youtube.py from:

command = ["ffmpeg", "-y", "-i", f, "-map", "0", "-ac", "1", "-ar", "16000", "-f", "segment", "-segment_time", "10", output_filename]

to:

command = ["ffmpeg", "-y", "-i", f, "-map", "0", "-ac", "1", "-ar", "16000", "-f", "segment", "-segment_time", "3", output_filename]

For the same reason, it is necessary to change the size of the output spectogram on line 70 in wav_to_spectogram.py from:

parser.add_argument('--shape', dest='shape', default=[129, 500, 1], type=int, nargs=3)

to:

parser.add_argument('--shape', dest='shape', default=[129, 150, 1], type=int, nargs=3)

Thank you!

@ibro45
Copy link
Author

ibro45 commented Nov 24, 2018

Since I have mentioned that I'm using 3-second segments, I'm interested in what do you think about increasing the pixel_per_second size from 50 to 100? Then I'd have 129x300x1 spectograms, which may result in the C(R)NN being able to detect patterns easier, isn't it? I'm still a newbie in this, sorry!

@Bartzi
Copy link
Member

Bartzi commented Nov 27, 2018

Hmm, interesting behaviour... can it be that your english samples contain lots of silence?
Have a look at this line of code. Everything that contains silence is just skipped.

To your second question:
That depends on the size of the actual regions in the voice samples. It could get better, but it might also not help... You might need to incease the size of the receptive field for the network in order to capture meaningful features.
But it is worth a try =)

@ibro45
Copy link
Author

ibro45 commented Nov 30, 2018

I've checked the samples, they seem to be alright. I also tried commenting out the two lines that ignore samples containing lots of silence and the same behaviour was repeated.

I also tried it on my whole dataset. The first output is the output after segmentation of the files, which tells how many of them there are. The second output is the wav_to_spectogram.py's output, as you can see, the same thing happened once again.

The output

And thanks for the advice regarding the 3-second segments! :)

@Bartzi
Copy link
Member

Bartzi commented Dec 8, 2018

I really don't know what the problem is...
the iterator definitely stops when working on english, because of some reasons... but I'm afraid I can not help you further from this end without access to the data...

@ibro45
Copy link
Author

ibro45 commented Dec 8, 2018

Thanks for replying! If you're interested in taking a look at it, I have included the sources.yml's content in the initial post. Each playlist contains just a video per each language whose purpose was testing that everything behaves as it should before running it on the cloud, so it's not going to be a trouble downloading the data.

@ibro45
Copy link
Author

ibro45 commented Dec 27, 2018

I seem to have figured out what was happening.

It isn't an isolated problem for these English samples I used. I eventually got rid of them from my big dataset and tried running the wav_to_spectrogram again and the same thing happened with French.

Basically, when the SpectrogramGenerator is run, those segmented files are turned into spectrograms by Sox. What happens there is that, since it calculates the width based on -X (capital X) parameter, which is the pixels per second parameter, it sometimes, for reasons unknown to me, outputs wrong dimension - instead of [129, 150, 1] it does [129, 149, 1].
(Note that I'm using 3-second segments and 50 pixel per second)

Therefore, I tried adding the -x (small x) parameter which defines the overall size of the width at this line and the appropriate value for it.

It seems to have solved the issue, but I wonder what's your comment on this. If that's fine, I'll make a pull request.

Thanks!

@Bartzi
Copy link
Member

Bartzi commented Jan 14, 2019

Hmm,

interesting problem. I'm not sure but reading the manual page of Sox, it seems that -x only sets the maximum width of spectogram. But all in all that should not be a problem, since the audio snippets should always have the same length, so I would be very happy to have a look at a nice PR 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants