You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am trying to reproduce the results presented in the paper "Controllable and Interpretable Singing Voice Decomposition via Assem-VC", with the CSD, NUS-48E and also with custom datasets. In the paper it is said that "all singing voices are split between 1-12 seconds and used for training with corresponding lyrics". I understand that the original .wav files of the datasets need to be splitted to shorter .wav files before building the metadata files with format "path_to_wav|transcription|speaker_id". However, I can't find any code in the repository for doing this. How is this splitting process done? Is it done manually for all the datasets?
Thanks!
The text was updated successfully, but these errors were encountered:
Hi, I am trying to reproduce the results presented in the paper "Controllable and Interpretable Singing Voice Decomposition via Assem-VC", with the CSD, NUS-48E and also with custom datasets. In the paper it is said that "all singing voices are split between 1-12 seconds and used for training with corresponding lyrics". I understand that the original .wav files of the datasets need to be splitted to shorter .wav files before building the metadata files with format "path_to_wav|transcription|speaker_id". However, I can't find any code in the repository for doing this. How is this splitting process done? Is it done manually for all the datasets?
Thanks!
The text was updated successfully, but these errors were encountered: