-
-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Silence / Background Noise similarity #62
Comments
Happy to hear that! So from what I can say, the model was trained on clean speech without silence nor background noise. So technically, the model has only heard clear voices so far. If I can draw a parallel with a simple cat/dog classifier, it would be like showing a car to the model. It would either predict a cat or a dog.
Yes it's true. I'm sure the model can be smart enough to learn this too. |
Hello! I've taken the repo/dataset and combined it with the Voxceleb2 dataset (6112 speakers). I also added a 'speaker' that was composed of a bunch of noise/silence samples. After I processed the voxceleb data into the same format (flac, 16khz, 24bit samples) as the librispeech data, I made another pass over both datasets, and for every utterance, I created 2 new training examples that were combined with random noise selected from https://github.com/microsoft/MS-SNSD . That resulted in around 730gb of training data. I've added 1k speakers to the initial classifier/softmax training and am currently running that training. Once it's complete I'll complete the triplet loss training and share the code/weights. I'm running it on a 2080ti, with 64gb of RAM, and I needed a bit over 200gb of swap space to keep the OOM killer at bay. An epoch is currently taking slightly over 1 hour. Talk to you in a week or two :) |
@w1nk AWESOME! Please let us know how it goes :) |
Just an update: I ended up needing to switch versions of tensorflow (switched to 2.3), 2.2 has a nasty memory leak that was getting triggered. Once I got things running stably, the softmax network converged / I early stopped it at epoch 38 and started training the triplet loss. That network is currently still training, but is getting close: 2000/2000 [==============================] - 815s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0230 - val_loss: 0.0221 Looks like it's fitting nicely, testing some of the later epochs look pretty good as well. I'll find somewhere to put the checkpoints and a couple of the preparation scripts. |
@w1nk very cool! |
how are you split train/val/test dataset? I found in code that train/val/test is come from same speaker, have you try to split dataset with difference speaker. And i also curious with your results. |
Hey @ntdat017, I haven't modified the training harness at all so the validation split is being calculated how it's written. For test, I've got a holdout set of data from the voxceleb dataset that I'll use to perform the evaluation. |
Sorry for the delay, it's been a busy week. The triplet training finally converged after a bit over 600 epochs. I haven't had a chance to fully evaluate the output yet, but I've gone ahead and uploaded the checkpoints and some helper scripts I used in case anyone reading along is interested. https://drive.google.com/drive/folders/1EExljgrj3kP-ciUzrsdoWYE5OT14_7Aa sha256 hashes: There are 3 files there, the 2 checkpoints (softmax + triplet) and a tar file with some helper scripts. The helper python scripts probably don't run out of the box, but they're pretty simple and should be easy to fix up. process_vox.py - this will generate a file that can be split/executed as bash commands that will convert the vox speech files into the correct naming scheme and proper encoding (will require ffmpeg + flac support). create_noise.py - this will use random noise samples from https://github.com/microsoft/MS-SNSD to generate 'noisy' versions of each input audio clip. |
@w1nk that's really awesome!!!! I'm going to have a look this weekend. |
I got an error when loading this model. model = keras.models.load_model('ResCNN_triplet_checkpoint_613.h5', compile=False)
What is your version of Keras, Tensorflow, and Python? |
@demonstan the ones specified in the requirements.txt of the repo. |
@w1nk Did you perform evaluation on any dataset? |
@demonstan I've not had a chance to perform the evaluation fully yet. Since I trained on all of librispeech and all the voxceleb2 training data, I need to take the voxceleb2 test data set and convert/rename it to the correct format and evaluate on that. I've not had a chance to do that yet. As for loading, it should load with TF 2.1/2/3 (I tried all of them) along with 1.15 as well. I was loading the model across those versions trying to get the tflite/coral compilation to work (hint: I didn't yet due to a coral compiler issue). |
May I ask why not using EarlyStopping and ReduceLROnPlateau call back here? Lines 40 to 42 in 7742796
|
@demonstan they could be used indeed. It's just that I always saw the loss decreasing steadily and I didn't think it was a necessity. Overfitting on this dataset would have been a pretty big challenge. The loss looked like an exponentially decreasing function on both the training and testing sets. |
It may also be helful to use SOX to remove silence and background noise. That's what i usually do. Denoise and split by silence and then compute embeddings. |
Good point. |
Linked to the README for reference. |
I've been having fun playing with your pre-trained model and implementation!
I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say
silent_features
. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them allsilent_features
, it would learn to predict varioussilent_features
and distinguish it from voices.The text was updated successfully, but these errors were encountered: