Silence / Background Noise similarity #62

Tomas1337 · 2020-08-06T09:39:25Z

I've been having fun playing with your pre-trained model and implementation!

I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say silent_features. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.

The text was updated successfully, but these errors were encountered:

philipperemy · 2020-08-06T14:47:47Z

Happy to hear that!

So from what I can say, the model was trained on clean speech without silence nor background noise. So technically, the model has only heard clear voices so far. If I can draw a parallel with a simple cat/dog classifier, it would be like showing a car to the model. It would either predict a cat or a dog.

if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.

Yes it's true. I'm sure the model can be smart enough to learn this too.

w1nk · 2020-11-12T01:50:06Z

Hello!

I've taken the repo/dataset and combined it with the Voxceleb2 dataset (6112 speakers). I also added a 'speaker' that was composed of a bunch of noise/silence samples. After I processed the voxceleb data into the same format (flac, 16khz, 24bit samples) as the librispeech data, I made another pass over both datasets, and for every utterance, I created 2 new training examples that were combined with random noise selected from https://github.com/microsoft/MS-SNSD . That resulted in around 730gb of training data. I've added 1k speakers to the initial classifier/softmax training and am currently running that training. Once it's complete I'll complete the triplet loss training and share the code/weights. I'm running it on a 2080ti, with 64gb of RAM, and I needed a bit over 200gb of swap space to keep the OOM killer at bay. An epoch is currently taking slightly over 1 hour.

Talk to you in a week or two :)

philipperemy · 2020-11-12T05:47:50Z

@w1nk AWESOME! Please let us know how it goes :)

w1nk · 2020-11-18T15:10:45Z

Just an update:

I ended up needing to switch versions of tensorflow (switched to 2.3), 2.2 has a nasty memory leak that was getting triggered. Once I got things running stably, the softmax network converged / I early stopped it at epoch 38 and started training the triplet loss. That network is currently still training, but is getting close:

2000/2000 [==============================] - 815s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0230 - val_loss: 0.0221
Epoch 336/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0229 - val_loss: 0.0228
Epoch 337/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0227
Epoch 338/1000
2000/2000 [==============================] - 813s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0226 - val_loss: 0.0219
Epoch 339/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0218
Epoch 340/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0221 - val_loss: 0.0219
Epoch 341/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0223 - val_loss: 0.0216
Epoch 342/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0222 - val_loss: 0.0215

Looks like it's fitting nicely, testing some of the later epochs look pretty good as well. I'll find somewhere to put the checkpoints and a couple of the preparation scripts.

philipperemy · 2020-11-19T03:56:07Z

@w1nk very cool!

ntdat017 · 2020-11-19T08:57:49Z

Just an update:

I ended up needing to switch versions of tensorflow (switched to 2.3), 2.2 has a nasty memory leak that was getting triggered. Once I got things running stably, the softmax network converged / I early stopped it at epoch 38 and started training the triplet loss. That network is currently still training, but is getting close:

2000/2000 [==============================] - 815s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0230 - val_loss: 0.0221
Epoch 336/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0229 - val_loss: 0.0228
Epoch 337/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0227
Epoch 338/1000
2000/2000 [==============================] - 813s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0226 - val_loss: 0.0219
Epoch 339/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0218
Epoch 340/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0221 - val_loss: 0.0219
Epoch 341/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0223 - val_loss: 0.0216
Epoch 342/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0222 - val_loss: 0.0215

Looks like it's fitting nicely, testing some of the later epochs look pretty good as well. I'll find somewhere to put the checkpoints and a couple of the preparation scripts.

how are you split train/val/test dataset? I found in code that train/val/test is come from same speaker, have you try to split dataset with difference speaker. And i also curious with your results.

w1nk · 2020-11-19T10:37:26Z

Hey @ntdat017, I haven't modified the training harness at all so the validation split is being calculated how it's written. For test, I've got a holdout set of data from the voxceleb dataset that I'll use to perform the evaluation.

w1nk · 2020-11-27T14:01:35Z

Sorry for the delay, it's been a busy week. The triplet training finally converged after a bit over 600 epochs. I haven't had a chance to fully evaluate the output yet, but I've gone ahead and uploaded the checkpoints and some helper scripts I used in case anyone reading along is interested.

https://drive.google.com/drive/folders/1EExljgrj3kP-ciUzrsdoWYE5OT14_7Aa

sha256 hashes:
b71ca16f8364605a8234c9458f8b2b5ae8c2e0f7ca1d551de4d332acdb40ab90 ResCNN_softmax_checkpoint_38.h5
d86a3ac61a427bbc6f425e3b561dd9ed28f57b789f0eb4bf04d3434113f86dab ResCNN_triplet_checkpoint_613.h5

There are 3 files there, the 2 checkpoints (softmax + triplet) and a tar file with some helper scripts. The helper python scripts probably don't run out of the box, but they're pretty simple and should be easy to fix up.

process_vox.py - this will generate a file that can be split/executed as bash commands that will convert the vox speech files into the correct naming scheme and proper encoding (will require ffmpeg + flac support).

create_noise.py - this will use random noise samples from https://github.com/microsoft/MS-SNSD to generate 'noisy' versions of each input audio clip.

philipperemy · 2020-11-28T01:58:08Z

@w1nk that's really awesome!!!! I'm going to have a look this weekend.

demonstan · 2020-11-30T00:54:20Z

Sorry for the delay, it's been a busy week. The triplet training finally converged after a bit over 600 epochs. I haven't had a chance to fully evaluate the output yet, but I've gone ahead and uploaded the checkpoints and some helper scripts I used in case anyone reading along is interested.

https://drive.google.com/drive/folders/1EExljgrj3kP-ciUzrsdoWYE5OT14_7Aa

sha256 hashes:
b71ca16f8364605a8234c9458f8b2b5ae8c2e0f7ca1d551de4d332acdb40ab90 ResCNN_softmax_checkpoint_38.h5
d86a3ac61a427bbc6f425e3b561dd9ed28f57b789f0eb4bf04d3434113f86dab ResCNN_triplet_checkpoint_613.h5

There are 3 files there, the 2 checkpoints (softmax + triplet) and a tar file with some helper scripts. The helper python scripts probably don't run out of the box, but they're pretty simple and should be easy to fix up.

process_vox.py - this will generate a file that can be split/executed as bash commands that will convert the vox speech files into the correct naming scheme and proper encoding (will require ffmpeg + flac support).

create_noise.py - this will use random noise samples from https://github.com/microsoft/MS-SNSD to generate 'noisy' versions of each input audio clip.

I got an error when loading this model.

model = keras.models.load_model('ResCNN_triplet_checkpoint_613.h5', compile=False)

  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\save.py", line 182, in load_model
    return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 178, in load_model_from_hdf5
    custom_objects=custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\model_config.py", line 55, in model_from_config
    return deserialize(config, custom_objects=custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\serialization.py", line 175, in deserialize
    printable_module_name='layer')
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 358, in deserialize_keras_object
    list(custom_objects.items())))
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 617, in from_config
    config, custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 1204, in reconstruct_from_config
    process_layer(layer_data)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 1186, in process_layer
    layer = deserialize_layer(layer_data, custom_objects=custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\serialization.py", line 175, in deserialize
    printable_module_name='layer')
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 358, in deserialize_keras_object
    list(custom_objects.items())))
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\core.py", line 1006, in from_config
    config, custom_objects, 'function', 'module', 'function_type')
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\core.py", line 1058, in _parse_function_from_config
    config[func_attr_name], globs=globs)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 457, in func_load
    code = marshal.loads(raw_code)
ValueError: bad marshal data (unknown type code)

What is your version of Keras, Tensorflow, and Python?

philipperemy · 2020-11-30T05:32:12Z

@demonstan the ones specified in the requirements.txt of the repo.

demonstan · 2020-11-30T07:11:30Z

@w1nk Did you perform evaluation on any dataset?

w1nk · 2020-11-30T11:36:18Z

@demonstan I've not had a chance to perform the evaluation fully yet. Since I trained on all of librispeech and all the voxceleb2 training data, I need to take the voxceleb2 test data set and convert/rename it to the correct format and evaluate on that. I've not had a chance to do that yet.

As for loading, it should load with TF 2.1/2/3 (I tried all of them) along with 1.15 as well. I was loading the model across those versions trying to get the tflite/coral compilation to work (hint: I didn't yet due to a coral compiler issue).

demonstan · 2020-12-02T06:52:30Z

May I ask why not using EarlyStopping and ReduceLROnPlateau call back here?

deep-speaker/train.py

Lines 40 to 42 in 7742796

    
           dsm.m.fit(x=train_generator(), y=None, steps_per_epoch=2000, shuffle=False, 
        
                     epochs=1000, validation_data=test_generator(), validation_steps=len(test_batches), 
        
                     callbacks=[checkpoint])

philipperemy · 2020-12-02T08:16:48Z

@demonstan they could be used indeed. It's just that I always saw the loss decreasing steadily and I didn't think it was a necessity. Overfitting on this dataset would have been a pretty big challenge. The loss looked like an exponentially decreasing function on both the training and testing sets.

1shershah · 2021-01-26T16:28:55Z

I've been having fun playing with your pre-trained model and implementation!

I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say silent_features. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.

It may also be helful to use SOX to remove silence and background noise. That's what i usually do. Denoise and split by silence and then compute embeddings.

philipperemy · 2021-01-26T23:17:25Z

Good point.

philipperemy · 2021-08-03T14:18:35Z

Linked to the README for reference.

philipperemy closed this as completed Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Silence / Background Noise similarity #62

Silence / Background Noise similarity #62

Tomas1337 commented Aug 6, 2020

philipperemy commented Aug 6, 2020 •

edited

Loading

w1nk commented Nov 12, 2020

philipperemy commented Nov 12, 2020

w1nk commented Nov 18, 2020

philipperemy commented Nov 19, 2020

ntdat017 commented Nov 19, 2020

w1nk commented Nov 19, 2020

w1nk commented Nov 27, 2020

philipperemy commented Nov 28, 2020

demonstan commented Nov 30, 2020

philipperemy commented Nov 30, 2020

demonstan commented Nov 30, 2020

w1nk commented Nov 30, 2020

demonstan commented Dec 2, 2020

philipperemy commented Dec 2, 2020

1shershah commented Jan 26, 2021

philipperemy commented Jan 26, 2021

philipperemy commented Aug 3, 2021

Silence / Background Noise similarity #62

Silence / Background Noise similarity #62

Comments

Tomas1337 commented Aug 6, 2020

philipperemy commented Aug 6, 2020 • edited Loading

w1nk commented Nov 12, 2020

philipperemy commented Nov 12, 2020

w1nk commented Nov 18, 2020

philipperemy commented Nov 19, 2020

ntdat017 commented Nov 19, 2020

w1nk commented Nov 19, 2020

w1nk commented Nov 27, 2020

philipperemy commented Nov 28, 2020

demonstan commented Nov 30, 2020

philipperemy commented Nov 30, 2020

demonstan commented Nov 30, 2020

w1nk commented Nov 30, 2020

demonstan commented Dec 2, 2020

philipperemy commented Dec 2, 2020

1shershah commented Jan 26, 2021

philipperemy commented Jan 26, 2021

philipperemy commented Aug 3, 2021

philipperemy commented Aug 6, 2020 •

edited

Loading