The files experiment Speech separation using various neural network structures. The experiments feed in a dataset of sound files with containing 2 clean voices and attempt to build a network to separate out the 2 voices. So far, 2 networks have been built:
- Feed forward network
- RNN network
The scripts were created using the Spyder IDE of anaconda. Before executing each script, set the console directory to the directory of the script.
J. R. Hershey, Z. Chen, J. Le Roux and S. Watanabe, "Deep clustering: Discriminative embeddings for segmentation and separation," 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, 2016, pp. 31-35
Within the DataGenerator folder are two Python scripts that create the dataset. It is assumed that a top-level folder exists called TIMIT_WAV that contains the TIMIT dataset. The top-level folder should look something like this:
The datagenerator.py script contains a class to create the data set. The dataset is saved as several pickle files. Each pickle file contains The pickle files are saved to a top level folder called Data.
The datagenerator2.py takes the data from a given number of pickle files and feeds data into tensorflow session in batches.
The feedforward folder contains a python script called train_net.py that trains a feedforward network. The network contains 2 hidden layers of 300 neurons and an output layer of 129 neurons (one for each frequency bin in the spectrogram). The output layer uses a sigmoid activation function. A mean squared error loss function is used on a known IBM. The following schematic represents the flow of code:
After 50 epochs, the network struggles to find any pattern in the data. The accuracy after 50 epochs is still close to 50%.
A test signal containing mixture of 2 voices was fed into the network and the following IBM was produced:
After applying the IBM, the original sound wave looks (and sounds) the same as the original sound wave, implying that a feed forward network is not a good model for speech separation.
The RNN folder contains a python script called train_rnn.py. This scripts trains a 2 layer RNN using LSTM cells containing 300 neurons. A final feedforward layer with 129 neurons using a sigmoid activation function produces an IBM. A mean squared error loss function was used against a known IBM. The flow is shown in the following schematic:
The network uses the same datagenerator.py class to create the data. The spectrograms are split into chunks of 100 time frequency bins which are fed into the RNN. The remainder data in a spectrogram after the nearest value of 100 is not used for training. Like the feed forward network, the network struggles to separate the two sound sources. Accuracy on the training set after 50 epochs is still almost 50%.
As with the feed forward network, a test signal containing mixture of 2 voices was fed into the network and the following IBM was produced:
As with the feed forward network, after applying the IBM, the original sound wave looks (and sounds) the same as the original sound wave, implying that a RNN network is not a good model for speech separation.
The Bi-Directional-RNN folder contains a python script called train_bi_directional_RNN.py. This scripts trains a 2 layer bi-directional RNN using LSTM cells containing 300 neurons. A final feedforward layer with 129 neurons using a sigmoid activation function produces an IBM. A mean squared error loss function was used against a known IBM. The flow is shown in the following schematic:
As with the one-directional RNN, the network uses the same datagenerator.py class to create the data. The spectrograms are split into chunks of 100 time frequency bins which are fed into the RNN. The remainder data in a spectrogram after the nearest value of 100 is not used for training. Accuracy on the training set after 50 epochs is still only 50%.
The same test signal containing mixture of 2 voices was fed into the network and the following IBM was produced:
As with the other networks, after applying the IBM, the original sound wave looks (and sounds) the same as the original sound wave, implying that a bi-directional RNN network on its own is not a good model for speech separation.
The Bi-Directional-RNN-with-loss-function folder contains a python script called train_bi_with_loss_function.py. This scripts trains the same 2 layer bi-directional RNN as before. This time, the loss function from deep clustering was implemented. The flow is shown in the following schematic:
Accuracy on the training set after 50 epochs was erratic. However, the purpose of the loss function is to move neurons in the final layer apart.
The same test signal containing mixture of 2 voices was fed into the network and the following IBM was produced:
As with the other networks, after applying the IBM, the original sound wave looks (and sounds) the same as the original sound wave, implying that a bi-directional RNN network on its own is not a good model for speech separation.
The full deep-clustering model in simplemented in the Deep-clustering folder within the python script called train_deep_clustering.py. The programmatic flow is shown in the following schematic:
The bi-directional LSTM model created before creates embeddings. Test signals are then fed into these embeddings. An example of the embeddings from a test signal is shown below:
K-means clustering is then applied to the embeddings to assign each embedding a speaker:
The loss function is designed to move embeddings from different sources further apaert and embeddings from the same source closer together:
The same test signal containing mixture of 2 voices as before was fed into the network. Clustering was performed on the results and the following IBM was produced:
Below is the output of the binary mask. If enough data is fed into the network, some separation is audible (honest!):