Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cant start training #3

Open
008karan opened this issue Dec 5, 2019 · 37 comments
Open

cant start training #3

008karan opened this issue Dec 5, 2019 · 37 comments

Comments

@008karan
Copy link

008karan commented Dec 5, 2019

I was testing the setup on mini librispeech data .This is log when I started training

# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train 
# Started at Thu Dec  5 19:24:21 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7ffb7b99c610>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730  chunks
1863  chunks
Traceback (most recent call last):
  File "../../../eend/bin/train.py", line 72, in <module>
    train(args)
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/train.py", line 100, in train
    gpuid = use_single_gpu()
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 56, in use_single_gpu
    cvd = get_free_gpus()[0]
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 40, in get_free_gpus
    del gpus[busid]
KeyError: ' 00000000:01:00.0'
# Accounting: time=1 threads=1
# Ended (code 1) at Thu Dec  5 19:24:22 IST 2019, elapsed time 1 seconds

can you guys suggest whats going wrong?

@sw005320
Copy link

sw005320 commented Dec 5, 2019

What kind of cluster environments are you using?
You may need to change https://github.com/hitachi-speech/EEND/blob/master/egs/mini_librispeech/v1/cmd.sh based on your environment accordingly. Check https://kaldi-asr.org/doc/queue.html

@yubouf, I strongly recommend to add more documents about cmd.sh and also change run.pl as default.

@008karan
Copy link
Author

008karan commented Dec 5, 2019

i am using conda environment. and using local machine so have changed to run.pl in cmd.sh

@yubouf
Copy link
Contributor

yubouf commented Dec 5, 2019

@008karan Thank you for testing EEND.
Consider set CUDA_VISIBLE_DEVICES.
The gpu selection failure might come from cuda (nvidia-smi) version, where I had not tested on cuda10.

@sw005320 Thank you for your suggestion. I will change default to 'run.pl'

@sw005320
Copy link

sw005320 commented Dec 5, 2019

Oh, I see.
Can you set CUDA_VISIBLE_DEVICES explicitly then?

@008karan
Copy link
Author

008karan commented Dec 5, 2019

after exporting CUDA_VISIBLE_DEVICES=1 here is the log

# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train 
# Started at Thu Dec  5 20:00:36 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7f9d28248910>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730  chunks
1863  chunks
Traceback (most recent call last):
  File "../../../eend/bin/train.py", line 72, in <module>
    train(args)
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/train.py", line 100, in train
    gpuid = use_single_gpu()
  File "/home/gamut/Downloads/EEND/eend/chainer_backend/utils.py", line 64, in use_single_gpu
    chainer.cuda.get_device_from_id(cvd).use()
  File "cupy/cuda/device.pyx", line 135, in cupy.cuda.device.Device.use
  File "cupy/cuda/device.pyx", line 141, in cupy.cuda.device.Device.use
  File "cupy/cuda/runtime.pyx", line 193, in cupy.cuda.runtime.setDevice
  File "cupy/cuda/runtime.pyx", line 145, in cupy.cuda.runtime.check_status
cupy.cuda.runtime.CUDARuntimeError: cudaErrorInvalidDevice: invalid device ordinal
# Accounting: time=1 threads=1
# Ended (code 1) at Thu Dec  5 20:00:37 IST 2019, elapsed time 1 seconds

@sw005320
Copy link

sw005320 commented Dec 5, 2019

CUDA_VISIBLE_DEVICES=0?

@008karan
Copy link
Author

008karan commented Dec 5, 2019

looks like training started but stopped

training model at exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train.
bash: line 1:  6217 Aborted                 (core dumped) ( train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train ) 2>> exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log >> exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log

log:

[
    {
        "main/loss": 0.8094631433486938,
        "main/speech_scored": 429.4651162790698,
        "main/speech_miss": 135.0,
        "main/speech_falarm": 20.930232558139537,
        "main/speaker_scored": 683.7209302325581,
        "main/speaker_miss": 351.7906976744186,
        "main/speaker_falarm": 55.25581395348837,
        "main/speaker_error": 28.13953488372093,
        "main/correct": 221.59302325581396,
        "main/diarization_error": 435.1860465116279,
        "main/frames": 453.25581395348837,
        "validation/main/loss": 0.7502496242523193,
        "validation/main/speech_scored": 377.26666666666665,
        "validation/main/speech_miss": 97.96666666666667,
        "validation/main/speech_falarm": 35.733333333333334,
        "validation/main/speaker_scored": 545.8,
        "validation/main/speaker_miss": 234.56666666666666,
        "validation/main/speaker_falarm": 83.8,
        "validation/main/speaker_error": 33.86666666666667,
        "validation/main/correct": 224.55,
        "validation/main/diarization_error": 352.23333333333335,
        "validation/main/frames": 417.6,
        "main/DER": 0.6364965986394557,
        "validation/main/DER": 0.6453523879320875,
        "main/SAD_MR": 0.3143445064168517,
        "validation/main/SAD_MR": 0.2596748542145256,
        "main/SAD_FR": 0.048735582390209566,
        "validation/main/SAD_FR": 0.09471638098603995,
        "main/MI": 0.5145238095238095,
        "validation/main/MI": 0.42976670331012584,
        "main/FA": 0.08081632653061224,
        "validation/main/FA": 0.1535360938072554,
        "main/CF": 0.04115646258503401,
        "validation/main/CF": 0.06204959081470625,
        "main/accuracy": 0.4888917393535146,
        "validation/main/accuracy": 0.5377155172413793,
        "epoch": 1,
        "iteration": 43,
        "elapsed_time": 107.64393779402599
    },
    {
        "main/loss": 0.6841620802879333,
        "main/speech_scored": 429.09302325581393,
        "main/speech_miss": 59.44186046511628,
        "main/speech_falarm": 22.41860465116279,
        "main/speaker_scored": 699.4651162790698,
        "main/speaker_miss": 238.53488372093022,
        "main/speaker_falarm": 89.3953488372093,
        "main/speaker_error": 21.53488372093023,
        "main/correct": 267.8953488372093,
        "main/diarization_error": 349.4651162790698,
        "main/frames": 453.3953488372093,
        "validation/main/loss": 0.6442975997924805,
        "validation/main/speech_scored": 377.26666666666665,
        "validation/main/speech_miss": 17.0,
        "validation/main/speech_falarm": 40.2,
        "validation/main/speaker_scored": 545.8,
        "validation/main/speaker_miss": 92.33333333333333,
        "validation/main/speaker_falarm": 159.63333333333333,
        "validation/main/speaker_error": 20.066666666666666,
        "validation/main/correct": 271.55,
        "validation/main/diarization_error": 272.03333333333336,
        "validation/main/frames": 417.6,
        "main/DER": 0.4996176480367058,
        "validation/main/DER": 0.4984121167704899,
        "main/SAD_MR": 0.13852907701479594,
        "validation/main/SAD_MR": 0.045060964834776465,
        "main/SAD_FR": 0.05224649070511084,
        "validation/main/SAD_FR": 0.10655592860929494,
        "main/MI": 0.3410247032616285,
        "validation/main/MI": 0.16917063637474045,
        "main/FA": 0.1278052997306912,
        "validation/main/FA": 0.2924758763893978,
        "main/CF": 0.030787645044386077,
        "validation/main/CF": 0.03676560400635154,
        "main/accuracy": 0.5908647927780057,
        "validation/main/accuracy": 0.6502634099616859,
        "epoch": 2,
        "iteration": 86,
        "elapsed_time": 178.86651986418292
    }
]

@yubouf
Copy link
Contributor

yubouf commented Dec 5, 2019

The losses of the two epochs look good.
Now I have no idea of the core dump cause.

What does exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train/.work/train.log say?

@008karan
Copy link
Author

008karan commented Dec 5, 2019

here it is

# train.py -c conf/train.yaml data/simu/data/train_clean_5_ns2_beta2_500 data/simu/data/dev_clean_2_ns2_beta2_500 exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train 
# Started at Thu Dec  5 20:08:49 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=64, config=[<yamlargparse.Path object at 0x7fd94ec1a850>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/train_clean_5_ns2_beta2_500.dev_clean_2_ns2_beta2_500.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data/simu/data/train_clean_5_ns2_beta2_500', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data/simu/data/dev_clean_2_ns2_beta2_500')
2730  chunks
1863  chunks
GPU device 0 is used
Prepared model
dot: graph is too large for cairo-renderer bitmaps. Scaling by 0.607145 to fit
epoch       main/loss   validation/main/loss  main/diarization_error_rate  validation/main/diarization_error_rate  elapsed_time
Tcl_AsyncDelete: async handler deleted by the wrong thread
# Accounting: time=187 threads=1
# Ended (code 134) at Thu Dec  5 20:11:56 IST 2019, elapsed time 187 seconds

@008karan
Copy link
Author

008karan commented Dec 5, 2019

where are the hyper parameter of model? maybe reducing the batch size would help

@yubouf
Copy link
Contributor

yubouf commented Dec 5, 2019

See conf directory. conf/train.yml have hyperparameters.

@008karan
Copy link
Author

008karan commented Dec 6, 2019

Thanks for the help. I really appreciate your quick reply. @yubouf

After reducing the batch size training completed with 29% DER.
Now I need to test it on my custom data. Got some doubts here:

  1. Is this repo implementation of 'End-to-End Neural Speaker Diarization with Permutation-free Objectives' or 'End-to-End Neural Speaker Diarization with Self-attention'.

  2. How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.

  3. Getting confused with the directory structure of the repo. I need to test it on my custom data. I have collected some audio data having a single speaker in each audio. I don't have the transcript.
    Found from comments that there should be segments, reco2dur, wav.scp, utt2spk, and spk2utt files for training.
    segment mean audio having a single speaker for saying 1 utterance
    reco2dur is for the duration of that audio
    wav.scp for list of audio
    utt2spk and _spk2ut_t for mapping
    In repo these files were only in dev_clean_2 not in train_clean_2.

Also there is diarization_data with mix audio what's that for?

I think I am missing something. Can you spread some light on what should be dataset format and structure for speaker diarization.

@yubouf
Copy link
Contributor

yubouf commented Dec 6, 2019

  1. Is this repo implementation of 'End-to-End Neural Speaker Diarization with Permutation-free Objectives' or 'End-to-End Neural Speaker Diarization with Self-attention'.

Both. The latest network configuration is based on 'End-to-End Neural Speaker Diarization with Self-attention'.

  1. How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.

"mini_librispeech" model is prepared just for the code integration tests, not related to the papers.
It's better to train a model in the "callhome" recipe.
But it requires huge data and training time is needed.

I'm afraid the current code is not intended for the inference-only purpose.
For inference, see below:

for dset in dev_clean_2_ns2_beta2_500; do
work=$infer_dir/$dset/.work
mkdir -p $work
$infer_cmd $work/infer.log \
infer.py \
-c $infer_config \
$infer_args \
data/simu/data/$dset \
$model_dir/$ave_id.nnet.npz \
$infer_dir/$dset \
|| exit 1
done

data/simu/data/dev_clean_2_ns2_beta2_500 is the kaldi-style data directory for inference.

  1. Getting confused with the directory structure of the repo. I need to test it on my custom data. I have collected some audio data having a single speaker in each audio. I don't have the transcript.
    Found from comments that there should be segments, reco2dur, wav.scp, utt2spk, and spk2utt files for training.
    segment mean audio having a single speaker for saying 1 utterance
    reco2dur is for the duration of that audio
    wav.scp for list of audio
    utt2spk and _spk2ut_t for mapping
    In repo these files were only in dev_clean_2 not in train_clean_2.
    Also there is diarization_data with mix audio what's that for?

train_clean_2 and dev_clean_2 are not actual training and test data for our model.
These are mini_librispeech dataset.
Our training and test data is generated by simulation:
Training: data/simu/data/train_clean_5_ns2_beta2_500
Test: data/simu/data/dev_clean_2_ns2_beta2_500.

@008karan
Copy link
Author

008karan commented Dec 7, 2019

  1. ok so training data should contain call recording of two people that's what you simulated right? Can you tell me how much data is needed and training time? Also, it is independent of a person who is speaking right?

  2. I would like to try both the papers which you have published. Where can I find the implementation for 'End-to-End Neural Speaker Diarization with Permutation-free Objectives'. I am assuming that both have the same data input.

  3. I have gone through data/simu/data/train_clean_5_ns2_beta2_500: As there is no documentation not getting whats in the following files
    In rttm

SPEAKER data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066 1    2.08   15.75 <NA> <NA> 1088-134315 <NA>

In segment file for example below :

1088-134315_data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066_0000208_0001782 data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066 2.08325 17.82825

in spk2utt: as per my understanding audio mixture generated by 1088 and 134315 are audio number 66,208,1782

1088-134315 1088-134315_data_simu_wav_train_clean_5_ns2_beta2_500_14_mix_0000066_0000208_0001782

same goes with utt2spk
and lastly wav.scp having mapping between directory


data_simu_wav_train_clean_5_ns2_beta2_500_100_mix_0000496 /home/gamut/Downloads/EEND/egs/mini_librispeech/v1/data/simu/wav/train_clean_5_ns2_beta2_500/100/mix_0000496.wav

please elaborate where i am wrong and whats actually in those files.
As of now, I have audio recording data of having two speakers in each audio. So do I need to label each speaker in it and generate the mapping you got in all files above?

Thanks!

@yubouf
Copy link
Contributor

yubouf commented Dec 8, 2019

Explanation of Kaldi's data directory:
https://kaldi-asr.org/doc/data_prep.html
RTTM:
https://web.archive.org/web/20100606041157if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf

To know how we generate the simulated training data, see run_prepare_shared.sh with our paper, particularly Algorithm 1.
Training time was not described in the papers. It depends on computing environments. In our experiments, for 100,000 mixtures (generated with beta=2) with 100 epochs, it took 4-6 days.

@008karan
Copy link
Author

008karan commented Dec 9, 2019

I already have audio recordings so no need to simulate but do I need to get the transcript?

@yubouf
Copy link
Contributor

yubouf commented Dec 9, 2019

I already have audio recordings so no need to simulate but do I need to get the transcript?

No. You don’t have to prepare the text file.

@008karan
Copy link
Author

Thanks for the links.
Got some doubt here: In RTTM

SPEAKER data_simu_wav_train_clean_5_ns2_beta2_500_100_mix_0000500 1    2.82    4.27 <NA> <NA> 1867-154075 <NA>

Is tbeg(2.82) and tdur(4.27) randomly generated here as I couldn't find difference after hearing the mix audio file. Same goes with found in segment file. I found segments which you are passing are randomly generated?

Lastly in spk2utt and utt2spk : which require <utterance-id> <speaker-id> how to get it as I got audio recording at first place.

Cheers!

@yubouf
Copy link
Contributor

yubouf commented Dec 10, 2019

Yes, the training data is the simulated two-speaker mixture of "mini_librespeech" utterances with randomly chosen silence intervals. segments and rttm reflects the random simulation result.
Each "mini_librespeech" utterance might be longer containing several sentences, it seemed strange mixture. But again, this is just intended for the integration test.
Our actual recipe related to the paper is the "callhome" recipe.

Suppose you already have your two-speaker mixtures for training data:
audio recordings:
rec1.wav, rec2.wav, ...
and segmentation for two speakers per recording.
You should prepare these below:
wav.scp: the list of <recording> <file> like

rec1 rec1.wav
rec2 rec2.wav
...

segments: the list of <utterance> <recording> <start_time> <end_time> like

rec1_Alice_001 rec1 2.0 4.5
rec1_Bob_001 rec1 4.3 8.0
rec1_Alice_002 rec1 10.0 11.5
rec2_Charlie_001 rec2 3.3 4.4
rec2_Charlie_002 rec2 5.5 6.0
rec2_Daisy_001 rec2 7.0 7.5

utt2spk: the list of <utterance> <speaker> like

rec1_Alice_001 Alice
rec1_Alice_002 Alice
rec1_Bob_001 Bob
rec2_Charlie_001 Charlie
rec2_Charlie_002 Charlie
rec2_Daisy_001 Daisy
...

Then, you can generate spk2utt, reco2dur, rttm using kaldi-tools.
rttm from steps/segmentation/convert_utt2spk_and_segments_to_rttm.py
reco2dur from utils/data/get_reco2dur.sh
spk2utt from utils/utt2spk_to_spk2utt.pl

@008karan
Copy link
Author

hi, i got all the files and started training but nothing is happening. There is nothing inside data.data.train except cg.dot and cg.png

@008karan
Copy link
Author

008karan commented Dec 20, 2019

train log

# train.py -c conf/train.yaml data data exp/diarize/model/data.data.train 
# Started at Fri Dec 20 12:27:28 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=16, config=[<yamlargparse.Path object at 0x7fa60e12d550>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/data.data.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data')
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
GPU device 0 is used
Prepared model

@yubouf
Copy link
Contributor

yubouf commented Dec 20, 2019

The log indicates that train.py is still on hold.
If the mini_librispeech recipe had worked, the difference might be your data preparation.

[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on

I have no idea about those lines.

@008karan
Copy link
Author

before trainer.run() everything is printing out.

    trainer.extend(extensions.dump_graph('main/loss', out_name="cg.dot"))
    print('###########################5')
    trainer.run()
    print('Finished!')

can you suggest how to debug further

@yubouf
Copy link
Contributor

yubouf commented Dec 23, 2019

When you interrupt the program by Ctrl+C, you will find the stack trace and possible cause of the stop.

I'm afraid that it's hard to find the problem because it might be related to the data preparation of your data. If you could open the data for me, I could run that for debugging.
But I don't want to face a risk of getting sensitive speech data you own.
When you try our code with other publicly available data, we possibly solve your issue.

@008karan
Copy link
Author

008karan commented Jan 6, 2020

i am getting this results. can you help me with inference

        "main/loss": 0.5038370490074158,
        "main/speech_scored": 242.6723163841808,
        "main/speech_miss": 149.28248587570621,
        "main/speech_falarm": 22.28813559322034,
        "main/speaker_scored": 242.6723163841808,
        "main/speaker_miss": 149.28248587570621,
        "main/speaker_falarm": 22.51412429378531,
        "main/speaker_error": 18.163841807909606,
        "main/correct": 346.4124293785311,
        "main/diarization_error": 189.96045197740114,
        "main/frames": 450.47457627118644,
        "validation/main/loss": 0.4737112522125244,
        "validation/main/speech_scored": 286.44943820224717,
        "validation/main/speech_miss": 86.78651685393258,
        "validation/main/speech_falarm": 40.17977528089887,
        "validation/main/speaker_scored": 286.44943820224717,
        "validation/main/speaker_miss": 86.78651685393258,
        "validation/main/speaker_falarm": 42.40449438202247,
        "validation/main/speaker_error": 40.95505617977528,
        "validation/main/correct": 348.02247191011236,
        "validation/main/diarization_error": 170.14606741573033,
        "validation/main/frames": 453.5730337078652,
        "main/DER": 0.7827858356808605,
        "validation/main/DER": 0.5939828979367695,
        "main/SAD_MR": 0.6151607571066049,
        "validation/main/SAD_MR": 0.3029732486075155,
        "main/SAD_FR": 0.09184457430214421,
        "validation/main/SAD_FR": 0.14026829842315838,
        "main/MI": 0.6151607571066049,
        "validation/main/MI": 0.3029732486075155,
        "main/FA": 0.09277582473866784,
        "validation/main/FA": 0.1480348317251118,
        "main/CF": 0.07484925383558774,
        "validation/main/CF": 0.14297481760414218,
        "main/accuracy": 0.7689944064012842,
        "validation/main/accuracy": 0.7672909235037653,
        "epoch": 10,
        "iteration": 1777,
        "elapsed_time": 903.8602520569693

@yubouf
Copy link
Contributor

yubouf commented Jan 6, 2020

Copied my earlier comment.

  1. How to do inference. I want to see if I pass audio with two speakers how accurately it separates them.

"mini_librispeech" model is prepared just for the code integration tests, not related to the papers.
It's better to train a model in the "callhome" recipe.
But it requires huge data and training time is needed.

I'm afraid the current code is not intended for the inference-only purpose.
For inference, see below:

for dset in dev_clean_2_ns2_beta2_500; do
work=$infer_dir/$dset/.work
mkdir -p $work
$infer_cmd $work/infer.log \
infer.py \
-c $infer_config \
$infer_args \
data/simu/data/$dset \
$model_dir/$ave_id.nnet.npz \
$infer_dir/$dset \
|| exit 1
done

data/simu/data/dev_clean_2_ns2_beta2_500 is the kaldi-style data directory for inference.

@yubouf
Copy link
Contributor

yubouf commented Jan 6, 2020

"main/DER": 0.7827858356808605, means the performance is very poor.
"iteration": 1777, indicates that your training data size is too small.

@008karan
Copy link
Author

ok, can you suggest how much hours of data is needed to build a good speaker diarization system? Also, can we do this without timestamps? As you know getting audio with accurate timestamps is a difficult task.
Thanks

@yubouf
Copy link
Contributor

yubouf commented Jan 11, 2020

We didn't use manual timestamps for simulated mixtures and two-channel recordings.
In both cases, we have single-speaker recordings. So we can get timestamps via a speech activity detection system.
In our papers, I suggested using a simulated training set of 100k recordings, sampled from large-scale telephone recordings, which have separate channels. The "callhome" recipe is good for general diarization tasks, but two-speaker recordings only.
Although we observed the real training set was better than the simulated dataset, we believe that better large-scale simulation with better model architecture can outperform the smaller real training set.

@AntonOkhotnikov
Copy link

Explanation of Kaldi's data directory:
https://kaldi-asr.org/doc/data_prep.html
RTTM:
https://web.archive.org/web/20100606041157if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf

To know how we generate the simulated training data, see run_prepare_shared.sh with our paper, particularly Algorithm 1.
Training time was not described in the papers. It depends on computing environments. In our experiments, for 100,000 mixtures (generated with beta=2) with 100 epochs, it took 4-6 days.

@yubouf Could you please reveal the GPU you used, so I can roughly estimate the training time in my case?

@yubouf
Copy link
Contributor

yubouf commented Feb 12, 2020

GeForce GTX 1080 Ti.

@AntonOkhotnikov
Copy link

Thank you very much

@Durgesh92
Copy link

We didn't use manual timestamps for simulated mixtures and two-channel recordings.
In both cases, we have single-speaker recordings. So we can get timestamps via a speech activity detection system.
In our papers, I suggested using a simulated training set of 100k recordings, sampled from large-scale telephone recordings, which have separate channels. The "callhome" recipe is good for general diarization tasks, but two-speaker recordings only.
Although we observed the real training set was better than the simulated dataset, we believe that better large-scale simulation with better model architecture can outperform the smaller real training set.

Is there any way to train multi-speaker recordings with callhome recipe? I get this error when no of speakers are more than two

File "/home/sysadmin/EEND/eend/feature.py", line 282, in get_labeledSTFT
T[rel_start:rel_end, speaker_index] = 1
IndexError: index 2 is out of bounds for axis 1 with size 2

@yubouf
Copy link
Contributor

yubouf commented Feb 24, 2020

The model should have a fixed number of speakers as in config num_speakers: 2.
For extending to a variable number of speakers, you can assume the num_speakers as the maximum number of speakers. And train the model using a variable number of speakers with labels zero padded to the maximum number of speakers.

@fxyouruo
Copy link

fxyouruo commented Jul 7, 2021

train log

# train.py -c conf/train.yaml data data exp/diarize/model/data.data.train 
# Started at Fri Dec 20 12:27:28 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=16, config=[<yamlargparse.Path object at 0x7fa60e12d550>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/data.data.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data')
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
GPU device 0 is used
Prepared model

@008karan Hi,I have the same problem as you, could you share your solution? Thank you!

@maerduduqi
Copy link

train.py -c conf/train.yaml data data exp/diarize/model/data.data.train

Started at Fri Dec 20 12:27:28 IST 2019

python version: 3.7.5 (default, Oct 25 2019, 15:51:11) [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=16, config=[<yamlargparse.Path object at 0x7fa60e12d550>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/data.data.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data')
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843 chunks
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843 chunks
GPU device 0 is used
Prepared model

@maerduduqi
Copy link

火车日志

# train.py -c conf/train.yaml data data exp/diarize/model/data.data.train 
# Started at Fri Dec 20 12:27:28 IST 2019
#
python version: 3.7.5 (default, Oct 25 2019, 15:51:11)  [GCC 7.3.0]
chainer version: 6.2.0
cupy version: 6.2.0
cuda version: 10000
cudnn version: 7500
namespace(backend='chainer', batchsize=16, config=[<yamlargparse.Path object at 0x7fa60e12d550>], context_size=7, dc_loss_ratio=0.5, embedding_layers=2, embedding_size=256, frame_shift=80, frame_size=200, gpu=0, gradclip=5, gradient_accumulation_steps=1, hidden_size=256, initmodel='', input_transform='logmel23_mn', label_delay=0, lr=0.001, max_epochs=10, model_save_dir='exp/diarize/model/data.data.train', model_type='Transformer', noam_scale=1.0, noam_warmup_steps=25000.0, num_frames=500, num_lstm_layers=1, num_speakers=2, optimizer='noam', resume='', sampling_rate=8000, seed=777, subsampling=10, train_data_dir='data', transformer_encoder_dropout=0.1, transformer_encoder_n_heads=4, transformer_encoder_n_layers=2, valid_data_dir='data')
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
[['agent-1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_1', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_3', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_5', '2c4d1476-a2a4-4a04-91cb-72b7a88aeedd_agent_7'].....so on
2843  chunks
GPU device 0 is used
Prepared model

@008karan你好,我和你遇到同样的问题,可以分享一下你的解决方案吗?谢谢你!

Have you solved the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants