OOM killed when diarizing 4 hours recording #8235

tttalshaul · 2024-01-24T15:46:50Z

I'm trying nemo_toolkit 1.22.0 on T4 (14G vRAM) with 40G cpu RAM to diarize with ClusteringDiarizer.
After completing speaker embeddings, tqdm shows "clustering: 0%" and then my script is getting killed by OS with error 137 OOM (out of memory) because of RAM consumption greater then 40G..

Installed nemo using pip install nemo_toolkit==1.22.0 (python 3.10 x64) on premise K8S.

I'm using diar_infer_meeting.yaml with max_num_speakers=20, chunk_cluster_count=50,
Tried to lower embeddings_per_chunk from default 10000 to 500 but it didn't help.

nithinraok · 2024-01-24T19:36:31Z

@tttalshaul can you share more info on length of audio and size of embeddings in pkl file you saved?

tttalshaul · 2024-01-28T09:48:28Z

Audio length is 3:43 hours after cutting silence (using VAD).
embeddings:
scale0 6.6M len 8920
scale1 7.9M len 10704
scale2 9.9M len 13385
scale3 14M len 17846
scale4 20M len 26772
scale5 40M len 53548

tttalshaul · 2024-02-05T13:42:28Z

@nithinraok
What tests can I make to understand the cause of this problem?

I'm moving ClusteringDiarizer(cfg=cfg).to(device), and I see that device is cuda:0.
I understand that this only effects vad and embeddings extraction, but clustering occurs in cpu RAM anyway?

Thank you

nithinraok · 2024-02-05T14:28:53Z

@tango4j what is the maximum audio duration we currently support for single pass offline diarzation

tango4j · 2024-02-06T23:09:41Z

@tttalshaul

Sorry for the late reply.
If you have visible cuda devices, both embedding extractions and clustering will be happening on GPU RAM.

clustering: 
    parameters:
....
      chunk_cluster_count: 50 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
      embeddings_per_chunk: 10000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio)

You should increase the embeddings_per_chunk value to consume less amount of memory.
Also note that even if you use long form audio clustering, there is a limit of audio length.
Thus, you need to make a compromise between performance and length.
Increasing embeddings_per_chunk will consume less memory but could lead to deteriorated performance.

tttalshaul · 2024-02-07T12:44:52Z

tried now embeddings_per_chunk 50000, and also tried 500000.
gpustat showed 0 GPU utilization after embedding, there was a small step of get_argmin_mat and then OOM.. (There wasn't enough time to see more steps).

What can I do to debug it further (suppose I can't export the audio)?
Thank you.

tttalshaul · 2024-02-07T12:47:54Z

Also, what is the supported limit of audio length with long form audio clustering? (Is is less than 4 hours?)
Increasing embeddings_per_chunk will consume less memory in CPU RAM? or in GPU vRAM?
In "deteriorated performance" you mean less good results? or that it will take greater time?

tttalshaul · 2024-02-13T14:37:26Z

@tango4j 🙏

tango4j · 2024-02-13T16:11:37Z

gpustat showed 0 GPU utilization after embedding, there was a small step of get_argmin_mat and then OOM.. (There wasn't enough time to see more steps).

First check torch.cuda.is_available() in your pytorch env.
Second,
diarizer_model = NeuralDiarizer(cfg).to(device)
check if this model variable is assigned to the device you want.

Also, what is the supported limit of audio length with long form audio clustering? (Is is less than 4 hours?)

This depends on your GPU memory size, portion of detected silence in your audio file (3) embeddings_per_chunk variable you set. This cannot be answered without running the sample.
However, your script is loading embedding vectors to CPU RAM, so GPU RAM size is not even an issue. Please make sure everything is loaded on GPU RAM.

you mean less good results? or that it will take greater time?

This means speaker counting can be wrong (wrong number of speakers) and diarization accuracy can be dropped.

tttalshaul · 2024-02-14T12:41:52Z

@tango4j
cuda.is_available() is True, device is cuda:0 (as I've mentioned).
I'm using ClusteringDiarizer, is NeuralDiarizer more recommended? or might behave elsewise regarding doing all computations on GPU and not CPU?

my GPU is T4 with 14G vRAM.
ClusteringDiarizer model is moved to GPU, vad is running on GPU, embeddings are running on GPU.
Why would get_argmin_mat method use CPU RAM so highly? (or other steps of clustering?)

What else can I do to make sure everything is loaded on GPU RAM?

tttalshaul · 2024-02-21T15:51:28Z

@tango4j any idea?

tango4j · 2024-02-21T18:43:12Z

@tttalshaul
Hi. Now I can see that the CPU-RAM that is piling up is speaker embedding vectors for clustering.
That speaker embedding vectors cannot be removed since it needs to be used for clustering process.
This can only fixed by saving the speaker embedding to your disk drive and load it again. (Since you are trying very long audio) - This feature should be implemented in the future.

However, you have another quick solution:
https://github.com/chimechallenge/C8DASR-Baseline-NeMo/tree/main/scripts/chime8

NeMo team is supporting CHiME-8 Challenge and we are providing the baseline for everyone.
This version of diarization uses less amount of memory, so there is a chance that this version performs without OOM (although not guaranteed)
If you are urgent on this task, please try CHiME-8 baseline diarization.

khaykingleb · 2024-02-22T09:35:01Z

Hello @tango4j! Are you talking about msdd_v2_models or msdd_v3_models module? Does the new MSDD model provide the same DER?

dkurt · 2024-02-22T14:21:38Z

I have also faced a CUDA OOM on long audio samples. Maybe my solution might help you too? #8469

Surprisingly, small changes in math helped not to hit the limit of GPU capacity.

tango4j · 2024-02-22T21:27:18Z

@dkurt
There are two types of OOM issues:
CPU-RAM OOM: After extracting speaker embeddings with GPU, it needs to load off the vectors to CPU-RAM, cause GPU-RAM is
heavily limited. However, if an audio file is even longer than the amount that your CPU-RAM can hold, CPU-RAM will be maxed out. One way to solve this is that you can write speaker embedding files on your disk drive than load again to avoid this. We do not implement this because it will slow down the process a lot.
GPU-RAM OOM: this happens affinity matrix grows too big since it has O(n^2). This can be mitigated by increasing the :
embeddings_per_chunk to a bigger number to a certain degree. But this also has limit since GPU-RAM is not inifinite. Try increasing this parameter as I stated above.

tango4j · 2024-02-22T21:29:47Z

@khaykingleb The MSDD-v2 model for Chime8 baseline will give you a different DER number for the same audio file, since these chime8 models are tuned for chime7 train/dev datasets.

github-actions · 2024-03-24T01:45:26Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

tttalshaul · 2024-03-25T13:25:50Z

@tango4j Thank you,
Do you have an example of how to use Chime8 diarization inference with custom input files? jupyter notebook or other example?
I'll try that..

khaykingleb · 2024-04-01T09:27:37Z

I believe it has the same interface. For example, you can see examples/speaker_tasks/diarization/neural_diarizer/multiscale_diar_decoder_v2_infer.py.

But the running doesn't work even if you download MSDD_v2_PALO_100ms_intrpl_3scales.nemo from HuggingFace Hub and try examples/speaker_tasks/diarization/conf/inference/diar_infer_msdd_v2.yaml instead.

I get the following error:

File ~/experiments/C8DASR-Baseline-NeMo/nemo/collections/asr/models/msdd_v2_models.py:1132, in EncDecDiarLabelModel.forward_vad(self, audio_signal, audio_signal_length)
   1130 def forward_vad(self, audio_signal, audio_signal_length):
   1131     log_probs = self.msdd._vad_model(input_signal=audio_signal, input_signal_length=audio_signal_length)
-> 1132     vad_probs = torch.softmax(log_probs, dim=-1)[:,:,1]
   1133     vad_probs_frame = torch.repeat_interleave(vad_probs, self.subsample_vad, dim=1)
   1134     return vad_probs_frame

IndexError: too many indices for tensor of dimension 2

And if I fix it, I get another error. @tango4j, can we really use MSDD V2?

tttalshaul · 2024-04-02T12:49:22Z

I've managed to run chime8 diarization somehow on a custom recording I've choosed but I've got not good results..
I'm using currently the LongClusteringDiarizer from gabitza-tech fork.. it splits the file to small chunks and of course don't OOM, but I'm not sure that the speakers in each split match the other splits..

github-actions · 2024-05-03T01:45:53Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

AatikaNazneen · 2024-05-04T20:06:51Z

Hey, I made a solution that worked for me. Previously the code never went to long form clustering because it was giving OOM at get_argmin_mat. Let me know if this works for you. I was able to process a 5 hour audio on g4.xlarge using this. Merge-Request

nithinraok · 2024-05-08T18:29:30Z

Thanks for creating the PR, @tango4j could you please help to review the PR.

github-actions · 2024-06-09T01:51:45Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-06-16T01:51:50Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

AatikaNazneen · 2024-06-17T12:35:20Z

The issue still exists
Please view the PR

nithinraok · 2024-06-24T02:15:45Z

@tango4j please add your comments/views for the PR.

tango4j · 2024-06-24T22:47:46Z

@nithinraok
I was not able to get the 4hour long audio samples yet. Once I get access to it, I will work on this PR.

tttalshaul · 2024-07-24T09:21:50Z

@tango4j Have you got the 4hour audio samples?
When do you think you will be available to approve the PR?
It's waiting for 3 months already..
Thank you.

AatikaNazneen · 2024-08-12T15:14:33Z

Hey @tango4j @nithinraok Please view the PR. Its been waiting for quite a while now

tango4j · 2024-08-15T00:10:30Z

Sorry for the delay on @AatikaNazneen's PR.
#9114 (comment)
I have changed the get_argmin_mat function to use torch.expand instead of torch.tile.
torch.tile allocates new memory while torch.expand does not.
I have run the sample provided by @AatikaNazneen (longer than 4 hours).
Note that the for loop version @AatikaNazneen is taking significant more time than torch.expand so not desirable to be the implementation in get_argmin_mat.

github-actions · 2024-09-14T01:55:02Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions · 2024-09-21T01:55:55Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

tttalshaul added the bug Something isn't working label Jan 24, 2024

titu1994 assigned tango4j and nithinraok Jan 24, 2024

github-actions bot added the stale label Mar 24, 2024

github-actions bot removed the stale label Mar 26, 2024

github-actions bot added the stale label May 3, 2024

AatikaNazneen mentioned this issue May 4, 2024

Fixed chokepoint in diarization for longer audios #9114

Open

8 tasks

github-actions bot removed the stale label May 10, 2024

github-actions bot added the stale label Jun 9, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 16, 2024

nithinraok reopened this Jun 24, 2024

github-actions bot removed the stale label Jun 25, 2024

github-actions bot added the stale label Sep 14, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOM killed when diarizing 4 hours recording #8235

OOM killed when diarizing 4 hours recording #8235

tttalshaul commented Jan 24, 2024

nithinraok commented Jan 24, 2024

tttalshaul commented Jan 28, 2024

tttalshaul commented Feb 5, 2024

nithinraok commented Feb 5, 2024

tango4j commented Feb 6, 2024

tttalshaul commented Feb 7, 2024

tttalshaul commented Feb 7, 2024

tttalshaul commented Feb 13, 2024

tango4j commented Feb 13, 2024 •

edited

Loading

tttalshaul commented Feb 14, 2024

tttalshaul commented Feb 21, 2024

tango4j commented Feb 21, 2024

khaykingleb commented Feb 22, 2024 •

edited

Loading

dkurt commented Feb 22, 2024 •

edited

Loading

tango4j commented Feb 22, 2024 •

edited

Loading

tango4j commented Feb 22, 2024

github-actions bot commented Mar 24, 2024

tttalshaul commented Mar 25, 2024

khaykingleb commented Apr 1, 2024 •

edited

Loading

tttalshaul commented Apr 2, 2024

github-actions bot commented May 3, 2024

AatikaNazneen commented May 4, 2024

nithinraok commented May 8, 2024

github-actions bot commented Jun 9, 2024

github-actions bot commented Jun 16, 2024

AatikaNazneen commented Jun 17, 2024

nithinraok commented Jun 24, 2024

tango4j commented Jun 24, 2024

tttalshaul commented Jul 24, 2024

AatikaNazneen commented Aug 12, 2024

tango4j commented Aug 15, 2024

github-actions bot commented Sep 14, 2024

github-actions bot commented Sep 21, 2024

OOM killed when diarizing 4 hours recording #8235

OOM killed when diarizing 4 hours recording #8235

Comments

tttalshaul commented Jan 24, 2024

nithinraok commented Jan 24, 2024

tttalshaul commented Jan 28, 2024

tttalshaul commented Feb 5, 2024

nithinraok commented Feb 5, 2024

tango4j commented Feb 6, 2024

tttalshaul commented Feb 7, 2024

tttalshaul commented Feb 7, 2024

tttalshaul commented Feb 13, 2024

tango4j commented Feb 13, 2024 • edited Loading

tttalshaul commented Feb 14, 2024

tttalshaul commented Feb 21, 2024

tango4j commented Feb 21, 2024

khaykingleb commented Feb 22, 2024 • edited Loading

dkurt commented Feb 22, 2024 • edited Loading

tango4j commented Feb 22, 2024 • edited Loading

tango4j commented Feb 22, 2024

github-actions bot commented Mar 24, 2024

tttalshaul commented Mar 25, 2024

khaykingleb commented Apr 1, 2024 • edited Loading

tttalshaul commented Apr 2, 2024

github-actions bot commented May 3, 2024

AatikaNazneen commented May 4, 2024

nithinraok commented May 8, 2024

github-actions bot commented Jun 9, 2024

github-actions bot commented Jun 16, 2024

AatikaNazneen commented Jun 17, 2024

nithinraok commented Jun 24, 2024

tango4j commented Jun 24, 2024

tttalshaul commented Jul 24, 2024

AatikaNazneen commented Aug 12, 2024

tango4j commented Aug 15, 2024

github-actions bot commented Sep 14, 2024

github-actions bot commented Sep 21, 2024

tango4j commented Feb 13, 2024 •

edited

Loading

khaykingleb commented Feb 22, 2024 •

edited

Loading

dkurt commented Feb 22, 2024 •

edited

Loading

tango4j commented Feb 22, 2024 •

edited

Loading

khaykingleb commented Apr 1, 2024 •

edited

Loading