Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM killed when diarizing 4 hours recording #8235

Closed
tttalshaul opened this issue Jan 24, 2024 · 33 comments
Closed

OOM killed when diarizing 4 hours recording #8235

tttalshaul opened this issue Jan 24, 2024 · 33 comments
Assignees
Labels
bug Something isn't working stale

Comments

@tttalshaul
Copy link

I'm trying nemo_toolkit 1.22.0 on T4 (14G vRAM) with 40G cpu RAM to diarize with ClusteringDiarizer.
After completing speaker embeddings, tqdm shows "clustering: 0%" and then my script is getting killed by OS with error 137 OOM (out of memory) because of RAM consumption greater then 40G..

Installed nemo using pip install nemo_toolkit==1.22.0 (python 3.10 x64) on premise K8S.

I'm using diar_infer_meeting.yaml with max_num_speakers=20, chunk_cluster_count=50,
Tried to lower embeddings_per_chunk from default 10000 to 500 but it didn't help.

@tttalshaul tttalshaul added the bug Something isn't working label Jan 24, 2024
@nithinraok
Copy link
Collaborator

@tttalshaul can you share more info on length of audio and size of embeddings in pkl file you saved?

@tttalshaul
Copy link
Author

Audio length is 3:43 hours after cutting silence (using VAD).
embeddings:
scale0 6.6M len 8920
scale1 7.9M len 10704
scale2 9.9M len 13385
scale3 14M len 17846
scale4 20M len 26772
scale5 40M len 53548

@tttalshaul
Copy link
Author

@nithinraok
What tests can I make to understand the cause of this problem?

I'm moving ClusteringDiarizer(cfg=cfg).to(device), and I see that device is cuda:0.
I understand that this only effects vad and embeddings extraction, but clustering occurs in cpu RAM anyway?

Thank you

@nithinraok
Copy link
Collaborator

@tango4j what is the maximum audio duration we currently support for single pass offline diarzation

@tango4j
Copy link
Collaborator

tango4j commented Feb 6, 2024

@tttalshaul

Sorry for the late reply.
If you have visible cuda devices, both embedding extractions and clustering will be happening on GPU RAM.

clustering: 
    parameters:
....
      chunk_cluster_count: 50 # Number of forced clusters (overclustering) per unit chunk in long-form audio clustering.
      embeddings_per_chunk: 10000 # Number of embeddings in each chunk for long-form audio clustering. Adjust based on GPU memory capacity. (default: 10000, approximately 40 mins of audio) 

You should increase the embeddings_per_chunk value to consume less amount of memory.
Also note that even if you use long form audio clustering, there is a limit of audio length.
Thus, you need to make a compromise between performance and length.
Increasing embeddings_per_chunk will consume less memory but could lead to deteriorated performance.

@tttalshaul
Copy link
Author

tried now embeddings_per_chunk 50000, and also tried 500000.
gpustat showed 0 GPU utilization after embedding, there was a small step of get_argmin_mat and then OOM.. (There wasn't enough time to see more steps).

What can I do to debug it further (suppose I can't export the audio)?
Thank you.

@tttalshaul
Copy link
Author

Also, what is the supported limit of audio length with long form audio clustering? (Is is less than 4 hours?)
Increasing embeddings_per_chunk will consume less memory in CPU RAM? or in GPU vRAM?
In "deteriorated performance" you mean less good results? or that it will take greater time?

@tttalshaul
Copy link
Author

@tango4j 🙏

@tango4j
Copy link
Collaborator

tango4j commented Feb 13, 2024

gpustat showed 0 GPU utilization after embedding, there was a small step of get_argmin_mat and then OOM.. (There wasn't enough time to see more steps).

First check torch.cuda.is_available() in your pytorch env.
Second,
diarizer_model = NeuralDiarizer(cfg).to(device)
check if this model variable is assigned to the device you want.

Also, what is the supported limit of audio length with long form audio clustering? (Is is less than 4 hours?)

This depends on your GPU memory size, portion of detected silence in your audio file (3) embeddings_per_chunk variable you set. This cannot be answered without running the sample.
However, your script is loading embedding vectors to CPU RAM, so GPU RAM size is not even an issue. Please make sure everything is loaded on GPU RAM.

you mean less good results? or that it will take greater time?

This means speaker counting can be wrong (wrong number of speakers) and diarization accuracy can be dropped.

@tttalshaul
Copy link
Author

@tango4j
cuda.is_available() is True, device is cuda:0 (as I've mentioned).
I'm using ClusteringDiarizer, is NeuralDiarizer more recommended? or might behave elsewise regarding doing all computations on GPU and not CPU?

my GPU is T4 with 14G vRAM.
ClusteringDiarizer model is moved to GPU, vad is running on GPU, embeddings are running on GPU.
Why would get_argmin_mat method use CPU RAM so highly? (or other steps of clustering?)

What else can I do to make sure everything is loaded on GPU RAM?

@tttalshaul
Copy link
Author

@tango4j any idea?

@tango4j
Copy link
Collaborator

tango4j commented Feb 21, 2024

@tttalshaul
Hi. Now I can see that the CPU-RAM that is piling up is speaker embedding vectors for clustering.
That speaker embedding vectors cannot be removed since it needs to be used for clustering process.
This can only fixed by saving the speaker embedding to your disk drive and load it again. (Since you are trying very long audio) - This feature should be implemented in the future.

However, you have another quick solution:
https://github.com/chimechallenge/C8DASR-Baseline-NeMo/tree/main/scripts/chime8

NeMo team is supporting CHiME-8 Challenge and we are providing the baseline for everyone.
This version of diarization uses less amount of memory, so there is a chance that this version performs without OOM (although not guaranteed)
If you are urgent on this task, please try CHiME-8 baseline diarization.

@khaykingleb
Copy link

khaykingleb commented Feb 22, 2024

Hello @tango4j! Are you talking about msdd_v2_models or msdd_v3_models module? Does the new MSDD model provide the same DER?

@dkurt
Copy link

dkurt commented Feb 22, 2024

I have also faced a CUDA OOM on long audio samples. Maybe my solution might help you too? #8469

Surprisingly, small changes in math helped not to hit the limit of GPU capacity.

@tango4j
Copy link
Collaborator

tango4j commented Feb 22, 2024

@dkurt
There are two types of OOM issues:
CPU-RAM OOM: After extracting speaker embeddings with GPU, it needs to load off the vectors to CPU-RAM, cause GPU-RAM is
heavily limited. However, if an audio file is even longer than the amount that your CPU-RAM can hold, CPU-RAM will be maxed out. One way to solve this is that you can write speaker embedding files on your disk drive than load again to avoid this. We do not implement this because it will slow down the process a lot.
GPU-RAM OOM: this happens affinity matrix grows too big since it has O(n^2). This can be mitigated by increasing the :
embeddings_per_chunk to a bigger number to a certain degree. But this also has limit since GPU-RAM is not inifinite. Try increasing this parameter as I stated above.

@tango4j
Copy link
Collaborator

tango4j commented Feb 22, 2024

@khaykingleb The MSDD-v2 model for Chime8 baseline will give you a different DER number for the same audio file, since these chime8 models are tuned for chime7 train/dev datasets.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Mar 24, 2024
@tttalshaul
Copy link
Author

@tango4j Thank you,
Do you have an example of how to use Chime8 diarization inference with custom input files? jupyter notebook or other example?
I'll try that..

@github-actions github-actions bot removed the stale label Mar 26, 2024
@khaykingleb
Copy link

khaykingleb commented Apr 1, 2024

I believe it has the same interface. For example, you can see examples/speaker_tasks/diarization/neural_diarizer/multiscale_diar_decoder_v2_infer.py.

But the running doesn't work even if you download MSDD_v2_PALO_100ms_intrpl_3scales.nemo from HuggingFace Hub and try examples/speaker_tasks/diarization/conf/inference/diar_infer_msdd_v2.yaml instead.

I get the following error:

File ~/experiments/C8DASR-Baseline-NeMo/nemo/collections/asr/models/msdd_v2_models.py:1132, in EncDecDiarLabelModel.forward_vad(self, audio_signal, audio_signal_length)
   1130 def forward_vad(self, audio_signal, audio_signal_length):
   1131     log_probs = self.msdd._vad_model(input_signal=audio_signal, input_signal_length=audio_signal_length)
-> 1132     vad_probs = torch.softmax(log_probs, dim=-1)[:,:,1]
   1133     vad_probs_frame = torch.repeat_interleave(vad_probs, self.subsample_vad, dim=1)
   1134     return vad_probs_frame

IndexError: too many indices for tensor of dimension 2

And if I fix it, I get another error. @tango4j, can we really use MSDD V2?

@tttalshaul
Copy link
Author

I've managed to run chime8 diarization somehow on a custom recording I've choosed but I've got not good results..
I'm using currently the LongClusteringDiarizer from gabitza-tech fork.. it splits the file to small chunks and of course don't OOM, but I'm not sure that the speakers in each split match the other splits..

Copy link
Contributor

github-actions bot commented May 3, 2024

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label May 3, 2024
@AatikaNazneen
Copy link

Hey, I made a solution that worked for me. Previously the code never went to long form clustering because it was giving OOM at get_argmin_mat. Let me know if this works for you. I was able to process a 5 hour audio on g4.xlarge using this. Merge-Request

@nithinraok
Copy link
Collaborator

Thanks for creating the PR, @tango4j could you please help to review the PR.

@github-actions github-actions bot removed the stale label May 10, 2024
Copy link
Contributor

github-actions bot commented Jun 9, 2024

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Jun 9, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 16, 2024
@AatikaNazneen
Copy link

The issue still exists
Please view the PR

@nithinraok
Copy link
Collaborator

@tango4j please add your comments/views for the PR.

@nithinraok nithinraok reopened this Jun 24, 2024
@tango4j
Copy link
Collaborator

tango4j commented Jun 24, 2024

@nithinraok
I was not able to get the 4hour long audio samples yet. Once I get access to it, I will work on this PR.

@github-actions github-actions bot removed the stale label Jun 25, 2024
@tttalshaul
Copy link
Author

@tango4j Have you got the 4hour audio samples?
When do you think you will be available to approve the PR?
It's waiting for 3 months already..
Thank you.

@AatikaNazneen
Copy link

Hey @tango4j @nithinraok Please view the PR. Its been waiting for quite a while now

@tango4j
Copy link
Collaborator

tango4j commented Aug 15, 2024

Sorry for the delay on @AatikaNazneen's PR.
#9114 (comment)
I have changed the get_argmin_mat function to use torch.expand instead of torch.tile.
torch.tile allocates new memory while torch.expand does not.
I have run the sample provided by @AatikaNazneen (longer than 4 hours).
Note that the for loop version @AatikaNazneen is taking significant more time than torch.expand so not desirable to be the implementation in get_argmin_mat.

Copy link
Contributor

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Sep 14, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

6 participants