You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, each cut (segment) is processed one at a time. This is fine for long segments which usually occupy the entire GPU memory, but may be wasteful for shorter segments. The conventional method (e.g. in ASR) is to do mini-batch processing by padding several segments to the same length. However, we have to be careful doing that here because of 2 reasons:
The CACGMM implementations are currently for 3-dimensional input (channels, time, frequency), and adding a batch dimension would require modifying a lot of internal implementation (which is done efficiently through einops).
The CACGMM inference step computes sums over the whole time duration, so adding padding would require some kind of masking.
For these reasons, it may be better to a different kind of "batching". Instead of combining segments in parallel, we can combine them sequentially --- but only if they are from the same recording and have the same target speaker. This is to ensure that we do not create a permutation problem in the mask estimation.
If we combine in this way, we can even remove individual contexts from each segment, and instead add the context to the combined "super-segment", which would further save compute/memory.
The text was updated successfully, but these errors were encountered:
Currently, each cut (segment) is processed one at a time. This is fine for long segments which usually occupy the entire GPU memory, but may be wasteful for shorter segments. The conventional method (e.g. in ASR) is to do mini-batch processing by padding several segments to the same length. However, we have to be careful doing that here because of 2 reasons:
For these reasons, it may be better to a different kind of "batching". Instead of combining segments in parallel, we can combine them sequentially --- but only if they are from the same recording and have the same target speaker. This is to ensure that we do not create a permutation problem in the mask estimation.
If we combine in this way, we can even remove individual contexts from each segment, and instead add the context to the combined "super-segment", which would further save compute/memory.
The text was updated successfully, but these errors were encountered: