Memory leak with parallel transcribe #1055

saddy001 · 2024-10-11T13:03:28Z

Hey there, thank you for the project.

I have spotted a memory leak in the latest release (1.0.3). When transcribing sequentially, the memory behaves as expected. When however it is called in parallel, the memory usage keeps increasing until OOM, even if garbage collected manually.

I have built a minimal reproducible example. Notice also how "mem after gc" increases when you increase PARALLEL to 5 or 6.

wget "https://cdn.pixabay.com/download/audio/2024/07/23/audio_9f165cf892.mp3?filename=medieval-gamer-voice-donx27t-forget-to-subscribe-226581.mp3" -O test.mp3

import gc
from threading import Thread

from faster_whisper import WhisperModel
import psutil

PARALLEL = 4
THREADS = []
MODEL = WhisperModel('large-v2', device='auto', compute_type='int8', cpu_threads=4)


def get_rss():
    ''' Get current memory usage (RSS) in MB '''
    return int(psutil.Process().memory_info().rss / 1048576)


def transcribe():
    print(f'mem before {get_rss()}')
    segments, _info = MODEL.transcribe('test.mp3')
    _ = list(segments)
    print(f'mem after  {get_rss()}')


def sequential():
    print('sequential:')
    for _ in range(PARALLEL):
        transcribe()


def parallel():
    print('\nparallel:')
    for _ in range(PARALLEL):
        THREADS.append(Thread(target=transcribe))
        THREADS[-1].start()


def main():
    sequential()
    gc.collect()
    print(f'\nmem after gc {get_rss()}')
    parallel()

    for t in THREADS:
        t.join()

    gc.collect()
    print(f'\nmem after gc {get_rss()}')


if __name__ == '__main__':
    main()

Output:

sequential:
mem before 1761
mem after  2617
mem before 2617
mem after  2617
mem before 2617
mem after  2617
mem before 2617
mem after  2617

mem after gc 2617  # everything fine up until here

parallel:
mem before 2617
mem before 2617
mem before 2617
mem before 2617
mem after  2691
mem after  2691
mem after  2691
mem after  2691

mem after gc 2691  # leak

Edit:
There's another catch. When you run the same script multiple times, you will notice the outcome is very different sometimes for the final "mem after gc".

for run in {1..10}; do python test.py|tail -n1;done

mem after gc 2669
mem after gc 2670
mem after gc 2669
mem after gc 2671
mem after gc 2670
mem after gc 2670
mem after gc 4098  # !
mem after gc 2670
mem after gc 2669
mem after gc 2670

The text was updated successfully, but these errors were encountered:

80boys · 2024-10-12T02:20:47Z

High Concurrency Exception

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak with parallel transcribe #1055

Memory leak with parallel transcribe #1055

saddy001 commented Oct 11, 2024 •

edited

Loading

80boys commented Oct 12, 2024

Memory leak with parallel transcribe #1055

Memory leak with parallel transcribe #1055

Comments

saddy001 commented Oct 11, 2024 • edited Loading

80boys commented Oct 12, 2024

saddy001 commented Oct 11, 2024 •

edited

Loading