Using ndarray as input to transcribe method #380

ColeDrain · 2022-10-20T20:55:28Z

ColeDrain
Oct 20, 2022

Hello there..

Looking at the signature of the transcribe() method, I can see that it supports ndarrays.
I have been trying to wrap my head on implementing this, but I've been getting some errors.

Base: I get audio as bytes (audio_bytes)

What I have tried ?

1. audio_array = np.copy(np.frombuffer(audio_bytes, dtype=np.uint8))
or
2. audio_array = np.frombuffer(aud_bytes, np.int16).flatten().astype(np.float32) / 32768.0

2 was described in this thread

model = load_model()
trans_dict = model.transcribe(audio_array)

Error 1: RuntimeError: "reflection_pad1d" not implemented for 'Byte'
Error 2: transcribe returns an empty segment

Would really appreciate a helping hand

Answered by jianfch

Oct 20, 2022

The comment to failed to take into account of that there is preprocessing done by ffmpeg in load_audio(). So it shouldn't be the bytes of the audio file but bytes from the output of ffmpeg.
Here's a modified version of load_audio that should work with bytes of the audio file directly.

def load_audio(file: (str, bytes), sr: int = 16000):
    """
    Open an audio file and read as mono waveform, resampling as necessary

    Parameters
    ----------
    file: (str, bytes)
        The audio file to open or bytes of audio file

    sr: int
        The sample rate to resample the audio if necessary

    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """

View full answer

jianfch · 2022-10-20T21:46:58Z

jianfch
Oct 20, 2022

The comment to failed to take into account of that there is preprocessing done by ffmpeg in load_audio(). So it shouldn't be the bytes of the audio file but bytes from the output of ffmpeg.
Here's a modified version of load_audio that should work with bytes of the audio file directly.

def load_audio(file: (str, bytes), sr: int = 16000):
    """
    Open an audio file and read as mono waveform, resampling as necessary

    Parameters
    ----------
    file: (str, bytes)
        The audio file to open or bytes of audio file

    sr: int
        The sample rate to resample the audio if necessary

    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """
    
    if isinstance(file, bytes):
        inp = file
        file = 'pipe:'
    else:
        inp = None
    
    try:
        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
        out, _ = (
            ffmpeg.input(file, threads=0)
            .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
            .run(cmd="ffmpeg", capture_stdout=True, capture_stderr=True, input=inp)
        )
    except ffmpeg.Error as e:
        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e

    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

To use it:

# audio_bytes are the bytes of the audio file
mel = whisper.log_mel_spectrogram(load_audio(audio_bytes ))

8 replies

ColeDrain Oct 21, 2022
Author

Aiit, thanks so much jian

yesha999 Feb 22, 2023

Thanks, it works with most formats, but with MOV format I get the error RuntimeError: 2D or 3D (batch mode) tensor expected for input, but got: [ torch.FloatTensor{1,1,0} ]

audio = load_audio(file)

print(audio)

This returns an empty list []

Next I found that the problem is in this line

        ffmpeg.input('pipe:', threads=0)
            .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
            .run(cmd="ffmpeg", capture_stdout=True, capture_stderr=True, input=file)
    )

This returns b''

chrisgchiang Apr 15, 2023

When passing my bytes through, I'm getting: pipe:: Invalid data found when processing input. Anyone know what could possibly be the problem?

I'm passing in my audio like this:


p = pyaudio.PyAudio()
    stream = p.open(format=pyaudio.paInt16,
                    channels=channels,
                    rate=rate,
                    input=True,
                    input_device_index=device_index,
                    frames_per_buffer=chunk)

    print("Recording... Press Enter to stop.")
    frames = []
    buffer = []

    while not stop_flag.is_set():
        data = stream.read(chunk)
        frames.append(data)
        buffer.append(data)

        if len(buffer) * chunk >= rate * 5:  # 5 seconds of audio
            audio_data = b''.join(buffer)
            audio_queue.put(audio_data)
            buffer = []

    # Put remaining audio into the queue
    audio_data = b''.join(buffer)
    audio_queue.put(audio_data)
    stream.stop_stream()
    stream.close()
    p.terminate()

UmutAlihan May 5, 2023

def load_audio(file: (str, bytes), sr: int = 16000):
    """
    Open an audio file and read as mono waveform, resampling as necessary

    Parameters
    ----------
    file: (str, bytes)
        The audio file to open or bytes of audio file

    sr: int
        The sample rate to resample the audio if necessary

    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """
    
    if isinstance(file, bytes):
        inp = file
        file = 'pipe:'
    else:
        inp = None
    
    try:
        # This launches a subprocess to decode audio while down-mixing and resampling as necessary.
        # Requires the ffmpeg CLI and `ffmpeg-python` package to be installed.
        out, _ = (
            ffmpeg.input(file, threads=0)
            .output("-", format="s16le", acodec="pcm_s16le", ac=1, ar=sr)
            .run(cmd="ffmpeg", capture_stdout=True, capture_stderr=True, input=inp)
        )
    except ffmpeg.Error as e:
        raise RuntimeError(f"Failed to load audio: {e.stderr.decode()}") from e

    return np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0

this solution works flawlessly, thank you very much!

I think this method should be merged to master branch

Shivansh-yadav13 Aug 4, 2023

I'm using a video mp4 file and getting a similar error

RuntimeError: 2D or 3D (batch mode) tensor expected for input, but got: [ torch.FloatTensor{1,1,0} ]

brian316 · 2022-11-02T03:42:41Z

brian316
Nov 2, 2022

how can we do this from just an ndarray? I have the same issue but using a numpy array of audio signal not a bytes object

2 replies

brian316 Nov 3, 2022

This is somewhat helpful

def get_format(in_type: np.dtype):
    format_strings = [
        (np.float64, 'f64le'),
        (np.float32, 'f32le'),
        (np.int16, 's16le'),
        (np.int32, 's32le'),
        (np.uint32, 'u32le')]
    for dtype, string in format_strings:
        if in_type == dtype:
            return string
    raise RuntimeError(f'Not supported type: {in_type}')

def load_audio_test(audio: np.ndarray, sampling_rate: int = 16000, sampling_rate_out: int = "float32", nchannel: int = 1, verbose: bool = False, timeout: float=None):
    """
    Open an audio file and read as mono waveform, resampling as necessary

    Parameters
    ----------
    audio: (np.ndarray)
        The audio array

    sampling_rate: int
        The sample rate to resample the audio if necessary

    sampling_rate_out: int
        The sample rate to resample the audio if necessary

    Returns
    -------
    A NumPy array containing the audio waveform, in float32 dtype.
    """
    
    format_string = get_format(audio.dtype)
    command = ['ffmpeg',
               '-f', format_string, '-ar', str(sampling_rate), '-ac', str(nchannel)] +\
              ['-i', 'pipe:0'] +\
              ['-f', format_string, '-ar', str(sampling_rate), '-ac', str(nchannel), '-']
    with subprocess.Popen(
            command,
            bufsize=-1,
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE,
            stderr=None if verbose else subprocess.PIPE) as p:
        try:
            stdout_bytes, stderr_bytes = p.communicate(
                input=audio.tobytes(), timeout=timeout)
        except subprocess.TimeoutExpired:
            p.kill()
            stdout_bytes, stderr_bytes = p.communicate()
            if verbose:
                mes =\
                    f'TimeoutExpired: {timeout}[s]. ' \
                    f' '.join(command)
            else:
                mes =\
                    f'TimeoutExpired: {timeout}[s]. ' \
                    f'{" ".join(command) }{os.linesep}' \
                    f'{stderr_bytes.decode("utf-8")}'
            raise RuntimeError(mes)
        if p.returncode != 0 or stdout_bytes is None:
            if verbose:
                mes = ' '.join(command)
            else:
                mes = f"{' '.join(command) }{os.linesep}{stderr_bytes.decode('utf-8')}"
            raise RuntimeError(mes)
        
        audio = np.fromstring(stdout_bytes, dtype=audio.dtype).astype("float32") / 32768.0
        return audio

elpidiovaldez Nov 4, 2022

You might be trying to do the same as me. See this thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using ndarray as input to transcribe method #380

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 10 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using ndarray as input to transcribe method #380

Replies: 2 comments · 10 replies

ColeDrain Oct 21, 2022 Author

Replies: 2 comments 10 replies

ColeDrain Oct 21, 2022
Author