Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Improvement ideas / feature requests #49

Open
UsernamesLame opened this issue Aug 30, 2024 · 102 comments
Open

Performance Improvement ideas / feature requests #49

UsernamesLame opened this issue Aug 30, 2024 · 102 comments

Comments

@UsernamesLame
Copy link
Contributor

As promised, here's the thread I'm making for this.

RE: pre-processing:

In pywhispercpp/model.py we have transcribe and it can take a numpy ndarray. What I was thinking is, rather than load in audio, crush it to mono, set it to 16khz, why not pre-process all that and generate binary blob files that we can feed in that just contain the numpy ndarray?

It's not a big performance increase, but anything we can do outside of Python land ahead of time will give us a win. And I'm ok chasing micro-optimizations in Python land. I'm useless in C++ land.

Also let's put all logging behind a flag to disable it. If possible, lets add a flag to disable whisper.cpp's incessant logging info to stderr. I know it has no impact on the transcription audio, but it should be controllable.

RE: copy.deepcopy

We need to drop @statimethod everywhere, and implement the deep copy methods on the C++ side. This is a minor request from me, it would just let us initialize the model in memory and create a deep copy that we can treat as a completely independent instance.

The other option is I can write a helper class using BytesIO to hold the model in memory and we can feed that to the Model class I guess? It would still be better than re-initializing the model to create a sterile instance.

RE: micro-optimizations

Under _get_segments we have assert end <= n, f"{end} > {n}: `End` index must be less or equal than the total number of segments" but I have to ask, is it even possible to end up in a situation where this assert would come true?

RE: features

Lets make the model usable in a context manager so we can do quick and dirty things like:

with Model("base.en", n_threads=6) as model:
    for segments in model.transcribe("file.mp3")
        for segment in segments:
            print(segment)

Not really necessary, just gives a more pleasant way of interacting with the model class.

@UsernamesLame
Copy link
Contributor Author

Looked into the numpy array saving: https://numpy.org/doc/stable/reference/generated/numpy.save.html

We can save the converted audio files to disk before feeding them to the model. This way we technically bypass the need for PyDub and ffmpeg. It also means no launching background processes (PyDub with ffmpeg) to manipulate audio so it's ready for the model to ingest.

@absadiki
Copy link
Owner

@UsernamesLame, Thanks for the ideas!

  • I don't think I understand the first point correctly, maybe some code will make it clear.
  • About logging: Yes it's annoying that logs are written to stderr, it's possible to add the flag, but needs some tweaks.
  • copy.deepcopy: What's that for ? You can create as many instances as you want! Maybe some code will be useful in here as well.
  • _get_segments: Yes might happen, if you want to get segments more than what whispercpp actually generated.
  • The context: Good feature, I actually started it at that time but I don't remember what happened why it's not there 😅

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Aug 30, 2024

This is what I was trying to explain:

sound = AudioSegment.from_file(media_file_path)

sound = sound.set_frame_rate(constants.WHISPER_SAMPLE_RATE).set_channels(1)

arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
arr /= np.iinfo(samples[0].typecode).max

with open("file.npy", "wb") as file:
    np.save(file, arr, allow_pickle=False)

I haven't tested it yet, but the idea is do all the operations we need on the numpy array ahead of time and then later we can just do something like:

array = np.load("file.npy")

_transcribe(array)

This way we can mass process our audio files before we load them into memory for whisper to process.

@abdeladim-s hopefully this makes sense now! The idea is to batch process hundreds if not thousands of audio files ahead of time in parallel (I can write a script to do this for us) and save them in a format we can just load into the model and get transcriptions back from.

Yes I know numpy should be fast but every context switch we can avoid the better.

@absadiki
Copy link
Owner

Looked into the numpy array saving: numpy.org/doc/stable/reference/generated/numpy.save.html

We can save the converted audio files to disk before feeding them to the model. This way we technically bypass the need for PyDub and ffmpeg. It also means no launching background processes (PyDub with ffmpeg) to manipulate audio so it's ready for the model to ingest.

PyDub and ffmpeg are actually there for the conversion to numpy arrays!
If we have numpy arrays, why we would need to save them to disk ?

@UsernamesLame
Copy link
Contributor Author

Looked into the numpy array saving: numpy.org/doc/stable/reference/generated/numpy.save.html
We can save the converted audio files to disk before feeding them to the model. This way we technically bypass the need for PyDub and ffmpeg. It also means no launching background processes (PyDub with ffmpeg) to manipulate audio so it's ready for the model to ingest.

PyDub and ffmpeg are actually there for the conversion to numpy arrays! If we have numpy arrays, why we would need to save them to disk ?

Pre-processing. Every context switch we can avoid the better! Imagine transcribing thousands of files.

The current solution looks like this:

pywhispercpp -> PyDub -> ffmpeg -> PyDub -> pywhispercpp -> numpy -> pywhispercpp -> PyBind11 -> whisper -> PyBind11 -> pywhispercpp

With my proposal it would look more like this:

pywhispercpp -> numpy -> pywhispercpp -> PyBind11 -> whisper -> PyBind11 -> pywhispercpp

The goal isn't to make this a full replacement for the existing solution, but I tomorrow I'll write a demo showing an alternative to load data into the model cutting out as many context switches as possible to gain some performance.

@absadiki
Copy link
Owner

This is what I was trying to explain:

sound = AudioSegment.from_file(media_file_path)

sound = sound.set_frame_rate(constants.WHISPER_SAMPLE_RATE).set_channels(1)

arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
arr /= np.iinfo(samples[0].typecode).max

with open("file.npy", "wb") as file:
    np.save(file, arr, allow_pickle=False)

I haven't tested it yet, but the idea is do all the operations we need on the numpy array ahead of time and then later we can just do something like:

array = np.load("file.npy")

_transcribe(array)

This way we can mass process our audio files before we load them into memory for whisper to process.

@abdeladim-s hopefully this makes sense now! The idea is to batch process hundreds if not thousands of audio files ahead of time in parallel (I can write a script to do this for us) and save them in a format we can just load into the model and get transcriptions back from.

Yes I know numpy should be fast but every context switch we can avoid the better.

Okey, so the idea is to process large amount of files ?
But I think it's the same, if not worse, taking into consideration the overhead of saving and loading the files to/from disk. And you will need to wait for the conversion in any ways.
IO operations are worse than using memory.

@UsernamesLame
Copy link
Contributor Author

This is what I was trying to explain:

sound = AudioSegment.from_file(media_file_path)

sound = sound.set_frame_rate(constants.WHISPER_SAMPLE_RATE).set_channels(1)

arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
arr /= np.iinfo(samples[0].typecode).max

with open("file.npy", "wb") as file:
    np.save(file, arr, allow_pickle=False)

I haven't tested it yet, but the idea is do all the operations we need on the numpy array ahead of time and then later we can just do something like:

array = np.load("file.npy")

_transcribe(array)

This way we can mass process our audio files before we load them into memory for whisper to process.
@abdeladim-s hopefully this makes sense now! The idea is to batch process hundreds if not thousands of audio files ahead of time in parallel (I can write a script to do this for us) and save them in a format we can just load into the model and get transcriptions back from.
Yes I know numpy should be fast but every context switch we can avoid the better.

Okey, so the idea is to process large amount of files ? But I think it's the same, if not worse, taking into consideration the overhead of saving and loading the files to/from disk. And you will need to wait for the conversion in any ways. IO operations are worse than using memory.

IO operations are generally cheaper than context switches. I'll test this unless you want to.

I can also read the files into memory and store them in a BytesIO object and read from it like a filesystem object too. There's a lot of ways this can be taken. But I genuinely believe that avoiding context switches > IO

RE: deepcopy

You can create completely independent objects that are clones of existing objects. Think instead of myModel = Model, we do myModel = existingModel.deepclone(). So we don't read the model weights from disk again, but instead do an in memory copy.

@absadiki
Copy link
Owner

IO operations are generally cheaper than context switches. I'll test this unless you want to.

I can also read the files into memory and store them in a BytesIO object and read from it like a filesystem object too. There's a lot of ways this can be taken. But I genuinely believe that avoiding context switches > IO

RE: deepcopy

You can create completely independent objects that are clones of existing objects. Think instead of myModel = Model, we do myModel = existingModel.deepclone(). So we don't read the model weights from disk again, but instead do an in memory copy.

Yes please, go ahead and test! experiments and Numbers will save us a lot of talk :)
Looking forward it!

@UsernamesLame
Copy link
Contributor Author

IO operations are generally cheaper than context switches. I'll test this unless you want to.
I can also read the files into memory and store them in a BytesIO object and read from it like a filesystem object too. There's a lot of ways this can be taken. But I genuinely believe that avoiding context switches > IO
RE: deepcopy
You can create completely independent objects that are clones of existing objects. Think instead of myModel = Model, we do myModel = existingModel.deepclone(). So we don't read the model weights from disk again, but instead do an in memory copy.

Yes please, go ahead and test! experiments and Numbers will save us a lot of talk :) Looking forward it!

So I'm testing wth a sample 33mb mp3 and the results are promising. Pre-processing into a numpy array and saving to disk shrinks it to 5.4mb so we can definitely have an impact on memory footprint with a helper script! Let me test transcription performance.

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Aug 30, 2024

I have numbers for you @abdeladim-s!

Here's the script:

``` py from pywhispercpp.model import Model import numpy as np import time

def usenumpy():
model = Model('base')
audio_data = np.load("file.npy")
segments = model.transcribe(audio_data)
for segment in segments:
print(segment)

def useaudiofile():
model = Model('base')
segments = model.transcribe("audio.mp3")
for segment in segments:
print(segment)

begin = time.time()
usenumpy()
end = time.time()
print("" * 20)
print(f"using raw numpy array finished in {end - begin}")
print("
" * 20)

begin = time.time()
useaudiofile()
end = time.time()
print("" * 20)
print(f"using mp3 file inished in {end - begin}")
print("
" * 20)

</details>



Here's the results!


using raw numpy array finished in 2.6472320556640625



using mp3 file inished in 26.56456184387207



On a M1 Pro MBP with 16gb of ram, **not** using the Metal backend, using the **base** whisper model.

Told ya it would have an improvement on processing time to pre-process the audio files into Numpy arrays!

This computer has a memory bandwidth of 200GB/s, and disk bandwidth of around 4GB/s. Context switching costs more than just loading raw data into memory :)

I am going to chase every optimization I can like a dog chases its tail.

@UsernamesLame
Copy link
Contributor Author

from pydub import AudioSegment
import numpy as np

sound = AudioSegment.from_file("audio.mp3")

sound = sound.set_frame_rate(1600).set_channels(1)
arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)

with open("file.npy", "wb") as file:
    np.save(file, arr, allow_pickle=False)

This is the pre-conversion script. I'm going to update WhisperWav to output numpy arrays that can be fed directly into the model.

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Aug 30, 2024

So I tried pre-converting a few files. Most work, but at random Numpy will completely mangle the conversion to a ndarray and saving leading to UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: unexpected end of data with numpy.load.

If anyone has any idea why it's randomly mangling things I'd love help here.

Edit:

Yea I'm at a complete dead end as to why numpy insists on butchering audio files at random when saving. When it works, the speedups are insane. When it doesn't work, the errors are absolutely useless.

Edit 2:

I decided to see if Copilot could help. It suggested:

with open("audio.npy", "rb") as f:
    audio_data = np.fromfile(f, dtype=np.float32)

And so far it seems to be working?

@UsernamesLame
Copy link
Contributor Author

Ok so final comment for now. A 42m audio file at 101 mb once crushed to mono and audio bitrate set to 1600khz becomes a 17mb~ npy file.

Processing the npy file takes around 10 seconds. Processing the raw wav file takes around 63 seconds.

This doesn't seem like an error or unreasonable. Can someone else please try and reproduce? Are we literally spending that much time prepping the file?!

@absadiki
Copy link
Owner

I still don't get what you are trying to achieve, but if I understand it correctly, it's basically the same as what I did, except that you are trying to dump and load the npy array, and you've made a deadly bug! lol

Also, when you did the experiment, why you didn't calculate the time needed to convert the files to npy, people are not moving around with dumped npy arrays of their media files 😅

Here is what I think this should be:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from pywhispercpp.model import Model
import numpy as np
import time
from pydub import AudioSegment


def usenumpy():
	# This part from your script should be included as well! ##########
    sound = AudioSegment.from_file("audio.mp3")
    # Here 16Khz not 1600 !!!! That's what you were doing wrong !!! 
    sound = sound.set_frame_rate(16000).set_channels(1)
    arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
    arr /= np.iinfo(np.int16).max # Normalization is important! otherwise you will get 'utf-8' codec can't decode bytes
    # dump array to npy file
    with open("file.npy", "wb") as file:
        np.save(file, arr, allow_pickle=False)
	#################### 
    # load model
    model = Model('base')
    # load array from npy file
    audio_data = np.load("file.npy")
    segments = model.transcribe(audio_data)
    for segment in segments:
        print(segment)

def useaudiofile():
    model = Model('base')
    segments = model.transcribe("audio.mp3")
    for segment in segments:
        print(segment)

begin = time.time()
usenumpy()
end = time.time()
print("*" * 20)
print(f"using raw numpy array finished in {end - begin}")
print("*" * 20)

begin = time.time()
useaudiofile()
end = time.time()
print("*" * 20)
print(f"using mp3 file finished in {end - begin}")
print("*" * 20)

I used this file from my other project, here are the results:

[2024-08-30 17:34:17,168] {model.py:130} INFO - Transcribing ...
[2024-08-30 17:34:33,929] {model.py:133} INFO - Inference time: 16.761 s
t0=0, t1=424, text=[Music]
t0=424, t1=800, text=What exactly is artificial intelligence?
t0=800, t1=1192, text=We speak of AI when computer systems perform tasks
t0=1192, t1=1448, text=that usually require human intelligence.
t0=1448, t1=1624, text=This includes, for example,
t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.
t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.
t0=2624, t1=2824, text=This can be achieved in two ways.
t0=2824, t1=3000, text=[Music]
t0=3000, t1=3280, text=You can program each individual instruction
t0=3280, t1=3544, text=so that the machine solve the task step by step.
t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.
t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.
t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.
t0=5032, t1=5336, text=This is known as machine learning.
t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.
t0=5904, t1=6224, text=When we watch films, listen to music or shop online.
t0=6224, t1=6528, text=AI gives us recommendations about what we might like.
t0=6528, t1=7080, text=AI is capable of converting spoken language into text
t0=7080, t1=7312, text=and translating it into other languages.
t0=7312, t1=8040, text=AI is a central component of robotics.
t0=8040, t1=8288, text=Robots make our everyday lives easier
t0=8288, t1=8488, text=or take on strenuous activities.
t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI
t0=8984, t1=9096, text=and can react to it.
t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.
t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.
t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.
t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.
t0=11280, t1=11544, text=For example, on digital learning platforms.
t0=11544, t1=11928, text=AI is becoming increasingly important.
t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities
t0=12504, t1=12688, text=at home and at work.
t0=12688, t1=12896, text=And where we would rather make our own decisions.
t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.
t0=13512, t1=13840, text=For this, we need an AI-competent society.
t0=13840, t1=14176, text=[MUSIC PLAYING]
t0=14176, t1=14376, text=you
********************
using raw numpy array finished in 17.416718244552612
********************
[2024-08-30 17:34:34,516] {model.py:130} INFO - Transcribing ...
[2024-08-30 17:34:50,128] {model.py:133} INFO - Inference time: 15.612 s
t0=0, t1=424, text=[Music]
t0=424, t1=800, text=What exactly is artificial intelligence?
t0=800, t1=1192, text=We speak of AI when computer systems perform tasks
t0=1192, t1=1448, text=that usually require human intelligence.
t0=1448, t1=1624, text=This includes, for example,
t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.
t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.
t0=2624, t1=2824, text=This can be achieved in two ways.
t0=2824, t1=3000, text=[Music]
t0=3000, t1=3280, text=You can program each individual instruction
t0=3280, t1=3544, text=so that the machine solve the task step by step.
t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.
t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.
t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.
t0=5032, t1=5336, text=This is known as machine learning.
t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.
t0=5904, t1=6224, text=When we watch films, listen to music or shop online.
t0=6224, t1=6528, text=AI gives us recommendations about what we might like.
t0=6528, t1=7080, text=AI is capable of converting spoken language into text
t0=7080, t1=7312, text=and translating it into other languages.
t0=7312, t1=8040, text=AI is a central component of robotics.
t0=8040, t1=8288, text=Robots make our everyday lives easier
t0=8288, t1=8488, text=or take on strenuous activities.
t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI
t0=8984, t1=9096, text=and can react to it.
t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.
t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.
t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.
t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.
t0=11280, t1=11544, text=For example, on digital learning platforms.
t0=11544, t1=11928, text=AI is becoming increasingly important.
t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities
t0=12504, t1=12688, text=at home and at work.
t0=12688, t1=12896, text=And where we would rather make our own decisions.
t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.
t0=13512, t1=13840, text=For this, we need an AI-competent society.
t0=13840, t1=14176, text=[MUSIC PLAYING]
t0=14176, t1=14376, text=you
********************
using mp3 file finished in 16.196656465530396
********************

This is not a real experiment per say, but as you can see, they are almost the same. There is no need to dump and load the numpy array!

Lmk what do you think ?

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Aug 31, 2024

I still don't get what you are trying to achieve, but if I understand it correctly, it's basically the same as what I did, except that you are trying to dump and load the npy array, and you've made a deadly bug! lol

Also, when you did the experiment, why you didn't calculate the time needed to convert the files to npy, people are not moving around with dumped npy arrays of their media files 😅

Here is what I think this should be:

#!/usr/bin/env python

# -*- coding: utf-8 -*-



from pywhispercpp.model import Model

import numpy as np

import time

from pydub import AudioSegment





def usenumpy():

	# This part from your script should be included as well! ##########

    sound = AudioSegment.from_file("audio.mp3")

    # Here 16Khz not 1600 !!!! That's what you were doing wrong !!! 

    sound = sound.set_frame_rate(16000).set_channels(1)

    arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)

    arr /= np.iinfo(np.int16).max # Normalization is important! otherwise you will get 'utf-8' codec can't decode bytes

    # dump array to npy file

    with open("file.npy", "wb") as file:

        np.save(file, arr, allow_pickle=False)

	#################### 

    # load model

    model = Model('base')

    # load array from npy file

    audio_data = np.load("file.npy")

    segments = model.transcribe(audio_data)

    for segment in segments:

        print(segment)



def useaudiofile():

    model = Model('base')

    segments = model.transcribe("audio.mp3")

    for segment in segments:

        print(segment)



begin = time.time()

usenumpy()

end = time.time()

print("*" * 20)

print(f"using raw numpy array finished in {end - begin}")

print("*" * 20)



begin = time.time()

useaudiofile()

end = time.time()

print("*" * 20)

print(f"using mp3 file finished in {end - begin}")

print("*" * 20)

I used this file from my other project, here are the results:

[2024-08-30 17:34:17,168] {model.py:130} INFO - Transcribing ...

[2024-08-30 17:34:33,929] {model.py:133} INFO - Inference time: 16.761 s

t0=0, t1=424, text=[Music]

t0=424, t1=800, text=What exactly is artificial intelligence?

t0=800, t1=1192, text=We speak of AI when computer systems perform tasks

t0=1192, t1=1448, text=that usually require human intelligence.

t0=1448, t1=1624, text=This includes, for example,

t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.

t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.

t0=2624, t1=2824, text=This can be achieved in two ways.

t0=2824, t1=3000, text=[Music]

t0=3000, t1=3280, text=You can program each individual instruction

t0=3280, t1=3544, text=so that the machine solve the task step by step.

t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.

t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.

t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.

t0=5032, t1=5336, text=This is known as machine learning.

t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.

t0=5904, t1=6224, text=When we watch films, listen to music or shop online.

t0=6224, t1=6528, text=AI gives us recommendations about what we might like.

t0=6528, t1=7080, text=AI is capable of converting spoken language into text

t0=7080, t1=7312, text=and translating it into other languages.

t0=7312, t1=8040, text=AI is a central component of robotics.

t0=8040, t1=8288, text=Robots make our everyday lives easier

t0=8288, t1=8488, text=or take on strenuous activities.

t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI

t0=8984, t1=9096, text=and can react to it.

t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.

t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.

t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.

t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.

t0=11280, t1=11544, text=For example, on digital learning platforms.

t0=11544, t1=11928, text=AI is becoming increasingly important.

t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities

t0=12504, t1=12688, text=at home and at work.

t0=12688, t1=12896, text=And where we would rather make our own decisions.

t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.

t0=13512, t1=13840, text=For this, we need an AI-competent society.

t0=13840, t1=14176, text=[MUSIC PLAYING]

t0=14176, t1=14376, text=you

********************

using raw numpy array finished in 17.416718244552612

********************

[2024-08-30 17:34:34,516] {model.py:130} INFO - Transcribing ...

[2024-08-30 17:34:50,128] {model.py:133} INFO - Inference time: 15.612 s

t0=0, t1=424, text=[Music]

t0=424, t1=800, text=What exactly is artificial intelligence?

t0=800, t1=1192, text=We speak of AI when computer systems perform tasks

t0=1192, t1=1448, text=that usually require human intelligence.

t0=1448, t1=1624, text=This includes, for example,

t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.

t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.

t0=2624, t1=2824, text=This can be achieved in two ways.

t0=2824, t1=3000, text=[Music]

t0=3000, t1=3280, text=You can program each individual instruction

t0=3280, t1=3544, text=so that the machine solve the task step by step.

t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.

t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.

t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.

t0=5032, t1=5336, text=This is known as machine learning.

t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.

t0=5904, t1=6224, text=When we watch films, listen to music or shop online.

t0=6224, t1=6528, text=AI gives us recommendations about what we might like.

t0=6528, t1=7080, text=AI is capable of converting spoken language into text

t0=7080, t1=7312, text=and translating it into other languages.

t0=7312, t1=8040, text=AI is a central component of robotics.

t0=8040, t1=8288, text=Robots make our everyday lives easier

t0=8288, t1=8488, text=or take on strenuous activities.

t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI

t0=8984, t1=9096, text=and can react to it.

t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.

t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.

t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.

t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.

t0=11280, t1=11544, text=For example, on digital learning platforms.

t0=11544, t1=11928, text=AI is becoming increasingly important.

t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities

t0=12504, t1=12688, text=at home and at work.

t0=12688, t1=12896, text=And where we would rather make our own decisions.

t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.

t0=13512, t1=13840, text=For this, we need an AI-competent society.

t0=13840, t1=14176, text=[MUSIC PLAYING]

t0=14176, t1=14376, text=you

********************

using mp3 file finished in 16.196656465530396

********************

This is not a real experiment per say, but as you can see, they are almost the same. There is no need to dump and load the numpy array!

Lmk what do you think ?

I caught the deadly bug locally and fixed it locally.

As for performance, it's odd you aren't getting better results and I am.

I'm guessing it has something to due with the memory bandwidth of M1 Pro vs x86 chips?

But yea you're understanding now. I haven't tested it on x86. Also I didn't include the converting to numpy arrays because the idea is to mass transform it then transcribe.

At least one benefit is the numpy arrays are generally smaller in my experience.

What are your system specs btw? And Python version? I'm using 3.12 and getting good results.

If I can't increase performance I can at least lower memory usage I guess. 😅

My idea is to let the model be long lived and keep feeding it fresh areas dumps as it transcribes them one after another. This way in a different process (I'm going to edit e action to show this) we can spawn sub processes to mass convert media files to numpy arrays.

The idea is that the model is the limiting factor, as in most people don't have the CPU / RAM to load 2 - 4 models, so if we can pre-process the files so the model can transcribe faster with less memory, it's still a (small) win!

I have access to a 128core ARM box that is piss slow at transcribing but can quickly spit out these bumpy arrays.

It's not gonna benefit everyone, but it's worth exploring the thought. It's also possible to store all the numpy arrays in a single database that clients running the models pull from to transcribe creating transcription cluster. The big benefit being that the clients can be small like a raspberry pi and still considerably faster transcriptions.

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Aug 31, 2024

I'm running a few more tests, including ensuring the numpy arrays produce the same results as the mp3, mostly because I can't believe that after crushing the frame rate, the response frequency, and the channels, I can go from 100mb to 70mb.

As of now, numpy is getting me 17 seconds while mp3 is 69 seconds. Timing the conversion to a numpy array gets me 5 seconds. So 21 seconds vs 69.

The performance gap has shrunk, but it's not gone. It's still ~3x faster to pre-process numpy arrays and then load them. I'm not saying everyone should, but it would make a fun example!

Edit:

I forgot to mention, I go us added to Whisper.cpp's README.md :)

ggerganov/whisper.cpp#2396

Merged already. I felt like we were ready for more visibility.

@absadiki
Copy link
Owner

  • I have an i7 8c/16t with 32 GB DDR4, running Python 3.10 .. When I tested the code provided with 1600 sample rate, I got results similar to yours, which is obvious because it's like 10x down-sampling, but when I fixed it it's almost the same, It's the same algorithm running under the hood anyway!

  • I can see the benefits of batch pre-processing, and this is exactly why I made the transcribe function accepts (audio file as well as numpy array) , if you want something quickly you can throw whatever file and the library will convert it for you, if you are a power user and you know what you are doing, you can use numpy arrays directly, in that case the pre-processing step will be ignored! I think from a library point of view this gives more flexibility to the users!

@absadiki
Copy link
Owner

I'm running a few more tests, including ensuring the numpy arrays produce the same results as the mp3, mostly because I can't believe that after crushing the frame rate, the response frequency, and the channels, I can go from 100mb to 7mb.

As of now, numpy is getting me 17 seconds while mp3 is 69 seconds. Timing the conversion to a numpy array gets me 5 seconds. So 21 seconds vs 69.

The performance gap has shrunk, but it's not gone. It's still ~3x faster to pre-process numpy arrays and then load them. I'm not saying everyone should, but it would make a fun example!

Edit:

I forgot to mention, I go us added to Whisper.cpp's README.md :)

ggerganov/whisper.cpp#2396

Merged already. I felt like we were ready for more visibility.

  • You can't tell from one example! You have to test multiple times and average the results, It's the same algorithm I used, so you should get basically the same results, unless there is some magic in dumping and loading the npy files

  • Oh, I just noticed you made a PR for this, you really think we are ready?!
    It's a small project, does not deserve that visibility 😅 But Thanks anyways!

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Aug 31, 2024

I'm running a few more tests, including ensuring the numpy arrays produce the same results as the mp3, mostly because I can't believe that after crushing the frame rate, the response frequency, and the channels, I can go from 100mb to 7mb.
As of now, numpy is getting me 17 seconds while mp3 is 69 seconds. Timing the conversion to a numpy array gets me 5 seconds. So 21 seconds vs 69.
The performance gap has shrunk, but it's not gone. It's still ~3x faster to pre-process numpy arrays and then load them. I'm not saying everyone should, but it would make a fun example!
Edit:
I forgot to mention, I go us added to Whisper.cpp's README.md :)
ggerganov/whisper.cpp#2396
Merged already. I felt like we were ready for more visibility.

  • You can't tell from one example! You have to test multiple times and average the results, It's the same algorithm I used, so you should get basically the same results, unless there is some magic in dumping and loading the npy files
  • Oh, I just noticed you made a PR for this, you really think we are ready?!
    It's a small project, does not deserve that visibility 😅 But Thanks anyways!

Re testing: I know one test isn't enough, but still it's promising!

Re pywhispercpp: It 100% deserves the visibility!

Also I double checked I'm using 16000 locally, and:

 ********************
using raw numpy array finished in 10.000927925109863
********************

That's still a pretty drastic difference. Also, when I accidentally did it with 1600, there was no real drop in accuracy on simpler audio files.

@absadiki
Copy link
Owner

  • I think It should not be a drastic difference in my opinion, as long as you are using the same algorithm as _load_audio.
  • If you have numpy arrays you can pass them through the transcribe function without any problem, as I said, the pre-processing step won't be executed!
  • Or maybe I am wrong and I missed something! and I need to make an optimization somewhere!

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Aug 31, 2024

Let's put my numpy theories to the test. I'm going to crush around 6h of audio into numpy arrays and transcribe it.

  • I think It should not be a drastic difference in my opinion, as long as you are using the same algorithm as _load_audio.
  • If you have numpy arrays you can pass them through the transcribe function without any problem, as I said, the pre-processing step won't be executed!
  • Or maybe I am wrong and I missed something! and I need to make an optimization somewhere!

It's really down to batch processing and pre-normalizing the numpy arrays making a very big difference on ARM (M1 Pro). I'm going to test feeding around 7.5h of audio into it and post the results.

Edit:

Just over 6gb of files converted into numpy arrays in 33 seconds. Time to transcribe!

Edit 2:

Whisper just spat out some debug logs. 174 seconds to transcribe 1h of audio with normalized numpy arrays!

Extrapolating this, it should take 17 minutes to transcribe >6h of audio. Lets see what actually happens as whisper spat out another debug log saying it finished in 147 seconds.

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Aug 31, 2024

********************
using raw numpy array finished in 1105.0404160022736
********************

neat!

Edit:

We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes.

So far the speed up isn't that promising, but the next check should be memory usage!

@UsernamesLame
Copy link
Contributor Author

https://github.com/EtienneAb3d/WhisperHallu?tab=readme-ov-file

I found this, a project about optimizing for whisper!

@absadiki
Copy link
Owner

********************
using raw numpy array finished in 1105.0404160022736
********************

neat!

Edit:

We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes.

So far the speed up isn't that promising, but the next check should be memory usage!

Interesting result!

@absadiki
Copy link
Owner

EtienneAb3d/WhisperHallu

I found this, a project about optimizing for whisper!

Sounds great, I'll take a look

@UsernamesLame
Copy link
Contributor Author

********************
using wav finished in 1575.6269478797913
********************

ouch!

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Aug 31, 2024

********************
using raw numpy array finished in 1105.0404160022736
********************

neat!
Edit:
We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes.
So far the speed up isn't that promising, but the next check should be memory usage!

Interesting result!

26 minutes for raw wav files, 17 minutes with numpy arrays.

I think we have a winner? Opinion?

Next test will be memory usage I guess.

@absadiki
Copy link
Owner

interesting! .. I think it's because of the parallel pre-conversion of the files to numpy. For small number of files, this won't have a huge effect!
But I have an idea, if you can replicate the same on Colab, that will give us a clear view of what's really happening in a fresh environnement!

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Aug 31, 2024

interesting! .. I think it's because of the parallel pre-conversion of the files to numpy. For small number of files, this won't have a huge effect! But I have an idea, if you can replicate the same on Colab, that will give us a clear view of what's really happening in a fresh environnement!

I've never used colab before, so here's the code.

from pywhispercpp.model import Model
import numpy as np
import time
import os
from glob import glob

model = Model('base')

def usenumpy():
    files = [f for f in glob("*") if os.path.isfile(f) and f.endswith((".pyd"))]
    for file in files:
        with open(f"{file}", "rb") as f:
            audio_data = np.fromfile(f, dtype=np.float32)
            numpy_segments = model.transcribe(audio_data)


def usewav():
        files = [f for f in glob("*") if os.path.isfile(f) and f.endswith((".wav"))]
        for file in files:
            mp3_segments = model.transcribe(file)

begin = time.time()
usewav()
end = time.time()
print("*" * 20)
print(f"using wav finished in {end - begin}")
print("*" * 20)

I used cobalt.tools to download a 1.5h video's audio from YouTube as a WAV, then converted it with this:

from pydub import AudioSegment
import numpy as np
from glob import glob
import os
import time

begin = time.time()


files = [f for f in glob("*") if os.path.isfile(f) and not f.endswith((".npy", ".md", ".txt", ".py", ".cfg"))]

for file in files:
    sound = AudioSegment.from_file(file)

    sound = sound.set_frame_rate(16000).set_channels(1)
    numpy_array = np.array(sound.get_array_of_samples()).T.astype(np.float32)
    numpy_array /= np.iinfo(np.int16).max

    with open(f"{file}.npy", "wb") as f:
        np.save(f, numpy_array, allow_pickle=False)

end = time.time()
print(f"{end - begin} seconds elsapsed")

I feel like it should be ok to feed it the same audio file 6 times to get a general idea as it seems like whisper performs worse with each pass, not better.

If you want to make a colab / Jupiter notebook, I'll gladly poke around with you. My theory is that the audio files being massive is causing the issue. The numpy arrays I save to disk are much smaller by comparison. The .wav is around 1gb, the .pyk is around 393mb.

Anyways, for now I must say goodnight my friend! Don't let the geese bite!

@absadiki
Copy link
Owner

So, the large files are causing the issue ?! Probably!
But I am still confused, why, convert -> save -> load -> transcribe is faster than convert -> transcribe.

Anyways, good luck with your exploration, let me know if find any optimizations we can add to the repo,
Goodnight :)

@BBC-Esq
Copy link

BBC-Esq commented Sep 6, 2024

Wait...so you stopped llama.cpp from running and pydub sped up by 5 seconds but av did not? This is getting stranger by the minute. lol

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Sep 6, 2024

Wait...so you stopped llama.cpp from running and pydub sped up by 5 seconds but av did not? This is getting stranger by the minute. lol

Ignore all the prior numbers. I'm re-running the tests

numpy==2.1.0
pydub==0.25.1
av==13.0.0

pydub_to_numpy took 0.260052 seconds
Pydub conversion took 13.436505 seconds
av_to_numpy took 18.294100 seconds
AV conversion took 18.299570 seconds

numpy==1.26.4
pydub==0.25.1
av==13.0.0

pydub_to_numpy took 0.164001 seconds
Pydub conversion took 13.522791 seconds
av_to_numpy took 18.088221 seconds
AV conversion took 18.093782 seconds

@BBC-Esq
Copy link

BBC-Esq commented Sep 6, 2024

Hmm...I have no explanation for your numbers and mine too...I'm re-running it multiple times and I get the same thing. Might be beyond my technical expertise. I'm happy to do a screenshare with anyone who wants to actually see it happen, but barring that...just not sure...

@UsernamesLame
Copy link
Contributor Author

Hmm...I have no explanation for your numbers and mine too...I'm re-running it multiple times and I get the same thing. Might be beyond my technical expertise. I'm happy to do a screenshare with anyone who wants to actually see it happen, but barring that...just not sure...

Ok so numpy 1.0 branch has a slight advantage over 2.0. The changes in performance between pydub and av are all over the place based on platform / CPU.

Do you wanna do a /10 test on your machine?

@BBC-Esq
Copy link

BBC-Esq commented Sep 6, 2024

What's a "/10" test?

@UsernamesLame
Copy link
Contributor Author

UsernamesLame commented Sep 6, 2024

def benchmark(input_file):

    pydub_times = []
    av_times = []

    for i in range(0, 10):
        converter = AudioConverter(input_file)
        
        pydub_time = converter.convert_pydub()
        pydub_times.append(pydub_time)
        
        av_time = converter.convert_av()
        av_times.append(av_time)


    pydub_avg_time = sum(pydub_times) / len(pydub_times)
    av_avg_time = sum(av_times) / len(av_times)        
    print(f"Average Pydub conversion time: {pydub_avg_time:.6f} seconds")
    print(f"Average AV conversion time: {av_avg_time:.6f} seconds")

Fixed it, re-running. My focus is split, sorry about the mistakes.

@UsernamesLame
Copy link
Contributor Author

@BBC-Esq

Average Pydub conversion time: 12.883802 seconds
Average AV conversion time: 18.154749 seconds

What's your results? This is with numpy 2.0 branch.

@absadiki
Copy link
Owner

absadiki commented Sep 6, 2024

@BBC-Esq,

Here are my results again:

pydub_to_numpy took 0.209830 seconds
Pydub conversion took 16.383211 seconds
av_to_numpy took 20.261770 seconds
AV conversion took 20.263226 seconds
Script used
import numpy as np
import time
import os
from pydub import AudioSegment
import av

def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__} took {end - start:.6f} seconds")
        return result
    return wrapper

class AudioConverter:
    def __init__(self, input_file):
        self.input_file = input_file
        self.base_name = os.path.splitext(os.path.basename(input_file))[0]

    def convert_pydub(self):
        start_time = time.perf_counter()
        audio = AudioSegment.from_file(self.input_file)
        audio = audio.set_frame_rate(16000).set_channels(1)

        @timeit
        def pydub_to_numpy():
            return np.array(audio.get_array_of_samples()).astype(np.float32) / np.iinfo(np.int16).max

        audio_array = pydub_to_numpy()
        end_time = time.perf_counter()
        return end_time - start_time

    def convert_av(self):
        start_time = time.perf_counter()
        container = av.open(self.input_file)
        audio = container.streams.audio[0]
        # Set up the resampler
        resampler = av.audio.resampler.AudioResampler(
            format='s16',
            layout='mono',
            rate=16000
        )

        @timeit
        def av_to_numpy():
            audio_frames = []
            for frame in container.decode(audio):
                resampled_frames = resampler.resample(frame)
                for resampled_frame in resampled_frames:
                    audio_frames.append(resampled_frame)
            if not audio_frames:
                return np.array([])
            # Concatenate all frames into a single numpy array, convert to float32, and normalize
            return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames]).astype(np.float32) / np.iinfo(np.int16).max

        audio_array = av_to_numpy()
        end_time = time.perf_counter()
        return end_time - start_time

def benchmark(input_file):
    converter = AudioConverter(input_file)

    pydub_time = converter.convert_pydub()
    print(f"Pydub conversion took {pydub_time:.6f} seconds")

    av_time = converter.convert_av()
    print(f"AV conversion took {av_time:.6f} seconds")

if __name__ == "__main__":
    input_file = r"sam_altman_lex_podcast_367.flac"
    benchmark(input_file)

Running Linux, everything is installed from scratch manually, killed all not necessary background processes before running the script. Here is the pip freeze:

av==13.0.0
numpy==1.26.4
pydub==0.25.1

  • It seems @UsernamesLame and I have similar results, with Pydub being faster. BTW, we're not looking for an exact numbers match; or right or wrong; we're just benchmarking which library is faster for pre-processing.

  • So, @BBC-Esq, pyav is likely very optimized (or maybe using some sort of acceleration) on your powerful CPU.

@absadiki
Copy link
Owner

absadiki commented Sep 6, 2024

So I decided to write my own conversion script without relying on any third-party library. It seems we have a new winner now .. calling ffmpeg directly is much faster than both libraries on my local machine.

Here are the results for the same flac audio

Pydub conversion took 14.357902 seconds
AV conversion took 19.177406 seconds
Raw FFMPEG conversion took 6.713088 seconds
Here is the script
import numpy as np
import time
import os
from pydub import AudioSegment
import av
import subprocess
import numpy as np
import os
import tempfile


def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__} took {end - start:.6f} seconds")
        return result

    return wrapper


class AudioConverter:
    def __init__(self, input_file):
        self.input_file = input_file
        self.base_name = os.path.splitext(os.path.basename(input_file))[0]

    def convert_pydub(self):
        start_time = time.perf_counter()
        audio = AudioSegment.from_file(self.input_file)
        audio = audio.set_frame_rate(16000).set_channels(1)

        # @timeit
        def np_array_conversion():
            return np.array(audio.get_array_of_samples())

        samples = np_array_conversion()

        # @timeit
        def np_float_conversion():
            return samples.astype(np.float32)

        audio_array = np_float_conversion()

        # @timeit
        def np_normalization(arr):
            return arr / np.iinfo(np.int16).max

        audio_array = np_normalization(audio_array)

        end_time = time.perf_counter()
        return end_time - start_time, audio_array

    def convert_av(self):
        start_time = time.perf_counter()
        container = av.open(self.input_file)
        audio = container.streams.audio[0]

        # Set up the resampler
        resampler = av.audio.resampler.AudioResampler(
            format='s16',
            layout='mono',
            rate=16000
        )

        # @timeit
        def get_array_of_samples():
            audio_frames = []
            for frame in container.decode(audio):
                resampled_frames = resampler.resample(frame)
                for resampled_frame in resampled_frames:
                    audio_frames.append(resampled_frame)

            if not audio_frames:
                return np.array([])

            # Concatenate all frames into a single numpy array
            return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames])

        audio_array = get_array_of_samples()

        # @timeit
        def np_float_conversion(arr):
            return arr.astype(np.float32)

        audio_array = np_float_conversion(audio_array)

        # @timeit
        def np_normalization(arr):
            return arr / np.iinfo(np.int16).max

        audio_array = np_normalization(audio_array)

        end_time = time.perf_counter()
        return end_time - start_time, audio_array

    def convert_ffmpeg(self):
        def to_np(file_path):
            with open(file_path, 'rb') as f:
                header = f.read(44)
                raw_data = f.read()
                samples = np.frombuffer(raw_data, dtype=np.int16)
            audio_array = samples.astype(np.float32) / np.iinfo(np.int16).max
            return audio_array

        start_time = time.perf_counter()
        if self.input_file.endswith('.wav'):
            res = to_np(self.input_file)
        else:
            temp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
            temp_file_path = temp_file.name
            temp_file.close()
            try:
                subprocess.run([
                    'ffmpeg', '-i', self.input_file, '-ac', '1', '-ar', '16000',
                    temp_file_path, '-y'
                ], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                res = to_np(temp_file_path)
            finally:
                os.remove(temp_file_path)

        end_time = time.perf_counter()
        return end_time - start_time, res


def benchmark(input_file):
    converter = AudioConverter(input_file)
    pydub_time, pydub_array = converter.convert_pydub()
    print(f"Pydub conversion took {pydub_time:.6f} seconds")
    av_time, av_array = converter.convert_av()
    print(f"AV conversion took {av_time:.6f} seconds")
    numpy_time, numpy_array = converter.convert_ffmpeg()
    print(f"Raw FFMPEG conversion took {numpy_time:.6f} seconds")


if __name__ == "__main__":
    input_file = "sam_altman_lex_podcast_367.flac"
    benchmark(input_file)

@BBC-Esq , @UsernamesLame please give it a try and let me know your results ?

@BBC-Esq
Copy link

BBC-Esq commented Sep 6, 2024

Yep, exact numbers is something you'd only publish after extensive and repetitive benchmarking, not like the informal discussion we're having here...

It's possible that it's something on my computer as well...perhaps I installed intelMKL way back in the day and av somehow leverages it while pydub doesn't...At this point who knows and I'm out of ideas of how to troubleshoot the issue. I know my benchmarks are reliable, don't doubt that your guys' are as well...but if we can't replicate it's probably not advisable to change libraries unless we could be certain. Something to remember though...Sort of like when I benchmarked GGUF files like I told you on the video call...

Q4 running faster than Q3...and then Q3_k_m (or whatever the current naming scheme is) using more VRAM than Q4...This was the behavior with llama-based models...but then it's reversed or we see an entirely new behavior when benching mistral-based models. lol.

BTW, I also benchmarked solely the beam_size parameter using the ctranslate2 library with "chat models"...GET THIS...one of my favorite models (neural chat) had a curved decline in tokens per second as beam size increased (i.e. more compute required)...yet at beam size 4 - AND ONLY 4 - it's tokens/second went up 25%. No other model did this and I tested and re-tested dozens of times. lol.

Takeaways...

  1. Always take all benchmarks with a grain of salt because you may not know all the parameters/details of the testing...
  2. ditto regarding hardware setups...background tasks, etc...even professionals with a dedicated benchmarking computer can experience differences...there's also the "silicon lottery" to contend with...
  3. There can always be factors that the tester, despite good intentions, just isn't aware of (if anyone is at all).
  4. Benchmarking is fun and I had fun delving into this interesting issue! 😄

@BBC-Esq
Copy link

BBC-Esq commented Sep 6, 2024

So I decided to write my own conversion script without relying on any third-party library. It seems we have a new winner now .. calling ffmpeg directly is much faster than both libraries on my local machine.

Here are the results for the same flac audio

Pydub conversion took 14.357902 seconds
AV conversion took 19.177406 seconds
Raw FFMPEG conversion took 6.713088 seconds

Here is the script

@BBC-Esq , @UsernamesLame please give it a try and let me know your results ?

Did I not already suggest directly using FFMPEG - OMG! I thought I did...but I thought you wanted a library that bundles it for the sake of simplicity. lol. Wait, I kinda did above when I commented on the average user probably not wanting to mess with PATH...

Anyways, here's the guy who informed me of this issue first...

shashikg/WhisperS2T#40 (comment)

BTW, if you're looking to implement true batching I highly recommend you check out his repository. His pipeline for implementing ctranslate2 is awesome-sauce.

@UsernamesLame I'll send you an invite to the whisper benchmarking private repo I created. Feel free to participate, just observe or not join at all at your pleasure.

Will bench the FFMPEG directly but turn to other issues after that. 😉

EDIT:

BTW, I believe FFMPEG's binary is written in C so yes it'll be faster...but it would be fun to compare it to Rust, which is much easier to incorporate for people only familiar with Python like myself, but unfortunately Rust doesn't have anything remotely close to FFMPEG's comprehensiveness...

@BBC-Esq
Copy link

BBC-Esq commented Sep 6, 2024

So I decided to write my own conversion script without relying on any third-party library. It seems we have a new winner now .. calling ffmpeg directly is much faster than both libraries on my local machine.

Here are the results for the same flac audio

Pydub conversion took 14.357902 seconds
AV conversion took 19.177406 seconds
Raw FFMPEG conversion took 6.713088 seconds

Here is the script

@BBC-Esq , @UsernamesLame please give it a try and let me know your results ?

The "never wrong" 😉 Claude identified a few issues:

  1. The timing in convert_ffmpeg is not consistent with the other methods. You're including the file I/O operations in the timing, which isn't done in the other methods.

  2. Missing @timeit decorator: To be consistent with the other methods, you should use the @timeit decorator for the actual conversion part.

  3. Return value: The convert_ffmpeg method returns both the time and the result, while the others only return the time.

After addressing those issues I received the following impressive results for FFMPEG:

pydub_to_numpy took 0.304464 seconds
Pydub conversion took 17.602802 seconds
av_to_numpy took 8.928977 seconds
AV conversion took 8.931782 seconds
ffmpeg_to_numpy took 2.070266 seconds
FFmpeg conversion took 2.070527 seconds

I used this script:

SCRIPT HERE
import numpy as np
import time
import os
from pydub import AudioSegment
import av
import subprocess
import tempfile

def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__} took {end - start:.6f} seconds")
        return result
    return wrapper

class AudioConverter:
    def __init__(self, input_file):
        self.input_file = input_file
        self.base_name = os.path.splitext(os.path.basename(input_file))[0]

    def convert_pydub(self):
        start_time = time.perf_counter()
        audio = AudioSegment.from_file(self.input_file)
        audio = audio.set_frame_rate(16000).set_channels(1)
        
        @timeit
        def pydub_to_numpy():
            return np.array(audio.get_array_of_samples()).astype(np.float32) / np.iinfo(np.int16).max
        
        audio_array = pydub_to_numpy()
        end_time = time.perf_counter()
        return end_time - start_time

    def convert_av(self):
        start_time = time.perf_counter()
        container = av.open(self.input_file)
        audio = container.streams.audio[0]
        resampler = av.audio.resampler.AudioResampler(
            format='s16',
            layout='mono',
            rate=16000
        )
        
        @timeit
        def av_to_numpy():
            audio_frames = []
            for frame in container.decode(audio):
                resampled_frames = resampler.resample(frame)
                for resampled_frame in resampled_frames:
                    audio_frames.append(resampled_frame)
            if not audio_frames:
                return np.array([])
            return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames]).astype(np.float32) / np.iinfo(np.int16).max
        
        audio_array = av_to_numpy()
        end_time = time.perf_counter()
        return end_time - start_time

    def convert_ffmpeg(self):
        start_time = time.perf_counter()

        @timeit
        def ffmpeg_to_numpy():
            if self.input_file.endswith('.wav'):
                return AudioConverter.to_np(self.input_file)
            else:
                with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file:
                    temp_file_path = temp_file.name
                try:
                    subprocess.run([
                        'ffmpeg', '-i', self.input_file, '-ac', '1', '-ar', '16000',
                        temp_file_path, '-y'
                    ], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                    return AudioConverter.to_np(temp_file_path)
                finally:
                    os.remove(temp_file_path)

        audio_array = ffmpeg_to_numpy()
        end_time = time.perf_counter()
        return end_time - start_time

    @staticmethod
    def to_np(file_path):
        with open(file_path, 'rb') as f:
            header = f.read(44)
            raw_data = f.read()
            samples = np.frombuffer(raw_data, dtype=np.int16)
        audio_array = samples.astype(np.float32) / np.iinfo(np.int16).max
        return audio_array

def benchmark(input_file):
    converter = AudioConverter(input_file)
    
    pydub_time = converter.convert_pydub()
    print(f"Pydub conversion took {pydub_time:.6f} seconds")
    
    av_time = converter.convert_av()
    print(f"AV conversion took {av_time:.6f} seconds")
    
    ffmpeg_time = converter.convert_ffmpeg()
    print(f"FFmpeg conversion took {ffmpeg_time:.6f} seconds")

if __name__ == "__main__":
    input_file = r"D:\Scripts\bench_cupy\sam_altman_lex_podcast_367.flac"
    benchmark(input_file)

[EDIT]

When I altered the script to keep track of all time - including I/O operations - I received these results:

convert_pydub took 17.707413 seconds
convert_av took 9.034201 seconds
convert_ffmpeg took 2.159566 seconds

Results are basically the same:

Second script here:

SCRIPT INCLUDING I/O IN TIME
import numpy as np
import time
import os
from pydub import AudioSegment
import av
import subprocess
import tempfile

def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{func.__name__} took {end - start:.6f} seconds")
        return result
    return wrapper

class AudioConverter:
    def __init__(self, input_file):
        self.input_file = input_file
        self.base_name = os.path.splitext(os.path.basename(input_file))[0]

    @timeit
    def convert_pydub(self):
        audio = AudioSegment.from_file(self.input_file)
        audio = audio.set_frame_rate(16000).set_channels(1)
        return np.array(audio.get_array_of_samples()).astype(np.float32) / np.iinfo(np.int16).max

    @timeit
    def convert_av(self):
        container = av.open(self.input_file)
        audio = container.streams.audio[0]
        resampler = av.audio.resampler.AudioResampler(
            format='s16',
            layout='mono',
            rate=16000
        )
        
        audio_frames = []
        for frame in container.decode(audio):
            resampled_frames = resampler.resample(frame)
            for resampled_frame in resampled_frames:
                audio_frames.append(resampled_frame)
        if not audio_frames:
            return np.array([])
        return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames]).astype(np.float32) / np.iinfo(np.int16).max

    @timeit
    def convert_ffmpeg(self):
        if self.input_file.endswith('.wav'):
            return AudioConverter.to_np(self.input_file)
        else:
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file:
                temp_file_path = temp_file.name
            try:
                subprocess.run([
                    'ffmpeg', '-i', self.input_file, '-ac', '1', '-ar', '16000',
                    temp_file_path, '-y'
                ], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                return AudioConverter.to_np(temp_file_path)
            finally:
                os.remove(temp_file_path)

    @staticmethod
    def to_np(file_path):
        with open(file_path, 'rb') as f:
            header = f.read(44)
            raw_data = f.read()
            samples = np.frombuffer(raw_data, dtype=np.int16)
        return samples.astype(np.float32) / np.iinfo(np.int16).max

def benchmark(input_file):
    converter = AudioConverter(input_file)
    
    pydub_array = converter.convert_pydub()
    av_array = converter.convert_av()
    ffmpeg_array = converter.convert_ffmpeg()

    # Optional: You can add checks here to ensure all methods produce similar results
    # print(f"Pydub array shape: {pydub_array.shape}")
    # print(f"AV array shape: {av_array.shape}")
    # print(f"FFmpeg array shape: {ffmpeg_array.shape}")

if __name__ == "__main__":
    input_file = r"D:\Scripts\bench_cupy\sam_altman_lex_podcast_367.flac"
    benchmark(input_file)

@absadiki
Copy link
Owner

absadiki commented Sep 6, 2024

Yep, exact numbers is something you'd only publish after extensive and repetitive benchmarking, not like the informal discussion we're having here...

It's possible that it's something on my computer as well...perhaps I installed intelMKL way back in the day and av somehow leverages it while pydub doesn't...At this point who knows and I'm out of ideas of how to troubleshoot the issue. I know my benchmarks are reliable, don't doubt that your guys' are as well...but if we can't replicate it's probably not advisable to change libraries unless we could be certain. Something to remember though...Sort of like when I benchmarked GGUF files like I told you on the video call...

Q4 running faster than Q3...and then Q3_k_m (or whatever the current naming scheme is) using more VRAM than Q4...This was the behavior with llama-based models...but then it's reversed or we see an entirely new behavior when benching mistral-based models. lol.

BTW, I also benchmarked solely the beam_size parameter using the ctranslate2 library with "chat models"...GET THIS...one of my favorite models (neural chat) had a curved decline in tokens per second as beam size increased (i.e. more compute required)...yet at beam size 4 - AND ONLY 4 - it's tokens/second went up 25%. No other model did this and I tested and re-tested dozens of times. lol.

Takeaways...

  1. Always take all benchmarks with a grain of salt because you may not know all the parameters/details of the testing...
  2. ditto regarding hardware setups...background tasks, etc...even professionals with a dedicated benchmarking computer can experience differences...there's also the "silicon lottery" to contend with...
  3. There can always be factors that the tester, despite good intentions, just isn't aware of (if anyone is at all).
  4. Benchmarking is fun and I had fun delving into this interesting issue! 😄

Well said, I couldn't agree more. 👍

@absadiki
Copy link
Owner

absadiki commented Sep 6, 2024

So I decided to write my own conversion script without relying on any third-party library. It seems we have a new winner now .. calling ffmpeg directly is much faster than both libraries on my local machine.
Here are the results for the same flac audio

Pydub conversion took 14.357902 seconds
AV conversion took 19.177406 seconds
Raw FFMPEG conversion took 6.713088 seconds

Here is the script
@BBC-Esq , @UsernamesLame please give it a try and let me know your results ?

Did I not already suggest directly using FFMPEG - OMG! I thought I did...but I thought you wanted a library that bundles it for the sake of simplicity. lol. Wait, I kinda did above when I commented on the average user probably not wanting to mess with PATH...

Anyways, here's the guy who informed me of this issue first...

shashikg/WhisperS2T#40 (comment)

BTW, if you're looking to implement true batching I highly recommend you check out his repository. His pipeline for implementing ctranslate2 is awesome-sauce.

@UsernamesLame I'll send you an invite to the whisper benchmarking private repo I created. Feel free to participate, just observe or not join at all at your pleasure.

Will bench the FFMPEG directly but turn to other issues after that. 😉

EDIT:

BTW, I believe FFMPEG's binary is written in C so yes it'll be faster...but it would be fun to compare it to Rust, which is much easier to incorporate for people only familiar with Python like myself, but unfortunately Rust doesn't have anything remotely close to FFMPEG's comprehensiveness...

Why would I stick with a library if ffmpeg delivers better results? Lol.
I just didn't expect such a huge difference to be honest, I knew they were all wrappers around ffmpeg, so I assumed there wouldn't be a big gap. But a 2-3x difference is significant!

@absadiki
Copy link
Owner

absadiki commented Sep 6, 2024

So I decided to write my own conversion script without relying on any third-party library. It seems we have a new winner now .. calling ffmpeg directly is much faster than both libraries on my local machine.
Here are the results for the same flac audio

Pydub conversion took 14.357902 seconds
AV conversion took 19.177406 seconds
Raw FFMPEG conversion took 6.713088 seconds

Here is the script
@BBC-Esq , @UsernamesLame please give it a try and let me know your results ?

The "never wrong" 😉 Claude identified a few issues:

  1. The timing in convert_ffmpeg is not consistent with the other methods. You're including the file I/O operations in the timing, which isn't done in the other methods.
  2. Missing @timeit decorator: To be consistent with the other methods, you should use the @timeit decorator for the actual conversion part.
  3. Return value: The convert_ffmpeg method returns both the time and the result, while the others only return the time.

After addressing those issues I received the following impressive results for FFMPEG:

pydub_to_numpy took 0.304464 seconds
Pydub conversion took 17.602802 seconds
av_to_numpy took 8.928977 seconds
AV conversion took 8.931782 seconds
ffmpeg_to_numpy took 2.070266 seconds
FFmpeg conversion took 2.070527 seconds

"Claude is wrong!" Or maybe you didn’t use the script I provided!

  1. Even though you don't see it, I/O operations are included in the libraries.
  2. I commented out all the timeit wrappers—the timing is only for the intermediate functions using numpy, which I'm not interested in! The timing for all methods is consistent!
  3. All functions return both the time and the array; I just re-checked!

Please don’t rely entirely on LLMs, especially for coding. Read the script yourself first!


Anyway, ffmpeg is fast on your computer too, which is great.
Hopefully, @UsernamesLame gets the same results.

I'll make the necessary changes and remove Pydub.

@BBC-Esq
Copy link

BBC-Esq commented Sep 6, 2024

So I decided to write my own conversion script without relying on any third-party library. It seems we have a new winner now .. calling ffmpeg directly is much faster than both libraries on my local machine.
Here are the results for the same flac audio

Pydub conversion took 14.357902 seconds
AV conversion took 19.177406 seconds
Raw FFMPEG conversion took 6.713088 seconds

Here is the script
@BBC-Esq , @UsernamesLame please give it a try and let me know your results ?

The "never wrong" 😉 Claude identified a few issues:

  1. The timing in convert_ffmpeg is not consistent with the other methods. You're including the file I/O operations in the timing, which isn't done in the other methods.
  2. Missing @timeit decorator: To be consistent with the other methods, you should use the @timeit decorator for the actual conversion part.
  3. Return value: The convert_ffmpeg method returns both the time and the result, while the others only return the time.

After addressing those issues I received the following impressive results for FFMPEG:

pydub_to_numpy took 0.304464 seconds
Pydub conversion took 17.602802 seconds
av_to_numpy took 8.928977 seconds
AV conversion took 8.931782 seconds
ffmpeg_to_numpy took 2.070266 seconds
FFmpeg conversion took 2.070527 seconds

"Claude is wrong!" Or maybe you didn’t use the script I provided!

1. Even though you don't see it, I/O operations are included in the libraries.

2. I commented out all the timeit wrappers—the timing is only for the intermediate functions using numpy, which I'm not interested in! The timing for all methods is consistent!

3. All functions return both the time and the array; I just re-checked!

Please don’t rely entirely on LLMs, especially for coding. Read the script yourself first!

Anyway, ffmpeg is fast on your computer too, which is great. Hopefully, @UsernamesLame gets the same results.

I'll make the necessary changes and remove Pydub.

That's one option. Plenty of repositories require FFMPEG as a dependency, but user's have to install it and add to PATH, which is not feasible for some users (like I used to be just over a year ago)...they don't even know what a PATH is.

Another option is to do what WhisperS2T does, which is allow FFMPEG...the passing of straight numpy arrays if a user so chooses...and uses a library. Either way, the pipline handles the particular kind of input and eventually it's all converted to a numpy array anyways.

Regarding Claude...apparently you don't know me enough to recognize my sarcasm. 😉

...and yes...I just used my script because i did the benchmarking in 5 minutes and I didn't want to create yet another python file...that's on me. 😉 didn't change the results though.

@absadiki
Copy link
Owner

absadiki commented Sep 6, 2024

That's one option. Plenty of repositories require FFMPEG as a dependency, but user's have to install it and add to PATH, which is not feasible for some users (like I used to be just over a year ago)...they don't even know what a PATH is.

Are you kidding? With that level of expertise and coding, you didn’t know about the PATH until a year ago? No way, lol!

Another option is to do what WhisperS2T does, which is allow FFMPEG...the passing of straight numpy arrays if a user so chooses...and uses a library. Either way, the pipline handles the particular kind of input and eventually it's all converted to a numpy array anyways.

Yes, that was already the case, the transcribe function accepts a numpy array. The pre-processing step is just for the actual media files. All the libraries require ffmpeg to be on PATH anyway; they don’t ship it with the build. So, there's no need for any third-party library anymore.

Regarding Claude...apparently you don't know me enough to recognize my sarcasm. 😉

...and yes...I just used my script because i did the benchmarking in 5 minutes and I didn't want to create yet another python file...that's on me. 😉 didn't change the results though.

Okay, I got it! 😆
Yes, the results match this time, which is great!

@BBC-Esq
Copy link

BBC-Esq commented Sep 6, 2024

That's one option. Plenty of repositories require FFMPEG as a dependency, but user's have to install it and add to PATH, which is not feasible for some users (like I used to be just over a year ago)...they don't even know what a PATH is.

Are you kidding? With that level of expertise and coding, you didn’t know about the PATH until a year ago? No way, lol!

Another option is to do what WhisperS2T does, which is allow FFMPEG...the passing of straight numpy arrays if a user so chooses...and uses a library. Either way, the pipline handles the particular kind of input and eventually it's all converted to a numpy array anyways.

Yes, that was already the case, the transcribe function accepts a numpy array. The pre-processing step is just for the actual media files. All the libraries require ffmpeg to be on PATH anyway; they don’t ship it with the build. So, there's no need for any third-party library anymore.

Regarding Claude...apparently you don't know me enough to recognize my sarcasm. 😉
...and yes...I just used my script because i did the benchmarking in 5 minutes and I didn't want to create yet another python file...that's on me. 😉 didn't change the results though.

Okay, I got it! 😆 Yes, the results match this time, which is great!

One last thing...some repos use av because it actually does bundle FFMPEG with it, that's the draw...so just keep that in mind if you ever feel the need to incorporate a library as an option for people.

But "yes" to results matching...

@absadiki
Copy link
Owner

absadiki commented Sep 7, 2024

One last thing...some repos use av because it actually does bundle FFMPEG with it, that's the draw...so just keep that in mind if you ever feel the need to incorporate a library as an option for people.

But "yes" to results matching...

Oh! Good to know that! PyAv is really great,
but a ~3x difference is huge, and this might upset the other "dev and benchmarking people" 😆
I will stick with the optimized version for now!

But yeah, I'm always open to suggestions and will definitely consider it if needed!

Thanks a lot for all your suggestions and contributions!

@UsernamesLame
Copy link
Contributor Author

So I decided to write my own conversion script without relying on any third-party library. It seems we have a new winner now .. calling ffmpeg directly is much faster than both libraries on my local machine.

Here are the results for the same flac audio

Pydub conversion took 14.357902 seconds

AV conversion took 19.177406 seconds

Raw FFMPEG conversion took 6.713088 seconds

Here is the script

@BBC-Esq , @UsernamesLame please give it a try and let me know your results ?

Did I not already suggest directly using FFMPEG - OMG! I thought I did...but I thought you wanted a library that bundles it for the sake of simplicity. lol. Wait, I kinda did above when I commented on the average user probably not wanting to mess with PATH...

Anyways, here's the guy who informed me of this issue first...

shashikg/WhisperS2T#40 (comment)

BTW, if you're looking to implement true batching I highly recommend you check out his repository. His pipeline for implementing ctranslate2 is awesome-sauce.

@UsernamesLame I'll send you an invite to the whisper benchmarking private repo I created. Feel free to participate, just observe or not join at all at your pleasure.

Will bench the FFMPEG directly but turn to other issues after that. 😉

EDIT:

BTW, I believe FFMPEG's binary is written in C so yes it'll be faster...but it would be fun to compare it to Rust, which is much easier to incorporate for people only familiar with Python like myself, but unfortunately Rust doesn't have anything remotely close to FFMPEG's comprehensiveness...

I will definitely join my friend from the best city in America! (I followed you on GitHub and saw your city and was like 🎉🎉🎉 because I have a soft spot for there)

Idk if I'll be able to help. I think we all lost the plot here. Batch processing can take longer. The goal is to shrink the actual transcription time!

I don't care if it takes hours to make numpy arrays to store in a database so transcription nodes can ingest from a central location. As long as the transcription goes faster.

Let's try and make that the new goal. Any ideas?

@UsernamesLame
Copy link
Contributor Author

So I decided to write my own conversion script without relying on any third-party library. It seems we have a new winner now .. calling ffmpeg directly is much faster than both libraries on my local machine.
Here are the results for the same flac audio

Pydub conversion took 14.357902 seconds
AV conversion took 19.177406 seconds
Raw FFMPEG conversion took 6.713088 seconds

Here is the script
@BBC-Esq , @UsernamesLame please give it a try and let me know your results ?

Did I not already suggest directly using FFMPEG - OMG! I thought I did...but I thought you wanted a library that bundles it for the sake of simplicity. lol. Wait, I kinda did above when I commented on the average user probably not wanting to mess with PATH...

Anyways, here's the guy who informed me of this issue first...

shashikg/WhisperS2T#40 (comment)

BTW, if you're looking to implement true batching I highly recommend you check out his repository. His pipeline for implementing ctranslate2 is awesome-sauce.

@UsernamesLame I'll send you an invite to the whisper benchmarking private repo I created. Feel free to participate, just observe or not join at all at your pleasure.

Will bench the FFMPEG directly but turn to other issues after that. 😉

EDIT:

BTW, I believe FFMPEG's binary is written in C so yes it'll be faster...but it would be fun to compare it to Rust, which is much easier to incorporate for people only familiar with Python like myself, but unfortunately Rust doesn't have anything remotely close to FFMPEG's comprehensiveness...

A good chunk of FFMPEG is actually assembly!

So I decided to write my own conversion script without relying on any third-party library. It seems we have a new winner now .. calling ffmpeg directly is much faster than both libraries on my local machine.
Here are the results for the same flac audio

Pydub conversion took 14.357902 seconds
AV conversion took 19.177406 seconds
Raw FFMPEG conversion took 6.713088 seconds

Here is the script
@BBC-Esq , @UsernamesLame please give it a try and let me know your results ?

The "never wrong" 😉 Claude identified a few issues:

1. The timing in convert_ffmpeg is not consistent with the other methods. You're including the file I/O operations in the timing, which isn't done in the other methods.

2. Missing @timeit decorator: To be consistent with the other methods, you should use the @timeit decorator for the actual conversion part.

3. Return value: The convert_ffmpeg method returns both the time and the result, while the others only return the time.

After addressing those issues I received the following impressive results for FFMPEG:

pydub_to_numpy took 0.304464 seconds
Pydub conversion took 17.602802 seconds
av_to_numpy took 8.928977 seconds
AV conversion took 8.931782 seconds
ffmpeg_to_numpy took 2.070266 seconds
FFmpeg conversion took 2.070527 seconds

I used this script:
SCRIPT HERE

[EDIT]

When I altered the script to keep track of all time - including I/O operations - I received these results:

convert_pydub took 17.707413 seconds
convert_av took 9.034201 seconds
convert_ffmpeg took 2.159566 seconds

Results are basically the same:

Second script here:
SCRIPT INCLUDING I/O IN TIME

I removed the timeit mostly because it was spamming the sysout. The benchmark was more of a quick and dirty thing than anything

So I decided to write my own conversion script without relying on any third-party library. It seems we have a new winner now .. calling ffmpeg directly is much faster than both libraries on my local machine.
Here are the results for the same flac audio

Pydub conversion took 14.357902 seconds
AV conversion took 19.177406 seconds
Raw FFMPEG conversion took 6.713088 seconds

Here is the script
@BBC-Esq , @UsernamesLame please give it a try and let me know your results ?

Did I not already suggest directly using FFMPEG - OMG! I thought I did...but I thought you wanted a library that bundles it for the sake of simplicity. lol. Wait, I kinda did above when I commented on the average user probably not wanting to mess with PATH...
Anyways, here's the guy who informed me of this issue first...
shashikg/WhisperS2T#40 (comment)
BTW, if you're looking to implement true batching I highly recommend you check out his repository. His pipeline for implementing ctranslate2 is awesome-sauce.
@UsernamesLame I'll send you an invite to the whisper benchmarking private repo I created. Feel free to participate, just observe or not join at all at your pleasure.
Will bench the FFMPEG directly but turn to other issues after that. 😉
EDIT:
BTW, I believe FFMPEG's binary is written in C so yes it'll be faster...but it would be fun to compare it to Rust, which is much easier to incorporate for people only familiar with Python like myself, but unfortunately Rust doesn't have anything remotely close to FFMPEG's comprehensiveness...

Why would I stick with a library if ffmpeg delivers better results? Lol. I just didn't expect such a huge difference to be honest, I knew they were all wrappers around ffmpeg, so I assumed there wouldn't be a big gap. But a 2-3x difference is significant!

Lets drop the libraries if we can get better results.

One last thing...some repos use av because it actually does bundle FFMPEG with it, that's the draw...so just keep that in mind if you ever feel the need to incorporate a library as an option for people.
But "yes" to results matching...

Oh! Good to know that! PyAv is really great, but a ~3x difference is huge, and this might upset the other "dev and benchmarking people" 😆 I will stick with the optimized version for now!

But yeah, I'm always open to suggestions and will definitely consider it if needed!

Thanks a lot for all your suggestions and contributions!

If you find any faster method of conversion please implement it! Just let me keep feeding Whisper raw numpy arrays.

@absadiki
Copy link
Owner

absadiki commented Sep 8, 2024

@UsernamesLame,
I already did 😆
I removed Pydub and replaced it with ffmpeg. Raw numpy arrays are the defaults, as usual.

@BBC-Esq
Copy link

BBC-Esq commented Sep 8, 2024

@abdeladim-s and @UsernamesLame , can you try this modified script? I'm still losing sleep (just kidding) regarding the difference between pydub and av...There will be a slight time increase because more time measurements are taken, but here's the script and my results...

REVISED SCRIPT
import numpy as np
import time
import os
from pydub import AudioSegment
import av
import subprocess
import tempfile

# ANSI escape code for green text
GREEN = '\033[92m'
RESET = '\033[0m'

def timeit(func):
    def wrapper(*args, **kwargs):
        start = time.perf_counter()
        result = func(*args, **kwargs)
        end = time.perf_counter()
        print(f"{GREEN}{func.__name__} took {end - start:.6f} seconds{RESET}")
        return result
    return wrapper

class AudioConverter:
    def __init__(self, input_file):
        self.input_file = input_file
        self.base_name = os.path.splitext(os.path.basename(input_file))[0]

    def time_step(self, step_name):
        start = time.perf_counter()
        return start, step_name

    def end_step(self, start, step_name, additional_info=""):
        end = time.perf_counter()
        print(f"{step_name} took {end - start:.6f} seconds. {additional_info}")

    @timeit
    def convert_pydub(self):
        """
        This method:
        1. Loads the entire audio file into memory.
            - "AudioSegment.from_file()" initially loads the entire file into memory
        2. Performs resampling and channel conversion in-memory.
        3. Converts the audio data to a numpy array.

        Note: The initial loading time includes file reading and decoding.
        """
        start, step = self.time_step("1. File Opening and Initial Setup")
        audio = AudioSegment.from_file(self.input_file)
        self.end_step(start, step, "Entire audio file loaded into memory.")

        start, step = self.time_step("2. Decoding and Resampling")
        audio = audio.set_frame_rate(16000).set_channels(1)
        self.end_step(start, step, "Resampling and channel conversion performed in-memory.")

        start, step = self.time_step("3. Converting to Numpy Array")
        result = np.array(audio.get_array_of_samples()).astype(np.float32) / np.iinfo(np.int16).max
        self.end_step(start, step, "Audio data converted to numpy array and normalized.")

        return result

    @timeit
    def convert_av(self):
        """
        This method:
        1. Opens the audio file without loading it entirely into memory.
            - "container.decode(audio)" yields frames one at a time, allowing for true streaming processing without loading the entire file into memory
        2. Creates a resampler for the desired output format.
        3. Processes the audio in chunks, decoding and resampling each chunk.
        4. Concatenates the processed chunks into a numpy array.

        Note: The decoding and resampling step includes the actual reading and processing of the audio data.
        """
        start, step = self.time_step("1. File Opening and Initial Setup")
        container = av.open(self.input_file)
        audio = container.streams.audio[0]
        resampler = av.audio.resampler.AudioResampler(
            format='s16',
            layout='mono',
            rate=16000
        )
        self.end_step(start, step, "File header opened and resampler created. Audio data not yet loaded.")

        start, step = self.time_step("2. Decoding and Resampling")
        audio_frames = []
        for frame in container.decode(audio):
            resampled_frames = resampler.resample(frame)
            for resampled_frame in resampled_frames:
                audio_frames.append(resampled_frame)
        self.end_step(start, step, "Audio data read, decoded, and resampled in chunks.")

        start, step = self.time_step("3. Converting to Numpy Array")
        if not audio_frames:
            result = np.array([])
        else:
            result = np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames]).astype(np.float32) / np.iinfo(np.int16).max
        self.end_step(start, step, "Processed audio frames converted to numpy array and normalized.")

        return result

    @timeit
    def convert_ffmpeg(self):
        """
        This method:
        1. Creates a temporary WAV file.
        2. Uses FFmpeg to convert the input to the temporary WAV file.
        3. Reads the temporary WAV file and converts it to a numpy array.

        Note: This method first converts the input to a WAV file before processing, 
        which can add overhead but ensures a consistent input format.
        """
        if self.input_file.endswith('.wav'):
            return self.to_np(self.input_file)
        else:
            start, step = self.time_step("1. File Opening and Initial Setup")
            with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as temp_file:
                temp_file_path = temp_file.name
            self.end_step(start, step, "Temporary WAV file created.")

            try:
                start, step = self.time_step("2. Decoding and Resampling")
                subprocess.run([
                    'ffmpeg', '-i', self.input_file, '-ac', '1', '-ar', '16000',
                    temp_file_path, '-y'
                ], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
                self.end_step(start, step, "Input file converted to WAV format using FFmpeg.")

                return self.to_np(temp_file_path)
            finally:
                os.remove(temp_file_path)

    def to_np(self, file_path):
        start, step = self.time_step("3. Converting to Numpy Array")
        with open(file_path, 'rb') as f:
            header = f.read(44)
            raw_data = f.read()
        samples = np.frombuffer(raw_data, dtype=np.int16)
        result = samples.astype(np.float32) / np.iinfo(np.int16).max
        self.end_step(start, step, "WAV file read, converted to numpy array, and normalized.")

        return result

def benchmark(input_file):
    converter = AudioConverter(input_file)
    
    print("\nPydub Backend:")
    pydub_array = converter.convert_pydub()

    print("\nAV Backend:")
    av_array = converter.convert_av()

    print("\nFFmpeg Backend:")
    ffmpeg_array = converter.convert_ffmpeg()

if __name__ == "__main__":
    input_file = r"D:\Scripts\bench_cupy\test1.flac"
    benchmark(input_file)
Pydub Backend:
1. File Opening and Initial Setup took 13.699101 seconds
2. Decoding and Resampling took 4.103232 seconds
3. Converting to Numpy Array took 0.353683 seconds
convert_pydub took 18.160763 seconds

AV Backend:
1. File Opening and Initial Setup took 0.003208 seconds
2. Decoding and Resampling took 6.435688 seconds
3. Converting to Numpy Array took 2.587717 seconds
convert_av took 9.112194 seconds

FFmpeg Backend:
1. File Opening and Initial Setup took 0.000831 seconds
2. Decoding and Resampling took 1.731438 seconds
3. Converting to Numpy Array took 0.321049 seconds
convert_ffmpeg took 2.069913 seconds

@absadiki
Copy link
Owner

absadiki commented Sep 8, 2024

@abdeladim-s and @UsernamesLame , can you try this modified script? I'm still losing sleep (just kidding) regarding the difference between pydub and av...There will be a slight time increase because more time measurements are taken, but here's the script and my results...

@BBC-Esq, Here are my results:

Pydub Backend:
1. File Opening and Initial Setup took 7.507274 seconds. Entire audio file loaded into memory.
2. Decoding and Resampling took 6.863667 seconds. Resampling and channel conversion performed in-memory.
3. Converting to Numpy Array took 0.381071 seconds. Audio data converted to numpy array and normalized.
convert_pydub took 14.770355 seconds

AV Backend:
1. File Opening and Initial Setup took 0.001853 seconds. File header opened and resampler created. Audio data not yet loaded.
2. Decoding and Resampling took 14.618280 seconds. Audio data read, decoded, and resampled in chunks.
3. Converting to Numpy Array took 4.329899 seconds. Processed audio frames converted to numpy array and normalized.
convert_av took 19.044509 seconds

FFmpeg Backend:
1. File Opening and Initial Setup took 0.000406 seconds. Temporary WAV file created.
2. Decoding and Resampling took 6.519111 seconds. Input file converted to WAV format using FFmpeg.
3. Converting to Numpy Array took 0.406190 seconds. WAV file read, converted to numpy array, and normalized.
convert_ffmpeg took 6.981947 seconds

still raw FFmpeg the fastest, followed by PyDub and then AV.

@BBC-Esq
Copy link

BBC-Esq commented Sep 8, 2024

Thanks, I'm trying a last ditch effort to see what might be leading to the difference...let's say I wanted to use a library and not FFMPEG, but still wanted the 2x speedup I'm getting on my computer (but not your guys...)...might be good to know what exactly on my computer is creating the disparity.

Can you please try this? run python in the command prompt where numpy is installed...

Then run import numpy as np
Then run np.show_config()

Not sure how it is in Linux, but on Windows it looks like this:

image

There should be something that says "Build dependencies"

Here's the relevant portion of what mine says...remember, I'm only showing the relevant portions:

  blas:
    detection method: pkgconfig
    found: true
...
    name: scipy-openblas
    openblas configuration: OpenBLAS 0.3.27  USE64BITINT DYNAMIC_ARCH NO_AFFINITY

and...

  lapack:
    detection method: pkgconfig
    found: true
...
    name: scipy-openblas
    openblas configuration: OpenBLAS 0.3.27  USE64BITINT DYNAMIC_ARCH NO_AFFINITY

@absadiki
Copy link
Owner

absadiki commented Sep 8, 2024

Sure! I would like to know the reason as well.

Here is the output of what you asked:

openblas64__info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
    runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
    libraries = ['openblas64_', 'openblas64_']
    library_dirs = ['/usr/local/lib']
    language = c
    define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
    runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
  • This is the full output, I didn't see any build dependencies on my side.

@BBC-Esq
Copy link

BBC-Esq commented Sep 9, 2024

No dice...dang...Oh well, will just chalk it up to an unknown. Maybe it'll reveal itself at a later date.

@absadiki
Copy link
Owner

absadiki commented Sep 9, 2024

Yeah, there are so many variables, it's hard to keep track of everything.
Hopefully, we'll find out eventually! 😄

@UsernamesLame
Copy link
Contributor Author

@abdeladim-s and @UsernamesLame , can you try this modified script? I'm still losing sleep (just kidding) regarding the difference between pydub and av...There will be a slight time increase because more time measurements are taken, but here's the script and my results...

REVISED SCRIPT

Pydub Backend:
1. File Opening and Initial Setup took 13.699101 seconds
2. Decoding and Resampling took 4.103232 seconds
3. Converting to Numpy Array took 0.353683 seconds
convert_pydub took 18.160763 seconds

AV Backend:
1. File Opening and Initial Setup took 0.003208 seconds
2. Decoding and Resampling took 6.435688 seconds
3. Converting to Numpy Array took 2.587717 seconds
convert_av took 9.112194 seconds

FFmpeg Backend:
1. File Opening and Initial Setup took 0.000831 seconds
2. Decoding and Resampling took 1.731438 seconds
3. Converting to Numpy Array took 0.321049 seconds
convert_ffmpeg took 2.069913 seconds

Sorry, college has been busy. Do you still need me to test this? Also I'm not getting notifications anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants