-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance Improvement ideas / feature requests #49
Comments
Looked into the numpy array saving: https://numpy.org/doc/stable/reference/generated/numpy.save.html We can save the converted audio files to disk before feeding them to the model. This way we technically bypass the need for PyDub and ffmpeg. It also means no launching background processes (PyDub with ffmpeg) to manipulate audio so it's ready for the model to ingest. |
@UsernamesLame, Thanks for the ideas!
|
This is what I was trying to explain: sound = AudioSegment.from_file(media_file_path)
sound = sound.set_frame_rate(constants.WHISPER_SAMPLE_RATE).set_channels(1)
arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
arr /= np.iinfo(samples[0].typecode).max
with open("file.npy", "wb") as file:
np.save(file, arr, allow_pickle=False) I haven't tested it yet, but the idea is do all the operations we need on the numpy array ahead of time and then later we can just do something like: array = np.load("file.npy")
_transcribe(array) This way we can mass process our audio files before we load them into memory for whisper to process. @abdeladim-s hopefully this makes sense now! The idea is to batch process hundreds if not thousands of audio files ahead of time in parallel (I can write a script to do this for us) and save them in a format we can just load into the model and get transcriptions back from. Yes I know numpy should be fast but every context switch we can avoid the better. |
PyDub and ffmpeg are actually there for the conversion to numpy arrays! |
Pre-processing. Every context switch we can avoid the better! Imagine transcribing thousands of files. The current solution looks like this: pywhispercpp -> PyDub -> ffmpeg -> PyDub -> pywhispercpp -> numpy -> pywhispercpp -> PyBind11 -> whisper -> PyBind11 -> pywhispercpp With my proposal it would look more like this: pywhispercpp -> numpy -> pywhispercpp -> PyBind11 -> whisper -> PyBind11 -> pywhispercpp The goal isn't to make this a full replacement for the existing solution, but I tomorrow I'll write a demo showing an alternative to load data into the model cutting out as many context switches as possible to gain some performance. |
Okey, so the idea is to process large amount of files ? |
IO operations are generally cheaper than context switches. I'll test this unless you want to. I can also read the files into memory and store them in a BytesIO object and read from it like a filesystem object too. There's a lot of ways this can be taken. But I genuinely believe that avoiding context switches > IO RE: deepcopy You can create completely independent objects that are clones of existing objects. Think instead of myModel = Model, we do myModel = existingModel.deepclone(). So we don't read the model weights from disk again, but instead do an in memory copy. |
Yes please, go ahead and test! experiments and Numbers will save us a lot of talk :) |
So I'm testing wth a sample 33mb mp3 and the results are promising. Pre-processing into a numpy array and saving to disk shrinks it to 5.4mb so we can definitely have an impact on memory footprint with a helper script! Let me test transcription performance. |
I have numbers for you @abdeladim-s! Here's the script:
```
py
from pywhispercpp.model import Model
import numpy as np
import time
def usenumpy(): def useaudiofile(): begin = time.time() begin = time.time()
using raw numpy array finished in 2.6472320556640625 using mp3 file inished in 26.56456184387207
|
from pydub import AudioSegment
import numpy as np
sound = AudioSegment.from_file("audio.mp3")
sound = sound.set_frame_rate(1600).set_channels(1)
arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
with open("file.npy", "wb") as file:
np.save(file, arr, allow_pickle=False) This is the pre-conversion script. I'm going to update WhisperWav to output numpy arrays that can be fed directly into the model. |
So I tried pre-converting a few files. Most work, but at random Numpy will completely mangle the conversion to a ndarray and saving leading to If anyone has any idea why it's randomly mangling things I'd love help here. Edit: Yea I'm at a complete dead end as to why numpy insists on butchering audio files at random when saving. When it works, the speedups are insane. When it doesn't work, the errors are absolutely useless. Edit 2: I decided to see if Copilot could help. It suggested: with open("audio.npy", "rb") as f:
audio_data = np.fromfile(f, dtype=np.float32) And so far it seems to be working? |
Ok so final comment for now. A 42m audio file at 101 mb once crushed to mono and audio bitrate set to 1600khz becomes a 17mb~ npy file. Processing the npy file takes around 10 seconds. Processing the raw wav file takes around 63 seconds. This doesn't seem like an error or unreasonable. Can someone else please try and reproduce? Are we literally spending that much time prepping the file?! |
I still don't get what you are trying to achieve, but if I understand it correctly, it's basically the same as what I did, except that you are trying to dump and load the npy array, and you've made a deadly bug! lol Also, when you did the experiment, why you didn't calculate the time needed to convert the files to npy, people are not moving around with dumped npy arrays of their media files 😅 Here is what I think this should be: #!/usr/bin/env python
# -*- coding: utf-8 -*-
from pywhispercpp.model import Model
import numpy as np
import time
from pydub import AudioSegment
def usenumpy():
# This part from your script should be included as well! ##########
sound = AudioSegment.from_file("audio.mp3")
# Here 16Khz not 1600 !!!! That's what you were doing wrong !!!
sound = sound.set_frame_rate(16000).set_channels(1)
arr = np.array(sound.get_array_of_samples()).T.astype(np.float32)
arr /= np.iinfo(np.int16).max # Normalization is important! otherwise you will get 'utf-8' codec can't decode bytes
# dump array to npy file
with open("file.npy", "wb") as file:
np.save(file, arr, allow_pickle=False)
####################
# load model
model = Model('base')
# load array from npy file
audio_data = np.load("file.npy")
segments = model.transcribe(audio_data)
for segment in segments:
print(segment)
def useaudiofile():
model = Model('base')
segments = model.transcribe("audio.mp3")
for segment in segments:
print(segment)
begin = time.time()
usenumpy()
end = time.time()
print("*" * 20)
print(f"using raw numpy array finished in {end - begin}")
print("*" * 20)
begin = time.time()
useaudiofile()
end = time.time()
print("*" * 20)
print(f"using mp3 file finished in {end - begin}")
print("*" * 20) I used this file from my other project, here are the results: [2024-08-30 17:34:17,168] {model.py:130} INFO - Transcribing ...
[2024-08-30 17:34:33,929] {model.py:133} INFO - Inference time: 16.761 s
t0=0, t1=424, text=[Music]
t0=424, t1=800, text=What exactly is artificial intelligence?
t0=800, t1=1192, text=We speak of AI when computer systems perform tasks
t0=1192, t1=1448, text=that usually require human intelligence.
t0=1448, t1=1624, text=This includes, for example,
t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.
t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.
t0=2624, t1=2824, text=This can be achieved in two ways.
t0=2824, t1=3000, text=[Music]
t0=3000, t1=3280, text=You can program each individual instruction
t0=3280, t1=3544, text=so that the machine solve the task step by step.
t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.
t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.
t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.
t0=5032, t1=5336, text=This is known as machine learning.
t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.
t0=5904, t1=6224, text=When we watch films, listen to music or shop online.
t0=6224, t1=6528, text=AI gives us recommendations about what we might like.
t0=6528, t1=7080, text=AI is capable of converting spoken language into text
t0=7080, t1=7312, text=and translating it into other languages.
t0=7312, t1=8040, text=AI is a central component of robotics.
t0=8040, t1=8288, text=Robots make our everyday lives easier
t0=8288, t1=8488, text=or take on strenuous activities.
t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI
t0=8984, t1=9096, text=and can react to it.
t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.
t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.
t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.
t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.
t0=11280, t1=11544, text=For example, on digital learning platforms.
t0=11544, t1=11928, text=AI is becoming increasingly important.
t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities
t0=12504, t1=12688, text=at home and at work.
t0=12688, t1=12896, text=And where we would rather make our own decisions.
t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.
t0=13512, t1=13840, text=For this, we need an AI-competent society.
t0=13840, t1=14176, text=[MUSIC PLAYING]
t0=14176, t1=14376, text=you
********************
using raw numpy array finished in 17.416718244552612
********************
[2024-08-30 17:34:34,516] {model.py:130} INFO - Transcribing ...
[2024-08-30 17:34:50,128] {model.py:133} INFO - Inference time: 15.612 s
t0=0, t1=424, text=[Music]
t0=424, t1=800, text=What exactly is artificial intelligence?
t0=800, t1=1192, text=We speak of AI when computer systems perform tasks
t0=1192, t1=1448, text=that usually require human intelligence.
t0=1448, t1=1624, text=This includes, for example,
t0=1624, t1=2056, text=recognizing images, making decisions or engaging in dialogue.
t0=2056, t1=2624, text=To do this, the AI systems must be equipped with knowledge and experience.
t0=2624, t1=2824, text=This can be achieved in two ways.
t0=2824, t1=3000, text=[Music]
t0=3000, t1=3280, text=You can program each individual instruction
t0=3280, t1=3544, text=so that the machine solve the task step by step.
t0=3544, t1=3984, text=This is comparable to a cooking recipe or assembly instructions.
t0=3984, t1=4480, text=Alternatively, you can use programs that learn from data themselves.
t0=4480, t1=5032, text=This enables them to detect relevant information, draw conclusions, or make predictions.
t0=5032, t1=5336, text=This is known as machine learning.
t0=5512, t1=5904, text=We all have probably dealt with AI at some point in our lives.
t0=5904, t1=6224, text=When we watch films, listen to music or shop online.
t0=6224, t1=6528, text=AI gives us recommendations about what we might like.
t0=6528, t1=7080, text=AI is capable of converting spoken language into text
t0=7080, t1=7312, text=and translating it into other languages.
t0=7312, t1=8040, text=AI is a central component of robotics.
t0=8040, t1=8288, text=Robots make our everyday lives easier
t0=8288, t1=8488, text=or take on strenuous activities.
t0=8488, t1=8984, text=Self-driving vehicles recognise their environment through AI
t0=8984, t1=9096, text=and can react to it.
t0=9096, t1=9568, text=AI is becoming increasingly important within medicine.
t0=9568, t1=9840, text=It supports doctors when diagnosing diseases.
t0=9840, t1=10696, text=Also, more and more patients use AI-based apps for initial diagnosis.
t0=10696, t1=11264, text=In the educational sector, AI helps to individualise learning activities.
t0=11280, t1=11544, text=For example, on digital learning platforms.
t0=11544, t1=11928, text=AI is becoming increasingly important.
t0=11928, t1=12504, text=Once we understand how AI works, we can better gauge where it can support everyday activities
t0=12504, t1=12688, text=at home and at work.
t0=12688, t1=12896, text=And where we would rather make our own decisions.
t0=12896, t1=13512, text=AI will not replace humans, but it is getting better and better at supporting us.
t0=13512, t1=13840, text=For this, we need an AI-competent society.
t0=13840, t1=14176, text=[MUSIC PLAYING]
t0=14176, t1=14376, text=you
********************
using mp3 file finished in 16.196656465530396
******************** This is not a real experiment per say, but as you can see, they are almost the same. There is no need to dump and load the numpy array! Lmk what do you think ? |
I caught the deadly bug locally and fixed it locally. As for performance, it's odd you aren't getting better results and I am. I'm guessing it has something to due with the memory bandwidth of M1 Pro vs x86 chips? But yea you're understanding now. I haven't tested it on x86. Also I didn't include the converting to numpy arrays because the idea is to mass transform it then transcribe. At least one benefit is the numpy arrays are generally smaller in my experience. What are your system specs btw? And Python version? I'm using 3.12 and getting good results. If I can't increase performance I can at least lower memory usage I guess. 😅 My idea is to let the model be long lived and keep feeding it fresh areas dumps as it transcribes them one after another. This way in a different process (I'm going to edit e action to show this) we can spawn sub processes to mass convert media files to numpy arrays. The idea is that the model is the limiting factor, as in most people don't have the CPU / RAM to load 2 - 4 models, so if we can pre-process the files so the model can transcribe faster with less memory, it's still a (small) win! I have access to a 128core ARM box that is piss slow at transcribing but can quickly spit out these bumpy arrays. It's not gonna benefit everyone, but it's worth exploring the thought. It's also possible to store all the numpy arrays in a single database that clients running the models pull from to transcribe creating transcription cluster. The big benefit being that the clients can be small like a raspberry pi and still considerably faster transcriptions. |
I'm running a few more tests, including ensuring the numpy arrays produce the same results as the mp3, mostly because I can't believe that after crushing the frame rate, the response frequency, and the channels, I can go from 100mb to 70mb. As of now, numpy is getting me 17 seconds while mp3 is 69 seconds. Timing the conversion to a numpy array gets me 5 seconds. So 21 seconds vs 69. The performance gap has shrunk, but it's not gone. It's still ~3x faster to pre-process numpy arrays and then load them. I'm not saying everyone should, but it would make a fun example! Edit: I forgot to mention, I go us added to Whisper.cpp's README.md :) Merged already. I felt like we were ready for more visibility. |
|
|
Re testing: I know one test isn't enough, but still it's promising! Re pywhispercpp: It 100% deserves the visibility! Also I double checked I'm using 16000 locally, and:
That's still a pretty drastic difference. Also, when I accidentally did it with 1600, there was no real drop in accuracy on simpler audio files. |
|
Let's put my numpy theories to the test. I'm going to crush around 6h of audio into numpy arrays and transcribe it.
It's really down to batch processing and pre-normalizing the numpy arrays making a very big difference on ARM (M1 Pro). I'm going to test feeding around 7.5h of audio into it and post the results. Edit: Just over 6gb of files converted into numpy arrays in 33 seconds. Time to transcribe! Edit 2: Whisper just spat out some debug logs. 174 seconds to transcribe 1h of audio with normalized numpy arrays! Extrapolating this, it should take 17 minutes to transcribe >6h of audio. Lets see what actually happens as whisper spat out another debug log saying it finished in 147 seconds. |
neat! Edit: We have a initial number for the 1gb wav files. 187s. Extrapolating again, 1122 seconds or 18 minutes. So far the speed up isn't that promising, but the next check should be memory usage! |
https://github.com/EtienneAb3d/WhisperHallu?tab=readme-ov-file I found this, a project about optimizing for whisper! |
Interesting result! |
Sounds great, I'll take a look |
ouch! |
26 minutes for raw wav files, 17 minutes with numpy arrays. I think we have a winner? Opinion? Next test will be memory usage I guess. |
interesting! .. I think it's because of the parallel pre-conversion of the files to numpy. For small number of files, this won't have a huge effect! |
I've never used colab before, so here's the code. from pywhispercpp.model import Model
import numpy as np
import time
import os
from glob import glob
model = Model('base')
def usenumpy():
files = [f for f in glob("*") if os.path.isfile(f) and f.endswith((".pyd"))]
for file in files:
with open(f"{file}", "rb") as f:
audio_data = np.fromfile(f, dtype=np.float32)
numpy_segments = model.transcribe(audio_data)
def usewav():
files = [f for f in glob("*") if os.path.isfile(f) and f.endswith((".wav"))]
for file in files:
mp3_segments = model.transcribe(file)
begin = time.time()
usewav()
end = time.time()
print("*" * 20)
print(f"using wav finished in {end - begin}")
print("*" * 20)
I used cobalt.tools to download a 1.5h video's audio from YouTube as a WAV, then converted it with this: from pydub import AudioSegment
import numpy as np
from glob import glob
import os
import time
begin = time.time()
files = [f for f in glob("*") if os.path.isfile(f) and not f.endswith((".npy", ".md", ".txt", ".py", ".cfg"))]
for file in files:
sound = AudioSegment.from_file(file)
sound = sound.set_frame_rate(16000).set_channels(1)
numpy_array = np.array(sound.get_array_of_samples()).T.astype(np.float32)
numpy_array /= np.iinfo(np.int16).max
with open(f"{file}.npy", "wb") as f:
np.save(f, numpy_array, allow_pickle=False)
end = time.time()
print(f"{end - begin} seconds elsapsed") I feel like it should be ok to feed it the same audio file 6 times to get a general idea as it seems like whisper performs worse with each pass, not better. If you want to make a colab / Jupiter notebook, I'll gladly poke around with you. My theory is that the audio files being massive is causing the issue. The numpy arrays I save to disk are much smaller by comparison. The .wav is around 1gb, the .pyk is around 393mb. Anyways, for now I must say goodnight my friend! Don't let the geese bite! |
So, the large files are causing the issue ?! Probably! Anyways, good luck with your exploration, let me know if find any optimizations we can add to the repo, |
Wait...so you stopped llama.cpp from running and |
Ignore all the prior numbers. I'm re-running the tests numpy==2.1.0
numpy==1.26.4
|
Hmm...I have no explanation for your numbers and mine too...I'm re-running it multiple times and I get the same thing. Might be beyond my technical expertise. I'm happy to do a screenshare with anyone who wants to actually see it happen, but barring that...just not sure... |
Ok so Do you wanna do a /10 test on your machine? |
What's a "/10" test? |
def benchmark(input_file):
pydub_times = []
av_times = []
for i in range(0, 10):
converter = AudioConverter(input_file)
pydub_time = converter.convert_pydub()
pydub_times.append(pydub_time)
av_time = converter.convert_av()
av_times.append(av_time)
pydub_avg_time = sum(pydub_times) / len(pydub_times)
av_avg_time = sum(av_times) / len(av_times)
print(f"Average Pydub conversion time: {pydub_avg_time:.6f} seconds")
print(f"Average AV conversion time: {av_avg_time:.6f} seconds") Fixed it, re-running. My focus is split, sorry about the mistakes. |
Average Pydub conversion time: 12.883802 seconds What's your results? This is with numpy 2.0 branch. |
Here are my results again: pydub_to_numpy took 0.209830 seconds
Pydub conversion took 16.383211 seconds
av_to_numpy took 20.261770 seconds
AV conversion took 20.263226 seconds Script usedimport numpy as np
import time
import os
from pydub import AudioSegment
import av
def timeit(func):
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
end = time.perf_counter()
print(f"{func.__name__} took {end - start:.6f} seconds")
return result
return wrapper
class AudioConverter:
def __init__(self, input_file):
self.input_file = input_file
self.base_name = os.path.splitext(os.path.basename(input_file))[0]
def convert_pydub(self):
start_time = time.perf_counter()
audio = AudioSegment.from_file(self.input_file)
audio = audio.set_frame_rate(16000).set_channels(1)
@timeit
def pydub_to_numpy():
return np.array(audio.get_array_of_samples()).astype(np.float32) / np.iinfo(np.int16).max
audio_array = pydub_to_numpy()
end_time = time.perf_counter()
return end_time - start_time
def convert_av(self):
start_time = time.perf_counter()
container = av.open(self.input_file)
audio = container.streams.audio[0]
# Set up the resampler
resampler = av.audio.resampler.AudioResampler(
format='s16',
layout='mono',
rate=16000
)
@timeit
def av_to_numpy():
audio_frames = []
for frame in container.decode(audio):
resampled_frames = resampler.resample(frame)
for resampled_frame in resampled_frames:
audio_frames.append(resampled_frame)
if not audio_frames:
return np.array([])
# Concatenate all frames into a single numpy array, convert to float32, and normalize
return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames]).astype(np.float32) / np.iinfo(np.int16).max
audio_array = av_to_numpy()
end_time = time.perf_counter()
return end_time - start_time
def benchmark(input_file):
converter = AudioConverter(input_file)
pydub_time = converter.convert_pydub()
print(f"Pydub conversion took {pydub_time:.6f} seconds")
av_time = converter.convert_av()
print(f"AV conversion took {av_time:.6f} seconds")
if __name__ == "__main__":
input_file = r"sam_altman_lex_podcast_367.flac"
benchmark(input_file) Running Linux, everything is installed from scratch manually, killed all not necessary background processes before running the script. Here is the pip freeze: av==13.0.0
numpy==1.26.4
pydub==0.25.1
|
So I decided to write my own conversion script without relying on any third-party library. It seems we have a new winner now .. calling ffmpeg directly is much faster than both libraries on my local machine. Here are the results for the same flac audio Pydub conversion took 14.357902 seconds
AV conversion took 19.177406 seconds
Raw FFMPEG conversion took 6.713088 seconds Here is the scriptimport numpy as np
import time
import os
from pydub import AudioSegment
import av
import subprocess
import numpy as np
import os
import tempfile
def timeit(func):
def wrapper(*args, **kwargs):
start = time.perf_counter()
result = func(*args, **kwargs)
end = time.perf_counter()
print(f"{func.__name__} took {end - start:.6f} seconds")
return result
return wrapper
class AudioConverter:
def __init__(self, input_file):
self.input_file = input_file
self.base_name = os.path.splitext(os.path.basename(input_file))[0]
def convert_pydub(self):
start_time = time.perf_counter()
audio = AudioSegment.from_file(self.input_file)
audio = audio.set_frame_rate(16000).set_channels(1)
# @timeit
def np_array_conversion():
return np.array(audio.get_array_of_samples())
samples = np_array_conversion()
# @timeit
def np_float_conversion():
return samples.astype(np.float32)
audio_array = np_float_conversion()
# @timeit
def np_normalization(arr):
return arr / np.iinfo(np.int16).max
audio_array = np_normalization(audio_array)
end_time = time.perf_counter()
return end_time - start_time, audio_array
def convert_av(self):
start_time = time.perf_counter()
container = av.open(self.input_file)
audio = container.streams.audio[0]
# Set up the resampler
resampler = av.audio.resampler.AudioResampler(
format='s16',
layout='mono',
rate=16000
)
# @timeit
def get_array_of_samples():
audio_frames = []
for frame in container.decode(audio):
resampled_frames = resampler.resample(frame)
for resampled_frame in resampled_frames:
audio_frames.append(resampled_frame)
if not audio_frames:
return np.array([])
# Concatenate all frames into a single numpy array
return np.concatenate([frame.to_ndarray().flatten() for frame in audio_frames])
audio_array = get_array_of_samples()
# @timeit
def np_float_conversion(arr):
return arr.astype(np.float32)
audio_array = np_float_conversion(audio_array)
# @timeit
def np_normalization(arr):
return arr / np.iinfo(np.int16).max
audio_array = np_normalization(audio_array)
end_time = time.perf_counter()
return end_time - start_time, audio_array
def convert_ffmpeg(self):
def to_np(file_path):
with open(file_path, 'rb') as f:
header = f.read(44)
raw_data = f.read()
samples = np.frombuffer(raw_data, dtype=np.int16)
audio_array = samples.astype(np.float32) / np.iinfo(np.int16).max
return audio_array
start_time = time.perf_counter()
if self.input_file.endswith('.wav'):
res = to_np(self.input_file)
else:
temp_file = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
temp_file_path = temp_file.name
temp_file.close()
try:
subprocess.run([
'ffmpeg', '-i', self.input_file, '-ac', '1', '-ar', '16000',
temp_file_path, '-y'
], check=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
res = to_np(temp_file_path)
finally:
os.remove(temp_file_path)
end_time = time.perf_counter()
return end_time - start_time, res
def benchmark(input_file):
converter = AudioConverter(input_file)
pydub_time, pydub_array = converter.convert_pydub()
print(f"Pydub conversion took {pydub_time:.6f} seconds")
av_time, av_array = converter.convert_av()
print(f"AV conversion took {av_time:.6f} seconds")
numpy_time, numpy_array = converter.convert_ffmpeg()
print(f"Raw FFMPEG conversion took {numpy_time:.6f} seconds")
if __name__ == "__main__":
input_file = "sam_altman_lex_podcast_367.flac"
benchmark(input_file) @BBC-Esq , @UsernamesLame please give it a try and let me know your results ? |
Yep, exact numbers is something you'd only publish after extensive and repetitive benchmarking, not like the informal discussion we're having here... It's possible that it's something on my computer as well...perhaps I installed Q4 running faster than Q3...and then Q3_k_m (or whatever the current naming scheme is) using more VRAM than Q4...This was the behavior with BTW, I also benchmarked solely the Takeaways...
|
Did I not already suggest directly using FFMPEG - OMG! I thought I did...but I thought you wanted a library that bundles it for the sake of simplicity. lol. Wait, I kinda did above when I commented on the average user probably not wanting to mess with PATH... Anyways, here's the guy who informed me of this issue first... shashikg/WhisperS2T#40 (comment) BTW, if you're looking to implement true batching I highly recommend you check out his repository. His pipeline for implementing @UsernamesLame I'll send you an invite to the whisper benchmarking private repo I created. Feel free to participate, just observe or not join at all at your pleasure. Will bench the FFMPEG directly but turn to other issues after that. 😉 EDIT: BTW, I believe FFMPEG's binary is written in C so yes it'll be faster...but it would be fun to compare it to Rust, which is much easier to incorporate for people only familiar with Python like myself, but unfortunately Rust doesn't have anything remotely close to FFMPEG's comprehensiveness... |
The "never wrong" 😉 Claude identified a few issues:
After addressing those issues I received the following impressive results for FFMPEG:
I used this script: SCRIPT HERE
[EDIT] When I altered the script to keep track of all time - including I/O operations - I received these results:
Results are basically the same: Second script here: SCRIPT INCLUDING I/O IN TIME
|
Well said, I couldn't agree more. 👍 |
Why would I stick with a library if ffmpeg delivers better results? Lol. |
"Claude is wrong!" Or maybe you didn’t use the script I provided!
Please don’t rely entirely on LLMs, especially for coding. Read the script yourself first! Anyway, ffmpeg is fast on your computer too, which is great. I'll make the necessary changes and remove Pydub. |
That's one option. Plenty of repositories require FFMPEG as a dependency, but user's have to install it and add to PATH, which is not feasible for some users (like I used to be just over a year ago)...they don't even know what a PATH is. Another option is to do what Regarding Claude...apparently you don't know me enough to recognize my sarcasm. 😉 ...and yes...I just used my script because i did the benchmarking in 5 minutes and I didn't want to create yet another python file...that's on me. 😉 didn't change the results though. |
Are you kidding? With that level of expertise and coding, you didn’t know about the PATH until a year ago? No way, lol!
Yes, that was already the case, the transcribe function accepts a numpy array. The pre-processing step is just for the actual media files. All the libraries require ffmpeg to be on PATH anyway; they don’t ship it with the build. So, there's no need for any third-party library anymore.
Okay, I got it! 😆 |
One last thing...some repos use But "yes" to results matching... |
Oh! Good to know that! PyAv is really great, But yeah, I'm always open to suggestions and will definitely consider it if needed! Thanks a lot for all your suggestions and contributions! |
I will definitely join my friend from the best city in America! (I followed you on GitHub and saw your city and was like 🎉🎉🎉 because I have a soft spot for there) Idk if I'll be able to help. I think we all lost the plot here. Batch processing can take longer. The goal is to shrink the actual transcription time! I don't care if it takes hours to make numpy arrays to store in a database so transcription nodes can ingest from a central location. As long as the transcription goes faster. Let's try and make that the new goal. Any ideas? |
A good chunk of FFMPEG is actually assembly!
I removed the timeit mostly because it was spamming the sysout. The benchmark was more of a quick and dirty thing than anything
Lets drop the libraries if we can get better results.
If you find any faster method of conversion please implement it! Just let me keep feeding Whisper raw numpy arrays. |
@UsernamesLame, |
@abdeladim-s and @UsernamesLame , can you try this modified script? I'm still losing sleep (just kidding) regarding the difference between pydub and av...There will be a slight time increase because more time measurements are taken, but here's the script and my results... REVISED SCRIPT
|
@BBC-Esq, Here are my results: Pydub Backend:
1. File Opening and Initial Setup took 7.507274 seconds. Entire audio file loaded into memory.
2. Decoding and Resampling took 6.863667 seconds. Resampling and channel conversion performed in-memory.
3. Converting to Numpy Array took 0.381071 seconds. Audio data converted to numpy array and normalized.
convert_pydub took 14.770355 seconds
AV Backend:
1. File Opening and Initial Setup took 0.001853 seconds. File header opened and resampler created. Audio data not yet loaded.
2. Decoding and Resampling took 14.618280 seconds. Audio data read, decoded, and resampled in chunks.
3. Converting to Numpy Array took 4.329899 seconds. Processed audio frames converted to numpy array and normalized.
convert_av took 19.044509 seconds
FFmpeg Backend:
1. File Opening and Initial Setup took 0.000406 seconds. Temporary WAV file created.
2. Decoding and Resampling took 6.519111 seconds. Input file converted to WAV format using FFmpeg.
3. Converting to Numpy Array took 0.406190 seconds. WAV file read, converted to numpy array, and normalized.
convert_ffmpeg took 6.981947 seconds still raw FFmpeg the fastest, followed by PyDub and then AV. |
Thanks, I'm trying a last ditch effort to see what might be leading to the difference...let's say I wanted to use a library and not FFMPEG, but still wanted the 2x speedup I'm getting on my computer (but not your guys...)...might be good to know what exactly on my computer is creating the disparity. Can you please try this? run Then run Not sure how it is in Linux, but on Windows it looks like this: There should be something that says "Build dependencies" Here's the relevant portion of what mine says...remember, I'm only showing the relevant portions:
and...
|
Sure! I would like to know the reason as well. Here is the output of what you asked: openblas64__info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['/usr/local/lib']
blas_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None)]
runtime_library_dirs = ['/usr/local/lib']
openblas64__lapack_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['/usr/local/lib']
lapack_ilp64_opt_info:
libraries = ['openblas64_', 'openblas64_']
library_dirs = ['/usr/local/lib']
language = c
define_macros = [('HAVE_CBLAS', None), ('BLAS_SYMBOL_SUFFIX', '64_'), ('HAVE_BLAS_ILP64', None), ('HAVE_LAPACKE', None)]
runtime_library_dirs = ['/usr/local/lib']
Supported SIMD extensions in this NumPy install:
baseline = SSE,SSE2,SSE3
found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL
|
No dice...dang...Oh well, will just chalk it up to an unknown. Maybe it'll reveal itself at a later date. |
Yeah, there are so many variables, it's hard to keep track of everything. |
Sorry, college has been busy. Do you still need me to test this? Also I'm not getting notifications anymore. |
As promised, here's the thread I'm making for this.
RE: pre-processing:
In
pywhispercpp/model.py
we havetranscribe
and it can take a numpy ndarray. What I was thinking is, rather than load in audio, crush it to mono, set it to 16khz, why not pre-process all that and generate binary blob files that we can feed in that just contain the numpy ndarray?It's not a big performance increase, but anything we can do outside of Python land ahead of time will give us a win. And I'm ok chasing micro-optimizations in Python land. I'm useless in C++ land.
Also let's put all logging behind a flag to disable it. If possible, lets add a flag to disable
whisper.cpp
's incessant logging info to stderr. I know it has no impact on the transcription audio, but it should be controllable.RE: copy.deepcopy
We need to drop @statimethod everywhere, and implement the deep copy methods on the C++ side. This is a minor request from me, it would just let us initialize the model in memory and create a deep copy that we can treat as a completely independent instance.
The other option is I can write a helper class using BytesIO to hold the model in memory and we can feed that to the Model class I guess? It would still be better than re-initializing the model to create a sterile instance.
RE: micro-optimizations
Under
_get_segments
we haveassert end <= n, f"{end} > {n}: `End` index must be less or equal than the total number of segments"
but I have to ask, is it even possible to end up in a situation where this assert would come true?RE: features
Lets make the model usable in a context manager so we can do quick and dirty things like:
Not really necessary, just gives a more pleasant way of interacting with the model class.
The text was updated successfully, but these errors were encountered: