Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python benchmark script #1298

Merged
merged 21 commits into from
Sep 25, 2023
Merged

Conversation

nchudleigh
Copy link
Contributor

@nchudleigh nchudleigh commented Sep 16, 2023

Simple benchmarking script written in python, just runs the compiled ./main with various settings and parses the results from the output- and saves them as CSV.

@nchudleigh
Copy link
Contributor Author

nchudleigh commented Sep 17, 2023

Example output

Commit Model Hardware Recording Length (seconds) Thread Processor Count Load Time (ms) Sample Time (ms) Encode Time (ms) Decode Time (ms) Sample Time per Run (ms) Encode Time per Run (ms) Decode Time per Run (ms) Total Time (ms)
c9f02c2 ggml-tiny.en.bin Apple M1 Pro 11.0 4 1 50.53 13.38 55.13 47.63 0.5 55.13 1.76 239.25
c9f02c2 ggml-base.en.bin Apple M1 Pro 11.0 4 1 73.22 11.31 93.36 69.6 0.42 93.36 2.58 324.1
c9f02c2 ggml-small.en.bin Apple M1 Pro 11.0 4 1 202.81 12.37 256.52 173.2 0.41 256.52 5.77 729.33
c9f02c2 ggml-medium.bin Apple M1 Pro 11.0 4 1 655.14 13.34 705.0 370.83 0.48 705.0 13.73 1884.92
c9f02c2 ggml-medium.en.bin Apple M1 Pro 11.0 4 1 774.22 11.31 713.09 357.81 0.42 713.09 13.25 1986.39
c9f02c2 ggml-large.bin Apple M1 Pro 11.0 4 1 1367.25 11.32 1262.89 562.71 0.42 1262.89 21.64 3421.02

@nchudleigh
Copy link
Contributor Author

If anyone has notes on clean-up or what is required let me know! @bobqianic @ggerganov

@pudepiedj
Copy link

I ran the examples from the same samples/jfk.wav obtained from this script on an Apple M2 MAX with 32GB RAM. I was mostly interested in the relative timings to see if the models lined up in the same ranking order as the M1 results above.
A few comments on things I needed to do to make the code run on my machine:
Is it worth trapping stderr as not None?
I found the two medium models didn't sort correctly unless I wrapped a float() around the x[1].get() in the lambda function.
I also added a cwd path to the original git repo in the get_git_short_hash() function so it can be run from elsewhere if required.

    return (
        subprocess.check_output(["git", "rev-parse", "--short", "HEAD"],cwd="/path/to/gitcloned/whisper.cpp")
        .decode()
        .strip()
    )
Commit Model Hardware Recording Length (seconds) Thread Processor Count Load Time (ms) Sample Time (ms) Encode Time (ms) Decode Time (ms) Sample Time per Run (ms) Encode Time per Run (ms) Decode Time per Run (ms) Total Time (ms)
903c957 ggml-tiny.en.bin Apple M2 Max 11 4 1 40.8 11.14 41.39 37.52 0.41 41.39 1.39 189.05
903c957 ggml-base.en.bin Apple M2 Max 11 4 1 61.53 10.01 63.44 45.85 0.37 63.44 1.7 229.55
903c957 ggml-small.en.bin Apple M2 Max 11 4 1 151.79 10.97 137.03 114.48 0.37 137.03 3.82 495.76
903c957 ggml-medium.bin Apple M2 Max 11 4 1 434.65 10.45 343.07 243.53 0.37 343.07 9.02 1145.64
903c957 ggml-medium.en.bin Apple M2 Max 11 4 1 451.14 9.99 345.55 245.87 0.37 345.55 9.11 1152.96
903c957 ggml-large.bin Apple M2 Max 11 4 1 877.59 10.04 573.76 353.65 0.37 573.76 13.6 1961.93

@nchudleigh
Copy link
Contributor Author

@pudepiedj

Is it worth trapping stderr as not None?
Could you expand on that a bit?

@nchudleigh
Copy link
Contributor Author

I found the two medium models didn't sort correctly unless I wrapped a float() around the x[1].get() in the lambda function.

Ill look into that thanks.

@pudepiedj
Copy link

I found the two medium models didn't sort correctly unless I wrapped a float() around the x[1].get() in the lambda function.

Ill look into that thanks.

Because medium and medium.en are very close in timings, I found that for some longer files their relative speeds interchange but the lambda function only appears to sort them if medium is faster than medium.en; if it's slower, it doesn't sort them! Unfortunately - my bad, sorry - adding the float() doesn't fix the problem either, so it's got something to do with how the lambda works. If I use samples/gb0.wav for example, this is what I get on my M2 Max (38 core) 32GB:

Commit Model Hardware Recording Length (seconds) Thread Processor Count Load Time (ms) Sample Time (ms) Encode Time (ms) Decode Time (ms) Sample Time per Run (ms) Encode Time per Run (ms) Decode Time per Run (ms) Total Time (ms)
903c957 ggml-tiny.en.bin Apple M2 Max 127.355375 4 1 42.02 178.27 136.03 585.59 0.36 27.21 1.19 1063.01
903c957 ggml-base.en.bin Apple M2 Max 127.355375 4 1 58.2 184.24 236.9 917.41 0.35 47.38 1.74 1526.07
903c957 ggml-small.en.bin Apple M2 Max 127.355375 4 1 166.02 184.41 597.09 2049.14 0.35 119.42 3.87 3201.84
903c957 ggml-medium.bin Apple M2 Max 127.355375 4 1 463.7 193.56 1820.74 4724.41 0.37 303.46 9.23 7612.92
903c957 ggml-medium.en.bin Apple M2 Max 127.355375 4 1 468.36 185.67 1533.49 4904.27 0.35 306.7 9.2 7431.89
903c957 ggml-large.bin Apple M2 Max 127.355375 4 1 876.76 190.78 2569.15 7540.93 0.35 513.83 13.89 11711.99

@pudepiedj
Copy link

I found the two medium models didn't sort correctly unless I wrapped a float() around the x[1].get() in the lambda function.

Ill look into that thanks.

Actually it's a typo in the field name/key: the lambda needs to be x[1].get("Total Time (ms)",0)

@nchudleigh
Copy link
Contributor Author

@pudepiedj nice catch, fixed.

@ggerganov is the root of the project appropriate for this? Feels like it should go elsewhere but I dont want to confuse the existing benchmarking implementation.

bench.py Outdated Show resolved Hide resolved
@bobqianic
Copy link
Collaborator

Overall, it's pretty good. I'm going to test it now. If there are no issues, we can merge this PR immediately.

@bobqianic
Copy link
Collaborator

bobqianic commented Sep 21, 2023

Feels like it should go elsewhere but I dont want to confuse the existing benchmarking implementation.

I suddenly thought that actually, you can create your own Python package and then upload it to PyPI. This way, during testing, you can directly pip install ABC and it will automatically set up the required files, which would be very convenient.

@pudepiedj
Copy link

I added the option to specify the number of threads, processors and sample file as command-line parameters instead of hard-coding them, if that's of any interest. No harm if not.

import argparse

# create the argument parser
parser = argparse.ArgumentParser()

# Add the argument(s) you want to parse
parser.add_argument('-t', '--threads', type=int, default=4, help='Number of threads')
parser.add_argument('-pr', '--procs', type=int, default=1, help='Number of processors')
parser.add_argument('-f', '--filename', type=str, default="./samples/jfk.wav", help='File to transcribe')

# Parse the command line arguments
args = parser.parse_args()

# Access the value of the argument
num_threads = args.threads
filename = args.filename
num_processors =  args.procs

threads = [num_threads]
processor_counts = [num_processors]
sample_file = filename

All my experiments suggest it's fastest with threads=1, I guess because the GPU does all the work and the overheads for the threads are more a hindrance than help.

@bobqianic
Copy link
Collaborator

it's fastest with threads=1

This holds true for OpenBLAS as well.

I guess because the GPU does all the work and the overheads for the threads are more a hindrance than help.

Yes, those threads are GGML threads, and all they're doing is spinning.

@nchudleigh
Copy link
Contributor Author

@pudepiedj I like the argument input but it would need to handle lists, a lot of the reason I added those was for testing different processor and thread counts. across models in the same run

@nchudleigh
Copy link
Contributor Author

Maybe for now we just add the file input?

@pudepiedj
Copy link

Maybe for now we just add the file input?

Obviously your call. It's hard to set a universal default for the file entry that will suit all local file structures, that's the only downside, so maybe we just stick with your original plan. I entirely see your point about being able to test lists of values, and I don't think it's possible to enter a list as a command-line parameter (is it?) although there's almost certainly a workaround if it was deemed desirable.

@pudepiedj
Copy link

pudepiedj commented Sep 21, 2023

I've never done it, and I haven't rigorously checked it for all eventualities, but Python is amazing. It's not very pretty, but with a bit of help from GPT-3.5-turbo this allows either a list (-l) or a single number of threads (-t) (lists taking precedence) to be entered from the command line in the form

python3 python_benchmark_script.py -l 1,2,4,8 -t 1 -pr 1 -f /Users/edsilm2/whisper.cpp/samples/jfk.wav

Here's the additional code

import argparse

# Custom action to handle comma-separated list
class ListAction(argparse.Action):
    def __call__(self, parser, namespace, values, option_string=None):
        setattr(namespace, self.dest, [int(val) for val in values.split(',')])

# create the argument parser
parser = argparse.ArgumentParser()

# Define the argument to accept a list
parser.add_argument('-l', '--threads_list', dest='thread_list', action=ListAction, help='List of values')

# Add the other argument(s) to parse
parser.add_argument('-t', '--threads', type=list, default=[4], help='List giving number of threads')
parser.add_argument('-pr', '--procs', type=int, default=1, help='Number of processors')
parser.add_argument('-f', '--filename', type=str, default="./samples/jfk.wav", help='File to transcribe')

# Parse the command line arguments
args = parser.parse_args()

# Access the list of values in the script
thread_list = args.thread_list

# Use the list of values as required (check)
print(f"List: {thread_list}")

# Access the value of the argument
filename = args.filename
num_processors =  args.procs

if args.thread_list is not None:
    threads = args.thread_list
elif args.threads is not None:
    threads = [args.threads]

processor_counts = [num_processors]

Sample output:

Commit Sample File Model Hardware Recording Length (seconds) Thread Processor Count Load Time (ms) Sample Time (ms) Encode Time (ms) Decode Time (ms) Prompt Time (ms) Sample Time per Run (ms) Encode Time per Run (ms) Decode Time per Run (ms) Total Time (ms)
903c957 jfk.wav ggml-tiny.bin Apple M2 Max 11.0 8 1 41.73 9.78 39.1 30.32 3.14 0.38 39.1 1.21 167.18
903c957 jfk.wav ggml-tiny.en.bin Apple M2 Max 11.0 4 1 40.53 9.81 41.56 31.81 0.0 0.36 41.56 1.18 167.58
903c957 jfk.wav ggml-tiny.en.bin Apple M2 Max 11.0 8 1 41.86 9.82 39.92 33.57 0.0 0.36 39.92 1.24 168.35
903c957 jfk.wav ggml-tiny.bin Apple M2 Max 11.0 4 1 41.79 9.73 40.84 29.17 2.87 0.37 40.84 1.17 169.36
903c957 jfk.wav ggml-tiny.en.bin Apple M2 Max 11.0 2 1 43.94 9.88 49.54 31.56 0.0 0.37 49.54 1.17 181.52
903c957 jfk.wav ggml-tiny.bin Apple M2 Max 11.0 2 1 39.17 9.93 52.01 28.84 2.78 0.38 52.01 1.15 181.7
903c957 jfk.wav ggml-tiny.bin Apple M2 Max 11.0 1 1 46.09 10.03 61.89 28.69 2.81 0.39 61.89 1.15 205.9
903c957 jfk.wav ggml-tiny.en.bin Apple M2 Max 11.0 1 1 47.33 10.08 63.46 32.39 0.0 0.37 63.46 1.2 207.13
903c957 jfk.wav ggml-base.en.bin Apple M2 Max 11.0 8 1 57.1 10.03 51.26 48.98 0.0 0.37 51.26 1.81 212.71
903c957 jfk.wav ggml-base.bin Apple M2 Max 11.0 8 1 56.88 10.02 54.45 46.77 4.63 0.37 54.45 1.8 218.86
903c957 jfk.wav ggml-base.bin Apple M2 Max 11.0 4 1 58.01 10.01 58.12 43.52 4.25 0.37 58.12 1.67 220.83
The whole lot!
image

@nchudleigh
Copy link
Contributor Author

@bobqianic ready for review!

extra/bench.py Outdated Show resolved Hide resolved
extra/bench.py Outdated Show resolved Hide resolved
@pudepiedj
Copy link

Both good points.
Is it worth noting in the README.md that the quantised models don't (currently) run? At least, they don't on my machine and I've raised this as an Issue #1314 as @bobqianic has noted by identifying this as a bug and potential enhancement.

@nchudleigh
Copy link
Contributor Author

Both good points. Is it worth noting in the README.md that the quantised models don't (currently) run? At least, they don't on my machine and I've raised this as an Issue #1314 as @bobqianic has noted by identifying this as a bug and potential enhancement.

I dont have the quantized models in the list for this reason. Would be nice once they are functioning to add them in, but simple enough and I think for now cleaner to leave them out.

@bobqianic bobqianic merged commit 9edbd0a into ggerganov:master Sep 25, 2023
35 checks passed
@pudepiedj
Copy link

pudepiedj commented Sep 25, 2023

Both good points. Is it worth noting in the README.md that the quantised models don't (currently) run? At least, they don't on my machine and I've raised this as an Issue #1314 as @bobqianic has noted by identifying this as a bug and potential enhancement.

I dont have the quantized models in the list for this reason. Would be nice once they are functioning to add them in, but simple enough and I think for now cleaner to leave them out.

Yes that's perfectly reasonable. I was just trying to anticipate the situation where someone adds their own edit and wonders whether the problem is theirs or in the code (because I often get errors and am not sure whether it is "me" or "it" given the vagaries of installation).

didzis pushed a commit to didzis/whisper.cpp that referenced this pull request Sep 30, 2023
* Create bench.py

* Various benchmark results

* Update benchmark script with hardware name, and file checks

* Remove old benchmark results

* Add git shorthash

* Round to 2 digits on calculated floats

* Fix the header reference when sorting results

* FIx order of models

* Parse file name

* Simplify filecheck

* Improve print run print statement

* Use simplified model name

* Update benchmark_results.csv

* Process single or lists of processors and threads

* Ignore benchmark results, dont check in

* Move bench.py to extra folder

* Readme section on how to use

* Move command to correct location

* Use separate list for models that exist

* Handle subprocess error in git short hash check

* Fix filtered models list initialization
@aehlke
Copy link

aehlke commented Oct 9, 2023

is there benchmarking of memory usage?

jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* Create bench.py

* Various benchmark results

* Update benchmark script with hardware name, and file checks

* Remove old benchmark results

* Add git shorthash

* Round to 2 digits on calculated floats

* Fix the header reference when sorting results

* FIx order of models

* Parse file name

* Simplify filecheck

* Improve print run print statement

* Use simplified model name

* Update benchmark_results.csv

* Process single or lists of processors and threads

* Ignore benchmark results, dont check in

* Move bench.py to extra folder

* Readme section on how to use

* Move command to correct location

* Use separate list for models that exist

* Handle subprocess error in git short hash check

* Fix filtered models list initialization
jacobwu-b pushed a commit to jacobwu-b/Transcriptify-by-whisper.cpp that referenced this pull request Oct 24, 2023
* Create bench.py

* Various benchmark results

* Update benchmark script with hardware name, and file checks

* Remove old benchmark results

* Add git shorthash

* Round to 2 digits on calculated floats

* Fix the header reference when sorting results

* FIx order of models

* Parse file name

* Simplify filecheck

* Improve print run print statement

* Use simplified model name

* Update benchmark_results.csv

* Process single or lists of processors and threads

* Ignore benchmark results, dont check in

* Move bench.py to extra folder

* Readme section on how to use

* Move command to correct location

* Use separate list for models that exist

* Handle subprocess error in git short hash check

* Fix filtered models list initialization
vonstring pushed a commit to vonstring/whisper.cpp that referenced this pull request Nov 7, 2023
* Create bench.py

* Various benchmark results

* Update benchmark script with hardware name, and file checks

* Remove old benchmark results

* Add git shorthash

* Round to 2 digits on calculated floats

* Fix the header reference when sorting results

* FIx order of models

* Parse file name

* Simplify filecheck

* Improve print run print statement

* Use simplified model name

* Update benchmark_results.csv

* Process single or lists of processors and threads

* Ignore benchmark results, dont check in

* Move bench.py to extra folder

* Readme section on how to use

* Move command to correct location

* Use separate list for models that exist

* Handle subprocess error in git short hash check

* Fix filtered models list initialization
landtanin pushed a commit to landtanin/whisper.cpp that referenced this pull request Dec 16, 2023
* Create bench.py

* Various benchmark results

* Update benchmark script with hardware name, and file checks

* Remove old benchmark results

* Add git shorthash

* Round to 2 digits on calculated floats

* Fix the header reference when sorting results

* FIx order of models

* Parse file name

* Simplify filecheck

* Improve print run print statement

* Use simplified model name

* Update benchmark_results.csv

* Process single or lists of processors and threads

* Ignore benchmark results, dont check in

* Move bench.py to extra folder

* Readme section on how to use

* Move command to correct location

* Use separate list for models that exist

* Handle subprocess error in git short hash check

* Fix filtered models list initialization
iThalay pushed a commit to iThalay/whisper.cpp that referenced this pull request Sep 23, 2024
* Create bench.py

* Various benchmark results

* Update benchmark script with hardware name, and file checks

* Remove old benchmark results

* Add git shorthash

* Round to 2 digits on calculated floats

* Fix the header reference when sorting results

* FIx order of models

* Parse file name

* Simplify filecheck

* Improve print run print statement

* Use simplified model name

* Update benchmark_results.csv

* Process single or lists of processors and threads

* Ignore benchmark results, dont check in

* Move bench.py to extra folder

* Readme section on how to use

* Move command to correct location

* Use separate list for models that exist

* Handle subprocess error in git short hash check

* Fix filtered models list initialization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants