Python benchmark script #1298

nchudleigh · 2023-09-16T21:11:00Z

Simple benchmarking script written in python, just runs the compiled ./main with various settings and parses the results from the output- and saves them as CSV.

nchudleigh · 2023-09-17T00:27:47Z

Example output

Commit	Model	Hardware	Recording Length (seconds)	Thread	Processor Count	Load Time (ms)	Sample Time (ms)	Encode Time (ms)	Decode Time (ms)	Sample Time per Run (ms)	Encode Time per Run (ms)	Decode Time per Run (ms)	Total Time (ms)
`c9f02c2`	ggml-tiny.en.bin	Apple M1 Pro	11.0	4	1	50.53	13.38	55.13	47.63	0.5	55.13	1.76	239.25
`c9f02c2`	ggml-base.en.bin	Apple M1 Pro	11.0	4	1	73.22	11.31	93.36	69.6	0.42	93.36	2.58	324.1
`c9f02c2`	ggml-small.en.bin	Apple M1 Pro	11.0	4	1	202.81	12.37	256.52	173.2	0.41	256.52	5.77	729.33
`c9f02c2`	ggml-medium.bin	Apple M1 Pro	11.0	4	1	655.14	13.34	705.0	370.83	0.48	705.0	13.73	1884.92
`c9f02c2`	ggml-medium.en.bin	Apple M1 Pro	11.0	4	1	774.22	11.31	713.09	357.81	0.42	713.09	13.25	1986.39
`c9f02c2`	ggml-large.bin	Apple M1 Pro	11.0	4	1	1367.25	11.32	1262.89	562.71	0.42	1262.89	21.64	3421.02

nchudleigh · 2023-09-18T13:38:11Z

If anyone has notes on clean-up or what is required let me know! @bobqianic @ggerganov

pudepiedj · 2023-09-19T11:12:45Z

I ran the examples from the same samples/jfk.wav obtained from this script on an Apple M2 MAX with 32GB RAM. I was mostly interested in the relative timings to see if the models lined up in the same ranking order as the M1 results above.
A few comments on things I needed to do to make the code run on my machine:
Is it worth trapping stderr as not None?
I found the two medium models didn't sort correctly unless I wrapped a float() around the x[1].get() in the lambda function.
I also added a cwd path to the original git repo in the get_git_short_hash() function so it can be run from elsewhere if required.

    return (
        subprocess.check_output(["git", "rev-parse", "--short", "HEAD"],cwd="/path/to/gitcloned/whisper.cpp")
        .decode()
        .strip()
    )

Commit	Model	Hardware	Recording Length (seconds)	Thread	Processor Count	Load Time (ms)	Sample Time (ms)	Encode Time (ms)	Decode Time (ms)	Sample Time per Run (ms)	Encode Time per Run (ms)	Decode Time per Run (ms)	Total Time (ms)
`903c957`	ggml-tiny.en.bin	Apple M2 Max	11	4	1	40.8	11.14	41.39	37.52	0.41	41.39	1.39	189.05
`903c957`	ggml-base.en.bin	Apple M2 Max	11	4	1	61.53	10.01	63.44	45.85	0.37	63.44	1.7	229.55
`903c957`	ggml-small.en.bin	Apple M2 Max	11	4	1	151.79	10.97	137.03	114.48	0.37	137.03	3.82	495.76
`903c957`	ggml-medium.bin	Apple M2 Max	11	4	1	434.65	10.45	343.07	243.53	0.37	343.07	9.02	1145.64
`903c957`	ggml-medium.en.bin	Apple M2 Max	11	4	1	451.14	9.99	345.55	245.87	0.37	345.55	9.11	1152.96
`903c957`	ggml-large.bin	Apple M2 Max	11	4	1	877.59	10.04	573.76	353.65	0.37	573.76	13.6	1961.93

nchudleigh · 2023-09-19T21:23:39Z

@pudepiedj

Is it worth trapping stderr as not None?
Could you expand on that a bit?

nchudleigh · 2023-09-19T21:24:00Z

I found the two medium models didn't sort correctly unless I wrapped a float() around the x[1].get() in the lambda function.

Ill look into that thanks.

pudepiedj · 2023-09-20T10:05:14Z

I found the two medium models didn't sort correctly unless I wrapped a float() around the x[1].get() in the lambda function.

Ill look into that thanks.

Because medium and medium.en are very close in timings, I found that for some longer files their relative speeds interchange but the lambda function only appears to sort them if medium is faster than medium.en; if it's slower, it doesn't sort them! Unfortunately - my bad, sorry - adding the float() doesn't fix the problem either, so it's got something to do with how the lambda works. If I use samples/gb0.wav for example, this is what I get on my M2 Max (38 core) 32GB:

Commit	Model	Hardware	Recording Length (seconds)	Thread	Processor Count	Load Time (ms)	Sample Time (ms)	Encode Time (ms)	Decode Time (ms)	Sample Time per Run (ms)	Encode Time per Run (ms)	Decode Time per Run (ms)	Total Time (ms)
`903c957`	ggml-tiny.en.bin	Apple M2 Max	127.355375	4	1	42.02	178.27	136.03	585.59	0.36	27.21	1.19	1063.01
`903c957`	ggml-base.en.bin	Apple M2 Max	127.355375	4	1	58.2	184.24	236.9	917.41	0.35	47.38	1.74	1526.07
`903c957`	ggml-small.en.bin	Apple M2 Max	127.355375	4	1	166.02	184.41	597.09	2049.14	0.35	119.42	3.87	3201.84
`903c957`	ggml-medium.bin	Apple M2 Max	127.355375	4	1	463.7	193.56	1820.74	4724.41	0.37	303.46	9.23	7612.92
`903c957`	ggml-medium.en.bin	Apple M2 Max	127.355375	4	1	468.36	185.67	1533.49	4904.27	0.35	306.7	9.2	7431.89
`903c957`	ggml-large.bin	Apple M2 Max	127.355375	4	1	876.76	190.78	2569.15	7540.93	0.35	513.83	13.89	11711.99

pudepiedj · 2023-09-20T11:36:15Z

I found the two medium models didn't sort correctly unless I wrapped a float() around the x[1].get() in the lambda function.

Ill look into that thanks.

Actually it's a typo in the field name/key: the lambda needs to be x[1].get("Total Time (ms)",0)

nchudleigh · 2023-09-20T17:10:04Z

@pudepiedj nice catch, fixed.

@ggerganov is the root of the project appropriate for this? Feels like it should go elsewhere but I dont want to confuse the existing benchmarking implementation.

bench.py

bobqianic · 2023-09-21T06:56:54Z

Overall, it's pretty good. I'm going to test it now. If there are no issues, we can merge this PR immediately.

bobqianic · 2023-09-21T13:28:34Z

Feels like it should go elsewhere but I dont want to confuse the existing benchmarking implementation.

I suddenly thought that actually, you can create your own Python package and then upload it to PyPI. This way, during testing, you can directly pip install ABC and it will automatically set up the required files, which would be very convenient.

pudepiedj · 2023-09-21T13:58:55Z

I added the option to specify the number of threads, processors and sample file as command-line parameters instead of hard-coding them, if that's of any interest. No harm if not.

import argparse

# create the argument parser
parser = argparse.ArgumentParser()

# Add the argument(s) you want to parse
parser.add_argument('-t', '--threads', type=int, default=4, help='Number of threads')
parser.add_argument('-pr', '--procs', type=int, default=1, help='Number of processors')
parser.add_argument('-f', '--filename', type=str, default="./samples/jfk.wav", help='File to transcribe')

# Parse the command line arguments
args = parser.parse_args()

# Access the value of the argument
num_threads = args.threads
filename = args.filename
num_processors =  args.procs

threads = [num_threads]
processor_counts = [num_processors]
sample_file = filename

All my experiments suggest it's fastest with threads=1, I guess because the GPU does all the work and the overheads for the threads are more a hindrance than help.

bobqianic · 2023-09-21T14:03:06Z

it's fastest with threads=1

This holds true for OpenBLAS as well.

I guess because the GPU does all the work and the overheads for the threads are more a hindrance than help.

Yes, those threads are GGML threads, and all they're doing is spinning.

nchudleigh · 2023-09-21T14:54:12Z

@pudepiedj I like the argument input but it would need to handle lists, a lot of the reason I added those was for testing different processor and thread counts. across models in the same run

nchudleigh · 2023-09-21T14:54:44Z

Maybe for now we just add the file input?

pudepiedj · 2023-09-21T15:57:19Z

Maybe for now we just add the file input?

Obviously your call. It's hard to set a universal default for the file entry that will suit all local file structures, that's the only downside, so maybe we just stick with your original plan. I entirely see your point about being able to test lists of values, and I don't think it's possible to enter a list as a command-line parameter (is it?) although there's almost certainly a workaround if it was deemed desirable.

pudepiedj · 2023-09-21T17:38:50Z

I've never done it, and I haven't rigorously checked it for all eventualities, but Python is amazing. It's not very pretty, but with a bit of help from GPT-3.5-turbo this allows either a list (-l) or a single number of threads (-t) (lists taking precedence) to be entered from the command line in the form

python3 python_benchmark_script.py -l 1,2,4,8 -t 1 -pr 1 -f /Users/edsilm2/whisper.cpp/samples/jfk.wav

Here's the additional code

import argparse

# Custom action to handle comma-separated list
class ListAction(argparse.Action):
    def __call__(self, parser, namespace, values, option_string=None):
        setattr(namespace, self.dest, [int(val) for val in values.split(',')])

# create the argument parser
parser = argparse.ArgumentParser()

# Define the argument to accept a list
parser.add_argument('-l', '--threads_list', dest='thread_list', action=ListAction, help='List of values')

# Add the other argument(s) to parse
parser.add_argument('-t', '--threads', type=list, default=[4], help='List giving number of threads')
parser.add_argument('-pr', '--procs', type=int, default=1, help='Number of processors')
parser.add_argument('-f', '--filename', type=str, default="./samples/jfk.wav", help='File to transcribe')

# Parse the command line arguments
args = parser.parse_args()

# Access the list of values in the script
thread_list = args.thread_list

# Use the list of values as required (check)
print(f"List: {thread_list}")

# Access the value of the argument
filename = args.filename
num_processors =  args.procs

if args.thread_list is not None:
    threads = args.thread_list
elif args.threads is not None:
    threads = [args.threads]

processor_counts = [num_processors]

Sample output:

Commit	Sample File	Model	Hardware	Recording Length (seconds)	Thread	Processor Count	Load Time (ms)	Sample Time (ms)	Encode Time (ms)	Decode Time (ms)	Prompt Time (ms)	Sample Time per Run (ms)	Encode Time per Run (ms)	Decode Time per Run (ms)	Total Time (ms)
`903c957`	jfk.wav	ggml-tiny.bin	Apple M2 Max	11.0	8	1	41.73	9.78	39.1	30.32	3.14	0.38	39.1	1.21	167.18
`903c957`	jfk.wav	ggml-tiny.en.bin	Apple M2 Max	11.0	4	1	40.53	9.81	41.56	31.81	0.0	0.36	41.56	1.18	167.58
`903c957`	jfk.wav	ggml-tiny.en.bin	Apple M2 Max	11.0	8	1	41.86	9.82	39.92	33.57	0.0	0.36	39.92	1.24	168.35
`903c957`	jfk.wav	ggml-tiny.bin	Apple M2 Max	11.0	4	1	41.79	9.73	40.84	29.17	2.87	0.37	40.84	1.17	169.36
`903c957`	jfk.wav	ggml-tiny.en.bin	Apple M2 Max	11.0	2	1	43.94	9.88	49.54	31.56	0.0	0.37	49.54	1.17	181.52
`903c957`	jfk.wav	ggml-tiny.bin	Apple M2 Max	11.0	2	1	39.17	9.93	52.01	28.84	2.78	0.38	52.01	1.15	181.7
`903c957`	jfk.wav	ggml-tiny.bin	Apple M2 Max	11.0	1	1	46.09	10.03	61.89	28.69	2.81	0.39	61.89	1.15	205.9
`903c957`	jfk.wav	ggml-tiny.en.bin	Apple M2 Max	11.0	1	1	47.33	10.08	63.46	32.39	0.0	0.37	63.46	1.2	207.13
`903c957`	jfk.wav	ggml-base.en.bin	Apple M2 Max	11.0	8	1	57.1	10.03	51.26	48.98	0.0	0.37	51.26	1.81	212.71
`903c957`	jfk.wav	ggml-base.bin	Apple M2 Max	11.0	8	1	56.88	10.02	54.45	46.77	4.63	0.37	54.45	1.8	218.86
`903c957`	jfk.wav	ggml-base.bin	Apple M2 Max	11.0	4	1	58.01	10.01	58.12	43.52	4.25	0.37	58.12	1.67	220.83
The whole lot!

nchudleigh · 2023-09-23T17:49:30Z

@bobqianic ready for review!

extra/bench.py

pudepiedj · 2023-09-24T14:09:57Z

Both good points.
Is it worth noting in the README.md that the quantised models don't (currently) run? At least, they don't on my machine and I've raised this as an Issue #1314 as @bobqianic has noted by identifying this as a bug and potential enhancement.

nchudleigh · 2023-09-25T14:01:20Z

Both good points. Is it worth noting in the README.md that the quantised models don't (currently) run? At least, they don't on my machine and I've raised this as an Issue #1314 as @bobqianic has noted by identifying this as a bug and potential enhancement.

I dont have the quantized models in the list for this reason. Would be nice once they are functioning to add them in, but simple enough and I think for now cleaner to leave them out.

pudepiedj · 2023-09-25T21:17:23Z

Both good points. Is it worth noting in the README.md that the quantised models don't (currently) run? At least, they don't on my machine and I've raised this as an Issue #1314 as @bobqianic has noted by identifying this as a bug and potential enhancement.

I dont have the quantized models in the list for this reason. Would be nice once they are functioning to add them in, but simple enough and I think for now cleaner to leave them out.

Yes that's perfectly reasonable. I was just trying to anticipate the situation where someone adds their own edit and wonders whether the problem is theirs or in the code (because I often get errors and am not sure whether it is "me" or "it" given the vagaries of installation).

* Create bench.py * Various benchmark results * Update benchmark script with hardware name, and file checks * Remove old benchmark results * Add git shorthash * Round to 2 digits on calculated floats * Fix the header reference when sorting results * FIx order of models * Parse file name * Simplify filecheck * Improve print run print statement * Use simplified model name * Update benchmark_results.csv * Process single or lists of processors and threads * Ignore benchmark results, dont check in * Move bench.py to extra folder * Readme section on how to use * Move command to correct location * Use separate list for models that exist * Handle subprocess error in git short hash check * Fix filtered models list initialization

aehlke · 2023-10-09T14:24:12Z

is there benchmarking of memory usage?

* Create bench.py * Various benchmark results * Update benchmark script with hardware name, and file checks * Remove old benchmark results * Add git shorthash * Round to 2 digits on calculated floats * Fix the header reference when sorting results * FIx order of models * Parse file name * Simplify filecheck * Improve print run print statement * Use simplified model name * Update benchmark_results.csv * Process single or lists of processors and threads * Ignore benchmark results, dont check in * Move bench.py to extra folder * Readme section on how to use * Move command to correct location * Use separate list for models that exist * Handle subprocess error in git short hash check * Fix filtered models list initialization

nchudleigh added 6 commits September 16, 2023 17:07

Create bench.py

090964a

Various benchmark results

1651808

Update benchmark script with hardware name, and file checks

30cb9b3

Remove old benchmark results

bb35cec

Add git shorthash

c9f02c2

Round to 2 digits on calculated floats

d6d22a8

nchudleigh added 2 commits September 20, 2023 13:07

Fix the header reference when sorting results

f1e7ad1

FIx order of models

9615b4d

bobqianic reviewed Sep 21, 2023

View reviewed changes

bench.py Outdated Show resolved Hide resolved

nchudleigh added 5 commits September 23, 2023 08:45

Parse file name

9ec9028

Simplify filecheck

3fe66b8

Improve print run print statement

b8a8731

Use simplified model name

35822ee

Update benchmark_results.csv

3a24b2e

nchudleigh added 5 commits September 23, 2023 08:56

Process single or lists of processors and threads

ce57282

Ignore benchmark results, dont check in

3418ec2

Move bench.py to extra folder

56cf598

Readme section on how to use

eb31154

Move command to correct location

18983e0

nchudleigh requested a review from bobqianic September 23, 2023 17:49

bobqianic reviewed Sep 24, 2023

View reviewed changes

extra/bench.py Outdated Show resolved Hide resolved

extra/bench.py Outdated Show resolved Hide resolved

nchudleigh added 3 commits September 25, 2023 06:55

Use separate list for models that exist

d6ca468

Handle subprocess error in git short hash check

403ad45

Fix filtered models list initialization

a72a1b5

nchudleigh requested a review from bobqianic September 25, 2023 14:01

bobqianic approved these changes Sep 25, 2023

View reviewed changes

bobqianic merged commit 9edbd0a into ggerganov:master Sep 25, 2023
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python benchmark script #1298

Python benchmark script #1298

nchudleigh commented Sep 16, 2023 •

edited

Loading

nchudleigh commented Sep 17, 2023 •

edited

Loading

nchudleigh commented Sep 18, 2023

pudepiedj commented Sep 19, 2023

nchudleigh commented Sep 19, 2023

nchudleigh commented Sep 19, 2023

pudepiedj commented Sep 20, 2023

pudepiedj commented Sep 20, 2023

nchudleigh commented Sep 20, 2023

bobqianic commented Sep 21, 2023

bobqianic commented Sep 21, 2023 •

edited

Loading

pudepiedj commented Sep 21, 2023

bobqianic commented Sep 21, 2023

nchudleigh commented Sep 21, 2023

nchudleigh commented Sep 21, 2023

pudepiedj commented Sep 21, 2023

pudepiedj commented Sep 21, 2023 •

edited

Loading

nchudleigh commented Sep 23, 2023

pudepiedj commented Sep 24, 2023

nchudleigh commented Sep 25, 2023

pudepiedj commented Sep 25, 2023 •

edited

Loading

aehlke commented Oct 9, 2023

Python benchmark script #1298

Python benchmark script #1298

Conversation

nchudleigh commented Sep 16, 2023 • edited Loading

nchudleigh commented Sep 17, 2023 • edited Loading

nchudleigh commented Sep 18, 2023

pudepiedj commented Sep 19, 2023

nchudleigh commented Sep 19, 2023

nchudleigh commented Sep 19, 2023

pudepiedj commented Sep 20, 2023

pudepiedj commented Sep 20, 2023

nchudleigh commented Sep 20, 2023

bobqianic commented Sep 21, 2023

bobqianic commented Sep 21, 2023 • edited Loading

pudepiedj commented Sep 21, 2023

bobqianic commented Sep 21, 2023

nchudleigh commented Sep 21, 2023

nchudleigh commented Sep 21, 2023

pudepiedj commented Sep 21, 2023

pudepiedj commented Sep 21, 2023 • edited Loading

nchudleigh commented Sep 23, 2023

pudepiedj commented Sep 24, 2023

nchudleigh commented Sep 25, 2023

pudepiedj commented Sep 25, 2023 • edited Loading

aehlke commented Oct 9, 2023

nchudleigh commented Sep 16, 2023 •

edited

Loading

nchudleigh commented Sep 17, 2023 •

edited

Loading

bobqianic commented Sep 21, 2023 •

edited

Loading

pudepiedj commented Sep 21, 2023 •

edited

Loading

pudepiedj commented Sep 25, 2023 •

edited

Loading