Adding model benchmarks #691

juuso-oskari · 2024-12-23T11:53:01Z

I am not making a trivial change, such as fixing a typo in a comment.
I have written a PR description following these
rules.
I have run pre-commit run --from-ref origin/main --to-ref HEAD.
This PR does not need a test because its a bugfix to pass a test.
I have not added any lit tests.

This PR adds advanced benchmarking for kernels inside perf-kernels. Aim is to have more comparable benchmark results by taking shapes from actual llms. model_configs.json holds the configs for various models, which we can then read in example for gemm.py and flash-attention.py to get benchmarking shapes for these kernels:

e.g.
python python/perf-kernels/gemm.py -model llama3_8B

To run the gemm kernel with a gemm shape taken from the 1st layer of the FNN of a Llama 3 8B model.

python python/perf-kernels/flash-attention.py -model all -b 2 -sq 1024

To run the flash attention kernel with the shapes from all the models in model_configs.json (currently llama3_8B, llama3_70B, llama3_405B).

…bias[dtype0-True-True-2-4-7-16219-64] by adding the qk += (tl.dot(q, k) * QK_SCALE).to(q.type.element_ty) conversion

vgokhale · 2024-12-23T16:19:02Z

python/perf-kernels/flash-attention.py

@@ -282,7 +284,7 @@ def _attn_fwd_inner(acc, l_i, m_i, q, k_ptrs, v_ptrs, bias_ptrs, stride_kn, stri
        else:
            if INT8_KV:
                k = (k * k_descale).to(q.type.element_ty)
-            qk += tl.dot(q, k) * QK_SCALE
+            qk += tl.dot((q * QK_SCALE).to(q.type.element_ty), k)


We had tried this before. Unfortunately, it does not work because of precision loss due to conversion. While the math on paper is the same, here it is first upcasting Q to f32, doing the scalar mult with qk_scale, then downcasting to q.dtype. This downcast affects performance in some cases.

vgokhale · 2024-12-23T16:20:26Z

python/perf-kernels/gemm.py

@@ -305,13 +308,26 @@ def benchmark(M, N, K, provider):


 # TODO(vgokhale): Add more options to benchmarking
+
+


Unintentional blank lines?

vgokhale · 2024-12-23T16:22:45Z

python/perf-kernels/gemm.py

@@ -305,13 +308,26 @@ def benchmark(M, N, K, provider):


 # TODO(vgokhale): Add more options to benchmarking


May be you could delete this TODO since you addressed it :)

vgokhale · 2024-12-23T16:24:49Z

python/perf-kernels/gemm.py

+                                                model_name=args.model)
+        benchmark.benchmarks.x_vals = x_vals
+
+    if args.M and args.N and args.K:


May be add an assert that both MNK and model cannot be provided together (because model is actually fixing MNK so the user likely made a mistake if they provided both)?

vgokhale · 2024-12-23T16:32:23Z

python/perf-kernels/model_configs.json

+{
+  "llama3_8B": {
+    "head_count": 32,
+    "head_dimension": 128,


Usually this is backwards - the hidden_size and head_count are provided, and we work out the head_dimension from that.

Here is an example

https://huggingface.co/unsloth/llama-3-8b/blob/main/config.json

vgokhale · 2024-12-23T16:36:41Z

python/perf-kernels/model_benchmarking.py

+    # Infer M, N, K based on the feedforward network (FFN) dimensions
+    M = batch_size * seq_len  # Total tokens in a batch
+    K = head_dimension * head_count  # Hidden size (d)
+    N = 4 * K  # FFN dimension is typically 4× hidden size


Hmm...I think this depends. On llama3-8b, the intermediate size (which is half the first FF output) is 14336. So the N dim would be 14336 x 2 or more generally 14336 x intermediate_size.

the intermediate size is a config parameter so we should read it from the json.

vgokhale · 2024-12-23T16:38:02Z

python/perf-kernels/model_benchmarking.py

+            raise ValueError(f"Model '{model_name}' not found in {config_file}")
+        # Handle a specific model
+        config = configs[model_name]
+        HQ = HK = config["head_count"]


This depends on the config as well. All models have a "num_attention_heads" parameter, but some also have "num_key_value_heads". If they list this latter, then the kv heads is different from Q heads.

juuso-oskari added 10 commits December 23, 2024 06:49

fixes the pytest python/perf-kernels/flash-attention.py::test_op_fwd_…

8a8c395

…bias[dtype0-True-True-2-4-7-16219-64] by adding the qk += (tl.dot(q, k) * QK_SCALE).to(q.type.element_ty) conversion

uncomment configs

369eeaa

QK scale to inside tl.dot, downside is that it reduces accuracy

87d9532

benchmark and configs files added

4e666aa

gemm benchmarking

abcf328

fix

eaf2de0

added model benchmarking to FA

1de51d9

fix

68869e7

fix

f178767

fix

715158b

juuso-oskari requested a review from vgokhale December 23, 2024 11:53

juuso-oskari self-assigned this Dec 23, 2024

vgokhale reviewed Dec 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding model benchmarks #691

Adding model benchmarks #691

juuso-oskari commented Dec 23, 2024

vgokhale Dec 23, 2024

vgokhale Dec 23, 2024

vgokhale Dec 23, 2024

vgokhale Dec 23, 2024

vgokhale Dec 23, 2024

vgokhale Dec 23, 2024

vgokhale Dec 23, 2024

		@@ -305,13 +308,26 @@ def benchmark(M, N, K, provider):


		# TODO(vgokhale): Add more options to benchmarking

Adding model benchmarks #691

Are you sure you want to change the base?

Adding model benchmarks #691

Conversation

juuso-oskari commented Dec 23, 2024

vgokhale Dec 23, 2024

Choose a reason for hiding this comment

vgokhale Dec 23, 2024

Choose a reason for hiding this comment

vgokhale Dec 23, 2024

Choose a reason for hiding this comment

vgokhale Dec 23, 2024

Choose a reason for hiding this comment

vgokhale Dec 23, 2024

Choose a reason for hiding this comment

vgokhale Dec 23, 2024

Choose a reason for hiding this comment

vgokhale Dec 23, 2024

Choose a reason for hiding this comment