Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Sep 10, 2025

While queueing the graph nodes, keep track of the memory intervals/ranges from which we read data and to which we write data. Using this information, for the new node we can determine if it can safely run concurrently with all the concurrent ops prior to it:

  • It should not read data from a memory range that a previous node is writing to
  • It should not write data to a memory range for which a previous node is reading from or writing to

This feature can be disabled with GGML_METAL_CONCURRENCY_DISABLE=1 env.

Improvements depends on the order of the nodes in the graph. Some models do not currently allow to benefit much from this logic, but utilizing a graph optimization approach similar to #15850 should improve things. Introduced logic for optimizing the graph to improve concurrency in a similar way as in #15850. The benefits are large for TG and decent for PP.

TODO:

  • Env to disable graph optimization
  • More comments about the implemented logic
  • Stats?

Example

For example, before this patch, the graph of one layer of gpt-oss-20b is executed like this :

(concurrent) means that the node runs in parallel with the previous one

0.02.203.953 D ggml_metal_encode_node: node[  897] - ADD          
0.02.203.954 D ggml_metal_encode_node: node[  898] - RMS_NORM     
0.02.203.955 D ggml_metal_encode_node:               fuse 2 ops
0.02.203.955 D ggml_metal_encode_node: node[  900] - MUL_MAT      
0.02.203.956 D ggml_metal_encode_node: node[  901] - ADD          
0.02.203.957 D ggml_metal_encode_node: node[  903] - ROPE         
0.02.203.957 D ggml_metal_encode_node: node[  904] - MUL_MAT      (concurrent)
0.02.203.958 D ggml_metal_encode_node: node[  905] - ADD          
0.02.203.959 D ggml_metal_encode_node: node[  907] - ROPE         
0.02.203.959 D ggml_metal_encode_node: node[  908] - MUL_MAT      (concurrent)
0.02.203.960 D ggml_metal_encode_node: node[  909] - ADD          
0.02.203.961 D ggml_metal_encode_node: node[  912] - SET_ROWS     (concurrent)
0.02.203.961 D ggml_metal_encode_node: node[  914] - SET_ROWS     
0.02.203.962 D ggml_metal_encode_node: node[  921] - FLASH_ATTN_EXT 
0.02.203.966 D ggml_metal_encode_node: node[  923] - MUL_MAT      
0.02.203.966 D ggml_metal_encode_node: node[  924] - ADD          
0.02.203.968 D ggml_metal_encode_node: node[  925] - ADD          
0.02.203.968 D ggml_metal_encode_node: node[  926] - RMS_NORM     
0.02.203.969 D ggml_metal_encode_node:               fuse 2 ops
0.02.203.969 D ggml_metal_encode_node: node[  929] - MUL_MAT      
0.02.203.970 D ggml_metal_encode_node: node[  930] - ADD          
0.02.203.971 D ggml_metal_encode_node: node[  931] - ARGSORT      
0.02.203.972 D ggml_metal_encode_node: node[  933] - MUL_MAT_ID   
0.02.203.972 D ggml_metal_encode_node: node[  934] - ADD_ID       
0.02.203.973 D ggml_metal_encode_node: node[  935] - MUL_MAT_ID   (concurrent)
0.02.203.974 D ggml_metal_encode_node: node[  936] - ADD_ID       
0.02.203.974 D ggml_metal_encode_node: node[  937] - GLU          
0.02.203.975 D ggml_metal_encode_node: node[  938] - MUL_MAT_ID   
0.02.203.976 D ggml_metal_encode_node: node[  939] - ADD_ID       
0.02.203.976 D ggml_metal_encode_node: node[  941] - GET_ROWS     (concurrent)
0.02.203.977 D ggml_metal_encode_node: node[  943] - SOFT_MAX     
0.02.203.978 D ggml_metal_encode_node: node[  945] - MUL          
0.02.203.978 D ggml_metal_encode_node: node[  950] - ADD          
0.02.203.979 D ggml_metal_encode_node:               fuse 3 ops

After this patch, the nodes are reordered and executed like this:

0.02.119.870 D ggml_metal_encode_node: node[  897] - ADD          
0.02.119.871 D ggml_metal_encode_node: node[  898] - RMS_NORM     
0.02.119.872 D ggml_metal_encode_node:               fuse 2 ops
0.02.119.872 D ggml_metal_encode_node: node[  900] - MUL_MAT      
0.02.119.873 D ggml_metal_encode_node: node[  901] - MUL_MAT      (concurrent)
0.02.119.874 D ggml_metal_encode_node: node[  902] - MUL_MAT      (concurrent)
0.02.119.875 D ggml_metal_encode_node: node[  903] - ADD          
0.02.119.875 D ggml_metal_encode_node: node[  905] - ADD          (concurrent)
0.02.119.876 D ggml_metal_encode_node: node[  907] - ADD          (concurrent)
0.02.119.877 D ggml_metal_encode_node: node[  909] - ROPE         
0.02.119.877 D ggml_metal_encode_node: node[  910] - ROPE         (concurrent)
0.02.119.878 D ggml_metal_encode_node: node[  913] - SET_ROWS     
0.02.119.879 D ggml_metal_encode_node: node[  914] - SET_ROWS     (concurrent)
0.02.119.880 D ggml_metal_encode_node: node[  921] - FLASH_ATTN_EXT 
0.02.119.883 D ggml_metal_encode_node: node[  923] - MUL_MAT      
0.02.119.884 D ggml_metal_encode_node: node[  924] - ADD          
0.02.119.885 D ggml_metal_encode_node: node[  925] - ADD          
0.02.119.891 D ggml_metal_encode_node: node[  926] - RMS_NORM     
0.02.119.892 D ggml_metal_encode_node:               fuse 2 ops
0.02.119.892 D ggml_metal_encode_node: node[  929] - MUL_MAT      
0.02.119.893 D ggml_metal_encode_node: node[  930] - ADD          
0.02.119.893 D ggml_metal_encode_node: node[  931] - ARGSORT      
0.02.119.894 D ggml_metal_encode_node: node[  933] - MUL_MAT_ID   
0.02.119.894 D ggml_metal_encode_node: node[  934] - MUL_MAT_ID   (concurrent)
0.02.119.895 D ggml_metal_encode_node: node[  935] - ADD_ID       
0.02.119.896 D ggml_metal_encode_node: node[  936] - ADD_ID       (concurrent)
0.02.119.897 D ggml_metal_encode_node: node[  937] - GLU          
0.02.119.911 D ggml_metal_encode_node: node[  938] - MUL_MAT_ID   
0.02.119.915 D ggml_metal_encode_node: node[  940] - ADD_ID       
0.02.119.917 D ggml_metal_encode_node: node[  941] - GET_ROWS     (concurrent)
0.02.119.920 D ggml_metal_encode_node: node[  943] - SOFT_MAX     
0.02.119.921 D ggml_metal_encode_node: node[  945] - MUL          
0.02.119.923 D ggml_metal_encode_node: node[  950] - ADD          
0.02.119.924 D ggml_metal_encode_node:               fuse 3 ops

Perf

Model Test t/s master t/s gg/metal-concurrent-graphs Speedup
gemma3 1B Q4_0 pp512 10347.13 10927.45 1.06
gemma3 1B Q4_0 pp2048 11105.25 11289.86 1.02
gemma3 1B Q4_0 pp4096 11278.28 11428.73 1.01
gemma3 1B Q4_0 tg128 204.67 225.84 1.10
gemma3 270M Q4_0 pp512 36085.32 37940.85 1.05
gemma3 270M Q4_0 pp2048 40402.50 41045.04 1.02
gemma3 270M Q4_0 pp4096 42624.23 43358.27 1.02
gemma3 270M Q4_0 tg128 333.98 392.28 1.17
gemma3 4B Q4_0 pp512 2664.39 2738.56 1.03
gemma3 4B Q4_0 pp2048 2837.75 2876.11 1.01
gemma3 4B Q4_0 pp4096 2823.03 2859.76 1.01
gemma3 4B Q4_0 tg128 124.45 137.87 1.11
gpt-oss 20B MXFP4 MoE pp512 2262.85 2303.68 1.02
gpt-oss 20B MXFP4 MoE pp2048 2660.63 2661.22 1.00
gpt-oss 20B MXFP4 MoE pp4096 2653.64 2662.97 1.00
gpt-oss 20B MXFP4 MoE tg128 120.91 133.14 1.10
qwen2 3B Q4_0 pp512 3019.72 3108.98 1.03
qwen2 3B Q4_0 pp2048 3239.79 3265.49 1.01
qwen2 3B Q4_0 pp4096 3055.22 3081.93 1.01
qwen2 3B Q4_0 tg128 152.27 167.65 1.10
qwen2 7B Q8_0 pp512 1427.79 1455.75 1.02
qwen2 7B Q8_0 pp2048 1500.79 1510.12 1.01
qwen2 7B Q8_0 pp4096 1445.67 1454.36 1.01
qwen2 7B Q8_0 tg128 75.93 78.35 1.03
qwen3 0.6B Q8_0 pp512 13398.86 13937.51 1.04
qwen3 0.6B Q8_0 pp2048 13190.99 13393.04 1.02
qwen3 0.6B Q8_0 pp4096 11061.29 11260.44 1.02
qwen3 0.6B Q8_0 tg128 245.57 274.48 1.12
qwen3moe 30B.A3B Q4_0 pp512 2119.15 2148.71 1.01
qwen3moe 30B.A3B Q4_0 pp2048 2447.42 2468.89 1.01
qwen3moe 30B.A3B Q4_0 pp4096 2183.56 2202.32 1.01
qwen3moe 30B.A3B Q4_0 tg128 91.51 101.70 1.11

@github-actions github-actions bot added ggml changes relating to the ggml tensor library for machine learning Apple Metal https://en.wikipedia.org/wiki/Metal_(API) labels Sep 10, 2025
@calvin2021y
Copy link

hi @ggerganov , thanks for the nice work

this pr can not merge into master branch any more

Applied patch to 'ggml/src/ggml-metal/CMakeLists.txt' cleanly.
Performing three-way merge...
error: ggml/src/ggml-metal/ggml-metal-common.cpp: does not exist in index
error: cannot read the current contents of 'ggml/src/ggml-metal/ggml-metal-common.cpp'
error: ggml/src/ggml-metal/ggml-metal-common.cpp: patch does not apply
Performing three-way merge...
error: ggml/src/ggml-metal/ggml-metal-common.h: does not exist in index
error: cannot read the current contents of 'ggml/src/ggml-metal/ggml-metal-common.h'
error: ggml/src/ggml-metal/ggml-metal-common.h: patch does not apply
Applied patch to 'ggml/src/ggml-metal/ggml-metal.m' cleanly.

@ggerganov ggerganov force-pushed the gg/metal-concurrent-graphs branch from 0a6f0eb to 417df40 Compare September 12, 2025 14:27
@ggerganov
Copy link
Member Author

@calvin2021y The branch is now rebased on latest master. Would appreciate feedback if you give it a try.

@calvin2021y
Copy link

calvin2021y commented Sep 13, 2025

I get 1% tps speedup with this patch. will try more models and update late.

@ggerganov ggerganov force-pushed the gg/metal-concurrent-graphs branch from 17cf93d to faffbec Compare September 13, 2025 09:50
@ggerganov ggerganov merged commit f161463 into master Sep 13, 2025
1 check passed
@ggerganov ggerganov deleted the gg/metal-concurrent-graphs branch September 13, 2025 10:54
@ggerganov ggerganov mentioned this pull request Sep 13, 2025
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants