Improves 2d tiled matmulnbits by repeating A, loads N times for each B load #23071

sushraja-msft · 2024-12-10T23:10:57Z

Description

Improves on previous change Implement 2d tiled matmulnbits specialized for prefill by keeping B in shared memory and reloading just A N times.

This is based on the observation that loading B is more expensive than loading A, that is for a run of size 16 seq length [3072, 3072, 8192] this matrix multiplication takes 1.9ms. Removing loadA drops it to 1.8ms, removing loadB drops it to 1.44ms.

By sharing B across multiple A tiles, the cost to load B and dequantize is reduced N fold.

------------------Baseline With Prefill Optimization from previous change ----

C:\onnxruntime>C:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500                                                                                                                                                               
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.2135e+07
        avg (tokens/s): 41.2856                                 << 
        p50 (us):       1.21288e+07
        stddev (us):    21282.1
        n:              5 * 501 token(s)
Token generation:
        avg (us):       78945.3
        avg (tokens/s): 12.667
        p50 (us):       78900.7
        stddev (us):    2232.43
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       20.5608
        avg (tokens/s): 48636.3
        p50 (us):       18.7
        stddev (us):    19.0409
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       22163.8
        p50 (ms):       22160.1
        stddev (ms):    31.3122
        n:              5
Peak working set size (bytes): 5478862848
WebGPU device lost (2): Device was destroyed.

-- With A_REPEAT of 8 ---
C:\onnxruntime>c:\model_benchmark\model_benchmark.exe -i C:\Phi-3.5-mini-instruct-onnx-web\Phi-3.5-mini-instruct-onnx-web -l 500                                                                                                                                                               
Batch size: 1, prompt tokens: 501, tokens to generate: 128
Prompt processing (time to first token):
        avg (us):       1.1233e+07
        avg (tokens/s): 44.6006              <<<
        p50 (us):       1.12267e+07
        stddev (us):    13445.2
        n:              5 * 501 token(s)
Token generation:
        avg (us):       78740.4
        avg (tokens/s): 12.7
        p50 (us):       78763
        stddev (us):    2196.62
        n:              635 * 1 token(s)
Token sampling:
        avg (us):       21.4592
        avg (tokens/s): 46600
        p50 (us):       20.3
        stddev (us):    10.3021
        n:              640 * 1 token(s)
E2E generation (entire generation loop):
        avg (ms):       21235.9
        p50 (ms):       21226.8
        stddev (ms):    44.8555
        n:              5

…B load

guschmue · 2024-12-11T03:53:02Z

/azp run ONNX Runtime Web CI Pipeline,Windows GPU CI Pipeline,Linux Android Emulator QNN CI Pipeline

guschmue · 2024-12-11T03:53:14Z

/azp run Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,Windows ARM64 QNN CI Pipeline,Windows CPU CI Pipeline

azure-pipelines · 2024-12-11T03:53:16Z

Azure Pipelines successfully started running 2 pipeline(s).

guschmue · 2024-12-11T03:53:25Z

/azp run Windows GPU TensorRT CI Pipeline,onnxruntime-binary-size-checks-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,Windows x64 QNN CI Pipeline,Big Models

guschmue · 2024-12-11T03:53:35Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline

azure-pipelines · 2024-12-11T03:53:44Z

Azure Pipelines successfully started running 4 pipeline(s).

azure-pipelines · 2024-12-11T03:53:48Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-12-11T03:53:54Z

Azure Pipelines successfully started running 9 pipeline(s).

sushraja-msft and others added 6 commits December 9, 2024 14:13

Implement 2d tiled matmulnbits specialized for prefill

1ee552c

Run linter

ffb2dab

Mac fix and improve comments

aa51ec8

Improves 2d tiled matmulnbits by repeating A, loads N times for each …

401938f

…B load

fix typo

9acf194

Merge branch 'main' into user/sushraja/mat_mul_2d_repeat

6fb5394

guschmue added the ep:WebGPU ort-web webgpu provider label Dec 12, 2024

sushanthr mentioned this pull request Dec 13, 2024

[js/webgpu] Optimize matmulnbits with M > 1 #23092

Open

sushraja-msft closed this Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improves 2d tiled matmulnbits by repeating A, loads N times for each B load #23071

Improves 2d tiled matmulnbits by repeating A, loads N times for each B load #23071

sushraja-msft commented Dec 10, 2024

guschmue commented Dec 11, 2024

guschmue commented Dec 11, 2024

azure-pipelines bot commented Dec 11, 2024

guschmue commented Dec 11, 2024

guschmue commented Dec 11, 2024

azure-pipelines bot commented Dec 11, 2024

azure-pipelines bot commented Dec 11, 2024

azure-pipelines bot commented Dec 11, 2024

Improves 2d tiled matmulnbits by repeating A, loads N times for each B load #23071

Improves 2d tiled matmulnbits by repeating A, loads N times for each B load #23071

Conversation

sushraja-msft commented Dec 10, 2024

Description

guschmue commented Dec 11, 2024

guschmue commented Dec 11, 2024

azure-pipelines bot commented Dec 11, 2024

guschmue commented Dec 11, 2024

guschmue commented Dec 11, 2024

azure-pipelines bot commented Dec 11, 2024

azure-pipelines bot commented Dec 11, 2024

azure-pipelines bot commented Dec 11, 2024