From 5789d9ee7f60cf79b3e5a4bdaf30e03274968708 Mon Sep 17 00:00:00 2001 From: "rshaw@neuralmagic.com" Date: Mon, 8 Jul 2024 03:09:14 +0000 Subject: [PATCH] nit --- docs/conceptual_guides/inference_acceleration.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/conceptual_guides/inference_acceleration.md b/docs/conceptual_guides/inference_acceleration.md index 1d7085a60..d2bcbd3f7 100644 --- a/docs/conceptual_guides/inference_acceleration.md +++ b/docs/conceptual_guides/inference_acceleration.md @@ -14,7 +14,9 @@ Roughly speaking, the time required to execute a matrix multiplication on a GPU * Latency of moving the weights from main memory (DRAM) to the compute (SRAM) * Latency of the tensor-core compute operations -While weight-only quanitzation does not change the latency of the tensor-core operations (since the compute still runs at `bf/fp16`), it can reduce the latency of moving the weights from DRAM to SRAM with "fused" inference kernels that upconvert the weights to `fp16`after moving them into SRAM (thereby reducing the total amount of data movement between DRAM and SRAM). LLM Inference Serving is usually dominated by batch size < 64 "decode" operations, which are "memory bandwidth bound", meaning we can speed up the `Linear` matmuls with weight-only quantization. +While weight-only quanitzation does not change the latency of the tensor-core operations (since the compute still runs at `bf/fp16`), it can reduce the latency of moving the weights from DRAM to SRAM with "fused" inference kernels that upconvert the weights to `fp16`after moving them into SRAM (thereby reducing the total amount of data movement between DRAM and SRAM). + +LLM Inference Serving is usually dominated by batch size < 64 "decode" operations, which are "memory bandwidth bound", meaning we can speed up the `Linear` matmuls with weight-only quantization. ### Accelerating Inference Serving in vLLM with `Marlin`