From 5789d9ee7f60cf79b3e5a4bdaf30e03274968708 Mon Sep 17 00:00:00 2001
From: "rshaw@neuralmagic.com" <rshaw@neuralmagic.com>
Date: Mon, 8 Jul 2024 03:09:14 +0000
Subject: [PATCH] nit

---
 docs/conceptual_guides/inference_acceleration.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/docs/conceptual_guides/inference_acceleration.md b/docs/conceptual_guides/inference_acceleration.md
index 1d7085a60..d2bcbd3f7 100644
--- a/docs/conceptual_guides/inference_acceleration.md
+++ b/docs/conceptual_guides/inference_acceleration.md
@@ -14,7 +14,9 @@ Roughly speaking, the time required to execute a matrix multiplication on a GPU
 * Latency of moving the weights from main memory (DRAM) to the compute (SRAM)
 * Latency of the tensor-core compute operations
 
-While weight-only quanitzation does not change the latency of the tensor-core operations (since the compute still runs at `bf/fp16`), it can reduce the latency of moving the weights from DRAM to SRAM with "fused" inference kernels that upconvert the weights to `fp16`after moving them into SRAM (thereby reducing the total amount of data movement between DRAM and SRAM). LLM Inference Serving is usually dominated by batch size < 64 "decode" operations, which are "memory bandwidth bound", meaning we can speed up the `Linear` matmuls with weight-only quantization.
+While weight-only quanitzation does not change the latency of the tensor-core operations (since the compute still runs at `bf/fp16`), it can reduce the latency of moving the weights from DRAM to SRAM with "fused" inference kernels that upconvert the weights to `fp16`after moving them into SRAM (thereby reducing the total amount of data movement between DRAM and SRAM).
+
+LLM Inference Serving is usually dominated by batch size < 64 "decode" operations, which are "memory bandwidth bound", meaning we can speed up the `Linear` matmuls with weight-only quantization.
 
 ### Accelerating Inference Serving in vLLM with `Marlin`