Skip to content

Commit

Permalink
[Core] Default to using per_token quantization for fp8 when cutlass i…
Browse files Browse the repository at this point in the history
…s supported. (vllm-project#8651)

Signed-off-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Michael Goin <mgoin@redhat.com>
Co-authored-by: mgoin <michael@neuralmagic.com>
  • Loading branch information
3 people authored and Ubuntu committed Jan 19, 2025
1 parent c2cd42b commit ada71fd
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion vllm/model_executor/layers/quantization/fp8.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,8 @@ def apply(self,
input_scale=layer.input_scale,
bias=bias,
cutlass_fp8_supported=self.cutlass_fp8_supported,
use_per_token_if_dynamic=False)
# Default to using per_token quantization if cutlass is supported
use_per_token_if_dynamic=self.cutlass_fp8_supported)


class Fp8MoEMethod(FusedMoEMethodBase):
Expand Down

0 comments on commit ada71fd

Please sign in to comment.