Fix per_token slowdown #57

Satrat · 2024-05-17T17:36:37Z

We had vectorized the scale and zeropoint calculations for the minmax observer, but the memoryless observer was still using the un-vectorized code. Moved the get_qparams_along_dim to the base observer class so all observers use it, and resolved some shape issues.

With this change the token,channel and tensor strategies can all use the same logic for calling quant/dequant so this simplified the forward pass code a lot too

Testing

Running a llama1.1b model with w8a8 dynamic per token:

Before change: 10 sec/iteration
After change: 4.5 iterations/sec

A huge speedup!

src/compressed_tensors/quantization/observers/base.py

speed fix

3163dea

Satrat requested review from bfineran, horheynm, dbogunowicz, dsikka and rahul-tuli and removed request for horheynm May 17, 2024 17:38

bfineran reviewed May 17, 2024

View reviewed changes

src/compressed_tensors/quantization/observers/base.py Show resolved Hide resolved

update calculate_qparams sig

87d3205

Satrat requested a review from bfineran May 17, 2024 19:23

dbogunowicz approved these changes May 20, 2024

View reviewed changes

bfineran approved these changes May 20, 2024

View reviewed changes

bfineran merged commit f9d8d8b into main May 20, 2024
1 check passed

bfineran deleted the sa/fix_token_speed branch May 20, 2024 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix per_token slowdown #57

Fix per_token slowdown #57

Satrat commented May 17, 2024

Fix per_token slowdown #57

Fix per_token slowdown #57

Conversation

Satrat commented May 17, 2024

Testing