Fix: Re-enable Sparse Compression for 2of4 Examples (#1153)

rahul-tuli · dsikka · web-flow · commit cdf686f55582 · 2025-02-14T23:05:41.000Z
This PR restores sparse compression for our `2of4` examples, which was previously disabled due to a bug in the vLLM Cutlass integration. #### Background A bug in the Cutlass integration caused certain sparse-only compressed models to produce gibberish results. To mitigate this issue, we temporarily turned off sparse compression for our `2of4` examples. The bug has since been fixed by @tlrmchlsmth in [vllm-project/vllm#13198](vllm-project/vllm#13198). With this fix in place, we can safely re-enable sparse compression for these examples. #### Changes - Re-enable sparse compression for `2of4` examples. #### Testing - Verified that sparse-only compressed models now produce expected outputs. --------- Signed-off-by: Rahul Tuli <rahul@neuralmagic.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
diff --git a/examples/sparse_2of4_quantization_fp8/llama3_8b_2of4.py b/examples/sparse_2of4_quantization_fp8/llama3_8b_2of4.py
@@ -116,7 +116,5 @@ def get_recipe(fp8_enabled):
 print("==========================================\n")
 
 # Save compressed model and tokenizer
-model.save_pretrained(
-    save_dir, save_compressed=args.fp8, disable_sparse_compression=True
-)
+model.save_pretrained(save_dir, save_compressed=args.fp8)
 tokenizer.save_pretrained(save_dir)
diff --git a/tests/e2e/vLLM/configs/sparse_24.yaml b/tests/e2e/vLLM/configs/sparse_24.yaml
@@ -5,4 +5,4 @@ recipe: tests/e2e/vLLM/recipes/Sparse_2of4/recipe_sparse_2of4.yaml
 scheme: sparse2of4_only
 dataset_id: HuggingFaceH4/ultrachat_200k
 dataset_split: train_sft
-save_compressed: False
+save_compressed: True