Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't have a sparsity_config if model is quantized #680

Closed
wants to merge 2 commits into from

Conversation

rahul-tuli
Copy link
Collaborator

@rahul-tuli rahul-tuli commented Sep 26, 2024

Previously, a no-operation (no-op) dense compression
configuration was being used as the sparsity_config for sparse-quantized models

However, since we do not yet support sparse compression for sparse-quantized models, the sparsity_config should ideally be left empty. This PR addresses that by ensuring the sparsity_config remains empty until sparse compression support is implemented for sparse-quantized models

Note: This only becomes relevant for cases when global_sparsity induced by quantization is > 0.05

Before this PR the compression_config from config.json was:

  "compression_config": {
    "config_groups": {
      "group_0": {
        "input_activations": null,
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 128,
          "num_bits": 4,
          "observer": "minmax",
          "observer_kwargs": {},
          "strategy": "group",
          "symmetric": true,
          "type": "int"
        }
      }
    },
    "format": "pack-quantized",
    "global_compression_ratio": 1.883165566487463,
    "ignore": [
      "lm_head"
    ],
    "kv_cache_scheme": null,
    "quant_method": "compressed-tensors",
    "quantization_status": "compressed",
    "sparsity_config": {
      "format": "dense",
      "global_sparsity": 0.14297135853065746,
      "ignore": [
        "model.layers.0.self_attn.o_proj",
        "model.layers.0.mlp.gate_proj",
        "model.layers.0.mlp.up_proj",
        "model.layers.0.mlp.down_proj",
        "model.layers.1.self_attn.v_proj",
        "model.layers.1.self_attn.o_proj",
        "model.layers.1.mlp.gate_proj",
        "model.layers.1.mlp.up_proj",
        "model.layers.1.mlp.down_proj",
        "model.layers.2.self_attn.q_proj",
        "model.layers.2.self_attn.k_proj",
        "model.layers.2.self_attn.v_proj",
        "model.layers.2.self_attn.o_proj",
        "model.layers.2.mlp.gate_proj",
        "model.layers.2.mlp.up_proj",
        "model.layers.2.mlp.down_proj",
        "model.layers.3.self_attn.q_proj",
        "model.layers.3.self_attn.k_proj",
        "model.layers.3.self_attn.v_proj",
        "model.layers.3.self_attn.o_proj",
        "model.layers.3.mlp.gate_proj",
        "model.layers.3.mlp.up_proj",
        "model.layers.3.mlp.down_proj",
        "model.layers.4.self_attn.q_proj",
        "model.layers.4.self_attn.k_proj",
        "model.layers.4.self_attn.v_proj",
        "model.layers.4.self_attn.o_proj",
        "model.layers.4.mlp.gate_proj",
        "model.layers.4.mlp.up_proj",
        "model.layers.4.mlp.down_proj",
        "model.layers.5.self_attn.q_proj",
        "model.layers.5.self_attn.k_proj",
        "model.layers.5.self_attn.v_proj",
        "model.layers.5.self_attn.o_proj",
        "model.layers.5.mlp.gate_proj",
        "model.layers.5.mlp.up_proj",
        "model.layers.5.mlp.down_proj",
        "model.layers.6.self_attn.q_proj",
        "model.layers.6.self_attn.k_proj",
        "model.layers.6.self_attn.v_proj",
        "model.layers.6.self_attn.o_proj",
        "model.layers.6.mlp.gate_proj",
        "model.layers.6.mlp.up_proj",
        "model.layers.6.mlp.down_proj",
        "model.layers.7.self_attn.q_proj",
        "model.layers.7.self_attn.k_proj",
        "model.layers.7.self_attn.v_proj",
        "model.layers.7.self_attn.o_proj",
        "model.layers.7.mlp.gate_proj",
        "model.layers.7.mlp.up_proj",
        "model.layers.7.mlp.down_proj",
        "model.layers.8.self_attn.q_proj",
        "model.layers.8.self_attn.k_proj",
        "model.layers.8.self_attn.v_proj",
        "model.layers.8.self_attn.o_proj",
        "model.layers.8.mlp.gate_proj",
        "model.layers.8.mlp.up_proj",
        "model.layers.8.mlp.down_proj",
        "model.layers.9.self_attn.q_proj",
        "model.layers.9.self_attn.k_proj",
        "model.layers.9.self_attn.v_proj",
        "model.layers.9.self_attn.o_proj",
        "model.layers.9.mlp.gate_proj",
        "model.layers.9.mlp.up_proj",
        "model.layers.9.mlp.down_proj",
        "model.layers.10.self_attn.q_proj",
        "model.layers.10.self_attn.k_proj",
        "model.layers.10.self_attn.v_proj",
        "model.layers.10.self_attn.o_proj",
        "model.layers.10.mlp.gate_proj",
        "model.layers.10.mlp.up_proj",
        "model.layers.10.mlp.down_proj",
        "model.layers.11.self_attn.q_proj",
        "model.layers.11.self_attn.k_proj",
        "model.layers.11.self_attn.v_proj",
        "model.layers.11.self_attn.o_proj",
        "model.layers.11.mlp.gate_proj",
        "model.layers.11.mlp.up_proj",
        "model.layers.11.mlp.down_proj",
        "model.layers.12.self_attn.q_proj",
        "model.layers.12.self_attn.k_proj",
        "model.layers.12.self_attn.v_proj",
        "model.layers.12.self_attn.o_proj",
        "model.layers.12.mlp.gate_proj",
        "model.layers.12.mlp.up_proj",
        "model.layers.12.mlp.down_proj",
        "model.layers.13.self_attn.q_proj",
        "model.layers.13.self_attn.k_proj",
        "model.layers.13.self_attn.v_proj",
        "model.layers.13.self_attn.o_proj",
        "model.layers.13.mlp.gate_proj",
        "model.layers.13.mlp.up_proj",
        "model.layers.13.mlp.down_proj",
        "model.layers.14.self_attn.q_proj",
        "model.layers.14.self_attn.k_proj",
        "model.layers.14.self_attn.v_proj",
        "model.layers.14.self_attn.o_proj",
        "model.layers.14.mlp.gate_proj",
        "model.layers.14.mlp.up_proj",
        "model.layers.14.mlp.down_proj",
        "model.layers.15.self_attn.q_proj",
        "model.layers.15.self_attn.k_proj",
        "model.layers.15.self_attn.v_proj",
        "model.layers.15.self_attn.o_proj",
        "model.layers.15.mlp.gate_proj",
        "model.layers.15.mlp.up_proj",
        "model.layers.15.mlp.down_proj",
        "model.layers.16.self_attn.q_proj",
        "model.layers.16.self_attn.k_proj",
        "model.layers.16.self_attn.v_proj",
        "model.layers.16.self_attn.o_proj",
        "model.layers.16.mlp.gate_proj",
        "model.layers.16.mlp.up_proj",
        "model.layers.16.mlp.down_proj",
        "model.layers.17.self_attn.q_proj",
        "model.layers.17.self_attn.k_proj",
        "model.layers.17.self_attn.v_proj",
        "model.layers.17.self_attn.o_proj",
        "model.layers.17.mlp.gate_proj",
        "model.layers.17.mlp.up_proj",
        "model.layers.17.mlp.down_proj",
        "model.layers.18.self_attn.q_proj",
        "model.layers.18.self_attn.k_proj",
        "model.layers.18.self_attn.v_proj",
        "model.layers.18.self_attn.o_proj",
        "model.layers.18.mlp.gate_proj",
        "model.layers.18.mlp.up_proj",
        "model.layers.18.mlp.down_proj",
        "model.layers.19.self_attn.q_proj",
        "model.layers.19.self_attn.k_proj",
        "model.layers.19.self_attn.v_proj",
        "model.layers.19.self_attn.o_proj",
        "model.layers.19.mlp.gate_proj",
        "model.layers.19.mlp.up_proj",
        "model.layers.19.mlp.down_proj",
        "model.layers.20.self_attn.q_proj",
        "model.layers.20.self_attn.k_proj",
        "model.layers.20.self_attn.v_proj",
        "model.layers.20.self_attn.o_proj",
        "model.layers.20.mlp.gate_proj",
        "model.layers.20.mlp.up_proj",
        "model.layers.20.mlp.down_proj",
        "model.layers.21.self_attn.q_proj",
        "model.layers.21.self_attn.k_proj",
        "model.layers.21.self_attn.v_proj",
        "model.layers.21.self_attn.o_proj",
        "model.layers.21.mlp.gate_proj",
        "model.layers.21.mlp.up_proj",
        "model.layers.21.mlp.down_proj",
        "lm_head"
      ],
      "registry_requires_subclass": false,
      "sparsity_structure": "unstructured",
      "targets": [
        "model.layers.0.self_attn.q_proj",
        "model.layers.0.self_attn.k_proj",
        "model.layers.0.self_attn.v_proj",
        "model.layers.1.self_attn.q_proj",
        "model.layers.1.self_attn.k_proj"
      ]
    },
    "version": "0.6.0.20240926"
  }

And now it is:

  "compression_config": {
    "config_groups": {
      "group_0": {
        "input_activations": null,
        "output_activations": null,
        "targets": [
          "Linear"
        ],
        "weights": {
          "actorder": null,
          "block_structure": null,
          "dynamic": false,
          "group_size": 128,
          "num_bits": 4,
          "observer": "minmax",
          "observer_kwargs": {},
          "strategy": "group",
          "symmetric": true,
          "type": "int"
        }
      }
    },
    "format": "pack-quantized",
    "global_compression_ratio": 2.1743527231963227,
    "ignore": [
      "lm_head"
    ],
    "kv_cache_scheme": null,
    "quant_method": "compressed-tensors",
    "quantization_status": "compressed",
    "version": "0.6.0.20240926"
  },

Test script:

from llmcompressor.modifiers.obcq import SparseGPTModifier 
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.pruning import ConstantPruningModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers import SparseAutoModelForCausalLM


MODEL_ID = "Xenova/llama2.c-stories110M"
DATASET_ID, SPLIT = "open_platypus", {"calibration": "train[:1%]"}
NUM_CALIBRATION_SAMPLES = 16
OUTPUT_DIR = MODEL_ID.split("/")[-1] + f"-{NUM_CALIBRATION_SAMPLES}-quantized"
MAX_SEQUENCE_LENGTH = 512

targets = ["Linear"]
ignore = ["lm_head"]
scheme = "W4A16"

recipe = [
    QuantizationModifier(scheme=scheme, targets=targets, ignore=ignore),
]

model = SparseAutoModelForCausalLM.from_pretrained(
    MODEL_ID,  device_map="auto", torch_dtype="auto"
)


oneshot(
    model=model,
    dataset=DATASET_ID,
    splits=SPLIT,
    recipe=recipe,
    save_compressed=True,
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

print("Compression Done, Model Saved at:", OUTPUT_DIR)

Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

@rahul-tuli rahul-tuli marked this pull request as ready for review September 26, 2024 16:58
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than modifying llm-compressor logic, shouldn't we be instead modifying the ModelCompressor logic to not save the sparsity config if the sparsity compressor is the dense compressor?

Making the change upstream has the advantage of standardization across all CT users

@rahul-tuli
Copy link
Collaborator Author

Rather than modifying llm-compressor logic, shouldn't we be instead modifying the ModelCompressor logic to not save the sparsity config if the sparsity compressor is the dense compressor?

Making the change upstream has the advantage of standardization across all CT users

I like that, good call out

@rahul-tuli
Copy link
Collaborator Author

rahul-tuli commented Sep 26, 2024

Closing this one out, the changes were made on Compressed-Tensors side as recommended by @kylesayrs
Here neuralmagic/compressed-tensors#169

@rahul-tuli rahul-tuli closed this Sep 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants