Skip to content

[AWQ] Insane memory requirement: over 900GB for 32B model #1409

@mratsim

Description

@mratsim

I tried to quantized GLM-4-0414-32B: https://huggingface.co/THUDM/GLM-4-32B-0414

Recipe:

recipe = [
    AWQModifier(
        bits=4,
        symmetric=False,
        # Read input->output from https://github.com/huggingface/transformers/blob/v4.51.3/src/transformers/models/glm4/modeling_glm4.py
        # which are somewhat easier than vllm ones as it's all in a single file
        mappings=[
            AWQMapping("re:.*input_layernorm", ["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"]),
            AWQMapping("re:.*v_proj", ["re:.*o_proj"]),
            AWQMapping("re:.*post_attention_layernorm", ["re:.*gate_up_proj"]),
            AWQMapping("re:.*gate_up_proj", ["re:.*down_proj"]),
        ]
    ),
    QuantizationModifier(
        ignore=ignore_layers,
        config_groups={
            "group_0": QuantizationScheme(
                targets=["Linear"],
                weights=QuantizationArgs(
                    num_bits=4,
                    type=QuantizationType.INT,
                    dynamic=False,
                    symmetric=False,
                    strategy=QuantizationStrategy.GROUP,
                    group_size=128,
                ),
            ),
        },
    )
]

I tried using 128 samples as suggested in those slides ("Calibration set"): https://minjiazhang.github.io/courses/fall24-resource/slides/awq.pdf

However every sample memory usage grew by 1~5 GB leading in the end to over 900GB before I decided to give up on AWQ. Even with a swapfile, the time was spent in kernel swap in/out and IO and compute was slow to then frustratingly crash on Cuda OOM once that CPU part was solved.

Screenshot:
Image

Side-note: couldn't the calibration be made multi-threaded?

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions