-
Notifications
You must be signed in to change notification settings - Fork 315
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
I tried to quantized GLM-4-0414-32B: https://huggingface.co/THUDM/GLM-4-32B-0414
Recipe:
recipe = [
AWQModifier(
bits=4,
symmetric=False,
# Read input->output from https://github.com/huggingface/transformers/blob/v4.51.3/src/transformers/models/glm4/modeling_glm4.py
# which are somewhat easier than vllm ones as it's all in a single file
mappings=[
AWQMapping("re:.*input_layernorm", ["re:.*q_proj", "re:.*k_proj", "re:.*v_proj"]),
AWQMapping("re:.*v_proj", ["re:.*o_proj"]),
AWQMapping("re:.*post_attention_layernorm", ["re:.*gate_up_proj"]),
AWQMapping("re:.*gate_up_proj", ["re:.*down_proj"]),
]
),
QuantizationModifier(
ignore=ignore_layers,
config_groups={
"group_0": QuantizationScheme(
targets=["Linear"],
weights=QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
dynamic=False,
symmetric=False,
strategy=QuantizationStrategy.GROUP,
group_size=128,
),
),
},
)
]I tried using 128 samples as suggested in those slides ("Calibration set"): https://minjiazhang.github.io/courses/fall24-resource/slides/awq.pdf
However every sample memory usage grew by 1~5 GB leading in the end to over 900GB before I decided to give up on AWQ. Even with a swapfile, the time was spent in kernel swap in/out and IO and compute was slow to then frustratingly crash on Cuda OOM once that CPU part was solved.
Side-note: couldn't the calibration be made multi-threaded?
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
