TensorRT Quantization Breaks for `LlamaLinearScalingRotaryEmbedding` #1083

Sanger2000 · 2024-02-14T23:00:21Z

System Info

NVIDIA 4090
TensorRT-0.7.1

In nvidia-ammo, it appears these lines in ammo/torch/export/layer_utils.py have an unexpected failure for some Llama variants:

In particular, the deepseek models use LlamaLinearScalingRotaryEmbedding. This means the module is picked up by the is_linear check, and is treated as the dense case. However, there is no .weight for this module, so the build_linear_config fails.

Lots of easy fixes for this (for example, just checking if "Rotary" in name and skipping that case), happy to contribute (but don't think there is an OSS repo to do so)

Who can help?

@Tracin

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Try compiling then running on fp8 for deepseek-coder-6.7b-base

Expected behavior

I expect the model to generate the tokens

actual behavior

The code throws the error: "no .weight for this module"

additional notes

N/A

The text was updated successfully, but these errors were encountered:

shatealaboxiaowang · 2024-03-13T06:54:44Z

Is there a solution ? i have the same problem.

activezhao · 2024-03-16T03:37:58Z

@Sanger2000 I have the same problem with deepseek-coder-6.7b-base model, have you solved the problem?

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \
                --dtype bfloat16 \
                --qformat int4_awq \
                --batch_size 8 \
                --tp_size 2 \
                --awq_block_size 128 \
                --output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \
                --calib_size 32

................................................................

/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension ammo_cuda_ext...
Loading extension ammo_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Quantization done. Total time used: 65.55 s.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /data/deepseek-coder-6.7b-base-int4-awq-tp2/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
    for model_config in torch_to_model_config(
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 945, in build_decoder_config
    config.attention = build_attention_config(layer, model_metadata_config, dtype, config)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 638, in build_attention_config
    config.dense = build_linear_config(layer, LINEAR_ROW, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 581, in build_linear_config
    torch_weight = module.weight.detach()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Quantized model exported to /data/deepseek-coder-6.7b-base-int4-awq-tp2 
Total time used 10.00 s.

RalphMao · 2024-04-02T04:33:43Z

Thank you for pointing out this issue. We will add a fix to more robustly distinguish the actual dense linear layer.

silverriver · 2024-04-02T08:07:35Z

I am facing the save issue with v0.8.0. Help needed.

activezhao · 2024-04-03T01:50:49Z

Thank you for pointing out this issue. We will add a fix to more robustly distinguish the actual dense linear layer.

Hi @RalphMao Are there any temporary ways to avoid this problem now?

Opdoop · 2024-04-19T08:14:22Z

@activezhao A hotfix would be modify the is_linear function to skip 'Rotary' layer.

def is_linear(module: nn.Module) -> bool:
    """Returns whether the module is a linear layer."""
    return any([k in type(module).__name__ for k in ["Linear", "Conv1D", "NormHead"]]) and ("Rotary" not in type(module).__name__)

activezhao · 2024-04-25T10:03:42Z

@activezhao A hotfix would be modify the is_linear function to skip 'Rotary' layer.

def is_linear(module: nn.Module) -> bool:
    """Returns whether the module is a linear layer."""
    return any([k in type(module).__name__ for k in ["Linear", "Conv1D", "NormHead"]]) and ("Rotary" not in type(module).__name__)

@Opdoop OK, thanks.

activezhao · 2024-04-25T10:23:04Z

@Sanger2000 I have the same problem with deepseek-coder-6.7b-base model, have you solved the problem?

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \
                --dtype bfloat16 \
                --qformat int4_awq \
                --batch_size 8 \
                --tp_size 2 \
                --awq_block_size 128 \
                --output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \
                --calib_size 32

................................................................

/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension ammo_cuda_ext...
Loading extension ammo_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Quantization done. Total time used: 65.55 s.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /data/deepseek-coder-6.7b-base-int4-awq-tp2/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
    for model_config in torch_to_model_config(
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 945, in build_decoder_config
    config.attention = build_attention_config(layer, model_metadata_config, dtype, config)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 638, in build_attention_config
    config.dense = build_linear_config(layer, LINEAR_ROW, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 581, in build_linear_config
    torch_weight = module.weight.detach()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Quantized model exported to /data/deepseek-coder-6.7b-base-int4-awq-tp2 
Total time used 10.00 s.

Hi @Opdoop I have a question

if I set --qformat to fp8 in quantize.py, are the Weight and Activation Function both fp8?

Thanks

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \
                --dtype bfloat16 \
                --qformat int4_awq \
                --batch_size 8 \
                --tp_size 2 \
                --awq_block_size 128 \
                --output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \
                --calib_size 32

hello-11 · 2024-11-15T11:09:43Z

@Sanger2000 Do you still have the problem? If not, we will close it soon.

Sanger2000 added the bug Something isn't working label Feb 14, 2024

byshiue assigned Tracin Feb 23, 2024

Tracin assigned RalphMao and unassigned Tracin Mar 5, 2024

janpetrov mentioned this issue May 25, 2024

quantize.py fails to export important data to config.json (eg rotary scaling) #1676

Closed

4 tasks

hello-11 added the stale label Nov 15, 2024

nv-guomingz closed this as completed Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorRT Quantization Breaks for `LlamaLinearScalingRotaryEmbedding` #1083

TensorRT Quantization Breaks for `LlamaLinearScalingRotaryEmbedding` #1083

Sanger2000 commented Feb 14, 2024

shatealaboxiaowang commented Mar 13, 2024

activezhao commented Mar 16, 2024

RalphMao commented Apr 2, 2024

silverriver commented Apr 2, 2024 •

edited

Loading

activezhao commented Apr 3, 2024

Opdoop commented Apr 19, 2024

activezhao commented Apr 25, 2024 •

edited

Loading

activezhao commented Apr 25, 2024

hello-11 commented Nov 15, 2024

TensorRT Quantization Breaks for LlamaLinearScalingRotaryEmbedding #1083

TensorRT Quantization Breaks for LlamaLinearScalingRotaryEmbedding #1083

Comments

Sanger2000 commented Feb 14, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

shatealaboxiaowang commented Mar 13, 2024

activezhao commented Mar 16, 2024

RalphMao commented Apr 2, 2024

silverriver commented Apr 2, 2024 • edited Loading

activezhao commented Apr 3, 2024

Opdoop commented Apr 19, 2024

activezhao commented Apr 25, 2024 • edited Loading

activezhao commented Apr 25, 2024

hello-11 commented Nov 15, 2024

TensorRT Quantization Breaks for `LlamaLinearScalingRotaryEmbedding` #1083

TensorRT Quantization Breaks for `LlamaLinearScalingRotaryEmbedding` #1083

silverriver commented Apr 2, 2024 •

edited

Loading

activezhao commented Apr 25, 2024 •

edited

Loading