Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorRT Quantization Breaks for LlamaLinearScalingRotaryEmbedding #1083

Closed
1 of 4 tasks
Sanger2000 opened this issue Feb 14, 2024 · 9 comments
Closed
1 of 4 tasks

TensorRT Quantization Breaks for LlamaLinearScalingRotaryEmbedding #1083

Sanger2000 opened this issue Feb 14, 2024 · 9 comments
Assignees
Labels
bug Something isn't working stale

Comments

@Sanger2000
Copy link

System Info

NVIDIA 4090
TensorRT-0.7.1

In nvidia-ammo, it appears these lines in ammo/torch/export/layer_utils.py have an unexpected failure for some Llama variants:
303903836-2d6ffec3-3908-4c08-a269-680904ccbd28

In particular, the deepseek models use LlamaLinearScalingRotaryEmbedding. This means the module is picked up by the is_linear check, and is treated as the dense case. However, there is no .weight for this module, so the build_linear_config fails.

Lots of easy fixes for this (for example, just checking if "Rotary" in name and skipping that case), happy to contribute (but don't think there is an OSS repo to do so)

Who can help?

@Tracin

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Try compiling then running on fp8 for deepseek-coder-6.7b-base

Expected behavior

I expect the model to generate the tokens

actual behavior

The code throws the error: "no .weight for this module"

additional notes

N/A

@Sanger2000 Sanger2000 added the bug Something isn't working label Feb 14, 2024
@Tracin Tracin assigned RalphMao and unassigned Tracin Mar 5, 2024
@shatealaboxiaowang
Copy link

Is there a solution ? i have the same problem.

@activezhao
Copy link

@Sanger2000 I have the same problem with deepseek-coder-6.7b-base model, have you solved the problem?

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \
                --dtype bfloat16 \
                --qformat int4_awq \
                --batch_size 8 \
                --tp_size 2 \
                --awq_block_size 128 \
                --output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \
                --calib_size 32

................................................................

/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension ammo_cuda_ext...
Loading extension ammo_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Quantization done. Total time used: 65.55 s.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /data/deepseek-coder-6.7b-base-int4-awq-tp2/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
    for model_config in torch_to_model_config(
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 945, in build_decoder_config
    config.attention = build_attention_config(layer, model_metadata_config, dtype, config)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 638, in build_attention_config
    config.dense = build_linear_config(layer, LINEAR_ROW, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 581, in build_linear_config
    torch_weight = module.weight.detach()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Quantized model exported to /data/deepseek-coder-6.7b-base-int4-awq-tp2 
Total time used 10.00 s.

@RalphMao
Copy link
Collaborator

RalphMao commented Apr 2, 2024

Thank you for pointing out this issue. We will add a fix to more robustly distinguish the actual dense linear layer.

@silverriver
Copy link

silverriver commented Apr 2, 2024

I am facing the save issue with v0.8.0. Help needed.

@activezhao
Copy link

Thank you for pointing out this issue. We will add a fix to more robustly distinguish the actual dense linear layer.

Hi @RalphMao Are there any temporary ways to avoid this problem now?

@Opdoop
Copy link

Opdoop commented Apr 19, 2024

@activezhao A hotfix would be modify the is_linear function to skip 'Rotary' layer.

def is_linear(module: nn.Module) -> bool:
    """Returns whether the module is a linear layer."""
    return any([k in type(module).__name__ for k in ["Linear", "Conv1D", "NormHead"]]) and ("Rotary" not in type(module).__name__)

@activezhao
Copy link

activezhao commented Apr 25, 2024

@activezhao A hotfix would be modify the is_linear function to skip 'Rotary' layer.

def is_linear(module: nn.Module) -> bool:
    """Returns whether the module is a linear layer."""
    return any([k in type(module).__name__ for k in ["Linear", "Conv1D", "NormHead"]]) and ("Rotary" not in type(module).__name__)

@Opdoop OK, thanks.

@activezhao
Copy link

@Sanger2000 I have the same problem with deepseek-coder-6.7b-base model, have you solved the problem?

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \
                --dtype bfloat16 \
                --qformat int4_awq \
                --batch_size 8 \
                --tp_size 2 \
                --awq_block_size 128 \
                --output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \
                --calib_size 32

................................................................

/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Loading extension ammo_cuda_ext...
Loading extension ammo_cuda_ext_fp8...
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:155: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  value = torch.tensor(value, device=self._pre_quant_scale.device)
/usr/local/lib/python3.10/dist-packages/ammo/torch/quantization/nn/modules/tensor_quantizer.py:153: UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
  self.register_buffer("_pre_quant_scale", torch.tensor(value))
Calibrating batch 1
Calibrating batch 2
Calibrating batch 3
Quantization done. Total time used: 65.55 s.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
torch.distributed not initialized, assuming single world_size.
Cannot export model to the model_config. The AMMO optimized model state_dict (including the quantization factors) is saved to /data/deepseek-coder-6.7b-base-int4-awq-tp2/ammo_model.0.pth using torch.save for further inspection.
Detailed export error: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 307, in export_model_config
    for model_config in torch_to_model_config(
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/model_config_export.py", line 185, in torch_to_model_config
    build_decoder_config(layer, model_metadata_config, decoder_type, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 945, in build_decoder_config
    config.attention = build_attention_config(layer, model_metadata_config, dtype, config)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 638, in build_attention_config
    config.dense = build_linear_config(layer, LINEAR_ROW, dtype)
  File "/usr/local/lib/python3.10/dist-packages/ammo/torch/export/layer_utils.py", line 581, in build_linear_config
    torch_weight = module.weight.detach()
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1695, in __getattr__
    raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'LlamaLinearScalingRotaryEmbedding' object has no attribute 'weight'
Quantized model exported to /data/deepseek-coder-6.7b-base-int4-awq-tp2 
Total time used 10.00 s.

Hi @Opdoop I have a question

if I set --qformat to fp8 in quantize.py, are the Weight and Activation Function both fp8?

Thanks

python /data/tensorrt_llm/examples/quantization/quantize.py --model_dir /data/deepseek-coder-6.7b-base/ \
                --dtype bfloat16 \
                --qformat int4_awq \
                --batch_size 8 \
                --tp_size 2 \
                --awq_block_size 128 \
                --output_dir /data/deepseek-coder-6.7b-base-int4-awq-tp2 \
                --calib_size 32

@hello-11
Copy link
Collaborator

@Sanger2000 Do you still have the problem? If not, we will close it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

9 participants