-
Notifications
You must be signed in to change notification settings - Fork 26.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added cache_block_outputs option to enable GPTQ for non-regular models #27032
Added cache_block_outputs option to enable GPTQ for non-regular models #27032
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @AlexKoff88, let's wait for the optimum PR to be merged. We might not need this argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! A few nits to fix.
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
@AlexKoff88 please run |
Fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this!
Just a small nit on the docstring.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
huggingface#27032) * Added cache_block_outputs option to enable GPTQ for non-regular models * Update src/transformers/utils/quantization_config.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/transformers/utils/quantization_config.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Fixed style * Update src/transformers/utils/quantization_config.py Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com> --------- Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
cache_block_outputs enables the collection of the block output to speed up GPTQ process. However, it does not work for some models such as ChatGLM where the LayerNorm is the first layer in the block.
Just compare:
OPT structure:
model.decoder.layers.0.self_attn
model.decoder.layers.0.self_attn.k_proj
model.decoder.layers.0.self_attn.v_proj
model.decoder.layers.0.self_attn.q_proj
model.decoder.layers.0.self_attn.out_proj
model.decoder.layers.0.activation_fn
model.decoder.layers.0.self_attn_layer_norm
model.decoder.layers.0.fc1
model.decoder.layers.0.fc2
model.decoder.layers.0.final_layer_norm
ChatGLM structure:
transformer.encoder.layers.0
transformer.encoder.layers.0.input_layernorm
transformer.encoder.layers.0.self_attention
transformer.encoder.layers.0.self_attention.query_key_value
transformer.encoder.layers.0.self_attention.core_attention
transformer.encoder.layers.0.self_attention.core_attention.attention_dropout
transformer.encoder.layers.0.self_attention.dense
transformer.encoder.layers.0.post_attention_layernorm
transformer.encoder.layers.0.mlp
transformer.encoder.layers.0.mlp.dense_h_to_4h
transformer.encoder.layers.0.mlp.dense_4h_to_h
The solution is to disable SA block output caching and collect the quantizing block inputs starting from the beginning of the model. It slows down the optimization a bit but works more stable.
Related PR to Optimum: huggingface/optimum#1479