New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[`BT`] Add fp16 support #859

Merged

younesbelkada merged 15 commits into huggingface:main from younesbelkada:add-bt-fp16-support

Mar 7, 2023

Contributor

younesbelkada commented Mar 7, 2023 •

edited

Loading

What does this PR do?

Currently on the main branch the fp16 inference for BetterTransformer decoder models is not supported, this PR aims to fix this

TODO

add tests


add fp16 support

16265ed

Contributor

fxmarty commented Mar 7, 2023 •

edited

Loading

@younesbelkada I think the proper solution should be to put back:

mask_value = torch.finfo(attn_weights.dtype).min
mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device)

in the forward instead of it being stateful (currently was always on fp32). WDYT?

Taking the reference code in https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html?highlight=scaled_dot_product_attention#torch.nn.functional.scaled_dot_product_attention, I kind of think casting to bool is bad. In any case, the solution I propose should avoid any casting altogether.


add fp16 tests and fix

7d0ee8c

Contributor Author

younesbelkada commented Mar 7, 2023

thanks @fxmarty for the heads up, will try now


propose fix

73d519e

younesbelkada requested a review from fxmarty

March 7, 2023 09:17

younesbelkada commented

View reviewed changes

optimum/bettertransformer/models/decoder_models.py Outdated

Comment on lines 131 to 132

		mask_value = torch.finfo(value.dtype).min
		attention_mask = torch.full([], mask_value, dtype=value.dtype).to(value.device)

Contributor Author

younesbelkada Mar 7, 2023

This will probably break the logis tests

fix

ad34b6b

Contributor Author

younesbelkada commented Mar 7, 2023 •

edited

Loading

@fxmarty I think the issue is that sometimes the attention_mask is provided on the forward pass, thus we need it to cast it in this case no?
I added logits tests as well FYI

younesbelkada added 3 commits

March 7, 2023 09:20


add logits tests

1f6baa2


make style

a50366b


fix tests

39f339c

fxmarty reviewed

View reviewed changes

optimum/bettertransformer/models/decoder_models.py Outdated

@@ @@ -74,12 +76,15 @@ def wrapped_scaled_dot_product( @@
  torch.bool
  )
- causal_mask = torch.where(causal_mask, 0, self._mask_value)
+ causal_mask = torch.where(causal_mask, 0, self._mask_value).to(value.dtype)

Contributor

fxmarty Mar 7, 2023

This cast is bad. Can we instead move the definition of mask_value in the forward?

fxmarty reviewed

View reviewed changes

optimum/bettertransformer/models/decoder_models.py Outdated

Comment on lines 85 to 86

		query = query.to(value.dtype)
		key = key.to(value.dtype)

Contributor

fxmarty Mar 7, 2023

Is this really needed? It should already be of the same dtype no?

HuggingFaceDocBuilderDev commented Mar 7, 2023 •

edited

Loading

The documentation is not available anymore as the PR was closed or merged.


fix tests

fxmarty reviewed

View reviewed changes

optimum/bettertransformer/models/decoder_models.py Outdated

Comment on lines 41 to 54

+ # gpt-2
+ if config.model_type == "gpt2":
+ target_dtype = self.gpt_layer.c_proj.weight.dtype
+ # gpt-neo-x
+ elif config.model_type == "gpt_neox":
+ target_dtype = self.gpt_layer.dense.weight.dtype
+ # gpt-j
+ else:
+ target_dtype = self.gpt_layer.out_proj.weight.dtype
+ self.downcast_qk = config.model_type in ["gptj", "gpt_neox"]
+ mask_value = torch.finfo(target_dtype).min
+ self._mask_value = torch.full([], mask_value, dtype=target_dtype)

Contributor

fxmarty Mar 7, 2023

This will IMO not work because the user may use model = model.to(torch.float16) after initializing the model. Here, self._mask_value would still be on fp32. I think we really need it in the forward.

Contributor Author

younesbelkada Mar 7, 2023

Nice catch!

younesbelkada commented

View reviewed changes

optimum/bettertransformer/models/decoder_models.py Outdated Show resolved Hide resolved


Update optimum/bettertransformer/models/decoder_models.py

cab4fb5

fxmarty reviewed

View reviewed changes

optimum/bettertransformer/models/decoder_models.py Outdated

Comment on lines 121 to 123

+ target_dtype = self.gpt_layer.k_proj.weight.dtype
+ mask_value = torch.finfo(target_dtype).min
+ self._mask_value = torch.full([], mask_value, dtype=target_dtype)

Contributor

fxmarty Mar 7, 2023

same

fxmarty reviewed

View reviewed changes

optimum/bettertransformer/models/decoder_models.py Outdated

Comment on lines 188 to 190

+ target_dtype = self.gpt_layer.qkv_proj.weight.dtype
+ mask_value = torch.finfo(target_dtype).min
+ self._mask_value = torch.full([], mask_value, dtype=target_dtype)

Contributor

fxmarty Mar 7, 2023

same

younesbelkada added 3 commits

March 7, 2023 10:40


more tests

b62a829


Merge branch 'add-bt-fp16-support' of https://github.com/younesbelkad…

d3a5019

…a/optimum into add-bt-fp16-support


revert change on codegen

4b00469

younesbelkada requested a review from fxmarty

March 7, 2023 10:42

fxmarty reviewed

View reviewed changes

optimum/bettertransformer/models/decoder_models.py Outdated

@@ @@ -74,12 +76,18 @@ def wrapped_scaled_dot_product( @@
  torch.bool
  )
- causal_mask = torch.where(causal_mask, 0, self._mask_value)
+ causal_mask = torch.where(causal_mask, 0, self._mask_value.to(value.dtype))

Contributor

fxmarty Mar 7, 2023 •

edited

Loading

This is not equivalent:

import torch

mask_value = torch.finfo(torch.float32).min
mask_value = torch.full([], mask_value, dtype=torch.float32)
casted = mask_value.to(torch.float16)

mask_value = torch.finfo(torch.float16).min
mask_value = torch.full([], mask_value, dtype=torch.float16)
assert torch.equal(casted, mask_value)

not sure if it has any influence or not though. I would just put the definition of mask_value in the forward directly, as in transformers

Contributor Author

younesbelkada Mar 7, 2023

Thanks, adapted as suggested!


fix issues

2b113f8

younesbelkada requested a review from fxmarty

March 7, 2023 11:32

Contributor Author

younesbelkada commented Mar 7, 2023

@fxmarty btw self._mask_value seems to be not used for T5 and OPT, shall we remove them?

fxmarty reviewed

View reviewed changes

tests/bettertransformer/test_decoder.py Show resolved Hide resolved

fxmarty reviewed

View reviewed changes

tests/bettertransformer/test_decoder.py Show resolved Hide resolved

fxmarty approved these changes

View reviewed changes

Contributor

fxmarty left a comment

LGTM thank you for iterating on this!

younesbelkada and others added 2 commits

March 7, 2023 13:47


Apply suggestions from code review

f0bc7d6

Co-authored-by: fxmarty <9808326+fxmarty@users.noreply.github.com>


make style

a69e1de

younesbelkada merged commit 5a7b923 into huggingface:main

younesbelkada deleted the add-bt-fp16-support branch

March 7, 2023 13:41

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment