Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

junk results for int8 for Flan-xl/xxl #22568

Closed
2 of 4 tasks
i-am-neo opened this issue Apr 4, 2023 · 5 comments
Closed
2 of 4 tasks

junk results for int8 for Flan-xl/xxl #22568

i-am-neo opened this issue Apr 4, 2023 · 5 comments

Comments

@i-am-neo
Copy link

i-am-neo commented Apr 4, 2023

System Info

  • transformers version: 4.27.4
  • Platform: Linux-5.10.147+-x86_64-with-glibc2.31
  • Python version: 3.9.16
  • Huggingface_hub version: 0.13.3
  • PyTorch version (GPU?): 2.0.0+cu118 (True)
  • Tensorflow version (GPU?): 2.12.0 (True)
  • Flax version (CPU?/GPU?/TPU?): 0.6.8 (gpu)
  • Jax version: 0.4.7
  • JaxLib version: 0.4.7
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?: no

Who can help?

@younesbelkada and maybe @philschmid ?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

  1. made a copy of notebook HuggingFace_bnb_int8_T5
  2. set runtime hardware accelerator to GPU, standard

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_name = "t5-3b-sharded" # NB. T5-11B does not fit into a GPU in Colab

T5-3b and T5-11B are supported!

We need sharded weights otherwise we get CPU OOM errors

model_id=f"ybelkada/{model_name}"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_8bit = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="cuda", load_in_8bit=True)

model_8bit.get_memory_footprint()

max_new_tokens = 400

text = """
Summarize: Whether out at a restaurant or buying tickets to a concert, modern life counts on the convenience of a credit card to make daily purchases. It saves us from carrying large amounts of cash and also can advance a full purchase that can be paid over time. How do card issuers know we’ll pay back what we charge? That’s a complex problem with many existing solutions—and even more potential improvements, to be explored in this competition.

Credit default prediction is central to managing risk in a consumer lending business. Credit default prediction allows lenders to optimize lending decisions, which leads to a better customer experience and sound business economics. Current models exist to help manage risk. But it's possible to create better models that can outperform those currently in use.

American Express is a globally integrated payments company. The largest payment card issuer in the world, they provide customers with access to products, insights, and experiences that enrich lives and build business success.

In this competition, you’ll apply your machine learning skills to predict credit default. Specifically, you will leverage an industrial scale data set to build a machine learning model that challenges the current model in production. Training, validation, and testing datasets include time-series behavioral data and anonymized customer profile information. You're free to explore any technique to create the most powerful model, from creating features to using the data in a more organic way within a model.
"""

input_ids = tokenizer(
text, return_tensors="pt"
).input_ids

if torch.cuda.is_available():
input_ids = input_ids.to('cuda')

outputs = model_8bit.generate(input_ids, max_new_tokens=max_new_tokens)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Resulting output (note the series of blanks at the beginning of the result between the periods). I also tried other prompts and the results were poor/unexpected.

My goal was to check that the int8 model reliably produces at least similar results as the non-int8, in order to potentially use the int8 for inference. Please see comparison of results in next section from using the Hosted Inference API or spaces API. What am I missing?

. . You can also use a combination of techniques to create a model that can outperform the current model in production. The goal is to create a model that can outperform the current model in production. The goal is to create a model that can outperform. The

Expected behavior

something akin to:
a)

['Challenge your machine learning skills to predict credit default.']

or b)

Challenge your machine learning skills to predict credit default.

a) is the result from trying a space API

response = requests.post("https://awacke1-google-flan-t5-xl.hf.space/run/predict", json={

"data": [
text,
],
"max_length": 500,
}).json()

data = response["data"]
print(data)

b) is the result from your Hosted inference API

Hope you can shed light.

@i-am-neo i-am-neo changed the title odd results for int8 for Flan-xl/xxl junk results for int8 for Flan-xl/xxl Apr 18, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@younesbelkada
Copy link
Contributor

younesbelkada commented May 21, 2023

Hi @i-am-neo
You should upgrade your transformers version and re-run your inference script as the recent releases contain a fix for T5 family models for fp16 and in8 inference

#20683
#20760

@i-am-neo
Copy link
Author

i-am-neo commented May 22, 2023

Thanks @younesbelkada . Still junky. Using your notebook and t5-3b-sharded, compare:

text = "Summarize: Hello my name is Younes and I am a Machine Learning Engineer at Hugging Face"   # outputs "s.:s. Summarize: Hello my name is Younes."
text = "summarize: Hello my name is Younes and I am a Machine Learning Engineer at Hugging Face"   # outputs "Younes is a Machine Learning Engineer at Hugging Face."

@i-am-neo
Copy link
Author

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_name = "t5-3b-sharded"
# T5-3b and T5-11B are supported!
# We need sharded weights otherwise we get CPU OOM errors
model_id=f"ybelkada/{model_name}"

#model_id='google/flan-t5-xl'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_8bit = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)
- `transformers` version: 4.29.2
- Platform: Linux-5.15.107+-x86_64-with-glibc2.31
- Python version: 3.10.11
- Huggingface_hub version: 0.14.1
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): 2.12.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.6.9 (gpu)
- Jax version: 0.4.8
- JaxLib version: 0.4.7
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants