junk results for int8 for Flan-xl/xxl #22568

i-am-neo · 2023-04-04T16:12:23Z

System Info

transformers version: 4.27.4
Platform: Linux-5.10.147+-x86_64-with-glibc2.31
Python version: 3.9.16
Huggingface_hub version: 0.13.3
PyTorch version (GPU?): 2.0.0+cu118 (True)
Tensorflow version (GPU?): 2.12.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.6.8 (gpu)
Jax version: 0.4.7
JaxLib version: 0.4.7
Using GPU in script?:
Using distributed or parallel set-up in script?: no

Who can help?

@younesbelkada and maybe @philschmid ?

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

made a copy of notebook HuggingFace_bnb_int8_T5
set runtime hardware accelerator to GPU, standard

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_name = "t5-3b-sharded" # NB. T5-11B does not fit into a GPU in Colab

T5-3b and T5-11B are supported!

We need sharded weights otherwise we get CPU OOM errors

model_id=f"ybelkada/{model_name}"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model_8bit = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="cuda", load_in_8bit=True)

model_8bit.get_memory_footprint()

max_new_tokens = 400

text = """
Summarize: Whether out at a restaurant or buying tickets to a concert, modern life counts on the convenience of a credit card to make daily purchases. It saves us from carrying large amounts of cash and also can advance a full purchase that can be paid over time. How do card issuers know we’ll pay back what we charge? That’s a complex problem with many existing solutions—and even more potential improvements, to be explored in this competition.

Credit default prediction is central to managing risk in a consumer lending business. Credit default prediction allows lenders to optimize lending decisions, which leads to a better customer experience and sound business economics. Current models exist to help manage risk. But it's possible to create better models that can outperform those currently in use.

American Express is a globally integrated payments company. The largest payment card issuer in the world, they provide customers with access to products, insights, and experiences that enrich lives and build business success.

In this competition, you’ll apply your machine learning skills to predict credit default. Specifically, you will leverage an industrial scale data set to build a machine learning model that challenges the current model in production. Training, validation, and testing datasets include time-series behavioral data and anonymized customer profile information. You're free to explore any technique to create the most powerful model, from creating features to using the data in a more organic way within a model.
"""

input_ids = tokenizer(
text, return_tensors="pt"
).input_ids

if torch.cuda.is_available():
input_ids = input_ids.to('cuda')

outputs = model_8bit.generate(input_ids, max_new_tokens=max_new_tokens)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Resulting output (note the series of blanks at the beginning of the result between the periods). I also tried other prompts and the results were poor/unexpected.

My goal was to check that the int8 model reliably produces at least similar results as the non-int8, in order to potentially use the int8 for inference. Please see comparison of results in next section from using the Hosted Inference API or spaces API. What am I missing?

. . You can also use a combination of techniques to create a model that can outperform the current model in production. The goal is to create a model that can outperform the current model in production. The goal is to create a model that can outperform. The

Expected behavior

something akin to:
a)

['Challenge your machine learning skills to predict credit default.']

or b)

Challenge your machine learning skills to predict credit default.

a) is the result from trying a space API

response = requests.post("https://awacke1-google-flan-t5-xl.hf.space/run/predict", json={

"data": [
text,
],
"max_length": 500,
}).json()

data = response["data"]
print(data)

b) is the result from your Hosted inference API

Hope you can shed light.

The text was updated successfully, but these errors were encountered:

github-actions · 2023-05-13T15:02:29Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

younesbelkada · 2023-05-21T16:36:29Z

Hi @i-am-neo
You should upgrade your transformers version and re-run your inference script as the recent releases contain a fix for T5 family models for fp16 and in8 inference

#20683
#20760

i-am-neo · 2023-05-22T14:05:44Z

Thanks @younesbelkada . Still junky. Using your notebook and t5-3b-sharded, compare:

text = "Summarize: Hello my name is Younes and I am a Machine Learning Engineer at Hugging Face"   # outputs "s.:s. Summarize: Hello my name is Younes."
text = "summarize: Hello my name is Younes and I am a Machine Learning Engineer at Hugging Face"   # outputs "Younes is a Machine Learning Engineer at Hugging Face."

i-am-neo · 2023-05-22T14:07:21Z

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch

model_name = "t5-3b-sharded"
# T5-3b and T5-11B are supported!
# We need sharded weights otherwise we get CPU OOM errors
model_id=f"ybelkada/{model_name}"

#model_id='google/flan-t5-xl'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_8bit = AutoModelForSeq2SeqLM.from_pretrained(model_id, device_map="auto", load_in_8bit=True)

- `transformers` version: 4.29.2
- Platform: Linux-5.15.107+-x86_64-with-glibc2.31
- Python version: 3.10.11
- Huggingface_hub version: 0.14.1
- Safetensors version: not installed
- PyTorch version (GPU?): 2.0.1+cu118 (True)
- Tensorflow version (GPU?): 2.12.0 (True)
- Flax version (CPU?/GPU?/TPU?): 0.6.9 (gpu)
- Jax version: 0.4.8
- JaxLib version: 0.4.7
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

github-actions · 2023-06-15T15:03:42Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

i-am-neo mentioned this issue Apr 10, 2023

How to use ggml for Flan-T5 ggml-org/llama.cpp#247

Closed

i-am-neo changed the title ~~odd results for int8 for Flan-xl/xxl~~ junk results for int8 for Flan-xl/xxl Apr 18, 2023

github-actions bot closed this as completed May 21, 2023

younesbelkada reopened this May 21, 2023

github-actions bot closed this as completed Jun 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

junk results for int8 for Flan-xl/xxl #22568

junk results for int8 for Flan-xl/xxl #22568

i-am-neo commented Apr 4, 2023 •

edited

Loading

T5-3b and T5-11B are supported!

We need sharded weights otherwise we get CPU OOM errors

github-actions bot commented May 13, 2023

younesbelkada commented May 21, 2023 •

edited

Loading

i-am-neo commented May 22, 2023 •

edited

Loading

i-am-neo commented May 22, 2023

github-actions bot commented Jun 15, 2023

junk results for int8 for Flan-xl/xxl #22568

junk results for int8 for Flan-xl/xxl #22568

Comments

i-am-neo commented Apr 4, 2023 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

T5-3b and T5-11B are supported!

We need sharded weights otherwise we get CPU OOM errors

Expected behavior

github-actions bot commented May 13, 2023

younesbelkada commented May 21, 2023 • edited Loading

i-am-neo commented May 22, 2023 • edited Loading

i-am-neo commented May 22, 2023

github-actions bot commented Jun 15, 2023

i-am-neo commented Apr 4, 2023 •

edited

Loading

younesbelkada commented May 21, 2023 •

edited

Loading

i-am-neo commented May 22, 2023 •

edited

Loading