GPT-2 example is broken? #11034

ba305 · 2021-04-02T03:56:40Z

Environment info

transformers version: I have had this issue with both 4.3.0 and 4.4.2 (and probably other versions as well)
Python version: 3.7.6
PyTorch version (GPU?): 1.7.0
Using GPU in script?: No, I just tested it on the CPU, but it would probably also happen on the GPU
Using distributed or parallel set-up in script?: No

Who can help

gpt2: @patrickvonplaten, @LysandreJik
Documentation: @sgugger

Information

Model I am using (Bert, XLNet ...): gpt2

The problem arises when using:

[ x] the official example scripts: (give details below)
my own modified scripts: (give details below)

To reproduce

Hello, I am trying to run this example here: https://huggingface.co/transformers/task_summary.html#causal-language-modeling. When I run that code, exactly the same as it is on that page, I get strange/very bad results. Even when I change the input text, it still gives weird results (e.g., predicting empty spaces or strange characters). I also asked my coworker to try it on her computer, and she also got strange results.

I am planning to fine-tune GPT-2 for a different purpose later, but was a bit concerned because I couldn't even get this simple example demo to work. Thanks for your help!

Steps to reproduce the behavior:

Just run the exact example code that I linked above

The text was updated successfully, but these errors were encountered:

LysandreJik · 2021-04-02T13:05:07Z

Hi! Sorry to hear the example doesn't work well for you. To be honest, it doesn't really make sense to try and generate a single token like it is done in that example. I have slightly modified the example so that it generates the 20 following tokens.

Also, I've removed the space at the end of the sequence because I believe it is there by mistake:

from transformers import AutoModelWithLMHead, AutoTokenizer, top_k_top_p_filtering
import torch
from torch.nn import functional as F
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelWithLMHead.from_pretrained("gpt2")
sequence = f"Hugging Face is based in DUMBO, New York City, and"
input_ids = tokenizer.encode(sequence, return_tensors="pt")
# get logits of last hidden state
generated = input_ids
for i in range(20):
    next_token_logits = model(generated).logits[:, -1, :]
    # filter
    filtered_next_token_logits = top_k_top_p_filtering(next_token_logits, top_k=50, top_p=1.0)
    # sample
    probs = F.softmax(filtered_next_token_logits, dim=-1)
    next_token = torch.multinomial(probs, num_samples=1)
    generated = torch.cat([generated, next_token], dim=-1)

resulting_string = tokenizer.decode(generated.tolist()[0])

print(resulting_string)

Running this gives me the following examples (not cherry-picked):

Hugging Face is based in DUMBO, New York City, and is produced by Eltas & Co., Inc. (a wholly owned subsidiary of Eltas
Hugging Face is based in DUMBO, New York City, and focuses primarily on the music and entertainment industry, and is funded by the Hudson River Chamber of Commerce.
Hugging Face is based in DUMBO, New York City, and has aired in dozens of local, national and foreign programs, including The Brady Bunch, The Colbert

ba305 · 2021-04-02T17:25:37Z

Thanks a lot for your help Lysandre!

Removing the space at the end of the example sequence solves the issue. Now I am getting normal results. It would be great if you could update the website since I imagine other people will run into the same issue at some point!

Also, thanks for adding the code to generate 20 tokens. That is helpful as well, although I believe the main problem was the space at the end of the input sequence.

Thanks again for your prompt reply. Feel free to close the issue whenever you want

LysandreJik · 2021-04-05T14:52:15Z

Great, nice to hear this fixes the issue! I've updated the docs on the master branch.

LysandreJik mentioned this issue Apr 5, 2021

Remove unnecessary space #11060

Merged

LysandreJik closed this as completed in #11060 Apr 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT-2 example is broken? #11034

GPT-2 example is broken? #11034

ba305 commented Apr 2, 2021

LysandreJik commented Apr 2, 2021

ba305 commented Apr 2, 2021

LysandreJik commented Apr 5, 2021

GPT-2 example is broken? #11034

GPT-2 example is broken? #11034

Comments

ba305 commented Apr 2, 2021

Environment info

Who can help

Information

To reproduce

LysandreJik commented Apr 2, 2021

ba305 commented Apr 2, 2021

LysandreJik commented Apr 5, 2021