Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

startoftext, pad, and endoftext tokens #310

Closed
thistleknot opened this issue Jan 15, 2023 · 3 comments
Closed

startoftext, pad, and endoftext tokens #310

thistleknot opened this issue Jan 15, 2023 · 3 comments

Comments

@thistleknot
Copy link

thistleknot commented Jan 15, 2023

Problem:
I used happy_gen.tokenizer.add_special_tokens to add a BOS, EOS, and PAD.

After adding the 2-3 special tokens and then happy_gen.train('QADataset.txt'), I would see an error about index off (I assume happytransformer wrapper isn't expecting any tokenizer changes, i.e. for custom [qa] prompts). If I train without making any changes to the tokenizer. It will run but I will see something around 13-15 training iterations (1 epoch) for a 1500 line file covering 100 contexts. So I'm assuming the model should pick up on the 100 contexts as the individual prompts. One way I thought I could deal with this is create separate training files per context, but some contexts are too small to train on individually.

The way I was doing QA before was assuming 1 context per line. I was [initially] using lazydatascientist's method retooled for QA (from sentiment classification).

#prep_txt = f"""<|startoftext|{text}<|pad|>{label}<|endoftext|>"""

dataset = load_dataset("squad_v2")

df = pd.DataFrame(dataset['train'])

cq = '\nContext: ' + df['context'] + '\n' + '\nQuestion: ' + df['question'] + '\n' + '\nAnswer:'

answers = ['' if a['text']==[] else a['text'][0] for a in df['answers']]

print(cq[0],answers[0])
...
"""
text = cq[0]

label = answers[0]
"""

**Happy Transformer proposed method:**
* Integrate a QA [prompt engineered](https://github.com/RossSong/GPT2-Question-Answering/blob/master/QA.ipynb) use case using [simple-gpt2](https://github.com/minimaxir/gpt-2-simple/tree/master/gpt_2_simple).

QADataset.txt

Code to generate the QADataset.txt

Sample reference data.

<|startoftext|>
Context: ... (i.e. a \n)
Question: ... (\n)
Answer: ... (...)
Question: ...
Answer: ...
Question: ...
Answer: ...
<|endoftext|>
<|startoftext|>
Context: ... (i.e. a \n)
Question: ... (\n)
Answer: ... (...)
Question: ...
Answer: ...
Question: ...
Answer: ...
<|endoftext|>(Edited)
  • i.e. repeat <|startoftext|> to <|endoftext|> for each context, question:answer sets
  • Note: No trailing "\nAnswer: < PAD > | {answer} <|endoftext|>"

Token reference post here: https://colab.research.google.com/drive/1sgJWIoldQreBl8JyIyOG5LE42J1xT5Ho?usp=sharing

with open('train.txt', 'w') as f:
    f.write('SAMPLE CASE 1: Saudade Gostosa Cafuné<|endoftext|>\nSAMPLE CASE 2: test test  <|endoftext|>'

If I type happy_gen.tokenizer, I will see a reference about a BOS token but nothing about a <|startoftext|> token, as I do with EOS which is explicitly matching <|endoftext|>, nor PAD specified.

*Question about <|PAD|> tokens. What are they used for? I see them used between say Context:... (see lazydatascientist's post) \nQuestion:...\n|Answer: <|PAD><|endoftext>

    • Is this a signal to the tokenizer that this is the inference part? If so, how would that affect a happygen training in this context?
@thistleknot
Copy link
Author

I will try your suggestions:

"happy_gen.model.resize_token_embeddings(len(happy_gen.tokenizer))" before training and that might solve the error.

Also you may be able to just us<|endoftext|> in-between your cases. <|startoftext|> is not a default token.

@thistleknot
Copy link
Author

thistleknot commented Jan 15, 2023

I've created a notebook. I didn't add those new tokens and took your advice and used just <|endoftext|> and it looks like it's training, however.
[ 3/11524 00:54 < 175:46:40, 0.02 it/s, Epoch 0.00/1]

I get a warning about some of the token lengths
"Token indices sequence length is longer than the specified maximum sequence length for this model (25608 > 2048). Running this sequence through the model will result in indexing errors"

Notebook:
Edit: Fixed token > 2048 by tokenizing data and counting input_id's before adding to the text file (or at least an array of strings that are then written to a text file). Training is going now. Says 4000 eligible candidates (i.e. I filtered down prompts to sizes of 256-512 tokens)

I've read I can control for this by doing something like

token = AutoTokenizer.from_pretrained("your model")
tokens = token.tokenize(
    text, max_length=MAX_TOKENS, truncation=True
)

, but I don't know how to do that because the tokenization is handled by the app and I'm not sure how I could do that effectively with these endoftext positions, but I figure it will just be open ended at that point.

Ideas on setting truncation:
https://huggingface.co/docs/transformers/pad_truncation
['only_first', 'only_second', 'longest_first'], i.e. truncation='only_second' or truncation='longest_first' to control how both sequences in the pair are truncated as detailed before.

I tried, but no dice.
happy_gen.tokenizer.set_truncation_and_padding(max_length=1792, stride=1536, truncation_strategy='longest_first', pad_to_multiple_of=1, padding_strategy='right')

@thistleknot
Copy link
Author

I think you can resolve this now. Using your suggestions fixed things. PAD is a token used to pad text (i.e. to left or right pad depending on setting). I take it PAD isn't explicitly mentioned because the wrapper on the back end takes care of it and/or padding is a non issue for GPT-Neo? Either way, I have no padding token specified in the updated prompt outlined in the notebook. I manually filtered contexts over 512 tokens (to speed up processing) and under 256.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant