startoftext, pad, and endoftext tokens #310

thistleknot · 2023-01-15T15:30:56Z

Problem:
I used happy_gen.tokenizer.add_special_tokens to add a BOS, EOS, and PAD.

After adding the 2-3 special tokens and then happy_gen.train('QADataset.txt'), I would see an error about index off (I assume happytransformer wrapper isn't expecting any tokenizer changes, i.e. for custom [qa] prompts). If I train without making any changes to the tokenizer. It will run but I will see something around 13-15 training iterations (1 epoch) for a 1500 line file covering 100 contexts. So I'm assuming the model should pick up on the 100 contexts as the individual prompts. One way I thought I could deal with this is create separate training files per context, but some contexts are too small to train on individually.

The way I was doing QA before was assuming 1 context per line. I was [initially] using lazydatascientist's method retooled for QA (from sentiment classification).

#prep_txt = f"""<|startoftext|{text}<|pad|>{label}<|endoftext|>"""

dataset = load_dataset("squad_v2")

df = pd.DataFrame(dataset['train'])

cq = '\nContext: ' + df['context'] + '\n' + '\nQuestion: ' + df['question'] + '\n' + '\nAnswer:'

answers = ['' if a['text']==[] else a['text'][0] for a in df['answers']]

print(cq[0],answers[0])
...
"""
text = cq[0]

label = answers[0]
"""

**Happy Transformer proposed method:**
* Integrate a QA [prompt engineered](https://github.com/RossSong/GPT2-Question-Answering/blob/master/QA.ipynb) use case using [simple-gpt2](https://github.com/minimaxir/gpt-2-simple/tree/master/gpt_2_simple).

QADataset.txt

Code to generate the QADataset.txt

Sample reference data.

<|startoftext|>
Context: ... (i.e. a \n)
Question: ... (\n)
Answer: ... (...)
Question: ...
Answer: ...
Question: ...
Answer: ...
<|endoftext|>
<|startoftext|>
Context: ... (i.e. a \n)
Question: ... (\n)
Answer: ... (...)
Question: ...
Answer: ...
Question: ...
Answer: ...
<|endoftext|>(Edited)

i.e. repeat <|startoftext|> to <|endoftext|> for each context, question:answer sets
Note: No trailing "\nAnswer: < PAD > | {answer} <|endoftext|>"

Token reference post here: https://colab.research.google.com/drive/1sgJWIoldQreBl8JyIyOG5LE42J1xT5Ho?usp=sharing

with open('train.txt', 'w') as f:
    f.write('SAMPLE CASE 1: Saudade Gostosa Cafuné<|endoftext|>\nSAMPLE CASE 2: test test  <|endoftext|>'

If I type happy_gen.tokenizer, I will see a reference about a BOS token but nothing about a <|startoftext|> token, as I do with EOS which is explicitly matching <|endoftext|>, nor PAD specified.

- Is this a signal to the tokenizer that this is the inference part? If so, how would that affect a happygen training in this context?

The text was updated successfully, but these errors were encountered:

thistleknot · 2023-01-15T16:00:41Z

I will try your suggestions:

"happy_gen.model.resize_token_embeddings(len(happy_gen.tokenizer))" before training and that might solve the error.

Also you may be able to just us<|endoftext|> in-between your cases. <|startoftext|> is not a default token.

thistleknot · 2023-01-15T16:27:08Z

I've created a notebook. I didn't add those new tokens and took your advice and used just <|endoftext|> and it looks like it's training, however.
[ 3/11524 00:54 < 175:46:40, 0.02 it/s, Epoch 0.00/1]

I get a warning about some of the token lengths
"Token indices sequence length is longer than the specified maximum sequence length for this model (25608 > 2048). Running this sequence through the model will result in indexing errors"

Notebook:
Edit: Fixed token > 2048 by tokenizing data and counting input_id's before adding to the text file (or at least an array of strings that are then written to a text file). Training is going now. Says 4000 eligible candidates (i.e. I filtered down prompts to sizes of 256-512 tokens)

https://colab.research.google.com/drive/179dRceBoigZHDTaVcrGjn_LoIOebSLZe#scrollTo=qoPgeWTTW6cS

I've read I can control for this by doing something like

token = AutoTokenizer.from_pretrained("your model")
tokens = token.tokenize(
    text, max_length=MAX_TOKENS, truncation=True
)

, but I don't know how to do that because the tokenization is handled by the app and I'm not sure how I could do that effectively with these endoftext positions, but I figure it will just be open ended at that point.

Ideas on setting truncation:
https://huggingface.co/docs/transformers/pad_truncation
['only_first', 'only_second', 'longest_first'], i.e. truncation='only_second' or truncation='longest_first' to control how both sequences in the pair are truncated as detailed before.

I tried, but no dice.
happy_gen.tokenizer.set_truncation_and_padding(max_length=1792, stride=1536, truncation_strategy='longest_first', pad_to_multiple_of=1, padding_strategy='right')

thistleknot · 2023-01-15T22:20:33Z

I think you can resolve this now. Using your suggestions fixed things. PAD is a token used to pad text (i.e. to left or right pad depending on setting). I take it PAD isn't explicitly mentioned because the wrapper on the back end takes care of it and/or padding is a non issue for GPT-Neo? Either way, I have no padding token specified in the updated prompt outlined in the notebook. I manually filtered contexts over 512 tokens (to speed up processing) and under 256.

thistleknot mentioned this issue Jan 15, 2023

Can you include code for this to handle GPT-Neo-125M minimaxir/gpt-2-simple#308

Open

thistleknot closed this as completed Jan 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

startoftext, pad, and endoftext tokens #310

startoftext, pad, and endoftext tokens #310

thistleknot commented Jan 15, 2023 •

edited

Loading

thistleknot commented Jan 15, 2023

thistleknot commented Jan 15, 2023 •

edited

Loading

thistleknot commented Jan 15, 2023

startoftext, pad, and endoftext tokens #310

startoftext, pad, and endoftext tokens #310

Comments

thistleknot commented Jan 15, 2023 • edited Loading

thistleknot commented Jan 15, 2023

thistleknot commented Jan 15, 2023 • edited Loading

thistleknot commented Jan 15, 2023

thistleknot commented Jan 15, 2023 •

edited

Loading

thistleknot commented Jan 15, 2023 •

edited

Loading