-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
startoftext, pad, and endoftext tokens #310
Comments
I will try your suggestions: "happy_gen.model.resize_token_embeddings(len(happy_gen.tokenizer))" before training and that might solve the error. Also you may be able to just us<|endoftext|> in-between your cases. <|startoftext|> is not a default token. |
I've created a notebook. I didn't add those new tokens and took your advice and used just <|endoftext|> and it looks like it's training, however. I get a warning about some of the token lengths Notebook: I've read I can control for this by doing something like
, but I don't know how to do that because the tokenization is handled by the app and I'm not sure how I could do that effectively with these endoftext positions, but I figure it will just be open ended at that point. Ideas on setting truncation: I tried, but no dice. |
I think you can resolve this now. Using your suggestions fixed things. PAD is a token used to pad text (i.e. to left or right pad depending on setting). I take it PAD isn't explicitly mentioned because the wrapper on the back end takes care of it and/or padding is a non issue for GPT-Neo? Either way, I have no padding token specified in the updated prompt outlined in the notebook. I manually filtered contexts over 512 tokens (to speed up processing) and under 256. |
Problem:
I used happy_gen.tokenizer.add_special_tokens to add a BOS, EOS, and PAD.
After adding the 2-3 special tokens and then happy_gen.train('QADataset.txt'), I would see an error about index off (I assume happytransformer wrapper isn't expecting any tokenizer changes, i.e. for custom [qa] prompts). If I train without making any changes to the tokenizer. It will run but I will see something around 13-15 training iterations (1 epoch) for a 1500 line file covering 100 contexts. So I'm assuming the model should pick up on the 100 contexts as the individual prompts. One way I thought I could deal with this is create separate training files per context, but some contexts are too small to train on individually.
The way I was doing QA before was assuming 1 context per line. I was [initially] using lazydatascientist's method retooled for QA (from sentiment classification).
QADataset.txt
Code to generate the QADataset.txt
Sample reference data.
Token reference post here: https://colab.research.google.com/drive/1sgJWIoldQreBl8JyIyOG5LE42J1xT5Ho?usp=sharing
If I type happy_gen.tokenizer, I will see a reference about a BOS token but nothing about a <|startoftext|> token, as I do with EOS which is explicitly matching <|endoftext|>, nor PAD specified.
*Question about <|PAD|> tokens. What are they used for? I see them used between say Context:... (see lazydatascientist's post) \nQuestion:...\n|Answer: <|PAD><|endoftext>
The text was updated successfully, but these errors were encountered: