-
Notifications
You must be signed in to change notification settings - Fork 530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check the user provided eos / bos token id against the tokenizer eos / bos token id #1039
Check the user provided eos / bos token id against the tokenizer eos / bos token id #1039
Conversation
Should we also pull out |
|
@samhavens @codestar12 do you think |
Llama and t5 models both expect both eos and bos, OPT I think is BOS only? |
OK, i think the safest thing to do would be to raise an error if the EOS or the BOS tokens that are provided in the yaml are different from what the tokenizer has. I will also add a flag that can override this error (in case someone wants to train with EOS/BOS tokens different from the ones in the tokenizer). |
Adding info about the override flags in the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also warn/error if tokenizer does have eos_token_id
or bos_token_id
but user does not set it?
Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
…/ bos token id (#1039) * lint * lint * added warning and error message instead of setting the eos and bos token ids * Update text_data.py Adding info about the override flags in the error message. * Update llmfoundry/data/text_data.py Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com> * Update llmfoundry/data/text_data.py Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com> * adding warning if user does not provide eos or bos token id * adding warning if user does not provide eos or bos token id --------- Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
Currently, we need to specify the eos or bos token id in the dataset config for sequence id masking to work. This PR adds code to check if the provided eos and bos token ids match the token ids in the tokenizer. If there is a mismatch, an error will be raised. The error can be suppressed through the flags
override_eos_token_id_mismatch_error
andoverride_bos_token_id_mismatch_error
Verification runs: