Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check the user provided eos / bos token id against the tokenizer eos / bos token id #1039

Conversation

ShashankMosaicML
Copy link
Contributor

@ShashankMosaicML ShashankMosaicML commented Mar 15, 2024

Currently, we need to specify the eos or bos token id in the dataset config for sequence id masking to work. This PR adds code to check if the provided eos and bos token ids match the token ids in the tokenizer. If there is a mismatch, an error will be raised. The error can be suppressed through the flags override_eos_token_id_mismatch_error and override_bos_token_id_mismatch_error

Verification runs:

125m-incorrect-eos-token-id-override-B1AOYT
125m-incorrect-eos-token-id-5DXPCI 
125m-correct-eos-token-id-Ge3qzT 
125m-incorrect-bos-token-id-override-mxBMzA
125m-incorrect-bos-token-id-EeBgKH
125m-correct-bos-token-id-I97jYG

@vchiley
Copy link
Contributor

vchiley commented Mar 15, 2024

Should we also pull out 'bos_token_id'?

@ShashankMosaicML
Copy link
Contributor Author

ShashankMosaicML commented Mar 17, 2024

Should we also pull out 'bos_token_id'?

ConcatenatedSequenceCollatorWrapper throws an error if we set both the eos_token_id and bos_token_id. That is why I only set eos_token_id here. I can add flags that dictate which of these to automatically pick. But then, we will have to explicitly set one of those flags.

@vchiley
Copy link
Contributor

vchiley commented Mar 18, 2024

@samhavens @codestar12 do you think eos_token_id or bos_token_id is more important and/or widely used?
(probably eos but wanted to check)

@samhavens
Copy link
Contributor

@samhavens @codestar12 do you think eos_token_id or bos_token_id is more important and/or widely used? (probably eos but wanted to check)

Llama and t5 models both expect both eos and bos, OPT I think is BOS only?

@ShashankMosaicML
Copy link
Contributor Author

@samhavens @codestar12 do you think eos_token_id or bos_token_id is more important and/or widely used? (probably eos but wanted to check)

Llama and t5 models both expect both eos and bos, OPT I think is BOS only?

OK, i think the safest thing to do would be to raise an error if the EOS or the BOS tokens that are provided in the yaml are different from what the tokenizer has. I will also add a flag that can override this error (in case someone wants to train with EOS/BOS tokens different from the ones in the tokenizer).

@ShashankMosaicML ShashankMosaicML changed the title [Draft] Get EOS token id from tokenizer Check eos / bos token id against the tokenizer eos / bos token id Mar 31, 2024
Adding info about the override flags in the error message.
@ShashankMosaicML ShashankMosaicML marked this pull request as ready for review March 31, 2024 21:39
@ShashankMosaicML ShashankMosaicML changed the title Check eos / bos token id against the tokenizer eos / bos token id Check the user provided eos / bos token id against the tokenizer eos / bos token id Apr 1, 2024
Copy link
Contributor

@vchiley vchiley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also warn/error if tokenizer does have eos_token_id or bos_token_id but user does not set it?

llmfoundry/data/text_data.py Outdated Show resolved Hide resolved
llmfoundry/data/text_data.py Outdated Show resolved Hide resolved
ShashankMosaicML and others added 4 commits April 1, 2024 09:35
Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
@ShashankMosaicML ShashankMosaicML merged commit d8ea2c5 into mosaicml:main Apr 1, 2024
9 checks passed
@ShashankMosaicML ShashankMosaicML deleted the get_eos_token_id_from_tokenizer branch April 1, 2024 18:47
KuuCi pushed a commit that referenced this pull request Apr 18, 2024
…/ bos token id (#1039)

* lint

* lint

* added warning and error message instead of setting the eos and bos token ids

* Update text_data.py

Adding info about the override flags in the error message.

* Update llmfoundry/data/text_data.py

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>

* Update llmfoundry/data/text_data.py

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>

* adding warning if user does not provide eos or bos token id

* adding warning if user does not provide eos or bos token id

---------

Co-authored-by: Vitaliy Chiley <6439018+vchiley@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants