Skip to content

Conversation

@mosheisland
Copy link
Contributor

Stages 1 & 2 append '<|endoftext|>' text marker to all samples. However, some tokenizers (e.g. OPT, Bloom), encode this marker as a sequence of subword tokens and not as a single special token.

This commit adds an optional support to add the EOT marker as a special token to force the tokenizer to encode it as a single token.

Note that using EOT special token may change the dynamics of stage3 training. Therefore, to be backward compliant, this commit makes it optional.

Change-Id: If98d348fcaa7d6685e755aabe305e23e7649c367

Stages 1 & 2 append '<|endoftext|>' text marker to all samples.
However, some tokenizers (e.g. OPT, Bloom), encode this marker as a sequence
of subword tokens and not as a single special token.

This commit adds an optional support to add the EOT marker as a special token
to force the tokenizer to encode it as a single token.

Note that using EOT special token may change the dynamics of stage3 training.
Therefore, to be backward compliant, this commit makes it optional.

Change-Id: If98d348fcaa7d6685e755aabe305e23e7649c367
Signed-off-by: Moshe Island <misland@habana.ai>
Copy link
Contributor

@lekurile lekurile left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjruwase tjruwase merged commit e8d879e into deepspeedai:master Oct 17, 2023
@mosheisland mosheisland deleted the 11_add_eot_special_token branch November 22, 2023 07:52
hwchen2017 pushed a commit that referenced this pull request Jun 8, 2025
Stages 1 & 2 append '<|endoftext|>' text marker to all samples.
However, some tokenizers (e.g. OPT, Bloom), encode this marker as a sequence
of subword tokens and not as a single special token.

This commit adds an optional support to add the EOT marker as a special token
to force the tokenizer to encode it as a single token.

Note that using EOT special token may change the dynamics of stage3 training.
Therefore, to be backward compliant, this commit makes it optional.

Change-Id: If98d348fcaa7d6685e755aabe305e23e7649c367

Signed-off-by: Moshe Island <misland@habana.ai>
Co-authored-by: Moshe Island <misland@habana.ai>
Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants