-
Notifications
You must be signed in to change notification settings - Fork 430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent pad ids, special tokens displaying in generate #1211
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1211
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit cc6ddbd with merge base 5825500 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1211 +/- ##
===========================================
- Coverage 68.66% 26.73% -41.94%
===========================================
Files 215 219 +4
Lines 9734 9860 +126
===========================================
- Hits 6684 2636 -4048
- Misses 3050 7224 +4174 ☔ View full report in Codecov by Sentry. |
@@ -153,5 +156,11 @@ def decode( | |||
k = None | |||
if k: | |||
token_ids = token_ids[:k] | |||
token_ids = [token_id for token_id in token_ids if token_id != self.bos_id] | |||
if not show_special: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is slightly far out, but could we move the logic from here to the generation recipe and move the flag for show_special
there, too? Unless we anticipate there's other places we could use this logic, but even if so, would it make sense to throw it in a separate utility function instead, with its own tests?
This way the tokenizers remain agnostic to how they are being used, and all calls to tokenizer.decode
will behave identically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you mean, but I do like keeping this with the tokenizer so I can quickly debug even in the middle of the recipe by calling tokenizer.decode(sample, show_special=True)
. I find myself doing this a lot when debugging in general, and it's especially useful to see the exact text that's going into the model . If it was a separate utility it wouldn't be straightforward to do this in a pdb debugger
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, yeah, IGYWM.
@@ -134,6 +134,7 @@ def decode( | |||
self, | |||
token_ids: List[int], | |||
truncate_at_eos: bool = True, | |||
show_special: bool = False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: show_special_tokens
pls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or skip_special_tokens
like HF? https://github.com/huggingface/transformers/blob/47c29ccfaf56947d845971a439cbe75a764b63d7/src/transformers/tokenization_utils_base.py#L4007
would be True by default for this case
One nit; lgtm |
Context
What is the purpose of this PR? Is it to
Pad ID is implicitly assumed to be 0 in
utils.generate
, and any instances of "0" in the generated tokens is replaced by the tokenizer's pad id. This is very apparent for Llama3 which has a reserved special token for pad id:Here, I simply remove the logic of replacing 0 with pad id. I additionally filter out any special tokens when decoding with the TikToken base tokenizer, with an option to let them be displayed in the decoded string. With these changes, the same generated output for Llama3 is now cleaned up:
Notably, the 0 ID for Llama3 is the token for "!"
Test plan
tune run generate --config generation
for Llama3, which uses tiktoken. Also tried other models that use SentencePiece like Phi3 and confirmed it does not impact any behavior.