-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
generalized chat sft prompt #7655
Conversation
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor code style issues
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
"end_of_name": "\n", | ||
} | ||
else: | ||
self.special_tokens = special_tokens | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do a check to see if the tokens in special_tokens are tokenizer's special tokens or not? If not (the case with llama), can we just throw a warning that we'll use text as turn tokens which might cause incorrect merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an assert in the code
assert torch.equal(torch.tensor(target[:header_len]), torch.tensor(header_tokens))
which will throw an exception if the token merge happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is different, the token merge can still happen during multi-turn
what I mean is that if the turn tokens are not special tokens, we just say that there might be an error possible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The header_len
stops at the "end_of_turn". The next token is "turn_start". If the merge happens this assert will catch it. The multiple turn has the same thing. each turn ends with "end_of_turn" and the next token is "turn_start". So this one is enough to catch it.
Also I don't see the point of just giving a warning which doesn't help the user at all.
Signed-off-by: Yi Dong <yidong@nvidia.com>
Signed-off-by: Yi Dong <yidong@nvidia.com>
# for key in turn['human_labels']: | ||
# value_set = label_values.get(key, set()) | ||
# value_set.add(turn['human_labels'][key]['value']) | ||
# label_values[key] = value_set |
Check notice
Code scanning / CodeQL
Commented-out code Note
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
nemo/collections/nlp/data/language_modeling/megatron/gpt_sft_chat_dataset.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Yi Dong <yidong@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com> Signed-off-by: Sasha Meister <sasha.meister.work@gmail.com>
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com> Signed-off-by: Elena Rastorgueva <erastorgueva@nvidia.com>
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com>
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com>
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com>
* fix dataset issues Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * all passed Signed-off-by: Yi Dong <yidong@nvidia.com> * refactor tests Signed-off-by: Yi Dong <yidong@nvidia.com> * all pass Signed-off-by: Yi Dong <yidong@nvidia.com> * working version Signed-off-by: Yi Dong <yidong@nvidia.com> * use end name signal for labels Signed-off-by: Yi Dong <yidong@nvidia.com> * all fixed Signed-off-by: Yi Dong <yidong@nvidia.com> * update doc Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure nccl not timing out Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * generate example template Signed-off-by: Yi Dong <yidong@nvidia.com> * generic end of name token Signed-off-by: Yi Dong <yidong@nvidia.com> * style fix Signed-off-by: Yi Dong <yidong@nvidia.com> * add the chat prompt format into the config Signed-off-by: Yi Dong <yidong@nvidia.com> * make sure sft working Signed-off-by: Yi Dong <yidong@nvidia.com> * address reviewer comment Signed-off-by: Yi Dong <yidong@nvidia.com> * fix non Signed-off-by: Yi Dong <yidong@nvidia.com> * try openAI prompt Signed-off-by: Yi Dong <yidong@nvidia.com> * remove unused imports Signed-off-by: Yi Dong <yidong@nvidia.com> * remove human labels from the data Signed-off-by: Yi Dong <yidong@nvidia.com> * use hf dataset to clean Signed-off-by: Yi Dong <yidong@nvidia.com> * reviewer comments Signed-off-by: Yi Dong <yidong@nvidia.com> --------- Signed-off-by: Yi Dong <yidong@nvidia.com>
What does this PR do ?
In this PR, it genialized the chat SFT dataset that it can use customized turn start/end tokens by using chat_prompt_tokens config. e.g.
after this change, the LM is not required to have "extra_id" special tokens any more to use chat SFT dataset. In this PR, also expanded the unit test to cover more LM tokenizers.
Another feature added is to overwrite the prompt_template config with the chat prompt format.