Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gpt trainer test #6915

Merged
merged 156 commits into from
Aug 10, 2023
Merged

Fix gpt trainer test #6915

merged 156 commits into from
Aug 10, 2023

Conversation

hsiehjackson
Copy link
Collaborator

@hsiehjackson hsiehjackson commented Jun 24, 2023

What does this PR do ?

  • Fix gpt trainer inference_step to compute generation results during validation and test
  • Add dataset with metadata, to output in prediction file
  • Add token_f1 metrics for Scrolls

Collection: [NLP]

Changelog

  • Add specific line by line info of high level changes in this PR.

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this 

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
@github-actions github-actions bot added the NLP label Jun 24, 2023
@hsiehjackson hsiehjackson force-pushed the fix-gpt-trainer-test branch 2 times, most recently from 70e3747 to fadf272 Compare June 26, 2023 16:47
hsiehjackson and others added 3 commits June 27, 2023 06:06
Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
@hsiehjackson hsiehjackson force-pushed the fix-gpt-trainer-test branch from fadf272 to 74c6fd8 Compare June 26, 2023 22:06
Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
@github-actions github-actions bot added the CI label Jun 27, 2023
@arendu arendu self-requested a review June 27, 2023 23:36
Jenkinsfile Outdated Show resolved Hide resolved
hsiehjackson and others added 4 commits June 27, 2023 20:13
Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
hsiehjackson and others added 6 commits August 4, 2023 15:53
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
@hsiehjackson hsiehjackson requested a review from arendu August 5, 2023 00:55
hsiehjackson and others added 4 commits August 7, 2023 10:37
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
ekmb
ekmb previously approved these changes Aug 8, 2023
arendu
arendu previously approved these changes Aug 9, 2023
yzhang123
yzhang123 previously approved these changes Aug 10, 2023
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
@hsiehjackson hsiehjackson dismissed stale reviews from yzhang123, arendu, and ekmb via cf2f2b9 August 10, 2023 02:48
Copy link
Collaborator

@yidong72 yidong72 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, please don't change the default configuration behavior as the old code depends on it.

drop_last: True
context_key: 'input'
label_key: 'output'
drop_last: False
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change the default behavior

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we drop last if we evaluate with the validation dataset?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one use case that I have to drop the last because I am preparing a dataset that computes contrastive loss. I need to make sure all the batch has the same batch size.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set drop_last=False, we can use pad_samples_to_global_batch_size=True right? Or this doesn't fulfill your settings? https://github.com/NVIDIA/NeMo/blob/fix-gpt-trainer-test/nemo/collections/nlp/models/language_modeling/megatron_gpt_sft_model.py#L788
If we set default drop_last=True, then we may drop some samples for evaluation which may show incorrect results for comparison.

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
@hsiehjackson hsiehjackson merged commit e0f8b9b into main Aug 10, 2023
@hsiehjackson hsiehjackson deleted the fix-gpt-trainer-test branch August 10, 2023 19:44
dorotat-nv pushed a commit to dorotat-nv/NeMo that referenced this pull request Aug 24, 2023
* Add trainer.test() for GPT

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Remove unused part

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Add trainer.test() for GPT

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Remove unused part

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix training part

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix references and add CI

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config error

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix dataset

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add metadata

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix empty batch

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix max seq length

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix dataset

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix dataset

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add token f1

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA in sft

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add inference config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix pad

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix num batch

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Add query_key

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove pdb

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix write json

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix dataset bug and refactor

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add logging for prediction

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix retrain

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Add query_key in config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix bug

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix bug

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add inference config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix mask

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix mask

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Split PR

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Undo commit

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Add query_key to doc_string

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Adjust yzhang123 comments

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error and follow comments

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove query key

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove logic and query

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove query from model

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove query_key

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Fix error

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix pdb

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Add default tokens_to_generate

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Revert prompt truncate re-prompt

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* skip generation with metric loss

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* support GPTSFTChatDataset

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add comment

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

---------

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Signed-off-by: dorotat <dorotat@nvidia.com>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Add trainer.test() for GPT

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Remove unused part

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Add trainer.test() for GPT

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Remove unused part

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix training part

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix references and add CI

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config error

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix dataset

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add metadata

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix empty batch

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix max seq length

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix dataset

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix dataset

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add token f1

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add FA in sft

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add inference config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix pad

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix num batch

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Add query_key

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove pdb

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix write json

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix dataset bug and refactor

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add logging for prediction

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix retrain

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Add query_key in config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix bug

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Fix bug

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add inference config

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix mask

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix mask

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>

* Split PR

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Undo commit

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Add query_key to doc_string

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Adjust yzhang123 comments

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix error and follow comments

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove query key

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove logic and query

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Remove query from model

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Remove query_key

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Fix error

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix pdb

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Add default tokens_to_generate

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* Revert prompt truncate re-prompt

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* skip generation with metric loss

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Fix bug

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* support GPTSFTChatDataset

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Add comment

Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>

---------

Signed-off-by: hsiehjackson <c2hsieh@ucsd.edu>
Signed-off-by: Cheng-Ping Hsieh <37269846+hsiehjackson@users.noreply.github.com>
Signed-off-by: Cheng-Ping Hsieh <chsieh@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants