add support for new mcore ds features #9388

dimapihtar · 2024-06-05T14:15:14Z

What does this PR do ?

Adds new mcore dataset features to NeMo.

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: dimapihtar <dpihtar@gmail.com>

Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>

nemo/collections/nlp/data/language_modeling/megatron/data_samplers.py

Signed-off-by: dimapihtar <dpihtar@gmail.com>

ShriyaPalsamudram · 2024-06-06T15:41:51Z

@dimapihtar - can you add
drop_last_partial_validation_sequence = True and add_extra_token_to_sequence = True to a github actions test?

dimapihtar · 2024-06-07T16:12:25Z

@dimapihtar - can you add drop_last_partial_validation_sequence = True and add_extra_token_to_sequence = True to a github actions test?

@ShriyaPalsamudram we have these values as True by default so I think all the tests are running in this way.

Signed-off-by: dimapihtar <dpihtar@gmail.com>

jkamalu · 2024-06-10T16:59:30Z

One necessary consistency test will be to measure validation eval between legacy and mcore code paths, with all options turned on to eval on the entire validation dataset

You'll have to make sure that the loss and ppl computation is able to handle partial batches and partial sequences. NeMo already does this at least for the legacy code path, but you'll want to make sure there aren't any breaking changes.

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

nemo/collections/nlp/data/language_modeling/megatron/data_samplers.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

…lers.py Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

…el.py Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

* add validation_drop_last and add_extra_token params support for mcore ds Signed-off-by: dimapihtar <dpihtar@gmail.com> * pad samples with dummy tokens only Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * use no_seqlen_plus_one_input_tokens as mcore's add_extra_token Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * set train_valid_test_num_samples[1] to None Signed-off-by: dimapihtar <dpihtar@gmail.com> * add test case when validation_drop_last is False Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * set validation_drop_last as True by default Signed-off-by: dimapihtar <dpihtar@gmail.com> * Update nemo/collections/nlp/data/language_modeling/megatron/data_samplers.py Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> * Update nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com> Signed-off-by: Jan Lasek <janek.lasek@gmail.com>

* add validation_drop_last and add_extra_token params support for mcore ds Signed-off-by: dimapihtar <dpihtar@gmail.com> * pad samples with dummy tokens only Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> * use no_seqlen_plus_one_input_tokens as mcore's add_extra_token Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * set train_valid_test_num_samples[1] to None Signed-off-by: dimapihtar <dpihtar@gmail.com> * add test case when validation_drop_last is False Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * set validation_drop_last as True by default Signed-off-by: dimapihtar <dpihtar@gmail.com> * Update nemo/collections/nlp/data/language_modeling/megatron/data_samplers.py Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> * Update nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com>

add validation_drop_last and add_extra_token params support for mcore ds

1324518

Signed-off-by: dimapihtar <dpihtar@gmail.com>

github-actions bot added the NLP label Jun 5, 2024

dimapihtar changed the title ~~add validation_drop_last and add_extra_token params support for mcore ds~~ add support for mcore ds new features Jun 5, 2024

dimapihtar and others added 2 commits June 6, 2024 05:10

pad samples with dummy tokens only

47f6576

Signed-off-by: dimapihtar <dpihtar@gmail.com>

Apply isort and black reformatting

f042feb

Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com>

github-advanced-security bot found potential problems Jun 6, 2024

View reviewed changes

nemo/collections/nlp/data/language_modeling/megatron/data_samplers.py Dismissed Show dismissed Hide dismissed

dimapihtar marked this pull request as ready for review June 6, 2024 13:27

dimapihtar added 3 commits June 6, 2024 06:32

use no_seqlen_plus_one_input_tokens as mcore's add_extra_token

0feefaf

Signed-off-by: dimapihtar <dpihtar@gmail.com>

revert config

f78f0b2

Signed-off-by: dimapihtar <dpihtar@gmail.com>

revert config

cd6f910

Signed-off-by: dimapihtar <dpihtar@gmail.com>

dimapihtar requested a review from ShriyaPalsamudram June 6, 2024 13:39

Merge branch 'main' into dpykhtar/mcore_ds_features

6288382

dimapihtar changed the title ~~add support for mcore ds new features~~ add support for new mcore ds features Jun 6, 2024

dimapihtar added the Run CICD label Jun 6, 2024

ShriyaPalsamudram requested a review from jbaczek June 6, 2024 15:36

Merge branch 'main' into dpykhtar/mcore_ds_features

5f9534a

dimapihtar and others added 3 commits June 10, 2024 14:36

Merge branch 'main' into dpykhtar/mcore_ds_features

b9df540

set train_valid_test_num_samples[1] to None

5a355be

Signed-off-by: dimapihtar <dpihtar@gmail.com>

add test case when validation_drop_last is False

0dee6e3

Signed-off-by: dimapihtar <dpihtar@gmail.com>

github-actions bot added the CI label Jun 10, 2024

dimapihtar and others added 3 commits June 10, 2024 09:48

revert config

eaa7d17

Signed-off-by: dimapihtar <dpihtar@gmail.com>

set validation_drop_last as True by default

10cc59a

Signed-off-by: dimapihtar <dpihtar@gmail.com>

Merge branch 'main' into dpykhtar/mcore_ds_features

d8371d5

dimapihtar added Run CICD and removed Run CICD labels Jun 10, 2024

jkamalu reviewed Jun 10, 2024

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Outdated Show resolved Hide resolved

jbaczek reviewed Jun 11, 2024

View reviewed changes

nemo/collections/nlp/data/language_modeling/megatron/data_samplers.py Outdated Show resolved Hide resolved

jbaczek reviewed Jun 11, 2024

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Outdated Show resolved Hide resolved

dimapihtar and others added 3 commits June 11, 2024 15:37

Update nemo/collections/nlp/data/language_modeling/megatron/data_samp…

7896242

…lers.py Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

Update nemo/collections/nlp/models/language_modeling/megatron_gpt_mod…

54bfda3

…el.py Co-authored-by: jbaczek <45043825+jbaczek@users.noreply.github.com> Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

Merge branch 'main' into dpykhtar/mcore_ds_features

d4d0385

Signed-off-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>

dimapihtar added Run CICD and removed Run CICD labels Jun 11, 2024

ShriyaPalsamudram approved these changes Jun 11, 2024

View reviewed changes

dimapihtar merged commit 91ab412 into main Jun 11, 2024
112 checks passed

dimapihtar deleted the dpykhtar/mcore_ds_features branch June 11, 2024 15:27

ko3n1g mentioned this pull request Jul 18, 2024

Release 2.0.0rc1 #9786

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for new mcore ds features #9388

add support for new mcore ds features #9388

dimapihtar commented Jun 5, 2024 •

edited

Loading

ShriyaPalsamudram commented Jun 6, 2024

dimapihtar commented Jun 7, 2024

jkamalu commented Jun 10, 2024 •

edited

Loading

add support for new mcore ds features #9388

add support for new mcore ds features #9388

Conversation

dimapihtar commented Jun 5, 2024 • edited Loading

What does this PR do ?

Changelog

Usage

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

ShriyaPalsamudram commented Jun 6, 2024

dimapihtar commented Jun 7, 2024

jkamalu commented Jun 10, 2024 • edited Loading

dimapihtar commented Jun 5, 2024 •

edited

Loading

jkamalu commented Jun 10, 2024 •

edited

Loading