-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "offline" data cache generation support #9576
Conversation
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
Signed-off-by: dimapihtar <dpihtar@gmail.com>
What is the reasoning behind overloading num_workers for the num_dataset_builder_threads, instead of passing num_dataset_builder_threads in as a separate config parameter? I could imagine a case where I set num_dataset_builder_threads to a very large number (64 or 128) but wouldn't want that many workers during the typical data loading. That would allow many workloads to specify a high num_dataset_builder_threads without needing to explicitly use the data_cache_generation_only offline step. Instead I would suggest num_dataset_builder_threads as an explicit new parameter in megatron_gpt_config.yaml, and default it to 1 to mirror https://github.com/NVIDIA/Megatron-LM/blob/e33c8f78a35765d5aa37475a144da60e8a2349d1/megatron/core/datasets/blended_megatron_dataset_config.py#L48 |
Signed-off-by: dimapihtar <dpihtar@gmail.com>
thank you for the review, fixed. |
Signed-off-by: dimapihtar <dpihtar@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @dimapihtar !
* add "offline" data cache generation support Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * add comment for data_cache_generation_only usage Signed-off-by: dimapihtar <dpihtar@gmail.com> * add num_dataset_builder_threads param Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix comment Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com>
* add "offline" data cache generation support Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * add comment for data_cache_generation_only usage Signed-off-by: dimapihtar <dpihtar@gmail.com> * add num_dataset_builder_threads param Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix comment Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>
* add "offline" data cache generation support Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * add comment for data_cache_generation_only usage Signed-off-by: dimapihtar <dpihtar@gmail.com> * add num_dataset_builder_threads param Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix comment Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Boxiang Wang <boxiangw@nvidia.com>
* copy of #9576 Signed-off-by: dimapihtar <dpihtar@gmail.com> * Apply isort and black reformatting Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: dimapihtar <dimapihtar@users.noreply.github.com> Co-authored-by: dimapihtar <dimapihtar@users.noreply.github.com>
* add "offline" data cache generation support Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * add comment for data_cache_generation_only usage Signed-off-by: dimapihtar <dpihtar@gmail.com> * add num_dataset_builder_threads param Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix comment Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Vivian Chen <xuanzic@example.com>
* add "offline" data cache generation support Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * add comment for data_cache_generation_only usage Signed-off-by: dimapihtar <dpihtar@gmail.com> * add num_dataset_builder_threads param Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix comment Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: kchike <kohei.chike@jp.ricoh.com>
* add "offline" data cache generation support Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * add comment for data_cache_generation_only usage Signed-off-by: dimapihtar <dpihtar@gmail.com> * add num_dataset_builder_threads param Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix comment Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com>
* add "offline" data cache generation support Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * add comment for data_cache_generation_only usage Signed-off-by: dimapihtar <dpihtar@gmail.com> * add num_dataset_builder_threads param Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix comment Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>
* add "offline" data cache generation support Signed-off-by: dimapihtar <dpihtar@gmail.com> * revert config Signed-off-by: dimapihtar <dpihtar@gmail.com> * add comment for data_cache_generation_only usage Signed-off-by: dimapihtar <dpihtar@gmail.com> * add num_dataset_builder_threads param Signed-off-by: dimapihtar <dpihtar@gmail.com> * fix comment Signed-off-by: dimapihtar <dpihtar@gmail.com> --------- Signed-off-by: dimapihtar <dpihtar@gmail.com>
What does this PR do ?
num_dataset_builder_threads
param usage.Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
GitHub Actions CI
The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.
The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information