Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement DeepSpeed Main autotuning for NeoX (#739)
* Add autotuning Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Add autotuning config * Need to add it to deepspeed args * Do not calculate derived values when autotuning * Do not calculate derived values when autotuning * Do not calculate derived values when autotuning * Do not calculate derived values when autotuning * Do not calculate derived values when autotuning * Need to set no_ssh_check argument with slurm.... * set master_address for SLURM * set master_address for SLURM * let json be a file ending * Write configs to json files instead of passing them in as CL arguments Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Write configs to json files instead of passing them in as CL arguments Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Pass in slurm_comment directly to DeepSpeed Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Move slurm_comment to deepspeed args Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Move slurm_comment to deepspeed args Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Slurm comment Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Slurm comment Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Slurm comment Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Move configs out of \/tmp Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Get values from ds_config when autotuning Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Get values from ds_config when autotuning Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Pass in autotuning config properly Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Debug print statement Signed-off-by: Dashiell Stander <dstander@protonmail.com> * lower mem requirement in tune.sh Signed-off-by: Dashiell Stander <dashiell@ip-172-31-47-203.ec2.internal> * Cursed hack to pass in autotuning config properly Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Cursed hack to pass in autotuning config properly Signed-off-by: Dashiell Stander <dstander@protonmail.com> * More sophisticated typing for autotuning config Signed-off-by: Dashiell Stander <dstander@protonmail.com> * More sophisticated typing for autotuning config Signed-off-by: Dashiell Stander <dstander@protonmail.com> * More sophisticated typing for autotuning config Signed-off-by: Dashiell Stander <dstander@protonmail.com> * So much debuggin Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Small bug Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Debugging print statements... Signed-off-by: Dashiell Stander <dstander@protonmail.com> * json configs for DeepSpeed Signed-off-by: Dashiell Stander <dashiell@ip-172-31-47-203.ec2.internal> * only two nodes * Needed to change up the configs * Do not actually need to do that Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Tune 6.7B model * New types for zero stage Signed-off-by: Dashiell Stander <dstander@protonmail.com> * New types for zero stage Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Tuning a larger model * Always copy autotuning args from ds_config Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Always copy autotuning args from ds_config Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Cleaner this way, I think... Signed-off-by: Dashiell Stander <dstander@protonmail.com> * New debug print statement * New debug print statement * Need to copy this over as well Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Need to copy over train_batch_size as well Signed-off-by: Dashiell Stander <dstander@protonmail.com> * debug print * new configs Signed-off-by: Dashiell Stander <dashiell@ip-172-31-47-203.ec2.internal> * Tests * Sync with new method of passing in autotuning configs Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Replicate on different cluster Signed-off-by: Dashiell Stander <dashiell@slurm-login-0.slurm-login.tenant-stabilitytraining-704a100.svc.tenant.chi.local> * Update NeoXArgs docs automatically * Use typing `List` and fix bug in decoding Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Use checkpoint_factor Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Change autotuning config name Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Add no_ssh_check config option Signed-off-by: Dashiell Stander <dstander@protonmail.com> * no_ssh_check should be a configured value Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Only pass in master_addr once Signed-off-by: Dashiell Stander <dstander@protonmail.com> * DeepSpeed now base64 encodes ds_config Signed-off-by: Dashiell Stander <dstander@protonmail.com> * whoops * still need to pass in megatron_fp Signed-off-by: Dashiell Stander <dstander@protonmail.com> * still need to pass in megatron_fp Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Only write to file when doing autotuning Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Update NeoXArgs docs automatically * Remove debugging configs Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Remove test scripts Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Update NeoXArgs docs automatically * Remove test script Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Update NeoXArgs docs automatically * Clean up Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Update NeoXArgs docs automatically * Run pre-commit hooks Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Update NeoXArgs docs automatically * base64 error Signed-off-by: Dashiell Stander <dstander@protonmail.com> * Update NeoXArgs docs automatically * Update NeoXArgs docs automatically * remove duplicated einops * Move autotuning configs into their own subdir --------- Signed-off-by: Dashiell Stander <dstander@protonmail.com> Signed-off-by: Dashiell Stander <dashiell@ip-172-31-47-203.ec2.internal> Signed-off-by: Dashiell Stander <dashiell@slurm-login-0.slurm-login.tenant-stabilitytraining-704a100.svc.tenant.chi.local> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal> Co-authored-by: Dashiell Stander <dashiell@ip-172-31-47-203.ec2.internal> Co-authored-by: Dashiell Stander <dashiell@slurm-login-0.slurm-login.tenant-stabilitytraining-704a100.svc.tenant.chi.local> Co-authored-by: github-actions <github-actions@github.com> Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
- Loading branch information
e897c23
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this commit breaks this claim for the function
consume_neox_args
:In order not to have any problems with different configs being mismatched across machines, we instead read the .yaml configuration file from the main rank, then serialize the arguments to a dictionary, which the deepspeed launcher broadcasts to all machines (
--megatron_config).
megatron_config
is not broadcasted after this commits. Instead, it just pass a local file path to --megatron_config.