Skip to content

Commit

Permalink
Implement DeepSpeed Main autotuning for NeoX (#739)
Browse files Browse the repository at this point in the history
* Add autotuning

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Add autotuning config

* Need to add it to deepspeed args

* Do not calculate derived values when autotuning

* Do not calculate derived values when autotuning

* Do not calculate derived values when autotuning

* Do not calculate derived values when autotuning

* Do not calculate derived values when autotuning

* Need to set no_ssh_check argument with slurm....

* set master_address for SLURM

* set master_address for SLURM

* let json be a file ending

* Write configs to json files instead of passing them in as CL arguments

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Write configs to json files instead of passing them in as CL arguments

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Pass in slurm_comment directly to DeepSpeed

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Move slurm_comment to deepspeed args

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Move slurm_comment to deepspeed args

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Slurm comment

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Slurm comment

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Slurm comment

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Move configs out of \/tmp

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Get values from ds_config when autotuning

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Get values from ds_config when autotuning

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Pass in autotuning config properly

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Debug print statement

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* lower mem requirement in tune.sh

Signed-off-by: Dashiell Stander <dashiell@ip-172-31-47-203.ec2.internal>

* Cursed hack to pass in autotuning config properly

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Cursed hack to pass in autotuning config properly

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* More sophisticated typing for autotuning config

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* More sophisticated typing for autotuning config

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* More sophisticated typing for autotuning config

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* So much debuggin

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Small bug

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Debugging print statements...

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* json configs for DeepSpeed

Signed-off-by: Dashiell Stander <dashiell@ip-172-31-47-203.ec2.internal>

* only two nodes

* Needed to change up the configs

* Do not actually need to do that

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Tune 6.7B model

* New types for zero stage

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* New types for zero stage

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Tuning a larger model

* Always copy autotuning args from ds_config

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Always copy autotuning args from ds_config

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Cleaner this way, I think...

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* New debug print statement

* New debug print statement

* Need to copy this over as well

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Need to copy over train_batch_size as well

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* debug print

* new configs

Signed-off-by: Dashiell Stander <dashiell@ip-172-31-47-203.ec2.internal>

* Tests

* Sync with new method of passing in autotuning configs

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Replicate on different cluster

Signed-off-by: Dashiell Stander <dashiell@slurm-login-0.slurm-login.tenant-stabilitytraining-704a100.svc.tenant.chi.local>

* Update NeoXArgs docs automatically

* Use typing `List` and fix bug in decoding

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Use checkpoint_factor

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Change autotuning config name

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Add no_ssh_check config option

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* no_ssh_check should be a configured value

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Only pass in master_addr once

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* DeepSpeed now base64 encodes ds_config

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* whoops

* still need to pass in megatron_fp

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* still need to pass in megatron_fp

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Only write to file when doing autotuning

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Update NeoXArgs docs automatically

* Remove debugging configs

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Remove test scripts

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Update NeoXArgs docs automatically

* Remove test script

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Update NeoXArgs docs automatically

* Clean up

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Update NeoXArgs docs automatically

* Run pre-commit hooks

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Update NeoXArgs docs automatically

* base64 error

Signed-off-by: Dashiell Stander <dstander@protonmail.com>

* Update NeoXArgs docs automatically

* Update NeoXArgs docs automatically

* remove duplicated einops

* Move autotuning configs into their own subdir

---------

Signed-off-by: Dashiell Stander <dstander@protonmail.com>
Signed-off-by: Dashiell Stander <dashiell@ip-172-31-47-203.ec2.internal>
Signed-off-by: Dashiell Stander <dashiell@slurm-login-0.slurm-login.tenant-stabilitytraining-704a100.svc.tenant.chi.local>
Co-authored-by: Dashiell Stander <dashiell@ip-172-31-45-20.ec2.internal>
Co-authored-by: Dashiell Stander <dashiell@ip-172-31-47-203.ec2.internal>
Co-authored-by: Dashiell Stander <dashiell@slurm-login-0.slurm-login.tenant-stabilitytraining-704a100.svc.tenant.chi.local>
Co-authored-by: github-actions <github-actions@github.com>
Co-authored-by: Quentin Anthony <qganthony@yahoo.com>
  • Loading branch information
6 people authored Mar 9, 2023
1 parent 68d223c commit e897c23
Show file tree
Hide file tree
Showing 9 changed files with 509 additions and 38 deletions.
78 changes: 78 additions & 0 deletions configs/autotuning_configs/small_tune.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
{
"pipe-parallel-size": 1,
"model-parallel-size": 1,

"num-layers": 12,
"hidden-size": 768,
"num-attention-heads": 12,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,

"scaled-upper-triang-masked-softmax-fusion": false,
"bias-gelu-fusion": false,


"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0006,
"betas": [0.9, 0.999],
"eps": 1.0e-8
}
},

"train_micro_batch_size_per_gpu": 1,
"data-impl": "mmap",
"split": "949,50,1",

"checkpoint-activations": true,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,

"gradient_clipping": 1.0,
"weight-decay": 0.0,
"hidden-dropout": 0.0,
"attention-dropout": 0.0,

"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},

"train-iters": 320000,
"lr-decay-iters": 320000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"eval-interval": 1000,
"eval-iters": 10,

"log-interval": 100,
"steps_per_print": 10,
"keep-last-n-checkpoints": 4,
"wall_clock_breakdown": true,
"launcher": "slurm",
"deepspeed_slurm": true,
"comment": "neox",
"autotuning": {
"enabled": true,
"arg_mappings": {
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
}
},
"zero_optimization": {
"stage": [0, 1, 2, 3]
},
"train-data-paths": ["/fsx/pile_deduped/pile_0.87_deduped_text_document"],
"valid-data-paths": ["/fsx/pile_deduped/pile_0.87_deduped_text_document"],
"test-data-paths": ["/fsx/pile_deduped/pile_0.87_deduped_text_document"]
}
72 changes: 72 additions & 0 deletions configs/autotuning_configs/tune.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
{
"pipe-parallel-size": 1,
"model-parallel-size": 1,
"num-layers": 12,
"hidden-size": 768,
"num-attention-heads": 12,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,
"scaled-upper-triang-masked-softmax-fusion": true,
"bias-gelu-fusion": true,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0006,
"betas": [0.9, 0.999],
"eps": 1.0e-8
}
},
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"contiguous_gradients": true,
"cpu_offload": false
},
"train_micro_batch_size_per_gpu": 1,
"autotuning_config": {
"enabled": true,
"arg_mappings": {
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
}
},
"data-impl": "mmap",
"split": "949,50,1",
"checkpoint-activations": true,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,
"gradient_clipping": 1.0,
"weight-decay": 0.0,
"hidden-dropout": 0.0,
"attention-dropout": 0.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"train-iters": 200,
"lr-decay-iters": 320000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"eval-interval": 1000,
"eval-iters": 10,
"log-interval": 100,
"steps_per_print": 10,
"keep-last-n-checkpoints": 4,
"wall_clock_breakdown": true,
"launcher": "slurm",
"deepspeed_slurm": true,
"comment": "neox"
}
86 changes: 86 additions & 0 deletions configs/autotuning_configs/tune_1-3B.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
{
"pipe-parallel-size": 1,
"model-parallel-size": 1,

"num-layers": 24,
"hidden-size": 2048,
"num-attention-heads": 16,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,
"gpt_j_residual": false,
"output_layer_parallelism": "column",
"attention_config": [[["flash"], 24]],
"scaled-upper-triang-masked-softmax-fusion": false,
"bias-gelu-fusion": false,

"init_method": "small_init",
"output_layer_init_method": "wang_init",

"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0002,
"betas": [0.9, 0.95],
"eps": 1.0e-8
}
},
"min_lr": 0.00002,

"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"contiguous_gradients": true
},
"train_micro_batch_size_per_gpu": 1,
"autotuning": {
"enabled": true,
"arg_mappings": {
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
}
},
"data-impl": "mmap",

"checkpoint-activations": false,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,

"gradient_clipping": 1.0,
"weight-decay": 0.1,
"hidden-dropout": 0,
"attention-dropout": 0,

"fp16": {
"fp16": true,
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},

"train-iters": 320000,
"lr-decay-iters": 320000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,
"launcher": "slurm",
"deepspeed_slurm": true,
"no_ssh_check": true,

"log-interval": 10,
"steps_per_print": 10,
"keep-last-n-checkpoints": 1,
"wall_clock_breakdown": true
}
77 changes: 77 additions & 0 deletions configs/autotuning_configs/tune_6-7B.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
{
"pipe-parallel-size": 1,
"model-parallel-size": 8,

"num-layers": 32,
"hidden-size": 4096,
"num-attention-heads": 32,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,

"scaled-upper-triang-masked-softmax-fusion": false,
"bias-gelu-fusion": false,


"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00012,
"betas": [0.9, 0.999],
"eps": 1.0e-8
}
},

"train_micro_batch_size_per_gpu": 1,
"zero_optimization": {
"stage": [0, 1, 2, 3]
},
"data-impl": "mmap",
"split": "949,50,1",

"checkpoint-activations": true,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,

"gradient_clipping": 1.0,
"weight-decay": 0,
"hidden-dropout": 0,
"attention-dropout": 0,

"fp16": {
"fp16": true,
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},

"train-iters": 100,
"lr-decay-iters": 320000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,
"log-interval": 100,
"steps_per_print": 10,
"keep-last-n-checkpoints": 4,
"wall_clock_breakdown": true,
"launcher": "slurm",
"deepspeed_slurm": true,
"no_ssh_check": true,
"comment": "neox",
"autotuning": {
"enabled": true,
"mp_size": 8,
"arg_mappings": {
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
}
}
}
26 changes: 24 additions & 2 deletions configs/neox_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -592,7 +592,7 @@ Optimizer Arguments



- **zero_stage**: int
- **zero_stage**: typing.Union[int, typing.List[int], typing.Literal['all']]

Default = None

Expand Down Expand Up @@ -1732,6 +1732,14 @@ Args for deepspeed config



- **autotuning**: dict

Default = None

Dictionary as described in DeepSpeed autotuning documentation: https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/autotuning



## NeoXArgsDeepspeedRunner

Args for deepspeed runner (deepspeed.launcher.runner).
Expand Down Expand Up @@ -1801,7 +1809,7 @@ Args for deepspeed runner (deepspeed.launcher.runner).



- **launcher**: str
- **launcher**: typing.Literal['pdsh', 'openmpi', 'mvapich', 'slurm']

Default = pdsh

Expand All @@ -1817,6 +1825,12 @@ Args for deepspeed runner (deepspeed.launcher.runner).



- **autotuning_run**: str

Default = None

Either "tune", "run", or `None`.

- **no_ssh_check**: bool

Default = False
Expand All @@ -1831,3 +1845,11 @@ Args for deepspeed runner (deepspeed.launcher.runner).

Adds a `--comment` to the DeepSpeed launch command. In DeeperSpeed this is passed on to the SlurmLauncher as well. Sometime necessary for cluster rules, or so I've heard.



- **no_ssh_check**: bool

Default = False

If `True` and running with multiple nodes, then DeepSpeedd doesn't conduct a check to ensure the head node is reachable with ssh.

11 changes: 11 additions & 0 deletions configs/slurm_local.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"vocab-file": "data/gpt2-vocab.json",
"merge-file": "data/gpt2-merges.txt",
"save": "checkpoints",
"checkpoint_validation_with_forward_pass": false,
"tensorboard-dir": "tensorboard",
"log-dir": "logs",
"use_wandb": true,
"wandb_host": "https://api.wandb.ai",
"wandb_project": "neox"
}
Loading

1 comment on commit e897c23

@silverriver
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this commit breaks this claim for the function consume_neox_args:

In order not to have any problems with different configs being mismatched across machines, we instead read the .yaml configuration file from the main rank, then serialize the arguments to a dictionary, which the deepspeed launcher broadcasts to all machines (--megatron_config).

megatron_config is not broadcasted after this commits. Instead, it just pass a local file path to --megatron_config.

Please sign in to comment.