Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DeepSpeed Main autotuning for NeoX #739

Merged
merged 101 commits into from
Mar 9, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
26e2255
Merge branch 'srun' into autotune
Sep 21, 2022
07f49ae
Add autotuning
dashstander Sep 26, 2022
c3611e9
Add autotuning config
Sep 26, 2022
5d5626b
Need to add it to deepspeed args
dashstander Sep 27, 2022
80c4d3d
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
dashstander Sep 27, 2022
c26a656
Do not calculate derived values when autotuning
dashstander Sep 28, 2022
80661a1
Do not calculate derived values when autotuning
dashstander Sep 28, 2022
b2de9ba
Do not calculate derived values when autotuning
dashstander Sep 28, 2022
08a6300
Do not calculate derived values when autotuning
dashstander Sep 28, 2022
f5a35da
Do not calculate derived values when autotuning
dashstander Sep 28, 2022
7ea39ee
Need to set no_ssh_check argument with slurm....
dashstander Sep 28, 2022
ee22677
set master_address for SLURM
dashstander Sep 28, 2022
eb658e3
set master_address for SLURM
dashstander Sep 28, 2022
21e1708
let json be a file ending
dashstander Sep 28, 2022
110a31f
Write configs to json files instead of passing them in as CL arguments
dashstander Sep 29, 2022
dece01b
Write configs to json files instead of passing them in as CL arguments
dashstander Sep 29, 2022
ecd8f8c
Pass in slurm_comment directly to DeepSpeed
dashstander Oct 11, 2022
c390be1
Move slurm_comment to deepspeed args
dashstander Oct 11, 2022
8033d35
Move slurm_comment to deepspeed args
dashstander Oct 11, 2022
e1d6b92
Slurm comment
dashstander Oct 11, 2022
bb0209d
Slurm comment
dashstander Oct 11, 2022
0097a39
Slurm comment
dashstander Oct 11, 2022
669ee08
Move configs out of \/tmp
dashstander Oct 14, 2022
dfbe565
Get values from ds_config when autotuning
dashstander Oct 17, 2022
ef94f8d
Get values from ds_config when autotuning
dashstander Oct 17, 2022
091b115
Merge main
dashstander Oct 18, 2022
ccf1fff
Pass in autotuning config properly
dashstander Oct 19, 2022
eb45fc0
Debug print statement
dashstander Oct 19, 2022
7d093e3
lower mem requirement in tune.sh
Oct 19, 2022
36ca337
Cursed hack to pass in autotuning config properly
dashstander Oct 19, 2022
1af17de
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
Oct 19, 2022
062669f
Cursed hack to pass in autotuning config properly
dashstander Oct 19, 2022
a094e43
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
Oct 19, 2022
ddeabf3
More sophisticated typing for autotuning config
dashstander Oct 19, 2022
db3d8f7
More sophisticated typing for autotuning config
dashstander Oct 19, 2022
26843b5
More sophisticated typing for autotuning config
dashstander Oct 19, 2022
8de8d7d
So much debuggin
dashstander Oct 19, 2022
f252245
Small bug
dashstander Oct 19, 2022
d8f86e2
Debugging print statements...
dashstander Oct 19, 2022
4c4fa1a
json configs for DeepSpeed
Oct 20, 2022
b31ea65
only two nodes
Oct 20, 2022
4071530
Needed to change up the configs
Oct 21, 2022
fda171f
Do not actually need to do that
dashstander Oct 25, 2022
d1f7d25
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
dashstander Oct 25, 2022
6073a24
Tune 6.7B model
Oct 25, 2022
46c3a7a
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
Oct 25, 2022
ab724bf
New types for zero stage
dashstander Oct 25, 2022
93da03f
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
Oct 25, 2022
35d825e
New types for zero stage
dashstander Oct 25, 2022
6b81a69
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
Oct 25, 2022
cf561a1
Tuning a larger model
Oct 25, 2022
58f93c2
Always copy autotuning args from ds_config
dashstander Oct 25, 2022
3c1b999
Always copy autotuning args from ds_config
dashstander Oct 25, 2022
8611453
Cleaner this way, I think...
dashstander Oct 25, 2022
ccca0af
New debug print statement
dashstander Oct 25, 2022
b411957
New debug print statement
dashstander Oct 25, 2022
ae327cf
Need to copy this over as well
dashstander Oct 25, 2022
8a718af
Need to copy over train_batch_size as well
dashstander Oct 25, 2022
96e009f
debug print
dashstander Nov 8, 2022
5488ca9
new configs
Nov 10, 2022
5c4a9f8
Tests
Nov 10, 2022
2d4691f
Merge branch 'main' into autotune
dashstander Dec 7, 2022
7bce1fe
Sync with new method of passing in autotuning configs
dashstander Dec 12, 2022
07c891d
Merge main
Jan 6, 2023
22a04ac
Replicate on different cluster
Jan 6, 2023
32480b3
Update NeoXArgs docs automatically
invalid-email-address Jan 6, 2023
0349fad
Use typing `List` and fix bug in decoding
dashstander Jan 6, 2023
a65ba8e
Use checkpoint_factor
dashstander Jan 7, 2023
4c5d26f
Change autotuning config name
dashstander Jan 7, 2023
2f4edfd
Add no_ssh_check config option
dashstander Jan 7, 2023
4913228
no_ssh_check should be a configured value
dashstander Jan 9, 2023
c2ce245
Only pass in master_addr once
dashstander Jan 10, 2023
c8c357f
DeepSpeed now base64 encodes ds_config
dashstander Jan 10, 2023
e3cf1d4
whoops
dashstander Jan 10, 2023
018151c
still need to pass in megatron_fp
dashstander Jan 10, 2023
a725ef3
still need to pass in megatron_fp
dashstander Jan 10, 2023
ca00004
Only write to file when doing autotuning
dashstander Jan 12, 2023
8ce31d5
Merge branch 'main' into autotune
Quentin-Anthony Jan 15, 2023
f3844f8
Update NeoXArgs docs automatically
invalid-email-address Jan 15, 2023
1c2fb3f
Remove debugging configs
dashstander Jan 16, 2023
00c9df6
Remove test scripts
dashstander Jan 16, 2023
9c1d2fe
Update NeoXArgs docs automatically
invalid-email-address Jan 16, 2023
2c5dedd
Remove test script
dashstander Jan 16, 2023
917b0b1
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
dashstander Jan 16, 2023
a28f4b8
Update NeoXArgs docs automatically
invalid-email-address Jan 16, 2023
6326ca1
Clean up
dashstander Jan 16, 2023
08edb21
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
dashstander Jan 16, 2023
6cd80ab
Update NeoXArgs docs automatically
invalid-email-address Jan 16, 2023
0f2e492
Run pre-commit hooks
dashstander Jan 16, 2023
a59c9ec
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
dashstander Jan 16, 2023
b92e936
Update NeoXArgs docs automatically
invalid-email-address Jan 16, 2023
bc586c9
base64 error
dashstander Jan 16, 2023
e05c967
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
dashstander Jan 16, 2023
c19b020
Update NeoXArgs docs automatically
invalid-email-address Jan 16, 2023
d9996d7
Merge branch 'main' into autotune
Quentin-Anthony Feb 14, 2023
14565fc
Update NeoXArgs docs automatically
invalid-email-address Feb 14, 2023
a79b566
Merge branch 'deepspeed_main' into autotune
Quentin-Anthony Feb 14, 2023
b156b02
Merge branch 'main' into autotune
Quentin-Anthony Mar 9, 2023
cb7a8bc
remove duplicated einops
Quentin-Anthony Mar 9, 2023
2e69e0f
Move autotuning configs into their own subdir
Quentin-Anthony Mar 9, 2023
9a9e773
Merge branch 'autotune' of https://github.com/EleutherAI/gpt-neox int…
Quentin-Anthony Mar 9, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 78 additions & 0 deletions configs/autotuning_configs/small_tune.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
{
"pipe-parallel-size": 1,
"model-parallel-size": 1,

"num-layers": 12,
"hidden-size": 768,
"num-attention-heads": 12,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,

"scaled-upper-triang-masked-softmax-fusion": false,
"bias-gelu-fusion": false,


"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0006,
"betas": [0.9, 0.999],
"eps": 1.0e-8
}
},

"train_micro_batch_size_per_gpu": 1,
"data-impl": "mmap",
"split": "949,50,1",

"checkpoint-activations": true,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,

"gradient_clipping": 1.0,
"weight-decay": 0.0,
"hidden-dropout": 0.0,
"attention-dropout": 0.0,

"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},

"train-iters": 320000,
"lr-decay-iters": 320000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"eval-interval": 1000,
"eval-iters": 10,

"log-interval": 100,
"steps_per_print": 10,
"keep-last-n-checkpoints": 4,
"wall_clock_breakdown": true,
"launcher": "slurm",
"deepspeed_slurm": true,
"comment": "neox",
"autotuning": {
"enabled": true,
"arg_mappings": {
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
}
},
"zero_optimization": {
"stage": [0, 1, 2, 3]
},
"train-data-paths": ["/fsx/pile_deduped/pile_0.87_deduped_text_document"],
"valid-data-paths": ["/fsx/pile_deduped/pile_0.87_deduped_text_document"],
"test-data-paths": ["/fsx/pile_deduped/pile_0.87_deduped_text_document"]
}
72 changes: 72 additions & 0 deletions configs/autotuning_configs/tune.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
{
"pipe-parallel-size": 1,
"model-parallel-size": 1,
"num-layers": 12,
"hidden-size": 768,
"num-attention-heads": 12,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,
"scaled-upper-triang-masked-softmax-fusion": true,
"bias-gelu-fusion": true,
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0006,
"betas": [0.9, 0.999],
"eps": 1.0e-8
}
},
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"contiguous_gradients": true,
"cpu_offload": false
},
"train_micro_batch_size_per_gpu": 1,
"autotuning_config": {
"enabled": true,
"arg_mappings": {
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
}
},
"data-impl": "mmap",
"split": "949,50,1",
"checkpoint-activations": true,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,
"gradient_clipping": 1.0,
"weight-decay": 0.0,
"hidden-dropout": 0.0,
"attention-dropout": 0.0,
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"train-iters": 200,
"lr-decay-iters": 320000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"eval-interval": 1000,
"eval-iters": 10,
"log-interval": 100,
"steps_per_print": 10,
"keep-last-n-checkpoints": 4,
"wall_clock_breakdown": true,
"launcher": "slurm",
"deepspeed_slurm": true,
"comment": "neox"
}
86 changes: 86 additions & 0 deletions configs/autotuning_configs/tune_1-3B.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
{
"pipe-parallel-size": 1,
"model-parallel-size": 1,

"num-layers": 24,
"hidden-size": 2048,
"num-attention-heads": 16,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,
"gpt_j_residual": false,
"output_layer_parallelism": "column",
"attention_config": [[["flash"], 24]],
"scaled-upper-triang-masked-softmax-fusion": false,
"bias-gelu-fusion": false,

"init_method": "small_init",
"output_layer_init_method": "wang_init",

"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0002,
"betas": [0.9, 0.95],
"eps": 1.0e-8
}
},
"min_lr": 0.00002,

"zero_optimization": {
"stage": 1,
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"contiguous_gradients": true
},
"train_micro_batch_size_per_gpu": 1,
"autotuning": {
"enabled": true,
"arg_mappings": {
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
}
},
"data-impl": "mmap",

"checkpoint-activations": false,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,

"gradient_clipping": 1.0,
"weight-decay": 0.1,
"hidden-dropout": 0,
"attention-dropout": 0,

"fp16": {
"fp16": true,
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},

"train-iters": 320000,
"lr-decay-iters": 320000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,
"launcher": "slurm",
"deepspeed_slurm": true,
"no_ssh_check": true,

"log-interval": 10,
"steps_per_print": 10,
"keep-last-n-checkpoints": 1,
"wall_clock_breakdown": true
}
77 changes: 77 additions & 0 deletions configs/autotuning_configs/tune_6-7B.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
{
"pipe-parallel-size": 1,
"model-parallel-size": 8,

"num-layers": 32,
"hidden-size": 4096,
"num-attention-heads": 32,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,

"scaled-upper-triang-masked-softmax-fusion": false,
"bias-gelu-fusion": false,


"optimizer": {
"type": "Adam",
"params": {
"lr": 0.00012,
"betas": [0.9, 0.999],
"eps": 1.0e-8
}
},

"train_micro_batch_size_per_gpu": 1,
"zero_optimization": {
"stage": [0, 1, 2, 3]
},
"data-impl": "mmap",
"split": "949,50,1",

"checkpoint-activations": true,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,

"gradient_clipping": 1.0,
"weight-decay": 0,
"hidden-dropout": 0,
"attention-dropout": 0,

"fp16": {
"fp16": true,
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},

"train-iters": 100,
"lr-decay-iters": 320000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"checkpoint-factor": 10000,
"eval-interval": 1000,
"eval-iters": 10,
"log-interval": 100,
"steps_per_print": 10,
"keep-last-n-checkpoints": 4,
"wall_clock_breakdown": true,
"launcher": "slurm",
"deepspeed_slurm": true,
"no_ssh_check": true,
"comment": "neox",
"autotuning": {
"enabled": true,
"mp_size": 8,
"arg_mappings": {
"train_micro_batch_size_per_gpu": "--train_micro_batch_size_per_gpu",
"gradient_accumulation_steps ": "--gradient_accumulation_steps"
}
}
}
26 changes: 24 additions & 2 deletions configs/neox_arguments.md
Original file line number Diff line number Diff line change
Expand Up @@ -592,7 +592,7 @@ Optimizer Arguments



- **zero_stage**: int
- **zero_stage**: typing.Union[int, typing.List[int], typing.Literal['all']]

Default = None

Expand Down Expand Up @@ -1732,6 +1732,14 @@ Args for deepspeed config



- **autotuning**: dict

Default = None

Dictionary as described in DeepSpeed autotuning documentation: https://github.com/microsoft/DeepSpeed/tree/master/deepspeed/autotuning



## NeoXArgsDeepspeedRunner

Args for deepspeed runner (deepspeed.launcher.runner).
Expand Down Expand Up @@ -1801,7 +1809,7 @@ Args for deepspeed runner (deepspeed.launcher.runner).



- **launcher**: str
- **launcher**: typing.Literal['pdsh', 'openmpi', 'mvapich', 'slurm']

Default = pdsh

Expand All @@ -1817,6 +1825,12 @@ Args for deepspeed runner (deepspeed.launcher.runner).



- **autotuning_run**: str

Default = None

Either "tune", "run", or `None`.

- **no_ssh_check**: bool

Default = False
Expand All @@ -1831,3 +1845,11 @@ Args for deepspeed runner (deepspeed.launcher.runner).

Adds a `--comment` to the DeepSpeed launch command. In DeeperSpeed this is passed on to the SlurmLauncher as well. Sometime necessary for cluster rules, or so I've heard.



- **no_ssh_check**: bool

Default = False

If `True` and running with multiple nodes, then DeepSpeedd doesn't conduct a check to ensure the head node is reachable with ssh.

11 changes: 11 additions & 0 deletions configs/slurm_local.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
{
"vocab-file": "data/gpt2-vocab.json",
"merge-file": "data/gpt2-merges.txt",
"save": "checkpoints",
"checkpoint_validation_with_forward_pass": false,
"tensorboard-dir": "tensorboard",
"log-dir": "logs",
"use_wandb": true,
"wandb_host": "https://api.wandb.ai",
"wandb_project": "neox"
}
Loading