Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when training large GPT2 models on single GPU #679

Closed
jeffbinder opened this issue Jan 19, 2021 · 14 comments · Fixed by #735
Closed

Segfault when training large GPT2 models on single GPU #679

jeffbinder opened this issue Jan 19, 2021 · 14 comments · Fixed by #735

Comments

@jeffbinder
Copy link

jeffbinder commented Jan 19, 2021

I'm trying to use DeepSpeed to finetune GPT2 models on a single RTX 3090 GPU. Using the scripts included with huggingface-transformers, I have been able to get it working up through the 774M model, and the ZeRO optimizations enable me to double the batch size. However, the CPU Adam optimizer is segfaulting when I try to train the 1558M model. I am using Ubuntu 20.04, CUDA 11.2, Nvidia drivers 460.32.03, and current git master versions of PyTorch, Transformers, and DeepSpeed.

Here is the script I used:

export BATCH_SIZE=1

export CUDA_VISIBLE_DEVICES=0
export CUDA_HOME=/usr/local/cuda-11.2
export TOKENIZERS_PARALLELISM=false
export MP_SIZE=1
export NUM_WORKERS=1
export NUM_GPUS_PER_WORKER=1

rm -r test_output

USE_TF=0 deepspeed --num_gpus=1 ../../src/transformers/examples/language-modeling/run_clm.py --output_dir=test_output --model_type=gpt2 --model_name_or_path=gpt2-xl --do_train --train_file=pofo-corpus.txt --per_device_train_batch_size $BATCH_SIZE --per_device_eval_batch_size $BATCH_SIZE --fp16 --deepspeed ds_config.json

pofo-corpus.txt is the Poetry Foundation collection in a single text file (around 18MB). Here is the config file:

{
    "fp16": {
        "enabled": true,
        "loss_scale": 0,
        "loss_scale_window": 100,
        "hysteresis": 2,
        "min_loss_scale": 1e-24,
        "initial_scale_power": -2
    },

    "zero_allow_untested_optimizer": true,
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 1.8e7,
        "reduce_scatter": true,
        "reduce_bucket_size": 1.8e7,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "cpu_offload": true
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": 1e-6,
            "warmup_max_lr": 5e-5,
            "warmup_num_steps": 500
        }
    }
}

I've messed around with a bunch of the settings, but none of them seem to affect the issue. Here is the output:

rm: cannot remove 'test_output': No such file or directory
[2021-01-18 14:10:29,800] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2021-01-18 14:10:29,815] [INFO] [runner.py:358:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 ../../src/transformers/examples/language-modeling/run_clm.py --output_dir=test_output --model_type=gpt2 --model_name_or_path=gpt2-xl --do_train --train_file=pofo-corpus.txt --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --fp16 --deepspeed ds_config.json
[2021-01-18 14:10:30,261] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0]}
[2021-01-18 14:10:30,261] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=1, node_rank=0
[2021-01-18 14:10:30,261] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2021-01-18 14:10:30,261] [INFO] [launch.py:100:main] dist_world_size=1
[2021-01-18 14:10:30,261] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0
[2021-01-18 14:10:31,069] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
WARNING:__main__:Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: True
Using custom data configuration default
Reusing dataset text (/home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab)
[INFO|configuration_utils.py:445] 2021-01-18 14:10:31,547 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/jechk/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.81d9c13b9ee3f2b22faaba04ca49e09b13f9fea3a7910768ed6664ec141e3c8b
[INFO|configuration_utils.py:481] 2021-01-18 14:10:31,547 >> Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1600,
  "n_head": 25,
  "n_inner": null,
  "n_layer": 48,
  "n_positions": 1024,
  "output_past": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.3.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|configuration_utils.py:445] 2021-01-18 14:10:31,624 >> loading configuration file https://huggingface.co/gpt2-xl/resolve/main/config.json from cache at /home/jechk/.cache/huggingface/transformers/d2de8fec009fa9b9196047559bcac6c1f02a9c500718b4346bc516354965b1ca.81d9c13b9ee3f2b22faaba04ca49e09b13f9fea3a7910768ed6664ec141e3c8b
[INFO|configuration_utils.py:481] 2021-01-18 14:10:31,625 >> Model config GPT2Config {
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1600,
  "n_head": 25,
  "n_inner": null,
  "n_layer": 48,
  "n_positions": 1024,
  "output_past": true,
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.3.0.dev0",
  "use_cache": true,
  "vocab_size": 50257
}

[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/vocab.json from cache at /home/jechk/.cache/huggingface/transformers/8560a2df03f812b276794ae6935255d0590522553a4c8103155472b07591a21b.c7ed1f96aac49e745788faa77ba0a26a392643a50bb388b9c04ff469e555241f
[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/merges.txt from cache at /home/jechk/.cache/huggingface/transformers/18fe27e0b70062b3e45fc4e827d5449d9fe85875937594da927e48cb657366d1.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
[INFO|tokenization_utils_base.py:1766] 2021-01-18 14:10:31,989 >> loading file https://huggingface.co/gpt2-xl/resolve/main/tokenizer.json from cache at /home/jechk/.cache/huggingface/transformers/aabb8839163cd911f810ab23f5ae8c966b9b9ea60622c429020611caa389b04b.cf2d0ecb83b6df91b3dbb53f1d1e4c311578bfd3aa0e04934215a49bf9898df0
[INFO|modeling_utils.py:1027] 2021-01-18 14:10:32,129 >> loading weights file https://huggingface.co/gpt2-xl/resolve/main/pytorch_model.bin from cache at /home/jechk/.cache/huggingface/transformers/96569b907e56747ce3e593c6a13d8475b8c733a64aab8af8f602b90d94c4af71.8fbbcdf404c82c5967934d411f1462fa0574d639f2aa398aa3754fced1bb26c0
[INFO|modeling_utils.py:1143] 2021-01-18 14:10:54,131 >> All model checkpoint weights were used when initializing GPT2LMHeadModel.

[INFO|modeling_utils.py:1151] 2021-01-18 14:10:54,131 >> All the weights of GPT2LMHeadModel were initialized from the model checkpoint at gpt2-xl.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.
Loading cached processed dataset at /home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-d5e960aa227f7b5e.arrow
Loading cached processed dataset at /home/jechk/.cache/huggingface/datasets/text/default-82f776b31993d586/0.0.0/daf90a707a433ac193b369c8cc1772139bb6cca21a9c7fe83bdd16aad9b9b6ab/cache-1b9b3e2f092a373d.arrow
[INFO|trainer.py:442] 2021-01-18 14:10:55,458 >> The following columns in the training set don't have a corresponding argument in `GPT2LMHeadModel.forward` and have been ignored: .
[INFO|trainer.py:359] 2021-01-18 14:10:55,458 >> Using amp fp16 backend
[INFO|integrations.py:323] 2021-01-18 14:10:55,459 >> Keeping the `scheduler` config from ds_config.json intact, ignoring any scheduler-specific cl args
[INFO|integrations.py:368] 2021-01-18 14:10:55,459 >> Keeping the `fp16` config from ds_config.json intact, ignoring any fp16-specific cl args
[2021-01-18 14:10:55,459] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed info: version=0.3.10+7b07e12, git-hash=7b07e12, git-branch=master
[2021-01-18 14:10:55,472] [INFO] [engine.py:73:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/jechk/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module cpu_adam...
Time to load cpu_adam op: 0.20410561561584473 seconds
Adam Optimizer #0 is created with scalar arithmetic capability.
Config: alpha=0.000050, betas=(0.900000, 0.999000), weight_decay=0.000000, adam_w=1
[2021-01-18 14:10:57,968] [INFO] [engine.py:540:_configure_optimizer] Using DeepSpeed Optimizer param name adamw as basic optimizer
[2021-01-18 14:10:57,968] [INFO] [engine.py:545:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam (
Parameter Group 0
    amsgrad: False
    betas: [0.9, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 5e-05
    weight_decay: 0.0
)
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-01-18 14:10:57,968] [INFO] [engine.py:661:_configure_zero_optimizer] Creating fp16 ZeRO stage 2 optimizer
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/jechk/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.1086723804473877 seconds
[2021-01-18 14:10:58,077] [INFO] [stage2.py:130:__init__] Reduce bucket size 18000000.0
[2021-01-18 14:10:58,077] [INFO] [stage2.py:131:__init__] Allgather bucket size 18000000.0
[2021-01-18 14:10:58,077] [INFO] [stage2.py:132:__init__] CPU Offload: True
group 0 param 0 = 1557611200
[2021-01-18 14:11:03,591] [INFO] [stage2.py:399:__init__] optimizer state initialized
[2021-01-18 14:11:03,591] [INFO] [engine.py:575:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage2.FP16_DeepSpeedZeroOptimizer object at 0x7fe0f0994d30>
[2021-01-18 14:11:03,591] [INFO] [engine.py:405:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupLR
[2021-01-18 14:11:03,591] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupLR object at 0x7fdfbc497ee0>
[2021-01-18 14:11:03,591] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[[0.9, 0.999]]
[2021-01-18 14:11:03,591] [INFO] [config.py:733:print] DeepSpeedEngine configuration:
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fe0f0994a60>
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   allreduce_always_fp32 ........ False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   amp_enabled .................. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   amp_params ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   checkpoint_tag_validation_enabled  True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   checkpoint_tag_validation_fail  False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   disable_allgather ............ False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   dump_state ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   dynamic_loss_scale_args ...... {'init_scale': 0.25, 'scale_window': 100, 'delayed_shift': 2, 'min_scale': 1e-24}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   elasticity_enabled ........... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   flops_profiler_config ........ <deepspeed.profiling.config.DeepSpeedFlopsProfilerConfig object at 0x7fe0f0994ac0>
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   fp16_enabled ................. True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   global_rank .................. 0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   gradient_accumulation_steps .. 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   gradient_clipping ............ 1.0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   gradient_predivide_factor .... 1.0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   initial_dynamic_scale ........ 0.25
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   loss_scale ................... 0
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   memory_breakdown ............. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   optimizer_legacy_fusion ...... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   optimizer_name ............... adamw
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   optimizer_params ............. {'lr': 5e-05, 'betas': [0.9, 0.999], 'eps': 1e-08, 'weight_decay': 0.0}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   pld_enabled .................. False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   pld_params ................... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   prescale_gradients ........... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   scheduler_name ............... WarmupLR
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   scheduler_params ............. {'warmup_min_lr': 1e-06, 'warmup_max_lr': 5e-05, 'warmup_num_steps': 500}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   sparse_attention ............. None
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   sparse_gradients_enabled ..... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   steps_per_print .............. 10
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   tensorboard_enabled .......... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   tensorboard_output_path ...... 
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   train_batch_size ............. 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   train_micro_batch_size_per_gpu  1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   wall_clock_breakdown ......... False
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   world_size ................... 1
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_allow_untested_optimizer  True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_config .................. {
    "allgather_bucket_size": 18000000.0,
    "allgather_partitions": true,
    "contiguous_gradients": true,
    "cpu_offload": true,
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": true,
    "reduce_bucket_size": 18000000.0,
    "reduce_scatter": true,
    "stage": 2
}
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_enabled ................. True
[2021-01-18 14:11:03,592] [INFO] [config.py:737:print]   zero_optimization_stage ...... 2
[2021-01-18 14:11:03,592] [INFO] [config.py:739:print]   json = {
    "fp16":{
        "enabled":true,
        "hysteresis":2,
        "initial_scale_power":-2,
        "loss_scale":0,
        "loss_scale_window":100,
        "min_loss_scale":1e-24
    },
    "gradient_accumulation_steps":1,
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "betas":[
                0.9,
                0.999
            ],
            "eps":1e-08,
            "lr":5e-05,
            "weight_decay":0.0
        },
        "type":"AdamW"
    },
    "scheduler":{
        "params":{
            "warmup_max_lr":5e-05,
            "warmup_min_lr":1e-06,
            "warmup_num_steps":500
        },
        "type":"WarmupLR"
    },
    "train_micro_batch_size_per_gpu":1,
    "zero_allow_untested_optimizer":true,
    "zero_optimization":{
        "allgather_bucket_size":18000000.0,
        "allgather_partitions":true,
        "contiguous_gradients":true,
        "cpu_offload":true,
        "overlap_comm":true,
        "reduce_bucket_size":18000000.0,
        "reduce_scatter":true,
        "stage":2
    }
}
Using /home/jechk/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.00028705596923828125 seconds
[INFO|trainer.py:810] 2021-01-18 14:11:03,643 >> ***** Running training *****
[INFO|trainer.py:811] 2021-01-18 14:11:03,643 >>   Num examples = 4917
[INFO|trainer.py:812] 2021-01-18 14:11:03,643 >>   Num Epochs = 3
[INFO|trainer.py:813] 2021-01-18 14:11:03,643 >>   Instantaneous batch size per device = 1
[INFO|trainer.py:814] 2021-01-18 14:11:03,643 >>   Total train batch size (w. parallel, distributed & accumulation) = 1
[INFO|trainer.py:815] 2021-01-18 14:11:03,643 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:816] 2021-01-18 14:11:03,643 >>   Total optimization steps = 14751
2021-01-18 14:11:03.737646: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
  0%|                                                                                                 | 0/14751 [00:00<?, ?it/s][W reducer.cpp:1042] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can adversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have unused parameters. (function operator())

The program then exits abruptly. The segfault is reported in dmesg:

[ 9250.120732] python3[10345]: segfault at 7fde4685c850 ip 00007fdfbc2057e0 sp 00007fdf8cd3fe40 error 6
[ 9250.120738] python3[10349]: segfault at 7fde846a9f70 ip 00007fdfbc2057e0 sp 00007fdf8ad3be40 error 6
[ 9250.120743] python3[10344]: segfault at 7fde370c9288 ip 00007fdfbc2057e0 sp 00007fdfbcce3e40 error 6
[ 9250.120745] python3[10348]: segfault at 7fde74f169a8 ip 00007fdfbc2057e0 sp 00007fdf8b53ce40 error 6
[ 9250.120749] python3[10347]: segfault at 7fde657833e0 ip 00007fdfbc2057e0 sp 00007fdf8bd3de40 error 6
[ 9250.120752]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120754]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120755]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120761] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120763] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120764] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120766]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120767]  in cpu_adam.so[7fdfbc203000+16000]
[ 9250.120772] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff 00 00 80 7f 74 30 <89> 06 c4 c1 7a 11 0c 94 c4 c1 7a 11 64 95 00 c4 c1 7a 11 1c 96 48
[ 9250.120778] Code: ff ff 7f c5 f9 7e c8 81 e1 00 80 00 00 4a 03 74 db 30 81 ff ff ff 7f 7f 0f 86 8c 00 00 00 c5 79 7e d0 81 ff

I tried training similar models using the DeepSpeed version of Megatron-LM instead of huggingface-transformers, and the same thing happens--it works correctly up through a certain number of parameters, but it segfaults with sufficiently large models.

@jeffbinder
Copy link
Author

jeffbinder commented Jan 21, 2021

I did a bit more digging around, and I think I have an idea of what's going on. The loop in Adam_Optimizer::Step is violating the bounds of _doubled_buffer because, when training the 1558M model, _param_size (1557611200) exceeds TILE (1073741824).

If I edit the code to change the value of TILE to 1557611200, then I am able to train the 1558M model using the CPU Adam optimizer with no apparent problem. Is there a particular reason why this value is hardcoded? I'm new to DeepSpeed, so forgive me if I'm misunderstanding something.

[EDIT: I originally noted that it was running slower than I expected, but that seems to have resulted from some weird problem with my installation. After a clean install, it runs at around 2.19s/it.]

@stas00
Copy link
Collaborator

stas00 commented Feb 5, 2021

Your report looks similar to a segfault I'm getting w/ DeepSpeed and transformers/t5-11b #726 (comment)

See if you still get the segfault if you use the parent of e51e4d7

cd DeepSpeed
git checkout e51e4d7^

@collinarnett
Copy link

collinarnett commented Feb 5, 2021

@stas00
I am also trying to do the same as jeff but I have a 3090 and I can get 17 steps in. My example works with the default trainer at 256 token lengh with adafactor but when using deep speed with cpu offload I get the following error.

Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.
Token indices sequence length is longer than the specified maximum sequence length for this model (1391 > 1024). Running this sequence through the model will result in indexing errors
Loading cached processed dataset at stories_dataset/cache-9e99c77f1a66e6d8.arrow
Loading cached processed dataset at stories_dataset/cache-d4dd3c505bd994ca.arrow
Loading cached processed dataset at stories_dataset/cache-6754a3027a1c188d.arrow
Loading cached processed dataset at stories_dataset/cache-abdcdc774d4104a8.arrow
Loading cached processed dataset at stories_dataset/cache-04780ef610a7f621.arrow
Loading cached processed dataset at stories_dataset/cache-675f830473d7e33b.arrow
Loading cached processed dataset at stories_dataset/cache-03c8fba9e4821eba.arrow
Loading cached processed dataset at stories_dataset/cache-785f7204bef347a6.arrow
Loading cached processed dataset at stories_dataset/cache-6e48eb5db4974563.arrow
Loading cached processed dataset at stories_dataset/cache-e20b2bb271c92d56.arrow
Loading cached processed dataset at stories_dataset/cache-6870e6fbb7b1056f.arrow
Loading cached processed dataset at stories_dataset/cache-c321192ed99f2b2a.arrow
Loading cached processed dataset at stories_dataset/cache-fd255eac8a97fa7a.arrow
Loading cached processed dataset at stories_dataset/cache-50072c6b3ee8a366.arrow
Loading cached processed dataset at stories_dataset/cache-91e579e877159438.arrow
Loading cached processed dataset at stories_dataset/cache-9cc56e36692ce94a.arrow
Loading cached processed dataset at stories_dataset/cache-2bc12f15f0f59c69.arrow
Loading cached processed dataset at stories_dataset/cache-c5614a4d49752db2.arrow
Loading cached processed dataset at stories_dataset/cache-403c114bbde4ea44.arrow
Loading cached processed dataset at stories_dataset/cache-3f5a6439e47ee525.arrow
Loading cached processed dataset at stories_dataset/cache-86e3b56172a5f9ed.arrow
Loading cached processed dataset at stories_dataset/cache-8532a8d0da2ea6d0.arrow
Loading cached processed dataset at stories_dataset/cache-a8fd7f04356eeb2f.arrow
Loading cached processed dataset at stories_dataset/cache-18c6ae2df3450b33.arrow
Loading cached processed dataset at stories_dataset/cache-bf49b946fa64edbe.arrow
Loading cached processed dataset at stories_dataset/cache-657169cfa40d34ee.arrow
Loading cached processed dataset at stories_dataset/cache-c026375dcef0b4a7.arrow
Loading cached processed dataset at stories_dataset/cache-5aaab9f8820bb9e7.arrow
Loading cached processed dataset at stories_dataset/cache-d202a7ceb1842a47.arrow
Loading cached processed dataset at stories_dataset/cache-89b88a2e0ccd34b1.arrow
Loading cached processed dataset at stories_dataset/cache-89c0d84eda05cab8.arrow
Loading cached processed dataset at stories_dataset/cache-20ad411473ed2427.arrow
Loading cached split indices for dataset at stories_dataset/cache-f3f5e12ddda3ec81.arrow and stories_dataset/cache-08e4026be548935f.arrow
[2021-02-05 22:20:48,368] [INFO] [logging.py:60:log_dist] [Rank -1] DeepSpeed info: version=0.3.10+981bc7d, git-hash=981bc7d, git-branch=HEAD
[2021-02-05 22:20:48,368] [INFO] [distributed.py:29:init_distributed] Not using the DeepSpeed or torch.distributed launchers, attempting to detect MPI environment...
--------------------------------------------------------------------------
[[26768,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: gorgon

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[2021-02-05 22:20:49,134] [INFO] [distributed.py:73:mpi_discovery] Discovered MPI settings of world_rank=0, local_rank=0, world_size=1, master_addr=192.168.122.249, master_port=29500
[2021-02-05 22:20:49,135] [INFO] [distributed.py:39:init_distributed] Initializing torch distributed with backend: nccl
[2021-02-05 22:20:49,207] [INFO] [engine.py:72:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
Using /home/collin/.cache/torch_extensions as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/collin/.cache/torch_extensions/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /usr/local/cuda-11.1/bin/nvcc -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-11.1/include -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include/TH -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.1/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_86,code=sm_86 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++14 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_86,code=compute_86 -c /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
[2/3] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda-11.1/include -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include/TH -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include/THC -isystem /usr/local/cuda-11.1/include -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -L/usr/local/cuda-11.1/lib64 -lcudart -lcublas -g -Wno-reorder -march=native -fopenmp  -c /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[3/3] c++ cpu_adam.o custom_cuda_kernel.cuda.o -shared -L/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda-11.1/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 17.01028299331665 seconds
Adam Optimizer #0 is created with scalar arithmetic capability.
Config: alpha=0.000050, betas=(0.800000, 0.999000), weight_decay=0.010000, adam_w=1
[2021-02-05 22:21:08,720] [INFO] [engine.py:533:_configure_optimizer] Using DeepSpeed Optimizer param name adam as basic optimizer
[2021-02-05 22:21:08,721] [INFO] [engine.py:538:_configure_optimizer] DeepSpeed Basic Optimizer = DeepSpeedCPUAdam (
Parameter Group 0
    amsgrad: False
    betas: [0.8, 0.999]
    bias_correction: True
    eps: 1e-08
    lr: 5e-05
    weight_decay: 0.01
)
Checking ZeRO support for optimizer=DeepSpeedCPUAdam type=<class 'deepspeed.ops.adam.cpu_adam.DeepSpeedCPUAdam'>
[2021-02-05 22:21:08,721] [INFO] [engine.py:655:_configure_zero_optimizer] Creating fp16 ZeRO stage 2 optimizer
Using /home/collin/.cache/torch_extensions as PyTorch extensions root...
Emitting ninja build file /home/collin/.cache/torch_extensions/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/2] c++ -MMD -MF flatten_unflatten.o.d -DTORCH_EXTENSION_NAME=utils -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include/TH -isystem /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/include/THC -isystem /usr/include/python3.8 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -c /home/collin/.virtualenvs/litai/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp -o flatten_unflatten.o 
[2/2] c++ flatten_unflatten.o -shared -L/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o utils.so
Loading extension module utils...
Time to load utils op: 8.627959728240967 seconds
[2021-02-05 22:21:17,349] [INFO] [stage2.py:130:__init__] Reduce bucket size 500000000.0
[2021-02-05 22:21:17,349] [INFO] [stage2.py:131:__init__] Allgather bucket size 500000000.0
[2021-02-05 22:21:17,349] [INFO] [stage2.py:132:__init__] CPU Offload: True
group 0 param 0 = 1557614400
[2021-02-05 22:21:33,802] [INFO] [stage2.py:399:__init__] optimizer state initialized
[2021-02-05 22:21:33,936] [INFO] [engine.py:568:_configure_optimizer] DeepSpeed Final Optimizer = <deepspeed.runtime.zero.stage2.FP16_DeepSpeedZeroOptimizer object at 0x7fbc2f6e2790>
[2021-02-05 22:21:33,941] [INFO] [engine.py:398:_configure_lr_scheduler] DeepSpeed using configured LR scheduler = WarmupDecayLR
[2021-02-05 22:21:33,941] [INFO] [logging.py:60:log_dist] [Rank 0] DeepSpeed LR Scheduler = <deepspeed.runtime.lr_schedules.WarmupDecayLR object at 0x7fbc2c570610>
[2021-02-05 22:21:33,942] [INFO] [logging.py:60:log_dist] [Rank 0] step=0, skipped=0, lr=[5e-05], mom=[[0.8, 0.999]]
[2021-02-05 22:21:33,942] [INFO] [config.py:708:print] DeepSpeedEngine configuration:
[2021-02-05 22:21:33,942] [INFO] [config.py:712:print]   activation_checkpointing_config  <deepspeed.runtime.activation_checkpointing.config.DeepSpeedActivationCheckpointingConfig object at 0x7fbc2f722bb0>
[2021-02-05 22:21:33,942] [INFO] [config.py:712:print]   allreduce_always_fp32 ........ False
[2021-02-05 22:21:33,942] [INFO] [config.py:712:print]   amp_enabled .................. False
[2021-02-05 22:21:33,942] [INFO] [config.py:712:print]   amp_params ................... False
[2021-02-05 22:21:33,942] [INFO] [config.py:712:print]   disable_allgather ............ False
[2021-02-05 22:21:33,942] [INFO] [config.py:712:print]   dump_state ................... False
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   dynamic_loss_scale_args ...... None
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   elasticity_enabled ........... False
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   flops_profiler_config ........ <deepspeed.profiling.config.DeepSpeedFlopsProfilerConfig object at 0x7fbc2f722a60>
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   fp16_enabled ................. True
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   global_rank .................. 0
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   gradient_accumulation_steps .. 1
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   gradient_clipping ............ 1.0
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   gradient_predivide_factor .... 1.0
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   initial_dynamic_scale ........ 4294967296
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   loss_scale ................... 0
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   memory_breakdown ............. False
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   optimizer_legacy_fusion ...... False
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   optimizer_name ............... adam
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   optimizer_params ............. {'lr': 5e-05, 'betas': [0.8, 0.999], 'eps': 1e-08, 'weight_decay': 0.01}
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   pipeline ..................... {'stages': 'auto', 'partition': 'best', 'seed_layers': False, 'activation_checkpoint_interval': 0}
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   pld_enabled .................. False
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   pld_params ................... False
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   prescale_gradients ........... False
[2021-02-05 22:21:33,943] [INFO] [config.py:712:print]   scheduler_name ............... WarmupDecayLR
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   scheduler_params ............. {'last_batch_iteration': -1, 'total_num_steps': 6000000, 'warmup_min_lr': 0, 'warmup_max_lr': 5e-05, 'warmup_num_steps': 120000}
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   sparse_attention ............. None
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   sparse_gradients_enabled ..... False
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   steps_per_print .............. 10
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   tensorboard_enabled .......... False
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   tensorboard_job_name ......... DeepSpeedJobName
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   tensorboard_output_path ...... 
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   train_batch_size ............. 1
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   train_micro_batch_size_per_gpu  1
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   wall_clock_breakdown ......... False
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   world_size ................... 1
[2021-02-05 22:21:33,944] [INFO] [config.py:712:print]   zero_allow_untested_optimizer  False
[2021-02-05 22:21:33,945] [INFO] [config.py:712:print]   zero_config .................. {
    "allgather_bucket_size": 500000000.0,
    "allgather_partitions": true,
    "contiguous_gradients": true,
    "cpu_offload": true,
    "elastic_checkpoint": true,
    "load_from_fp32_weights": true,
    "overlap_comm": true,
    "reduce_bucket_size": 500000000.0,
    "reduce_scatter": true,
    "stage": 2
}
[2021-02-05 22:21:33,945] [INFO] [config.py:712:print]   zero_enabled ................. True
[2021-02-05 22:21:33,945] [INFO] [config.py:712:print]   zero_optimization_stage ...... 2
[2021-02-05 22:21:33,947] [INFO] [config.py:714:print]   json = {
    "fp16":{
        "enabled":true
    },
    "gradient_accumulation_steps":1,
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "betas":[
                0.8,
                0.999
            ],
            "eps":1e-08,
            "lr":5e-05,
            "weight_decay":0.01
        },
        "type":"Adam"
    },
    "scheduler":{
        "params":{
            "last_batch_iteration":-1,
            "total_num_steps":6000000,
            "warmup_max_lr":5e-05,
            "warmup_min_lr":0,
            "warmup_num_steps":120000
        },
        "type":"WarmupDecayLR"
    },
    "train_micro_batch_size_per_gpu":1,
    "zero_optimization":{
        "allgather_bucket_size":500000000.0,
        "allgather_partitions":true,
        "contiguous_gradients":true,
        "cpu_offload":true,
        "overlap_comm":true,
        "reduce_bucket_size":500000000.0,
        "reduce_scatter":true,
        "stage":2
    }
}
Using /home/collin/.cache/torch_extensions as PyTorch extensions root...
No modifications detected for re-loaded extension module utils, skipping build step...
Loading extension module utils...
Time to load utils op: 0.0074961185455322266 seconds
  0%|                                                                                                                            | 0/6000000 [00:00<?, ?it/s][2021-02-05 22:21:36,923] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4294967296, reducing to 2147483648.0
  0%|                                                                                                               | 1/6000000 [00:02<4537:47:13,  2.72s/it][2021-02-05 22:21:38,502] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2147483648.0, reducing to 1073741824.0
  0%|                                                                                                               | 2/6000000 [00:04<3965:43:36,  2.38s/it][2021-02-05 22:21:39,799] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1073741824.0, reducing to 536870912.0
  0%|                                                                                                               | 3/6000000 [00:05<3424:30:09,  2.05s/it][2021-02-05 22:21:41,081] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 536870912.0, reducing to 268435456.0
  0%|                                                                                                               | 4/6000000 [00:06<3038:14:26,  1.82s/it][2021-02-05 22:21:42,345] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 268435456.0, reducing to 134217728.0
  0%|                                                                                                               | 5/6000000 [00:08<2758:27:05,  1.66s/it][2021-02-05 22:21:43,682] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 134217728.0, reducing to 67108864.0
  0%|                                                                                                               | 6/6000000 [00:09<2600:13:29,  1.56s/it][2021-02-05 22:21:44,951] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 67108864.0, reducing to 33554432.0
  0%|                                                                                                               | 7/6000000 [00:10<2454:04:31,  1.47s/it][2021-02-05 22:21:46,209] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 33554432.0, reducing to 16777216.0
  0%|                                                                                                               | 8/6000000 [00:12<2346:36:00,  1.41s/it][2021-02-05 22:21:47,470] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 16777216.0, reducing to 8388608.0
  0%|                                                                                                               | 9/6000000 [00:13<2273:24:11,  1.36s/it][2021-02-05 22:21:48,799] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 8388608.0, reducing to 4194304.0
  0%|                                                                                                              | 10/6000000 [00:14<2255:30:51,  1.35s/it][2021-02-05 22:21:50,084] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 4194304.0, reducing to 2097152.0
  0%|                                                                                                              | 11/6000000 [00:15<2222:12:57,  1.33s/it][2021-02-05 22:21:51,367] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 2097152.0, reducing to 1048576.0
  0%|                                                                                                              | 12/6000000 [00:17<2196:35:21,  1.32s/it][2021-02-05 22:21:52,624] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1048576.0, reducing to 524288.0
  0%|                                                                                                              | 13/6000000 [00:18<2165:49:03,  1.30s/it][2021-02-05 22:21:53,882] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 524288.0, reducing to 262144.0
  0%|                                                                                                              | 14/6000000 [00:19<2145:17:50,  1.29s/it][2021-02-05 22:21:55,148] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
  0%|                                                                                                              | 15/6000000 [00:20<2134:43:51,  1.28s/it][2021-02-05 22:21:56,418] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 131072.0, reducing to 65536.0
  0%|                                                                                                              | 16/6000000 [00:22<2129:18:34,  1.28s/it][2021-02-05 22:21:57,677] [INFO] [stage2.py:1357:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
  0%|                                                                                                              | 17/6000000 [00:23<2120:02:35,  1.27s/it]zsh: segmentation fault (core dumped)  python trainer.py

Watching htop during the segmentation fault, my CPU ram fills up completely.

however if I switch to using AdamW with the DeepSpeed commit you mentioned I get the following error

AssertionError: You are using an untested ZeRO Optimizer. Please add <"zero_allow_untested_optimizer": true> in the configuration file to use it.

Then adding that I get the following error where I can't start the training

[2021-02-05 23:43:50,899] [INFO] [stage2.py:130:__init__] Reduce bucket size 200000000.0
[2021-02-05 23:43:50,899] [INFO] [stage2.py:131:__init__] Allgather bucket size 200000000.0
[2021-02-05 23:43:50,899] [INFO] [stage2.py:132:__init__] CPU Offload: True
Traceback (most recent call last):
  File "trainer.py", line 66, in <module>
    trainer.train()
  File "/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/transformers/trainer.py", line 831, in train
    model, optimizer, lr_scheduler = init_deepspeed(
  File "/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/transformers/integrations.py", line 381, in init_deepspeed
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/deepspeed/__init__.py", line 110, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 173, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 550, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 672, in _configure_zero_optimizer
    optimizer = FP16_DeepSpeedZeroOptimizer(
  File "/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 395, in __init__
    self.initialize_optimizer_states()
  File "/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/deepspeed/runtime/zero/stage2.py", line 421, in initialize_optimizer_states
    self.optimizer.step()
  File "/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 26, in decorate_context
    return func(*args, **kwargs)
  File "/home/collin/.virtualenvs/litai/lib/python3.8/site-packages/torch/optim/adamw.py", line 112, in step
    denom = (exp_avg_sq.sqrt() / math.sqrt(bias_correction2)).add_(group['eps'])
RuntimeError: [enforce fail at CPUAllocator.cpp:65] . DefaultCPUAllocator: can't allocate memory: you tried to allocate 6230457600 bytes. Error code 12 (Cannot allocate memory)

I have 45gb of CPU ram.

@stas00
Copy link
Collaborator

stas00 commented Feb 5, 2021

It's 17 because that's when your CPU Adam optimizer actually runs for the first time and it crashes. Because of Overflow it skips the first dozen or so steps. I was hitting 22 and 25 in other cases before segfaulting.

Most likely it has to do with you running on AMD, correct?

For a temp solution please see: huggingface/transformers#9996 (comment)

The Deepspeed team is aware of the problem and working on solving this.

@collinarnett
Copy link

Yes I'm running on AMD.

Since I get the allocator issue with AdamW since I have to use the "zero_allow_untested_optimizer" flag does that mean I just have to wait until this is fixed?

If I use Adam with the rollback you mentioned I still get the same segfault.

@stas00
Copy link
Collaborator

stas00 commented Feb 6, 2021

Thank you for confirming that it's indeed an issue on the AMD platform, @collinarnett

I'm just a messenger at the moment and will let the DeepSpeed folks to chime in on the specifics. But it looks like there is some bug in DeepSpeed's CPU Adam that happens only on the AMD platform. And so until it's fixed using torch optim should work, that is:

   "optimizer": {
     "type": "Adam",
     "params": {
       "torch_adam": true,
       "lr": 3e-5,
       "betas": [
         0.8,
         0.999
       ],
       "eps": 1e-8,
       "weight_decay": 3e-7
     }
   },

it will be much slower.

The specific commit at which the breakage was first detected is the one where the DeepSpeed AdamW was made directly available
e51e4d7. The devs are trying to figure out why this was the cause, since supposedly it did work before and just required a whole bunch of params to enable it (zero_allow_untested_optimizer, etc.)

That's all I know as of this moment. I will keep on posting updates in this thread huggingface/transformers#9996 as get more info, so you may choose to track it.

@jeffbinder
Copy link
Author

For the record, when I finally got it to run, the CPU Adam optimizer used almost 54GB of CPU memory to train gpt2-xl. I'm on AMD, too.

@stas00
Copy link
Collaborator

stas00 commented Feb 6, 2021

As I shared here: huggingface/transformers#9996 (comment) my CPU RAM usage stats are:

  • for t5-3b on 1x 24GB gpu: ~71GB RAM
  • for t5-11b on 1x 40GB gpu: ~234GB RAM

@jeffbinder
Copy link
Author

@stas00 I finally got around to doing some more experiments. You're right: I don't get the segfault with the parent of e51e4d7. However, that version runs much slower, around 3.89s/it. I can get it to run on e51e4d7 if I modify the value of TILE, and the training time is about half as long.

@stas00
Copy link
Collaborator

stas00 commented Feb 8, 2021

I can get it to run on e51e4d7 if I modify the value of TILE, and the training time is about half as long.

What's the exact change that you made, @jeffbinder? Perhaps that would be useful to the developers.

For some reason the core dump I get has its backtrace completely corrupted. The backtrace would be very helpful. but they say they have an AMD machine to test this on - so hopefully they will sort it out.

If you manage to get one and post it here that would help. Just in case you don't know how, here is one such guide: https://jvns.ca/blog/2018/04/28/debugging-a-segfault-on-linux/
But since you were messing with TILE I'm sure this not new to you.

@jeffbinder
Copy link
Author

Here is a diff:

--- a/csrc/includes/cpu_adam.h
+++ b/csrc/includes/cpu_adam.h
@@ -20,7 +20,7 @@
         }                                                                                      \
     }
 
-#define TILE (1024 * 1024 * 1024)
+#define TILE 1557611200
 
 #if defined(__AVX512__)
 #define SIMD_STORE(a, d) _mm512_storeu_ps(a, d)

1557611200 is the number of parameters in gpt2-xl. In itself, this patch isn't really a solution because it's specific to one model. If you're training another large model, you'd have to change it to a different number.

I'll see if I can get a proper stack trace when I have the time. The crash was, I believe, occurring in Adam_Optimizer::Step at cpu_adam.cpp:134.

@jeffbinder
Copy link
Author

jeffbinder commented Feb 8, 2021

I read the code a bit more carefully and I think I can see why this segfault is only happening on AMD systems. If AVX512 or AVX256 instructions are available, Adam_Optimizer::Step copies the data in blocks, one TILE at a time, and then runs an extra loop to copy the remainder. If __AVX512__ and __AVX256__ are undefined (which appears to be the case on my system), then it just uses that last loop to copy all the data. But that loop tries to store the parameters in one half of _doubled_buffer, which is not big enough to handle models that exceed the size of TILE.

The way this loop works looks like it results in a difference in behavior between Intel and AMD. In the AVX code, launch_param_update is called once every TILE. However, without AVX, it only ends up being called once at the end, regardless of how big the parameter size is.

My hacky solution of changing the value of TILE is not the right answer, but I wonder if it might be possible to save some GPUCPU memory in addition to fixing the segfault by changing the way the buffer is allocated.

There also appears to be an issue with how the availability of AVX instructions is being determined. I have a Zen 2 processor that is supposed to have AVX256, but it doesn't seem to be detected. I'm not sure if that's an issue with DeepSpeed or a configuration problem on my end, though.

@jeffra jeffra linked a pull request Feb 8, 2021 that will close this issue
@RezaYazdaniAminabadi
Copy link
Contributor

Hi @jeffbinder, @stas00, and @collinarnett

Thank you for the constructive discussion on this issue and how to solve it.
The solution that Jeff mentioned is a plausible one when the number of parameters is not huge and there is enough memory to allocate the doubled buffers. These buffers helps to copy the parameters in a tiled manner and are not intended to be as big as the model size. However, there is a bug in the CPU-Adam implementation for the scalar mode, that these buffers are indexed higher than the TILE dimension. So, I have modified the code in this PR to still keep the copy in a tiled fashion when we are in scalar mode.
Could you please try this out and let me know if the problem is solved?
Thanks a lot!

Reza

@jeffbinder
Copy link
Author

Hi @RezaYazdaniAminabadi,
Yes, it works on my system! Thanks for addressing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants