[BUG] Error while training with Deepspeed #4295

karths8 · 2023-09-08T19:49:44Z

Describe the bug
Deepspeed runs into a bug while training a CodeLlama-34B model with QLoRA using this script

To Reproduce
Run the script with deepspeed file passed into the params. The deepspeed config i used is given below:

{
  "bf16": {
    "enabled": "auto"
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": 16777216,
    "stage3_prefetch_bucket_size": 15099494.4,
    "stage3_param_persistence_threshold": 40960,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Expected behavior

Expected behaviour is deepspeed training without any errors. The following error (RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7ff729d61cb0>) pops up with the traceback as given below

[2023-09-08 19:26:04,877] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:07,007] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-09-08 19:26:07,007] [INFO] [runner.py:570:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None finetune_llama2_codegen.py --bf16 --per_device_train_batch_size 1 --per_device_eval_batch_size 2 --model_name /workspace/CodeLlama-34b-Python-hf --dataset_name llama_data --save_steps 100 --num_train_epochs 2 --learning_rate 2e-5 --weight_decay 0.01 --lora_alpha 256 --lora_r 32 --use_qlora True --max_seq_length 8192 --run_name CodeGen-34B-combined-train --deepspeed_path /workspace/deepspeed_config_stage3.json
[2023-09-08 19:26:08,340] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:10,452] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-09-08 19:26:10,452] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-09-08 19:26:10,452] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-09-08 19:26:10,452] [INFO] [launch.py:163:main] dist_world_size=2
[2023-09-08 19:26:10,452] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-09-08 19:26:12,700] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:12,745] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:19,891] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-08 19:26:19,891] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-09-08 19:26:20,266] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-08 19:27:56,714] [INFO] [partition_parameters.py:342:__exit__] finished initializing model - num_params = 435, num_elems = 33.74B
Loading checkpoint shards: 100%|█████████████| 7/7 [03:06<00:00, 26.65s/it]
Loading checkpoint shards: 100%|█████████████| 7/7 [03:06<00:00, 26.61s/it]
trainable params: 39,321,600 || all params: 33,783,291,904 || trainable%: 0.11639363064954678
Map:   0%|                                 | 0/2787 [00:00<?, ? examples/s]trainable params: 39,321,600 || all params: 33,783,291,904 || trainable%: 0.11639363064954678
Map: 100%|█████████████████████| 2787/2787 [00:04<00:00, 630.71 examples/s]
Map: 100%|█████████████████████| 2787/2787 [00:04<00:00, 656.34 examples/s]
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu118/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
[2/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o 
[3/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 20.74338674545288 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 12.17668604850769 seconds
Parameter Offload: Total persistent parameters: 2367488 in 145 params
You're using a CodeLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/workspace/finetune_llama2_codegen.py", line 545, in <module>
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1553, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1835, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2690, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1923, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 141, in backward
    outputs = ctx.run_function(*detached_inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in custom_forward
    return module(*inputs, past_key_value, output_attentions)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
    self.__all_gather_params(params_to_fetch, forward)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
    self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gather_params_
    handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
    dtype=get_only_unique_item(p.ds_tensor.dtype
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
    raise RuntimeError(f"expected there to be only one unique element in {items}")
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7f74ee477ed0>
Traceback (most recent call last):
  File "/workspace/finetune_llama2_codegen.py", line 545, in <module>
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1553, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1835, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2690, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1923, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 141, in backward
    outputs = ctx.run_function(*detached_inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in custom_forward
    return module(*inputs, past_key_value, output_attentions)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
    self.__all_gather_params(params_to_fetch, forward)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
    self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gather_params_
    handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
    dtype=get_only_unique_item(p.ds_tensor.dtype
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
    raise RuntimeError(f"expected there to be only one unique element in {items}")
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7ff729d61cb0>

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.0.1+cu118
deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed']
deepspeed info ................... 0.10.3+542dc0d5, 542dc0d, master
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
shared memory (/dev/shm) size .... 188.00 GB

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

OS: Ubuntu 20.04
GPU count and types 4xA100 80GB
Interconnects (if applicable) N/A
Python version 3.10
Any other relevant info about your setup

Launcher context
used deepspeed launcher with huggingface integration

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

hamelsmu · 2023-10-08T04:13:18Z

I ran into this same exact issue as well.

Aillian · 2023-10-15T09:14:32Z

any solution?

tohtana · 2023-10-16T08:00:25Z

Some code for ZeRO3 assumes that all parameters in a model has the same dtype. This model has uint8 and float32 parameters and it throws the error.
Let us consider how we can fix this.

zyzfred · 2023-10-30T02:23:35Z

Some code for ZeRO3 assumes that all parameters in a model has the same dtype. This model has uint8 and float32 parameters and it throws the error. Let us consider how we can fix this.

Do you fix this problem for now?

noobmaster29 · 2023-11-03T03:44:47Z

I have the same issue. I've attached my deepspeed config file. I'm running my training off the Axolotl library.

ds_config_zero3.json

tohtana · 2023-11-08T01:28:26Z

I submitted #4647 to address this issue. It is working on my environment.
I would appreciate it if anyone could try.

This PR addresses an error reported in #4295. When parameters in multiple data types are given, DeepSpeed performs allgather for each data type. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

momozzing · 2023-11-09T00:38:44Z

I submitted #4647 to address this issue. It is working on my environment. I would appreciate it if anyone could try.

Thank you for your #4647 !!
It works well in my environment, too!

momozzing · 2023-11-09T05:18:03Z

I submitted #4647 to address this issue. It is working on my environment. I would appreciate it if anyone could try.

Hi tohtana, I found the issue,

I changed to your code, but the training was good,
but LoRA size didn't match when I inference.

model.save_pretrained(my_model) -> adapter_model.bin size -> 163KB.

I think the weight of LoRA was not saved.

How can I solve this problem?

size mismatch for base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.k_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.k_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.o_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.o_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.mlp.gate_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([13824, 64]). size mismatch for base_model.model.model.layers.10.mlp.up_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([13824, 64]). size mismatch for base_model.model.model.layers.10.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 13824]). size mismatch for base_model.model.model.layers.10.mlp.down_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]).

tohtana · 2023-11-09T17:05:30Z

Hi @momozzing, can you share the code to reproduce this?

momozzing · 2023-11-10T00:57:11Z

Hi @momozzing, can you share the code to reproduce this?

Ok, My baseline model is LLAMA.

Zero stage 2 works well with this code. However, zero stage 3 does not work.

Code

tokenizer = AutoTokenizer.from_pretrained(config["model"]["tokenizer_path"], eos_token='<|endoftext|>', add_bos_token=False)

model_config = LlamaConfig.from_pretrained(config["model"]["model_path"])
model_config.eos_token_id = tokenizer.eos_token_id
model_config.use_cache = False

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)    

model = AutoModelForCausalLM.from_pretrained(
    config["model"]["model_path"], 
    config=model_config,
    quantization_config=bnb_config,
    )

lora_config = LoraConfig(
        r=config["lora"]["r"],
        lora_alpha=config["lora"]["lora_alpha"],
        target_modules=config["lora"]["target_modules"],
        lora_dropout=config["lora"]["lora_dropout"],
        bias=config["lora"]["bias"],
        task_type=config["lora"]["task_type"],
)

for param in model.parameters():
    param.requires_grad = False  # freeze the model - train adapters later
    if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
        param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()
model.enable_input_require_grads()  
model=prepare_model_for_kbit_training(model)

## load lora
model = get_peft_model(model, lora_config)

optimizer = bnb.optim.PagedAdam32bit(model.parameters(), lr=2e-4, betas=(0.9, 0.999)) # equivalent
print_rank_0(config, f"Trainable_parameters: {get_trainable_parameters(model)}", config["global_rank"])

model, _, _, _ = deepspeed.initialize(
    model=model,
    args={"local_rank":config["local_rank"], "global_rank":config["global_rank"]},
    config=config["ds_config"],
    optimizer = optimizer,
)

ds_config_zero3

  "ds_config":{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 4,
  "bf16": {
    "enabled": true
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 2e-4,
      "warmup_num_steps": 1000,
      "total_num_steps": 10000
    }
  },
  "zero_optimization": {
    "stage": 3,   
    "allgather_partitions": true,
    "allgather_bucket_size":2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 2e9,
    "stage3_max_reuse_distance": 2e9,
    "stage3_gather_16bit_weights_on_model_save": true    
  },
  "zero_allow_untested_optimizer": true,
  "wall_clock_breakdown": false,
  "steps_per_print": 100000
  }
}

ds_config_zero2

  "ds_config":{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 4,
  "bf16": {
    "enabled": true
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 2e-4,
      "warmup_num_steps": 1000,
      "total_num_steps": 10000
    }
  },
  "zero_optimization": {
    "stage": 2,   
    "allgather_partitions": true,
    "allgather_bucket_size":2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,  
  },
  "zero_allow_untested_optimizer": true,
  "wall_clock_breakdown": false,
  "steps_per_print": 100000
  }
}

tohtana · 2023-11-14T23:54:38Z

Hi @momozzing,
It appears that the checkpoint for ZeRO3 is partitioned, so we'll need to use DeepSpeed's loading function for it. You can find more information in the document.

Also, the error you mentioned seems to be distinct from the initial problem. If it persists, I suggest creating a new issue to address it.

momozzing · 2023-11-15T00:56:57Z

Hi @tohtana,
Thank you for your answer.

I'm using this code
deepspeed.DeepSpeedEngine.save_checkpoint(save_dir=save_dir , exclude_frozen_parameters=True)

but, save_checkpoint only saves the optimizer state, model state is not saved.

-rw-rw-r-- 1 519K 09:32 zero_pp_rank_0_mp_rank_00_model_states.pt
-rw-rw-r-- 1 478M 09:32 zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 519K 09:32 zero_pp_rank_1_mp_rank_00_model_states.pt
-rw-rw-r-- 1 478M 09:32 zero_pp_rank_1_mp_rank_00_optim_states.pt

When I save the trained model,
There seems to be an issue where the parameter size of LoRA is saved as torch.Size([0]).

Is there any way to save LoRA's trained weight??

tohtana · 2023-11-15T02:08:05Z

Hi @momozzing,
I haven't run the code, but isn't zero_pp_rank_0_mp_rank_00_model_states.pt the model state?
Since you specified exclude_frozen_parameters=True, it only has parameters that are trained for LoRA.

You can find an example of the combination of ZeRO3 and LoRA in DeepSpeed-Chat. In the following example, it saves all the parameters including ones for LoRA.
https://github.com/microsoft/DeepSpeedExamples/blob/ccb2a3400a05ea075b643bb3aeabb02f9883c5da/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L385

momozzing · 2023-11-15T06:39:02Z

Hi @tohtana

LLAMA + QLoRA without deepspeed stores the size of the adapter_model.bin at 477MB.

LLAMA + QLoRA with deepspeed zero2 stores the size of the adapter_model.bin at 477MB.

but, LLAMA + QLoRA with deepspeed zero3 stores the size of the adapter_model.bin at 519K.

so, There seems to be an issue where the parameter size of LoRA is saved as torch.Size([0]).

Is there any way to save LoRA's trained weight with deepspeed zero3??

Does deepspeed zero3 support bitsandbytes?

tohtana · 2023-11-15T08:09:37Z

Hi @momozzing

ZeRO3 sets an empty size (Size([0]) to a parameter object and has real tensor data in a different attribute. We cannot say that parameters are not saved even when we see torch.Size([0]) in the error message. ZeRO3 also saves partitioned parameters, which are in a different format from the normal PyTorch's checkpoint. So we need to use DeepSpeed's API to load a checkpoint.
In your code, you use AutoModelForCausalLM.from_pretrained(). This cannot properly load a checkpoint that ZeRO3 saved.

Here is another example using HF trainer and LoRA. This script seems to save parameters properly. Can you check this as well?
https://github.com/tohtana/ds_repro_4295/blob/main/finetune_llama_v2.py

momozzing · 2023-11-17T06:14:57Z

Hi, @tohtana
As you said, using DeepSpeed's API solved the problem.

Here's how I solved it.

state_dict = self.engine._zero3_consolidated_16bit_state_dict()
lora_state_dict = get_peft_model_state_dict(self.model, state_dict)
self.model.save_pretrained(save_dir)
torch.save(lora_state_dict, os.path.join(save_dir, "adapter_model.bin"))

Thank you very much for your reply.

This PR addresses an error reported in microsoft#4295. When parameters in multiple data types are given, DeepSpeed performs allgather for each data type. --------- Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

stas00 · 2024-03-05T01:37:52Z

this is a workaround, not a proper solution as this can be really expensive:

state_dict = self.engine._zero3_consolidated_16bit_state_dict()

get_peft_model_state_dict ideally needs to be fixed to become ZeRO aware - it'll need to do that for Deepspeed ZeRO and FSDP as well. In the case of Deepspeed it needs to gather the weights like it's done here:

https://github.com/huggingface/transformers/blob/81c8191b4651de216c00e25e1af607683e980614/src/transformers/modeling_utils.py#L605-L620

This is the efficient way of doing that as it'd gather one layer at a time and incur little memory overhead.

pacman100 · 2024-03-05T13:05:39Z

with zero.init enabled, I get below with the latest branch of Accelerate, Transformers and latest release of Deepspeed:

File "/raid/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    model = AutoModelForCausalLM.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
        raise ValueError(set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)

  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
ValueError        : raise ValueError(raise ValueError(Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.


ValueErrorValueError: : Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.

Below is the memory usage when zero_init=False and qlora+deepSpeed stage 3 for Llama 70B. GPU memory usage per GPU: 20% of 80Gb = 16GB per GPU. However, the initial memory per GPU during model loading would be 35GB (0.5*70B) as each GPU loads the pretrained model in 4 bits. If zero_init is enabled with QLoRA, then one could finetune 70B model on 8 24GB GPUs which would be great.
Code: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/sft/training
Command:

accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml"  train.py \
--seed 100 \
--model_name_or_path "meta-llama/Llama-2-70b-hf" \
--dataset_name "smangrul/ultrachat-10k-chatml" \
--chat_template_format "chatml" \
--add_special_tokens False \
--append_concat_token False \
--splits "train,test" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--packing True \
--learning_rate 1e-4 \
--lr_scheduler_type "cosine" \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--output_dir "mistral-sft-lora-ds" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing True \
--use_reentrant True \
--dataset_text_field "content" \
--use_flash_attn True \
--use_peft_lora True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--lora_target_modules "all-linear" \
--use_4bit_quantization True \
--use_nested_quant True \
--bnb_4bit_compute_dtype "bfloat16"

karths8 added bug Something isn't working training labels Sep 8, 2023

loadams self-assigned this Sep 15, 2023

tjruwase assigned tohtana and unassigned loadams Oct 12, 2023

tohtana mentioned this issue Nov 7, 2023

Enable ZeRO3 allgather for multiple dtypes #4647

Merged

momozzing mentioned this issue Dec 5, 2023

[REQUEST] Support Training with Quantized model, e.g., qlora #3620

Closed

tjruwase closed this as completed Dec 6, 2023

SunMarc mentioned this issue Jan 9, 2024

Seeking for Help: how to work deepspeed zero stage 3 with quantized model? huggingface/accelerate#2312

Closed

younesbelkada mentioned this issue Mar 6, 2024

[Documentation] Compatibility between lora + deepspeed + bitsandbytes huggingface/peft#1418

Closed

Nagico mentioned this issue Mar 11, 2024

multi gpu - transformers/modeling_utils.py - Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect. axolotl-ai-cloud/axolotl#1240

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Error while training with Deepspeed #4295

[BUG] Error while training with Deepspeed #4295

karths8 commented Sep 8, 2023

hamelsmu commented Oct 8, 2023

Aillian commented Oct 15, 2023

tohtana commented Oct 16, 2023

zyzfred commented Oct 30, 2023

noobmaster29 commented Nov 3, 2023

tohtana commented Nov 8, 2023 •

edited

Loading

momozzing commented Nov 9, 2023

momozzing commented Nov 9, 2023 •

edited

Loading

tohtana commented Nov 9, 2023

momozzing commented Nov 10, 2023

tohtana commented Nov 14, 2023

momozzing commented Nov 15, 2023

tohtana commented Nov 15, 2023

momozzing commented Nov 15, 2023

tohtana commented Nov 15, 2023

momozzing commented Nov 17, 2023

stas00 commented Mar 5, 2024

pacman100 commented Mar 5, 2024 •

edited

Loading

[BUG] Error while training with Deepspeed #4295

[BUG] Error while training with Deepspeed #4295

Comments

karths8 commented Sep 8, 2023

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

hamelsmu commented Oct 8, 2023

Aillian commented Oct 15, 2023

tohtana commented Oct 16, 2023

zyzfred commented Oct 30, 2023

noobmaster29 commented Nov 3, 2023

tohtana commented Nov 8, 2023 • edited Loading

momozzing commented Nov 9, 2023

momozzing commented Nov 9, 2023 • edited Loading

tohtana commented Nov 9, 2023

momozzing commented Nov 10, 2023

Code

ds_config_zero3

ds_config_zero2

tohtana commented Nov 14, 2023

momozzing commented Nov 15, 2023

tohtana commented Nov 15, 2023

momozzing commented Nov 15, 2023

tohtana commented Nov 15, 2023

momozzing commented Nov 17, 2023

stas00 commented Mar 5, 2024

pacman100 commented Mar 5, 2024 • edited Loading

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

tohtana commented Nov 8, 2023 •

edited

Loading

momozzing commented Nov 9, 2023 •

edited

Loading

pacman100 commented Mar 5, 2024 •

edited

Loading