Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Error while training with Deepspeed #4295

Closed
karths8 opened this issue Sep 8, 2023 · 18 comments
Closed

[BUG] Error while training with Deepspeed #4295

karths8 opened this issue Sep 8, 2023 · 18 comments
Assignees
Labels
bug Something isn't working training

Comments

@karths8
Copy link

karths8 commented Sep 8, 2023

Describe the bug
Deepspeed runs into a bug while training a CodeLlama-34B model with QLoRA using this script

To Reproduce
Run the script with deepspeed file passed into the params. The deepspeed config i used is given below:

{
  "bf16": {
    "enabled": "auto"
  },
  "scheduler": {
    "type": "WarmupLR",
    "params": {
      "warmup_min_lr": "auto",
      "warmup_max_lr": "auto",
      "warmup_num_steps": "auto"
    }
  },
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {
      "device": "cpu",
      "pin_memory": true
    },
    "offload_param": {
      "device": "cpu",
      "pin_memory": true
    },
    "overlap_comm": true,
    "contiguous_gradients": true,
    "sub_group_size": 1e9,
    "reduce_bucket_size": 16777216,
    "stage3_prefetch_bucket_size": 15099494.4,
    "stage3_param_persistence_threshold": 40960,
    "stage3_max_live_parameters": 1e9,
    "stage3_max_reuse_distance": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "steps_per_print": 2000,
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Expected behavior

Expected behaviour is deepspeed training without any errors. The following error (RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7ff729d61cb0>) pops up with the traceback as given below

[2023-09-08 19:26:04,877] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:07,007] [WARNING] [runner.py:203:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-09-08 19:26:07,007] [INFO] [runner.py:570:main] cmd = /usr/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None finetune_llama2_codegen.py --bf16 --per_device_train_batch_size 1 --per_device_eval_batch_size 2 --model_name /workspace/CodeLlama-34b-Python-hf --dataset_name llama_data --save_steps 100 --num_train_epochs 2 --learning_rate 2e-5 --weight_decay 0.01 --lora_alpha 256 --lora_r 32 --use_qlora True --max_seq_length 8192 --run_name CodeGen-34B-combined-train --deepspeed_path /workspace/deepspeed_config_stage3.json
[2023-09-08 19:26:08,340] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:10,452] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2023-09-08 19:26:10,452] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=2, node_rank=0
[2023-09-08 19:26:10,452] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2023-09-08 19:26:10,452] [INFO] [launch.py:163:main] dist_world_size=2
[2023-09-08 19:26:10,452] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2023-09-08 19:26:12,700] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:12,745] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2023-09-08 19:26:19,891] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-08 19:26:19,891] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2023-09-08 19:26:20,266] [INFO] [comm.py:637:init_distributed] cdb=None
[2023-09-08 19:27:56,714] [INFO] [partition_parameters.py:342:__exit__] finished initializing model - num_params = 435, num_elems = 33.74B
Loading checkpoint shards: 100%|█████████████| 7/7 [03:06<00:00, 26.65s/it]
Loading checkpoint shards: 100%|█████████████| 7/7 [03:06<00:00, 26.61s/it]
trainable params: 39,321,600 || all params: 33,783,291,904 || trainable%: 0.11639363064954678
Map:   0%|                                 | 0/2787 [00:00<?, ? examples/s]trainable params: 39,321,600 || all params: 33,783,291,904 || trainable%: 0.11639363064954678
Map: 100%|█████████████████████| 2787/2787 [00:04<00:00, 630.71 examples/s]
Map: 100%|█████████████████████| 2787/2787 [00:04<00:00, 656.34 examples/s]
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu118/cpu_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/cpu_adam/build.ninja...
Building extension module cpu_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/4] /usr/local/cuda/bin/nvcc  -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 --use_fast_math -std=c++17 -U__CUDA_NO_HALF_OPERATORS__ -U__CUDA_NO_HALF_CONVERSIONS__ -U__CUDA_NO_HALF2_OPERATORS__ -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/common/custom_cuda_kernel.cu -o custom_cuda_kernel.cuda.o 
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
[2/4] c++ -MMD -MF cpu_adam_impl.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam/cpu_adam_impl.cpp -o cpu_adam_impl.o 
[3/4] c++ -MMD -MF cpu_adam.o.d -DTORCH_EXTENSION_NAME=cpu_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/includes -I/usr/local/cuda/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/torch/csrc/api/include -isystem /usr/local/lib/python3.10/dist-packages/torch/include/TH -isystem /usr/local/lib/python3.10/dist-packages/torch/include/THC -isystem /usr/local/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++17 -O3 -std=c++17 -g -Wno-reorder -L/usr/local/cuda/lib64 -lcudart -lcublas -g -march=native -fopenmp -D__AVX512__ -D__ENABLE_CUDA__ -DBF16_AVAILABLE -c /usr/local/lib/python3.10/dist-packages/deepspeed/ops/csrc/adam/cpu_adam.cpp -o cpu_adam.o 
[4/4] c++ cpu_adam.o cpu_adam_impl.o custom_cuda_kernel.cuda.o -shared -lcurand -L/usr/local/lib/python3.10/dist-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda -ltorch -ltorch_python -L/usr/local/cuda/lib64 -lcudart -o cpu_adam.so
Loading extension module cpu_adam...
Time to load cpu_adam op: 20.74338674545288 seconds
Loading extension module cpu_adam...
Time to load cpu_adam op: 12.17668604850769 seconds
Parameter Offload: Total persistent parameters: 2367488 in 145 params
You're using a CodeLlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
Traceback (most recent call last):
  File "/workspace/finetune_llama2_codegen.py", line 545, in <module>
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1553, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1835, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2690, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1923, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 141, in backward
    outputs = ctx.run_function(*detached_inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in custom_forward
    return module(*inputs, past_key_value, output_attentions)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
    self.__all_gather_params(params_to_fetch, forward)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
    self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gather_params_
    handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
    dtype=get_only_unique_item(p.ds_tensor.dtype
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
    raise RuntimeError(f"expected there to be only one unique element in {items}")
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7f74ee477ed0>
Traceback (most recent call last):
  File "/workspace/finetune_llama2_codegen.py", line 545, in <module>
    trainer.train()
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1553, in train
    return inner_training_loop(
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 1835, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 2690, in training_step
    self.accelerator.backward(loss)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 1958, in backward
    self.deepspeed_engine_wrapped.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/utils/deepspeed.py", line 167, in backward
    self.engine.backward(loss, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/engine.py", line 1923, in backward
    self.optimizer.backward(loss, retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/stage3.py", line 2080, in backward
    self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
    scaled_loss.backward(retain_graph=retain_graph)
  File "/usr/local/lib/python3.10/dist-packages/torch/_tensor.py", line 487, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/__init__.py", line 200, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 274, in apply
    return user_fn(self, *args)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py", line 141, in backward
    outputs = ctx.run_function(*detached_inputs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 697, in custom_forward
    return module(*inputs, past_key_value, output_attentions)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 424, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 321, in forward
    query_states = self.q_proj(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    result = hook(self, args)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 392, in _pre_forward_module_hook
    self.pre_sub_module_forward_function(module)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/parameter_offload.py", line 505, in pre_sub_module_forward_function
    param_coordinator.fetch_sub_module(sub_module, forward=prev_grad_state)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 284, in fetch_sub_module
    self.__all_gather_params(params_to_fetch, forward)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 428, in __all_gather_params
    self.__all_gather_params_(nonquantized_params, forward, quantize=self.zero_quantized_weights)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 446, in __all_gather_params_
    handle = partitioned_params[0].all_gather_coalesced(partitioned_params,
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/zero/partition_parameters.py", line 1132, in all_gather_coalesced
    dtype=get_only_unique_item(p.ds_tensor.dtype
  File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/utils.py", line 842, in get_only_unique_item
    raise RuntimeError(f"expected there to be only one unique element in {items}")
RuntimeError: expected there to be only one unique element in <generator object Init._convert_to_deepspeed_param.<locals>.all_gather_coalesced.<locals>.<genexpr> at 0x7ff729d61cb0>

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/usr/local/lib/python3.10/dist-packages/torch']
torch version .................... 2.0.1+cu118
deepspeed install path ........... ['/usr/local/lib/python3.10/dist-packages/deepspeed']
deepspeed info ................... 0.10.3+542dc0d5, 542dc0d, master
torch cuda version ............... 11.8
torch hip version ................ None
nvcc version ..................... 11.8
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.8
shared memory (/dev/shm) size .... 188.00 GB

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: Ubuntu 20.04
  • GPU count and types 4xA100 80GB
  • Interconnects (if applicable) N/A
  • Python version 3.10
  • Any other relevant info about your setup

Launcher context
used deepspeed launcher with huggingface integration

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

@karths8 karths8 added bug Something isn't working training labels Sep 8, 2023
@loadams loadams self-assigned this Sep 15, 2023
@hamelsmu
Copy link

hamelsmu commented Oct 8, 2023

I ran into this same exact issue as well.

@tjruwase tjruwase assigned tohtana and unassigned loadams Oct 12, 2023
@Aillian
Copy link

Aillian commented Oct 15, 2023

any solution?

@tohtana
Copy link
Contributor

tohtana commented Oct 16, 2023

Some code for ZeRO3 assumes that all parameters in a model has the same dtype. This model has uint8 and float32 parameters and it throws the error.
Let us consider how we can fix this.

@zyzfred
Copy link

zyzfred commented Oct 30, 2023

Some code for ZeRO3 assumes that all parameters in a model has the same dtype. This model has uint8 and float32 parameters and it throws the error. Let us consider how we can fix this.

Do you fix this problem for now?

@noobmaster29
Copy link

I have the same issue. I've attached my deepspeed config file. I'm running my training off the Axolotl library.

ds_config_zero3.json

@tohtana
Copy link
Contributor

tohtana commented Nov 8, 2023

I submitted #4647 to address this issue. It is working on my environment.
I would appreciate it if anyone could try.

github-merge-queue bot pushed a commit that referenced this issue Nov 8, 2023
This PR addresses an error reported in #4295.
When parameters in multiple data types are given, DeepSpeed performs
allgather for each data type.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
@momozzing
Copy link

I submitted #4647 to address this issue. It is working on my environment. I would appreciate it if anyone could try.

Thank you for your #4647 !!
It works well in my environment, too!

@momozzing
Copy link

momozzing commented Nov 9, 2023

I submitted #4647 to address this issue. It is working on my environment. I would appreciate it if anyone could try.

Hi tohtana, I found the issue,

I changed to your code, but the training was good,
but LoRA size didn't match when I inference.

model.save_pretrained(my_model) -> adapter_model.bin size -> 163KB.

I think the weight of LoRA was not saved.

How can I solve this problem?

size mismatch for base_model.model.model.layers.10.self_attn.q_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.q_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.k_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.k_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.v_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.v_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.self_attn.o_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.self_attn.o_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]). size mismatch for base_model.model.model.layers.10.mlp.gate_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.mlp.gate_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([13824, 64]). size mismatch for base_model.model.model.layers.10.mlp.up_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 5120]). size mismatch for base_model.model.model.layers.10.mlp.up_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([13824, 64]). size mismatch for base_model.model.model.layers.10.mlp.down_proj.lora_A.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([64, 13824]). size mismatch for base_model.model.model.layers.10.mlp.down_proj.lora_B.default.weight: copying a param with shape torch.Size([0]) from checkpoint, the shape in current model is torch.Size([5120, 64]).

@tohtana
Copy link
Contributor

tohtana commented Nov 9, 2023

Hi @momozzing, can you share the code to reproduce this?

@momozzing
Copy link

Hi @momozzing, can you share the code to reproduce this?

Ok, My baseline model is LLAMA.

Zero stage 2 works well with this code. However, zero stage 3 does not work.

Code

tokenizer = AutoTokenizer.from_pretrained(config["model"]["tokenizer_path"], eos_token='<|endoftext|>', add_bos_token=False)

model_config = LlamaConfig.from_pretrained(config["model"]["model_path"])
model_config.eos_token_id = tokenizer.eos_token_id
model_config.use_cache = False

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)    

model = AutoModelForCausalLM.from_pretrained(
    config["model"]["model_path"], 
    config=model_config,
    quantization_config=bnb_config,
    )

lora_config = LoraConfig(
        r=config["lora"]["r"],
        lora_alpha=config["lora"]["lora_alpha"],
        target_modules=config["lora"]["target_modules"],
        lora_dropout=config["lora"]["lora_dropout"],
        bias=config["lora"]["bias"],
        task_type=config["lora"]["task_type"],
)

for param in model.parameters():
    param.requires_grad = False  # freeze the model - train adapters later
    if param.ndim == 1:
        # cast the small parameters (e.g. layernorm) to fp32 for stability
        param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()
model.enable_input_require_grads()  
model=prepare_model_for_kbit_training(model)

## load lora
model = get_peft_model(model, lora_config)

optimizer = bnb.optim.PagedAdam32bit(model.parameters(), lr=2e-4, betas=(0.9, 0.999)) # equivalent
print_rank_0(config, f"Trainable_parameters: {get_trainable_parameters(model)}", config["global_rank"])

model, _, _, _ = deepspeed.initialize(
    model=model,
    args={"local_rank":config["local_rank"], "global_rank":config["global_rank"]},
    config=config["ds_config"],
    optimizer = optimizer,
)

ds_config_zero3

  "ds_config":{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 4,
  "bf16": {
    "enabled": true
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 2e-4,
      "warmup_num_steps": 1000,
      "total_num_steps": 10000
    }
  },
  "zero_optimization": {
    "stage": 3,   
    "allgather_partitions": true,
    "allgather_bucket_size":2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,
    "stage3_prefetch_bucket_size": "auto",
    "stage3_param_persistence_threshold": "auto",
    "stage3_max_live_parameters": 2e9,
    "stage3_max_reuse_distance": 2e9,
    "stage3_gather_16bit_weights_on_model_save": true    
  },
  "zero_allow_untested_optimizer": true,
  "wall_clock_breakdown": false,
  "steps_per_print": 100000
  }
}

ds_config_zero2

  "ds_config":{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 4,
  "bf16": {
    "enabled": true
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "warmup_min_lr": 0,
      "warmup_max_lr": 2e-4,
      "warmup_num_steps": 1000,
      "total_num_steps": 10000
    }
  },
  "zero_optimization": {
    "stage": 2,   
    "allgather_partitions": true,
    "allgather_bucket_size":2e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 2e8,
    "contiguous_gradients": true,  
  },
  "zero_allow_untested_optimizer": true,
  "wall_clock_breakdown": false,
  "steps_per_print": 100000
  }
}

@tohtana
Copy link
Contributor

tohtana commented Nov 14, 2023

Hi @momozzing,
It appears that the checkpoint for ZeRO3 is partitioned, so we'll need to use DeepSpeed's loading function for it. You can find more information in the document.

Also, the error you mentioned seems to be distinct from the initial problem. If it persists, I suggest creating a new issue to address it.

@momozzing
Copy link

Hi @tohtana,
Thank you for your answer.

I'm using this code
deepspeed.DeepSpeedEngine.save_checkpoint(save_dir=save_dir , exclude_frozen_parameters=True)

but, save_checkpoint only saves the optimizer state, model state is not saved.

-rw-rw-r-- 1 519K 09:32 zero_pp_rank_0_mp_rank_00_model_states.pt
-rw-rw-r-- 1 478M 09:32 zero_pp_rank_0_mp_rank_00_optim_states.pt
-rw-rw-r-- 1 519K 09:32 zero_pp_rank_1_mp_rank_00_model_states.pt
-rw-rw-r-- 1 478M 09:32 zero_pp_rank_1_mp_rank_00_optim_states.pt

When I save the trained model,
There seems to be an issue where the parameter size of LoRA is saved as torch.Size([0]).

Is there any way to save LoRA's trained weight??

@tohtana
Copy link
Contributor

tohtana commented Nov 15, 2023

Hi @momozzing,
I haven't run the code, but isn't zero_pp_rank_0_mp_rank_00_model_states.pt the model state?
Since you specified exclude_frozen_parameters=True, it only has parameters that are trained for LoRA.

You can find an example of the combination of ZeRO3 and LoRA in DeepSpeed-Chat. In the following example, it saves all the parameters including ones for LoRA.
https://github.com/microsoft/DeepSpeedExamples/blob/ccb2a3400a05ea075b643bb3aeabb02f9883c5da/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py#L385

@momozzing
Copy link

Hi @tohtana

LLAMA + QLoRA without deepspeed stores the size of the adapter_model.bin at 477MB.

LLAMA + QLoRA with deepspeed zero2 stores the size of the adapter_model.bin at 477MB.

but, LLAMA + QLoRA with deepspeed zero3 stores the size of the adapter_model.bin at 519K.

so, There seems to be an issue where the parameter size of LoRA is saved as torch.Size([0]).

Is there any way to save LoRA's trained weight with deepspeed zero3??

Does deepspeed zero3 support bitsandbytes?

@tohtana
Copy link
Contributor

tohtana commented Nov 15, 2023

Hi @momozzing

ZeRO3 sets an empty size (Size([0]) to a parameter object and has real tensor data in a different attribute. We cannot say that parameters are not saved even when we see torch.Size([0]) in the error message. ZeRO3 also saves partitioned parameters, which are in a different format from the normal PyTorch's checkpoint. So we need to use DeepSpeed's API to load a checkpoint.
In your code, you use AutoModelForCausalLM.from_pretrained(). This cannot properly load a checkpoint that ZeRO3 saved.

Here is another example using HF trainer and LoRA. This script seems to save parameters properly. Can you check this as well?
https://github.com/tohtana/ds_repro_4295/blob/main/finetune_llama_v2.py

@momozzing
Copy link

Hi, @tohtana
As you said, using DeepSpeed's API solved the problem.

Here's how I solved it.

state_dict = self.engine._zero3_consolidated_16bit_state_dict()
lora_state_dict = get_peft_model_state_dict(self.model, state_dict)
self.model.save_pretrained(save_dir)
torch.save(lora_state_dict, os.path.join(save_dir, "adapter_model.bin"))

Thank you very much for your reply.

@tjruwase tjruwase closed this as completed Dec 6, 2023
mauryaavinash95 pushed a commit to mauryaavinash95/DeepSpeed that referenced this issue Feb 17, 2024
This PR addresses an error reported in microsoft#4295.
When parameters in multiple data types are given, DeepSpeed performs
allgather for each data type.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
@stas00
Copy link
Collaborator

stas00 commented Mar 5, 2024

this is a workaround, not a proper solution as this can be really expensive:

state_dict = self.engine._zero3_consolidated_16bit_state_dict()

get_peft_model_state_dict ideally needs to be fixed to become ZeRO aware - it'll need to do that for Deepspeed ZeRO and FSDP as well. In the case of Deepspeed it needs to gather the weights like it's done here:

https://github.com/huggingface/transformers/blob/81c8191b4651de216c00e25e1af607683e980614/src/transformers/modeling_utils.py#L605-L620

This is the efficient way of doing that as it'd gather one layer at a time and incur little memory overhead.

@pacman100
Copy link
Contributor

pacman100 commented Mar 5, 2024

  1. with zero.init enabled, I get below with the latest branch of Accelerate, Transformers and latest release of Deepspeed:
File "/raid/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    model = AutoModelForCausalLM.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    return model_class.from_pretrained(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3504, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    ) = cls._load_pretrained_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 3928, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/raid/sourab/transformers/src/transformers/modeling_utils.py", line 805, in _load_state_dict_into_meta_model
        raise ValueError(set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)

  File "/raid/sourab/accelerate/src/accelerate/utils/modeling.py", line 345, in set_module_tensor_to_device
ValueError        : raise ValueError(raise ValueError(Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.


ValueErrorValueError: : Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.Trying to set a tensor of shape torch.Size([32000, 4096]) in "weight" (which has shape torch.Size([0])), this look incorrect.
  1. Below is the memory usage when zero_init=False and qlora+deepSpeed stage 3 for Llama 70B. GPU memory usage per GPU: 20% of 80Gb = 16GB per GPU. However, the initial memory per GPU during model loading would be 35GB (0.5*70B) as each GPU loads the pretrained model in 4 bits. If zero_init is enabled with QLoRA, then one could finetune 70B model on 8 24GB GPUs which would be great.
    Code: https://github.com/pacman100/DHS-LLM-Workshop/blob/main/chat_assistant/sft/training
    Command:
accelerate launch --config_file "configs/deepspeed_config_z3_qlora.yaml"  train.py \
--seed 100 \
--model_name_or_path "meta-llama/Llama-2-70b-hf" \
--dataset_name "smangrul/ultrachat-10k-chatml" \
--chat_template_format "chatml" \
--add_special_tokens False \
--append_concat_token False \
--splits "train,test" \
--max_seq_len 2048 \
--num_train_epochs 1 \
--logging_steps 5 \
--log_level "info" \
--logging_strategy "steps" \
--evaluation_strategy "epoch" \
--save_strategy "epoch" \
--push_to_hub \
--hub_private_repo True \
--hub_strategy "every_save" \
--bf16 True \
--packing True \
--learning_rate 1e-4 \
--lr_scheduler_type "cosine" \
--weight_decay 1e-4 \
--warmup_ratio 0.0 \
--max_grad_norm 1.0 \
--output_dir "mistral-sft-lora-ds" \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing True \
--use_reentrant True \
--dataset_text_field "content" \
--use_flash_attn True \
--use_peft_lora True \
--lora_r 8 \
--lora_alpha 16 \
--lora_dropout 0.1 \
--lora_target_modules "all-linear" \
--use_4bit_quantization True \
--use_nested_quant True \
--bnb_4bit_compute_dtype "bfloat16"
Screenshot 2024-03-05 at 6 30 39 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests