deepspeed zero3 don't work for qwen1.5 7b #1230

jackfsuia · 2024-04-25T15:07:58Z

jackfsuia
Apr 25, 2024

used deepspeed zero3 to finetune QWEN1.5 7B, then a mysterious error happened:

[2024-04-25 22:49:17,319] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  NVIDIA Inference is only supported on Ampere and newer architectures
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
 [WARNING]  using untested triton version (2.2.0), only 1.0.0 is known to be compatible
/usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-04-25 22:49:19,277] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-25 22:49:19,278] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-04-25 22:49:43,619] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 87377) of binary: /usr/bin/python3.11
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
======================================================
finetune_new.py FAILED
------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-04-25_22:49:43
  host      : I198d6e5a0001d010f2
  rank      : 0 (local_rank: 0)
  exitcode  : -9 (pid: 87377)
  error_file: <N/A>
  traceback : Signal 9 (SIGKILL) received by PID 87377
======================================================

what should I do, I have tried reducing "sub_group_size", "reduce_bucket_size","stage3_prefetch_bucket_size", "stage3_param_persistence_threshold", "stage3_max_live_parameters","stage3_max_reuse_distance" to 1e3, and setting two "device" to "cpu", doesn't work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepspeed zero3 don't work for qwen1.5 7b #1230

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

deepspeed zero3 don't work for qwen1.5 7b #1230

jackfsuia Apr 25, 2024

Replies: 0 comments

jackfsuia
Apr 25, 2024