You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
used deepspeed zero3 to finetune QWEN1.5 7B, then a mysterious error happened:
[2024-04-25 22:49:17,319] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2
[WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible
/usr/local/lib/python3.11/dist-packages/transformers/deepspeed.py:23: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
warnings.warn(
[2024-04-25 22:49:19,277] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-25 22:49:19,278] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2024-04-25 22:49:43,619] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 0 (pid: 87377) of binary: /usr/bin/python3.11
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
finetune_new.py FAILED
------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-04-25_22:49:43
host : I198d6e5a0001d010f2
rank : 0 (local_rank: 0)
exitcode : -9 (pid: 87377)
error_file: <N/A>
traceback : Signal 9 (SIGKILL) received by PID 87377
======================================================
what should I do, I have tried reducing "sub_group_size", "reduce_bucket_size","stage3_prefetch_bucket_size", "stage3_param_persistence_threshold", "stage3_max_live_parameters","stage3_max_reuse_distance" to 1e3, and setting two "device" to "cpu", doesn't work.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
used deepspeed zero3 to finetune QWEN1.5 7B, then a mysterious error happened:
what should I do, I have tried reducing "sub_group_size", "reduce_bucket_size","stage3_prefetch_bucket_size", "stage3_param_persistence_threshold", "stage3_max_live_parameters","stage3_max_reuse_distance" to 1e3, and setting two "device" to "cpu", doesn't work.
Beta Was this translation helpful? Give feedback.
All reactions