Skip to content

deepspeed launcher exits as successful on failure #618

@stas00

Description

@stas00

Demo:

  1. prepare a failing program:
echo garbageeeeee > fail.py
  1. run it:
deepspeed fail.py || echo "failed"

log:

[2020-12-24 16:25:13,160] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-12-24 16:25:13,185] [INFO] [runner.py:355:main] cmd = /home/stas/anaconda3/envs/main-38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 fail.py
[2020-12-24 16:25:13,902] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2020-12-24 16:25:13,902] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=2, node_rank=0
[2020-12-24 16:25:13,902] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2020-12-24 16:25:13,902] [INFO] [launch.py:100:main] dist_world_size=2
[2020-12-24 16:25:13,902] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1
Traceback (most recent call last):
  File "fail.py", line 1, in <module>
    garbageeeeee
NameError: name 'garbageeeeee' is not defined
Traceback (most recent call last):
  File "fail.py", line 1, in <module>
    garbageeeeee
NameError: name 'garbageeeeee' is not defined

As you can see echo "failed" didn't kick in - as deepspeed returned returncode 0.

Another way to see the same:

deepspeed fail.py && echo $?
[...]
NameError: name 'garbageeeeee' is not defined
0

This bug breaks tests that rely on the sub-process returning returncode > 0 on failure.

Thank you.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions