-
Notifications
You must be signed in to change notification settings - Fork 4.7k
Closed
Description
Demo:
- prepare a failing program:
echo garbageeeeee > fail.py
- run it:
deepspeed fail.py || echo "failed"
log:
[2020-12-24 16:25:13,160] [WARNING] [runner.py:117:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-12-24 16:25:13,185] [INFO] [runner.py:355:main] cmd = /home/stas/anaconda3/envs/main-38/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 fail.py
[2020-12-24 16:25:13,902] [INFO] [launch.py:78:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2020-12-24 16:25:13,902] [INFO] [launch.py:84:main] nnodes=1, num_local_procs=2, node_rank=0
[2020-12-24 16:25:13,902] [INFO] [launch.py:99:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2020-12-24 16:25:13,902] [INFO] [launch.py:100:main] dist_world_size=2
[2020-12-24 16:25:13,902] [INFO] [launch.py:102:main] Setting CUDA_VISIBLE_DEVICES=0,1
Traceback (most recent call last):
File "fail.py", line 1, in <module>
garbageeeeee
NameError: name 'garbageeeeee' is not defined
Traceback (most recent call last):
File "fail.py", line 1, in <module>
garbageeeeee
NameError: name 'garbageeeeee' is not defined
As you can see echo "failed" didn't kick in - as deepspeed returned returncode 0.
Another way to see the same:
deepspeed fail.py && echo $?
[...]
NameError: name 'garbageeeeee' is not defined
0
This bug breaks tests that rely on the sub-process returning returncode > 0 on failure.
Thank you.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels