You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tests/test_trainer_distributed.py fails occasionally on multi-gpu github runner CI and as a result doesn't free up the 29500 default distributed port.
This could be caused by an occasional deadlock discusses in testing_utils.py's _stream_subprocess``. When debugging one such zombie it was stuck in exec(eval(sys.stdin.readline()))`
Note that other similar tests under examples don't exhibit the same behavior - perhaps it somehow has to do with this being a different script that it runs (this test runs its own file as the distributed script).
The bt of the subsequent failures is long and confusing, as there are several mixed failures, but it's all really one failure: Address already in use since the previous distributed run of the same test didn't free up this port.
A quick check should show which process is bound to it:
netstat -tulpn | grep :29500
The full bt:
NCCL_DEBUG=INFO pytest -sv tests/test_trainer_distributed.py
================================================================ test session starts ================================================================
platform linux -- Python 3.7.4, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 -- /home/github_actions/actions-runner/_work/transformers/transformers/.env/bin/python
cachedir: .pytest_cache
rootdir: /home/github_actions/actions-runner/_work/transformers/transformers
plugins: xdist-2.2.1, forked-1.3.0
collected 1 item
tests/test_trainer_distributed.py::TestTrainerDistributed::test_trainer
Running: /home/github_actions/actions-runner/_work/transformers/transformers/.env/bin/python -m torch.distributed.launch --nproc_per_node=2 /home/github_actions/actions-runner/_work/transformers/transformers/tests/test_trainer_distributed.py --output_dir /tmp/tmp2k265qn5
stderr: Traceback (most recent call last):
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/tests/test_trainer_distributed.py", line 82, in <module>
stderr: training_args = parser.parse_args_into_dataclasses()[0]
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/hf_argparser.py", line 180, in parse_args_into_dataclasses
stderr: obj = dtype(**inputs)
stderr: File "<string>", line 61, in __init__
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 491, in __post_init__
stderr: if is_torch_available() and self.device.type != "cuda" and self.fp16:
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1369, in wrapper
stderr: return func(*args, **kwargs)
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 620, in device
stderr: return self._setup_devices
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1359, in __get__
stderr: cached = self.fget(obj)
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1369, in wrapper
stderr: return func(*args, **kwargs)
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 605, in _setup_devices
stderr: torch.distributed.init_process_group(backend="nccl")
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
stderr: store, rank, world_size = next(rendezvous_iterator)
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
stderr: store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
stderr: RuntimeError: Address already in use
stderr: Traceback (most recent call last):
stderr: File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
stderr: "__main__", mod_spec)
stderr: File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
stderr: exec(code, run_globals)
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
stderr: main()
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
stderr: cmd=cmd)
stderr: subprocess.CalledProcessError: Command '['/home/github_actions/actions-runner/_work/transformers/transformers/.env/bin/python', '-u', '/home/github_actions/actions-runner/_work/transformers/transformers/tests/test_trainer_distributed.py', '--local_rank=1', '--output_dir', '/tmp/tmp2k265qn5']' returned non-zero exit status 1.
stdout: *****************************************
stdout: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
stdout: *****************************************
stdout: multi-gpu-ci-runner:18062:18062 [1] NCCL INFO Bootstrap : Using [0]ens6:10.128.0.66<0>
stdout: multi-gpu-ci-runner:18062:18062 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
stdout:
stdout: multi-gpu-ci-runner:18062:18062 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
stdout: multi-gpu-ci-runner:18062:18062 [1] NCCL INFO NET/Socket : Using [0]ens6:10.128.0.66<0>
stdout: multi-gpu-ci-runner:18062:18062 [1] NCCL INFO Using network Socket
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout:
stdout: multi-gpu-ci-runner:18062:18089 [1] include/socket.h:403 NCCL WARN Connect to 10.128.0.66<52523> failed : Connection refused
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO bootstrap.cc:95 -> 2
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO bootstrap.cc:309 -> 2
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO init.cc:555 -> 2
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO init.cc:840 -> 2
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO group.cc:73 -> 2 [Async thread]
stderr: Traceback (most recent call last):
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/tests/test_trainer_distributed.py", line 82, in <module>
stderr: training_args = parser.parse_args_into_dataclasses()[0]
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/hf_argparser.py", line 180, in parse_args_into_dataclasses
stderr: obj = dtype(**inputs)
stderr: File "<string>", line 61, in __init__
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 491, in __post_init__
stderr: if is_torch_available() and self.device.type != "cuda" and self.fp16:
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1369, in wrapper
stderr: return func(*args, **kwargs)
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 620, in device
stderr: return self._setup_devices
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1359, in __get__
stderr: cached = self.fget(obj)
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1369, in wrapper
stderr: return func(*args, **kwargs)
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 605, in _setup_devices
stderr: torch.distributed.init_process_group(backend="nccl")
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
stderr: barrier()
stderr: File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
stderr: work = _default_pg.barrier()
stderr: RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
FAILED
===================================================================== FAILURES ======================================================================
________________________________________________________ TestTrainerDistributed.test_trainer ________________________________________________________
self = <tests.test_trainer_distributed.TestTrainerDistributed testMethod=test_trainer>
@require_torch_multi_gpu
def test_trainer(self):
distributed_args = f"""
-m torch.distributed.launch
--nproc_per_node={torch.cuda.device_count()}
{self.test_file_dir}/test_trainer_distributed.py
""".split()
output_dir = self.get_auto_remove_tmp_dir()
args = f"--output_dir {output_dir}".split()
cmd = [sys.executable] + distributed_args + args
> execute_subprocess_async(cmd, env=self.get_env())
tests/test_trainer_distributed.py:72:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cmd = ['/home/github_actions/actions-runner/_work/transformers/transformers/.env/bin/python', '-m', 'torch.distributed.launc.../github_actions/actions-runner/_work/transformers/transformers/tests/test_trainer_distributed.py', '--output_dir', ...]
env = {'HOME': '/home/github_actions', 'KMP_DUPLICATE_LIB_OK': 'True', 'KMP_INIT_AT_FORK': 'FALSE', 'LANG': 'C.UTF-8', ...}, stdin = None
timeout = 180, quiet = False, echo = True
def execute_subprocess_async(cmd, env=None, stdin=None, timeout=180, quiet=False, echo=True) -> _RunOutput:
loop = asyncio.get_event_loop()
result = loop.run_until_complete(
_stream_subprocess(cmd, env=env, stdin=stdin, timeout=timeout, quiet=quiet, echo=echo)
)
cmd_str = " ".join(cmd)
if result.returncode > 0:
stderr = "\n".join(result.stderr)
raise RuntimeError(
> f"'{cmd_str}' failed with returncode {result.returncode}\n\n"
f"The combined stderr from workers follows:\n{stderr}"
)
A short term workaround could be to randomize the port, so this test won't trumple upon its previous zombie.
but this is a band-aid and a real solution is needed. It also will be an issue with any other distributed tests that rely on the same default port number.
I will keep on monitoring the issue.
Meanwhile this PR #10281 should help with preventing the incremental number of zombies from scheduled runs.
It's difficult to debug w/o being able to reproduce this problem at will.
The text was updated successfully, but these errors were encountered:
One other solution - since this is a single node we could use a unique file rather than port for setting up the distributed process group.
That is init_process_group() with init_method="file:///tmp/unique_file" - but the trainer currently hardcodes the env:// method so we may need to make it more flexible around that.
tests/test_trainer_distributed.py
fails occasionally on multi-gpu github runner CI and as a result doesn't free up the 29500 default distributed port.This could be caused by an occasional deadlock discusses in testing_utils.py's
_stream_subprocess``. When debugging one such zombie it was stuck in
exec(eval(sys.stdin.readline()))`Note that other similar tests under
examples
don't exhibit the same behavior - perhaps it somehow has to do with this being a different script that it runs (this test runs its own file as the distributed script).The bt of the subsequent failures is long and confusing, as there are several mixed failures, but it's all really one failure:
Address already in use
since the previous distributed run of the same test didn't free up this port.A quick check should show which process is bound to it:
The full bt:
A short term workaround could be to randomize the port, so this test won't trumple upon its previous zombie.
but this is a band-aid and a real solution is needed. It also will be an issue with any other distributed tests that rely on the same default port number.
I will keep on monitoring the issue.
Meanwhile this PR #10281 should help with preventing the incremental number of zombies from scheduled runs.
It's difficult to debug w/o being able to reproduce this problem at will.
The text was updated successfully, but these errors were encountered: