Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests] tests/test_trainer_distributed.py intermittent failure #10282

Closed
stas00 opened this issue Feb 19, 2021 · 2 comments
Closed

[tests] tests/test_trainer_distributed.py intermittent failure #10282

stas00 opened this issue Feb 19, 2021 · 2 comments
Assignees
Labels
Tests Related to tests

Comments

@stas00
Copy link
Contributor

stas00 commented Feb 19, 2021

tests/test_trainer_distributed.py fails occasionally on multi-gpu github runner CI and as a result doesn't free up the 29500 default distributed port.

This could be caused by an occasional deadlock discusses in testing_utils.py's _stream_subprocess``. When debugging one such zombie it was stuck in exec(eval(sys.stdin.readline()))`

Note that other similar tests under examples don't exhibit the same behavior - perhaps it somehow has to do with this being a different script that it runs (this test runs its own file as the distributed script).

The bt of the subsequent failures is long and confusing, as there are several mixed failures, but it's all really one failure: Address already in use since the previous distributed run of the same test didn't free up this port.

A quick check should show which process is bound to it:

netstat -tulpn | grep :29500

The full bt:

NCCL_DEBUG=INFO pytest -sv tests/test_trainer_distributed.py
================================================================ test session starts ================================================================
platform linux -- Python 3.7.4, pytest-6.2.2, py-1.10.0, pluggy-0.13.1 -- /home/github_actions/actions-runner/_work/transformers/transformers/.env/bin/python
cachedir: .pytest_cache
rootdir: /home/github_actions/actions-runner/_work/transformers/transformers
plugins: xdist-2.2.1, forked-1.3.0
collected 1 item                                                                                                                                    

tests/test_trainer_distributed.py::TestTrainerDistributed::test_trainer 
Running:  /home/github_actions/actions-runner/_work/transformers/transformers/.env/bin/python -m torch.distributed.launch --nproc_per_node=2 /home/github_actions/actions-runner/_work/transformers/transformers/tests/test_trainer_distributed.py --output_dir /tmp/tmp2k265qn5
stderr: Traceback (most recent call last):
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/tests/test_trainer_distributed.py", line 82, in <module>
stderr:     training_args = parser.parse_args_into_dataclasses()[0]
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/hf_argparser.py", line 180, in parse_args_into_dataclasses
stderr:     obj = dtype(**inputs)
stderr:   File "<string>", line 61, in __init__
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 491, in __post_init__
stderr:     if is_torch_available() and self.device.type != "cuda" and self.fp16:
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1369, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 620, in device
stderr:     return self._setup_devices
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1359, in __get__
stderr:     cached = self.fget(obj)
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1369, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 605, in _setup_devices
stderr:     torch.distributed.init_process_group(backend="nccl")
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 436, in init_process_group
stderr:     store, rank, world_size = next(rendezvous_iterator)
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 179, in _env_rendezvous_handler
stderr:     store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
stderr: RuntimeError: Address already in use
stderr: Traceback (most recent call last):
stderr:   File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
stderr:     "__main__", mod_spec)
stderr:   File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
stderr:     exec(code, run_globals)
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/launch.py", line 260, in <module>
stderr:     main()
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/launch.py", line 256, in main
stderr:     cmd=cmd)
stderr: subprocess.CalledProcessError: Command '['/home/github_actions/actions-runner/_work/transformers/transformers/.env/bin/python', '-u', '/home/github_actions/actions-runner/_work/transformers/transformers/tests/test_trainer_distributed.py', '--local_rank=1', '--output_dir', '/tmp/tmp2k265qn5']' returned non-zero exit status 1.
stdout: *****************************************
stdout: Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
stdout: *****************************************
stdout: multi-gpu-ci-runner:18062:18062 [1] NCCL INFO Bootstrap : Using [0]ens6:10.128.0.66<0>
stdout: multi-gpu-ci-runner:18062:18062 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
stdout: 
stdout: multi-gpu-ci-runner:18062:18062 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
stdout: multi-gpu-ci-runner:18062:18062 [1] NCCL INFO NET/Socket : Using [0]ens6:10.128.0.66<0>
stdout: multi-gpu-ci-runner:18062:18062 [1] NCCL INFO Using network Socket
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO Call to connect returned Connection refused, retrying
stdout: 
stdout: multi-gpu-ci-runner:18062:18089 [1] include/socket.h:403 NCCL WARN Connect to 10.128.0.66<52523> failed : Connection refused
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO bootstrap.cc:95 -> 2
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO bootstrap.cc:309 -> 2
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO init.cc:555 -> 2
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO init.cc:840 -> 2
stdout: multi-gpu-ci-runner:18062:18089 [1] NCCL INFO group.cc:73 -> 2 [Async thread]
stderr: Traceback (most recent call last):
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/tests/test_trainer_distributed.py", line 82, in <module>
stderr:     training_args = parser.parse_args_into_dataclasses()[0]
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/hf_argparser.py", line 180, in parse_args_into_dataclasses
stderr:     obj = dtype(**inputs)
stderr:   File "<string>", line 61, in __init__
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 491, in __post_init__
stderr:     if is_torch_available() and self.device.type != "cuda" and self.fp16:
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1369, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 620, in device
stderr:     return self._setup_devices
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1359, in __get__
stderr:     cached = self.fget(obj)
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/file_utils.py", line 1369, in wrapper
stderr:     return func(*args, **kwargs)
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/src/transformers/training_args.py", line 605, in _setup_devices
stderr:     torch.distributed.init_process_group(backend="nccl")
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group
stderr:     barrier()
stderr:   File "/home/github_actions/actions-runner/_work/transformers/transformers/.env/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier
stderr:     work = _default_pg.barrier()
stderr: RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8
FAILED

===================================================================== FAILURES ======================================================================
________________________________________________________ TestTrainerDistributed.test_trainer ________________________________________________________

self = <tests.test_trainer_distributed.TestTrainerDistributed testMethod=test_trainer>

    @require_torch_multi_gpu
    def test_trainer(self):
    
        distributed_args = f"""
            -m torch.distributed.launch
            --nproc_per_node={torch.cuda.device_count()}
            {self.test_file_dir}/test_trainer_distributed.py
        """.split()
        output_dir = self.get_auto_remove_tmp_dir()
        args = f"--output_dir {output_dir}".split()
        cmd = [sys.executable] + distributed_args + args
>       execute_subprocess_async(cmd, env=self.get_env())

tests/test_trainer_distributed.py:72: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cmd = ['/home/github_actions/actions-runner/_work/transformers/transformers/.env/bin/python', '-m', 'torch.distributed.launc.../github_actions/actions-runner/_work/transformers/transformers/tests/test_trainer_distributed.py', '--output_dir', ...]
env = {'HOME': '/home/github_actions', 'KMP_DUPLICATE_LIB_OK': 'True', 'KMP_INIT_AT_FORK': 'FALSE', 'LANG': 'C.UTF-8', ...}, stdin = None
timeout = 180, quiet = False, echo = True

    def execute_subprocess_async(cmd, env=None, stdin=None, timeout=180, quiet=False, echo=True) -> _RunOutput:
    
        loop = asyncio.get_event_loop()
        result = loop.run_until_complete(
            _stream_subprocess(cmd, env=env, stdin=stdin, timeout=timeout, quiet=quiet, echo=echo)
        )
    
        cmd_str = " ".join(cmd)
        if result.returncode > 0:
            stderr = "\n".join(result.stderr)
            raise RuntimeError(
>               f"'{cmd_str}' failed with returncode {result.returncode}\n\n"
                f"The combined stderr from workers follows:\n{stderr}"
            )

A short term workaround could be to randomize the port, so this test won't trumple upon its previous zombie.

+        from random import randint
+        master_port = 2950 + randint(1, 99)
         distributed_args = f"""
             -m torch.distributed.launch
             --nproc_per_node={torch.cuda.device_count()}
+            --master_port {master_port}
             {self.test_file_dir}/test_trainer_distributed.py
         """.split()

but this is a band-aid and a real solution is needed. It also will be an issue with any other distributed tests that rely on the same default port number.

I will keep on monitoring the issue.

Meanwhile this PR #10281 should help with preventing the incremental number of zombies from scheduled runs.

It's difficult to debug w/o being able to reproduce this problem at will.

@stas00 stas00 self-assigned this Feb 19, 2021
@stas00 stas00 added the Tests Related to tests label Feb 19, 2021
@stas00
Copy link
Contributor Author

stas00 commented Mar 13, 2021

One other solution - since this is a single node we could use a unique file rather than port for setting up the distributed process group.

That is init_process_group() with init_method="file:///tmp/unique_file" - but the trainer currently hardcodes the env:// method so we may need to make it more flexible around that.

Reference: https://pytorch.org/docs/master/distributed.html#torch.distributed.init_process_group

@stas00
Copy link
Contributor Author

stas00 commented Mar 18, 2021

since we are switching to docker runs, this becomes moot, as there will be no processes from previous runs.

@stas00 stas00 closed this as completed Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Tests Related to tests
Projects
None yet
Development

No branches or pull requests

1 participant