-
Notifications
You must be signed in to change notification settings - Fork 7.1k
Description
What is the problem?
when ray is shutdown, horovod class need to be nuked, otherwise next time when it starts, it will pickup expired class attribute set by previous ray / horovod run.
This cuase issue in unit tests where ray cluster / horovod ray executor repeatedly bring up and shutdwon
The RC is here:
below func_cls is our horovod class, it is attached with an RAY_CLIENT_MODE_ATTR attribute of a key value. the second time when ray stop and start again, this key in horovod class is not cleared, an expired cache, causing failure.
ray/_private/client_mode_hook.py", line 90, in client_mode_convert_actor
def client_mode_convert_function(func_cls, in_args, in_kwargs, **kwargs):
"""Runs a preregistered ray RemoteFunction through the ray client.
The common case for this is to transparently convert that RemoteFunction
to a ClientRemoteFunction. This happens in circumstances where the
RemoteFunction is declared early, in a library and only then is Ray used in
client mode -- nescessitating a conversion.
"""
from ray.util.client import ray
key = getattr(func_cls, RAY_CLIENT_MODE_ATTR, None) #<- here is the bug!
if key is None:
key = ray._convert_function(func_cls)
setattr(func_cls, RAY_CLIENT_MODE_ATTR, key)
client_func = ray._get_converted(key)
return client_func._remote(in_args, in_kwargs, **kwargs)
Ray 2.0.0.dev0
MacOS
Reproduction (REQUIRED)
ray_client_port = 31300
subprocess.Popen(shlex.split(f'python -m ray.util.client.server --port {ray_client_port}'))
ray.util.connect(f'0.0.0.0:{ray_client_port}')
from horovod.ray import RayExecutor
ray_executor = RayExecutor(RayExecutor.create_settings(timeout_s=300))
ray_executor.start()
ray_executor.shutdown()
ray.util.disconnect()
os.system('ray stop')
run the second time it will fail at ray_executor.start()
ray_client_port = 31300
subprocess.Popen(shlex.split(f'python -m ray.util.client.server --port {ray_client_port}'))
ray.util.connect(f'0.0.0.0:{ray_client_port}')
from horovod.ray import RayExecutor
ray_executor = RayExecutor(RayExecutor.create_settings(timeout_s=300))
ray_executor.start()
ray_executor.shutdown()
ray.util.disconnect()
os.system('ray stop')
The error message looks like this:
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/horovod/ray/runner.py", line 379, in start
self.workers = self._create_workers(resources_per_host())
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/horovod/ray/runner.py", line 321, in _create_workers
for node_rank in range(self.num_hosts)
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/horovod/ray/runner.py", line 321, in
for node_rank in range(self.num_hosts)
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/ray/actor.py", line 476, in remote
override_environment_variables))
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/ray/actor.py", line 587, in _remote
override_environment_variables))
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 90, in client_mode_convert_actor
client_actor = ray._get_converted(key)
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/ray/util/client/api.py", line 284, in _get_converted
return self.worker._get_converted(key)
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/ray/util/client/worker.py", line 507, in _get_converted
return self._converted[key]
KeyError: 'c829a8fc5fd543c2b20078d28a469ea6'
- [ x ] I have verified my script runs in a clean environment and reproduces the issue.
- [ x ] I have verified the issue also occurs with the latest wheels.