Skip to content

[core] ray client not idempotent, causing tests to fail when used together with horovod #15378

@yuduber

Description

@yuduber

What is the problem?

when ray is shutdown, horovod class need to be nuked, otherwise next time when it starts, it will pickup expired class attribute set by previous ray / horovod run.

This cuase issue in unit tests where ray cluster / horovod ray executor repeatedly bring up and shutdwon

The RC is here:
below func_cls is our horovod class, it is attached with an RAY_CLIENT_MODE_ATTR attribute of a key value. the second time when ray stop and start again, this key in horovod class is not cleared, an expired cache, causing failure.
 
 ray/_private/client_mode_hook.py", line 90, in client_mode_convert_actor

def client_mode_convert_function(func_cls, in_args, in_kwargs, **kwargs):
"""Runs a preregistered ray RemoteFunction through the ray client.
 
The common case for this is to transparently convert that RemoteFunction
to a ClientRemoteFunction. This happens in circumstances where the
RemoteFunction is declared early, in a library and only then is Ray used in
client mode -- nescessitating a conversion.
"""
from ray.util.client import ray
 
key = getattr(func_cls, RAY_CLIENT_MODE_ATTR, None) #<- here is the bug!
if key is None:
key = ray._convert_function(func_cls)
setattr(func_cls, RAY_CLIENT_MODE_ATTR, key)
client_func = ray._get_converted(key)
return client_func._remote(in_args, in_kwargs, **kwargs)

Ray 2.0.0.dev0
MacOS

Reproduction (REQUIRED)

ray_client_port = 31300
subprocess.Popen(shlex.split(f'python -m ray.util.client.server --port {ray_client_port}'))
ray.util.connect(f'0.0.0.0:{ray_client_port}')
from horovod.ray import RayExecutor
ray_executor = RayExecutor(RayExecutor.create_settings(timeout_s=300))
ray_executor.start()
ray_executor.shutdown()
ray.util.disconnect()
os.system('ray stop')

run the second time it will fail at ray_executor.start()

ray_client_port = 31300
subprocess.Popen(shlex.split(f'python -m ray.util.client.server --port {ray_client_port}'))
ray.util.connect(f'0.0.0.0:{ray_client_port}')
from horovod.ray import RayExecutor
ray_executor = RayExecutor(RayExecutor.create_settings(timeout_s=300))
ray_executor.start()
ray_executor.shutdown()
ray.util.disconnect()
os.system('ray stop')

The error message looks like this:

File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/horovod/ray/runner.py", line 379, in start

self.workers = self._create_workers(resources_per_host())
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/horovod/ray/runner.py", line 321, in _create_workers
for node_rank in range(self.num_hosts)
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/horovod/ray/runner.py", line 321, in
for node_rank in range(self.num_hosts)
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/ray/actor.py", line 476, in remote
override_environment_variables))
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/ray/actor.py", line 587, in _remote
override_environment_variables))
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/ray/_private/client_mode_hook.py", line 90, in client_mode_convert_actor
client_actor = ray._get_converted(key)
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/ray/util/client/api.py", line 284, in _get_converted
return self.worker._get_converted(key)
File "/Users/yud/repo/wkspace/ml-code-ws/env/py369/lib/python3.6/site-packages/ray/util/client/worker.py", line 507, in _get_converted
return self._converted[key]
KeyError: 'c829a8fc5fd543c2b20078d28a469ea6'

  • [ x ] I have verified my script runs in a clean environment and reproduces the issue.
  • [ x ] I have verified the issue also occurs with the latest wheels.

Metadata

Metadata

Assignees

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn't

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions