Skip to content

Conversation

@robertnishihara
Copy link
Collaborator

With large clusters, monitor.py is at very high utilization. Using cProfile, I see that it's spending all of it's time in the line

clients = ray.global_state.client_table()

This call does N Redis queries, where N is the number of nodes in the cluster. It is currently called once per heartbeat. There are 10N heartbeats per second, for a total of 10N^2 Redis queries per second (400K for a 200 node cluster). This will prevent the monitor from being able to keep up with all of the heartbeats.

As a separate note, the main loop in monitor.py is

ray/python/ray/monitor.py

Lines 517 to 526 in 51fdbe3

while True:
# Process autoscaling actions
if self.autoscaler:
self.autoscaler.update()
# Record how many dead local schedulers and plasma managers we had
# at the beginning of this round.
num_dead_local_schedulers = len(self.dead_local_schedulers)
num_dead_plasma_managers = len(self.dead_plasma_managers)
# Process a round of messages.
self.process_messages()

This code assumes that the call to self.process_messages() will return. However, proess_messages runs its own infinite loop which will never return for large clusters (i.e., it only returns when there are no more messages left to process).

To address this, I force process_messages to return after processing a maximum of 10K messages.

cc @ericl

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, nice catch here

redis_port = get_port(args.redis_address)

# Initialize the global state.
ray.global_state._initialize_global_state(redis_ip_address, redis_port)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this since we don't need to get the client table any more?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure, but I think that was always redundant because the monitor creates its own global state at

self.state = ray.experimental.state.GlobalState()
self.state._initialize_global_state(redis_address, redis_port)

for ls in local_schedulers:
if ls["DBClientID"] == client_id:
ip = ls["AuxAddress"].split(":")[0]
ip = self.local_scheduler_id_to_ip_map.get(client_id, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: None is already the default for get

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, thanks!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4501/
Test PASSed.

@pcmoritz pcmoritz merged commit de3cfa2 into ray-project:master Mar 27, 2018
@pcmoritz pcmoritz deleted the monitorbottleneck branch March 27, 2018 05:30
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4504/
Test PASSed.

royf added a commit to royf/ray that referenced this pull request Apr 22, 2018
* commit 'f69cbd35d4e86f2a3c2ace875aaf8166edb69f5d': (64 commits)
  Bump version to 0.4.0. (ray-project#1745)
  Fix monitor.py bottleneck by removing excess Redis queries. (ray-project#1786)
  Convert the ObjectTable implementation to a Log (ray-project#1779)
  Acquire worker lock when importing actor. (ray-project#1783)
  Introduce a log interface for the new GCS (ray-project#1771)
  [tune] Fix linting error (ray-project#1777)
  [tune] Added pbt with keras on cifar10 dataset example (ray-project#1729)
  Add a GCS table for the xray task flatbuffer (ray-project#1775)
  [tune] Change tune resource request syntax to be less confusing (ray-project#1764)
  Remove from X import Y convention in RLlib ES. (ray-project#1774)
  Check if the provider is external before getting the config. (ray-project#1743)
  Request and cancel notifications in the new GCS API (ray-project#1758)
  Fix resource bookkeeping for blocked actor methods. (ray-project#1766)
  Fix bug when connecting another driver in local case. (ray-project#1760)
  Define string prefixes for all tables in the new GCS API (ray-project#1755)
  [rllib] Update RLlib to work with new actor scheduling behavior (ray-project#1754)
  Redirect output of all processes by default. (ray-project#1752)
  Add API for getting total cluster resources. (ray-project#1736)
  Always send actor creation tasks to the global scheduler. (ray-project#1757)
  Print error when actor takes too long to start, and refactor error me… (ray-project#1747)
  ...

# Conflicts:
#	python/ray/rllib/__init__.py
#	python/ray/rllib/dqn/dqn.py
#	python/ray/rllib/dqn/dqn_evaluator.py
#	python/ray/rllib/dqn/dqn_replay_evaluator.py
#	python/ray/rllib/optimizers/__init__.py
#	python/ray/rllib/tuned_examples/pong-dqn.yaml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants