Fix monitor.py bottleneck by removing excess Redis queries. #1786

robertnishihara · 2018-03-27T04:22:07Z

With large clusters, monitor.py is at very high utilization. Using cProfile, I see that it's spending all of it's time in the line

ray/python/ray/monitor.py

Line 272 in 51fdbe3

clients = ray.global_state.client_table()

This call does N Redis queries, where N is the number of nodes in the cluster. It is currently called once per heartbeat. There are 10N heartbeats per second, for a total of 10N^2 Redis queries per second (400K for a 200 node cluster). This will prevent the monitor from being able to keep up with all of the heartbeats.

As a separate note, the main loop in monitor.py is

ray/python/ray/monitor.py

Lines 517 to 526 in 51fdbe3

    
           while True: 
        
               # Process autoscaling actions 
        
               if self.autoscaler: 
        
                   self.autoscaler.update() 
        
               # Record how many dead local schedulers and plasma managers we had 
        
               # at the beginning of this round. 
        
               num_dead_local_schedulers = len(self.dead_local_schedulers) 
        
               num_dead_plasma_managers = len(self.dead_plasma_managers) 
        
               # Process a round of messages. 
        
               self.process_messages()

This code assumes that the call to self.process_messages() will return. However, proess_messages runs its own infinite loop which will never return for large clusters (i.e., it only returns when there are no more messages left to process).

To address this, I force process_messages to return after processing a maximum of 10K messages.

cc @ericl

ericl

LGTM, nice catch here

ericl · 2018-03-27T05:02:14Z

python/ray/monitor.py

    redis_port = get_port(args.redis_address)

-    # Initialize the global state.
-    ray.global_state._initialize_global_state(redis_ip_address, redis_port)


Is this since we don't need to get the client table any more?

I'm not 100% sure, but I think that was always redundant because the monitor creates its own global state at

ray/python/ray/monitor.py

Lines 85 to 86 in 51fdbe3

self.state = ray.experimental.state.GlobalState()

self.state._initialize_global_state(redis_address, redis_port)

ericl · 2018-03-27T05:02:57Z

python/ray/monitor.py

-        for ls in local_schedulers:
-            if ls["DBClientID"] == client_id:
-                ip = ls["AuxAddress"].split(":")[0]
+        ip = self.local_scheduler_id_to_ip_map.get(client_id, None)


nit: None is already the default for get

Oh, thanks!

AmplabJenkins · 2018-03-27T05:14:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4501/
Test PASSed.

AmplabJenkins · 2018-03-27T06:06:57Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/4504/
Test PASSed.

* commit 'f69cbd35d4e86f2a3c2ace875aaf8166edb69f5d': (64 commits) Bump version to 0.4.0. (ray-project#1745) Fix monitor.py bottleneck by removing excess Redis queries. (ray-project#1786) Convert the ObjectTable implementation to a Log (ray-project#1779) Acquire worker lock when importing actor. (ray-project#1783) Introduce a log interface for the new GCS (ray-project#1771) [tune] Fix linting error (ray-project#1777) [tune] Added pbt with keras on cifar10 dataset example (ray-project#1729) Add a GCS table for the xray task flatbuffer (ray-project#1775) [tune] Change tune resource request syntax to be less confusing (ray-project#1764) Remove from X import Y convention in RLlib ES. (ray-project#1774) Check if the provider is external before getting the config. (ray-project#1743) Request and cancel notifications in the new GCS API (ray-project#1758) Fix resource bookkeeping for blocked actor methods. (ray-project#1766) Fix bug when connecting another driver in local case. (ray-project#1760) Define string prefixes for all tables in the new GCS API (ray-project#1755) [rllib] Update RLlib to work with new actor scheduling behavior (ray-project#1754) Redirect output of all processes by default. (ray-project#1752) Add API for getting total cluster resources. (ray-project#1736) Always send actor creation tasks to the global scheduler. (ray-project#1757) Print error when actor takes too long to start, and refactor error me… (ray-project#1747) ... # Conflicts: # python/ray/rllib/__init__.py # python/ray/rllib/dqn/dqn.py # python/ray/rllib/dqn/dqn_evaluator.py # python/ray/rllib/dqn/dqn_replay_evaluator.py # python/ray/rllib/optimizers/__init__.py # python/ray/rllib/tuned_examples/pong-dqn.yaml

Fix monitor.py bottleneck by removing excess Redis queries.

a789757

ericl approved these changes Mar 27, 2018

View reviewed changes

Remove unnecessary default value.

20714aa

pcmoritz merged commit de3cfa2 into ray-project:master Mar 27, 2018

pcmoritz deleted the monitorbottleneck branch March 27, 2018 05:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix monitor.py bottleneck by removing excess Redis queries. #1786

Fix monitor.py bottleneck by removing excess Redis queries. #1786

Uh oh!

robertnishihara commented Mar 27, 2018

Uh oh!

ericl left a comment

Uh oh!

ericl Mar 27, 2018

Uh oh!

robertnishihara Mar 27, 2018

Uh oh!

ericl Mar 27, 2018

Uh oh!

robertnishihara Mar 27, 2018

Uh oh!

robertnishihara Mar 27, 2018

Uh oh!

AmplabJenkins commented Mar 27, 2018

Uh oh!

AmplabJenkins commented Mar 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	while True:
	# Process autoscaling actions
	if self.autoscaler:
	self.autoscaler.update()
	# Record how many dead local schedulers and plasma managers we had
	# at the beginning of this round.
	num_dead_local_schedulers = len(self.dead_local_schedulers)
	num_dead_plasma_managers = len(self.dead_plasma_managers)
	# Process a round of messages.
	self.process_messages()

	self.state = ray.experimental.state.GlobalState()
	self.state._initialize_global_state(redis_address, redis_port)

Fix monitor.py bottleneck by removing excess Redis queries. #1786

Fix monitor.py bottleneck by removing excess Redis queries. #1786

Uh oh!

Conversation

robertnishihara commented Mar 27, 2018

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

ericl Mar 27, 2018

Choose a reason for hiding this comment

Uh oh!

robertnishihara Mar 27, 2018

Choose a reason for hiding this comment

Uh oh!

ericl Mar 27, 2018

Choose a reason for hiding this comment

Uh oh!

robertnishihara Mar 27, 2018

Choose a reason for hiding this comment

Uh oh!

robertnishihara Mar 27, 2018

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Mar 27, 2018

Uh oh!

AmplabJenkins commented Mar 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants