-
Notifications
You must be signed in to change notification settings - Fork 7k
Fix monitor.py bottleneck by removing excess Redis queries. #1786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix monitor.py bottleneck by removing excess Redis queries. #1786
Conversation
ericl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, nice catch here
| redis_port = get_port(args.redis_address) | ||
|
|
||
| # Initialize the global state. | ||
| ray.global_state._initialize_global_state(redis_ip_address, redis_port) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this since we don't need to get the client table any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure, but I think that was always redundant because the monitor creates its own global state at
Lines 85 to 86 in 51fdbe3
| self.state = ray.experimental.state.GlobalState() | |
| self.state._initialize_global_state(redis_address, redis_port) |
python/ray/monitor.py
Outdated
| for ls in local_schedulers: | ||
| if ls["DBClientID"] == client_id: | ||
| ip = ls["AuxAddress"].split(":")[0] | ||
| ip = self.local_scheduler_id_to_ip_map.get(client_id, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: None is already the default for get
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
|
Test PASSed. |
|
Test PASSed. |
* commit 'f69cbd35d4e86f2a3c2ace875aaf8166edb69f5d': (64 commits) Bump version to 0.4.0. (ray-project#1745) Fix monitor.py bottleneck by removing excess Redis queries. (ray-project#1786) Convert the ObjectTable implementation to a Log (ray-project#1779) Acquire worker lock when importing actor. (ray-project#1783) Introduce a log interface for the new GCS (ray-project#1771) [tune] Fix linting error (ray-project#1777) [tune] Added pbt with keras on cifar10 dataset example (ray-project#1729) Add a GCS table for the xray task flatbuffer (ray-project#1775) [tune] Change tune resource request syntax to be less confusing (ray-project#1764) Remove from X import Y convention in RLlib ES. (ray-project#1774) Check if the provider is external before getting the config. (ray-project#1743) Request and cancel notifications in the new GCS API (ray-project#1758) Fix resource bookkeeping for blocked actor methods. (ray-project#1766) Fix bug when connecting another driver in local case. (ray-project#1760) Define string prefixes for all tables in the new GCS API (ray-project#1755) [rllib] Update RLlib to work with new actor scheduling behavior (ray-project#1754) Redirect output of all processes by default. (ray-project#1752) Add API for getting total cluster resources. (ray-project#1736) Always send actor creation tasks to the global scheduler. (ray-project#1757) Print error when actor takes too long to start, and refactor error me… (ray-project#1747) ... # Conflicts: # python/ray/rllib/__init__.py # python/ray/rllib/dqn/dqn.py # python/ray/rllib/dqn/dqn_evaluator.py # python/ray/rllib/dqn/dqn_replay_evaluator.py # python/ray/rllib/optimizers/__init__.py # python/ray/rllib/tuned_examples/pong-dqn.yaml
With large clusters,
monitor.pyis at very high utilization. Using cProfile, I see that it's spending all of it's time in the lineray/python/ray/monitor.py
Line 272 in 51fdbe3
This call does N Redis queries, where N is the number of nodes in the cluster. It is currently called once per heartbeat. There are 10N heartbeats per second, for a total of 10N^2 Redis queries per second (400K for a 200 node cluster). This will prevent the monitor from being able to keep up with all of the heartbeats.
As a separate note, the main loop in
monitor.pyisray/python/ray/monitor.py
Lines 517 to 526 in 51fdbe3
This code assumes that the call to
self.process_messages()will return. However,proess_messagesruns its own infinite loop which will never return for large clusters (i.e., it only returns when there are no more messages left to process).To address this, I force
process_messagesto return after processing a maximum of 10K messages.cc @ericl