Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync rh to all workers in addition to head node #278

Merged
merged 2 commits into from
Jan 4, 2024
Merged

Conversation

jlewitt1
Copy link
Collaborator

No description provided.

Copy link
Collaborator Author

jlewitt1 commented Dec 28, 2023

Current dependencies on/for this PR:

This stack of pull requests is managed by Graphite.

@jlewitt1 jlewitt1 force-pushed the sync-rh-to-all-nodes branch 4 times, most recently from 50fd384 to 4542418 Compare December 28, 2023 19:27

# Update the cluster config on the cluster
self.save_config_to_cluster()
with ProcessPoolExecutor() as executor:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongreenberg you think this is overkill?

@jlewitt1 jlewitt1 force-pushed the sync-rh-to-all-nodes branch 2 times, most recently from 3b15c58 to 13254f1 Compare December 31, 2023 19:45
runhouse/resources/hardware/cluster.py Show resolved Hide resolved
@@ -842,6 +858,25 @@ def disconnect(self):
if self._rpc_tunnel:
self._rpc_tunnel.stop()

def _sync_across_all_nodes(self, _rh_install_url):
loop = asyncio.new_event_loop()
executor = ProcessPoolExecutor()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to also call: asyncio.set_event_loop(loop)?

According to the documentation:

If there’s need to set this loop as the event loop for the current context, [set_event_loop()](https://python.readthedocs.io/en/latest/library/asyncio-eventloops.html#asyncio.set_event_loop) must be called explicitly.

Source:
https://python.readthedocs.io/en/latest/library/asyncio-eventloops.html#asyncio.AbstractEventLoopPolicy.new_event_loop

If the existing code works, we can probably skip it for now, but it would be good to be aware of it in case of any exceptions or weird behavior that we may face in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, my understanding is that since this loop's usage is confined to this function we don't necessarily need to set it, but if we do then we should also close the loop as you added in your other comment.

Seems like it's the recommended way to do it like this, so I'll update it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great now, thanks for updating!

runhouse/resources/hardware/cluster.py Show resolved Hide resolved
@@ -703,10 +717,12 @@ def restart_server(
logger.info(f"Restarting Runhouse API server on {self.name}.")

if resync_rh:
self._sync_runhouse_to_cluster(_install_url=_rh_install_url)
if not self.address:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a comment on what is the logical condition here e.g. in what cases would self.address be None?

runhouse/resources/hardware/cluster.py Show resolved Hide resolved
@@ -994,13 +1033,23 @@ def run(
if not run_name:
# If not creating a Run then just run the commands via SSH and return
return self._run_commands_with_ssh(
commands, cmd_prefix, stream_logs, port_forward, require_outputs
commands,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should node be first?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

made it optional since we can always default to using the head node's address, do you think that API should be more explicit?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional sounds great to me, thanks

runhouse/resources/hardware/sagemaker_cluster.py Outdated Show resolved Hide resolved
@jlewitt1 jlewitt1 force-pushed the sync-rh-to-all-nodes branch 3 times, most recently from 543f642 to 04ec710 Compare January 3, 2024 11:43
@jlewitt1 jlewitt1 merged commit d7ef339 into main Jan 4, 2024
6 of 9 checks passed
@jlewitt1 jlewitt1 deleted the sync-rh-to-all-nodes branch January 4, 2024 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants