-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidate periodic loops into one function updating Den and updating autostop. #873
Consolidate periodic loops into one function updating Den and updating autostop. #873
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
9cd3695
to
d53931e
Compare
3cfdc6a
to
20d04a5
Compare
d53931e
to
61bcf59
Compare
20d04a5
to
baeb9bd
Compare
runhouse/servers/cluster_servlet.py
Outdated
print("hello") | ||
await asyncio.sleep(STATUS_CHECK_DELAY) | ||
print("delayed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ding
runhouse/rns/rns_client.py
Outdated
@@ -659,3 +671,25 @@ def contents(self, name_or_path, full_paths): | |||
return folder(name=name_or_path, path=folder_url).resources( | |||
full_paths=full_paths | |||
) | |||
|
|||
def send_status(self, status: ResourceStatusData, cluster_rns_address: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to catch all other errors so any failures here don't nuke the loop or cluster servlet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be async?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess async doesn't rly matter because its in its own thread, but can for consistency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to catch all other errors so any failures here don't nuke the loop or cluster servlet.
The cluster servlet has everything in a try: except: so it'll catch it
@@ -260,63 +246,45 @@ async def aclear_all_references_to_env_servlet_name(self, env_servlet_name: str) | |||
# Cluster status functions | |||
############################################## | |||
|
|||
async def asend_status_info_to_den(self): | |||
async def aperiodic_status_check(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny name
system_cpu_usage: float | ||
system_memory_usage: Dict[str, Any] | ||
system_disk_usage: Dict[str, Any] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the GPU info go?
def __init__(self): | ||
self._last_activity = time.time() | ||
self._last_register = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need the last_registered last_activity stuff anymore. We can just set the activity directly in SkyPilot, there's no reason to set it lazily like we did before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah currently we also do set_last_active_time_to_now
in get_env_servlet_name_for_key
. This happens quite often, we don't need to re-set it everytime, and the loop is always running, so we can just do it there.
runhouse/servers/cluster_servlet.py
Outdated
if self.cluster_config.get("den_auth", False) and configs.token: | ||
logger.debug("Creating send_status_info_to_den thread.") | ||
if self.cluster_config.get("den_auth", False): | ||
logger.info("Creating aperiodic_status_check thread.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok this has to be on purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid, could be simplified a little and some extra fault tolerance
baeb9bd
to
544fab0
Compare
544fab0
to
65473ec
Compare
autostop.