-
Notifications
You must be signed in to change notification settings - Fork 512
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ux] cache cluster status of autostop or spot clusters for 2s #4332
Conversation
c023bd1
to
fc18dcc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505!
sky/backends/backend_utils.py
Outdated
if 0 <= cluster_status_lock_timeout < time.perf_counter( | ||
) - start_time: | ||
logger.debug( | ||
'Refreshing status: Failed get the lock for cluster ' | ||
f'{cluster_name!r}. Using the cached status.') | ||
return record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misread something, but doesn't the presence of this if block violate correctness?
E.g., if two concurrent requests (named r1, r2) come in, and say r1 acquires the lock and takes a long time to refresh. Shouldn't r2 wait for r1 to complete, especially since _must_refresh_cluster_status
was evaluated to True previously?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, but this is the current behavior. It allows us to at least get something in the case where e.g. there is a provision that takes a long time holding the lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - can we add a quick comment noting this behavior?
'select name, launched_at, handle, last_use, status, autostop, ' | ||
'metadata, to_down, owner, cluster_hash, storage_mounts_metadata, ' | ||
'cluster_ever_up, status_updated_at from clusters ' | ||
'order by launched_at desc').fetchall() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to move away from *
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is robust against column order.
Edit: to expand on this - there's no guarantee on the order of columns that select *
will give up. For instance, if me and you are both developing features that add a column, and then we merge both of these changes, this will break. We will each have added our own new column before the other, so the order of the columns in our global state db will be different.
I've already hit this bug a few times between this change, #4289, and the stuff we were testing yesterday
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also #4211, same class of issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Is backward compatibility still preserved? E.g., if a cluster was launched before this PR and state.db doesn't contain status_updated_at
, but on upgrading to this branch this line tries to select status_updated_at
, will that work?
(I think it should still work because create_table
is called at module initialization, but just want to double check).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I believe that should be fine. In fact even with select *
, this method would crash if status_updated_at
was missing because we would not have enough columns to unpack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cg505! Left some minor comments on previous threads, but otherwise lgtm!
Previously, every time we want the status of a cluster with spot VMs or with autostop set, we will fetch the latest actual status from the cloud. This is needed since these clusters can be terminated from "outside", and the state in the local state database will be out of date.
However, we often end up fetching the status multiple times in a single invocation. For instance,
sky launch
will check the status in cli.py, then again almost immediately after as part of the provisioning codepath.To mitigate this, we can keep track of the last time we fetched the status from the cloud. If it is within the past 2 seconds, assume that it's still accurate (that is, the cluster hasn't been terminated/stopped since then).
Caveats:
The updated timestamp check/set is not atomic, so if multiple parallel invocations check the status, they may all see that it is out of date, and then all try to refresh the status.Performance impact:
sky launch --fast
skips one status check, saving ~3s (from ~10.5s -> ~7.5s if the cluster is already up).sky jobs launch --fast
is the same, but this will mitigate the performance hit from make --fast robust against credential or wheel updates #4289Tested (run the relevant ones):
bash format.sh
sky launch --fast
on many autostop cluster to try and make it fail.pytest tests/test_smoke.py
pytest tests/test_smoke.py --managed-jobs
conda deactivate; bash -i tests/backward_compatibility_tests.sh