-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ws-manager] Provide ordering of status updates #5223
Conversation
Looks like we're still lacking the processing of the new |
68068b1
to
9873235
Compare
Codecov Report
@@ Coverage Diff @@
## main #5223 +/- ##
===========================================
+ Coverage 19.04% 39.42% +20.37%
===========================================
Files 2 13 +11
Lines 168 3716 +3548
===========================================
+ Hits 32 1465 +1433
- Misses 134 2129 +1995
- Partials 2 122 +120
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Indeed, we do not. This change just introduces the value. Once we have it we can observe its behaviour and eventually introduce logic based on it. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
9873235
to
4c196d2
Compare
/werft run 👍 started the job as gitpod-build-cw-wsstatus-version.4 |
/werft run 👍 started the job as gitpod-build-cw-wsstatus-version.5 |
/lgtm |
LGTM label has been added. Git tree hash: ad434b3d9fd7d54f48615e9356afdce4d36133d5
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aledbf Associated issue requirement bypassed by: aledbf The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR introduces a
status_version
field on workspace status which imposes a partial order of status updates. Using this field we can determine which update "came first", rather than having to rely on heuristics or best effort ordering. Due to the prior lack of such an order, we have seen otherwise functional workspaces revert back to a non-running state (e.g. "Opening IDE").More formally speaking, this PR introduces
status_version
so that for two status updatess1
ands2
,s1
was a status befores2
ifs1.statusVersion < s2.statusVersion
.How could this be solved?
We have several options available for producing such a field:
resourceRevision
. However, there's a number of issues with this approach. For one the set of objects is not stable, e.g. when ports are opened a new service may be created. Some variant of a vector clock where we consider the Kubernetes objects the processes might be used - if there is such a thing as a "dynamic vector clock". However, the Kubernetes resource version must not be interpreted (they come from etcd/postgresql) - it does not even impose an order, but just says if two versions are different.How do "hybrid logical clocks" (HLC) work?
HLC work around the limitations of real-wall clocks by throwing a logical clock in the mix which steps in when the real-wall time does funny things, e.g. run backwards. Whenever the real time moves backwards we rely on the logical clock for each status update until the real clock has caught up. Real-wall time and logical time are encoded in a single
uint64
as described in the paper linked above, with 48bit for the real time and 16 bit for the logical time.What happens when ws-manager restarts and lands on a node with a different time?
In that case it could happen that the causally newer status updates have a "lower"
status_version
. This would violate the order we want to impose. Suppose however, that the real-wall time of all machines ws-manager could run are guaranteed (at least very likely) to not differ by more thaneps
, then waiting foreps
prior to producing new updates is sufficient to guarantee that the new status updates will have a newer status version. In GKE machines are timesynched using Google's NTP server, hence can be expected to certainly drift less than 1 second apart (see NTP performance). Hence, waiting foreps = 2 seconds
should be more than plenty to avoid this issue.What happens if for some reason the real-wall clock jumped ahead/ran backwards too far?
We spend 16 bit of the HLC on the logical clock, hence can support 65535 "ticks" until the real wall time has recovered. In a workspace cluster with 1000 workspaces, each producing 10 events per minute (port opening, start, stop, prebuilds, etc). we would consume the logical clock in 6.5 minutes. Once the logical clock is "consumed", i.e. would overflow, the implementation will panic to avoid breaking the order guarantee we wish to impose. In this case, ws-manager would start afresh with a newly initialized real-wall time. If we can maintain the order guarantee then is not certain, but likely (see point above).
Wait, shouldn't I be on vacation?
Yes, but I really like this stuff.