Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ws-manager] Provide ordering of status updates #5223

Merged
merged 2 commits into from
Sep 29, 2021
Merged

Conversation

csweichel
Copy link
Contributor

@csweichel csweichel commented Aug 16, 2021

This PR introduces a status_version field on workspace status which imposes a partial order of status updates. Using this field we can determine which update "came first", rather than having to rely on heuristics or best effort ordering. Due to the prior lack of such an order, we have seen otherwise functional workspaces revert back to a non-running state (e.g. "Opening IDE").

More formally speaking, this PR introduces status_version so that for two status updates s1 and s2, s1 was a status before s2 if s1.statusVersion < s2.statusVersion.

How could this be solved?

We have several options available for producing such a field:

  • real world time which, because we're only producing status updates from a single process looks like the obvious choice. However, real wall time is a tricky beast. It can go backwards (e.g. after an NTP re-sync), can run slower or faster (depending on the quartz in the machine, or NTP clock skew), and generally differs between machines even when they're synched to the same NTP server by up to 1 second.
  • logical clock/lamport clock is the distributed system 101 (see paper or explanation here and here). However, we have no high-frequency state storage available in ws-manager. We cannot update a Kubernetes object everytime we receive a status update as that would likely burst the k8s API rate limits quickly - it certainly would not scale. Introducing local state just for this feature seems overkill as well.
  • logical clocks derived from our existing state seem like a good idea. Especially because Kubernetes already offers a resourceRevision. However, there's a number of issues with this approach. For one the set of objects is not stable, e.g. when ports are opened a new service may be created. Some variant of a vector clock where we consider the Kubernetes objects the processes might be used - if there is such a thing as a "dynamic vector clock". However, the Kubernetes resource version must not be interpreted (they come from etcd/postgresql) - it does not even impose an order, but just says if two versions are different.
  • hybrid logical clocks mix real-world time with logical clocks. While we do not (yet?) have difference processes to consider, we can use the guarantees afforded by the naive implementation to overcome the limitations of real world and logical clocks when used independently.

How do "hybrid logical clocks" (HLC) work?

HLC work around the limitations of real-wall clocks by throwing a logical clock in the mix which steps in when the real-wall time does funny things, e.g. run backwards. Whenever the real time moves backwards we rely on the logical clock for each status update until the real clock has caught up. Real-wall time and logical time are encoded in a single uint64 as described in the paper linked above, with 48bit for the real time and 16 bit for the logical time.

What happens when ws-manager restarts and lands on a node with a different time?

In that case it could happen that the causally newer status updates have a "lower" status_version. This would violate the order we want to impose. Suppose however, that the real-wall time of all machines ws-manager could run are guaranteed (at least very likely) to not differ by more than eps, then waiting for eps prior to producing new updates is sufficient to guarantee that the new status updates will have a newer status version. In GKE machines are timesynched using Google's NTP server, hence can be expected to certainly drift less than 1 second apart (see NTP performance). Hence, waiting for eps = 2 seconds should be more than plenty to avoid this issue.

What happens if for some reason the real-wall clock jumped ahead/ran backwards too far?

We spend 16 bit of the HLC on the logical clock, hence can support 65535 "ticks" until the real wall time has recovered. In a workspace cluster with 1000 workspaces, each producing 10 events per minute (port opening, start, stop, prebuilds, etc). we would consume the logical clock in 6.5 minutes. Once the logical clock is "consumed", i.e. would overflow, the implementation will panic to avoid breaking the order guarantee we wish to impose. In this case, ws-manager would start afresh with a newly initialized real-wall time. If we can maintain the order guarantee then is not certain, but likely (see point above).

Wait, shouldn't I be on vacation?

Yes, but I really like this stuff.

@JanKoehnlein
Copy link
Contributor

Looks like we're still lacking the processing of the new statusVersion on the ws-man-brigde side, or am I missing something?

@csweichel csweichel force-pushed the cw/wsstatus-version branch from 68068b1 to 9873235 Compare August 23, 2021 14:08
@codecov
Copy link

codecov bot commented Aug 23, 2021

Codecov Report

Merging #5223 (f26a1bb) into main (8a0195a) will increase coverage by 20.37%.
The diff coverage is 66.10%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main    #5223       +/-   ##
===========================================
+ Coverage   19.04%   39.42%   +20.37%     
===========================================
  Files           2       13       +11     
  Lines         168     3716     +3548     
===========================================
+ Hits           32     1465     +1433     
- Misses        134     2129     +1995     
- Partials        2      122      +120     
Flag Coverage Δ
components-local-app-app-linux ?
components-local-app-app-windows ?
components-ws-manager-app 39.42% <66.10%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/ws-manager/pkg/manager/metrics.go 11.11% <0.00%> (ø)
components/ws-manager/pkg/clock/clock.go 68.62% <68.62%> (ø)
components/ws-manager/pkg/manager/manager.go 25.35% <100.00%> (ø)
components/ws-manager/pkg/manager/status.go 73.37% <100.00%> (ø)
components/local-app/pkg/auth/auth.go
components/local-app/pkg/auth/pkce.go
components/ws-manager/pkg/manager/annotations.go 65.11% <0.00%> (ø)
components/ws-manager/pkg/manager/create.go 78.83% <0.00%> (ø)
...-manager/pkg/manager/internal/workpool/workpool.go 100.00% <0.00%> (ø)
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a0195a...f26a1bb. Read the comment docs.

@csweichel
Copy link
Contributor Author

Looks like we're still lacking the processing of the new statusVersion on the ws-man-brigde side, or am I missing something?

Indeed, we do not. This change just introduces the value. Once we have it we can observe its behaviour and eventually introduce logic based on it.

@stale
Copy link

stale bot commented Sep 28, 2021

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the meta: stale This issue/PR is stale and will be closed soon label Sep 28, 2021
@stale stale bot removed the meta: stale This issue/PR is stale and will be closed soon label Sep 29, 2021
@aledbf
Copy link
Member

aledbf commented Sep 29, 2021

/werft run

👍 started the job as gitpod-build-cw-wsstatus-version.4

@aledbf
Copy link
Member

aledbf commented Sep 29, 2021

/werft run

👍 started the job as gitpod-build-cw-wsstatus-version.5

@aledbf
Copy link
Member

aledbf commented Sep 29, 2021

/lgtm
/approve no-issue

@roboquat
Copy link
Contributor

LGTM label has been added.

Git tree hash: ad434b3d9fd7d54f48615e9356afdce4d36133d5

@roboquat
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf

Associated issue requirement bypassed by: aledbf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@roboquat roboquat merged commit a89702b into main Sep 29, 2021
@roboquat roboquat deleted the cw/wsstatus-version branch September 29, 2021 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants