[ws-manager] Provide ordering of status updates #5223

csweichel · 2021-08-16T15:00:17Z

This PR introduces a status_version field on workspace status which imposes a partial order of status updates. Using this field we can determine which update "came first", rather than having to rely on heuristics or best effort ordering. Due to the prior lack of such an order, we have seen otherwise functional workspaces revert back to a non-running state (e.g. "Opening IDE").

More formally speaking, this PR introduces status_version so that for two status updates s1 and s2, s1 was a status before s2 if s1.statusVersion < s2.statusVersion.

How could this be solved?

We have several options available for producing such a field:

real world time which, because we're only producing status updates from a single process looks like the obvious choice. However, real wall time is a tricky beast. It can go backwards (e.g. after an NTP re-sync), can run slower or faster (depending on the quartz in the machine, or NTP clock skew), and generally differs between machines even when they're synched to the same NTP server by up to 1 second.
logical clock/lamport clock is the distributed system 101 (see paper or explanation here and here). However, we have no high-frequency state storage available in ws-manager. We cannot update a Kubernetes object everytime we receive a status update as that would likely burst the k8s API rate limits quickly - it certainly would not scale. Introducing local state just for this feature seems overkill as well.
logical clocks derived from our existing state seem like a good idea. Especially because Kubernetes already offers a resourceRevision. However, there's a number of issues with this approach. For one the set of objects is not stable, e.g. when ports are opened a new service may be created. Some variant of a vector clock where we consider the Kubernetes objects the processes might be used - if there is such a thing as a "dynamic vector clock". However, the Kubernetes resource version must not be interpreted (they come from etcd/postgresql) - it does not even impose an order, but just says if two versions are different.
hybrid logical clocks mix real-world time with logical clocks. While we do not (yet?) have difference processes to consider, we can use the guarantees afforded by the naive implementation to overcome the limitations of real world and logical clocks when used independently.

How do "hybrid logical clocks" (HLC) work?

HLC work around the limitations of real-wall clocks by throwing a logical clock in the mix which steps in when the real-wall time does funny things, e.g. run backwards. Whenever the real time moves backwards we rely on the logical clock for each status update until the real clock has caught up. Real-wall time and logical time are encoded in a single uint64 as described in the paper linked above, with 48bit for the real time and 16 bit for the logical time.

What happens when ws-manager restarts and lands on a node with a different time?

In that case it could happen that the causally newer status updates have a "lower" status_version. This would violate the order we want to impose. Suppose however, that the real-wall time of all machines ws-manager could run are guaranteed (at least very likely) to not differ by more than eps, then waiting for eps prior to producing new updates is sufficient to guarantee that the new status updates will have a newer status version. In GKE machines are timesynched using Google's NTP server, hence can be expected to certainly drift less than 1 second apart (see NTP performance). Hence, waiting for eps = 2 seconds should be more than plenty to avoid this issue.

What happens if for some reason the real-wall clock jumped ahead/ran backwards too far?

We spend 16 bit of the HLC on the logical clock, hence can support 65535 "ticks" until the real wall time has recovered. In a workspace cluster with 1000 workspaces, each producing 10 events per minute (port opening, start, stop, prebuilds, etc). we would consume the logical clock in 6.5 minutes. Once the logical clock is "consumed", i.e. would overflow, the implementation will panic to avoid breaking the order guarantee we wish to impose. In this case, ws-manager would start afresh with a newly initialized real-wall time. If we can maintain the order guarantee then is not certain, but likely (see point above).

Wait, shouldn't I be on vacation?

Yes, but I really like this stuff.

JanKoehnlein · 2021-08-17T11:47:17Z

Looks like we're still lacking the processing of the new statusVersion on the ws-man-brigde side, or am I missing something?

codecov · 2021-08-23T14:26:17Z

Codecov Report

Merging #5223 (f26a1bb) into main (8a0195a) will increase coverage by 20.37%.
The diff coverage is 66.10%.

@@             Coverage Diff             @@
##             main    #5223       +/-   ##
===========================================
+ Coverage   19.04%   39.42%   +20.37%     
===========================================
  Files           2       13       +11     
  Lines         168     3716     +3548     
===========================================
+ Hits           32     1465     +1433     
- Misses        134     2129     +1995     
- Partials        2      122      +120

Flag	Coverage Δ
components-local-app-app-linux	`?`
components-local-app-app-windows	`?`
components-ws-manager-app	`39.42% <66.10%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
components/ws-manager/pkg/manager/metrics.go	`11.11% <0.00%> (ø)`
components/ws-manager/pkg/clock/clock.go	`68.62% <68.62%> (ø)`
components/ws-manager/pkg/manager/manager.go	`25.35% <100.00%> (ø)`
components/ws-manager/pkg/manager/status.go	`73.37% <100.00%> (ø)`
components/local-app/pkg/auth/auth.go
components/local-app/pkg/auth/pkce.go
components/ws-manager/pkg/manager/annotations.go	`65.11% <0.00%> (ø)`
components/ws-manager/pkg/manager/create.go	`78.83% <0.00%> (ø)`
...-manager/pkg/manager/internal/workpool/workpool.go	`100.00% <0.00%> (ø)`
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a0195a...f26a1bb. Read the comment docs.

csweichel · 2021-08-23T14:41:19Z

Looks like we're still lacking the processing of the new statusVersion on the ws-man-brigde side, or am I missing something?

Indeed, we do not. This change just introduces the value. Once we have it we can observe its behaviour and eventually introduce logic based on it.

stale · 2021-09-28T10:59:54Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

aledbf · 2021-09-29T09:43:51Z

/werft run

👍 started the job as gitpod-build-cw-wsstatus-version.4

aledbf · 2021-09-29T09:49:41Z

/werft run

👍 started the job as gitpod-build-cw-wsstatus-version.5

aledbf · 2021-09-29T10:37:09Z

/lgtm
/approve no-issue

roboquat · 2021-09-29T10:37:13Z

LGTM label has been added.

Git tree hash: ad434b3d9fd7d54f48615e9356afdce4d36133d5

roboquat · 2021-09-29T10:37:15Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aledbf

Associated issue requirement bypassed by: aledbf

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~components/ws-manager/OWNERS~~ [aledbf]
~~components/ws-manager-api/OWNERS~~ [aledbf]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

roboquat requested a review from jankeromnes August 16, 2021 15:00

roboquat added the size/XXL label Aug 16, 2021

csweichel requested review from aledbf and geropl and removed request for jankeromnes August 16, 2021 15:00

svenefftinge mentioned this pull request Aug 17, 2021

[ws-man-bridge] don't update stopped workspaces #5234

Merged

csweichel force-pushed the cw/wsstatus-version branch from 68068b1 to 9873235 Compare August 23, 2021 14:08

stale bot added the meta: stale This issue/PR is stale and will be closed soon label Sep 28, 2021

[ws-manager] Provide ordering of status updates

4c196d2

csweichel force-pushed the cw/wsstatus-version branch from 9873235 to 4c196d2 Compare September 29, 2021 08:24

stale bot removed the meta: stale This issue/PR is stale and will be closed soon label Sep 29, 2021

Initialize clock

f26a1bb

roboquat assigned aledbf Sep 29, 2021

roboquat added the lgtm label Sep 29, 2021

roboquat added the approved label Sep 29, 2021

roboquat merged commit a89702b into main Sep 29, 2021

roboquat deleted the cw/wsstatus-version branch September 29, 2021 10:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ws-manager] Provide ordering of status updates #5223

[ws-manager] Provide ordering of status updates #5223

csweichel commented Aug 16, 2021 •

edited

Loading

JanKoehnlein commented Aug 17, 2021

codecov bot commented Aug 23, 2021 •

edited

Loading

csweichel commented Aug 23, 2021

stale bot commented Sep 28, 2021

aledbf commented Sep 29, 2021 •

edited by werft-gitpod-dev-com bot

Loading

aledbf commented Sep 29, 2021 •

edited by werft-gitpod-dev-com bot

Loading

aledbf commented Sep 29, 2021

roboquat commented Sep 29, 2021

roboquat commented Sep 29, 2021

[ws-manager] Provide ordering of status updates #5223

[ws-manager] Provide ordering of status updates #5223

Conversation

csweichel commented Aug 16, 2021 • edited Loading

How could this be solved?

How do "hybrid logical clocks" (HLC) work?

What happens when ws-manager restarts and lands on a node with a different time?

What happens if for some reason the real-wall clock jumped ahead/ran backwards too far?

Wait, shouldn't I be on vacation?

JanKoehnlein commented Aug 17, 2021

codecov bot commented Aug 23, 2021 • edited Loading

Codecov Report

csweichel commented Aug 23, 2021

stale bot commented Sep 28, 2021

aledbf commented Sep 29, 2021 • edited by werft-gitpod-dev-com bot Loading

aledbf commented Sep 29, 2021 • edited by werft-gitpod-dev-com bot Loading

aledbf commented Sep 29, 2021

roboquat commented Sep 29, 2021

roboquat commented Sep 29, 2021

csweichel commented Aug 16, 2021 •

edited

Loading

codecov bot commented Aug 23, 2021 •

edited

Loading

aledbf commented Sep 29, 2021 •

edited by werft-gitpod-dev-com bot

Loading

aledbf commented Sep 29, 2021 •

edited by werft-gitpod-dev-com bot

Loading