Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure stateOK is reported only when all components have sent updates #11249

Merged
merged 34 commits into from
Apr 7, 2022

Conversation

vitorenesduarte
Copy link
Contributor

@vitorenesduarte vitorenesduarte commented Mar 18, 2022

Fixes #11065.

This PR:

Since it seems difficult to know when TeleportProcess.registerTeleportReadyEvent should be updated, with the goal of quickly detecting a bug when it's introduced we have that:

  1. if componentCount is lower than it should, then the service fails to start (due to Throw startup error if TeleportReadyEvent is not emitted #11725)
  2. if componentCount is higher than it should, then an error is logged in function processState.getStateLocked.

Testing

None.

@vitorenesduarte vitorenesduarte marked this pull request as ready for review March 18, 2022 13:02
@vitorenesduarte
Copy link
Contributor Author

vitorenesduarte commented Mar 18, 2022

@vitorenesduarte I'm not certain, but does process.ServiceCount() work?

Thanks! I initially tried this but process.ServiceCount() is not exactly what's needed here. There are more services than components (e.g. the auth component has 5 services: "auth.tls", "auth.heartbeat.broadcast", "auth.broadcast", "auth.heartbeat", and "auth.shutdown").

@russjones russjones added the cloud Cloud label Mar 18, 2022
@vitorenesduarte vitorenesduarte self-assigned this Mar 21, 2022
if proxyConfig.Enabled {
componentCount++
}
if proxyConfig.Kube.Enabled && !proxyConfig.Kube.ListenAddr.IsEmpty() && !proxyConfig.DisableReverseTunnel {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the proxyConfig.DisableReverseTunnel impact kubernetes_service ?

Also I think that there are some cases where proxyConfig.Kube.ListenAddr is empty but kube service should be started

For example when teleport is stared as kube agent:

teleport:
    kubernetes_service:
    enabled: true
    kubeconfig_file: /path/to/kubeconfig

    auth_service:
      enabled: false
    
    ssh_service:
      enabled: false
    
    proxy_service:
      enabled: false
    
    app_service:
      enabled: false
    
    db_service:
      enabled: false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the proxyConfig.DisableReverseTunnel impact kubernetes_service ?

I'm not familiar enough with Teleport to able to answer this.

I came up with if proxyConfig.Kube.Enabled && !proxyConfig.Kube.ListenAddr.IsEmpty() && !proxyConfig.DisableReverseTunnel { by putting the following two conditions together:

if listeners.kube != nil && !process.Config.Proxy.DisableReverseTunnel {

if cfg.Proxy.Kube.Enabled && !cfg.Proxy.Kube.ListenAddr.IsEmpty() {
process.log.Debugf("Setup Proxy: turning on Kubernetes proxy.")
listener, err := process.importOrCreateListener(listenerProxyKube, cfg.Proxy.Kube.ListenAddr.Addr)
if err != nil {
return nil, trace.Wrap(err)
}
listeners.kube = listener
}

ComponentCount counts the number of components that send heartbeats. When the condition above holds, it seems that the proxy kube component will be started and will send heartbeats.

Also I think that there are some cases where proxyConfig.Kube.ListenAddr is empty but kube service should be started

For example when teleport is stared as kube agent:

teleport:
    kubernetes_service:
    enabled: true
    kubeconfig_file: /path/to/kubeconfig

    auth_service:
      enabled: false
    
    ssh_service:
      enabled: false
    
    proxy_service:
      enabled: false
    
    app_service:
      enabled: false
    
    db_service:
      enabled: false

I think that in this case this function will correctly return 1 due to this piece of code below:

	if cfg.Kube.Enabled {
		componentCount++
	}

process *TeleportProcess
mu sync.Mutex
states map[string]*componentState
componentCount int
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we emphasise in variable name what is the difference between current component calculated based on len(states
and desired stateOK componentCount ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried to improve this in 4c0fca2.

@vitorenesduarte vitorenesduarte force-pushed the vitor/fix-process-state-ok branch from 839b1fc to 4c0fca2 Compare March 23, 2022 12:07
@vitorenesduarte vitorenesduarte force-pushed the vitor/fix-process-state-ok branch from da12339 to 734795f Compare April 6, 2022 15:19
@vitorenesduarte
Copy link
Contributor Author

@zmb3 @smallinsky Since it seems difficult to have a compile-time error when this new function Config.ComponentCount should be updated, maybe we can have the following two run-time errors instead (with the goal of quickly detecting a bug when it's introduced):

  1. if Config.ComponentCount undercounts, then the service fails to start
  2. if Config.ComponentCount overcounts, then an error is logged

This PR already solves 2. with the new error added to function processState.getStateLocked.

While researching how to solve 1., I noticed #11724. The fix for #11724 in #11725 ensures that a startup error is thrown if the TeleportReadyEvent is not emitted.

So, if #11725 makes sense, we can also solve 1. by making sure that the Config.ComponentCount function has equal (not similar) conditions to the ones in the code that creates the TeleportReadyEvent event mapping. (We can't ensure this with a compile-time error, but we can leave a note that the two should be updated together.)

With #11725 merged, I pushed 734795f so that this PR now implements both 1. and 2. @zmb3 @smallinsky

@@ -3440,6 +3451,9 @@ func (process *TeleportProcess) waitForAppDepend() {

// registerTeleportReadyEvent ensures that a TeleportReadyEvent is produced
// when all components have started.
// Note that this function should be kept in sync with the Config.ComponentCount function so that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the mapping is 1:1, can't we just calculate the value here and store it in the process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think this is a good idea. Implemented something similar in 16b4228.

@vitorenesduarte
Copy link
Contributor Author

Updated the top-level comment to reflect the latest changes.

@vitorenesduarte vitorenesduarte merged commit b749302 into master Apr 7, 2022
@vitorenesduarte vitorenesduarte deleted the vitor/fix-process-state-ok branch April 7, 2022 11:28
vitorenesduarte pushed a commit that referenced this pull request Apr 7, 2022
…#11249)

Fixes #11065.

This commit:
- ensures  that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready
- changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`)
- uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065

Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that:
1. if `componentCount` is lower than it should, then the service fails to start (due to #11725)
2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.
vitorenesduarte pushed a commit that referenced this pull request Apr 7, 2022
…#11249)

Fixes #11065.

This commit:
- ensures  that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready
- changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`)
- uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065

Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that:
1. if `componentCount` is lower than it should, then the service fails to start (due to #11725)
2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.
vitorenesduarte pushed a commit that referenced this pull request Apr 8, 2022
* Throw startup error if `TeleportReadyEvent` is not emitted (#11725)

* Throw startup error if `TeleportReadyEvent` is not emitted

Before this commit, the `TeleportReadyEvent` was only waited for when a
process reload occurred. Thus, if a bug exists in the code that emits
this event (as it's currently the case since the `MetricsReady` and
`WindowsDesktopReady` events are never emitted), such a bug may go
unnoticed for a while.

This commit ensures that the `TeleportReadyEvent` is always waited for
on startup, and throws an error if the event is not emitted (after some
timeout).

This commit also:
- removes the `MetricsReady` event (as this is not produced by a
  component that sends heartbeats, which is the case of every other
  event required by the `TeleportReadyEvent` event mapping)
- ensures that `WindowsDesktopReady` event is emitted
- refactors some of the code in `lib/service/supervisor.go`
- moves the event mapping registration to a new `registerTeleportReadyEvent` function

* Ensure stateOK is reported only when all components have sent updates (#11249)

Fixes #11065.

This commit:
- ensures  that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready
- changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`)
- uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065

Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that:
1. if `componentCount` is lower than it should, then the service fails to start (due to #11725)
2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.

* Make `PortList.Pop()` thread-safe (#11799)
vitorenesduarte pushed a commit that referenced this pull request Apr 8, 2022
* Throw startup error if `TeleportReadyEvent` is not emitted (#11725)

* Throw startup error if `TeleportReadyEvent` is not emitted

Before this commit, the `TeleportReadyEvent` was only waited for when a
process reload occurred. Thus, if a bug exists in the code that emits
this event (as it's currently the case since the `MetricsReady` and
`WindowsDesktopReady` events are never emitted), such a bug may go
unnoticed for a while.

This commit ensures that the `TeleportReadyEvent` is always waited for
on startup, and throws an error if the event is not emitted (after some
timeout).

This commit also:
- removes the `MetricsReady` event (as this is not produced by a
  component that sends heartbeats, which is the case of every other
  event required by the `TeleportReadyEvent` event mapping)
- ensures that `WindowsDesktopReady` event is emitted
- refactors some of the code in `lib/service/supervisor.go`
- moves the event mapping registration to a new `registerTeleportReadyEvent` function

* Ensure stateOK is reported only when all components have sent updates (#11249)

Fixes #11065.

This commit:
- ensures  that `TeleportReadyEvent` is only produced when all components that send heartbeats (i.e. call [`process.onHeartbeat`](https://github.com/gravitational/teleport/blob/16bf416556f337b045b66dc9c3f5a3e16f8cc988/lib/service/service.go#L358-L366)) are ready
- changes `TeleportProcess.registerTeleportReadyEvent` so that it returns a count of these components (let's call it `componentCount`)
- uses `componentCount` to also ensure that `stateOK` is only reported when all the components have sent their heartbeat, thus fixing #11065

Since it seems difficult to know when `TeleportProcess.registerTeleportReadyEvent` should be updated, with the goal of quickly detecting a bug when it's introduced we have that:
1. if `componentCount` is lower than it should, then the service fails to start (due to #11725)
2. if `componentCount` is higher than it should, then an error is logged in function `processState.getStateLocked`.

* Make `PortList.Pop()` thread-safe (#11799)
@webvictim webvictim mentioned this pull request Apr 19, 2022
r0mant added a commit that referenced this pull request Apr 26, 2022
r0mant added a commit that referenced this pull request Apr 26, 2022
r0mant added a commit that referenced this pull request Apr 26, 2022
@r0mant r0mant mentioned this pull request Apr 26, 2022
r0mant added a commit that referenced this pull request Apr 26, 2022
…2242)

* Revert "Backport #11725 #11249 #11799 to branch/v8 (#11794)"

This reverts commit 48eedcd.

* Revert "Fix ProxyKube not reporting its readiness (#12153)"

This reverts commit b924b40.
r0mant added a commit that referenced this pull request Apr 26, 2022
* Revert "Make `PortList.Pop()` thread-safe (#11799)"

This reverts commit a17337d.

* Revert "Ensure stateOK is reported only when all components have sent updates (#11249)"

This reverts commit b749302.

* Revert "Throw startup error if `TeleportReadyEvent` is not emitted (#11725)"

This reverts commit 933e247.

* Revert "Fix ProxyKube not reporting its readiness (#12150)"

This reverts commit 6cdcfe7.
r0mant added a commit that referenced this pull request Apr 26, 2022
* Revert "Backport #11725 #11249 #11799 to branch/v9 (#11795)"

This reverts commit ea8ee94.

* Revert "Fix ProxyKube not reporting its readiness (#12152)"

This reverts commit ce301f0.
@webvictim webvictim mentioned this pull request Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud Cloud
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stateOK reported in processState.getState() before all components have sent updates
6 participants