Periodically resync proxies to agents #18050

rosstimothy · 2022-11-02T15:03:42Z

Prior to #14262, resource watchers would periodically close their watcher, create a new one and refetch the current set of resources. It turns out that the reverse tunnel subsytem relied on this behavior to periodically broadcast the list of proxies to agents during steady state. Now that watchers are persistent and no longer perform a refetch, agents that are unable to connect to a proxy expire them after a period of time, and since they never receive the periodic refresh, they never attempt to connect to said proxy again.

To remedy this, a new ticker is added to the localsite that grabs the current set of proxies from its proxy watcher and sends a discovery request to the agent. The frequency of the ticker is set to fire prior to the tracker would expire the proxy so that if a proxy exists in the cluster, then the agent will continually try to connect to it.

espadolini

How much data is this, given the relatively inefficient marshaling of the mostly-empty ServerV2?

rosstimothy · 2022-11-03T16:59:21Z

How much data is this, given the relatively inefficient marshaling of the mostly-empty ServerV2?

The table below shows the size of a marshaled discoveryRequest for master and the two commits in this PR:

Master	`3523747`	7439815d93
4766	3671	801

Prior to #14262, resource watchers would periodically close their watcher, create a new one and refetch the current set of resources. It turns out that the reverse tunnel subsytem relied on this behavior to periodically broadcast the list of proxies to agents during steady state. Now that watchers are persistent and no longer perform a refetch, agents that are unable to connect to a proxy expire them after a period of time, and since they never receive the periodic refresh, they never attempt to connect to said proxy again. To remedy this, a new ticker is added to the `localsite` that grabs the current set of proxies from its proxy watcher and sends a discovery request to the agent. The frequency of the ticker is set to fire prior to the tracker would expire the proxy so that if a proxy exists in the cluster, then the agent will continually try to connect to it.

github-actions · 2022-11-04T13:07:23Z

@rosstimothy See the table below for backport results.

Branch	Result
branch/v10	Failed
branch/v11	Create PR
branch/v8	Failed
branch/v9	Failed

Prior to #14262, resource watchers would periodically close their watcher, create a new one and refetch the current set of resources. It turns out that the reverse tunnel subsystem relied on this behavior to periodically broadcast the list of proxies to agents during steady state. Now that watchers are persistent and no longer perform a refetch, agents that are unable to connect to a proxy expire them after a period of time, and since they never receive the periodic refresh, they never attempt to connect to said proxy again. To remedy this, a new ticker is added to the `localsite` that grabs the current set of proxies from its proxy watcher and sends a discovery request to the agent. The frequency of the ticker is set to fire prior to the tracker would expire the proxy so that if a proxy exists in the cluster, then the agent will continually try to connect to it.

* Periodically resync proxies to agents (#18050) Prior to #14262, resource watchers would periodically close their watcher, create a new one and refetch the current set of resources. It turns out that the reverse tunnel subsystem relied on this behavior to periodically broadcast the list of proxies to agents during steady state. Now that watchers are persistent and no longer perform a refetch, agents that are unable to connect to a proxy expire them after a period of time, and since they never receive the periodic refresh, they never attempt to connect to said proxy again. To remedy this, a new ticker is added to the `localsite` that grabs the current set of proxies from its proxy watcher and sends a discovery request to the agent. The frequency of the ticker is set to fire prior to the tracker would expire the proxy so that if a proxy exists in the cluster, then the agent will continually try to connect to it.

Prior to #14262, resource watchers would periodically close their watcher, create a new one and refetch the current set of resources. It turns out that the reverse tunnel subsystem relied on this behavior to periodically broadcast the list of proxies to agents during steady state. Now that watchers are persistent and no longer perform a refetch, agents that are unable to connect to a proxy expire them after a period of time, and since they never receive the periodic refresh, they never attempt to connect to said proxy again. To remedy this, a new ticker is added to the `localsite` that grabs the current set of proxies from its proxy watcher and sends a discovery request to the agent. The frequency of the ticker is set to fire prior to the tracker would expire the proxy so that if a proxy exists in the cluster, then the agent will continually try to connect to it.

Moves `UpdateTrustedCluster` logging from debug to info so default logging level includes when admin operations are performed to establish or remove trust. Alters `remoteSite` such that it logs in the same manner as `localSite` Cherry-picks some of the availability changes made in #18050 to ensure that agents spawned for trusted clusters are more robust to connection issues.

* Improve site and trusted cluster logging and availability Moves `UpdateTrustedCluster` logging from debug to info so default logging level includes when admin operations are performed to establish or remove trust. Alters `remoteSite` such that it logs in the same manner as `localSite` Cherry-picks some of the availability changes made in #18050 to ensure that agents spawned for trusted clusters are more robust to connection issues. * Ensure metric `remote_cluster` reflects current state The metric wasn't properly updated when remote sites went offline of when remote cluster resources were removed. Any change to the remoteSite state or the remoteCluster resource are now accurately reflected in the metric. * Add tracking of outbound connections to remote clusters The metric `trust_clusters` existed and was exported, but was never used anywhere. Now when the `RemoteClusterTunnelManager` starts and stops agent pools it will create and delete a counter for the cluster. Within the `AgentPool` the metric is set to the number of connected proxies within `updateConnectedProxies`.

Moves `UpdateTrustedCluster` logging from debug to info so default logging level includes when admin operations are performed to establish or remove trust. Alters `remoteSite` such that it logs in the same manner as `localSite` Cherry-picks some of the availability changes made in #18050 to ensure that agents spawned for trusted clusters are more robust to connection issues.

Moves `UpdateTrustedCluster` logging from debug to info so default logging level includes when admin operations are performed to establish or remove trust. Alters `remoteSite` such that it logs in the same manner as `localSite` Cherry-picks some of the availability changes made in #18050 to ensure that agents spawned for trusted clusters are more robust to connection issues. * Ensure metric `remote_cluster` reflects current state The metric wasn't properly updated when remote sites went offline of when remote cluster resources were removed. Any change to the remoteSite state or the remoteCluster resource are now accurately reflected in the metric. * Add tracking of outbound connections to remote clusters The metric `trust_clusters` existed and was exported, but was never used anywhere. Now when the `RemoteClusterTunnelManager` starts and stops agent pools it will create and delete a counter for the cluster. Within the `AgentPool` the metric is set to the number of connected proxies within `updateConnectedProxies`.

* Improve site and trusted cluster logging and availability Moves `UpdateTrustedCluster` logging from debug to info so default logging level includes when admin operations are performed to establish or remove trust. Alters `remoteSite` such that it logs in the same manner as `localSite` Cherry-picks some of the availability changes made in #18050 to ensure that agents spawned for trusted clusters are more robust to connection issues. * Ensure metric `remote_cluster` reflects current state The metric wasn't properly updated when remote sites went offline of when remote cluster resources were removed. Any change to the remoteSite state or the remoteCluster resource are now accurately reflected in the metric. * Add tracking of outbound connections to remote clusters The metric `trust_clusters` existed and was exported, but was never used anywhere. Now when the `RemoteClusterTunnelManager` starts and stops agent pools it will create and delete a counter for the cluster. Within the `AgentPool` the metric is set to the number of connected proxies within `updateConnectedProxies`.

* Improve site and trusted cluster logging and availability Moves `UpdateTrustedCluster` logging from debug to info so default logging level includes when admin operations are performed to establish or remove trust. Alters `remoteSite` such that it logs in the same manner as `localSite` Cherry-picks some of the availability changes made in #18050 to ensure that agents spawned for trusted clusters are more robust to connection issues. * Ensure metric `remote_cluster` reflects current state The metric wasn't properly updated when remote sites went offline of when remote cluster resources were removed. Any change to the remoteSite state or the remoteCluster resource are now accurately reflected in the metric. * Add tracking of outbound connections to remote clusters The metric `trust_clusters` existed and was exported, but was never used anywhere. Now when the `RemoteClusterTunnelManager` starts and stops agent pools it will create and delete a counter for the cluster. Within the `AgentPool` the metric is set to the number of connected proxies within `updateConnectedProxies`. * update metrics docs * Update docs/pages/includes/metrics.mdx Co-authored-by: Alex Fornuto <alex.fornuto@goteleport.com> Co-authored-by: Alex Fornuto <alex.fornuto@goteleport.com>

* Improve site and trusted cluster logging and availability Moves `UpdateTrustedCluster` logging from debug to info so default logging level includes when admin operations are performed to establish or remove trust. Alters `remoteSite` such that it logs in the same manner as `localSite` Cherry-picks some of the availability changes made in #18050 to ensure that agents spawned for trusted clusters are more robust to connection issues. * Ensure metric `remote_cluster` reflects current state The metric wasn't properly updated when remote sites went offline of when remote cluster resources were removed. Any change to the remoteSite state or the remoteCluster resource are now accurately reflected in the metric. * Add tracking of outbound connections to remote clusters The metric `trust_clusters` existed and was exported, but was never used anywhere. Now when the `RemoteClusterTunnelManager` starts and stops agent pools it will create and delete a counter for the cluster. Within the `AgentPool` the metric is set to the number of connected proxies within `updateConnectedProxies`.

Moves `UpdateTrustedCluster` logging from debug to info so default logging level includes when admin operations are performed to establish or remove trust. Alters `remoteSite` such that it logs in the same manner as `localSite` Cherry-picks some of the availability changes made in #18050 to ensure that agents spawned for trusted clusters are more robust to connection issues. * Ensure metric `remote_cluster` reflects current state The metric wasn't properly updated when remote sites went offline of when remote cluster resources were removed. Any change to the remoteSite state or the remoteCluster resource are now accurately reflected in the metric. * Add tracking of outbound connections to remote clusters The metric `trust_clusters` existed and was exported, but was never used anywhere. Now when the `RemoteClusterTunnelManager` starts and stops agent pools it will create and delete a counter for the cluster. Within the `AgentPool` the metric is set to the number of connected proxies within `updateConnectedProxies`.

rosstimothy force-pushed the tross/refresh_agent_proxies branch 2 times, most recently from d7dbb72 to 3523747 Compare November 2, 2022 18:42

rosstimothy requested review from espadolini and fspmarshall November 2, 2022 18:42

rosstimothy marked this pull request as ready for review November 2, 2022 18:42

github-actions bot requested review from marcoandredinis and timothyb89 November 2, 2022 18:42

rosstimothy removed request for marcoandredinis and timothyb89 November 2, 2022 18:43

fspmarshall approved these changes Nov 2, 2022

View reviewed changes

espadolini approved these changes Nov 3, 2022

View reviewed changes

rosstimothy added 2 commits November 3, 2022 15:21

Reduce paylod size of discoveryRequest

ccf6637

rosstimothy force-pushed the tross/refresh_agent_proxies branch from 5bb90a5 to 66e48d3 Compare November 3, 2022 19:21

rosstimothy added backport/branch/v9 labels Nov 3, 2022

rosstimothy enabled auto-merge (squash) November 3, 2022 19:23

rosstimothy force-pushed the tross/refresh_agent_proxies branch from b46bd52 to 66523d9 Compare November 3, 2022 21:28

Further reduce payload size by removing cluster name and tunnel type

fdde20d

rosstimothy force-pushed the tross/refresh_agent_proxies branch from 15e4581 to fdde20d Compare November 3, 2022 21:43

Merge branch 'master' into tross/refresh_agent_proxies

7f07d73

rosstimothy merged commit 3b4c144 into master Nov 4, 2022

rosstimothy mentioned this pull request Nov 4, 2022

[v11] Periodically resync proxies to agents #18149

Merged

rosstimothy mentioned this pull request Nov 4, 2022

[v10] Periodically resync proxies to agents #18152

Merged

rosstimothy mentioned this pull request Nov 4, 2022

[v9] Periodically resync proxies to agents #18153

Merged

rosstimothy mentioned this pull request Nov 4, 2022

[v8] Periodically resync proxies to agents #18154

Merged

rosstimothy deleted the tross/refresh_agent_proxies branch November 4, 2022 15:34

rosstimothy mentioned this pull request Nov 23, 2022

Teleport SSH node deadlocked during connection - reverse tunnel not reconnecting? #13673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Periodically resync proxies to agents #18050

Periodically resync proxies to agents #18050

rosstimothy commented Nov 2, 2022

espadolini left a comment

rosstimothy commented Nov 3, 2022

github-actions bot commented Nov 4, 2022

Periodically resync proxies to agents #18050

Periodically resync proxies to agents #18050

Conversation

rosstimothy commented Nov 2, 2022

espadolini left a comment

Choose a reason for hiding this comment

rosstimothy commented Nov 3, 2022

github-actions bot commented Nov 4, 2022